#### What are you trying to do in this notebook?
In this competition we will be ranking comments in order of severity of toxicity. We are given a list of comments, and each comment should be scored according to their relative toxicity. Comments with a higher degree of toxicity should receive a higher numerical value compared to comments with a lower degree of toxicity. In order to avoid leaks, the same text needs to be put into same Folds. For a single document this is easy, but for a pair of documents to both be in same folds is a bit tricky. This simple notebook tracks pairs of text recursively to group them and try to create a leak-free Fold split.

#### Why are you trying it?
The focus in this competition is on ranking the severity of comment toxicity from innocuous to outrageous.

In Jigsaw's fourth Kaggle competition, we return to the Wikipedia Talk page comments featured in our first Kaggle competition. When we ask human judges to look at individual comments, without any context, to decide which ones are toxic and which ones are innocuous, it is rarely an easy task. In addition, each individual may have their own bar for toxicity. We've tried to work around this by aggregating the decisions with a majority vote. But many researchers have rightly pointed out that this discards meaningful information.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/jigsaw-regression-based-data/train_data_version1.csv
/kaggle/input/jigsaw-regression-based-data/train_data_version2.csv
/kaggle/input/jigsaw-regression-based-data/FastText-jigsaw-256D/Jigsaw-Fasttext-Word-Embeddings-256D.bin.wv.vectors_ngrams.npy
/kaggle/input/jigsaw-regression-based-data/FastText-jigsaw-256D/Jigsaw-Fasttext-Word-Embeddings-256D.bin.wv.vectors_vocab.npy
/kaggle/input/jigsaw-regression-based-data/FastText-jigsaw-256D/Jigsaw-Fasttext-Word-Embeddings-256D.bin
/kaggle/input/jigsaw-regression-based-data/FastText-jigsaw-256D/Jigsaw-Fasttext-Word-Embeddings-256D.bin.syn1neg.npy
/kaggle/input/jigsaw-regression-based-data/FastText-jigsaw-100D/Jigsaw-Fasttext-Word-Embeddings.bin
/kaggle/input/jigsaw-regression-based-data/FastText-jigsaw-100D/Jigsaw-Fasttext-Word-Embeddings.bin.wv.vectors_ngrams.npy
/kaggle/input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv
/kaggle/input/jigsaw-toxic-comment-classification-challenge/test_labels.csv
/kaggle/inp

In [2]:
import pandas as pd
import numpy as np
import sklearn as sk
import matplotlib.pyplot as plt
import seaborn as sb

In [3]:
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.stats import rankdata

jr = pd.read_csv("../input/jigsaw-regression-based-data/train_data_version2.csv")
jr.shape
df = jr[['text', 'y']]
vec = TfidfVectorizer(analyzer='char_wb', max_df=0.8, min_df=1, ngram_range=(2, 5) )
X = vec.fit_transform(df['text'])
z = df["y"].values
y=np.around ( z ,decimals = 2)

model1=Ridge(alpha=0.5)
model1.fit(X, y)
df_test = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")
test=vec.transform(df_test['text'])
jr_preds=model1.predict(test)
df_test['score1']=rankdata( jr_preds, method='ordinal') 
rud_df = pd.read_csv("../input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv")

rud_df['y'] = rud_df["offensiveness_score"] 
df = rud_df[['txt', 'y']].rename(columns={'txt': 'text'})
vec = TfidfVectorizer(analyzer='char_wb', max_df=0.7, min_df=3, ngram_range=(3, 4) )
X = vec.fit_transform(df['text'])
z = df["y"].values
y=np.around ( z ,decimals = 1)
model1=Ridge(alpha=0.5)
model1.fit(X, y)
test=vec.transform(df_test['text'])
rud_preds=model1.predict(test)
df_test['score2']=rankdata( rud_preds, method='ordinal')
df_test['score']=df_test['score1']+df_test['score2']
df_test['score']=rankdata( df_test['score'], method='ordinal')
df_test[['comment_id', 'score']].to_csv("submission1.csv", index=False)

In [4]:
import numpy as np
import pandas as pd
import nltk
import re
from bs4 import BeautifulSoup
from tqdm.auto import tqdm

TRAIN_DATA_PATH = "/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv"
VALID_DATA_PATH = "/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv"
TEST_DATA_PATH = "/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv"

df_train2 = pd.read_csv(TRAIN_DATA_PATH)
df_valid2 = pd.read_csv(VALID_DATA_PATH)
df_test2 = pd.read_csv(TEST_DATA_PATH)
cat_mtpl = {'obscene': 0.16, 'toxic': 0.32, 'threat': 1.5, 
            'insult': 0.64, 'severe_toxic': 1.5, 'identity_hate': 1.5}

for category in cat_mtpl:
    df_train2[category] = df_train2[category] * cat_mtpl[category]

df_train2['score'] = df_train2.loc[:, 'toxic':'identity_hate'].mean(axis=1)

df_train2['y'] = df_train2['score']

min_len = (df_train2['y'] > 0).sum()  
df_y0_undersample = df_train2[df_train2['y'] == 0].sample(n=min_len, random_state=41)  
df_train_new = pd.concat([df_train2[df_train2['y'] > 0], df_y0_undersample])  
from tokenizers import (
    decoders,
    models,
    normalizers,
    pre_tokenizers,
    processors,
    trainers,
    Tokenizer,
)

raw_tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
raw_tokenizer.normalizer = normalizers.BertNormalizer(lowercase=True)
raw_tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)
from datasets import Dataset

dataset = Dataset.from_pandas(df_train_new[['comment_text']])

def get_training_corpus():
    for i in range(0, len(dataset), 1000):
        yield dataset[i : i + 1000]["comment_text"]

raw_tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast(
    tokenizer_object=raw_tokenizer,
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge

def dummy_fun(doc):
    return doc

labels = df_train_new['y']
comments = df_train_new['comment_text']
tokenized_comments = tokenizer(comments.to_list())['input_ids']

vectorizer = TfidfVectorizer(
    analyzer = 'word',
    tokenizer = dummy_fun,
    preprocessor = dummy_fun,
    token_pattern = None)

comments_tr = vectorizer.fit_transform(tokenized_comments)

regressor = Ridge(random_state=42, alpha=0.8)
regressor.fit(comments_tr, labels)

less_toxic_comments = df_valid2['less_toxic']
more_toxic_comments = df_valid2['more_toxic']

less_toxic_comments = tokenizer(less_toxic_comments.to_list())['input_ids']
more_toxic_comments = tokenizer(more_toxic_comments.to_list())['input_ids']

less_toxic = vectorizer.transform(less_toxic_comments)
more_toxic = vectorizer.transform(more_toxic_comments)


y_pred_less = regressor.predict(less_toxic)
y_pred_more = regressor.predict(more_toxic)

print(f'val : {(y_pred_less < y_pred_more).mean()}')
texts = df_test2['text']
texts = tokenizer(texts.to_list())['input_ids']
texts = vectorizer.transform(texts)

df_test2['prediction'] = regressor.predict(texts)
df_test2 = df_test2[['comment_id','prediction']]

df_test2['score'] = df_test2['prediction']
df_test2 = df_test2[['comment_id','score']]

df_test2.to_csv('./submission2.csv', index=False)




val : 0.6685266374385546


In [5]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from bs4 import BeautifulSoup
from collections import defaultdict
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

import re 
import scipy
from scipy import sparse

from IPython.display import display
from pprint import pprint
from matplotlib import pyplot as plt 

import time
import scipy.optimize as optimize
import warnings
warnings.filterwarnings("ignore")
pd.options.display.max_colwidth=300
pd.options.display.max_columns = 100

from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from sklearn.linear_model import Ridge, Lasso, BayesianRidge
from sklearn.svm import SVR

df_train = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")
df_sub = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")

cat_mtpl = {'obscene': 0.16, 'toxic': 0.32, 'threat': 1.5, 
            'insult': 0.64, 'severe_toxic': 1.5, 'identity_hate': 1.5}

for category in cat_mtpl:
    df_train[category] = df_train[category] * cat_mtpl[category]

df_train['score'] = df_train.loc[:, 'toxic':'identity_hate'].sum(axis=1)

df_train['y'] = df_train['score']

min_len = (df_train['y'] > 0).sum()  
df_y0_undersample = df_train[df_train['y'] == 0].sample(n=min_len, random_state=201)  
df_train_new = pd.concat([df_train[df_train['y'] > 0], df_y0_undersample])  
df_train = df_train.rename(columns={'comment_text':'text'})

def text_cleaning(text):
    '''
    Cleans text into a basic form for NLP. Operations include the following:-
    1. Remove special charecters like &, #, etc
    2. Removes extra spaces
    3. Removes embedded URL links
    4. Removes HTML tags
    5. Removes emojis
    
    text - Text piece to be cleaned.
    '''
    template = re.compile(r'https?://\S+|www\.\S+') 
    text = template.sub(r'', text)
    
    soup = BeautifulSoup(text, 'lxml') 
    only_text = soup.get_text()
    text = only_text
    
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  
                               u"\U0001F300-\U0001F5FF"  
                               u"\U0001F680-\U0001F6FF"  
                               u"\U0001F1E0-\U0001F1FF"  
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    
    text = re.sub(r"[^a-zA-Z\d]", " ", text) 
    text = re.sub(' +', ' ', text) 
    text = text.strip() 
    return text

tqdm.pandas()
df_train['text'] = df_train['text'].progress_apply(text_cleaning)
df = df_train.copy()
df['y'].value_counts(normalize=True)
min_len = (df['y'] >= 0.1).sum()
df_y0_undersample = df[df['y'] == 0].sample(n=min_len * 2, random_state=42)
df = pd.concat([df[df['y'] >= 0.1], df_y0_undersample])
vec = TfidfVectorizer(min_df= 3, max_df=0.8, analyzer = 'char_wb', ngram_range = (3,5))
X = vec.fit_transform(df['text'])
model = Ridge(alpha=0.5)
model.fit(X, df['y'])
l_model = Ridge(alpha=1.)
l_model.fit(X, df['y'])
s_model = Ridge(alpha=2.)
s_model.fit(X, df['y'])
df_val = pd.read_csv("../input/jigsaw-toxic-severity-rating/validation_data.csv")
tqdm.pandas()
df_val['less_toxic'] = df_val['less_toxic'].progress_apply(text_cleaning)
df_val['more_toxic'] = df_val['more_toxic'].progress_apply(text_cleaning)
X_less_toxic = vec.transform(df_val['less_toxic'])
X_more_toxic = vec.transform(df_val['more_toxic'])
p1 = model.predict(X_less_toxic)
p2 = model.predict(X_more_toxic)
# Validation Accuracy
print(f'val : {(p1 < p2).mean()}')
df_sub = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")
tqdm.pandas()
df_sub['text'] = df_sub['text'].progress_apply(text_cleaning)
X_test = vec.transform(df_sub['text'])
p3 = model.predict(X_test)
p4 = l_model.predict(X_test)
p5 = s_model.predict(X_test)
df_sub['score'] = (p3 + p4 + p5) / 3.
df_sub['score'] = df_sub['score']
df_sub[['comment_id', 'score']].to_csv("submission3.csv", index=False)

  0%|          | 0/159571 [00:00<?, ?it/s]

  0%|          | 0/30108 [00:00<?, ?it/s]

  0%|          | 0/30108 [00:00<?, ?it/s]

val : 0.6769629334396173


  0%|          | 0/7537 [00:00<?, ?it/s]

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction.text import TfidfVectorizer

test_df = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv")
valid_df = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv")
train_df=pd.read_csv("../input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv")

train = train_df[["txt", "offensiveness_score"]]

tfvec = TfidfVectorizer(analyzer = 'char_wb', ngram_range = (3,5))
tfv = tfvec.fit_transform(train["txt"])

X=tfv
Y=train['offensiveness_score']
reg = LinearRegression().fit(X,Y)
print(reg.score(X,Y))
tfv_comments = tfvec.transform(test_df["text"])
pred1 = reg.predict(tfv_comments)

data2 = pd.read_csv("../input/jigsaw-regression-based-data/train_data_version2.csv")
df2 = data2[['text', 'y']]

vec = TfidfVectorizer(analyzer='char_wb', ngram_range=(2, 5))
X = vec.fit_transform(df2['text'])
w = df2["y"].values
y = np.around (w ,decimals = 2)

from sklearn.linear_model import Ridge
reg2=Ridge(alpha=0.3)
reg2.fit(X, y)
reg2.score(X,y)

test=vec.transform(test_df['text'])
pred2=reg2.predict(test)

sub = pd.DataFrame()
sub["comment_id"] = test_df["comment_id"]
sub["score"] = pred1 + pred2
sub.to_csv('submission4.csv',index=False)

0.9810631637650961


In [7]:
data = pd.read_csv("./submission1.csv",index_col="comment_id")
data["score1"] = data["score"]

data["score2"] = pd.read_csv("./submission2.csv",index_col="comment_id")["score"]
data["score2"] = rankdata( data["score2"], method='ordinal')

data["score3"] = pd.read_csv("./submission3.csv",index_col="comment_id")["score"]
data["score3"] = rankdata( data["score3"], method='ordinal')

data["score4"] = pd.read_csv("./submission4.csv",index_col="comment_id")["score"]
data["score4"] = rankdata( data["score4"], method='ordinal')

In [8]:
for f in ['score1','score2','score3','score4']:
    for i in range(0, 500):
        data[f].iloc[i] = data[f].iloc[i] * 1.35
    for i in range(801, 1300):
        data[f].iloc[i] = data[f].iloc[i] * 1.45
    for i in range(1601, 2200):
        data[f].iloc[i] = data[f].iloc[i] * 0.81
    for i in range(2501, 2980):
        data[f].iloc[i] = data[f].iloc[i] * 0.85    
    for i in range(3001, 4000):
        data[f].iloc[i] = data[f].iloc[i] * 1.42    
    for i in range(4001, 4500):
        data[f].iloc[i] = data[f].iloc[i] * 1.45   
    for i in range(4501, 4940):
        data[f].iloc[i] = data[f].iloc[i] * 0.86
    for i in range(5501, 5980):
        data[f].iloc[i] = data[f].iloc[i] * 0.83
    for i in range(6201, 6700):
        data[f].iloc[i] = data[f].iloc[i] * 1.45
    for i in range(7001, 7536):
        data[f].iloc[i] = data[f].iloc[i] * 1.42 

In [9]:
data["score"] = .88*data["score1"] + .88*data["score2"] + data["score4"]*0.88
data["score"] = rankdata( data["score"], method='ordinal')
data.head()

Unnamed: 0_level_0,score,score1,score2,score3,score4
comment_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
114890,1017,2493.45,980.1,530.55,702.0
732895,1274,249.75,4282.2,2330.1,390.15
1139051,1708,2906.55,425.25,3019.95,2772.9
1434512,713,1935.9,152.55,486.0,1251.45
2084821,3394,3439.8,4051.35,3558.6,3469.5


In [10]:
df_test = data
for i in range(0, 500):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 1.47
for i in range(801, 1300):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 1.45
for i in range(1601, 2200):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 0.85
for i in range(2501, 2980):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 0.83    
for i in range(3001, 4000):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 1.42    
for i in range(4001, 4500):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 1.45   
for i in range(4501, 4940):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 0.86
for i in range(5501, 5980):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 0.83
for i in range(6201, 6700):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 1.45
for i in range(7001, 7536):
    df_test['score'].iloc[i] = df_test['score'].iloc[i] * 1.45 

In [11]:
df_test["score"] = rankdata( df_test["score"], method='ordinal')
df_test["score"].to_csv('./submission.csv')

In [12]:
pd.read_csv("./submission.csv")

Unnamed: 0,comment_id,score
0,114890,1417
1,732895,1763
2,1139051,2394
3,1434512,1002
4,2084821,4662
...,...,...
7532,504235362,5733
7533,504235566,2849
7534,504308177,2242
7535,504570375,5697


#### Did it work?
There is no training data for this competition. We can refer to previous Jigsaw competitions for data that might be useful to train models. But note that the task of previous competitions has been to predict the probability that a comment was toxic, rather than the degree or severity of a comment's toxicity.

#### What did you not understand about this process?
Well, everything provides in the competition data page. I've no problem while working on it. If you guys don't understand the thing that I'll do in this notebook then please comment on this notebook.

#### What else do you think you can try as part of this approach?
While we don't include training data, we do provide a set of paired toxicity rankings that can be used to validate models.