# Background
Here we take the average of two different models. Both models use outside data and a fraction of validation data for training. 

The following code was inspired by the following
* notebook by JULIÁN PELLER: https://www.kaggle.com/julian3833/jigsaw-incredibly-simple-naive-bayes-0-768
* notebook by ANDREJ MARINCHENKO: https://www.kaggle.com/andrej0marinchenko/jigsaw-ensemble-0-86
* notebook by MANAV: https://www.kaggle.com/manabendrarout/pytorch-roberta-ranking-baseline-jrstc-infer

So far this notebook is split into three submissions:
1. Using the current competition training datasets as validation data for running a Naive Bayes model
2. Using the current competition training datasets as a seperate training data to generate two Naive Bayes models
3. Using the current competition training datasets as additional training data to generate a single Naive Bayes models

# Imports
Import the following libraries and data from 
* Jigsaw Toxic Comment Classification challenge - predicts whether a comment was 'toxic' 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate' (1,0)
* Ruddit Jigsaw dataset - scores between -1 (maximally supportive) and 1 (maximally offensive)
* Jigsaw Unintended Bias in Toxicity Classification - predicts whether a comment was 'toxic' 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate' (1,0)
* Jigsaw Toxic Severity rating - predicts how toxic a comment is compared to other comments

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from imblearn.over_sampling import SMOTE # used to over-sample imbalanced dataset
from sklearn.feature_extraction.text import TfidfVectorizer # converts sentences into vectors
from sklearn.naive_bayes import MultinomialNB # Naive Bayes model for discrete scores
from sklearn.linear_model import Ridge # Ridge regression model
import re # regular expression libary
from bs4 import BeautifulSoup # used to interpret websites
from tqdm.auto import tqdm # progress bar
import optuna # finds best parameters
from sklearn.model_selection import train_test_split # used to split data into training and validation sets
from sklearn.model_selection import StratifiedKFold # used to cross-validate the data


# ignore SettingWithCopyWarning
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# set seed for randomness
rseed=201

In [None]:
import imblearn

# Create training data from other jigsaw competitions
To train the data we need comments (X) as features and then a "ground truth" of the comment's toxicity (y). We use the Jigsaw Toxic Comment Classification challenge training data which labels each comment as either 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', or 'identity_hate' using 1's and 0's. Since in this competition these categories are all considered to be toxic, we transform this training data to list whether each comment is toxic or not.

## Import Jigsaw Toxic Comment Classification challenge data

For training Naive Bayes we label each comment (1) toxic or 0 for not toxic. 

In [None]:
pre_train_df1 = pd.read_csv(
    "/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv")

# naive bayes target
pre_train_df1['y_nb'] = (
    pre_train_df1[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
    .sum(axis=1) > 0 ).astype(int)

For training ridge regression we must put weights on each of the categories to measure the toxicity of a comment.

In [None]:
# Create a score that measure how much toxic is a comment
toxicity_weights = {'obscene': 0.02, 'toxic': 0.05, 'threat': 0.27, 
            'insult': 0.11, 'severe_toxic': 0.28, 'identity_hate': 0.27}

for cat in toxicity_weights:
    pre_train_df1[cat] = pre_train_df1[cat] * toxicity_weights[cat]
    
pre_train_df1['y_rr'] = pre_train_df1[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1)

In [None]:
# organize the dataframe to only include the text and targets
pre_train_df1 = pre_train_df1[['comment_text', 'y_nb','y_rr']].rename(
    columns={'comment_text': 'text'})
pre_train_df1.sample(5)

## Import Ruddit Jigsaw dataset
For training Naive Bayes we label each comment (1) toxic or 0 for not toxic. 

In [None]:
pre_train_df2 = pd.read_csv(
    "/kaggle/input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv")

# rename the columns containing the text data and score
pre_train_df2 = pre_train_df2[['txt', 'offensiveness_score']].rename(columns={'txt': 'text',
                                                          'offensiveness_score':'score'})


# label all comments with an offensiveness score greater than 0 as 1's, otherwise 0's
pre_train_df2['y_nb'] = (pre_train_df2['score'] > 0 ).astype(int)




For training ridge regression we set the target to zero if the offensiveness score is less than 0.

In [None]:
pre_train_df2['y_rr'] = pre_train_df2['score'].copy()
pre_train_df2.loc[pre_train_df2['y_rr']<0,'y_rr']=0

In [None]:
# drop the score column containing the offensiveness values
pre_train_df2.drop(['score'], axis=1, inplace=True)
pre_train_df2.sample(5)

## Import Jigsaw Unintended Bias challenge data (deprecated)

I tried using this third dataset, but it didn't help the model

For training Naive Bayes, comments were labeled as toxic (y_nb=1) if any of the 'toxic', 'severe_toxicity', 'obscene', 'threat', 'insult', and 'identity_attack' values were equal to or greater than 0.5. 

In [None]:
if 1==0:
    pre_train_df3 = pd.read_csv(
        "/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv")
    print(len(pre_train_df3))
    pre_train_df3['y_nb'] = (
        pre_train_df3[['toxic', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack']] >= .5 ).any(axis=1).astype(int)

For training ridge regression we must put weights on each of the categories to measure the toxicity of a comment. After weighing these score we take the maximum value as the toxicity score.


In [None]:
if 1==0:
    # Create a score that measure how much toxic is a comment
    toxicity_weights = {'obscene': 0.02, 'toxic': 0.05, 'threat': 0.27, 
                'insult': 0.11, 'severe_toxicity': 0.28, 'identity_attack': 0.27}

    for cat in toxicity_weights:
        pre_train_df3[cat] = pre_train_df3[cat] * toxicity_weights[cat]

    pre_train_df3['y_rr'] = pre_train_df3[['toxic', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack']].max(axis=1)


    pre_train_df3.loc[pre_train_df3['y_nb']>0,['comment_text','toxic', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack','y_nb']].sample(5)

In [None]:
if 1==0:
    pre_train_df3 = pre_train_df3[['comment_text', 'y_nb','y_rr']].rename(
        columns={'comment_text': 'text'})

## Imbalanced dataset
Below we run code to see if the comments have a relatively equal amount of toxic and non-toxic comments.

In [None]:
print ("The first training dataset has %i rows." % len(pre_train_df1))
print ("The first training dataset has %i toxic comments." % (pre_train_df1['y_nb'] == 1).sum())
print ("The first training dataset has %i non-toxic comments." % (pre_train_df1['y_nb'] == 0).sum())

print ("The second training dataset has %i rows." % len(pre_train_df2))
print ("The second training dataset has %i toxic comments." % (pre_train_df2['y_nb'] == 1).sum())
print ("The second training dataset has %i non-toxic comments." % (pre_train_df2['y_nb'] == 0).sum())

if 1==0:
    print ("The third training dataset has %i rows." % len(pre_train_df3))
    print ("The third training dataset has %i toxic comments." % (pre_train_df3['y_nb'] == 1).sum())
    print ("The third training dataset has %i non-toxic comments." % (pre_train_df3['y_nb'] == 0).sum())

The dataset is very unbalanced. Below is code to undersample the majority class. I tried over-sampling and it decreased the accuracy of the model.

In [None]:
train_df_list=[]
for cur_train_df in [pre_train_df1,pre_train_df2]:
    # undersample to the number of toxic comments (undersample_n)
    undersample_n = (cur_train_df['y_nb'] == 1).sum()

    # perform undersample
    cur_train_df_y0_undersample = cur_train_df.loc[cur_train_df['y_nb'] == 0,:].sample(
        n=undersample_n, random_state=rseed)

    # generate new training dataframe given undersampled commets
    cur_train_df = pd.concat([cur_train_df.loc[cur_train_df['y_nb'] == 1,:], cur_train_df_y0_undersample])

    train_df_list.append(cur_train_df)
    print(cur_train_df['y_nb'].value_counts())
train_df = pd.concat(train_df_list)

## Comments to vectors

below we use a clean function from MANAV

In [None]:
def text_cleaning(text):
    '''
    Cleans text into a basic form for NLP. Operations include the following:-
    1. Remove special charecters like &, #, etc
    2. Removes extra spaces
    3. Removes embedded URL links
    4. Removes HTML tags
    5. Removes emojis
    
    text - Text piece to be cleaned.
    '''
    template = re.compile(r'https?://\S+|www\.\S+') #Removes website links
    text = template.sub(r'', text)
    
    soup = BeautifulSoup(text, 'lxml') #Removes HTML tags
    only_text = soup.get_text()
    text = only_text
    
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    
    text = re.sub(r"[^a-zA-Z\d]", " ", text) #Remove special Charecters
    text = re.sub(' +', ' ', text) #Remove Extra Spaces
    text = text.strip() # remove spaces at the beginning and at the end of string

    return text

below we use a clean function from ANDREJ MARINCHENKO

In [None]:
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')

def clean(data, col):
    
    data[col] = data[col].str.replace('https?://\S+|www\.\S+', ' social medium ', regex=True)      
        
    data[col] = data[col].str.lower()
    data[col] = data[col].str.replace("4", "a") 
    data[col] = data[col].str.replace("2", "l")
    data[col] = data[col].str.replace("5", "s") 
    data[col] = data[col].str.replace("1", "i") 
    data[col] = data[col].str.replace("!", "i") 
    data[col] = data[col].str.replace("|", "i", regex=False) 
    data[col] = data[col].str.replace("0", "o") 
    data[col] = data[col].str.replace("l3", "b") 
    data[col] = data[col].str.replace("7", "t") 
    data[col] = data[col].str.replace("7", "+") 
    data[col] = data[col].str.replace("8", "ate") 
    data[col] = data[col].str.replace("3", "e") 
    data[col] = data[col].str.replace("9", "g")
    data[col] = data[col].str.replace("6", "g")
    data[col] = data[col].str.replace("@", "a")
    data[col] = data[col].str.replace("$", "s", regex=False)
    data[col] = data[col].str.replace("#ofc", " of fuckin course ")
    data[col] = data[col].str.replace("fggt", " faggot ")
    data[col] = data[col].str.replace("your", " your ")
    data[col] = data[col].str.replace("self", " self ")
    data[col] = data[col].str.replace("cuntbag", " cunt bag ")
    data[col] = data[col].str.replace("fartchina", " fart china ")    
    data[col] = data[col].str.replace("youi", " you i ")
    data[col] = data[col].str.replace("cunti", " cunt i ")
    data[col] = data[col].str.replace("sucki", " suck i ")
    data[col] = data[col].str.replace("pagedelete", " page delete ")
    data[col] = data[col].str.replace("cuntsi", " cuntsi ")
    data[col] = data[col].str.replace("i'm", " i am ")
    data[col] = data[col].str.replace("offuck", " of fuck ")
    data[col] = data[col].str.replace("centraliststupid", " central ist stupid ")
    data[col] = data[col].str.replace("hitleri", " hitler i ")
    data[col] = data[col].str.replace("i've", " i have ")
    data[col] = data[col].str.replace("i'll", " sick ")
    data[col] = data[col].str.replace("fuck", " fuck ")
    data[col] = data[col].str.replace("f u c k", " fuck ")
    data[col] = data[col].str.replace("shit", " shit ")
    data[col] = data[col].str.replace("bunksteve", " bunk steve ")
    data[col] = data[col].str.replace('wikipedia', ' social medium ')
    data[col] = data[col].str.replace("faggot", " faggot ")
    data[col] = data[col].str.replace("delanoy", " delanoy ")
    data[col] = data[col].str.replace("jewish", " jewish ")
    data[col] = data[col].str.replace("sexsex", " sex ")
    data[col] = data[col].str.replace("allii", " all ii ")
    data[col] = data[col].str.replace("i'd", " i had ")
    data[col] = data[col].str.replace("'s", " is ")
    data[col] = data[col].str.replace("youbollocks", " you bollocks ")
    data[col] = data[col].str.replace("dick", " dick ")
    data[col] = data[col].str.replace("cuntsi", " cuntsi ")
    data[col] = data[col].str.replace("mothjer", " mother ")
    data[col] = data[col].str.replace("cuntfranks", " cunt ")
    data[col] = data[col].str.replace("ullmann", " jewish ")
    data[col] = data[col].str.replace("mr.", " mister ", regex=False)
    data[col] = data[col].str.replace("aidsaids", " aids ")
    data[col] = data[col].str.replace("njgw", " nigger ")
    data[col] = data[col].str.replace("wiki", " social medium ")
    data[col] = data[col].str.replace("administrator", " admin ")
    data[col] = data[col].str.replace("gamaliel", " jewish ")
    data[col] = data[col].str.replace("rvv", " vanadalism ")
    data[col] = data[col].str.replace("admins", " admin ")
    data[col] = data[col].str.replace("pensnsnniensnsn", " penis ")
    data[col] = data[col].str.replace("pneis", " penis ")
    data[col] = data[col].str.replace("pennnis", " penis ")
    data[col] = data[col].str.replace("pov.", " point of view ", regex=False)
    data[col] = data[col].str.replace("vandalising", " vandalism ")
    data[col] = data[col].str.replace("cock", " dick ")
    data[col] = data[col].str.replace("asshole", " asshole ")
    data[col] = data[col].str.replace("youi", " you ")
    data[col] = data[col].str.replace("afd", " all fucking day ")
    data[col] = data[col].str.replace("sockpuppets", " sockpuppetry ")
    data[col] = data[col].str.replace("iiprick", " iprick ")
    data[col] = data[col].str.replace("penisi", " penis ")
    data[col] = data[col].str.replace("warrior", " warrior ")
    data[col] = data[col].str.replace("loil", " laughing out insanely loud ")
    data[col] = data[col].str.replace("vandalise", " vanadalism ")
    data[col] = data[col].str.replace("helli", " helli ")
    data[col] = data[col].str.replace("lunchablesi", " lunchablesi ")
    data[col] = data[col].str.replace("special", " special ")
    data[col] = data[col].str.replace("ilol", " i lol ")
    data[col] = data[col].str.replace(r'\b[uU]\b', 'you', regex=True)
    data[col] = data[col].str.replace(r"what's", "what is ")
    data[col] = data[col].str.replace(r"\'s", " is ", regex=False)
    data[col] = data[col].str.replace(r"\'ve", " have ", regex=False)
    data[col] = data[col].str.replace(r"can't", "cannot ")
    data[col] = data[col].str.replace(r"n't", " not ")
    data[col] = data[col].str.replace(r"i'm", "i am ")
    data[col] = data[col].str.replace(r"\'re", " are ", regex=False)
    data[col] = data[col].str.replace(r"\'d", " would ", regex=False)
    data[col] = data[col].str.replace(r"\'ll", " will ", regex=False)
    data[col] = data[col].str.replace(r"\'scuse", " excuse ", regex=False)
    data[col] = data[col].str.replace('\s+', ' ', regex=True)  # will remove more than one whitespace character
#     text = re.sub(r'\b([^\W\d_]+)(\s+\1)+\b', r'\1', re.sub(r'\W+', ' ', text).strip(), flags=re.I)  # remove repeating words coming immediately one after another
    data[col] = data[col].str.replace(r'(.)\1+', r'\1\1', regex=True) # 2 or more characters are replaced by 2 characters
#     text = re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', text, flags = re.I)
    data[col] = data[col].str.replace("[:|♣|'|§|♠|*|/|?|=|%|&|-|#|•|~|^|>|<|►|_]", '', regex=True)
    
    
    data[col] = data[col].str.replace(r"what's", "what is ")    
    data[col] = data[col].str.replace(r"\'ve", " have ", regex=False)
    data[col] = data[col].str.replace(r"can't", "cannot ")
    data[col] = data[col].str.replace(r"n't", " not ", regex=False)
    data[col] = data[col].str.replace(r"i'm", "i am ", regex=False)
    data[col] = data[col].str.replace(r"\'re", " are ", regex=False)
    data[col] = data[col].str.replace(r"\'d", " would ", regex=False)
    data[col] = data[col].str.replace(r"\'ll", " will ", regex=False)
    data[col] = data[col].str.replace(r"\'scuse", " excuse ", regex=False)
    data[col] = data[col].str.replace(r"\'s", " ", regex=False)

    # Clean some punctutations
    data[col] = data[col].str.replace('\n', ' \n ')
    data[col] = data[col].str.replace(r'([a-zA-Z]+)([/!?.])([a-zA-Z]+)',r'\1 \2 \3', regex=True)
    # Replace repeating characters more than 3 times to length of 3
    data[col] = data[col].str.replace(r'([*!?\'])\1\1{2,}',r'\1\1\1', regex=True)    
    # Add space around repeating characters
    data[col] = data[col].str.replace(r'([*!?\']+)',r' \1 ', regex=True)    
    # patterns with repeating characters 
    data[col] = data[col].str.replace(r'([a-zA-Z])\1{2,}\b',r'\1\1', regex=True)
    data[col] = data[col].str.replace(r'([a-zA-Z])\1\1{2,}\B',r'\1\1\1', regex=True)
    data[col] = data[col].str.replace(r'[ ]{2,}',' ', regex=True).str.strip()   
    data[col] = data[col].str.replace(r'[ ]{2,}',' ', regex=True).str.strip()   
    data[col] = data[col].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    tqdm.pandas()
    data[col] = data[col].progress_apply(text_cleaning)
    return data

In [None]:
# Test clean function
test_clean_df = pd.DataFrame({"text":
                              ["heyy\n\nkkdsfj",
                               "hi   how/are/U ???",
                               "hey?????",
                               "noooo!!!!!!!!!   comeone !! ",
                              "cooooooooool     brooooooooooo  coool brooo",
                              "naaaahhhhhhh","'re been cool"]})
display(test_clean_df)
clean(test_clean_df,'text')

In [None]:
# create TF-IDF object
vec = TfidfVectorizer()

# fit the TF-IDF object to the CLEAN training comments
train_df = clean(train_df,'text')
X = vec.fit_transform(train_df['text'])
X

## Fit Models
We fit the gaussian process regression model using the sklearn Ridge function to the comments from the training data based on whether they are labeled as toxic or not.

In [None]:
X

In [None]:
nb_model = MultinomialNB() 
nb_model.fit(X, train_df['y_nb'])

rr_model = Ridge(alpha=0.5)
rr_model.fit(X, train_df['y_rr'])

To validate the models we use the training data for this competition.

In [None]:
val_df = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv")
val_df.head()

The naive bayes model is run on the the comments labeled as "less_toxic" and "more_toxic" seperately. If a comment is more toxic it should have a higher value.

In [None]:
val_df = clean(val_df,'less_toxic')
val_df = clean(val_df,'more_toxic')
X_less_toxic = vec.transform(val_df['less_toxic'])
X_more_toxic = vec.transform(val_df['more_toxic'])

The "predict_proba" function generated a 2D array where the first dimension lists the probability the comment is not toxic and the second contains the probability that the comment is toxic.

In [None]:
nb_p1 = nb_model.predict_proba(X_less_toxic)
nb_p2 = nb_model.predict_proba(X_more_toxic)
rr_p1 = rr_model.predict(X_less_toxic)
rr_p2 = rr_model.predict(X_more_toxic)


To validate the model we measure whether the "more_toxic" comments got higher values than the "less_toxic" comments

In [None]:
# Validation Accuracy
#naive bayes
nb_acc = (nb_p1[:, 1] < nb_p2[:, 1]).mean()
# ridge regression
rr_acc = (rr_p1 < rr_p2).mean()
# both
mean_acc = (((nb_p1[:, 1]+rr_p1)/2) < ((nb_p2[:, 1]+rr_p2)/2)).mean()

print ("Naive Bayes Validation: %f" % nb_acc)
print ("Ridge Regression Validation: %f" % rr_acc)
print ("Mean Validation: %f" % mean_acc)

## Submission

In [None]:
sub_df = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv")
sub_df = clean(sub_df,'text')
X_test = vec.transform(sub_df['text'])
nb_p3 = nb_model.predict_proba(X_test)
rr_p3 = rr_model.predict(X_test)

add the predictions to the submission data frame

In [None]:
sub_df['score'] =  ((nb_p3[:, 1]+rr_p3)/2)

In [None]:
# uncomment below to generate original submission file
#sub_df[['comment_id', 'score']].to_csv("submission.csv", index=False)

# Add more training data using current competition
We can split the 30108 rows of training data from the current competition for both training and validation. Unlike the comments from the first competition which are binary (1=toxic, 0=not) we can label the comments from the current competition fractions based on whether they were more toxic than the comparing comment more often than not.

First we generate a table with two columns: 'text' and 'y_rr'. The 'y_rr' will equal *0* if the comment was considered less toxic and *1* if the comment was considered more toxic in the pairwise comparisons.

In [None]:
less_toxic_score_df=pd.DataFrame()
less_toxic_score_df["text"] = val_df["less_toxic"].copy()
less_toxic_score_df["y_rr"] = 0
more_toxic_score_df=pd.DataFrame()
more_toxic_score_df["text"] = val_df["more_toxic"].copy()
more_toxic_score_df["y_rr"] = 1
toxic_score_df = pd.concat([less_toxic_score_df, more_toxic_score_df], ignore_index=True)

In [None]:
toxic_score_df.head()

Group the y_rr values given to each comment_text and get the average of each distinct comment_text. Then set `y_nb` to "0" if the average is less than or equal to 0.5 and set `y` to "1" otherwise

In [None]:
# sort the comments (not necessary)
toxic_score_df = toxic_score_df.sort_values(by = 'text').copy()

# use groupby function to group the comments
val_score_df = toxic_score_df.groupby('text')['y_rr'].mean().reset_index()

# set `y` to "0" if the average is less than or equal to 0.5 and set `y` to "1" otherwise
val_score_df['y_nb'] = (val_score_df['y_rr'] > .5).astype(int)

# drop the `score` column
#val_score_df.drop(['score'], axis=1, inplace=True)

In [None]:
print ("The training data has %i rows." % len(val_score_df))
print ("The training data has %i toxic comments." % (val_score_df['y_nb'] == 1).sum())
print ("The training data has %i non-toxic comments." % (val_score_df['y_nb'] == 0).sum())

This dataset does not look seriously imbalanced so we can continue by splitting the full validation set into a smaller training and validation set.

In [None]:
# we chose a random test size, this number may be optimized for improved results
train_df2, val_df2 = train_test_split(val_score_df, test_size=0.70, random_state = rseed)

In [None]:
print ("The training data has %i rows." % len(train_df2))
print ("The training data has %i toxic comments." % (train_df2['y_nb'] == 1).sum())
print ("The training data has %i non-toxic comments." % (train_df2['y_nb'] == 0).sum())

### Comments to vectors

In [None]:
vec2 = TfidfVectorizer()

# fit the TF-IDF object to the training comments
train_df2 = clean(train_df2,'text')
X2 = vec2.fit_transform(train_df2['text'])
X2

### Fit Models
We fit the naive bayes and ridge regression models to the comments from this new training data

In [None]:
nb_model2 = MultinomialNB()
nb_model2.fit(X2, train_df2['y_nb'])

rr_model2 = Ridge(alpha=0.7)
rr_model2.fit(X2, train_df2['y_rr'])

### Use validation data from current competition to create seperate model
Then validate our results on the new validation data

In [None]:
val_df2 = clean(val_df2,'text')
X_predict = vec2.transform(val_df2['text'])
nb_predictions = nb_model2.predict_proba(X_predict)
rr_predictions = rr_model2.predict(X_predict)

In [None]:
val_df2.loc[:,'nb_pred']=nb_predictions[:, 1]
val_df2.loc[:,'rr_pred']=rr_predictions
# limit values to be between 0 to 1
val_df2.loc[val_df2["rr_pred"]<0,"rr_pred"] = 0
val_df2.loc[val_df2["rr_pred"]>1,"rr_pred"] = 1
val_df2.loc[:,'pred_score']=(val_df2["nb_pred"] + val_df2["rr_pred"])/2



Below I created a function to measure the expected margin rank loss given the validation comment scores.

In [None]:
# get a dataframe where each comment has a predicted
# and true score, we split the comments in half
# evenly and then predict which score is more toxic
def marginRankLossF(df, predScoreCol, trueScoreCol):
    df = df.copy().reset_index()
    total_rows=len(df)
    if total_rows % 2 == 1:
        total_rows = total_rows - 1
    half_rows = total_rows // 2
    #print(total_rows)
    input1_df = df.loc[range(0,half_rows),:].copy().reset_index()
    #print(input1_df.head())
    input2_df = df.loc[range(half_rows,total_rows),:].copy().reset_index()
    #print(input2_df.head())
    input1 = input1_df[predScoreCol]
    #print(input1.head())
    input2 = input2_df[predScoreCol]
    #print(input2.head())
    target = (input1_df[trueScoreCol] > input2_df[trueScoreCol]).astype(int) 
    target = target.replace(0, -1)
    #print((input1_df[predScoreCol] > input2_df[predScoreCol]).astype(int).replace(0,-1).head())
    a= np.multiply((-1*target),(input1-input2))
    output = np.maximum(np.zeros(len(a)),a)
    return(output)

In [None]:
print(marginRankLossF(val_df2, 'pred_score', 'y_rr').describe())

### Submission with new model

In [None]:
sub_df = sub_df.rename(
    columns={'score': 'score1'})
sub_df = clean(sub_df,'text')
X_test2 = vec2.transform(sub_df['text'])

nb_sub_pred2 = nb_model2.predict_proba(X_test2)
rr_sub_pred2 = rr_model2.predict(X_test2)

# limit prediction values to be between 0 to 1
rr_sub_pred2[rr_sub_pred2 < 0] = 0
rr_sub_pred2[rr_sub_pred2 > 1] = 1

In [None]:
sub_df['score2'] = (nb_sub_pred2[:, 1] + rr_sub_pred2)/2

get the average of the two models

In [None]:
sub_df.loc[:,'score'] = sub_df.loc[:,['score1','score2']].astype(float).mean(axis=1)

generate submission file

In [None]:
#sub_df[['comment_id', 'score']].to_csv("submission.csv", index=False)

## Use validation data from current competition to add to data from the original competition

In [None]:
train_df3 = pd.concat([train_df,train_df2])

### Comments to vectors

In [None]:
# create TF-IDF object
vec3 = TfidfVectorizer()

# fit the TF-IDF object to the CLEAN training comments
train_df3 = clean(train_df3,'text')
X3 = vec3.fit_transform(train_df3['text'])
X3

### Fit Models
We fit the naive bayes and ridge regression models to the comments from this new training data

In [None]:
nb_model3 = MultinomialNB()
nb_model3.fit(X3, train_df3['y_nb'])

rr_model3 = Ridge(alpha=0.7)
rr_model3.fit(X3, train_df3['y_rr'])

### Use validation data from current competition to create seperate model
Then validate our results on the new validation data

In [None]:
val_df2 = clean(val_df2,'text')
X_predict = vec3.transform(val_df2['text'])
nb_predictions = nb_model3.predict_proba(X_predict)
rr_predictions = rr_model3.predict(X_predict)

In [None]:
val_df2.loc[:,'nb_pred']=nb_predictions[:, 1]
val_df2.loc[:,'rr_pred']=rr_predictions
# limit values to be between 0 to 1
val_df2.loc[val_df2["rr_pred"]<0,"rr_pred"] = 0
val_df2.loc[val_df2["rr_pred"]>1,"rr_pred"] = 1
val_df2.loc[:,'pred_score']=(val_df2["nb_pred"] + val_df2["rr_pred"])/2

val_df2.loc[:,'error'] = 1-abs(val_df2['y_rr']-val_df2['pred_score'])

In [None]:
val_df2['error'].describe()

In [None]:
val_df2['error'].mean()

### Submission with new model

In [None]:
sub_df = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv")
sub_df = clean(sub_df,'text')
X_test3 = vec3.transform(sub_df['text'])

nb_sub_pred3 = nb_model3.predict_proba(X_test3)
rr_sub_pred3 = rr_model3.predict(X_test3)

# limit prediction values to be between 0 to 1
rr_sub_pred3[rr_sub_pred3 < 0] = 0
rr_sub_pred3[rr_sub_pred3 > 1] = 1

sub_df['score'] = (nb_sub_pred3[:, 1] + rr_sub_pred3)/2

generate submission file

In [None]:
#sub_df[['comment_id', 'score']].to_csv("submission.csv", index=False)

# Optimize weights to train Jigsaw Toxic Comment Classification challenge data for Ridge Regression model

In [None]:

def objective(trial):
    
    # get dataset1
    pre_train_df1 = pd.read_csv(
        "/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv")

    # naive bayes target for dataset1
    pre_train_df1['y_nb'] = (
        pre_train_df1[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
        .sum(axis=1) > 0 ).astype(int)

    
    obscene_w1 = trial.suggest_float('obscene_w1', 0, 1)
    toxic_w1 = trial.suggest_float('toxic_w1', 0, 1)
    threat_w1 = trial.suggest_float('threat_w1', 1, 2)
    insult_w1 = trial.suggest_float('insult_w1', 0, 1)
    severe_toxic_w1 = trial.suggest_float('severe_toxic_w1', 1, 2)
    identity_hate_w1 = trial.suggest_float('identity_hate_w1', 1, 2)
    
    
    # Create a score that measure how much toxic is a comment
    toxicity_weights = {'obscene': obscene_w1, 'toxic': toxic_w1, 'threat': threat_w1, 
                'insult': insult_w1, 'severe_toxic': severe_toxic_w1, 'identity_hate': identity_hate_w1}

    for cat in toxicity_weights:
        pre_train_df1[cat] = pre_train_df1[cat] * toxicity_weights[cat]

    # ridge regression target for dataset1
    pre_train_df1['y_rr'] = pre_train_df1[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1)

    # organize the dataframe to only include the text and targets
    pre_train_df1 = pre_train_df1[['comment_text', 'y_nb','y_rr']].rename(
        columns={'comment_text': 'text'})
    
    
    # get training data from validation set
    error_results=[]
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=rseed)
    for train_df2_index, val_df2_index in skf.split(val_score_df['text'], val_score_df['y_nb']):
        
        train_df2 = val_score_df.loc[train_df2_index,:]
        val_df2 = val_score_df.loc[val_df2_index,:]
    
        train_df_list=[train_df2]
        for cur_train_df in [pre_train_df1,pre_train_df2]:
            # undersample to the number of toxic comments (undersample_n)
            undersample_n = (cur_train_df['y_nb'] == 1).sum()

            # perform undersample
            cur_train_df_y0_undersample = cur_train_df.loc[cur_train_df['y_nb'] == 0,:].sample(
                n=undersample_n, random_state=rseed)

            # generate new training dataframe given undersampled commets
            cur_train_df = pd.concat([cur_train_df.loc[cur_train_df['y_nb'] == 1,:], cur_train_df_y0_undersample])

            train_df_list.append(cur_train_df)
        train_df3 = pd.concat(train_df_list)

        # create TF-IDF object
        vec3 = TfidfVectorizer()

        # fit the TF-IDF object to the CLEAN training comments
        train_df3 = clean(train_df3,'text')
        X3 = vec3.fit_transform(train_df3['text'])
        X3

        nb_model3 = MultinomialNB()
        nb_model3.fit(X3, train_df3['y_nb'])

        alpha_v = trial.suggest_float('alpha_v', 0, 2)

        rr_model3 = Ridge(alpha=alpha_v)
        rr_model3.fit(X3, train_df3['y_rr'])

        val_df2 = clean(val_df2,'text')
        X_predict = vec3.transform(val_df2['text'])
        nb_predictions = nb_model3.predict_proba(X_predict)
        rr_predictions = rr_model3.predict(X_predict)

        val_df2.loc[:,'nb_pred']=nb_predictions[:, 1]
        val_df2.loc[:,'rr_pred']=rr_predictions
        # limit values to be between 0 to 1
        val_df2.loc[val_df2["rr_pred"]<0,"rr_pred"] = 0
        val_df2.loc[val_df2["rr_pred"]>1,"rr_pred"] = 1
        val_df2.loc[:,'pred_score']=(val_df2["nb_pred"] + val_df2["rr_pred"])/2

        #val_df2.loc[:,'error'] = 1-abs(val_df2['y_rr']-val_df2['pred_score'])
        #error_results.append(val_df2['error'].mean())
        error_results.append(marginRankLossF(val_df2, 'pred_score', 'y_rr').mean())
    print(error_results)
    return (sum(error_results) / len(error_results))

study = optuna.create_study()
study.optimize(objective, n_trials=20)
best_param_dict = study.best_params
best_param_dict

Now I will reset the weights with the optimized values and generate a new submission.

In [None]:
# get dataset1
pre_train_df1 = pd.read_csv(
    "/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv")

# naive bayes target
pre_train_df1['y_nb'] = (
    pre_train_df1[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
    .sum(axis=1) > 0 ).astype(int)


# Create a score that measure how much toxic is a comment
toxicity_weights = {'obscene': study.best_params['obscene_w1'],
                    'toxic': study.best_params['toxic_w1'],
                    'threat': study.best_params['threat_w1'], 
                    'insult': study.best_params['insult_w1'],
                    'severe_toxic': study.best_params['severe_toxic_w1'],
                    'identity_hate': study.best_params['identity_hate_w1']}

for cat in toxicity_weights:
    pre_train_df1[cat] = pre_train_df1[cat] * toxicity_weights[cat]

pre_train_df1['y_rr'] = pre_train_df1[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1)

# organize the dataframe to only include the text and targets
pre_train_df1 = pre_train_df1[['comment_text', 'y_nb','y_rr']].rename(
    columns={'comment_text': 'text'})

# this time use entire validation set for training the model
train_df_list=[]
for cur_train_df in [pre_train_df1,pre_train_df2,val_score_df]:
    # undersample to the number of toxic comments (undersample_n)
    undersample_n = (cur_train_df['y_nb'] == 1).sum()

    # perform undersample
    cur_train_df_y0_undersample = cur_train_df.loc[cur_train_df['y_nb'] == 0,:].sample(
        n=undersample_n, random_state=rseed)

    # generate new training dataframe given undersampled commets
    cur_train_df = pd.concat([cur_train_df.loc[cur_train_df['y_nb'] == 1,:], cur_train_df_y0_undersample])

    train_df_list.append(cur_train_df)
train_df3 = pd.concat(train_df_list)

# create TF-IDF object
vec3 = TfidfVectorizer()

# fit the TF-IDF object to the CLEAN training comments
train_df3 = clean(train_df3,'text')
X3 = vec3.fit_transform(train_df3['text'])
X3

nb_model3 = MultinomialNB()
nb_model3.fit(X3, train_df3['y_nb'])

rr_model3 = Ridge(alpha=study.best_params['alpha_v'])
rr_model3.fit(X3, train_df3['y_rr'])

# predict submission comments

sub_df = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv")
sub_df = clean(sub_df,'text')
X_test3 = vec3.transform(sub_df['text'])

nb_sub_pred3 = nb_model3.predict_proba(X_test3)
rr_sub_pred3 = rr_model3.predict(X_test3)

# limit prediction values to be between 0 to 1
rr_sub_pred3[rr_sub_pred3 < 0] = 0
rr_sub_pred3[rr_sub_pred3 > 1] = 1

sub_df['score'] = (nb_sub_pred3[:, 1] + rr_sub_pred3)/2

sub_df[['comment_id', 'score']].to_csv("submission.csv", index=False)