# Background
The following code was inspired by the following
* notebook by JULIÁN PELLER: https://www.kaggle.com/julian3833/jigsaw-incredibly-simple-naive-bayes-0-768
* notebook by ANDREJ MARINCHENKO: https://www.kaggle.com/andrej0marinchenko/jigsaw-ensemble-0-86
* notebook by MANAV: https://www.kaggle.com/manabendrarout/pytorch-roberta-ranking-baseline-jrstc-infer

So far this notebook is split into three submissions:
1. Using the current competition training datasets as validation data for running a Naive Bayes model
2. Using the current competition training datasets as a seperate training data to generate two Naive Bayes models
3. Using the current competition training datasets as additional training data to generate a single Naive Bayes models

# Imports
Import the following libraries and data from 
* Jigsaw Toxic Comment Classification challenge - predicts whether a comment was 'toxic' 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate' (1,0)
* Ruddit Jigsaw dataset - scores between -1 (maximally supportive) and 1 (maximally offensive)
* Jigsaw Unintended Bias in Toxicity Classification - predicts whether a comment was 'toxic' 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate' (1,0)
* Jigsaw Toxic Severity rating - predicts how toxic a comment is compared to other comments

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_extraction.text import TfidfVectorizer # converts sentences into vectors
from sklearn.naive_bayes import MultinomialNB # Naive Bayes model for discrete scores
import re # regular expression libary
from bs4 import BeautifulSoup # used to interpret websites
from tqdm.auto import tqdm # progress bar
from scipy.sparse import vstack # concat sparse matricies

 
# imbalanced dataset methods
from imblearn.under_sampling import RandomUnderSampler # used to randomly under-sample imbalanced dataset
from imblearn.over_sampling import SMOTE # used to over-sample imbalanced dataset with SMOTE
from imblearn.over_sampling import ADASYN # used to over-sample imbalanced dataset with ADASYN
# ignore SettingWithCopyWarning
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# set seed for randomness
rseed=201

In [None]:
from imblearn.over_sampling import SMOTE

# Create training data from other jigsaw competitions
To train the data we need comments (X) as features and then a "ground truth" of the comment's toxicity (y). We use the Jigsaw Toxic Comment Classification challenge training data which labels each comment as either 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', or 'identity_hate' using 1's and 0's. Since in this competition these categories are all considered to be toxic, we transform this training data to list whether each comment is toxic or not.

## Import Jigsaw Toxic Comment Classification challenge data

In [None]:
pre_train_df1 = pd.read_csv(
    "/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv")
pre_train_df1['y'] = (
    pre_train_df1[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
    .sum(axis=1) > 0 ).astype(int)
pre_train_df1 = pre_train_df1[['comment_text', 'y']].rename(
    columns={'comment_text': 'text'})
pre_train_df1.sample(5)

## Import Ruddit Jigsaw dataset

In [None]:
pre_train_df2 = pd.read_csv(
    "/kaggle/input/ruddit-jigsaw-dataset/Dataset/ruddit_with_text.csv")

# rename the columns containing the text data and score
pre_train_df2 = pre_train_df2[['txt', 'offensiveness_score']].rename(columns={'txt': 'text',
                                                          'offensiveness_score':'score'})


# label all comments with an offensiveness score greater than 0 as 1's, otherwise 0's
pre_train_df2['y'] = (pre_train_df2['score'] > 0 ).astype(int)

# uncomment below to transform the offensiveness score to range 0-1
#pre_train_df2['score'] = ((pre_train_df2['score'] - pre_train_df2.score.min())
#                      / (pre_train_df2.score.max() - pre_train_df2.score.min()))

pre_train_df2.drop(['score'], axis=1, inplace=True)
pre_train_df2.sample(5)

## Import Jigsaw Unintended Bias challenge data

In [None]:
if 1==0:
    pre_train_df3 = pd.read_csv(
        "/kaggle/input/jigsaw-multilingual-toxic-comment-classification/jigsaw-unintended-bias-train.csv")
    print(len(pre_train_df3))
    pre_train_df3['y'] = (
        pre_train_df3[['toxic', 'severe_toxicity', 'obscene', 'threat', 'insult', 'identity_attack']]
        .sum(axis=1) > 0 ).astype(int)
    pre_train_df3 = pre_train_df3[['comment_text', 'y']].rename(
        columns={'comment_text': 'text'})
    pre_train_df3.sample(3)


## Comments to vectors

below we generate a clean function from MANAV

In [None]:
def text_cleaning(text):
    '''
    Cleans text into a basic form for NLP. Operations include the following:-
    1. Remove special charecters like &, #, etc
    2. Removes extra spaces
    3. Removes embedded URL links
    4. Removes HTML tags
    5. Removes emojis
    
    text - Text piece to be cleaned.
    '''
    template = re.compile(r'https?://\S+|www\.\S+') #Removes website links
    text = template.sub(r'', text)
    
    soup = BeautifulSoup(text, 'lxml') #Removes HTML tags
    only_text = soup.get_text()
    text = only_text
    
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    
    text = re.sub(r"[^a-zA-Z\d]", " ", text) #Remove special Charecters
    text = re.sub(' +', ' ', text) #Remove Extra Spaces
    text = text.strip() # remove spaces at the beginning and at the end of string

    return text

below we generate a clean function from ANDREJ MARINCHENKO

In [None]:
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')

def clean(data, col):
    
    data[col] = data[col].str.replace('https?://\S+|www\.\S+', ' social medium ', regex=True)      
        
    data[col] = data[col].str.lower()
    data[col] = data[col].str.replace("4", "a") 
    data[col] = data[col].str.replace("2", "l")
    data[col] = data[col].str.replace("5", "s") 
    data[col] = data[col].str.replace("1", "i") 
    data[col] = data[col].str.replace("!", "i") 
    data[col] = data[col].str.replace("|", "i", regex=False) 
    data[col] = data[col].str.replace("0", "o") 
    data[col] = data[col].str.replace("l3", "b") 
    data[col] = data[col].str.replace("7", "t") 
    data[col] = data[col].str.replace("7", "+") 
    data[col] = data[col].str.replace("8", "ate") 
    data[col] = data[col].str.replace("3", "e") 
    data[col] = data[col].str.replace("9", "g")
    data[col] = data[col].str.replace("6", "g")
    data[col] = data[col].str.replace("@", "a")
    data[col] = data[col].str.replace("$", "s", regex=False)
    data[col] = data[col].str.replace("#ofc", " of fuckin course ")
    data[col] = data[col].str.replace("fggt", " faggot ")
    data[col] = data[col].str.replace("your", " your ")
    data[col] = data[col].str.replace("self", " self ")
    data[col] = data[col].str.replace("cuntbag", " cunt bag ")
    data[col] = data[col].str.replace("fartchina", " fart china ")    
    data[col] = data[col].str.replace("youi", " you i ")
    data[col] = data[col].str.replace("cunti", " cunt i ")
    data[col] = data[col].str.replace("sucki", " suck i ")
    data[col] = data[col].str.replace("pagedelete", " page delete ")
    data[col] = data[col].str.replace("cuntsi", " cuntsi ")
    data[col] = data[col].str.replace("i'm", " i am ")
    data[col] = data[col].str.replace("offuck", " of fuck ")
    data[col] = data[col].str.replace("centraliststupid", " central ist stupid ")
    data[col] = data[col].str.replace("hitleri", " hitler i ")
    data[col] = data[col].str.replace("i've", " i have ")
    data[col] = data[col].str.replace("i'll", " sick ")
    data[col] = data[col].str.replace("fuck", " fuck ")
    data[col] = data[col].str.replace("f u c k", " fuck ")
    data[col] = data[col].str.replace("shit", " shit ")
    data[col] = data[col].str.replace("bunksteve", " bunk steve ")
    data[col] = data[col].str.replace('wikipedia', ' social medium ')
    data[col] = data[col].str.replace("faggot", " faggot ")
    data[col] = data[col].str.replace("delanoy", " delanoy ")
    data[col] = data[col].str.replace("jewish", " jewish ")
    data[col] = data[col].str.replace("sexsex", " sex ")
    data[col] = data[col].str.replace("allii", " all ii ")
    data[col] = data[col].str.replace("i'd", " i had ")
    data[col] = data[col].str.replace("'s", " is ")
    data[col] = data[col].str.replace("youbollocks", " you bollocks ")
    data[col] = data[col].str.replace("dick", " dick ")
    data[col] = data[col].str.replace("cuntsi", " cuntsi ")
    data[col] = data[col].str.replace("mothjer", " mother ")
    data[col] = data[col].str.replace("cuntfranks", " cunt ")
    data[col] = data[col].str.replace("ullmann", " jewish ")
    data[col] = data[col].str.replace("mr.", " mister ", regex=False)
    data[col] = data[col].str.replace("aidsaids", " aids ")
    data[col] = data[col].str.replace("njgw", " nigger ")
    data[col] = data[col].str.replace("wiki", " social medium ")
    data[col] = data[col].str.replace("administrator", " admin ")
    data[col] = data[col].str.replace("gamaliel", " jewish ")
    data[col] = data[col].str.replace("rvv", " vanadalism ")
    data[col] = data[col].str.replace("admins", " admin ")
    data[col] = data[col].str.replace("pensnsnniensnsn", " penis ")
    data[col] = data[col].str.replace("pneis", " penis ")
    data[col] = data[col].str.replace("pennnis", " penis ")
    data[col] = data[col].str.replace("pov.", " point of view ", regex=False)
    data[col] = data[col].str.replace("vandalising", " vandalism ")
    data[col] = data[col].str.replace("cock", " dick ")
    data[col] = data[col].str.replace("asshole", " asshole ")
    data[col] = data[col].str.replace("youi", " you ")
    data[col] = data[col].str.replace("afd", " all fucking day ")
    data[col] = data[col].str.replace("sockpuppets", " sockpuppetry ")
    data[col] = data[col].str.replace("iiprick", " iprick ")
    data[col] = data[col].str.replace("penisi", " penis ")
    data[col] = data[col].str.replace("warrior", " warrior ")
    data[col] = data[col].str.replace("loil", " laughing out insanely loud ")
    data[col] = data[col].str.replace("vandalise", " vanadalism ")
    data[col] = data[col].str.replace("helli", " helli ")
    data[col] = data[col].str.replace("lunchablesi", " lunchablesi ")
    data[col] = data[col].str.replace("special", " special ")
    data[col] = data[col].str.replace("ilol", " i lol ")
    data[col] = data[col].str.replace(r'\b[uU]\b', 'you', regex=True)
    data[col] = data[col].str.replace(r"what's", "what is ")
    data[col] = data[col].str.replace(r"\'s", " is ", regex=False)
    data[col] = data[col].str.replace(r"\'ve", " have ", regex=False)
    data[col] = data[col].str.replace(r"can't", "cannot ")
    data[col] = data[col].str.replace(r"n't", " not ")
    data[col] = data[col].str.replace(r"i'm", "i am ")
    data[col] = data[col].str.replace(r"\'re", " are ", regex=False)
    data[col] = data[col].str.replace(r"\'d", " would ", regex=False)
    data[col] = data[col].str.replace(r"\'ll", " will ", regex=False)
    data[col] = data[col].str.replace(r"\'scuse", " excuse ", regex=False)
    data[col] = data[col].str.replace('\s+', ' ', regex=True)  # will remove more than one whitespace character
#     text = re.sub(r'\b([^\W\d_]+)(\s+\1)+\b', r'\1', re.sub(r'\W+', ' ', text).strip(), flags=re.I)  # remove repeating words coming immediately one after another
    data[col] = data[col].str.replace(r'(.)\1+', r'\1\1', regex=True) # 2 or more characters are replaced by 2 characters
#     text = re.sub(r'((\b\w+\b.{1,2}\w+\b)+).+\1', r'\1', text, flags = re.I)
    data[col] = data[col].str.replace("[:|♣|'|§|♠|*|/|?|=|%|&|-|#|•|~|^|>|<|►|_]", '', regex=True)
    
    
    data[col] = data[col].str.replace(r"what's", "what is ")    
    data[col] = data[col].str.replace(r"\'ve", " have ", regex=False)
    data[col] = data[col].str.replace(r"can't", "cannot ")
    data[col] = data[col].str.replace(r"n't", " not ", regex=False)
    data[col] = data[col].str.replace(r"i'm", "i am ", regex=False)
    data[col] = data[col].str.replace(r"\'re", " are ", regex=False)
    data[col] = data[col].str.replace(r"\'d", " would ", regex=False)
    data[col] = data[col].str.replace(r"\'ll", " will ", regex=False)
    data[col] = data[col].str.replace(r"\'scuse", " excuse ", regex=False)
    data[col] = data[col].str.replace(r"\'s", " ", regex=False)

    # Clean some punctutations
    data[col] = data[col].str.replace('\n', ' \n ')
    data[col] = data[col].str.replace(r'([a-zA-Z]+)([/!?.])([a-zA-Z]+)',r'\1 \2 \3', regex=True)
    # Replace repeating characters more than 3 times to length of 3
    data[col] = data[col].str.replace(r'([*!?\'])\1\1{2,}',r'\1\1\1', regex=True)    
    # Add space around repeating characters
    data[col] = data[col].str.replace(r'([*!?\']+)',r' \1 ', regex=True)    
    # patterns with repeating characters 
    data[col] = data[col].str.replace(r'([a-zA-Z])\1{2,}\b',r'\1\1', regex=True)
    data[col] = data[col].str.replace(r'([a-zA-Z])\1\1{2,}\B',r'\1\1\1', regex=True)
    data[col] = data[col].str.replace(r'[ ]{2,}',' ', regex=True).str.strip()   
    data[col] = data[col].str.replace(r'[ ]{2,}',' ', regex=True).str.strip()   
    data[col] = data[col].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
    tqdm.pandas()
    data[col] = data[col].progress_apply(text_cleaning)
    return data

the code below simply tests the cleaning functions above

In [None]:
# Test clean function
test_clean_df = pd.DataFrame({"text":
                              ["heyy\n\nkkdsfj",
                               "hi   how/are/U ???",
                               "hey?????",
                               "noooo!!!!!!!!!   comeone !! ",
                              "cooooooooool     brooooooooooo  coool brooo",
                              "naaaahhhhhhh","'re been cool"]})
display(test_clean_df)
clean(test_clean_df,'text')

now we can clean up the comments in the training data

In [None]:
# create TF-IDF object
pre_train_df1 = clean(pre_train_df1,'text')
pre_train_df2 = clean(pre_train_df2,'text')

# fit the TF-IDF object to the CLEAN training comments
pre_train_df = pd.concat([pre_train_df1,pre_train_df2])

Now we can embed the comments in each training dataset. Note that we must first fit the embedding on all the comments from both datasets. When building the vocabulary I chose to ignore terms that have a document frequency in less than .001% of the documents.

In [None]:
# generate a model for the words
vec = TfidfVectorizer(min_df=1e-5)
vec.fit(pre_train_df['text'])

# print the vocabulary size
print(len(vec.vocabulary_))

# transform the text into sparse matrix format
X_1 = vec.transform(pre_train_df1['text'])
X_2 = vec.transform(pre_train_df2['text'])
X_2

## Imbalanced dataset
Below we run code to see if the comments have a relatively equal amount of toxic and non-toxic comments.

In [None]:
print ("The first training dataset has %i rows." % len(pre_train_df1))
print ("The first training dataset has %i toxic comments." % (pre_train_df1['y'] == 1).sum())
print ("The first training dataset has %i non-toxic comments." % (pre_train_df1['y'] == 0).sum())

print ("The second training dataset has %i rows." % len(pre_train_df2))
print ("The second training dataset has %i toxic comments." % (pre_train_df2['y'] == 1).sum())
print ("The second training dataset has %i non-toxic comments." % (pre_train_df2['y'] == 0).sum())


The first dataset is very unbalanced and the second is a little imbalanced. Below is code to undersample the majority class. Undersampling decreases the chance of false positives. However, we loss a lot of training data this way.

In [None]:
if 1==0:
    rus = RandomUnderSampler(random_state=rseed)
    rus_pre_train_x1, rus_pretrain_y1 = (
        rus.fit_resample(X_1, pre_train_df1["y"]))
    rus_pre_train_x2, rus_pretrain_y2 = (
        rus.fit_resample(X_2, pre_train_df2["y"]))
    
    X_train = vstack([rus_pre_train_x1,rus_pre_train_x2])
    y_train = pd.concat([rus_pretrain_y1, rus_pretrain_y2])

To prevent tossing training data, I want to try over-sampling using SMOTE. 

In [None]:
if 1==1:
    sm = SMOTE(random_state=rseed)
    sm_pre_train_x1, sm_pretrain_y1 = (
        sm.fit_resample(X_1, pre_train_df1["y"]))
    sm_pre_train_x2, sm_pretrain_y2 = (
        sm.fit_resample(X_2, pre_train_df2["y"]))
    
    X_train = vstack([sm_pre_train_x1,sm_pre_train_x2])
    y_train = pd.concat([sm_pretrain_y1, sm_pretrain_y2])

In [None]:
print ("The first training dataset has %i rows." % len(y_train))
print ("The first training dataset has %i toxic comments." % (y_train == 1).sum())
print ("The first training dataset has %i non-toxic comments." % (y_train == 0).sum())

## Fit Naive Bayes
We fit the naive bayes model to the comments from the training data based on whether they are labeled as toxic or not.

In [None]:
model = MultinomialNB() 
model.fit(X_train, y_train)

## Validate using the data from the current competition
To validate the Naive Bayes model we use the training data for this competition.

In [None]:
val_df = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/validation_data.csv")
val_df.head()

The naive bayes model is run on the the comments labeled as "less_toxic" and "more_toxic" seperately. If a comment is more toxic it should have a higher value.

In [None]:
val_df = clean(val_df,'less_toxic')
val_df = clean(val_df,'more_toxic')
X_less_toxic = vec.transform(val_df['less_toxic'])
X_more_toxic = vec.transform(val_df['more_toxic'])

The "predict_proba" function generated a 2D array where the first dimension lists the probability the comment is not toxic and the second contains the probability that the comment is toxic.

In [None]:
p1 = model.predict_proba(X_less_toxic)
p2 = model.predict_proba(X_more_toxic)

To validate the model we measure whether the "more_toxic" comments got higher values than the "less_toxic" comments

In [None]:
# Validation Accuracy
(p1[:, 1] < p2[:, 1]).mean()

## Submission

In [None]:
sub_df = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv")
sub_df = clean(sub_df,'text')
X_test = vec.transform(sub_df['text'])
p3 = model.predict_proba(X_test)

add the predictions to the submission data frame

In [None]:
sub_df['score'] = p3[:, 1]

In [None]:
# uncomment below to generate original submission file
#sub_df[['comment_id', 'score']].to_csv("submission.csv", index=False)

# Add more training data using current competition
We can split the 30108 rows of training data from the current competition for both training and validation. Unlike the comments from the first competition which are binary (1=toxic, 0=not) we can label the comments from the current competition fractions based on whether they were more toxic than the comparing comment more often than not.

First we generate a table with two columns: 'text' and 'score'. The score will equal *0* if the comment was considered less toxic and *1* if the comment was considered more toxic in the pairwise comparisons.

In [None]:
less_toxic_score_df=pd.DataFrame()
less_toxic_score_df["text"] = val_df["less_toxic"].copy()
less_toxic_score_df["score"] = 0
more_toxic_score_df=pd.DataFrame()
more_toxic_score_df["text"] = val_df["more_toxic"].copy()
more_toxic_score_df["score"] = 1
toxic_score_df = pd.concat([less_toxic_score_df, more_toxic_score_df], ignore_index=True)

Group the scores given to each comment_text, get the average of each distinct comment_text, and then set `y` to "0" if the average is less than or equal to 0.5 and set `y` to "1" otherwise

In [None]:
# sort the comments (not necessary)
toxic_score_df = toxic_score_df.sort_values(by = 'text').copy()

# use groupby function to group the comments
val_score_df = toxic_score_df.groupby('text')['score'].mean().reset_index()

# set `y` to "0" if the average is less than or equal to 0.5 and set `y` to "1" otherwise
val_score_df['y'] = (val_score_df['score'] > .5).astype(int)

# drop the `score` column
#val_score_df.drop(['score'], axis=1, inplace=True)

In [None]:
print ("The training data has %i rows." % len(val_score_df))
print ("The training data has %i toxic comments." % (val_score_df['y'] == 1).sum())
print ("The training data has %i non-toxic comments." % (val_score_df['y'] == 0).sum())

This dataset does not look seriously imbalanced so we can continue by splitting the full validation set into a smaller training and validation set.

In [None]:
from sklearn.model_selection import train_test_split

# we chose a random test size, this number may be optimized for improved results
train_df2, val_df2 = train_test_split(val_score_df, test_size=0.70, random_state = rseed)

In [None]:
print ("The training data has %i rows." % len(train_df2))
print ("The training data has %i toxic comments." % (train_df2['y'] == 1).sum())
print ("The training data has %i non-toxic comments." % (train_df2['y'] == 0).sum())

### Comments to vectors

In [None]:
vec2 = TfidfVectorizer()

# fit the TF-IDF object to the training comments
train_df2 = clean(train_df2,'text')
X2 = vec2.fit_transform(train_df2['text'])
X2

### Fit Naive Bayes
We fit the naive bayes model to the comments from this new training data

In [None]:
# create TF-IDF object


model2 = MultinomialNB()
model2.fit(X2, train_df2['y'])

### Use validation data from current competition to create seperate model
Then validate our results on the new validation data

In [None]:
val_df2 = clean(val_df2,'text')
X_predict = vec2.transform(val_df2['text'])
predictions = model2.predict_proba(X_predict)

In [None]:
val_df2.loc[:,'pred_score']=predictions[:, 1]
val_df2.loc[:,'error'] = 1-abs(val_df2['score']-val_df2['pred_score'])

In [None]:
val_df2['error'].describe()

### Submission with new model

In [None]:
sub_df = sub_df.rename(
    columns={'score': 'score1'})
sub_df = clean(sub_df,'text')
X_test2 = vec2.transform(sub_df['text'])
submission_predictions2 = model2.predict_proba(X_test2)
sub_df['score2'] = submission_predictions2[:, 1]

get the average of the two models

In [None]:
sub_df.loc[:,'score'] = sub_df.loc[:,['score1','score2']].astype(float).mean(axis=1)

generate submission file

In [None]:
#sub_df[['comment_id', 'score']].to_csv("submission.csv", index=False)

## Use validation data from current competition to add to data from the original competition

In [None]:
pre_train_df = pd.concat([pre_train_df,train_df2.loc[:,["text","y"]]])

Generate sparse matrix using all the training data comments

In [None]:
# generate a model for the words
vec3 = TfidfVectorizer(min_df=1e-5)
vec3.fit(pre_train_df['text'])

# print the vocabulary size
print(len(vec3.vocabulary_))

# transform the text into sparse matrix format
X_1 = vec3.transform(pre_train_df1['text'])
X_2 = vec3.transform(pre_train_df2['text'])
X_3 = vec3.transform(train_df2['text'])
X_3

Recall we have imbalanced data so we over-sample

In [None]:
print ("The first training dataset has %i rows." % len(pre_train_df1))
print ("The first training dataset has %i toxic comments." % (pre_train_df1['y'] == 1).sum())
print ("The first training dataset has %i non-toxic comments." % (pre_train_df1['y'] == 0).sum())

print ("The second training dataset has %i rows." % len(pre_train_df2))
print ("The second training dataset has %i toxic comments." % (pre_train_df2['y'] == 1).sum())
print ("The second training dataset has %i non-toxic comments." % (pre_train_df2['y'] == 0).sum())

print ("The third training dataset has %i rows." % len(train_df2))
print ("The third training dataset has %i toxic comments." % (train_df2['y'] == 1).sum())
print ("The third training dataset has %i non-toxic comments." % (train_df2['y'] == 0).sum())

In [None]:
if 1==1:
    sm = SMOTE(random_state=rseed)
    sm_pre_train_x1, sm_pretrain_y1 = (
        sm.fit_resample(X_1, pre_train_df1["y"]))
    sm_pre_train_x2, sm_pretrain_y2 = (
        sm.fit_resample(X_2, pre_train_df2["y"]))
    sm_pre_train_x3, sm_pretrain_y3 = (
        sm.fit_resample(X_3, train_df2["y"]))
    
    X_train = vstack([sm_pre_train_x1, sm_pre_train_x2, sm_pre_train_x3])
    y_train = pd.concat([sm_pretrain_y1, sm_pretrain_y2, sm_pretrain_y3])

### Fit Naive Bayes
We fit the naive bayes model to the comments from this new training data

In [None]:
X_train = vstack([sm_pre_train_x1, sm_pre_train_x2, sm_pre_train_x3])
y_train = pd.concat([sm_pretrain_y1, sm_pretrain_y2, sm_pretrain_y3])

model3 = MultinomialNB()
model3.fit(X_train, y_train)

### Use validation data from current competition to create seperate model
Then validate our results on the new validation data

In [None]:
val_df2 = clean(val_df2,'text')
X_predict = vec3.transform(val_df2['text'])
predictions = model3.predict_proba(X_predict)

In [None]:
val_df2.loc[:,'pred_score']=predictions[:, 1]
val_df2.loc[:,'error'] = 1-abs(val_df2['score']-val_df2['pred_score'])

In [None]:
val_df2['error'].describe()

### Submission with new model

In [None]:
sub_df = pd.read_csv("/kaggle/input/jigsaw-toxic-severity-rating/comments_to_score.csv")

In [None]:
sub_df = clean(sub_df,'text')
X_test3 = vec3.transform(sub_df['text'])
submission_predictions3 = model3.predict_proba(X_test3)
sub_df['score'] = submission_predictions3[:, 1]

generate submission file

In [None]:
sub_df[['comment_id', 'score']].to_csv("submission.csv", index=False)