<div style='color:white;background-color:#0b8b10; height: 100px; border-radius: 25px;'><h1 style='text-align:center;padding: 3%'>Jigsaw Rate Severity of Toxic Comments Competition</h1></div>

## Table of Contents
* [Datasets Information](#data_information)
    - Description of the Data Present within this Competition
        - Files
    - Description of the Data Present within Jigsaw Toxic Comment Classification Challenge
        - Files
* [EDA for Jigsaw Toxic Comment Classification Challenge Dataset](#eda)
    - Install and Import Libraries
    - General Dataset Information
    - Clean Data
    - Comments Distribution
    - Distribution of top n-grams
    - Sentiment Polarity
    - Word Clouds
* [Models](#models)
    - Assumptions
    - Preprocessing
    - Train Test Split
    - TF-IDF
    - SVD
    - LightGBM Model
    - Inference

<a id='data_information'></a>
# Datasets Information
## Description of the Data Present within this Competition

<div>The data used for this competition are Wikipedia Talk page comments. The purpose is to rank the severity of comment toxicity from innocuous to outrageous, where the middle matters as much as the extremes.</br>
<b>Important</b>:
There is no training data for this competition. You can refer to previous Jigsaw competitions for data that might be useful to train models.
<h4> Competition URL:</h4> https://www.kaggle.com/c/jigsaw-toxic-severity-rating
<h3> Files </h3>
<span style="background-color:#e1e6e3;">comments_to_score.csv</span> - collection of comments </br>
<span style="background-color:#e1e6e3;">validation_data.csv</span> - pair rankings that can be used to validate models </br>
<span style="background-color:#e1e6e3;">sample_submission.csv</span> - a sample submission file in the correct format </br>
</br>
<b style='margin-top:1.5%;margin-left:1%;background-color:#fbffb3'><i>Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.</i></b></div>

## Description of the Data Present within Jigsaw Toxic Comment Classification Challenge

<div>
You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are: 
- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate
    
<h4> Competition URL: </h4> https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/overview
<h3> Files </h3>
<span style="background-color:#e1e6e3;">train.csv</span> - the training set, contains comments with their binary labels </br>
<span style="background-color:#e1e6e3;">test.csv</span> - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring </br>
<span style="background-color:#e1e6e3;">sample_submission.csv</span> - a sample submission file in the correct format </br>
<span style="background-color:#e1e6e3;">test_labels.csv</span> - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!) 
</div>

<a id='eda'></a>
# EDA for Jigsaw Toxic Comment Classification Challenge Dataset
Regarding the data presented in this competition, I performed an exploratory analysis in the following notebook: https://www.kaggle.com/serquet/jigsaw-full-eda
Therefore I proceed in an EDA phase only for the external data used.

## Install and Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import random

from textblob import TextBlob
from stop_words import get_stop_words
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import mean_squared_error

from wordcloud import WordCloud, STOPWORDS

import lightgbm as ltb


plt.style.use('ggplot')

pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', None)
pd.set_option('max_rows', None)

## General Dataset Information
### Load Data

In [None]:
df = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")

In [None]:
df.shape

In [None]:
df.info()

## Clean Data

In [None]:
def clean_text(text):
    text = re.sub(r'<[^<]+?>', '', text)
    text = text.replace('\n', ' ')
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'<[^<]+?>', '', text) 
    text = text.replace('(\xa0)', ' ')
    text = text.replace('(&lt)', '')
    text = text.replace('(&gt)', '')
    text = text.replace("\\", "")
    
    return text

In [None]:
df['comment_text'] = df['comment_text'].apply(clean_text)

### Comments Distribution

How many comments contain more than one type of toxicity?

In [None]:
print(f'{df[df.toxic+df.severe_toxic+df.obscene+df.threat+df.insult+df.identity_hate > 1].shape[0]} comments contain more than one type of toxicity')

How many non-toxic comments?

In [None]:
nontoxic_comments = df[df.toxic+df.severe_toxic+df.obscene+df.threat+df.insult+df.identity_hate == 0]
print(f'Out of {df.shape[0]} comments, {nontoxic_comments.shape[0]} are non-toxic and {df.shape[0]-nontoxic_comments.shape[0]} are toxic. \n'
      f'So we have a percentage of toxic comments of {round((df.shape[0]-nontoxic_comments.shape[0])/df.shape[0]*100,3)}%')

There are different types of toxicity. Let's see if toxic and severely toxic include all other classes:

In [None]:
print(f'There are: \n'
      f'- {df[df.toxic == 1].shape[0]} comments classified as toxic \n'
      f'- {df[df.severe_toxic == 1].shape[0]} comments classified as severe toxic \n'
      f'- {df[df.obscene == 1].shape[0]} comments classified as obscene \n'
      f'- {df[df.threat == 1].shape[0]} comments classified as threat \n'
      f'- {df[df.insult == 1].shape[0]} comments classified as insult \n'
      f'- {df[df.identity_hate == 1].shape[0]} comments classified as identity hate \n'
      f'Comments classified as toxic and as severely toxic are {df[df.toxic == 1].shape[0]+df[df.severe_toxic == 1].shape[0]}. \n'
      f'There are {df[(df.obscene+df.threat+df.insult+df.identity_hate > 0)&(df.toxic+df.severe_toxic==0)].shape[0]} comments that belong to some class of toxicity but have not been assigned to either the toxic or severe toxic class \n'
      f'There are {df[(df.severe_toxic==1)&(df.toxic==1)].shape[0]} comments that belong both to the toxic class and to the severe toxic class, while {df[(df.severe_toxic==1)&(df.toxic==0)].shape[0]} that belong to the severe_toxic class but not to the toxic class \n')

<b style='margin-top:1.5%;margin-left:1%;background-color:#f6e51d'><i>SO ALL COMMENTS CLASSIFIED AS SEVERE TOXIC ARE ALSO CLASSIFIED AS TOXIC</i><b>

Let's see some examples for each class

In [None]:
df[df.toxic == 1].sample(1, random_state=42)

In [None]:
df[df.severe_toxic == 1].sample(1, random_state=42)

In [None]:
df[df.obscene == 1].sample(1, random_state=42)

In [None]:
df[df.threat == 1].sample(1, random_state=42)

In [None]:
df[df.insult == 1].sample(1, random_state=42)

In [None]:
df[df.identity_hate == 1].sample(1, random_state=42)

## Distribution of top n-grams for non-toxic, toxic, severe toxic

Unigrams

In [None]:
def get_top_n_words(corpus, n=None, remove_stop_words=False, n_words=1): # if n_words=1 -> unigrams, if n_words=2 -> bigrams..
    if remove_stop_words:
        vec = CountVectorizer(stop_words = 'english', ngram_range=(n_words, n_words)).fit(corpus)
    else:
        vec = CountVectorizer(ngram_range=(n_words, n_words)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
text_no_toxic = df[df.toxic+df.severe_toxic+df.obscene+df.threat+df.insult+df.identity_hate == 0].comment_text.values
text_toxic = df[(df.toxic == 1)&(df.severe_toxic == 0)].comment_text.values
text_severe_toxic = df[(df.toxic == 1)&(df.severe_toxic == 1)].comment_text.values


common_words_non_toxic = get_top_n_words(text_no_toxic, 20, remove_stop_words=True, n_words=1)
common_words_toxic = get_top_n_words(text_toxic, 20, remove_stop_words=True, n_words=1)
common_words_severe_toxic = get_top_n_words(text_severe_toxic, 20, remove_stop_words=True, n_words=1)

df_tmp_non_toxic = pd.DataFrame(common_words_non_toxic, columns = ['text' , 'count'])
df_tmp_toxic = pd.DataFrame(common_words_toxic, columns = ['text' , 'count'])
df_tmp_severe_toxic = pd.DataFrame(common_words_severe_toxic, columns = ['text' , 'count'])

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_toxic.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#0b8b10")
ax1.set_title('Toxic Unigram Distribution')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_severe_toxic.groupby('text').sum()['count'].sort_values(ascending=False).plot(
   kind='bar', color = "#0b8b10")
ax2.set_title('Severe Toxic Unigram Distribution')
ax2.set_xlabel("Unigrams")
ax2.set_ylabel("Frequency")

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_non_toxic.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#0b8b10")
ax1.set_title('Non Toxic Unigram Distribution')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

plt.show()

Bigrams

In [None]:
text_no_toxic = df[df.toxic+df.severe_toxic+df.obscene+df.threat+df.insult+df.identity_hate == 0].comment_text.values
text_toxic = df[(df.toxic == 1)&(df.severe_toxic == 0)].comment_text.values
text_severe_toxic = df[(df.toxic == 1)&(df.severe_toxic == 1)].comment_text.values


common_words_non_toxic = get_top_n_words(text_no_toxic, 20, remove_stop_words=True, n_words=2)
common_words_toxic = get_top_n_words(text_toxic, 20, remove_stop_words=True, n_words=2)
common_words_severe_toxic = get_top_n_words(text_severe_toxic, 20, remove_stop_words=True, n_words=2)

df_tmp_non_toxic = pd.DataFrame(common_words_non_toxic, columns = ['text' , 'count'])
df_tmp_toxic = pd.DataFrame(common_words_toxic, columns = ['text' , 'count'])
df_tmp_severe_toxic = pd.DataFrame(common_words_severe_toxic, columns = ['text' , 'count'])

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_toxic.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#0b8b10")
ax1.set_title('Toxic Bigram Distribution')
ax1.set_xlabel("Bigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_severe_toxic.groupby('text').sum()['count'].sort_values(ascending=False).plot(
   kind='bar', color = "#0b8b10")
ax2.set_title('Severe Toxic Bigram Distribution')
ax2.set_xlabel("Bigrams")
ax2.set_ylabel("Frequency")

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_non_toxic.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#0b8b10")
ax1.set_title('Non Toxic Bigram Distribution')
ax1.set_xlabel("Bigrams")
ax1.set_ylabel("Frequency")

plt.show()

Trigrams

In [None]:
text_no_toxic = df[df.toxic+df.severe_toxic+df.obscene+df.threat+df.insult+df.identity_hate == 0].comment_text.values
text_toxic = df[(df.toxic == 1)&(df.severe_toxic == 0)].comment_text.values
text_severe_toxic = df[(df.toxic == 1)&(df.severe_toxic == 1)].comment_text.values


common_words_non_toxic = get_top_n_words(text_no_toxic, 20, remove_stop_words=True, n_words=3)
common_words_toxic = get_top_n_words(text_toxic, 20, remove_stop_words=True, n_words=3)
common_words_severe_toxic = get_top_n_words(text_severe_toxic, 20, remove_stop_words=True, n_words=3)

df_tmp_non_toxic = pd.DataFrame(common_words_non_toxic, columns = ['text' , 'count'])
df_tmp_toxic = pd.DataFrame(common_words_toxic, columns = ['text' , 'count'])
df_tmp_severe_toxic = pd.DataFrame(common_words_severe_toxic, columns = ['text' , 'count'])

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_toxic.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#0b8b10")
ax1.set_title('Toxic Trigram Distribution')
ax1.set_xlabel("Trigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_severe_toxic.groupby('text').sum()['count'].sort_values(ascending=False).plot(
   kind='bar', color = "#0b8b10")
ax2.set_title('Severe Toxic Trigram Distribution')
ax2.set_xlabel("Trigrams")
ax2.set_ylabel("Frequency")

fig = plt.figure(figsize=(15,6))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_non_toxic.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#0b8b10")
ax1.set_title('Non Toxic Trigram Distribution')
ax1.set_xlabel("Trigrams")
ax1.set_ylabel("Frequency")

plt.show()

### Sentiment Polarity

Let's use TextBlob to calculate sentiment polarity. The sentiment polarity value lies in the range of [-1, 1] where 1 means positive sentiment and -1 means a negative sentiment:

In [None]:
polarity_toxic_not_severe = df[(df.toxic==1)&(df.severe_toxic==0)]['comment_text'].map(lambda text: TextBlob(text).sentiment.polarity)
polarity_severe_toxic = df[df.severe_toxic==1]['comment_text'].map(lambda text: TextBlob(text).sentiment.polarity)
polarity_non_toxic = df[(df.toxic+df.severe_toxic+df.obscene+df.threat+df.insult+df.identity_hate == 0)]['comment_text'].map(lambda text: TextBlob(text).sentiment.polarity)

In [None]:
fig, axes = plt.subplots(ncols=3, nrows=1, figsize=(25, 6))
ax1, ax2, ax3 = axes.flatten()

ax1.hist(polarity_non_toxic, color = "#0b8b10", bins=25)
ax1.set_title('Polarity Distribution for Non-Toxic Comments')
ax1.set_xlabel("Sentiment")
ax1.set_ylabel("Frequency")

ax2.hist(polarity_toxic_not_severe,  color = "#0b8b10", bins=25)
ax2.set_title('Polarity Distribution for Toxic Comments')
ax2.set_xlabel("Sentiment")
ax2.set_ylabel("Frequency")

ax3.hist(polarity_severe_toxic, color = "#0b8b10", bins=25)
ax3.set_title('Polarity Distribution for Severe Toxic Comments')
ax3.set_xlabel("Sentiment")
ax3.set_ylabel("Frequency")

plt.show()

From these graphs we can see that the sentiment for non-toxic comments is predominantly neutral/positive, for toxic comments neutral and negative, while for severely toxic comments the sentiment is neutral, negative and strongly negative.

## Word Clouds for non-toxic, toxic, severe toxic

In [None]:
def color_func(word, font_size, position, orientation, random_state=None, hsl=[125, 75, 25],
                    **kwargs):
    return f"hsl({hsl[0]}, {random.randint(hsl[1]-10, hsl[1]+10)}%, {random.randint(hsl[2]-10, hsl[1]+10)}%)"

In [None]:
wc_text_no_toxic = WordCloud(background_color="#fff",max_words=1000,stopwords=set(STOPWORDS))
wc_text_toxic = WordCloud(background_color="#fff",max_words=1000,stopwords=set(STOPWORDS))
wc_text_severe_toxic = WordCloud(background_color="#fff",max_words=1000,stopwords=set(STOPWORDS))


wc_text_no_toxic.generate(" ".join(text_no_toxic))
wc_text_toxic.generate(" ".join(text_toxic))
wc_text_severe_toxic.generate(" ".join(text_severe_toxic))


fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = plt.imshow(wc_text_toxic.recolor(color_func=color_func, random_state=42),
           interpolation="bilinear")
ax1 = plt.title("Toxic", fontsize=20)

ax2 = fig.add_subplot(122)
ax2 = plt.imshow(wc_text_severe_toxic.recolor(color_func=color_func, random_state=42),
           interpolation="bilinear")
ax2 = plt.title("Severe Toxic", fontsize=20)

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = plt.imshow(wc_text_no_toxic.recolor(color_func=color_func, random_state=42),
           interpolation="bilinear")
ax1 = plt.title("Non Toxic", fontsize=20)


plt.show()

<a id='models'></a>
# Models

<b style='margin-top:1.5%;margin-left:1%;background-color:#f6e51d'>In this notebook I applied a simple machine learning model, in the next notebooks I will use models more recommended for unstructured data, based on neural networks</b>

## Assumptions

An assumption we can make is that the severity score for each comment is given by the sum of the scores assigned to the different categories of toxicity: obscene, threat, insult and identity_hate, toxic and severe_toxic. 
Finally we multiply this score by 3 if the comment is considered severe_toxic and leave it unchanged if it is considered only toxic

In [None]:
df['target'] = df.toxic+df.severe_toxic+df.obscene+df.threat+df.insult+df.identity_hate

In [None]:
df['target'][df.severe_toxic == 1] = 3*df.target

In [None]:
fig = plt.figure(figsize=(10,6))

ax1 = df['target'].plot.hist(bins=25, color='#0b8b10')
ax1.set_title('Target Distribution')
ax1.set_ylabel("Frequency")

plt.show()

Let's resample the zeros to balance the dataset

In [None]:
df0 = df[df.target == 0].sample((round(df[df.target == 0].shape[0]/4)))

In [None]:
df = df[df.target != 0]
df = pd.concat([df0,df], axis = 0)

In [None]:
df.target.value_counts()

In [None]:
fig = plt.figure(figsize=(10,6))

ax1 = df['target'].plot.hist(bins=25, color='#0b8b10')
ax1.set_title('Target Distribution')
ax1.set_ylabel("Frequency")

plt.show()

## Preprocessing

In [None]:
# remove stop words

stop_words = list(get_stop_words('en'))
nltk_words = list(stopwords.words('english'))
stop_words.extend(nltk_words)

df['clean_comment_text_NO_STOPWORDS'] = df['comment_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In [None]:
df[['comment_text', 'clean_comment_text_NO_STOPWORDS']].head(2)

In [None]:
# stemming
stemmer = SnowballStemmer("english")

In [None]:
df['clean_comment_text_NO_STOPWORDS_stemmed'] = df['clean_comment_text_NO_STOPWORDS'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

In [None]:
df[['comment_text', 'clean_comment_text_NO_STOPWORDS_stemmed']].head(2)

## Train-Test Split

In [None]:
X = df['clean_comment_text_NO_STOPWORDS_stemmed']
y = df['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
print(f'Number of comments in train data: {X_train.shape[0]} \n'
     f'Number of comments in test data: {X_test.shape[0]}')

## TF-IDF

In [None]:
vectorizer = TfidfVectorizer(analyzer = 'char_wb', max_df = 0.5, min_df = 3, ngram_range = (3,5), lowercase=False)
vect_X_train = vectorizer.fit_transform(X_train)
vect_X_test = vectorizer.transform(X_test)

In [None]:
print(f'Final number of features: {len(vectorizer.get_feature_names())}')

## SVD

In [None]:
truncatedSVD = TruncatedSVD(n_components=2000, random_state=42)

In [None]:
truncatedSVD.fit(vect_X_train)

In [None]:
truncatedSVD.explained_variance_ratio_.sum()

In [None]:
X_train_SVD = truncatedSVD.transform(vect_X_train)
X_test_SVD = truncatedSVD.transform(vect_X_test)

## LightGBM Model

For reproducibility the hyperparameters of the selected model were identified by this random search:

In [None]:
# param_grid = {"learning_rate": [0.05, 0.1, 0.2, 0.3, 0.5],
#               "max_depth": [20, 22, 24, 30],
#               "colsample_bytree": [0.6, 0.8, 0.9],
#               "subsample": [0.3, 0.5, 0.7, 0.9],
#               "reg_alpha": [0.1, 0.2, 0.6, 0.8],
#               "reg_lambda": [10, 12, 14, 16],
#               "n_estimators": [40, 60, 80]
#               }
# reg_tree = ltb.LGBMRegressor()
# reg = RandomizedSearchCV(reg_tree, param_distributions=param_grid, n_iter=30,
#                          scoring='neg_mean_absolute_error', verbose=1, cv=4, random_state=42, n_jobs=-1)
# result = reg.fit(X_train_SVD, y_train)
# params = result.best_params_
# params

The selected hyperparameters were:

In [None]:
params = {'subsample': 0.7,
 'reg_lambda': 12,
 'reg_alpha': 0.6,
 'n_estimators': 80,
 'max_depth': 22,
 'learning_rate': 0.5,
 'colsample_bytree': 0.8,
 'random_state': 42}

In [None]:
reg_tree = ltb.LGBMRegressor(**params)
reg_tree.fit(X_train_SVD, y_train)

In [None]:
preds = reg_tree.predict(X_test_SVD)

In [None]:
mean_squared_error(y_test, preds)

## Inference

In [None]:
df_sub = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")

## Clean Data and Preprocessing

In [None]:
df_sub.head(2)

In [None]:
df_sub['text'] = df_sub['text'].apply(clean_text)

In [None]:
df_sub['clean_comment_text_NO_STOPWORDS'] = df_sub['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

In [None]:
df_sub['clean_comment_text_NO_STOPWORDS_stemmed'] = df_sub['clean_comment_text_NO_STOPWORDS'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

In [None]:
X = vectorizer.transform(df_sub['clean_comment_text_NO_STOPWORDS_stemmed'])

In [None]:
X_SVD = truncatedSVD.transform(X)

## Predict

In [None]:
final_preds = reg_tree.predict(X_SVD)

In [None]:
df_sub.columns

In [None]:
final_df = pd.DataFrame(pd.concat([df_sub['comment_id'], pd.Series(final_preds)], axis=1))

In [None]:
final_df.head()

In [None]:
final_df.columns = ['comment_id', 'score']

In [None]:
final_df_sorted = final_df.sort_values('score', ascending = False)

## Save Submission File

In [None]:
final_df_sorted.to_csv("submission.csv", index=False)

<div style='color:white;background-color:#0b8b10; height: 50px; border-radius: 25px;'><h1 style='text-align:center;padding: 1%'>The End</h1></div>

#### In this notebook, an exploratory analysis of a dataset published in an old Jigsaw competition was performed. The model used is trained on this dataset. The proposed model is a simple LightGMB, but future notebooks will use models more suitable for unstructured data.