# Jigsaw Detox Multilingual Toxic Comment Classification Exploratory Data Analysis

# Introduction

The following notebook contains an Exploratory Data Analysis of the training, validation and testing data provided in the Jigsaw Detox Multilingual Toxic Comment Classsification Competition. The goal of this competition is to accurately classify toxic comments in multiple languages given English only training data. A seperate notebook contains the modelling phase and is available here. 

## Exploratory Data Analysis

The follow section contains an Exploratory Data Analysis of the data provided in the competition. This analysis will inform the subsequent preprocessing and modelling steps for my solution. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS

## Training Data

### Visualizing The Data 

In [None]:
DIR_INPUT= '/kaggle/input/jigsaw-multilingual-toxic-comment-classification'
train_df1 = pd.read_csv(DIR_INPUT + '/jigsaw-toxic-comment-train.csv')
train_df1.head()

In [None]:
train_df2 = pd.read_csv(DIR_INPUT + '/jigsaw-unintended-bias-train.csv')
train_df2.head()

In [None]:
cols_filter = ['id', 'comment_text', 'toxic']
train_df = train_df1[cols_filter].append(train_df2[cols_filter])
train_df.head()

In [None]:
print(train_df.shape)

### Distribution of Target Variable

In [None]:
train_df['toxic']=train_df['toxic']>0.5

In [None]:
train_df['toxic'].value_counts(normalize=True)

In [None]:
sns.barplot(x=['Not-toxic', 'Toxic'], y=train_df.toxic.value_counts())

### Word Frequency 

In [None]:
rnd_comments = train_df[train_df['toxic'] == 0].sample(n=10000)['comment_text'].values
wc = WordCloud(background_color="black", max_words=2000, stopwords=STOPWORDS.update(['Trump', 'people', 'one', 'will']))
wc.generate(" ".join(rnd_comments))

plt.figure(figsize=(20,10))
plt.axis("off")
plt.title("Frequent words in non-toxic comments", fontsize=20)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17), alpha=0.98)
plt.show()

In [None]:
rnd_comments = train_df[train_df['toxic'] == 1].sample(n=10000)['comment_text'].values
wc = WordCloud(background_color="black", max_words=2000, stopwords=STOPWORDS.update(['Trump', 'people', 'one', 'will']))
wc.generate(" ".join(rnd_comments))

plt.figure(figsize=(20,10))
plt.axis("off")
plt.title("Frequent words in toxic comments", fontsize=20)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17), alpha=0.98)
plt.show()


### Remarks

**1. Two datasets (train dataset 1 and train dataset 2) containing english comments and various features about the comments are supplied as training data. Each dataset contains a different set of features but the ones used in the aggregate dataset are: comment_text(comment text), toxic(toxicity), and lang(language). Although the various other features could contain meaningful information for our prediction task, the validation and test dataset do not contain these features so we cannot leverage them.**

**2. The comments contain cased characters, stop words, punctuation, web url's, user mentions, new line characters among other structures in the text that are not ideal when tokenizing and modelling natural language. As such, these occurences should be removed from the comments in the preprocessing phase.**

**3. Train dataset 1 and train dataset 2 has the toxic value as a float ranging between 0 and 1 where 1 and Dataset 2 has the toxic value as an interger. In each case, the value is 0 if the comment is not toxic and 1 if the comment is toxic. A threshhold, presumably .5, can be used on Dataset 1 to convert the toxic column value to an interger(0 or 1).**

**4. The comments are solely in english but as the name of the competition alludes to, the test dataset contains comments from six languages. Possible solutions:**
   *  Find or use an API to translate all data to a single language, presumably english. This approach has implications that the toxicity of the comment may not be maintained through the translation so that has to be considered. 
   * Leverage a multilingual model, such as XLM-Roberta, that is trained with data accross many languages and can be fine tuned for a specific classification task. 
   * Upsample comments in low resource languages
   * Downsampling observations from high resource languages
    

**5. There is a high imbalance between toxic and non-toxic comments. Toxic comments only account for roughly 6 percent of the comments whereas non-toxic comments account for the remaining 94%. Possible solution:**
   * Up sample the toxic comments by sampling observations multiple times.
   * Generate synthetic toxic comments based on the existing ones using pseudolabelling or another technique. The generated comments that are mislabelled can be counteracted by using label smoothing which prevents the model from predicting the labels too confidently
   * Downsampling the majortity class. This is an easy method to obtained balanced data but in doing we so we lose a lot of data. 

**6. Wordclouds weres used to dsiplay the most frequent occurences of words for the set of both toxic and non-toxic comments. Stop words were removed in order to denoise the data. The most frequent words in the non-toxic comments are words that are commonly used in the English language and to a lesser extent, words related to politics. The most frequent words in the toxic comments are swear words, racial slurs, offensive terms and words related to politics.** 




## Validation Data

### Visualizing the Data

In [None]:
valid_df = pd.read_csv(DIR_INPUT + '/validation.csv')
valid_df.head()

In [None]:
per_lang = valid_df['lang'].value_counts()

### Distribution of Variables

In [None]:
sns.barplot(x=per_lang.index, y=per_lang.values)

In [None]:
valid_df.toxic.value_counts(normalize=True)

In [None]:
sns.barplot(x=['Not-toxic', 'Toxic'], y=valid_df.toxic.value_counts())

In [None]:
per_lang = valid_df.groupby(by=['lang', 'toxic']).count()[['id']]
per_lang

In [None]:
data=[]
for lang in valid_df['lang'].unique():
      y = per_lang[per_lang.index.get_level_values('lang') == lang].values.flatten()
      data.append(go.Bar(name=lang, x=['Non-toxic', 'Toxic'], y=y))
fig = go.Figure(data=data)
fig.update_layout(
    title='Language distribution in the validation dataset',
    barmode='group'
)
fig.show()

### Remarks
**1. A validation dataset contains only 8000 observations that span over three languages.**

**2. The comments contain cased characters, stop words, punctuation, web url's, user mentions, new line characters among other structures in the text that are not ideal for the tokenizer and modelling phases. As such, preprocessing of the comments should be used to remove these occurences.**

**3. The validation dataset contains three columns that coincide with the features that we used in the training dataset: comment_text(comment text), toxic(toxicity), and lang(language).**

**4. The target columns is an integer the value is 0 if the comment is not toxic and 1 if the comment is toxic.**

**5. The validation dataset only contains only three languages of the six languages in the test data tr(turkish), it(italian) and es(spanish). Potential solutions:**
   * Find or use an API to translate all data to a single language, presumably english. This approach has implications that the toxicity of the comment may not be maintained through the translation.
   * Leverage a multilingual model, such as XLM-Roberta, that is trained with data accross many languages and can be fine tuned for a specific classification task. 
   * Upsample comments in low resource languages
   * Downsampling observations from high resource languages


**6. There is an imbalance between toxic comments and nontoxic comments on a cumulative basis as well as within each language in the validation dataset. Potential Solutions:**
   * Up sample the toxic comments 
   * Generate synthetic toxic comments based on the existing ones using pseudolabelling or another technique. The generated comments that are mislabelled can be dealth with using label smoothing which prevents the model from predicting the targets too confidently
   * Downsampling the majortity class. This is an easy method to obtained balanced data but in doing we so we lose a lot of data. 

**7. The validation dataset only has 8000 observations. This is subbstantially smaller then the train and the test dataset which have 2125743 and 63812 records respectively. This may result in an inaccurate estimate of the model's performance on the validation set. To prevent this, a small portion of obervations from the training data can be set aside and included as part of the training dataset** 

**8. Since the validation dataset solely contains comments across three languages, it may be beneficial to add the validation dataset to the training dataset because the training dataset only includes english comments. Although, this may yield a limited improvement because the validation dataset is relatively small when compared to the size of the training data.**

## Test Data

### Visualizing the Data

In [None]:
test_df = pd.read_csv(DIR_INPUT + '/test.csv')
test_df.head()

### Distribution of Language Variable 

In [None]:
test_df['lang'].value_counts()

In [None]:
per_lang = test_df['lang'].value_counts()
sns.barplot(x=per_lang.index, y=per_lang.values)

### Remarks 
**1. The testing dataset contains 63812 observations that span over 6 languages. The columns of the testing dataset including id, comment_text(comment text) and lang(language).** 

**2. The comments contain cased characters, stop words, punctuation, web url's, user mentions, new line characters among other structures in the text that are not ideal for the tokenizer and modelling phases. As such, preprocessing of the comments should be used to remove these occurences.**

**3. The testing dataset contains six languages: tr(turkish), pt(portugese), ru(russian), french(fr), it(italian) and es(spanish). Implications:**
   * Find or use an API to translate all data to a single language, presumably english. This approach has implications that the toxicity of the comment may not be maintained through the translation.
   * Leverage a multilingual model, such as XLM-Roberta, that is trained with data accross many languages and can be fine tuned for a specific classification task. 



## Conclusion 
This concludes the preliminary exploratory data analysis for the training, validation and testing data for the Jigsaw Detox Multilingual Toxic Comment Classification Competition. Once the competition concludes, I will be releasing the notebooks that contain the preprocessing and modelling portion of my solution and including the corresponging links below.


### As of June 15th, my submission is ranked 325 out of 1550 solutions (Top 20 percent of solutions). 