<h1><center><font size="6">Jigsaw Multilingual Toxic Comment Classification</font></center></h1>
<h1><center><font size="6">EDA</font></center></h1>


[Jigsaw Toxic Comment Classification](https://www.kaggle.com/c/jigsaw-multilingual-toxic-comment-classification/overview) goal is to take advantage of [Kaggle's new TPU support](https://www.kaggle.com/docs/tpu) to build multilingual models with English-only training data. 

## Acknowledgements ❤  

This notebook was generated with compiling the best bits of all the best EDA notebooks. Thanks to them!

1. [Jigsaw Multilingual: Quick EDA & TPU Modeling](https://www.kaggle.com/ipythonx/jigsaw-multilingual-quick-eda-tpu-modeling)
2. [Jigsaw Multilingual Toxicity : EDA + Models](https://www.kaggle.com/tarunpaparaju/jigsaw-multilingual-toxicity-eda-models)
3. [Stop the S@# - Toxic Comments EDA](https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda/notebook)

# Table of Contents

- <a href='#1'>Introduction</a>  
- <a href='#2'>Load Libraries</a>  
- <a href='#3'>Prepare Data for EDA</a>   
- <a href='#4'>EDA</a>     
    - <a href='#41'>Example Comments</a>   
    - <a href='#42'>Distribution of Toxicity</a>   
    - <a href='#43'>Languages</a>   
    - <a href='#44'>Wordclouds - Frequent Words</a>   
    - <a href='#45'>EDA of Indirect Features</a>   
        - <a href='#451'>Distribution of Characters & Words</a>   
    - <a href='#46'>Sentiment vs. Toxicity: Sentiment Analysis of Comment Toxicity</a> 
        - <a href='#461'>Negative Sentiment</a>   
        - <a href='#462'>Positive Sentiment</a>  
        - <a href='#463'>Neutral Sentiment</a>  
        - <a href='#464'>Compound Sentiment</a>  

# <a id='1'>Introduction</a>  

**What should we expect the data format to be?**

> The primary data for the competition is, in each provided file, the `comment_text` column. This contains the text of a comment which has been classified as `toxic` or non-toxic (0...1 in the toxic column). The train set’s comments are entirely in english and come either from Civil Comments or Wikipedia talk page edits. The test data's `comment_text` columns are composed of multiple non-English languages.
The `*-train.csv` files and `validation.csv` file also contain a toxic column that is the target to be trained on. 

> The `jigsaw-toxic-comment`-train.csv and `jigsaw-unintended-bias-train.csv` contain training data (`comment_text` and `toxic`) from the two previous Jigsaw competitions, as well as additional columns that you may find useful. `*-seqlen128.csv` files contain training, validation, and test data that has been processed for input into BERT.

**What am I predicting?**

> You are predicting the probability that a comment is `toxic`. A toxic comment would receive a `1.0`. A benign, `non-toxic` comment would receive a `0.0`. In the `test set`, all comments are classified as either a `1.0` or a `0.0`.

# <a id='2'>Load Libraries</a>  

In [None]:
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import os
import re
from tqdm import tqdm
tqdm.pandas()


from wordcloud import WordCloud, STOPWORDS
from PIL import Image
from kaggle_datasets import KaggleDatasets
from colorama import Fore, Back, Style, init
import plotly.graph_objects as go

import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# <a id='3'>Prepare Data for EDA</a> 

**Most Important Files**
- `jigsaw-toxic-comment-train.csv`:  data from [this competition ](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)
- `jigsaw-unintended-bias-train.csv`: data from [this competition](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)
- `validation.csv`: comments from Wikipedia talk pages in different non-English languages
- `test.csv`: comments from Wikipedia talk pages in different non-English languages
- `sample_submission.csv`: a sample submission file in the correct format

In [None]:
dir = '/kaggle/input/jigsaw-multilingual-toxic-comment-classification'

train_set1 = pd.read_csv(os.path.join(dir, 'jigsaw-toxic-comment-train.csv'))
train_set2 = pd.read_csv(os.path.join(dir, 'jigsaw-unintended-bias-train.csv'))
train_set2.toxic = train_set2.toxic.round().astype(int)

valid = pd.read_csv(os.path.join(dir, 'validation.csv'))
test = pd.read_csv(os.path.join(dir, 'test.csv'))

In [None]:
# Combine train1 with a subset of train2
train = pd.concat([
    train_set1[['comment_text', 'toxic']],
    train_set2[['comment_text', 'toxic']].query('toxic==1'),
    train_set2[['comment_text', 'toxic']].query('toxic==0').sample(n=100000, random_state=0)
])

# <a id='4'>EDA</a> 

In [None]:
print(train.shape)
train.head()

In [None]:
print(valid.shape)
valid.head()

In [None]:
print(test.shape)
test.head()

In [None]:
print("Check for missing values in Train dataset")
null_check=train.isnull().sum()
print(null_check)
print("Check for missing values in Validation dataset")
null_check=valid.isnull().sum()
print(null_check)
print("Check for missing values in Test dataset")
null_check=test.isnull().sum()
print(null_check)
print("filling NA with \"unknown\"")
train["comment_text"].fillna("unknown", inplace=True)
valid["comment_text"].fillna("unknown", inplace=True)

****
## <a id='41'>Example Comments</a>  

In [None]:
for i in range(3):
    print(f'[Comment {i+1}]\n', train['comment_text'][i])
    print()

In [None]:
print("Toxic comments:")
print(train[train.toxic==1].iloc[:10,0])

****
## <a id='42'>Distribution of Toxicity</a>  

And so we observe following columns so far.

- `id` - identifier within each file.
- `comment_text` - the text of the comment to be classified.
- `lang` - the language of the comment.
- `toxic` - whether or not the comment is classified as toxic. (Does not exist in `test.csv`.)

In [None]:
#print(train.toxic.value_counts())
#print(valid.toxic.value_counts())

print("Train set")
print("Toxic comments = ",len(train[train['toxic']==1]))
print("Non-toxic comments = ",len(train[train['toxic']==0]))

print("\nValidation set")
print("Toxic comments = ",len(valid[valid['toxic']==1]))
print("Non-toxic comments = ",len(valid[valid['toxic']==0]))

In [None]:
sns.set(style="darkgrid")

f = plt.figure(figsize=(20,5))
f.add_subplot(1,2,1)
sns.countplot(train_set1.toxic)
plt.title('Toxic Comment Distribution in Train Set 1')
f.add_subplot(1,2,2)
sns.countplot(train_set2.toxic)
plt.title('Toxic Comment Distribution in Train Set 2')

- We can see that there is a huge imbalance in the data. Therefore, it made sense to make new train set, combining Train Sets 1 & 2, but downsampling the non-toxic comments from Train Set 2. Below you can see the slightly more balanced Train Set.

In [None]:
f = plt.figure(figsize=(20,5))
f.add_subplot(1,2,1)
sns.countplot(train.toxic)
plt.title('Toxic Comment Distribution in Train Set')
f.add_subplot(1,2,2)
sns.countplot(valid.toxic)
plt.title('Toxic Comment Distribution in Validation Set')

There is a clear class imbalance, which could lead to bias towards a comment being classified as 0 (non-toxic). We can manage this bias by preprocessing the data with upsampling-downsampling.

****
## <a id='43'>Languages</a>  

In [None]:
print(valid.lang.value_counts())
print(test.lang.value_counts())

In [None]:
f = plt.figure(figsize=(20,5))
f.add_subplot(1,2,1)
sns.countplot(valid.lang)
plt.title('Langauages in Validation Set')
f.add_subplot(1,2,2)
sns.countplot(test.lang)
plt.title('Languages in Final Test Set')

****
## <a id='44'>Wordclouds - Frequent Words</a>  

In [None]:
stopword=set(STOPWORDS)

#wordcloud of all comments
plt.figure(figsize=(10,10))
text = train.comment_text.values
wc = WordCloud(background_color="black",max_words=2000,stopwords=stopword)
wc.generate(" ".join(text))
plt.axis("off")
plt.title("Common words in All Comments", fontsize=16)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17), alpha=0.98)

In [None]:
#non-toxic wordcloud
clean_mask=np.array(Image.open("../input/imagesforkernal/safe-zone.png"))
clean_mask=clean_mask[:,:,1]

plt.figure(figsize=(20,20))
plt.subplot(121)
subset = train.query("toxic == 0")
text = subset.comment_text.values
wc = WordCloud(background_color="black",max_words=1000,mask=clean_mask,stopwords=stopword)
wc.generate(" ".join(text))
plt.axis("off")
plt.title("Common words in non-Toxic Comments", fontsize=16)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17), alpha=0.98)

#toxic wordcloud
clean_mask=np.array(Image.open("../input/imagesforkernal/swords.png"))
clean_mask=clean_mask[:,:,1]

plt.subplot(122)
subset = train.query("toxic == 1")
text = subset.comment_text.values
wc = WordCloud(background_color="black",max_words=1000,mask=clean_mask,stopwords=stopword)
wc.generate(" ".join(text))
plt.axis("off")
plt.title("Common words in Toxic Comments", fontsize=16)
plt.imshow(wc.recolor(colormap= 'viridis' , random_state=17), alpha=0.98)

plt.show()

****
## <a id='45'>EDA of Indirect Features</a>  


- count of sentences
- count of words
- count of unique words
- count of characters
- count of punctuations
- count of uppercase words/letters
- count of stop words
- avg length of each word

### <a id='451'>Distribution of Characters & Words</a>  

In [None]:
nums_1 = train[train['toxic']==1]['comment_text'].sample(frac=0.1).str.len()
nums_2 = train[train['toxic']==0]['comment_text'].sample(frac=0.1).str.len()

fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-toxic"],
                         colors=["red", "green"], show_hist=False)

fig.update_layout(title_text="Number of characters per comment vs. Toxicity", xaxis_title="No of characters per comment", 
                  yaxis_title="Distribution of observations (%)", template="simple_white")
fig.show()

In [None]:
nums_1 = train[train['toxic']==1]['comment_text'].sample(frac=0.1).str.split().str.len()
nums_2 = train[train['toxic']==0]['comment_text'].sample(frac=0.1).str.split().str.len()

fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-toxic"],
                         colors=["red", "green"], show_hist=False)

fig.update_layout(title_text="Number of words per comment vs. Toxicity", xaxis_title="No of words per comment", 
                  yaxis_title="Distribution of observations (%)", template="simple_white")
fig.show()

- Toxic comments typically have more characters & words.

****
## <a id='46'>Sentiment vs. Toxicity: Sentiment Analysis of Comment Toxicity</a>  

**Do _Sentiment_ and _Toxicity_ have a relationship? We could hypothesize that if a comment is a higher negative sentiment, then it is more likely to be toxic.**

Sentiment and polarity are quantities that reflect the emotion and intention behind a sentence. Now, I will give a sentiment intensity score to comments using the NLTK (natural language toolkit) library.

In [None]:
SIA = SentimentIntensityAnalyzer()

def polarity(x):
    if type(x) == str:
        return SIA.polarity_scores(x)
    else:
        return 1000
    
train["polarity"] = train["comment_text"].progress_apply(polarity)

In [None]:
print(train[train.toxic==1].iloc[4,0])

polarity(train[train.toxic==1].iloc[4,0])

The Positive, Negative and Neutral scores represent the proportion of text that falls in these categories. This means our sentence was rated as 0% Positive, 45% Neutral and 55% Negative. Hence all these should add up to 1.

From this comment sentiment example alone we can already see that sentiment scores are not a reliable reflection of how toxic a comment is. This comment is clearly toxic, however it has only been rated as 55% negative. Would you say this comment is only 55% toxic? Probably not... Let's explore the sentiment analyis any way.

### <a id='461'>Negative Sentiment</a>  
Negative sentiment refers to negative or pessimistic emotions. It is a score between 0 and 1; the greater the score, the more negative the abstract is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["neg"] for pols in train["polarity"] if pols["neg"] != 0], marker=dict(color='red')))
fig.update_layout(xaxis_title="Negative sentiment", title_text="Negative sentiment", 
                  yaxis_title="Number of comments", template="simple_white")

- Most comments have a low negativity sentiment

In [None]:
train["negativity"] = train["polarity"].apply(lambda x: x["neg"])

nums_1 = train.sample(frac=0.1).query("toxic == 1")["negativity"]
nums_2 = train.sample(frac=0.1).query("toxic == 0")["negativity"]

fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-Toxic"],
                         colors=["red", "green"], show_hist=False)

fig.update_layout(title_text="Negative Sentiment vs. Toxicity", xaxis_title="Negative Sentiment", 
                  yaxis_title="Number of comments", template="simple_white")
fig.show()

- Comments with low negative sentiment are more likely to be non-toxic, and comments with high negative sentiment are more likely to be toxic. 
- A comment is likely to be non-toxic if it has a negativity of 0.
- A comment is likely to be toxic if it has a negativity more than 0.8.

### <a id='462'>Positive Sentiment</a>  
Positive sentiment refers to positive or optimistic emotions. It is a score between 0 and 1; the greater the score, the more positive the abstract is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["pos"] for pols in train["polarity"] if pols["pos"] != 0], marker=dict(color='green')))
fig.update_layout(xaxis_title="Positive sentiment", title_text="Positive sentiment", 
                  yaxis_title="Number of comments", template="simple_white")

- Most comments have low positive sentiment.

In [None]:
train["positivity"] = train["polarity"].apply(lambda x: x["pos"])

nums_1 = train.sample(frac=0.1).query("toxic == 1")["positivity"]
nums_2 = train.sample(frac=0.1).query("toxic == 0")["positivity"]

fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-Toxic"],
                         colors=["red", "green"], show_hist=False)

fig.update_layout(title_text="Positive Sentiment vs. Toxicity", xaxis_title="Positive Sentiment", 
                  yaxis_title="Number of comments", template="simple_white")
fig.show()

- The higher the positivity sentiment, the more likely the comment is non-toxic. 
- However, we can see that both the distributions are very similar, indicating that positive sentiment is not an accurate indicator of comment toxicity.

### <a id='463'>Neutral Sentiment</a>  
Neutrality sentiment refers to the level of bias or opinion in the text. It is a score between 0 and 1; the greater the score, the more neutral/unbiased the abstract is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["neu"] for pols in train["polarity"] if pols["neu"] != 1], marker=dict(color='grey')))
fig.update_layout(xaxis_title="Neutral sentiment", title_text="Neutral sentiment", 
                  yaxis_title="Number of comments", template="simple_white")

- Most comments are neutral -- meaning that they are unopinionated or unbiased. 

In [None]:
train["neutral"] = train["polarity"].apply(lambda x: x["neu"])

nums_1 = train.sample(frac=0.1).query("toxic == 1")["neutral"]
nums_2 = train.sample(frac=0.1).query("toxic == 0")["neutral"]

fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-Toxic"],
                         colors=["red", "green"], show_hist=False)

fig.update_layout(title_text="Neutral Sentiment vs. Toxicity", xaxis_title="Neutral Sentiment", 
                  yaxis_title="Number of comments", template="simple_white")
fig.show()

- A comment with a neutral sentiment of exactly 1 is likely a non-toxic comment. This is beause the probability density of the non-toxic distribution experiences a sudden jump at 1, and the probability density of the toxic distribution is significantly lower at the same point. 

### <a id='464'>Compound Sentiment</a>  
Compoundness sentiment refers to the total level of sentiment in the sentence. It is a score between -1 and 1; the greater the score, the more emotional the abstract is.

In [None]:
fig = go.Figure(go.Histogram(x=[pols["compound"] for pols in train["polarity"] if pols["compound"] != 0], marker=dict(color='yellow')))
fig.update_layout(xaxis_title="Compound sentiment", title_text="Compound sentiment", 
                  yaxis_title="Number of comments", template="simple_white")

- All comments are evenly distribution among different levels of compound sentiment, meaning the comments express a variety of emotions.

In [None]:
train["compound"] = train["polarity"].apply(lambda x: x["compound"])

nums_1 = train.sample(frac=0.1).query("toxic == 1")["compound"]
nums_2 = train.sample(frac=0.1).query("toxic == 0")["compound"]

fig = ff.create_distplot(hist_data=[nums_1, nums_2],
                         group_labels=["Toxic", "Non-Toxic"],
                         colors=["red", "green"], show_hist=False)

fig.update_layout(title_text="Compound Sentiment vs. Toxicity", xaxis_title="Compound Sentiment", 
                  yaxis_title="Number of comments", template="simple_white")
fig.show()

- Non-toxic comments have a higher compound sentiment.
- Toxic comments have a lower compound sentiment.
- Compound sentiment (compared to Negative, Positive, Neutral sentiments) seems do have a greater visible correlation with toxicity.

## The next step is to build models!