<div style='color:white;background-color:#f2dde8; height: 100px; border-radius: 25px;'><h1 style='text-align:center;padding: 3%'>Jigsaw Rate Severity of Toxic Comments Competition</h1></div>

## Table of Contents
* [Dataset Information](#data_information)
    - Data description
    - Files
* [Preliminary Data Exploration](#preliminary_eda)
    - Install and Import Libraries
    - Load Data
    - General Dataset Information
* [Exploratory Data Analysis](#eda)
    - [Comments to Score Dataset](#cts)
        - Clean Text
        - Language Detection
        - Comment Length Distribution
        - Word Count Distribution
        - Distribution of Top Unigrams
        - Distribution of Top Bigrams
        - Distribution of Top Trigrams
        - Unique Words Analysis
        - Sentiment Polarity
        - Word Clouds
    - [Validation Dataset](#val)
        - Clean Text
        - Comment Length Distribution
        - Word Count Distribution
        - Distribution of Top Unigrams
        - Distribution of Top Bigrams
        - Distribution of Top Trigrams
        - Unique Words Analysis
        - Sentiment Polarity
        - Word Clouds
        - Worker Analysis

<div style='color:#40192e;background-color:#f2dde8; height: 20px; border-radius: 5px;'></div>

<a id='data_information'></a>
# Dataset Information
## Data Description

<div>
The data used for this competition are Wikipedia Talk page comments. The purpose is to rank the severity of comment toxicity from innocuous to outrageous, where the middle matters as much as the extremes.</br>
<b>Important</b>:
There is no training data for this competition. You can refer to previous Jigsaw competitions for data that might be useful to train models.

<h3>Files</h3>
<span style="background-color:#e1e6e3;">comments_to_score.csv</span> - collection of comments </br>
<span style="background-color:#e1e6e3;">validation_data.csv</span> - pair rankings that can be used to validate models </br>
<span style="background-color:#e1e6e3;">sample_submission.csv</span> - a sample submission file in the correct format </br>
</br>
<b style='margin-top:1.5%;background-color:#fbffb3'><i>Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.</i></b></div>

<div style='color:#40192e;background-color:#f2dde8; height: 20px; border-radius: 5px;'></div>

<a id='preliminary_eda'></a>
# Preliminary Data Exploration

## Install and Import Libraries

In [None]:
! pip install langdetect

In [None]:
import numpy as np
import pandas as pd
import os
import re
import matplotlib.pyplot as plt

from langdetect import detect
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from PIL import Image
from wordcloud import WordCloud, STOPWORDS
import random


plt.style.use('ggplot')

pd.set_option('display.max_columns', None)
pd.set_option('max_colwidth', None)
pd.set_option('max_rows', None)

## Load Data

In [None]:
df = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')
val_df = pd.read_csv('../input/jigsaw-toxic-severity-rating/validation_data.csv')

## General Dataset Information

In [None]:
print(f"The shape of the Comments to Score dataset is {df.shape} \n"
f"The shape of the validation dataset is {val_df.shape}")

In [None]:
df.info()

In [None]:
val_df.info()

In [None]:
df.head(2)

In [None]:
val_df.head(2)

<div style='color:#40192e;background-color:#f2dde8; height: 20px; border-radius: 5px;'></div>

<a id='eda'></a>
# Exploratory Data Analysis

<a id='cts'></a>
# Comments to Score Dataset
## Clean Text

In [None]:
def clean_text(text):
    text = re.sub(r'<[^<]+?>', '', text)
    text = text.replace('\n', ' ')
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'<[^<]+?>', '', text) 
    text = text.replace('(\xa0)', ' ')
    text = text.replace('(&lt)', '')
    text = text.replace('(&gt)', '')
    text = text.replace("\\", "")
    
    return text

In [None]:
df['text'] = df['text'].apply(clean_text)

Let's take a look at the first two clean comments:

In [None]:
df.head(2)

## Language Detection

Let's detect the language of each comment:

In [None]:
df['language'] = df['text'].apply(detect)

In [None]:
count_all_language = df['language'].value_counts()
count_language_not_eng = df['language'][df.language != 'en'].value_counts()

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = count_all_language.plot(kind='bar', color = "#640372")
ax1.set_title('Frequency of languages in all comments')
ax1.set_xlabel("Languages")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = count_language_not_eng.plot(kind='bar', color = "#640372")
ax2.set_title('Frequency of languages in non-English comments')
ax2.set_xlabel("Languages")
ax2.set_ylabel("Frequency")

plt.show()

Comments classified as German:

In [None]:
df['text'][df.language=='de'].head(2)

Comments classified as in Italian language:

In [None]:
df['text'][df.language=='it'].head(2)

Actually the comments are in English, even those classified as other language. Probably the high number of swear words/terms belonging to specific slangs affects the accuracy of "langdetect".

## Comment Length Distribution

In [None]:
comment_length = df['text'].apply(len)

fig = plt.figure(figsize=(10,8))

ax1 = comment_length.plot(kind='hist', color = "#640372", bins=100)
ax1.set_title('Comment Length Distribution')
ax1.set_xlabel("Comment Length")
ax1.set_ylabel("Frequency")

plt.show()

## Word Count Distribution

In [None]:
word_count = df['text'].apply(lambda x: len(str(x).split()))

fig = plt.figure(figsize=(10,8))

ax1 = word_count.plot(kind='hist', color = "#640372", bins=100)
ax1.set_title('Word Count Distribution')
ax1.set_xlabel("Word Count")
ax1.set_ylabel("Frequency")

plt.show()

## Distribution of Top Unigrams

In [None]:
def get_top_n_words(corpus, n=None, remove_stop_words=False, n_words=1): # if n_words=1 -> unigrams, if n_words=2 -> bigrams..
    if remove_stop_words:
        vec = CountVectorizer(stop_words = 'english', ngram_range=(n_words, n_words)).fit(corpus)
    else:
        vec = CountVectorizer(ngram_range=(n_words, n_words)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

Distribution of top unigrams before removing stop words:

In [None]:
common_words = get_top_n_words(df['text'], 20, remove_stop_words=False, n_words=1)
for word, freq in common_words:
    print(word, freq)

In [None]:
df_tmp = pd.DataFrame(common_words, columns = ['text' , 'count'])

fig = plt.figure(figsize=(10,8))

ax1 = df_tmp.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Unigram Distribution')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

plt.show()

Distribution of top unigrams after removing stop words:

In [None]:
common_words = get_top_n_words(df['text'], 20, remove_stop_words=True, n_words=1)
for word, freq in common_words:
    print(word, freq)

In [None]:
df_tmp = pd.DataFrame(common_words, columns = ['text' , 'count'])

fig = plt.figure(figsize=(10,8))

ax1 = df_tmp.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Unigram Distribution')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

plt.show()

## Distribution of Top Bigrams

Distribution of top bigrams before removing stop words:

In [None]:
common_words = get_top_n_words(df['text'], 20, remove_stop_words=False, n_words=2)
for word, freq in common_words:
    print(word, freq)

In [None]:
df_tmp = pd.DataFrame(common_words, columns = ['text' , 'count'])

fig = plt.figure(figsize=(10,8))

ax1 = df_tmp.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Bigram Distribution')
ax1.set_xlabel("Bigrams")
ax1.set_ylabel("Frequency")

plt.show()

Distribution of top bigrams after removing stop words:

In [None]:
common_words = get_top_n_words(df['text'], 20, remove_stop_words=True, n_words=2)
for word, freq in common_words:
    print(word, freq)

In [None]:
df_tmp = pd.DataFrame(common_words, columns = ['text' , 'count'])

fig = plt.figure(figsize=(10,8))

ax1 = df_tmp.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Bigram Distribution')
ax1.set_xlabel("Bigrams")
ax1.set_ylabel("Frequency")

plt.show()

## Distribution of Top Trigrams

Distribution of top trigrams before removing stop words:

In [None]:
common_words = get_top_n_words(df['text'], 20, remove_stop_words=False, n_words=3)
for word, freq in common_words:
    print(word, freq)

In [None]:
df_tmp = pd.DataFrame(common_words, columns = ['text' , 'count'])

fig = plt.figure(figsize=(10,8))

ax1 = df_tmp.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Trigram Distribution')
ax1.set_xlabel("Trigram")
ax1.set_ylabel("Frequency")

plt.show()

Distribution of top trigrams after removing stop words:

In [None]:
common_words = get_top_n_words(df['text'], 20, remove_stop_words=True, n_words=3)
for word, freq in common_words:
    print(word, freq)

In [None]:
df_tmp = pd.DataFrame(common_words, columns = ['text' , 'count'])

fig = plt.figure(figsize=(10,8))

ax1 = df_tmp.groupby('text').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Trigram Distribution')
ax1.set_xlabel("Trigrams")
ax1.set_ylabel("Frequency")

plt.show()

The situation does not change much by keeping or removing the stop words in the case of trigrams.

## Unique Words Analysis

You can see from these analyses that many comments are nothing more than repeated words. Let's try to do some analysis by considering only the unique words contained in a text. We perform this analysis by also removing anything that is not an alphabet character and making all text lowercase.

In [None]:
sorted(set(["b", "a"]))

In [None]:
def get_unique_words(string):
    string = string.lower()
    regex = re.compile('[^a-zA-Z]')
    string = regex.sub(' ', string)
    words = string.split()
    new_string = " ".join(sorted(set(words), key=words.index))
    return new_string

In [None]:
df['set_of_words'] = df['text'].apply(get_unique_words)

In [None]:
df.head(2)

Let's look at the most frequent words after removing words present more than once from the texts.

Before removing stop words:

In [None]:
common_words = get_top_n_words(df['set_of_words'], 20, remove_stop_words=False, n_words=1)
for word, freq in common_words:
    print(word, freq)

In [None]:
df_tmp = pd.DataFrame(common_words, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(10,8))

ax1 = df_tmp.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Unigram Distribution')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

plt.show()

After removing stop words:

In [None]:
common_words = get_top_n_words(df['set_of_words'], 20, remove_stop_words=True, n_words=1)
for word, freq in common_words:
    print(word, freq)

In [None]:
df_tmp = pd.DataFrame(common_words, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(10,8))

ax1 = df_tmp.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Unigram Distribution')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

plt.show()

Several swear words are found to lose positions relative to the most frequent words, after removing repeated words. This can be traced to the fact that there are many comments within the dataset with swear words repeated many times. For example:

In [None]:
df['text'].iloc[3028]

In [None]:
df['text'].iloc[4949]

## Sentiment Polarity

Let's use TextBlob to calculate sentiment polarity. The polarity value lies in the range of [-1, 1] where 1 means positive sentiment and -1 means a negative sentiment:

In [None]:
polarity = df['text'].map(lambda text: TextBlob(text).sentiment.polarity)

In [None]:
fig = plt.figure(figsize=(10,8))

ax1 = polarity.plot(kind='hist', color = "#640372", bins=100)
ax1.set_title('Polarity Distribution')
ax1.set_xlabel("Sentiment")
ax1.set_ylabel("Frequency")

plt.show()

Most comments have zero polarity, so neutral sentiment, let's see if the polarity found in this way is reliable:

In [None]:
df['polarity'] = df['text'].map(lambda text: TextBlob(text).sentiment.polarity)
print(f"""A comment with the most neutral polarity: \n {df['text'][df.polarity == 0].sample(1, random_state=42).values[0]} \n
A comment with negative polarity: \n {df['text'][df.polarity == -1].sample(1, random_state=42).values[0]} \n
A comment with positive polarity: \n {df['text'][df.polarity == 1].sample(1, random_state=42).values[0]}""")

Let's take 10 random comments with positive polarity:

In [None]:
print("10 comments with neutral polarity: \n")
comments = df.loc[df.polarity == 0, ['text']].sample(10, random_state=42).values
for comment in comments:
    print(f"""- {comment[0]}\n""")

In [None]:
print("10 comments with positive polarity: \n")
comments = df.loc[df.polarity == 1, ['text']].sample(10, random_state=42).values
for comment in comments:
    print(f"""- {comment[0]}\n""")

So we can see that the polarity is not very accurate.

## Word Clouds

Word cloud of all the comments:

In [None]:
mask = np.array(Image.open("../input/wiki-img/Wikipedia_W.png"))

In [None]:
mask = mask[:,:,3]
text = df.text.values

In [None]:
def purple_color_func(word, font_size, position, orientation, random_state=None,
                    **kwargs):
    return f"hsl(312, {random.randint(20, 60)}%, {random.randint(20, 60)}%)"

In [None]:
wc= WordCloud(background_color="#fcebff",max_words=1000,mask=mask,stopwords=set(STOPWORDS))
wc.generate(" ".join(text))
plt.figure(figsize=(15,10))
plt.axis("off")
plt.title("Word Cloud", fontsize=20)
plt.imshow(wc.recolor(color_func=purple_color_func, random_state=42),
           interpolation="bilinear")
plt.show()

Word clouds based on sentiment:

In [None]:
mask_joy = np.array(Image.open("../input/emoji-imgs/joy.png"))
mask_sad = np.array(Image.open("../input/emoji-imgs/sad.png"))

In [None]:
mask_joy = mask_joy[:,:,1]
mask_sad = mask_sad[:,:,3]

In [None]:
text_positive_polarity = df[(df.polarity > 0.9)].text.values
text_negative_polarity = df[(df.polarity < -0.9)].text.values

In [None]:
wc_positive_polarity = WordCloud(background_color="#fcebff",max_words=1000,mask=mask_joy,stopwords=set(STOPWORDS))
wc_negative_polarity = WordCloud(background_color="#fcebff",max_words=1000,mask=mask_sad,stopwords=set(STOPWORDS))

wc_positive_polarity.generate(" ".join(text_positive_polarity))
wc_negative_polarity.generate(" ".join(text_negative_polarity))

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = plt.imshow(wc_positive_polarity.recolor(color_func=purple_color_func, random_state=42),
           interpolation="bilinear")
ax1 = plt.title("Positive Polarity Comments", fontsize=20)

ax2 = fig.add_subplot(122)
ax2 = plt.imshow(wc_negative_polarity.recolor(color_func=purple_color_func, random_state=42),
           interpolation="bilinear")
ax2 = plt.title("Negative Polarity Comments", fontsize=20)

plt.show()

<div style='color:#40192e;background-color:#f2dde8; height: 20px; border-radius: 5px;'></div>

<a id='val'></a>
# Validation Dataset

Let's proceed with the exploration of the validation dataset, in which we have for each row two comments, classified as 'more toxic' or 'less toxic'. This evaluation is done by considering only those two comments.

In [None]:
val_df.head(2)

## Clean Text

In [None]:
val_df['less_toxic'] = val_df['less_toxic'].apply(clean_text)
val_df['more_toxic'] = val_df['more_toxic'].apply(clean_text)

## Comment Length Distribution

Let's look at the length of the comments according to the column they belong to:

In [None]:
comment_length_less_toxic = val_df['less_toxic'].apply(len)
comment_length_more_toxic = val_df['more_toxic'].apply(len)


fig = plt.figure(figsize=(10,8))

ax1 = comment_length_less_toxic.plot(kind='hist', color = "#ff7033", bins=100, alpha=1)
ax1 = comment_length_more_toxic.plot(kind='hist', color = "#622864", bins=100, alpha=0.6)
ax1.set_title('Comment Length Distribution More Toxic vs Less Toxic')
ax1.set_xlabel("Comment Length")
ax1.set_ylabel("Frequency")
ax1.legend()

plt.show()

Note how the most toxic comments tend to have shorter lengths in general, however, with a peak for length 500.

## Word Count Distribution

In [None]:
word_count_less_toxic = val_df['less_toxic'].apply(lambda x: len(str(x).split()))
word_count_more_toxic = val_df['more_toxic'].apply(lambda x: len(str(x).split()))


fig = plt.figure(figsize=(10,8))

ax1 = word_count_less_toxic.plot(kind='hist', color = "#ff7033", bins=100)
ax1 = word_count_more_toxic.plot(kind='hist', color = "#622864", bins=100, alpha=0.7)
ax1.set_title('Word Count Distribution More Toxic vs Less Toxic')
ax1.set_xlabel("Word Count")
ax1.set_ylabel("Frequency")
ax1.legend()

plt.show()

In terms of word count, the most toxic comments have a lower word count on average than the least toxic comments.

## Distribution of Top Unigrams

Distribution of top unigrams before removing stop words:

In [None]:
common_words_less_toxic = get_top_n_words(val_df['less_toxic'], 20, remove_stop_words=False, n_words=1)
common_words_more_toxic = get_top_n_words(val_df['more_toxic'], 20, remove_stop_words=False, n_words=1)

In [None]:
df_tmp_less_toxic = pd.DataFrame(common_words_less_toxic, columns = ['set_of_words' , 'count'])
df_tmp_more_toxic = pd.DataFrame(common_words_more_toxic, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_less_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Unigram Distribution for Less Toxic Comments')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_more_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax2.set_title('Unigram Distribution for More Toxic Comments')
ax2.set_xlabel("Unigrams")
ax2.set_ylabel("Frequency")

plt.show()

Distribution of top unigrams before removing stop words:

In [None]:
common_words_less_toxic = get_top_n_words(val_df['less_toxic'], 20, remove_stop_words=True, n_words=1)
common_words_more_toxic = get_top_n_words(val_df['more_toxic'], 20, remove_stop_words=True, n_words=1)

In [None]:
df_tmp_less_toxic = pd.DataFrame(common_words_less_toxic, columns = ['set_of_words' , 'count'])
df_tmp_more_toxic = pd.DataFrame(common_words_more_toxic, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_less_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Unigram Distribution for Less Toxic Comments')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_more_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax2.set_title('Unigram Distribution for More Toxic Comments')
ax2.set_xlabel("Unigrams")
ax2.set_ylabel("Frequency")

plt.show()

We can see a lot of difference between the top unigrams of the least toxic comments, where there are few swear words, compared to those of the most toxic comments.

## Distribution of Top Bigrams

Distribution of top bigrams before removing stop words:

In [None]:
common_words_less_toxic = get_top_n_words(val_df['less_toxic'], 20, remove_stop_words=False, n_words=2)
common_words_more_toxic = get_top_n_words(val_df['more_toxic'], 20, remove_stop_words=False, n_words=2)

In [None]:
df_tmp_less_toxic = pd.DataFrame(common_words_less_toxic, columns = ['set_of_words' , 'count'])
df_tmp_more_toxic = pd.DataFrame(common_words_more_toxic, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_less_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Bigram Distribution for Less Toxic Comments')
ax1.set_xlabel("Bigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_more_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax2.set_title('Bigram Distribution for More Toxic Comments')
ax2.set_xlabel("Bigrams")
ax2.set_ylabel("Frequency")

plt.show()

As for bigrams, even without removing the stop words we can see how the comments differ between the two categories.

Distribution of top bigrams after removing stop words:

In [None]:
common_words_less_toxic = get_top_n_words(val_df['less_toxic'], 20, remove_stop_words=True, n_words=2)
common_words_more_toxic = get_top_n_words(val_df['more_toxic'], 20, remove_stop_words=True, n_words=2)

In [None]:
df_tmp_less_toxic = pd.DataFrame(common_words_less_toxic, columns = ['set_of_words' , 'count'])
df_tmp_more_toxic = pd.DataFrame(common_words_more_toxic, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_less_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Bigram Distribution for Less Toxic Comments')
ax1.set_xlabel("Bigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_more_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax2.set_title('Bigram Distribution for More Toxic Comments')
ax2.set_xlabel("Bigrams")
ax2.set_ylabel("Frequency")

plt.show()

Several 'toxic' words can be spotted in the less toxic comments. This is because one comment is defined as less toxic than another, there is no guarantee that it is not toxic in general.

## Distribution of Top Trigrams

Distribution of top trigrams before removing stop words:

In [None]:
common_words_less_toxic = get_top_n_words(val_df['less_toxic'], 20, remove_stop_words=False, n_words=3)
common_words_more_toxic = get_top_n_words(val_df['more_toxic'], 20, remove_stop_words=False, n_words=3)

In [None]:
df_tmp_less_toxic = pd.DataFrame(common_words_less_toxic, columns = ['set_of_words' , 'count'])
df_tmp_more_toxic = pd.DataFrame(common_words_more_toxic, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_less_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Trigram Distribution for Less Toxic Comments')
ax1.set_xlabel("Trigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_more_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax2.set_title('Trigram Distribution for More Toxic Comments')
ax2.set_xlabel("Trigrams")
ax2.set_ylabel("Frequency")

plt.show()

Distribution of top trigrams after removing stop words:

In [None]:
common_words_less_toxic = get_top_n_words(val_df['less_toxic'], 20, remove_stop_words=True, n_words=3)
common_words_more_toxic = get_top_n_words(val_df['more_toxic'], 20, remove_stop_words=True, n_words=3)

In [None]:
df_tmp_less_toxic = pd.DataFrame(common_words_less_toxic, columns = ['set_of_words' , 'count'])
df_tmp_more_toxic = pd.DataFrame(common_words_more_toxic, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_less_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Trigram Distribution for Less Toxic Comments')
ax1.set_xlabel("Trigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_more_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax2.set_title('Trigram Distribution for More Toxic Comments')
ax2.set_xlabel("Trigrams")
ax2.set_ylabel("Frequency")

plt.show()

The same thing about bigrams applies here. Toxic trigrams are very frequently present in less toxic comments, this is because one comment is defined as less toxic than another, there is no guarantee that it is not toxic in general.

## Unique Words Analysis

Like for the dataset of the previous section, we go to see the unigrams present in the text most frequently after selecting only the unique words of the comments.

In [None]:
val_df['set_of_words_less_toxic'] = val_df['less_toxic'].apply(get_unique_words)
val_df['set_of_words_more_toxic'] = val_df['more_toxic'].apply(get_unique_words)

Distribution of top unigrams before removing stop words:

In [None]:
common_words_less_toxic = get_top_n_words(val_df['set_of_words_less_toxic'], 20, remove_stop_words=False, n_words=1)
common_words_more_toxic = get_top_n_words(val_df['set_of_words_more_toxic'], 20, remove_stop_words=False, n_words=1)

In [None]:
df_tmp_less_toxic = pd.DataFrame(common_words_less_toxic, columns = ['set_of_words' , 'count'])
df_tmp_more_toxic = pd.DataFrame(common_words_more_toxic, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_less_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Unigram Distribution for Less Toxic Comments')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_more_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax2.set_title('Unigram Distribution for More Toxic Comments')
ax2.set_xlabel("Unigrams")
ax2.set_ylabel("Frequency")

plt.show()

Distribution of top unigrams after removing stop words:

In [None]:
common_words_less_toxic = get_top_n_words(val_df['set_of_words_less_toxic'], 20, remove_stop_words=True, n_words=1)
common_words_more_toxic = get_top_n_words(val_df['set_of_words_more_toxic'], 20, remove_stop_words=True, n_words=1)

In [None]:
df_tmp_less_toxic = pd.DataFrame(common_words_less_toxic, columns = ['set_of_words' , 'count'])
df_tmp_more_toxic = pd.DataFrame(common_words_more_toxic, columns = ['set_of_words' , 'count'])

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = df_tmp_less_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax1.set_title('Unigram Distribution for Less Toxic Comments')
ax1.set_xlabel("Unigrams")
ax1.set_ylabel("Frequency")

ax2 = fig.add_subplot(122)
ax2 = df_tmp_more_toxic.groupby('set_of_words').sum()['count'].sort_values(ascending=False).plot(
    kind='bar', color = "#640372")
ax2.set_title('Unigram Distribution for More Toxic Comments')
ax2.set_xlabel("Unigrams")
ax2.set_ylabel("Frequency")

plt.show()

As seen before several swear words are found to lose positions relative to the most frequent words, after removing repeated words.This can be traced to the fact that there are many comments within the dataset with swear words repeated many times.

## Sentiment Polarity

Let's use TextBlob to calculate sentiment polarity. The sentiment polarity value lies in the range of [-1, 1] where 1 means positive sentiment and -1 means a negative sentiment:

In [None]:
polarity_less_toxic = val_df['less_toxic'].map(lambda text: TextBlob(text).sentiment.polarity)
polarity_more_toxic = val_df['more_toxic'].map(lambda text: TextBlob(text).sentiment.polarity)

In [None]:
fig = plt.figure(figsize=(10,8))

ax1 = polarity_less_toxic.plot(kind='hist', color = "#622864", bins=100)
ax1 = polarity_more_toxic.plot(kind='hist', color = "#ff7033", bins=100, alpha=0.7)
ax1.set_title('Polarity Distribution Less vs More Toxic Comments')
ax1.set_xlabel("Sentiment")
ax1.set_ylabel("Frequency")
ax1.legend()

plt.show()

We can see that the less toxic comments have a more neutral or positive polarity. While comments classified as toxic have a more negative polarity.

## Word Clouds

In [None]:
text_less_toxic = val_df['less_toxic'].values
text_more_toxic = val_df['more_toxic'].values

In [None]:
wc_less_toxic = WordCloud(background_color="#fcebff",max_words=1000,mask=mask_joy,stopwords=set(STOPWORDS))
wc_more_toxic = WordCloud(background_color="#fcebff",max_words=1000,mask=mask_sad,stopwords=set(STOPWORDS))

wc_less_toxic.generate(" ".join(text_less_toxic))
wc_more_toxic.generate(" ".join(text_more_toxic))

fig = plt.figure(figsize=(20,8))

ax1 = fig.add_subplot(121)
ax1 = plt.imshow(wc_less_toxic.recolor(color_func=purple_color_func, random_state=42),
           interpolation="bilinear")
ax1 = plt.title("Word Cloud Less Toxic Comments", fontsize=20)

ax2 = fig.add_subplot(122)
ax2 = plt.imshow(wc_more_toxic.recolor(color_func=purple_color_func, random_state=42),
           interpolation="bilinear")
ax2 = plt.title("Word Cloud More Toxic Comments", fontsize=20)

plt.show()

## Worker Analysis

In [None]:
print(f'We have {val_df.worker.nunique()} different workers')

Let's see if the order of toxicity between two comments belonging to a pair is consistent across workers.

In [None]:
dict_check = {}
for i in range(val_df.shape[0]):
    comment = [val_df['less_toxic'][i], val_df['more_toxic'][i]]
    key_comment = comment.copy()
    key_comment.sort()
    key_comment = tuple(key_comment)
    if key_comment not in dict_check.keys():
        dict_check[key_comment] = {}
        dict_check[key_comment]['less_more'] = 0
        dict_check[key_comment]['more_less'] = 0
        if comment[0]<comment[1]:
            dict_check[key_comment]['less_more'] += 1
        else:
            dict_check[key_comment]['more_less'] += 1
    else:
        if comment[0]<comment[1]:
            dict_check[key_comment]['less_more'] += 1
        else:
            dict_check[key_comment]['more_less'] += 1

In [None]:
print(f'Unlike the total pairs of comments in the dataset ({val_df.shape[0]}) the unique pairs of comments are much less:\n'
      f'number of unique pairs of comments: {len(dict_check.keys())}')

Let's look at how many comments were ranked differently by different workers:

In [None]:
key_col = []
less_more_col = []
more_less_col = []
for key in dict_check.keys():
    key_col.append(key)
    less_more_col.append(dict_check[key]['less_more'])
    more_less_col.append(dict_check[key]['more_less'])
final_df = pd.concat([pd.Series(key_col), pd.Series(less_more_col), pd.Series(more_less_col)], axis = 1)
final_df.columns = ['comments', 'less_more', 'more_less']

In [None]:
max_time = np.max(final_df['less_more']+final_df['more_less'])
print(f'Maximum number of times a single comment pair was rated: {max_time}')

In [None]:
differently_classified = final_df[(final_df.less_more>0)&(final_df.more_less>0)]

In [None]:
print(f"The number of comment pairs ranked differently based on who ranked them is: {differently_classified.shape[0]} out of a total of {len(dict_check.keys())}. \n"
"This is expected, often both comments contain terms that are considered toxic, so it is subjective how these two comments are categorized into 'more toxic' or 'less toxic'.")

<div style='color:white;background-color:#f2dde8; height: 50px; border-radius: 25px;'><h1 style='text-align:center;padding: 1%'>The End</h1></div>

This notebook ends here, I will try to update it as I go along with new analysis. Thanks for making it to the end :) !