<h1><center>Comprehensive EDA for the Jigsaw comp + Sentiment analysis using Huggingface Pipeline + Benchmark Baseline</center></h1>
                           
                           
<center><img src = "https://i.imgur.com/iRX7hwu.png" width = "1000" height = "400"/></center>                                                                            

In [None]:
%config Completer.use_jedi = False

!pip install -Uqqq plotnine

!pip install -Uqqq pyicu
!pip install -Uqqq pycld2
!pip install -Uqqq morfessor
!pip install -Uqqq polyglot

!pip install -Uqqq transformers

<h3 style='background:orange; color:black'><center>Consider upvoting this notebook if you found it helpful.</center></h3>

This EDA attempts to answer the following questions I had:
1. How many unique comments are in the dataset?
1. What types of 'noise' are in the text? (spaces, and so on)
1. Are all comments compared three times?
1. What is the frequency of that each comment is ranked?
1. How many workers?
1. How many pairs did each worker labeled?
1. What is the frequency of unanimity?
1. Are some workers that are notably better? (i.e., participated more on unanimities)
1. Are the length of toxic comments any different from non-toxic?
1. Are all comments written in English?
1. What are the main sentiment on this dataset considering the following models:
    1. Positive vs. Negative
    1. Sadness, joy, love, anger, fear and surprise
    1. Toxic, sever toxic, obscene, threat, insult, identity hate

<h3 style='background:orange; color:black'><center>Thank you =)</center></h3>


In [None]:
import matplotlib.pyplot as plt
from plotnine import *
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

comments_to_score = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')
validation_data = pd.read_csv('../input/jigsaw-toxic-severity-rating/validation_data.csv')

# 1. How many unique comments are in the dataset?

In [None]:
unique_comments = pd.Series(validation_data.melt('worker')['value'].unique())
len(unique_comments)

So in this dataset, we apparently have 14251 unique comments. Let's check if we can sanitize any of them

# 2. What types of 'noise' are in the text? (spaces, and so on)

In [None]:
uc1 = unique_comments.str.strip().unique()
uc2 = unique_comments.str.strip().str.lower().unique()
uc3 = unique_comments.str.replace("\n", " ").str.strip().str.lower().unique()
print(f'''
    Number of unique comments: {len(unique_comments)}
    Removing tailing spaces: {len(uc1)}
    And lowering the case: {len(uc2)}
    And removing scapes: {len(uc3)}
    '''
)

We dropped the number of unique comments by removing leading/trailing spaces and lowering the cases by 13. There might be more stuff to clean, but I will leave it like that.

For the remainder of my analysis, it might be handy to have unique IDs for each comment, so let's do that now.

In [None]:
def sanitize(texts): return pd.Series(texts).str.replace('\n', ' ').str.strip().str.lower()

unique_comments = sanitize(unique_comments)

id2txt = {k:v for k, v in enumerate(unique_comments)}
txt2id = {k:v for v, k in enumerate(unique_comments)}

_validation_data = validation_data.copy()
_validation_data['less_toxic_id'] = sanitize(_validation_data['less_toxic']).apply(lambda x: txt2id[x])
_validation_data['more_toxic_id'] = sanitize(_validation_data['more_toxic']).apply(lambda x: txt2id[x])
_validation_data['hashed_pair'] = (_validation_data[['less_toxic_id', 'more_toxic_id']]
                                   .astype(str).apply(lambda x: '-'.join(sorted(x)), axis=1))
_validation_data['hashed_pair_un'] = (_validation_data[['less_toxic_id', 'more_toxic_id']]
                                   .astype(str).apply(lambda x: '-'.join(x), axis=1))

# 3. Are all comments compared three times?

In [None]:
df = _validation_data['hashed_pair'].value_counts().to_frame('n')

(ggplot(df, aes('n'))
 + geom_bar(fill = 'orange', color = 'black')
 + geom_text(
     aes(label=after_stat('prop*100'), group=1),
     stat='count',
     va='bottom',
     format_string='{:.1f}%'
 )
 + ggtitle('Number of times each pair of comments is evaluated')
 + xlab('Number of comparisons')
 + ylab('Frequency')
)

In [None]:
df['n'].value_counts()

The pairs with only one comparison are probably a bug or could be sanitization problems, so let's further investigate this.

## Investigating if there are any more sanity problems...

In [None]:
idx = df.index[df['n'] < 3]
single_pairs = _validation_data.set_index('hashed_pair').loc[idx]
single_comments = single_pairs[['less_toxic', 'more_toxic']].values.flatten()

from difflib import SequenceMatcher
similarity = []
for i, commentA in enumerate(tqdm(single_comments)):
    for j, commentB in enumerate(single_comments):
        if i < j:
            similarity.append(SequenceMatcher(None, commentA, commentB).ratio())
        else: 
            similarity.append(0)
        
similarity = np.array(similarity).reshape((len(single_comments), -1))

plt.imshow(similarity);

From the image, we can see that are indeed some quite similar comments, let's check a few of them.

In [None]:
for j in [-1, -2, -3, -4]:
    id_max = np.unravel_index(similarity.flatten().argsort()[j], similarity.shape)
    for i in [0, 1]:
        print(f'Comment {i + 1}:\n{"-"*50}')
        print(single_comments[id_max[i]])
        print("-"*50, sep = '')
    print('\n', '='*50, '\n', sep = '')

Ok, apparently, the remaining comments are a fluke on the labeling process and not just the same comment saved with artifacts.

From now on **I FILTERED THOSE OUT FROM THE REST OF THE EDA**

In [None]:
idx = df.index[df['n'] == 3]
_validation_data = _validation_data.set_index('hashed_pair').loc[idx].reset_index()
_validation_data.shape

Ok, that is interesting. The new dataset has EXACTLY 30k comparisons (i.e., 10k unique pairs) 

Let me save this dataset for future usage

In [None]:
_validation_data.to_csv('validation_data_clean.csv', index = False)

In [None]:
_validation_data

# 4. What is the frequency of that each comment is ranked?

In [None]:
df = (_validation_data[['less_toxic_id', 'more_toxic_id']]
      .melt(value_name = 'n')['n']
      .value_counts()
      .to_frame()
     )
(ggplot(df, aes('factor(n)'))
 + geom_bar(fill = 'orange', color = 'black')
 + geom_text(
     aes(label=after_stat('prop*100'), group=1),
     stat='count',
     ha = 'left',
     nudge_y = 200,
     format_string='{:.1f}%'
 )
 + coord_flip()
 + ggtitle('TODO')
 + xlab('Number of comparisons per comment')
 + ylab('Frequency')
 + ylim(0, 10500)
)

From this, we can see that most comments (over 8k) are compared only against a single comment (i.e., three times). But a few others are compared against multiple other comments.

In [None]:
(ggplot(df, aes('factor(n)'))
 + geom_bar(fill = 'orange', color = 'black')
 + scale_y_log10()
 + coord_flip()
 + ggtitle('TODO')
 + xlab('Number of comparisons per comment')
 + ylab('Frequency (log10)')
)

Visualizing the same data using a log scale shows a clear linear relationship, which is quite neat.

# 5. How many workers?

In [None]:
_validation_data['worker'].nunique()

753 people worked as labelers, that is quite a lot. Let's check how many labels each of them did.

# 6. How many pairs did each worker labeled?

In [None]:
df = _validation_data['worker'].value_counts().to_frame('n')
df.T

From the table we can see that it ranges from 1 pair to 248 pairs.

In [None]:
(ggplot(df, aes('n'))
 + geom_histogram(bins = 20, fill = 'orange', color = 'black')
 + ggtitle('Histogram of Workload')
 + xlab('Number of comparisons per worker')
 + ylab('Absolute Frequency (number of workers at the bin)')
)

But the majority of workers labeled less than 5 pairs. Only a handfull labeled more than 200.

# 7. What is the frequency of unanimity?

We define a unanimous decision when all the three workers agree.

In [None]:
unnanimous_pairs = (_validation_data['hashed_pair_un'].value_counts() == 3).to_dict()
_validation_data['is_unanimous'] = _validation_data['hashed_pair_un'].apply(lambda x: unnanimous_pairs[x])

In [None]:
df = _validation_data.groupby('hashed_pair').head(1)

(ggplot(df, aes('is_unanimous'))
 + geom_bar(fill = 'orange', color = 'black')
 + geom_text(
     aes(label=after_stat('prop*100'), group=1),
     stat='count',
     va='bottom',
     format_string='{:.1f}%'
 )
 + ggtitle('Quantity of Unanimous decisions')
 + xlab('Is unanimous?')
 + ylab('Absolute Frequency')
)

Remarkably, almost half of the decisions were unanimous. Keep in mind that random chance would be 25% only.

# 8. Are some workers that are notably better? (i.e., participated more on unanimities)

In [None]:
_validation_data.groupby('worker')['is_unanimous'].agg(['mean', 'sum']).T

Right out of the bat, we can see that some workers have zero unanimous decisions, and some have 70, which will have a widespread.

In [None]:
df = (_validation_data
      .groupby('worker')['is_unanimous']
      .agg(['mean', 'sum', 'count'])
      .reset_index()
     )
df['bins_mean'] = pd.cut(df['mean'], 10)

(ggplot(df.query('count > 5'), aes(x = 'bins_mean', y = after_stat('count')))
 + geom_bar(fill = 'orange', color = 'black')
 + ggtitle('Frequency of Unannimity\n(for workers with more than 5 labels)')
 + coord_flip()
 + xlab('Relative frequency of unannimity per worker')
 + ylab('Number of workers at that frequency of unannimity')
)

Not surprisingly, the majority of workers will reach unanimity at half the time (just like the global average). But there are a few that are significantly better and a few worse.

In [None]:
_df = df.query('count > 5').sort_values('mean', ascending = False).head(20)
(ggplot(_df, aes(x = 'factor(worker)', y = 'mean'))
 + geom_bar(stat = 'identity', fill = 'orange', color = 'black')
 + ggtitle('Unanimous frequency by worker\n(top 20 of workers with over 5 labels)')
 + coord_flip()
 + xlab('Worker id')
 + ylab('Unanimity rate')
 + scale_x_discrete(limits = _df['worker'][::-1])
)

In [None]:
_df = df.query('count > 5').sort_values('mean', ascending = False).tail(20)
(ggplot(_df, aes(x = 'factor(worker)', y = 'mean'))
 + geom_bar(stat = 'identity', fill = 'orange', color = 'black')
 + ggtitle('Unanimous frequency by worker\n(worst 20 of workers with over 5 labels)')
 + coord_flip()
 + xlab('Worker id')
 + ylab('Unanimity rate')
 + scale_x_discrete(limits = _df['worker'][::-1])
)

Let's investigate it a bit further and see how it coorelates with the total number of unanimity per worker.

In [None]:
(ggplot(df, aes(x = 'count', y = 'mean'))
 + geom_hline(yintercept = 0.5, color = 'red')
 + geom_point(color = 'orange')
 + geom_smooth(method = 'lm', color = 'darkorange')
)

As expected the more labels a worker grade the closest it is to the mean. In my opinion we should not exploit this information as it is most likely noise.

# 9. Are the length of toxic comments any different from non-toxic?

In [None]:
df = (_validation_data
      .query('is_unanimous')
      .groupby('hashed_pair')
      .head(1)[['less_toxic', 'more_toxic']]
      .apply(lambda x: x.apply(lambda x: len(x)), axis = 1)
      .melt(var_name = 'toxicity', value_name = 'length')
     )
df_mean = df.groupby('toxicity').median().reset_index()

In [None]:
(ggplot(df, aes(x = 'length', fill = 'toxicity'))
 + geom_density(color = 'black', alpha = 0.5)
 + geom_vline(df_mean, aes(xintercept = 'length', color = 'toxicity'))
 + xlab('Length of the comment')
)

In [None]:
(ggplot(df.query('length < 500'), aes(x = 'length', fill = 'toxicity'))
 + geom_density(color = 'black', alpha = 0.5)
 + geom_vline(df_mean, aes(xintercept = 'length', color = 'toxicity'))
 + xlab('Length of the comment (clipped at 500)')
)

Indeed, the less toxic comments are a bit longer than the non-toxic comments, let's see if this could be used as a predictor for this dataset.

In [None]:
(_validation_data['less_toxic'].str.len() > _validation_data['more_toxic'].str.len()).mean()

Close but no cigar. This naive predictor would give us 54% accuracy, which is not *that* terble, but I am sure we can do much better using deeplearning.

# 10. Are all comments written in English?

In [None]:
from polyglot.detect import Detector

In [None]:
def get_language(text):
    return Detector("".join(x for x in text if x.isprintable()), quiet=True).languages[0].name

In [None]:
%%capture
langs = [get_language(comment) for comment in unique_comments]

In [None]:
df = pd.DataFrame({
    'text': unique_comments,
    'lang': langs
})
df.tail()

In [None]:
df['lang'].value_counts().to_frame().T

From this we have over 14k comments in english and over 50 of unknown language. Let's check the ones unknown

In [None]:
df[df['lang'] == 'un']

Ok, they are clearly too poorly written to be detected as English... lmao

In [None]:
df[df['lang'] == 'German']

The same is also true for most of the other 'non-english' labels. So we can assume that at least the vast majority is written in (sort of) English

# 11. What are the main sentiment on this dataset?

In [None]:
from transformers import pipeline
from tqdm.notebook import tqdm

In [None]:
def predict(text):
    try:
        p = classifier(text)[0]
        df = pd.DataFrame(p).set_index('label').T
        df['text'] = text
        return df
    except:
        return None

## a. Positive vs. Negative

In [None]:
classifier = pipeline("text-classification", model='distilbert-base-uncased-finetuned-sst-2-english', return_all_scores=True)

In [None]:
unique_comments[1]

In [None]:
predict(unique_comments[1])

In [None]:
predictions = [predict(text) for text in tqdm(unique_comments[:250])]
predictions = pd.concat(predictions).reset_index(drop = True)
predictions.tail(1)

In [None]:
(ggplot(predictions.melt('text'), aes(x = 'label', y = 'value'))
+ geom_boxplot(fill = 'orange', color = 'black')
)

In [None]:
(predictions['NEGATIVE'] > 0.5).mean()

In [None]:
predictions.sort_values('NEGATIVE').reset_index().loc[1, 'text']

## b. Sadness, joy, love, anger, fear and surprise

In [None]:
classifier = pipeline("text-classification", model='bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores=True, function_to_apply = 'sigmoid')

In [None]:
predictions = [predict(text) for text in tqdm(unique_comments[:250])]
predictions = pd.concat(predictions).reset_index(drop = True)
predictions.tail(1)

In [None]:
(ggplot(predictions.melt('text'), aes(x = 'label', y = 'value'))
+ geom_boxplot(fill = 'orange', color = 'black')
)

In [None]:
predictions.sort_values('love', ascending = False).reset_index().loc[1, 'text']

## c. Toxic, sever toxic, obscene, threat, insult, identity hate

In [None]:
classifier = pipeline("text-classification", model='unitary/toxic-bert', return_all_scores=True, function_to_apply = 'sigmoid')

In [None]:
predictions = [predict(text) for text in tqdm(unique_comments[:250])]
predictions = pd.concat(predictions).reset_index(drop = True)
predictions.tail(1)

In [None]:
(ggplot(predictions.melt('text'), aes(x = 'label', y = 'value'))
+ geom_boxplot(fill = 'orange', color = 'black')
)

# Baseline using unitary/toxic-bert model

**For the offilne version check:**
https://www.kaggle.com/coldfir3/simple-inference-notebook-on-pre-trained-model

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
import torch

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
MODEL_NAME = 'unitary/toxic-bert'

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to(device)

In [None]:
# remove the slicing of [:36] to run the whole set
comments_to_score = pd.read_csv('../input/jigsaw-toxic-severity-rating/comments_to_score.csv')[:36] 
# comments_to_score_texts = comments_to_score['text'].values

In [None]:
BS = 8
def get_comments_to_score():
    txts = comments_to_score['text'].values
    for i in range(0, len(txts), BS):
        yield txts[i : i + BS].tolist()

In [None]:
outputs = []
for sequences in tqdm(get_comments_to_score(), total = len(comments_to_score) // BS): 
    tokens = tokenizer(sequences, 
                       padding=True, 
                       truncation=True, 
                       add_special_tokens=True,
                       return_tensors="pt").to(device)
    output = model(**tokens)
    outputs.append(output['logits'].cpu().detach().numpy())

In [None]:
predictions = np.concatenate(outputs)[:,0]

In [None]:
comments_to_score['score'] = predictions
comments_to_score = comments_to_score.drop('text', axis = 1)

In [None]:
comments_to_score.to_csv('submission.csv', index = False)
comments_to_score