# Data Exploration
We start by printing some examples of each one of the different toxic examples types:
'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'.

This way, we will better understand what each one of these comment hate types really mean.

Let's first import the necessary data.

In [1]:
# Import packages
%matplotlib inline
import os
import pandas as pd
import numpy as np
import seaborn as sns
import nltk

In [2]:
# Import dataset
parent_path = os.path.dirname(os.getcwd())
fname = 'train.csv'
csv_path = parent_path + '/data/raw/' + fname
data = pd.read_csv(csv_path, index_col='id')

Now, let's build the function that will be used to search, in loaded data, each type of toxic comment.

In [3]:
def find_comments(data, toxic_type, n, shuffle=True):
    """
    Finds n comments of given toxicity type in given data.
    :param data: pandas.DataFrame
        Comments and corresponding classification
    :param toxic_type: str
        One of
            'toxic', 'severe_toxic', 'obscene', 'threat', 
            'insult', 'identity_hate'
    :param n: int
        Number of comments to retrieve
    :param shuffle: bool, default True
        If False, the first n comments of given type are returned.
        Else, a random sample of size n is returned.
    :return: pandas.DataFrame
        n comments of given toxicity type
    """
    filt = data[data[toxic_type] == 1]
    if shuffle:
        filt = filt.iloc[np.random.permutation(len(filt))]
    
    return filt.head(n)

Finnaly, let's print 5 comments of each toxic comment type

In [4]:
output_classes = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
for t in output_classes:
    comments = find_comments(data, t, n=5)
    print('>>> Category: %s <<<' % t)
    for comm_id, r in comments.iterrows():
        print('===========================================')
        print('[Comment ID: %s]' % comm_id)
        print(r['comment_text'])
        print(r.loc[output_classes].to_dict())

>>> Category: toxic <<<
[Comment ID: dcef49cfef632147]
Decade montage images 

WHAT THE FUCK DID YOU DO TO MY FREE IMAGES LIKE THIS ONE, , YOU LAZY SHIT. I WORKED ON THOSE FOR TWO MONTHS, AND ALL THE IMAGES CAME FROM WIKIPEDIA AND WHERE FREE, YOU DUMB FUCK. MAYBE IF YOU DIDN'T HAVE SHIT FOR BRAINS, YOU WOULD CHECK THE SOURCES I HAD POSTED IN THE DESCRIPTION BOX, YOU LITTLE ASSHOLE.
{'toxic': 1, 'severe_toxic': 0, 'obscene': 1, 'threat': 0, 'insult': 1, 'identity_hate': 0}
[Comment ID: 9944de1897da0c1d]
"
READY TO FUCK YOU UP OLD SCHOOL """"
See the inviting place with it's friendly and fair administrators
ready to give you an olde time wiki welcome.
See the how this admin feels about wikilov

This admin , a Democrat cyber thug, banns me due to differences in writing style.

Gamaliel uses this place as a sword for his own power drunk political agenda.

Gamaliel have absolutely no evidence to say I'm a ""sock of joehazleton"" other then 
The asinine ""duck test""  I would call the duck t

## Feature Engineering
Now, let's try to find some valuable features in loaded comments.

#### Capital letters percentage
Let's start by computing the percentage of capital letters used in each comment and check if it has anything to do with the outputs.

In [17]:
def capital_letters_pct(text):
    """ Computes percentage of capital letters in given text, not counting with spaces"""
    n_caps = sum([x.isupper() for x in text])
    n_lower = sum([x.islower() for x in text])
    if n_caps == 0 and n_lower == 0:
        return 0.0
    return float(n_caps) / (n_lower + n_caps)

# Some tests
assert capital_letters_pct('HELLO WORLD') == 1.0
assert capital_letters_pct('HhEeLlLlOo') == 0.5
assert capital_letters_pct('hello world!!!') == 0.0
assert capital_letters_pct('14:53,') == 0.0

In [6]:
# Compute capital letters percentage for each comment
data['caps_pct'] = data['comment_text'].apply(capital_letters_pct)

In [7]:
# Compute Pearson correlations
def print_correlations(data, label):
    for c in output_classes:
        df = data[[label, c]].dropna()
        corr = np.corrcoef(df[label].values, df[c].values)[0, 1]
        print('> %s: %.3f' % (c, corr))
        
print_correlations(data, label='caps_pct')

> toxic: 0.221
> severe_toxic: 0.169
> obscene: 0.183
> threat: 0.056
> insult: 0.169
> identity_hate: 0.089


We can see that there is some correlation with toxic comments. On the other hand, the correlation with threat comments is low. Let's compute the correlations only when this percentage is bigger than a threshold.

In [8]:
unique_data = data[data['caps_pct'] > 0.1]
print('Length filtered data: %d/%d' % (len(unique_data), len(data)))
print_correlations(unique_data, label='caps_pct')

Length filtered data: 18678/159571
> toxic: 0.425
> severe_toxic: 0.265
> obscene: 0.335
> threat: 0.111
> insult: 0.317
> identity_hate: 0.160


Now we can see that the correlation increased for every classes.

#### Repetitions
Let's now try to find repetitive comments and check if this is related to any of the outputs.
To meausure this, we will compute the percentage of unique words in each comment.

In [29]:
def unique_words_pct(text):
    """ Computes percentage of unique words, in given text """
    words = nltk.word_tokenize(text)
    n_unique = len(set(words))
    if len(words) == 0:
        return np.nan
    else:
        return float(n_unique) / len(words)
    
# Some tests
assert unique_words_pct('a b c d') == 1.0
assert unique_words_pct('rep rep rep rep') == 0.25
assert unique_words_pct('this is a test. this is a test.') == 0.5

In [10]:
# Compute unique words pct and print correlations
data['unique_words'] = data['comment_text'].apply(unique_words_pct)
print_correlations(data, label='unique_words')

> toxic: 0.049
> severe_toxic: -0.020
> obscene: 0.039
> threat: -0.000
> insult: 0.040
> identity_hate: 0.013


All the correlations are close to zero, which is bad. Let's try to compute the correlations only when the unique words percentage is below a certain threshold.

In [11]:
# Print correlations
unique_data = data[data['unique_words'] < 0.5]
print('Length filtered data: %d/%d' % (len(unique_data), len(data)))
print_correlations(unique_data, label='unique_words')

Length filtered data: 6235/159571
> toxic: -0.572
> severe_toxic: -0.441
> obscene: -0.473
> threat: -0.125
> insult: -0.442
> identity_hate: -0.206


Now we can see a strong negative correlation, so this can still be an interesting feature.

#### IP address
Another feature to explore is the presence of an IP address in the comment text. In case this IP address is shown only for anonymous users, this could be an interesting feature.

In [12]:
import re
def has_ip(text):
    ip = re.findall(r'[0-9]+(?:\.[0-9]+){3}', text)
    return len(ip) > 0
    
assert has_ip('a b c d') == False
assert has_ip('my ip is 192.159.124.13:50') == True

In [13]:
# Compute feature and check how many comments have an IP address
data['has_ip'] = data['comment_text'].apply(has_ip)
print('%d/%d' % (len(data[data['has_ip']]), len(data)))

10083/159571


In [14]:
# Print correlations
print_correlations(data, label='has_ip')

> toxic: 0.034
> severe_toxic: -0.004
> obscene: 0.017
> threat: -0.002
> insult: 0.014
> identity_hate: 0.006


The correlations are all close to 0, what indicates this might not be a good feature. Let's hope that, in combination with other features it can be more useful.

#### Sentiment analysis

Let's now compute the sentiment of each comment. For this purpose, we will use VADER sentiment analysis. This link explains how it works: http://datameetsmedia.com/vader-sentiment-analysis-explained/

In [15]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
def vader_sentiment(text):
    return sid.polarity_scores(text)['compound']

In [16]:
# Compute VADER sentiment and print correlations
data['sentiment'] = data['comment_text'].apply(vader_sentiment)
print_correlations(data, label='sentiment')

> toxic: -0.292
> severe_toxic: -0.139
> obscene: -0.251
> threat: -0.075
> insult: -0.246
> identity_hate: -0.099


We can see some strong correlations in some categories.