# Data Exploration: Toxic Comments

This kernel intends to explore and summarize the Wikipedia Toxic Comments Data Set. The summary focuses 
on correlations between the types of comment labels, missing / weird data, and most common terms in toxic
comments.

## Set Up: Load modules and training data

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

train = pd.read_csv("../input/train.csv")

## Label Exploration

First we need to learn about the class labels. Let's count the toxic comments for starters.

In [None]:
train.toxic.value_counts()

The good news here is that, even though it might feel worse, the percentage of toxic 
comments is not too high.

In [None]:
pd.crosstab(train.toxic, train.severe_toxic)


As we might expect, all severely toxic comments are also toxic comments, but not all toxic 
comments are severe.

In [None]:
pd.crosstab(train.toxic, [train.obscene, train.threat, train.insult, train.identity_hate])

Interestingly, over a third of comments are "civilly" toxic, meaning they are neither obscene, 
insult, threat, nor identity hate, yet they are still disruptive. However, adding these labels 
greatly increases the prevalence of toxicity. The worst cases, the 58 comments that are all 
four of the above, are 100% toxic. 

In [None]:
pd.crosstab(train.severe_toxic, [train.obscene, train.threat, train.insult, train.identity_hate])

The cases of "civil" severely toxic comments are much rarer - there are only 22. Generally, a 
smaller portion of comments are severely toxic, no matter what other labels we condition on. 
Even in the worst cases, only 20 of the 58 obscene + threat + insult + identity hate comments 
are severely toxic.

In [None]:
train.iloc[:, 2:8].corr()

The correlation matrix is another way of summarizing the relationships between labels,
although here we only see pair-wise correlations rather than the full cross-tabulation.
The story stays the same - all the correlations are positive, and the correlations for 
severe_toxic are always smaller than for toxic. 

The correlation matrix will make for a good sanity check later when making multi-class 
predicitons. We should expect the correlation matrix of the predicted probabilities to look 
very similar to this one, else something is likely awry. 

## What makes a comment toxic?

Let's start out with an overview of the comments' structures.

In [None]:
train[train.comment_text.isnull()]

There are no missing values for the comment texts, so let's check for empty strings.

In [None]:
train[train.comment_text == '']

Looks okay. If we find secretly missing values later we can deal with them then.

In [None]:
train['comment_length'] = train.comment_text.str.len()
train.comment_length.describe()

The mean is about double the median, so there are some huge comments skewing the data. The 
largest comment is 5000 characters, while the inter-quartile range is only 96 to 435 characters. Let's look at the longest comments and see if they are naughty or nice.

In [None]:
train = train.sort_values(by="comment_length", ascending=False)
pd.set_option('display.max_colwidth', -1)
train.comment_text.head(1)

Well, I've only displayed 1 comment, but change this to head(10) or so, and you'll see for
yourself these are very vulgar and spammy. You could probably target these basic spam posts
by targeting a low ratio of unique words to comment length. For the record, you and I both 
surely do not want Jimmy Wales to die!

In [None]:
one_percent = int(np.ceil(train.shape[0] / 100))
train_sub = train.iloc[0:one_percent, :]
train_sub.toxic.value_counts()

Long comments in general aren't especially toxic. In the above 1% longest comments, still over
80% are not toxic.

## Most common terms in toxic comments

Let's use the TfidfVectorizer class from scikit-learn to analyze common words in the 
toxic comments. I'll pass it the list of common English stop-words, but otherwize let's 
not worry about cleaning up much. This should give us an idea of which words appear most
commonly in toxic comments, but not so much in general (An advantage over simple counting)

In [None]:
train = train.sort_values(by='toxic', ascending=False)
vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit(train.comment_text)
X = vectorizer.transform(train.comment_text)
X_toxic = X[0:9237, :]

In [None]:
means = np.asarray(np.mean(X_toxic, axis=0))
top_ten = np.argsort(-means)[:10]

for ind in range(10):
    print(ind + 1, ":", vectorizer.get_feature_names()[top_ten[0, ind]])
# vectorizer.get_feature_names()[np.argmax(means)]

No major surprises there, eh? It makes sense that most folks using the F word on 
wikipedia are not terribly contructive in their comments. Interestingly wikipedia still
shows up in the list even re-weighting with IDF. That's all for now but maybe I'll be back
with some visualization soon.