# Toxic Comment Classification Challenge
### Identify and classify toxic online comments

<img src='https://storage.googleapis.com/kaggle-media/competitions/jigsaw/003-avatar.png' height=150 width=150/>
Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

The [Conversation AI](https://conversationai.github.io/) team, a research initiative founded by [Jigsaw](https://jigsaw.google.com/) and Google (both a part of Alphabet) are working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the [Perspective API](https://perspectiveapi.com/), including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content).

In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s [current models](https://github.com/conversationai/unintended-ml-bias-analysis). You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.

Disclaimer: the dataset for this competition contains text that may be considered profane, vulgar, or offensive.

## Lets load the neccesary packages
___
The libraries below will be used to load and explore the toxic comment data challenge data
<img src='https://media.giphy.com/media/12Q9qZRnnab0T6/giphy.gif' height=100 />

**File descriptions**  
*train.csv* - the training set, contains comments with their binary labels  
*test.csv* - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring.  
*sample_submission.csv* - a sample submission file in the correct format

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # for visuals
import matplotlib.pyplot as plt # for plots
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))
%matplotlib inline

## Loading Data
___
Loading training, testing and sample submission datasets

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
submission = pd.read_csv('../input/sample_submission.csv')

## Exploratoring Data
___
> Errors using inadequate data are much less than those using no data at all.   
> by **Charles Babbage**


In [None]:
train.head()

In [None]:
train.shape

## Dataset Overview
___
The dataset has 
* About 160 000 of records 
* 7 columns excluding id column
* lets get started with statistical analysis

### Checking Data types
___
It is important to know what type of data you are working on just to be sure, even though we know this dataset

In [None]:
train.dtypes

### Checking Missing Values
___
The data does not contain the missing values, but I will check it in later stage to verify, finding the missing values in text might be challenge sometimes.

In [None]:
train.isnull().any()

### Statistical Overview of Labels
___
Our labels for datasets are as follows 
* 'toxic'  
* 'severe_toxic'  
* 'obscene'  
* 'threat'  
* 'insult'  
* 'identity_hate'  

In [None]:
train.describe()

### Lets plot the results for sake of graphically understanding

In [None]:
train.describe().plot(kind='bar')

### Visualizing data on pairplor graph
___
Plot pairwise relationship in a datasets

In [None]:
sns.pairplot(train)

### The diagonal plots are as follows  with the numbers that reflets the bar graphs
___


#### Displaying the numbers first****

In [None]:
print(train.obscene.value_counts())
print(train.threat.value_counts())
print(train.insult.value_counts())
print(train.identity_hate.value_counts())
print(train.toxic.value_counts())
print(train.severe_toxic.value_counts())

The following are simplified version of diagonal from top left to bottom right

In [None]:
fig, plots = plt.subplots(2,3,figsize=(18,12))
plot1, plot2, plot3, plot4, plot5, plot6 = plots.flatten()
sns.countplot(train['obscene'], palette= 'deep', ax = plot1)
sns.countplot(train['threat'], palette= 'muted', ax = plot2)
sns.countplot(train['insult'], palette = 'pastel', ax = plot3)
sns.countplot(train['identity_hate'], palette = 'dark', ax = plot4)
sns.countplot(train['toxic'], palette= 'colorblind', ax = plot5)
sns.countplot(train['severe_toxic'], palette= 'bright', ax = plot6)

# Let's do Text clearning a bit
___

The informal text to formal systax was obtained from the notebook below:  
https://www.kaggle.com/gakngm/some-predictions-for-toxic-comments  
Titled: **Some predictions for Toxic Comments**  by  [Gael Kngm](https://www.kaggle.com/gakngm)   
Good start at Gael  



In [None]:
structured_patterns = [
 (r'won\'t', 'will not'),
 (r'can\'t', 'cannot'),
 (r'i\'m', 'i am'),
 (r'ain\'t', 'is not'),
 (r'(\w+)\'ll', '\g<1> will'),
 (r'(\w+)n\'t', '\g<1> not'),
 (r'(\w+)\'ve', '\g<1> have'),
 (r'(\w+)\'s', '\g<1> is'),
 (r'(\w+)\'re', '\g<1> are'),
 (r'(\w+)\'d', '\g<1> would')
]

class RegexpReplacer(object):
    def __init__(self, patterns=structured_patterns):
         self.patterns = [(re.compile(regex), repl) for (regex, repl) in
         patterns]
            
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
             s = re.sub(pattern, repl, s)
        return s


### Removing symbols in the text 
___
Example:  
*     **from** hello, i need two$   
*     **to** hello i need two  

In [None]:
import re
def strip_symbols(text):
    return ' '.join(re.compile(r'\W+', re.UNICODE).split(text))

### Convert the text to lower
___
Standardizing the text to all lower cases and replaing new line spaces by spaces to avoid creating new words from test

In [None]:
train.comment_text = train.comment_text.str.lower()
train.comment_text = train.comment_text.str.replace('\n',' ')
replacer = RegexpReplacer()

#### Removing symbols and converting text from informal to formal text
___
Example:   
    **from** :  I Can't do it    
    **to** : i cannot do it 

In [None]:
train.comment_text = train.comment_text.apply(lambda x:replacer.replace(x))
train.comment_text = train.comment_text.apply(lambda x:strip_symbols(x))

## Display the clean text 
___
Displaying the first few rows of the data 

In [None]:
train.comment_text.head()

## Wordclouds for clean dataset 
___
**Lets define word clouds first:**    
Wordclouds - an image composed of words used in a particular text or subject, in which the size of each word indicates its frequency or importance.   
**Warning**  
    please note that some words are toxic since the dataset contain toxic comments 


In [None]:
from wordcloud import WordCloud
wordcloud = WordCloud(width=1440, height=1080).generate(" ".join(train.comment_text.astype(str)))
plt.figure(figsize=(20, 15))
plt.imshow(wordcloud)
plt.axis('off')

# Machine Learning
___
<img src='http://www.princeton.edu/~samory/samoryDraw1.jpg' height=250 width=600/>
Load neccesary packages to make predictions and model fitting

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

* toxic
* severe_toxic
* obscene
* threat
* insult
* identity_hate  
You must create a model which predicts a probability of each type of toxicity for each comment.


## Load Machine leaning Packages 
___
* Loading Bernouli Naive Bayes, Since its better with text for sample notebook, can be improved later to move to tensorflow and Keras with algorithms like RNN, LSTM and GRU  
* This notebook will be updated soon, but now it uses the TF-IDF and CountVectorizer
*  One versus Rest Classifier for fitting multiple labels 

In [None]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer, CountVectorizer

## Stop words removal
___
Removing english stopwords from text to have more meaningful words to eliminate noise and fitting the training dataset to the model 

In [None]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(train.comment_text)
y = train.loc[:,'toxic':'identity_hate']
clf = BernoulliNB()
model = OneVsRestClassifier(clf)
model.fit(X, y)

## Testing Dataset
___
We will apply all the steps we have done in training dataset for only text comment_text column, since the test dataset has only comment_text column The target results are probabilities all the class labels per text given to the model

In [None]:
## display the first few records for test dataset
test.head()

#### Converting to lower and removing new line spaces

In [None]:
test.comment_text = test.comment_text.str.lower()
test.comment_text = test.comment_text.str.replace('\n',' ')

## Check dimensions
___
we need to check the size of the test dataset, later we will make sure the dimension should be the same as training dataset

In [None]:
test.shape

### Removing symbols and converting informal text to formal

In [None]:
test.comment_text = test.comment_text.apply(lambda x:replacer.replace(x))
test.comment_text = test.comment_text.apply(lambda x:strip_symbols(x))

## Making train and test to have same dimensions
* To make sure the dimension are the same, vectorizer will use .tranform(text) since previously it used .fit_transform(), 
* Let's print the dimension of X_test and X_train to show they have the same dimensions

In [None]:
X_test = vectorizer.transform(test.comment_text)

In [None]:
## remenber we have to make sure that the columns are the same not the rows 
print("X train shape : ",X.shape )
print("X test shape : ",X_test.shape)

# Predictions 
___
Lets predicts probabilities for test dataset
<img src='https://media-exp2.licdn.com/mpr/mpr/AAEAAQAAAAAAAAliAAAAJGJhNWZmYWM2LTVjMjAtNDkwNS05MzJiLWE4MzAxNmVjNzliZQ.png'>

In [None]:
probs = model.predict_proba(X_test)

## Replacing sample probabilities
___
The below submission frame data is overwritten using model prediction probabilites

In [None]:
submission.loc[:,'toxic':'identity_hate'] = probs

## Identity hate probabilities  
* Identity_hate with 3000 points of probabilities greater than 0.5 and less than 0.2 

In [None]:
plt.figure(figsize=(12, 8))
plt.subplot(1,2,1)
sns.violinplot(x = 'toxic', y = 'insult', data = train[0:50000])
plt.subplot(1,2,2)
sns.distplot(submission[submission['identity_hate'] > 0.5]['identity_hate'][0:3000], color = 'green')
sns.distplot(submission[submission['identity_hate'] < 0.2 ]['identity_hate'][0:3000], color = 'red')

In [None]:
submission.to_csv('submission.csv', index=False)

## ROC Curve

In [None]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
fpr, tpr, thresholds = roc_curve(model.predict(X_test)[:,1], model.predict_proba(X_test)[:,1])
bernouli = roc_auc_score(model.predict(X_test)[:,1], model.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Bernouli Naive Bayes (area = %0.2f)' % bernouli)
plt.plot([0,1], [0,1],label='Base Rate' 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

## Both negative and positive comments are welcome