## Overview

The project aims to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate, we will use a dataset of comments from Wikipedia’s talk page edits, collected by Kaggle. Improvements to the current model will hopefully help online discussion become more productive and respectful.

##### Packages Load

In [1]:
# Packages for data processing and wrangling
import pandas as pd
import numpy as np

# Packages for visulization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py 
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go

# Packages for feature engineering
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2

# Packages for modeling
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

## Data Description

##### Data Load

In [2]:
train = pd.read_csv('./data/train.csv')
test_X = pd.read_csv('./data/test.csv')
test_y = pd.read_csv('./data/test_labels.csv')
submission = pd.read_csv('./data/sample_submission.csv')

In [7]:
train.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


In [13]:
test_X.head()

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.


In [14]:
test_y.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,-1,-1,-1,-1,-1,-1


In [15]:
submission.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.5,0.5,0.5,0.5,0.5,0.5
1,0000247867823ef7,0.5,0.5,0.5,0.5,0.5,0.5
2,00013b17ad220c46,0.5,0.5,0.5,0.5,0.5,0.5
3,00017563c3f7919a,0.5,0.5,0.5,0.5,0.5,0.5
4,00017695ad8997eb,0.5,0.5,0.5,0.5,0.5,0.5


## Explanatory Data Analysis

## Feature Engineering

In [19]:
# Vectorizing the text data
tokenizer = TfidfVectorizer(stop_words='english',
                            token_pattern='(?u)\\b[a-zA-Z]{1,}\\b')
train_X = tokenizer.fit_transform(train.comment_text)

In [20]:
# Check tokens
tokenizer.get_feature_names()

['aa',
 'aaa',
 'aaaa',
 'aaaaa',
 'aaaaaaaa',
 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaany',
 'aaaaaaaaaah',
 'aaaaaaaaaahhhhhhhhhhhhhh',
 'aaaaaaaaadm',
 'aaaaaaaacfo',
 'aaaaaaaaczy',
 'aaaaaaaahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh',
 'aaaaaaaayui',
 'aaaaaaahhhhhhhhhhhhhhhhhhhhhhhh',
 'aaaaaaw',
 'aaaaah',
 'aaaah',
 'aaaannnnyyyywwwwhhhheeeerrrreeee',
 'aaaawwww',
 'aaages',
 'aaaghh',
 'aaah',
 'aaahhh',
 'aaahs',
 'aaai',
 'aaajade',
 'aaand',
 'aaarrrgggh',
 'aaaww',
 'aab',
 'aaba',
 'aaberg',
 'aabove',
 'aac',
 'aacargo',
 'aacd',
 'aachen',
 'aachi',
 'aacs',
 'aadd',
 'aademia',
 'aadil',
 'aadmi',
 'aadu',
 'aae',
 'aaets',
 'aaf',
 'aaffect',
 'aafia',
 'aafp',
 'aafs',
 'aagadu',
 'aage',
 'aagf',
 'aagin',
 'aah',
 'aahahahahahaha',
 'aahank',
 'aahh',
 'aahil',
 'aahoa',
 'aai',
 'aaiha',
 'aajacksoniv',
 'aajonus',
 'aakash',
 'aake',
 'aalborg',
 'aalertbot',
 'aaliya',
 'aaliyah',
 'aals',
 'aalst',
 'aam',
 'aami',
 'aamir',
 'aamirjamil'

## Modeling

Since there are six classes of predicted variables, we might need to create 6 classifiers.

### Reference