# ☣️ Jigsaw - Super simple Naive Bayes [LB=0.768]

## Very simple naive bayes with `LB=0768`.

Using data from [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

I have created a dataset for this data. It's public here :
* [jigsaw-toxic-comment-classification-challenge](https://www.kaggle.com/julian3833/jigsaw-toxic-comment-classification-challenge)


# Please, _DO_ upvote!

# Imports

In [1]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Create train data

The competition was multioutput

We turn it into a binary toxic/ no-toxic classification

In [2]:
df = pd.read_csv("data/train.csv")
df['y'] = (df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1) > 0 ).astype(int)
df = df[['comment_text', 'y']].rename(columns={'comment_text': 'text'})
df.sample(5)

Unnamed: 0,text,y
85488,"""\n\n Please do not vandalize pages, as you di...",0
51692,"If you're that sensitive, why don't you kill y...",1
17303,Category:Articles requiring a direct DNB link,0
71714,Thought you'd like to know ),0
66298,Do not remove my accurate commentary \n\ndont ...,0


# Undersample

The dataset is very unbalanced. Here we undersample the majority class. Other strategies might work better.

In [3]:
df['y'].value_counts(normalize=True)

0    0.898321
1    0.101679
Name: y, dtype: float64

In [4]:
min_len = (df['y'] == 1).sum()

In [5]:
df_y0_undersample = df[df['y'] == 0].sample(n=min_len, random_state=201)

In [6]:
df = pd.concat([df[df['y'] == 1], df_y0_undersample])

In [7]:
df['y'].value_counts()

1    16225
0    16225
Name: y, dtype: int64

In [8]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
import re
from nltk.corpus import stopwords
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
all_stopwords = stopwords.words('english')

In [9]:
#https://www.kaggle.com/kishalmandal/multi-label-stratified-k-fold-toxic-comments
def clean(comment):
    comment = re.sub('[^a-zA-Z]', ' ', comment)
    comment = comment.lower()
    comment = comment.split()
    comment = [stemmer.stem(word) for word in comment if not word in set(all_stopwords)]
    comment = [lemmatizer.lemmatize(word) for word in comment]
    comment = ' '.join(comment)
    return comment

In [10]:
df['text'].iloc[0]

'COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK'

In [11]:
clean(df['text'].iloc[0])

'cocksuck piss around work'

In [12]:
df['text'] = df['text'].apply(clean)

# TF-IDF

In [13]:
vec = TfidfVectorizer()

In [14]:
X = vec.fit_transform(df['text'])
X

<32450x45777 sparse matrix of type '<class 'numpy.float64'>'
	with 728904 stored elements in Compressed Sparse Row format>

# Fit Naive Bayes

In [15]:
model = MultinomialNB()
model.fit(X, df['y'])

# Validate

In [16]:
df_val = pd.read_csv("jigsaw/validation_data.csv")

In [17]:
X_less_toxic = vec.transform(df_val['less_toxic'].apply(clean))
X_more_toxic = vec.transform(df_val['more_toxic'].apply(clean))

In [18]:
p1 = model.predict_proba(X_less_toxic)
p2 = model.predict_proba(X_more_toxic)

In [19]:
# Validation Accuracy
(p1[:, 1] < p2[:, 1]).mean()

0.6675634382888269

# Submission

In [20]:
df_sub = pd.read_csv("jigsaw/comments_to_score.csv")
X_test = vec.transform(df_sub['text'])
p3 = model.predict_proba(X_test)

In [21]:
df_sub

Unnamed: 0,comment_id,text
0,114890,"""\n \n\nGjalexei, you asked about whether ther..."
1,732895,"Looks like be have an abuser , can you please ..."
2,1139051,I confess to having complete (and apparently b...
3,1434512,"""\n\nFreud's ideas are certainly much discusse..."
4,2084821,It is not just you. This is a laundry list of ...
...,...,...
7532,504235362,"Go away, you annoying vandal."
7533,504235566,This user is a vandal.
7534,504308177,""" \n\nSorry to sound like a pain, but one by f..."
7535,504570375,Well it's pretty fucking irrelevant now I'm un...


In [22]:
df_sub['score'] = p3[:, 1]

In [23]:
df_sub['score'].count()

7537

In [24]:
# 9 comments will fail if compared one with the other
df_sub['score'].nunique()

7464

In [25]:
df_sub[['comment_id', 'score']].to_csv("submission.csv", index=False)

# Please, _DO_ upvote!