# ☣️ Jigsaw - Super simple Naive Bayes [LB=0.768]

## Very simple naive bayes with `LB=0768`.

Using data from [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

I have created a dataset for this data. It's public here :
* [jigsaw-toxic-comment-classification-challenge](https://www.kaggle.com/julian3833/jigsaw-toxic-comment-classification-challenge)


# Please, _DO_ upvote!

# Imports

In [None]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Create train data

The competition was multioutput

We turn it into a binary toxic/ no-toxic classification

In [None]:
df = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")
df['y'] = (df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1) > 0 ).astype(int)
df = df[['comment_text', 'y']].rename(columns={'comment_text': 'text'})
df.sample(5)

# Undersample

The dataset is very unbalanced. Here we undersample the majority class. Other strategies might work better.

In [None]:
df['y'].value_counts(normalize=True)

In [None]:
min_len = (df['y'] == 1).sum()

In [None]:
df_y0_undersample = df[df['y'] == 0].sample(n=min_len, random_state=201)

In [None]:
df = pd.concat([df[df['y'] == 1], df_y0_undersample])

In [None]:
df['y'].value_counts()

In [None]:
from nltk.stem import SnowballStemmer, WordNetLemmatizer
import re
from nltk.corpus import stopwords
stemmer = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
all_stopwords = stopwords.words('english')

In [None]:
#https://www.kaggle.com/kishalmandal/multi-label-stratified-k-fold-toxic-comments
def clean(comment):
    comment = re.sub('[^a-zA-Z]', ' ', comment)
    comment = comment.lower()
    comment = comment.split()
    comment = [stemmer.stem(word) for word in comment if not word in set(all_stopwords)]
    comment = [lemmatizer.lemmatize(word) for word in comment]
    comment = ' '.join(comment)
    return comment

In [None]:
df['text'].iloc[0]

In [None]:
clean(df['text'].iloc[0])

In [None]:
df['text'] = df['text'].apply(clean)

# TF-IDF

In [None]:
vec = TfidfVectorizer()

In [None]:
X = vec.fit_transform(df['text'])
X

# Fit Naive Bayes

In [None]:
model = MultinomialNB()
model.fit(X, df['y'])

# Validate

In [None]:
df_val = pd.read_csv("../input/jigsaw-toxic-severity-rating/validation_data.csv")

In [None]:
X_less_toxic = vec.transform(df_val['less_toxic'].apply(clean))
X_more_toxic = vec.transform(df_val['more_toxic'].apply(clean))

In [None]:
p1 = model.predict_proba(X_less_toxic)
p2 = model.predict_proba(X_more_toxic)

In [None]:
# Validation Accuracy
(p1[:, 1] < p2[:, 1]).mean()

# Submission

In [None]:
df_sub = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")
X_test = vec.transform(df_sub['text'])
p3 = model.predict_proba(X_test)

In [None]:
df_sub

In [None]:
df_sub['score'] = p3[:, 1]

In [None]:
df_sub['score'].count()

In [None]:
# 9 comments will fail if compared one with the other
df_sub['score'].nunique()

In [None]:
df_sub[['comment_id', 'score']].to_csv("submission.csv", index=False)

# Please, _DO_ upvote!