# Toxic Comments Classification

This notebook is a way for me to get into Sentiment Analysis and Naïve Bayes algorithm. 
For this project  I used TF-IDF word embeddings and Naïve Bayes + Logistic Regression to create a model. 

My work comes from the superbe notebook of Jeremy Howard (https://www.kaggle.com/code/jhoward/nb-svm-strong-linear-baseline/data).  
Honours are for him only. 

However, if we consider the training dataset, it appears there are only few toxic comments compare to the size of the dataset (<10% for toxic, <1% for some). Therefore, training a model on this dataset can't be accurate.  

For this reason, if I look at the overall accuracy (one error in any classification makes the analyss wrong), I got a little bit more than 11% accuracy, which is very low. So I post this project on Kaggle to submit it and check how accuracy is calculated.

Feel free to comment and give advices. I am here to learn. 

The complete Notebook is on my Github, where I used the labeled data set to test my models, and calculate accuracy.

https://github.com/JeremyArancio/Toxic_Comments_Classification

Have a good reading !

Jérémy


# Updates

**V6**  
TFIDF + Logistic Regression  
Punct & Stopwords removed

**V7**  
Inverse of regularization strength C=4 in Logistic regression  
Punct & Stopwords accepted

**V8**  
Unigrams & Bigrams

# Libraries

In [None]:
import pandas as pd
import numpy as np
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Import Data

## File descriptions
* train.csv - the training set, contains comments with their binary labels
* test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand 
labeling, the test set contains some comments which are not included in scoring.
* sample_submission.csv - a sample submission file in the correct format
* test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
train_data = pd.read_csv("/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv.zip")
test_data = pd.read_csv('/kaggle/input/jigsaw-toxic-comment-classification-challenge/test.csv.zip')
test_label_data = pd.read_csv('/kaggle/input/jigsaw-toxic-comment-classification-challenge/test_labels.csv.zip')
samp_subm = pd.read_csv("/kaggle/input/jigsaw-toxic-comment-classification-challenge/sample_submission.csv.zip")

In [None]:
train_data.info()

In [None]:
train_data.isna().sum()

In [None]:
train_data.head(10)

As we can see the training dataset contains :
* the comment ID
* the raw text
* the different categories of toxicity

In [None]:
# Let's check some comments
for i in range(10):
    print(train_data['comment_text'][i])
    print('---------------')

In [None]:
#Let's check in the test.csv
test_data.head(10)

Here we just have the ID's and comments with no classification

In [None]:
# Submission data set
samp_subm

# Clean the corpus

In [None]:
#Let''s define a function that preprocesses a text

def preprocess(corpus):
    
    '''
    From a string, make text lowercase, remove hyperlinks, word containing numbers.
    Input : a list of strings
    Output : a list of tokens stored in a generator (yield)
    '''

    for text in corpus:

        text = text.lower()                                               # Lowercase
        text = re.sub(r'https?://[^\s\n\r]+', '', text)                   # Remove links
        #text = re.sub('[%s]' % re.escape(string.punctuation), '', text)   # Remove punctuation
        text = re.sub('\w*\d\w*', '', text)                               # Remove words containing numbers
    
        yield ' '.join([word for word in text.split(' ')]) # Return a generator 

In [None]:
%%time

# We save the cleaned comments in a list to be easily manipulated
clean_comments = list(preprocess(train_data['comment_text']))

In [None]:
for i in range(10):
    print(clean_comments[i])
    print('------------')

We note some words with no meaning, or typos. It can be better but we are going to work with that at first.



In [None]:
%%time
# We do the same for the test set
test_clean_comments = list(preprocess(test_data["comment_text"]))

# Models

In [None]:
# Let's define target, which is the classification made by human
target = train_data[['toxic', 'severe_toxic', 'obscene', 'threat','insult', 'identity_hate']]
# target = np.array(target) #transform dataframe into array
target.head()

Let's check if target values are balanced.   
In other words, is the target made of as much toxic as non-toxic comments

In [None]:
target.sum(axis=0) / target.shape[0]

As we can see, the target set  is not balanced.

We define NaÏve Bayes relation

In [None]:
def probNB(bow,target,cat):

    '''
    Naive Bayes probability for each word
    Inputs :
    bow : bag of words (with doc in rows and words in columns)
    target : classification vector (filled with 1 and 0)
    cat : 1 or 0, in target
    Output : 
    Vector of Naive Bayes probabilities with smoothing (n_words,1)
    '''

    p = np.array(bow[target==cat].sum(axis=0))

    return np.transpose((p+1) / (p.sum() + bow.shape[1]))
    


In [None]:
def get_model(bow,target):

    '''
    Function that return the log likelihood of a document
    Inputs :
    bow : bag of words (n_doc,n_words)
    target : classification of comments (n_doc,1)
    Output : 
    Return a vector of Log Likelihood for each comment (Naïve Bayes) (n_doc,1)
    '''

    log = np.log(probNB(bow,target,1)/probNB(bow,target,0))
    m = bow.dot(log)
    model = LogisticRegression(C=4).fit(m,target)
    return model , log

## TF-IDF NB-Logistic regression

In [None]:
# Word embeddings
tfidf_vec = TfidfVectorizer(ngram_range=(1,2),min_df=3,max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )
tfidf = tfidf_vec.fit_transform(clean_comments)
tfidf_test = tfidf_vec.transform(test_clean_comments)

# Let's create our model
df_classification = pd.DataFrame() #We store probabilities into a Dataframe
df_classification['Comments'] = test_data['comment_text']

for i,j in enumerate(target.columns):
    print('fit', j)
    model,log = get_model(tfidf,target[j])
    df_classification[j] = model.predict_proba(tfidf_test.dot(log))[:,1]


# Submission


In [None]:
keys = target.columns
submid = pd.DataFrame({"id" : samp_subm["id"]})
submission = pd.concat([submid,df_classification[keys]],axis=1)
submission.to_csv('submission.csv', index=False)
print('Done!')