# Description

<p>An online store is launching a new service where customers are allowed to edit descriptions and comment other people's changes. My task is to develop a machine learning model with F1 score higher than 0.75 that will classify the positive and negative comments and look out for the latter ones, which also known as toxic, all in order to send them for moderation.</p>

<p>Provided data are in a file called `toxic_comments.csv`. The column *text* has comments, and *toxic* is the target variable.</p>

## Overview

In [1]:
import pandas as pd

data = pd.read_csv('/datasets/toxic_comments.csv')

In [2]:
data

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


Okay, we know that the comments are written in English and there are almost 160 thousand comments in the provided dataframe.

## Text Preprocessing

In [3]:
%%time
import re
import nltk
import pandas as pd

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    clean_and_lower = re.sub(r'[^a-zA-Z ]', ' ', text.lower()) #lower case and cleaning 
    result = [lemmatizer.lemmatize(w, 'v') for w in w_tokenizer.tokenize(clean_and_lower)] #tokenize and lemmatize the clean and lower
    return " ".join(result) #getting rid of commas and other punctuation symbols

data['text_lemmatized'] = data['text'].apply(lemmatize_text)

Wall time: 48.7 s


Converted the words in lowercase, by leaving just the letters, created tokens, separated by commas, tokens were lemmatized considering part of speech = 'v' (verbs)

In [4]:
data

Unnamed: 0,text,toxic,text_lemmatized
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edit make under my usernam...
1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour i m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not try to edit war it s ju...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,you sir be my hero any chance you remember wha...
...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,and for the second time of ask when your view ...
159567,You should be ashamed of yourself \n\nThat is ...,0,you should be ashamed of yourself that be a ho...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,spitzer umm theres no actual article for prost...
159569,And it looks like it was actually you who put ...,0,and it look like it be actually you who put on...


In [5]:
from nltk.corpus import stopwords as nltk_stopwords 
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english')) #download the English dictionary of stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ww\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Received stopwords

In [6]:
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(data['text_lemmatized'], data['toxic'], train_size=0.85, random_state=0)
#85% train
x_valid, x_test, y_valid, y_test = train_test_split(x_valid, y_valid, test_size=0.5, random_state=0)
#7.5% validation 7.5% test data set

<p>Divided the full dataframe in training, validation and test datasets: 85%, 7.5% and 7.5% correspondingly.</p>
<p>The number of rows:</p>

In [7]:
x_train.shape, y_train.shape, x_valid.shape, y_valid.shape, x_test.shape, y_test.shape

((135635,), (135635,), (11968,), (11968,), (11968,), (11968,))

# Training

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [9]:
%%time
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf_train = count_tf_idf.fit_transform(x_train)
tf_idf_valid = count_tf_idf.transform(x_valid)
tf_idf_test = count_tf_idf.transform(x_test)

Wall time: 8.66 s


Getting TF-IDF taken into account the received stop words

In [10]:
print("Matrix size:", tf_idf_train.shape)

Matrix size: (135635, 142204)


### Logistic Regression

In [11]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(tf_idf_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
tf_idf_train_pred = classifier.predict(tf_idf_train)
tf_idf_valid_pred = classifier.predict(tf_idf_valid)

In [13]:
from sklearn.metrics import f1_score
f1_logreg = f1_score(y_valid, tf_idf_valid_pred)

print(f1_logreg)

0.7627772420443587


A logistic regression was trained and demonstrated F1 score equal to 0.763 in validation set.

### Decision Tree Classifier

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import make_scorer
f_one_score = make_scorer(f1_score, average='macro', greater_is_better = True)

#param_grid = {
#    'max_depth': range(90,201,10),
#    'random_state': [0]
#}

#model = DecisionTreeClassifier()

#grid_search = GridSearchCV(estimator = model, param_grid = param_grid, verbose = 1, cv = 3, scoring = f_one_score)

In [15]:
#%%time
#grid_search.fit(tf_idf_train, y_train)
#print(grid_search.best_params_)

{'max_depth': 160, 'random_state': 0} <br>
CPU times: user 58min 55s, sys: 2.02 s, total: 58min 57s <br>
Wall time: 59min 4s <br>

In [16]:
%%time

model = DecisionTreeClassifier(max_depth = 160, random_state = 0)

model.fit(tf_idf_train, y_train)
y_valid_pred = model.predict(tf_idf_valid)
print(f1_score(y_valid, y_valid_pred))

0.7300479720889665
Wall time: 1min 19s


<p>0.7300479720889665<br>
Wall time: 1min 19s</p>

<p>Decision Tree Classifier's F1 score is equal to 0.73, its result is .03 lower than logistic regression's, and also much slower, in order to get better results with DecisionTreeClassifier, I probably need a lot more time for training.</p>

### LightGBM

In [17]:
import lightgbm as lgb

In [18]:
#gridParams = {
#    'boosting_type': ['gbdt'],
#    'learning_rate': [0.5, 0.1],
#    'max_depth': range(10,51,10),
#    'n_estimators': range(10,51,10),
#    'random_state': [12345],
#    #max_bin = [5]    
#}

#lgbclass = lgb.LGBMClassifier()

In [19]:
#lgb_grid = GridSearchCV(lgbclass, gridParams, verbose=1, cv=3, scoring = f_one_score)

In [20]:
#%%time

#lgb_grid.fit(tf_idf_train, y_train)

In [21]:
#lgb_grid.best_params_

{'boosting_type': 'gbdt', <br>
 'learning_rate': 0.5, <br>
 'max_depth': 20, <br>
 'n_estimators': 50, <br>
 'random_state': 12345} <br>

In [22]:
%%time
model = lgb.LGBMClassifier(boosting_type = 'gbdt', learning_rate = 0.5, max_depth = 20, 
                          n_estimators = 50, random_state = 12345)

model.fit(tf_idf_train, y_train)

Wall time: 9.75 s


LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.5, max_depth=20,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=50, n_jobs=-1, num_leaves=31, objective=None,
               random_state=12345, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

Wall time: 9.75 s

In [23]:
y_valid_pred = model.predict(tf_idf_valid)
f1_lgbm = f1_score(y_valid, y_valid_pred)
print("F1 =", f1_lgbm)

F1 = 0.7728482697426797


LightGBM's gradient boosting models has shown the best result, though it's not really much better, but I believe that one can get even better results if he or she goes through more hyperparameter tuning and increases the number of n_estimators, cause the model has chosen the largest number of n_estimators that it was given, but it will increase the time.

### Test

In [24]:
from scipy.sparse import vstack
x_train = vstack([tf_idf_train, tf_idf_valid])
y_train = pd.concat([y_train, y_valid])

Union of training and validation TF-IDF datasets.

In [25]:
model = lgb.LGBMClassifier(boosting_type = 'gbdt', learning_rate = 0.5, max_depth = 20, 
                          n_estimators = 50, random_state = 12345)

model.fit(x_train, y_train)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.5, max_depth=20,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=50, n_jobs=-1, num_leaves=31, objective=None,
               random_state=12345, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [26]:
%%time
y_test_pred = model.predict(tf_idf_test)
f1_test = f1_score(y_test, y_test_pred)
print(f1_test)

0.7796917497733455
Wall time: 67.9 ms


Training both train and validation sets together have led the model to F1 score almost equal to 0.78.