# A model for classifying comments as positive or negative.

## Contents:
1. Reviewing of the data and preprocessing.
2. Training models.
3. Testing models.
4. Overall conclusion.


The online store "Wikishop" is launching a new service. Now users can edit and supplement product descriptions, just like in wiki communities. That is, customers can suggest their own edits and comment on the changes of others. The store needs a tool that will search for toxic comments and send them for moderation. 

We will train a model to classify comments as positive or negative. We have a dataset with labels about toxicity of edits. 

We will build a model with an F1 metric value of no less than 0.75.

## Description of data:
* text - it contains the text of the comment,
* toxic - the target feature.

## 1. Reviewing of the data and preprocessing.

Import the libraries and tools.

In [1]:
!pip install lightgbm
!pip install xgboost
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import pandas as pd
import numpy as np

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize

import spacy
import re

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.metrics import f1_score, make_scorer
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier

from time import time
from tqdm import tqdm
from tqdm import notebook

import lightgbm as lgb
import xgboost as xgb
from lightgbm import LGBMRegressor

pd.options.display.max_columns = None
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', 50)
pd.options.display.float_format = '{:,.2f}'.format
pd.options.display.max_columns = None

Let's load additional data for NLTK.

In [3]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/vladimir/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/vladimir/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/vladimir/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/vladimir/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
stopwords = nltk.corpus.stopwords.words('english')

Open the dataset.

In [5]:
url = 'https://code.s3.yandex.net/datasets/toxic_comments.csv'
data = pd.read_csv(url)

In [6]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


Let's look at the missing values.

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [8]:
data = data[['text', 'toxic']]

In [9]:
data.duplicated().sum()

0

There are no duplicates.

We'll change the data type to save memory.

In [10]:
data['toxic'] = data['toxic'].astype('uint8')

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  uint8 
dtypes: object(1), uint8(1)
memory usage: 1.4+ MB


Let's look at the class balance.

In [12]:
data['toxic'].value_counts(normalize=True)

0   0.90
1   0.10
Name: toxic, dtype: float64

The target is not balanced.

### We'll prepare the text.

We'll clean the text and lemmatize it with a function.

In [13]:
def lemmatize(text):
    l = WordNetLemmatizer()
    word_list = nltk.word_tokenize(text)
    lemm_list = []
    for word in word_list:
        lemm_list.append(l.lemmatize(word))
    lemm_text = " ".join(lemm_list)
    return lemm_text

def clear_text(text):
    re_text = re.sub(r"[^a-zA-Z ]", " ", text)
    re_text = re.sub(r"[UTC]", "", re_text)
    re_text = re_text.lower()
    return " ".join(re_text.split())

Let's clean the text.

In [14]:
%%time

data['clear_text'] = data['text'].apply(clear_text)
data.head()

CPU times: user 4.48 s, sys: 112 ms, total: 4.6 s
Wall time: 4.73 s


Unnamed: 0,text,toxic,clear_text
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,d aww he matches this background colour i m se...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man i m really not trying to edit war it s...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...


We'll lemmatize it with spacy.

In [15]:
def lemmatize_spacy(text, lemmatizer):
    doc = lemmatizer(text)
    lemm_text = " ".join([token.lemma_ for token in doc])
    return lemm_text

In [16]:
%%time

sp = spacy.load('en_core_web_sm', 
                disable=['parser', 'ner'])

data['lemms'] = data['clear_text'].apply(lemmatize_spacy, lemmatizer=sp)

CPU times: user 27min 24s, sys: 2.76 s, total: 27min 27s
Wall time: 27min 29s


We'll split the dataset into samples.

In [17]:
train_features, test_features, train_target, test_target = train_test_split(
    data.drop('toxic', axis=1),
    data['toxic'],
    test_size=0.2,
    random_state=12345,
    stratify=data['toxic'],
    )

corpus_train = train_features['lemms']
corpus_test = test_features['lemms']
corpus_train

4627      I don t mean to intrude but I have notice the ...
23542     god or whoever whatever I now decree you the c...
128384    an we keep this neat and sequential we have al...
31111     here s what you say may rama block expire jun ...
41207     sir giggsy have often say that he like to keep...
                                ...                        
54715     indeed bigdunc that page rightly say that para...
75564     I m not try to make a point or anything except...
71196     forgive my cruddy formattingi m still relative...
55751     alk grasshopper scout I move your comment from...
11543                                   oppose wp ommonname
Name: lemms, Length: 127433, dtype: object

We'll use TfidfVectorizer to clean the bag of words and add stop words to it.

In [18]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf_train = count_tf_idf.fit_transform(corpus_train)
tf_idf_test = count_tf_idf.transform(corpus_test)

## 2. Training models.

To automate the process, we'll write a function to select hyperparameters and calculate metrics.

In [19]:
analisys = pd.DataFrame({'model':[], 'F1_model':[], 'F1_on_train':[]})
all_models = []

def train_model(model, parameters):
    
    model_random = RandomizedSearchCV(
        estimator=model,
        param_distributions=parameters,
        scoring='f1', 
        n_jobs=-1,
        cv=4, 
        verbose=0
        )
    
    start = time()
    model_random.fit(tf_idf_train, train_target)
    print('Time to search for hyperparameters %.2f sec.' %(time() - start))
    
    f1 = model_random.best_score_
    f1_on_train = f1_score(train_target, model_random.predict(tf_idf_train))
    
    print('Best params:', model_random.best_params_)
    print('F1 of trained model:', round(f1, 3))
    print('F1 on train set:', round(f1_on_train, 3))

    all_models.append(model_random)
    row = []
    row.extend([model, f1, f1_on_train])
    analisys.loc[len(analisys.index)] = row
    
    return model_random

### Logistic Regression

In [20]:
%%time
ran_lr = {
    "penalty": ['l1', 'l2', 'elasticnet', 'none'],
    "class_weight": ['balanced', 'none'],
    }

logr = LogisticRegression(max_iter=100)

lr_random = train_model(logr, ran_lr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Time to search for hyperparameters 94.23 sec.
Best params: {'penalty': 'l2', 'class_weight': 'balanced'}
F1 of trained model: 0.75
F1 on train set: 0.832
CPU times: user 19 s, sys: 1.39 s, total: 20.3 s
Wall time: 1min 34s


### DecisionTreeClassifier

In [21]:
%%time
ran_grid_tree = {
    "max_depth": list(range(50, 60)),
    }

dtr = DecisionTreeClassifier()

dtr_random = train_model(dtr, ran_grid_tree)

Time to search for hyperparameters 1480.32 sec.
Best params: {'max_depth': 59}
F1 of trained model: 0.701
F1 on train set: 0.838
CPU times: user 1min 6s, sys: 284 ms, total: 1min 6s
Wall time: 24min 40s


### Random Forest Classifier

In [22]:
%%time
ran_grid_forest = {
    'max_depth': list(range(300, 320)),
    'n_estimators': [12, 16],
    }

rfc = RandomForestClassifier(n_jobs=-1)

rfc_random = train_model(rfc, ran_grid_forest)

Time to search for hyperparameters 3250.55 sec.
Best params: {'n_estimators': 16, 'max_depth': 314}
F1 of trained model: 0.633
F1 on train set: 0.914
CPU times: user 5min 47s, sys: 3.34 s, total: 5min 51s
Wall time: 54min 13s


### LightGBM

In [23]:
%%time
rand_lgbm_param = {
    'max_depth': [20, 25, 30],
    'learning_rate': [0.15, 0.3, 0.45],
    }

gbm = lgb.LGBMClassifier(
    boosting_type='gbdt',
    n_jobs=-1,
    )

gbm_random = train_model(gbm, rand_lgbm_param)

Time to search for hyperparameters 5162.22 sec.
Best params: {'max_depth': 25, 'learning_rate': 0.45}
F1 of trained model: 0.77
F1 on train set: 0.892
CPU times: user 7min 11s, sys: 2.44 s, total: 7min 13s
Wall time: 1h 26min 7s


### XGBoost

In [24]:
%%time
rand_xgb_param = {
    'max_depth': [6, 7, 8, 9],
    'learning_rate': [0.3, 0.5, 1.0],
    }

xb = xgb.XGBClassifier(booster='gbtree', 
                      use_rmm=True,
                      n_jobs=-1)

xb_random = train_model(xb, rand_xgb_param)

Time to search for hyperparameters 4416.96 sec.
Best params: {'max_depth': 9, 'learning_rate': 0.5}
F1 of trained model: 0.753
F1 on train set: 0.861
CPU times: user 4min 42s, sys: 505 ms, total: 4min 42s
Wall time: 1h 13min 37s


### Model analysis:

In [25]:
all_names = pd.DataFrame({'names':['LogisticRegression', 
                                   'DecisionTree', 
                                   'RandomForest', 
                                   'LightGBM', 
                                   'XGBoost']})
analisys = pd.concat([analisys, all_names], 
                     axis=1, 
                     join='inner')
display(analisys)

Unnamed: 0,model,F1_model,F1_on_train,names
0,LogisticRegression(),0.75,0.83,LogisticRegression
1,DecisionTreeClassifier(),0.7,0.84,DecisionTree
2,RandomForestClassifier(n_jobs=-1),0.63,0.91,RandomForest
3,LGBMClassifier(),0.77,0.89,LightGBM
4,"XGBClassifier(base_score=None, booster='gbtree...",0.75,0.86,XGBoost


### Conclusion:
We can recommend three models to the customer as the most accurate according to the given metric. Let's test the following models: LogisticRegression, XGBoost, and LightGBM.

## 3. Testing models.

In [31]:
predicted = lr_random.predict(tf_idf_test)
print('F1 LogisticRegression on the test set:', f1_score(test_target, predicted))

predicted = xb_random.predict(tf_idf_test)
print('F1 XGBoost on the test set:', f1_score(test_target, predicted))

predicted = gbm_random.predict(tf_idf_test)
print('F1 LightGBM on the test set:', f1_score(test_target, predicted))

F1 LogisticRegression on the test set: 0.7579626394301842
F1 XGBoost on the test set: 0.7513116474291709
F1 LightGBM on the test set: 0.7751728790689829


## 4. Overall conclusion.
A dataset of 159292 rows was received for work. The data was prepared, the data type of the target feature was changed to save memory. The dataset was checked for missing values and duplicates, which were not found. Further data preparation, text cleaning and lemmatization were carried out before training the models. 

The dataset was divided into samples, and a function with hyperparameter selection was written to automate the calculation of metrics for the used models. As a result of testing the selected models, only LightGBM meets the customer's requirements, its F1 metric exceeds the acceptable threshold of 0.75 and is 0.776. This result was obtained based on the use of RandomizedSearchCV - a hyperparameter selection tool. 

When selecting hyperparameters using GridSearchCV, higher accuracy values can be achieved for the considered models, but unfortunately this will lead to an increase in the training time, which is already 33 minutes on average.

