# Project description

The online store is launching a new service in which users can edit and supplement product descriptions. The store needs a tool that will find toxic comments and send them for moderation.  
  
We need to build a model to classify comments into positive and negative ones.  
We have a dataset with annotations on the toxicity of comments at our disposal.  

## The plan for the project:

1. Studying the data;
2. Preparing data for training models:
     * Lemmatization;
     * Cleaning of extra symbols;
     * Vectorization of comments using the tf_idf method;
3. Training classical prediction models;
4. Using BERT;
5. Analyzing the results.

# 1. Studying the data

In [1]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb
from tqdm import notebook
import re

import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import wordnet
from nltk.corpus import stopwords as nltk_stopwords

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import shuffle
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from lightgbm import LGBMClassifier
import warnings
warnings.filterwarnings('ignore')

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

Let's load the data and conduct the data exploration

In [2]:
df_comments = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
print(df_comments.info())
df_comments.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB
None


Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


There are two columns in the dataset:

* text - user comments, object type;
* toxic - toxicity of the comment.
    - 0 - comment is positive.
    - 1 - comment is negative.  
    
There are about 160,000 objects in the dataset

Let's look at the ratio of positive and negative values of the toxic feature.

In [4]:
df_comments['toxic'].value_counts()/df_comments['toxic'].shape*100

0    89.832112
1    10.167888
Name: toxic, dtype: float64

The selection is unbalanced, the share of negative comments is about 10% of the entire sample.

Let's check for duplicates and missing values in the sample.

In [5]:
df_comments.duplicated().sum()

0

In [6]:
df_comments.isna().sum()

text     0
toxic    0
dtype: int64

There are no duplicates and missing values.

***
## Conclusion

At this stage, we conducted a preliminary study of the dataset data:

* There are slightly less than 160,000 objects in the dataset;
* Two features - user comments and a comment tonality classifier;
* The data is unbalanced. Negative comments make up about 10% of the entire sample;
* There are no duplicates and missing values in the dataset.

***
# 2. Data preparation

At this stage, we conducted a preliminary study of the dataset data:

- There are slightly less than 160,000 objects in the dataset;
- Two features - user comments and a comment tonality classifier;
- The data is unbalanced. Negative comments make up about 10% of the entire sample;
- There are no duplicates and missing values in the dataset.

***
## Lemmatization and cleaning of extra symbols

To conduct lemmatization, we will use the Wordnet Lemmatizer with NLTK library.  
  
In the process of implementing the project, lemmatization was performed in two ways:

1. Standard lemmatization;
2. Lemmatization using POS tagging, when the part of speech is determined for each word.  
    In this variant, more correct lemmatization was carried out, for example, the verbs "are" and "is" were turned into the verb "be".  

However, following the training of models and analysis of results, we only kept the first option for the following reasons:

- faster operation of the method - the first option took one minute, the second thirty-one minutes;
- with a much faster operation of the first algorithm, the results were the same.

***
## Creating a corpus with all existing comments

In [7]:
corpus = df_comments['text'].values

Let's look at the first comment, record it and compare it with the lemmatized one.

In [8]:
corpus[0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

We'll perform lemmatization using the Wordnet Lemmatizer with NLTK library without using POS-tagging.

In [9]:
%%time

lemmas = []

stemmer = WordNetLemmatizer()

for sen in range(0, len(corpus)):
    # delete all special characters
    document = re.sub(r'\W', ' ', str(corpus[sen]))
    
    # delete all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # remove single characters from the beginning of the string
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
    
    # replace several spaces with one
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # remove the prefix 'b'
    document = re.sub(r'^b\s+', '', document)
    
    # convert uppercase letters to lowercase
    document = document.lower()
    
    # lemmatization
    document = document.split()
    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    
    lemmas.append(document)

CPU times: user 1min 15s, sys: 598 ms, total: 1min 15s
Wall time: 1min 16s


Let's see what result we got after lemmatization.

In [10]:
lemmas[0]

'explanation why the edits made under my username hardcore metallica fan were reverted they weren vandalism just closure on some gas after voted at new york doll fac and please don remove the template from the talk page since m retired now 89 205 38 27'

The text is ready for processing. On Kaggle they recommend not deleting them, so we won't touch them.

We'll convert our list of lemmas into Series and add it to the original dataset.

In [11]:
lemmas_series = pd.Series(lemmas, name = 'lemmas')
df_comments_new = pd.concat([df_comments, lemmas_series], axis = 1)
df_comments_new.head()

Unnamed: 0,text,toxic,lemmas
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,d aww he match this background colour m seemin...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man m really not trying to edit war it jus...
3,"""\nMore\nI can't make any real suggestions on ...",0,more can make any real suggestion on improveme...
4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...


***
### Conclusion

At this stage, an attempt was made to lemmatize using the Wordnet Lemmatizer with NLTK library in two methods - standard and with POS-tagging.  

The first method took 1 minute, the second one 31 minutes. Slight improvements in the quality of lemmatization by the second method did not add to the result of forecasting, so the first option was left.  

***
## Vectorization of data using the tf_idf method

We will perform vectorization of data using the tf_idf method.  
  
For this, we'll isolate the target feature into a separate dataset and divide the data into training, validation, and test sets.

In [12]:
comments_features = df_comments_new.drop(['text','toxic'], axis=1)
comments_target = df_comments_new['toxic']

In [13]:
X_train, X_rest, y_train, y_rest = train_test_split(
    comments_features, comments_target, test_size=0.4, random_state=12345)
X_valid, X_test, y_valid, y_test = train_test_split(
    X_rest, y_rest, test_size=0.5, random_state=12345)

Let's look at the size of the resulting tables.

In [14]:
print(X_train.shape)
print(X_valid.shape)
print(X_test.shape)

(95742, 1)
(31914, 1)
(31915, 1)


We'll create a counter, specifying stop words in it, i.e. words without semantic load.  
For this, we previously downloaded the stopwords package for the English language, which is located in the nltk.corpus module of the nltk library

In [15]:
stopwords = set(nltk_stopwords.words('english'))

We will create corpora of comments from the training and test sets.

In [16]:
corpus_train = X_train['lemmas'].values
corpus_valid = X_valid['lemmas'].values
corpus_test = X_test['lemmas'].values

We'll train the converter on the corpus_train sample.

In [17]:
tf_idfconverter = TfidfVectorizer(max_features=6000, min_df=5, max_df=0.7, stop_words=stopwords)
tf_idfconverter.fit(corpus_train)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=0.7, max_features=6000,
                min_df=5, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True,
                stop_words={'a', 'about', 'above', 'after', 'again', 'against',
                            'ain', 'all', 'am', 'an', 'and', 'any', 'are',
                            'aren', "aren't", 'as', 'at', 'be', 'because',
                            'been', 'before', 'being', 'below', 'between',
                            'both', 'but', 'by', 'can', 'couldn', "couldn't", ...},
                strip_accents=None, sublinear_tf=False,
                token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
                vocabulary=None)

We will convert all samples into matrices and check their dimensionality.

In [18]:
X_train_tf = tf_idfconverter.transform(corpus_train)
X_valid_tf = tf_idfconverter.transform(corpus_valid)
X_test_tf = tf_idfconverter.transform(corpus_test)

In [19]:
print(X_train_tf.shape)
print(X_valid_tf.shape)
print(X_test_tf.shape)

(95742, 6000)
(31914, 6000)
(31915, 6000)


***
### Conclusion

At this stage, the dataset was divided into training, validation and test samples in a ratio of 3/1/1 and user comments were vectorized using the tf_idf method.  
  
As a result, matrices were obtained with the number of features equal to 130,886.

***
## Handling data imbalance

As we noticed earlier, the number of negative comments is much less than positive ones.  
This is good for our service, but bad for our prediction models.  
So, using the upsampling method, we'll increase the number of negative comments.  

We will perform the transformation in several stages:

* Divide the training sample into positive and negative comments;
* Copy negative comments several times.
* In our case, we increase the sample of negative comments 8 times;
* Based on the obtained data, we'll create a new training sample;
* We'll shuffle the data.

In [20]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, y_train_up = upsample(X_train, y_train, 8)

In [21]:
print('Table size features_upsampled:', features_upsampled.shape)
print('Table size target_upsampled:',y_train_up.shape)
print()
print(y_train_up.value_counts(normalize=True))

Размеры таблицы features_upsampled: (164013, 1)
Размеры таблицы target_upsampled: (164013,)

0    0.524282
1    0.475718
Name: toxic, dtype: float64


After performing upsampling, the classes became balanced, and the size of the tables increased.  

We'll create a new corpus of training features.

In [22]:
corpus_train_up = features_upsampled['lemmas'].values

We'll convert the new sample into a matrix and check its dimensionality.

In [23]:
X_train_up = tf_idfconverter.transform(corpus_train_up)

In [24]:
X_train_up.shape

(164013, 6000)

The dimensionality matches the samples obtained earlier. Everything is great!

***
## Conclusion

During the data preparation for forecasting, we successfully carried out:

- Lemmatization using the Wordnet Lemmatizer with NLTK library;
- We cleaned the data from unnecessary characters;
- Vectorized using the tf_idf method;
- Balanced the number of positive and negative comments.

# 2. Model training

We'll conduct the training of models and look at the results obtained on the validation sample.  

For this, we'll create a function for training models.

In [25]:
def ml_models(models, ft, tt, fv, tv):
    model = models
    model.fit(ft, tt)
    predictions_valid = model.predict(fv)
    print('f1 = {:.2f}'.format(f1_score(tv, predictions_valid)))

We'll train several models, compare the results and working time.

**Logistic regression with data balancing before performing upsampling**

In [26]:
%%time
ml_models(LogisticRegression(random_state=12345, class_weight = 'balanced'), X_train_tf, y_train, X_valid_tf, y_valid)

f1 = 0.71
CPU times: user 2.23 s, sys: 462 µs, total: 2.23 s
Wall time: 2.24 s


**Logistic regression on data balanced using upsampling**

In [27]:
%%time
ml_models(LogisticRegression(random_state=12345), X_train_up, y_train_up, X_valid_tf, y_valid)

f1 = 0.72
CPU times: user 3.75 s, sys: 12.7 ms, total: 3.76 s
Wall time: 3.79 s


**Decision tree**

In [28]:
%%time
ml_models(RandomForestClassifier(), X_train_up, y_train_up, X_valid_tf, y_valid)

f1 = 0.68
CPU times: user 58.2 s, sys: 19 ms, total: 58.3 s
Wall time: 58.9 s


**LGBMClassifier**

In [29]:
%%time
ml_models(LGBMClassifier(), X_train_up, y_train_up, X_valid_tf, y_valid)

f1 = 0.73
CPU times: user 6min 37s, sys: 1.53 s, total: 6min 39s
Wall time: 6min 41s


**GradientBoostingClassifier**

In [30]:
%%time
ml_models(GradientBoostingClassifier(), X_train_up, y_train_up, X_valid_tf, y_valid)

f1 = 0.68
CPU times: user 3min 14s, sys: 25.7 ms, total: 3min 14s
Wall time: 3min 17s


In [31]:
d = {'Model' : ['LogisticRegression_b', 'LogisticRegression', 'RandomForestClassifier', 'LGBMClassifier', 'GradientBoostingClassifier'],
    'RMSE' :pd.Series([0.75, 0.76, 0.64, 0.73, 0.68]),
     'time, s': pd.Series([18, 22, 150, 230, 240])
    }
df1 = pd.DataFrame(d)
df1

Unnamed: 0,Model,RMSE,"time, s"
0,LogisticRegression_b,0.75,18
1,LogisticRegression,0.76,22
2,RandomForestClassifier,0.64,150
3,LGBMClassifier,0.73,230
4,GradientBoostingClassifier,0.68,240


## Checking on the test sample

In [32]:
%%time
ml_models(LogisticRegression(random_state=12345), X_train_up, y_train_up, X_test_tf, y_test)

f1 = 0.76
CPU times: user 6.08 s, sys: 27.9 ms, total: 6.11 s
Wall time: 6.11 s


***
## Conclusion

The best result and working speed was shown by the Logistic Regression model, trained on balanced data:

- f1 = 0.76
- Wall time: 6.12 s  

The result was confirmed on the test sample.  
  
The LGBMClassifier model looks promising, theoretically, you can select hyperparameters to improve its result, but the working speed does not allow this.

***
*The speed of models' work changes significantly in different runs.*

***
## 3. BERT

At the time of project implementation, the most promising method of processing and classifying texts was the BERT neural network.

BERT (Bidirectional Encoder Representations from Transformers) is a neural network for creating a language model. It was developed by Google to improve the relevance of search results. This algorithm understands the context of queries, not just analyzes phrases. For machine learning, it is valuable because it helps to build vector representations. Moreover, in text analysis, a model that is already pretrained on a large corpus is used. These pretrained versions of BERT are suitable for working with texts in 104 languages of the world, including Russian.

However, it is almost impossible to train it at home, so the selection of a model pretrained on a relevant dictionary plays a big role.

In our case, we use the DistilBERT model.  
  
DistilBERT is a downsized version of BERT, developed and made publicly available by a group of HuggingFace developers. It is faster and lighter than its elder brother, but still quite comparable in performance.

Despite the fact that the model is already pretrained, its use takes a lot of time, so in the project we will use it on a sample of 200 comments.

In the test version, a sample of 2000 comments was also used, but in this case, the work of the model took 30 minutes.
I didn't leave it in the final version of the project, I'll briefly describe its result below.

In [33]:
batch_1 = df_comments[:200]

Load the pretrained DistilBERT model and tokenizer

In [34]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




For the correct operation of the model, we need to limit the length of comments to 512 characters.

In [35]:
batch_1['text'] = batch_1['text'].str[:512]

Let's tokenize the data

In [36]:
tokenized = batch_1['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

The dataset now represents a list of lists. Before DistilBERT processes it at the input, we need to bring vectors to one size by adding the identifier 0 (padding) to shorter vectors.

In [37]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

Now let's explain to the model that zeros don't carry significant information. This is necessary for the component of the model called "attention".  

We will discard these tokens and "create a mask" for really important tokens, i.e. we will indicate zero and non-zero values.

In [38]:
attention_mask = np.where(padded != 0, 1, 0)
print(attention_mask.shape)

(200, 230)


We'll create embeddings for user comments, limiting the batch size to 50 objects.

In [39]:
batch_size = 50
embeddings = []
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
        batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)]) 
        attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
        
        with torch.no_grad():
            batch_embeddings = model(batch, attention_mask=attention_mask_batch)
        
        embeddings.append(batch_embeddings[0][:,0,:].numpy())

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




We will collect all embeddings into a feature matrix by calling the concatenate function, highlight the target feature, and divide the data into training and testing samples.

In [40]:
features = np.concatenate(embeddings)
target = batch_1['toxic']

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size = 0.2, random_state=12345)

Let's train a logistic regression model.

In [41]:
%%time
ml_models(LogisticRegression(random_state=12345, class_weight = 'balanced'), features_train, target_train, features_test, target_test)

f1 = 0.67
CPU times: user 69.1 ms, sys: 36.6 ms, total: 106 ms
Wall time: 59.2 ms


***
## Conclusion

On a sample of 200 comments, BERT in conjunction with logistic regression showed an f1 score of 0.67 and a running time of about 4 minutes.
The result on a sample of 2000 comments was 0.72, with a running time of 35 minutes.  
  
We can suggest that on the full sample, the result will be equal or better than that of our logistic regression model trained on vectorized data using the tf-idf method, but in this case, BERT will take a lot of time.

# 3. Summary

In this work, we trained a model to classify comments as positive or negative.
The best result in terms of quality and speed was shown by the logistic regression model, trained on vectorized and balanced data.

- f1 = 0.76
- Wall time: 6.6 s  

Other models could theoretically show a better result, but their running time is significantly longer and bringing them to a suitable state would take a very large amount of time given the available computational resources.