# Definition of toxic comments

Online store "Wikishop" launches a new service. Now users can edit and supplement product descriptions, just like in wiki communities. That is, clients propose their edits and comment on the changes of others. The store needs a tool that will look for toxic comments and submit them for moderation.

I will train the model to classify comments into positive and negative. I have a data set with markup on the toxicity of edits.

It is necessary to build a model with the value of the quality metric *F1* not less than 0.75.

**Project steps**

1. Download and prepare data.
2. Train different models.
3. Draw conclusions.

**Data Description**

The data is in the `toxic_comments.csv` file. The *text* column contains the text of the comment, and *toxic* is the target attribute.

## Preprocessing

Loading the necessary libraries.

In [1]:
import pandas as pd
import numpy as np
import nltk
nltk.download('stopwords')
import re
import torch
import transformers
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem.snowball import EnglishStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
df = pd.read_csv('/datasets/toxic_comments.csv')
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Remarkably, the data is presented in full, without omissions.

Separate the target feature.

In [4]:
target = df['toxic']
features = df['text']

The division into training and test samples is 70% by 30%.

In [5]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=.3, shuffle=False)

Data type conversion.

In [6]:
corpus_train = features_train.values.astype('U')

In [7]:
corpus_test = features_test.values.astype('U')

I clear the dataframe and leave only regular expressions.

In [8]:
def clear_text(text):
    text_lem = re.sub(r'[^a-zA-Z]', ' ', text)
    return text_lem

In [9]:
for i in range(len(corpus_train)):
    corpus_train[i] = clear_text(corpus_train[i])

In [10]:
for i in range(len(corpus_test)):
    corpus_test[i] = clear_text(corpus_test[i])

I bring the words to the main form using stemming from the nltk library.

In [11]:
stemmer = EnglishStemmer(ignore_stopwords=False)

In [12]:
%%time
for i in range(len(corpus_train)):
    word_list = nltk.word_tokenize(corpus_train[i])
    corpus_train[i] = ' '.join([stemmer.stem(w) for w in word_list])

CPU times: user 1min 41s, sys: 48.8 ms, total: 1min 41s
Wall time: 1min 44s


In [13]:
%%time
for i in range(len(corpus_test)):
    word_list = nltk.word_tokenize(corpus_test[i])
    corpus_test[i] = ' '.join([stemmer.stem(w) for w in word_list])

CPU times: user 42.9 s, sys: 7.77 ms, total: 42.9 s
Wall time: 43.7 s


The stopwords variable is loaded with the base of stopwords.

In [14]:
stopwords = set(nltk_stopwords.words('english'))

Data vectorization.

In [15]:
count_tf_idf = TfidfVectorizer(stop_words=stopwords)
tf_idf = count_tf_idf.fit(corpus_train)

In [16]:
train_x = tf_idf.transform(corpus_train)

In [17]:
test_x = tf_idf.transform(corpus_test)

###### Conclusion

The data are prepared, transformed and divided into samples. Everything is ready to train the models.

## Training

### Logistic regression

First I will train logistic regression. After that, F1 will be measured.

In [18]:
model = LogisticRegression(random_state=12345, solver = 'liblinear', class_weight='balanced')

In [19]:
model.fit(train_x, target_train)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=12345, solver='liblinear', tol=0.0001,
                   verbose=0, warm_start=False)

In [20]:
predicted = pd.Series(model.predict(test_x))

In [21]:
f1_score(predicted, target_test).round(2)

0.75

### Random Forest

Next, I will test the random forest model using cross-validation.

In [22]:
for max_depth in range(5, 26, 5):
    rf_model = RandomForestClassifier(class_weight = 'balanced', max_depth = max_depth, n_estimators = 100)
    rf_cv = cross_val_score(rf_model, train_x, target_train, cv=3)
    print("Score при max_depth =", max_depth, ":", rf_cv)
    print("Score mean =", sum(rf_cv)/len(rf_cv))
    print()

Score при max_depth = 5 : [0.61846699 0.64037279 0.61906425]
Score mean = 0.625968008641557

Score при max_depth = 10 : [0.69218993 0.68393629 0.69974753]
Score mean = 0.6919579186706691

Score при max_depth = 15 : [0.73575227 0.75083931 0.71003438]
Score mean = 0.7322086525583004

Score при max_depth = 20 : [0.75425686 0.7384041  0.7372153 ]
Score mean = 0.7432920881873307

Score при max_depth = 25 : [0.77969061 0.76547686 0.75580146]
Score mean = 0.7669896427976592



In [23]:
for estim in range(50, 201, 50):
    rf_model = RandomForestClassifier(class_weight = 'balanced', max_depth = 5, n_estimators = estim)
    rf_cv = cross_val_score(rf_model, train_x, target_train, cv=3)
    print("Score при n_estimators =", estim, ":", rf_cv)
    print("Score mean =", sum(rf_cv)/len(rf_cv))
    print()

Score при n_estimators = 50 : [0.62942472 0.61936991 0.67756231]
Score mean = 0.6421189817061267

Score при n_estimators = 100 : [0.67156362 0.64273628 0.64412333]
Score mean = 0.6528077472467088

Score при n_estimators = 150 : [0.62638986 0.66215454 0.65545767]
Score mean = 0.6480006899597318

Score при n_estimators = 200 : [0.67191277 0.68603121 0.63308444]
Score mean = 0.6636761400878753



In [28]:
model_1 = RandomForestClassifier(class_weight = 'balanced',
                    max_depth = 25,
                    n_estimators = 200)

In [29]:
model_1.fit(train_x, target_train)

RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=25, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=200, n_jobs=None, oob_score=False,
                       random_state=None, verbose=0, warm_start=False)

In [30]:
predicted_1 = pd.Series(model_1.predict(test_x))

In [31]:
f1_score(predicted_1, target_test).round(2)

0.42

## General Conclusion

As a result, for now:

* Data researched.
* Divided into training and test sets.
* Data types converted (most of the kernel bugs happened at this stage)
* Only regular expressions are selected.
* Stemming done with nltk.
* Data cleared of stop words.
* Data is vectorized.
* Trained logistic regression that reaches the value of the target metric (f1)
* The random forest model has not proven itself for these purposes and has not reached the required mark of the target metric.