# Project for Wikishop

The Wikishop online store is launching a new service. Now users can edit and add product descriptions, just like in wiki communities. That is, clients propose their own edits and comment on the changes of others. The store needs a tool that will look for toxic comments and send them for moderation.

Let's train the model to classify comments into positive and negative. We have a data set with markings about the toxicity of edits.

Let's build a model with a quality metric value *F1* of at least 0.75.

**Instructions for completing the project**

1. Download and prepare the data.
2. Let's train different models.
3. Let's draw conclusions.

It is not necessary to use *BERT* to complete the project, but you can try.

**Description of data**

The data is in the file `toxic_comments.csv`.

The *text* column in it contains the text of the comment, and *toxic* is the target attribute.

## Preparation

Let's import the necessary libraries and take a look at the data

In [1]:
import pandas as pd

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import f1_score

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
import re

df = pd.read_csv('/datasets/toxic_comments.csv')

display(df.head())

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


Let's write the lemmatization function

In [2]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]
df['lemma_text'] = df['text'].apply(lemmatize_text)
df['lemma_text'] = df['lemma_text'].apply(" ".join)

Remove extra characters

In [3]:
def clear_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    return " ".join(text.split())
df['lemma_text'] = df['lemma_text'].apply(clear_text)

Let's divide the data into test and training samples

In [4]:
train, test = train_test_split(df, test_size = 0.5, random_state = 12345)

print(train.shape)
print(test.shape)

(79785, 3)
(79786, 3)


In [5]:
corpus = train['lemma_text'].values.astype('U')

In [6]:
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(sublinear_tf = True, stop_words=stopwords)
tf_idf = count_tf_idf.fit_transform(corpus)

In [7]:
target_train = train['toxic'].values.astype('U')
target_test = test['toxic'].values.astype('U')

In [8]:
features_test = count_tf_idf.transform(test['lemma_text'])

## Model training

<b>Logistic regression</b>

In [9]:
lr = LogisticRegression(random_state = 12345)
parameters_lr_1 = {'penalty':['l1'],
                 'solver':['liblinear', 'saga'],
                 'class_weight':[None, 'balanced']}
parameters_lr_2 = {'penalty':['l2'],
                 'solver':['lbfgs', 'saga', 'sag', 'newton-cg'], 
                 'class_weight':[None, 'balanced']}

search_lr_1 = RandomizedSearchCV(lr, parameters_lr_1, cv=5, n_jobs = -1)
search_lr_1.fit(tf_idf, target_train)
best_lr_1 = search_lr_1.best_estimator_
predict_lr_1 = best_lr_1.predict(features_test)

search_lr_2 = RandomizedSearchCV(lr, parameters_lr_2, cv=5, n_jobs = -1)
search_lr_2.fit(tf_idf, target_train)
best_lr_2 = search_lr_2.best_estimator_
predict_lr_2 = best_lr_2.predict(features_test)

print('F1 для Logistic Regression (регуляризация l1) на тестовой выборке:', f1_score(target_test, predict_lr_1, pos_label='1'))
print()
print('F1 для Logistic Regression (регуляризация l2) на тестовой выборке:', f1_score(target_test, predict_lr_2, pos_label='1'))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1 для Logistic Regression (регуляризация l1) на тестовой выборке: 0.7711739527050733

F1 для Logistic Regression (регуляризация l2) на тестовой выборке: 0.7132268663559257


<b>SGD</b>

In [10]:
sgd = SGDClassifier(random_state = 12345)
parameters_sgd = {'penalty':['l1', 'l2', 'elasticnet'],
                  'alpha':[0.00001, 0.00005, 0.0001, 0.001, 0.01, 0.1],
                  'max_iter':range(100, 1500, 200),
                  'early_stopping': [True, False],
                  'class_weight':[None, 'balanced']}
search_sgd = RandomizedSearchCV(sgd, parameters_sgd, cv=5, n_jobs = -1)
search_sgd.fit(tf_idf, target_train)
best_sgd = search_sgd.best_estimator_
predict_sgd = best_sgd.predict(features_test)
print('F1 для SGD на тестовой выборке:', f1_score(target_test, predict_sgd, pos_label='1'))

F1 для SGD на тестовой выборке: 0.7035253921416369


## Conclusions

Logistic regression is more suitable for our task. Using it we get F1 = 0.77