<h1>Содержание<span class="tocSkip"></span></h1>
<li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Логистическая-регрессия" data-toc-modified-id="Логистическая-регрессия-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Логистическая регрессия</a></span></li><li><span><a href="#Дерево-решений" data-toc-modified-id="Дерево-решений-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Дерево решений</a></span></li><li><span><a href="#Случайный-лес" data-toc-modified-id="Случайный-лес-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Случайный лес</a></span></li><li><span><a href="#CatBoostClassifier" data-toc-modified-id="CatBoostClassifier-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>CatBoostClassifier</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a>

# Проект для интернет-магазина

Создается новый сервис, в котором пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Необходимо обучить модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок.


**Этапы выполнения проекта**

1. Загрузим и подготовим данные.
2. Обучим разные модели. 
3. Сделаем выводы.


**Описание данных**

Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

Загрузим все используемые библиотеки

In [1]:
import pandas as pd
import numpy as np
import re
from pymystem3 import Mystem
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import f1_score, make_scorer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
import nltk
from nltk.corpus import wordnet
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier

import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')

Посмотрим на наши данные

In [3]:
data

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


Проверим на наличие дубликатов.

In [4]:
data.duplicated().value_counts()

False    159571
dtype: int64

Узнаем, есть какое соотношение отрицательных и положительных отзывов

In [5]:
class_frequency = pd.Series(data['toxic']).value_counts(normalize = 2 / len(pd.Series(data['toxic'])))
class_frequency

0    0.898321
1    0.101679
Name: toxic, dtype: float64

Чтож, дубликаты отсутствуют, а положительных отзывов наблюдается только 10%.<br>
Чтобы эффективно обучить модель, в пункте деления на выборки задействуем параметр "stratify" для равномерного разделения на два кластера (наблюдения класса 0 и наблюдения класса 1).

Теперь займемся предобработкой данных.<br>

Очистим текст

In [6]:
otziv = []
for i in data['text']:
    clear = re.sub(r'[^a-zA-Z0-9 ]', ' ', i)
    otziv.append(' '.join(clear.split()))
    
data['clear'] = otziv

data

Unnamed: 0,text,toxic,clear
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He matches this background colour I m se...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not trying to edit war It s...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestions on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...
...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,And for the second time of asking when your vi...
159567,You should be ashamed of yourself \n\nThat is ...,0,You should be ashamed of yourself That is a ho...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm theres no actual article for prost...
159569,And it looks like it was actually you who put ...,0,And it looks like it was actually you who put ...


Т.к. для обучения 159,5 тыс. строк - слишком большое количество, сделаем sample из 80 тыс.строк

In [7]:
sample = 80000
corpus = data.sample(n = sample, random_state = 1234).reset_index(drop=True)

Проведем лемматизацию текста при помощи "WordNetLemmatizer"

In [8]:
m = WordNetLemmatizer()

def lemma(corpus):
    corpus_new = []
    for sentence in corpus:
        word_list = nltk.word_tokenize(sentence)
        corpus_new.append(' '.join([m.lemmatize(w) for w in word_list]))
    return corpus_new
    
corpus['lemm_text'] = lemma(corpus['clear'])

corpus

Unnamed: 0,text,toxic,clear,lemm_text
0,"""\n\nYour latest edit has again had to be remo...",0,Your latest edit has again had to be removed a...,Your latest edit ha again had to be removed an...
1,"""\n\nWorks for me. BaStun not BaTsun """,0,Works for me BaStun not BaTsun,Works for me BaStun not BaTsun
2,List of newspapers in India by readership\n\nT...,0,List of newspapers in India by readership This...,List of newspaper in India by readership This ...
3,Happy New Years\nCheers ),0,Happy New Years Cheers,Happy New Years Cheers
4,Photos \nSouthwestern’s official website has m...,0,Photos Southwestern s official website has man...,Photos Southwestern s official website ha many...
...,...,...,...,...
79995,The way of the force....very one-sided? \n\nHa...,0,The way of the force very one sided Have you e...,The way of the force very one sided Have you e...
79996,"""\n\n thanks for your help. I agree that """"he...",0,thanks for your help I agree that he may be fa...,thanks for your help I agree that he may be fa...
79997,I've added a couple of articles about the game...,0,I ve added a couple of articles about the game...,I ve added a couple of article about the game ...
79998,"""\n\nWhen was the last time you got laid, """"Sc...",0,When was the last time you got laid Scientizzl...,When wa the last time you got laid Scientizzle...


In [9]:
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Напишем функцию, возвращающую словарь с частями речи

In [10]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {'J': wordnet.ADJ,
               'N': wordnet.NOUN,
               'V': wordnet.VERB,
               'R': wordnet.ADV,}
    return tag_dict.get(tag, wordnet.NOUN)

Также напишем функцию, лемматизирующую текст с учетом pos_tag

In [11]:
def get_word_text(corpus):
    corpus_new = []
    for sentence in corpus:
        corpus_new.append(' '.join([m.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence) if not w in stop_words]))
    return corpus_new

In [12]:
%%time
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

corpus['lemma_text'] = get_word_text(corpus['clear'])

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


CPU times: user 11min 12s, sys: 45.9 s, total: 11min 58s
Wall time: 12min 10s


In [13]:
corpus.head(15)

Unnamed: 0,text,toxic,clear,lemm_text,lemma_text
0,"""\n\nYour latest edit has again had to be remo...",0,Your latest edit has again had to be removed a...,Your latest edit ha again had to be removed an...,Your late edit remove various reason I already...
1,"""\n\nWorks for me. BaStun not BaTsun """,0,Works for me BaStun not BaTsun,Works for me BaStun not BaTsun,Works BaStun BaTsun
2,List of newspapers in India by readership\n\nT...,0,List of newspapers in India by readership This...,List of newspaper in India by readership This ...,List newspaper India readership This page writ...
3,Happy New Years\nCheers ),0,Happy New Years Cheers,Happy New Years Cheers,Happy New Years Cheers
4,Photos \nSouthwestern’s official website has m...,0,Photos Southwestern s official website has man...,Photos Southwestern s official website ha many...,Photos Southwestern official website many phot...
5,"How's this? It's a lot shorter, but it's more ...",0,How s this It s a lot shorter but it s more fo...,How s this It s a lot shorter but it s more fo...,How It lot shorter focus settlement rather hal...
6,Why did you remove skinwalker instead of movin...,0,Why did you remove skinwalker instead of movin...,Why did you remove skinwalker instead of movin...,Why remove skinwalker instead move new character
7,"""I repeat, the text was about a minority schoo...",0,I repeat the text was about a minority school ...,I repeat the text wa about a minority school t...,I repeat text minority school school every min...
8,bay pigs article \n\nhi i noticed that you wer...,1,bay pigs article hi i noticed that you were li...,bay pig article hi i noticed that you were lic...,bay pig article hi notice lick wikiadmins as o...
9,"MusikAnimal is a liar, a disinformant and plag...",0,MusikAnimal is a liar a disinformant and plagu...,MusikAnimal is a liar a disinformant and plagu...,MusikAnimal liar disinformant plague WP Please...


Получили чистый и лемматизированный текст<br>
Теперь разделим выборку на обучающую и тестовую с применением параметра "stratify".

In [14]:
train, test = train_test_split(corpus, stratify = corpus['toxic'], test_size = 0.20, random_state = 12345)

Далее рассчитаем TF-IDF. Не забудем, что функцию "fit_transform" нужно запускать только на обучающей выборке, чтобы в модели не были учтены частоты слов из тестовой выборки.

In [15]:
train['lemma_text'] = train['lemma_text'].values.astype('U')#.toarray()
test['lemma_text'] = test['lemma_text'].values.astype('U')#.toarray()

count_tf_idf = TfidfVectorizer(stop_words=stop_words)
tf_idf_train = count_tf_idf.fit(train['lemma_text'])
tf_idf_train = count_tf_idf.transform(train['lemma_text'])
tf_idf_test = count_tf_idf.transform(test['lemma_text'])
# tf_idf_test = count_tf_idf.transform(test['lemm_text'].values.astype('U')).toarray()

print('Размер обучающей выборки:', tf_idf_train.shape)
print('Размер тестовой выборки:', tf_idf_test.shape)


Размер обучающей выборки: (64000, 99505)
Размер тестовой выборки: (16000, 99505)


Перейдем к обучению.

## Обучение

### Логистическая регрессия

In [16]:
model_linear = LogisticRegression(random_state = 123, class_weight = 'balanced')
model_linear.fit(tf_idf_train, train['toxic'])
prediction = model_linear.predict(tf_idf_test)
print('F1-мера:', f1_score(prediction, test['toxic']))

F1-мера: 0.7380623792437208


0.72 < 0.75... Едем дальше

In [17]:
%%time

parameters = {'C': np.linspace(0.01, 130, 10), 'class_weight': ['balanced']}

grid_search = GridSearchCV(estimator = LogisticRegression(), param_grid = parameters, n_jobs = -1, cv = 5)

grid_search.fit(tf_idf_train, train['toxic'])
grid_search.best_params_

CPU times: user 13min 58s, sys: 12min 21s, total: 26min 20s
Wall time: 26min 21s


{'C': 14.453333333333335, 'class_weight': 'balanced'}

In [18]:
model_linear_01 = LogisticRegression(**grid_search.best_params_, random_state = 123)
model_linear_01.fit(tf_idf_train, train['toxic'])
pred_test = model_linear_01.predict(tf_idf_test)
print('F1-мера:', f1_score(pred_test, test['toxic']))

F1-мера: 0.7468317123489537


При подборе параметра регуляризации "С" результат близок. Возможно, он был бы достигнут при большем объеме сэмпла. Однако, продолжим

In [19]:
%%time

model_linear = LogisticRegression(random_state = 123, class_weight = 'balanced')
model_linear.fit(tf_idf_train, train['toxic'])
probabilities_train = model_linear.predict_proba(tf_idf_train)
probabilities_one_train = probabilities_train[:, 1]
f1_best = 0
best_threshold = 0
for threshold in np.arange(0.5, 1, 0.02):
    predicted_train = probabilities_one_train > threshold
    f1 = f1_score(predicted_train, train['toxic'])
    if f1 > f1_best:
        f1_best = f1
        best_threshold = threshold
        
print('F1-мера:', f1_best,'Оптимальный порог классификации:', best_threshold)

F1-мера: 0.8752624339390429 Оптимальный порог классификации: 0.6000000000000001
CPU times: user 6.56 s, sys: 5.23 s, total: 11.8 s
Wall time: 11.8 s


Теперь попробуем на тестовых данных.

In [20]:
model_linear = LogisticRegression(random_state = 123, class_weight = 'balanced')
model_linear.fit(tf_idf_train, train['toxic'])
probabilities_test = model_linear.predict_proba(tf_idf_test)
probabilities_one_test = probabilities_test[:, 1]
predicted_test = probabilities_one_test > best_threshold
f1 = f1_score(predicted_test, test['toxic'])
print('F1-мера:', f1,'| При пороге классификации: 0.60')

F1-мера: 0.7543859649122806 | При пороге классификации: 0.62


**BINGO! Значение меры достигнуто. F1 > 0.75**

### Дерево решений

Для проведения перекрестной проверки создадим свой скорер F1

In [22]:
f1 = make_scorer(f1_score)

In [23]:
%%time

model_tree = DecisionTreeClassifier(random_state = 12345)
pred = cross_val_score(model_tree, tf_idf_train, train['toxic'], cv = 5, scoring = f1)
print('F1-мера:', pred.mean())

F1-мера: 0.6934490734779982
CPU times: user 11min 53s, sys: 0 ns, total: 11min 53s
Wall time: 12min 15s


Дерево решений отрабатывает плохо.

Попробуем обучить дерево решений при помощи GridSearchCV

In [25]:
model_tree = DecisionTreeClassifier(random_state = 1234)
param = {'max_depth': range(1, 30)}

grid_tree = GridSearchCV(model_tree, param, cv=5, scoring = f1, verbose = False)
grid_tree.fit(tf_idf_train, train['toxic'])
grid_tree.best_params_

{'max_depth': 28}

In [26]:
model_tree = DecisionTreeClassifier(**grid_tree.best_params_, random_state = 123)
model_tree.fit(tf_idf_train, train['toxic'])
predict_test = model_tree.predict(tf_idf_test)
print('F1-мера:', f1_score(predict_test, test['toxic']))

F1-мера: 0.6602907421576129


Дерево решений по-прежнему отрабатывает плохо

### Случайный лес

In [28]:
%%time

model_forest = RandomForestClassifier(random_state = 1234)
param = {'n_estimators': [1, 30, 3],
         'max_depth': [2, 20, 2]}

grid_forest = GridSearchCV(model_forest, param, cv=5, scoring = f1, verbose = False)
grid_forest.fit(tf_idf_train, train['toxic'])
f1 = grid_forest.best_score_
print('F1:', f1)

F1: 0.11142951630123196
CPU times: user 59.6 s, sys: 0 ns, total: 59.6 s
Wall time: 1min 1s


Мера значительно ниже уже полученных значений. Не будем запускать на тест

### CatBoostClassifier

In [29]:
%%time
text_features = ['lemma_text']

model_cat_01 = CatBoostClassifier(verbose = 50, 
                                  loss_function='Logloss', 
                                  eval_metric = 'F1',
                                  iterations=1000, 
                                  learning_rate=0.2,
                                  random_seed = 0)
model_cat_01.fit(tf_idf_train, train['toxic'])
prediction = model_cat_01.predict(tf_idf_test)
print("F1-мера:", f1_score(prediction, test['toxic']))

0:	learn: 0.3971188	total: 3.41s	remaining: 56m 47s
50:	learn: 0.6625853	total: 2m 24s	remaining: 44m 45s
100:	learn: 0.7148471	total: 4m 51s	remaining: 43m 17s
150:	learn: 0.7492537	total: 7m 16s	remaining: 40m 52s
200:	learn: 0.7609278	total: 9m 39s	remaining: 38m 21s
250:	learn: 0.7751257	total: 12m 3s	remaining: 35m 57s
300:	learn: 0.7898917	total: 14m 26s	remaining: 33m 32s
350:	learn: 0.8002151	total: 16m 50s	remaining: 31m 7s
400:	learn: 0.8058556	total: 19m 14s	remaining: 28m 44s
450:	learn: 0.8091508	total: 21m 37s	remaining: 26m 19s
500:	learn: 0.8126776	total: 24m	remaining: 23m 55s
550:	learn: 0.8147622	total: 26m 24s	remaining: 21m 31s
600:	learn: 0.8176023	total: 28m 47s	remaining: 19m 7s
650:	learn: 0.8214947	total: 31m 11s	remaining: 16m 43s
700:	learn: 0.8240790	total: 33m 34s	remaining: 14m 19s
750:	learn: 0.8253744	total: 35m 58s	remaining: 11m 55s
800:	learn: 0.8313719	total: 38m 23s	remaining: 9m 32s
850:	learn: 0.8354475	total: 40m 47s	remaining: 7m 8s
900:	learn:

F1 = 0.74 - слишком низкий показатель для такого количества времени

## Выводы

Был проведен анализ данных, в течение которого было выяснено, что баланс классов сильно нарушен: 90% - отрицательных отзывов против 10% положительных.<br>
Проведена очистка и лемматизация английских слов (в том числе при помощи "pos_tag")<br>
В ходе обучения моделей, был получен наилучший результат F1-меры у модели логистической регрессии со сбалансированным весом классов и с измененным порогом классификации до 0,6. Результат был достигнут и составляет  0.754. 