<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
# Импортируем библиотеки, которые нам понадобятся в дальнейшем:
import pandas as pd
import numpy as np
from catboost import Pool, CatBoostClassifier, CatBoost, cv
import lightgbm as lgb
from lightgbm import LGBMClassifier
from tqdm import notebook
import warnings

# sklearn
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, MinMaxScaler
from sklearn.utils import shuffle
from sklearn.compose import ColumnTransformer
from sklearn.multioutput import MultiOutputRegressor

In [2]:
# Импортируем библиотеки для работы с текстом
import torch, transformers, nltk, re
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import wordnet
from nltk.corpus import stopwords as nltk_stopwords

nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
data = pd.read_csv('/datasets/toxic_comments.csv')
data.info()
print()
print(data.columns)
print()
display(data.head())
print()
print('Количество пропусков:')
print(data.isna().sum())
print()
print('Количество дубликатов:', data.duplicated().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB

Index(['text', 'toxic'], dtype='object')



Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0



Количество пропусков:
text     0
toxic    0
dtype: int64

Количество дубликатов: 0


Пропуски и полные дубликаты отсутствуют. Названия столбцов прописаны адекватно.  
Для дальнейшей обработки требуется провести лемматизацию по столбцу 'text'

In [4]:
#Сформируем корпус текстов и осуществим кодировку:
data['text'] = data['text'].astype('U').values

Подготовим функцию для лемматизации текстов на английском языке и проверим ее работу.

In [5]:
# Lemmatize with POS Tag
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)
# Заявим лемматизатор
lemmatizer = WordNetLemmatizer()

# Лемматизируем первую строку корпуса
print(' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(data.loc[0,'text'])]))

Explanation Why the edits make under my username Hardcore Metallica Fan be revert ? They be n't vandalism , just closure on some GAs after I vote at New York Dolls FAC . And please do n't remove the template from the talk page since I 'm retire now.89.205.38.27


Очистим полученный набор слов с помощбю регулярных выражений 

In [6]:
def clear_text(text):
    text = re.sub(r'[^a-zA-Z ]', ' ', text) 
    text = " ".join(text.split())
    return text

print("Исходный текст:", data.loc[0,'text'])
print("Лемматизированный текст:", [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(data.loc[0,'text'])])
print("Очищенный и лемматизированный текст:", " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in 
                                                        nltk.word_tokenize(clear_text(data.loc[0,'text']))]))

Исходный текст: Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27
Лемматизированный текст: ['Explanation', 'Why', 'the', 'edits', 'make', 'under', 'my', 'username', 'Hardcore', 'Metallica', 'Fan', 'be', 'revert', '?', 'They', 'be', "n't", 'vandalism', ',', 'just', 'closure', 'on', 'some', 'GAs', 'after', 'I', 'vote', 'at', 'New', 'York', 'Dolls', 'FAC', '.', 'And', 'please', 'do', "n't", 'remove', 'the', 'template', 'from', 'the', 'talk', 'page', 'since', 'I', "'m", 'retire', 'now.89.205.38.27']
Очищенный и лемматизированный текст: Explanation Why the edits make under my username Hardcore Metallica Fan be revert They weren t vandalism just closure on some GAs after I vote at New York Dolls FAC And please don t remove the template from the talk page since I m retire now


Лемматизатор работает корректно. Лемматизируем все текстовые строки датасета.  

*CPU times: user 30min 12s, sys: 2min 33s, total: 32min 46s*  
*Wall time: 32min 59s*

In [7]:
%%time
data['lemm_text'] = data['text'].map(lambda x: " ".join([lemmatizer.lemmatize(w, get_wordnet_pos(w))
                                                         for w in nltk.word_tokenize(clear_text(x))]))

CPU times: user 31min 9s, sys: 2min 38s, total: 33min 48s
Wall time: 34min 2s


Лемматизация успешно проведена. Теперь можно рассчитать **TFIDF** для корпуса лемматизированных текстов для формирования набора признаков.

In [8]:
display(data.head())

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits make under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He match this background colour I m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not try to edit war It s ju...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestion on impro...
4,"You, sir, are my hero. Any chance you remember...",0,You sir be my hero Any chance you remember wha...


## Обучение

Создадим матрицу TF-IDF по корпусу лемматизированных текстов, однако прежде разобьем датасет на выборки обучающую и валидационную в соотношении 75:25.

In [9]:
r=42 # random_state
train, test = train_test_split(data, test_size=0.25, random_state=42)
print(train.shape, test.shape)

(119678, 3) (39893, 3)


Выделим целевой признак и набор признаков:

In [10]:
target_train = train['toxic']
target_test = test['toxic']
features_train = train['lemm_text'].astype('U').values
features_test = test['lemm_text'].astype('U').values
print(features_train.shape, features_test.shape, target_train.shape, target_test.shape)

(119678,) (39893,) (119678,) (39893,)


Создадим счетчик и укажем в нем стоп-слова

In [11]:
stopwords = set(nltk_stopwords.words('english'))
#count_tf_idf = TfidfVectorizer(stop_words=stopwords, ngram_range=(2,2)) 

In [12]:
count_tf_idf=TfidfVectorizer(stop_words=stopwords, lowercase=True, min_df=0.0001)

Посчитаем TF-IDF для корпуса текстов

In [13]:
%%time
tfidf_features_train = count_tf_idf.fit_transform(features_train)
display(tfidf_features_train)
print(tfidf_features_train.shape)

<119678x13943 sparse matrix of type '<class 'numpy.float64'>'
	with 2931194 stored elements in Compressed Sparse Row format>

(119678, 13943)
CPU times: user 7.41 s, sys: 109 ms, total: 7.52 s
Wall time: 7.54 s


In [14]:
%%time
tfidf_features_test = count_tf_idf.transform(features_test)
print(tfidf_features_test.shape)

(39893, 13943)
CPU times: user 2.37 s, sys: 277 µs, total: 2.37 s
Wall time: 2.49 s


Мы подготовили выборки для проведения обучения разных моделей.

In [15]:
pd.set_option('mode.chained_assignment', None)
warnings.filterwarnings("ignore")

# Создадим функцию для поиска лучших параметров модели при помощи GrindSearch:
def gridsrch_model(model, params):
    grid_model = GridSearchCV(model, param_grid=params, scoring='f1', cv=3, verbose=100)  
    return grid_model

- **'Logistic Regression'**

In [16]:
model_linear = LogisticRegression()
grid_params_linear = {'class_weight': ['balanced', None]}

grid_model = gridsrch_model(model_linear, grid_params_linear)

*CPU times: user 45.2 s, sys: 36.1 s, total: 1min 21s*  
*Wall time: 1min 21s*

In [18]:
%%time
# Обучим модель
grid_model.fit(tfidf_features_train, target_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] class_weight=balanced ...........................................
[CV] ............... class_weight=balanced, score=0.723, total=  12.3s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   12.3s remaining:    0.0s
[CV] class_weight=balanced ...........................................
[CV] ............... class_weight=balanced, score=0.727, total=  14.3s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   26.6s remaining:    0.0s
[CV] class_weight=balanced ...........................................
[CV] ............... class_weight=balanced, score=0.728, total=  14.1s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   40.7s remaining:    0.0s
[CV] class_weight=None ...............................................
[CV] ................... class_weight=None, score=0.710, total=   8.5s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=None,
             param_grid={'class_weight': ['balanced', None]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1', verbose=100)

In [19]:
print('Лучшая модель логистической регрессии:')
print(grid_model.best_estimator_)
print('Лучшее значение F1: {:.4f}'.format(grid_model.best_score_))

Лучшая модель логистической регрессии:
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
Лучшее значение F1: 0.7260


In [20]:
predictions = grid_model.predict(tfidf_features_test)
f1_test = f1_score(target_test, predictions)
print('Лучшее значение F1 тестовой выборки для модели логистической регрессии: {:.4f}'.format(f1_test))

Лучшее значение F1 тестовой выборки для модели логистической регрессии: 0.7318


In [21]:
#Зафиксируем значения F1 для тестовой выборки.
model_1 = 0.7318

- **'Дерево решений'**

In [22]:
model_tree = DecisionTreeClassifier(random_state=r)
grid_params_tree = {'max_depth': range(1, 6)}

grid_model = gridsrch_model(model_tree, grid_params_tree)

*CPU times: user 23.2 s, sys: 33.5 ms, total: 23.3 s*  
*Wall time: 23.5 s*


In [23]:
%%time
# Обучим модель
grid_model.fit(tfidf_features_train, target_train)

Fitting 3 folds for each of 5 candidates, totalling 15 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] max_depth=1 .....................................................
[CV] ......................... max_depth=1, score=0.349, total=   0.7s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.7s remaining:    0.0s
[CV] max_depth=1 .....................................................
[CV] ......................... max_depth=1, score=0.356, total=   0.6s
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.2s remaining:    0.0s
[CV] max_depth=1 .....................................................
[CV] ......................... max_depth=1, score=0.358, total=   0.6s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    1.8s remaining:    0.0s
[CV] max_depth=2 .....................................................
[CV] ......................... max_depth=2, score=0.410, total=   0.9s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed

GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=DecisionTreeClassifier(class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort=False, random_state=42,
                                              splitter='best'),
             iid='warn', n_jobs=None, param_grid={'max_depth': range(1, 6)},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f

In [26]:
print('Лучшая модель дерева решений:')
print(grid_model.best_estimator_)
print('Лучшее значение F1: {:.4f}'.format(grid_model.best_score_))

Лучшая модель дерева решений:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=42, splitter='best')
Лучшее значение F1: 0.5146


In [25]:
predictions = grid_model.predict(tfidf_features_test)
f1_test = f1_score(target_test, predictions)
print('Лучшее значение F1 тестовой выборки для дерева решений: {:.4f}'.format(f1_test))

Лучшее значение F1 тестовой выборки для дерева решений: 0.5402


In [27]:
#Зафиксируем значения F1 для тестовой выборки.
model_2 = 0.5402

- **'Случайный лес'**

*CPU times: user 36.7 s, sys: 28.8 ms, total: 36.7 s*  
*Wall time: 37.2 s*

In [28]:
%%time
#model_forest = RandomForestClassifier(random_state=r, n_estimators=50, max_depth=10, class_weight='balanced')
#model_forest = RandomForestClassifier(random_state=r, n_estimators=100, max_depth=5, class_weight='balanced')
#model_forest = RandomForestClassifier(random_state=r, n_estimators=200, max_depth=5, class_weight='balanced')
#model_forest = RandomForestClassifier(random_state=r, n_estimators=200, max_depth=10, class_weight='balanced')
#model_forest = RandomForestClassifier(random_state=r, n_estimators=200, max_depth=50, class_weight='balanced')
model_forest = RandomForestClassifier(random_state=r, class_weight='balanced')
model_forest.fit(tfidf_features_train, target_train)

CPU times: user 36.7 s, sys: 28.8 ms, total: 36.7 s
Wall time: 37.2 s


RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=10, n_jobs=None, oob_score=False,
                       random_state=42, verbose=0, warm_start=False)

In [29]:
predictions = model_forest.predict(tfidf_features_test)
f1_test = f1_score(target_test, predictions)
print('Лучшее значение F1 тестовой выборки для случайного леса: {:.4f}'.format(f1_test))

Лучшее значение F1 тестовой выборки для случайного леса: 0.6505


In [31]:
#Зафиксируем значения F1 для тестовой выборки.
model_3 = 0.6505

- **'CatboostClassifier'**

*CPU times: user 55min 11s, sys: 1min 13s, total: 56min 24s*  
*Wall time: 56min 40s*

In [32]:
%%time
model_catboost =  CatBoostClassifier(random_state=r)
model_catboost.fit(tfidf_features_train, target_train)

Learning rate set to 0.064894
0:	learn: 0.6256163	total: 4.21s	remaining: 1h 10m 9s
1:	learn: 0.5678953	total: 7.61s	remaining: 1h 3m 18s
2:	learn: 0.5175887	total: 11s	remaining: 1h 59s
3:	learn: 0.4759956	total: 14.4s	remaining: 59m 49s
4:	learn: 0.4409404	total: 17.8s	remaining: 59m 4s
5:	learn: 0.4109206	total: 21.2s	remaining: 58m 34s
6:	learn: 0.3854108	total: 24.7s	remaining: 58m 24s
7:	learn: 0.3643473	total: 28.1s	remaining: 58m 5s
8:	learn: 0.3453903	total: 31.5s	remaining: 57m 49s
9:	learn: 0.3294656	total: 34.9s	remaining: 57m 35s
10:	learn: 0.3158190	total: 38.3s	remaining: 57m 24s
11:	learn: 0.3039829	total: 41.8s	remaining: 57m 22s
12:	learn: 0.2937493	total: 45.4s	remaining: 57m 27s
13:	learn: 0.2848313	total: 48.9s	remaining: 57m 24s
14:	learn: 0.2761583	total: 52.6s	remaining: 57m 34s
15:	learn: 0.2693633	total: 56.2s	remaining: 57m 36s
16:	learn: 0.2638044	total: 59.6s	remaining: 57m 27s
17:	learn: 0.2588968	total: 1m 3s	remaining: 57m 22s
18:	learn: 0.2545263	total:

<catboost.core.CatBoostClassifier at 0x7f4eb67cdf10>

In [33]:
predictions = model_catboost.predict(tfidf_features_test)
f1_test = f1_score(target_test, predictions)
print('Лучшее значение F1 тестовой выборки для CatboostClassifier: {:.4f}'.format(f1_test))

Лучшее значение F1 тестовой выборки для CatboostClassifier: 0.7519


In [35]:
model_4 = 0.7519

- **Градиентный бустинг LightGBM:**

In [36]:
model_lgbm = lgb.LGBMClassifier(boosting_type='gbdt', random_state=r)

*CPU times: user 10min 4s, sys: 2.33 s, total: 10min 7s*  
*Wall time: 10min 12s*

In [37]:
%%time
# Обучим модель
model_lgbm.fit(tfidf_features_train, target_train)

CPU times: user 10min 4s, sys: 2.33 s, total: 10min 7s
Wall time: 10min 12s


LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, silent=True,
               subsample=1.0, subsample_for_bin=200000, subsample_freq=0)

In [38]:
predictions = model_lgbm.predict(tfidf_features_test)
f1_test = f1_score(target_test, predictions)
print('Лучшее значение F1 тестовой выборки для LightGBMClassifier: {:.4f}'.format(f1_test))

Лучшее значение F1 тестовой выборки для LightGBMClassifier: 0.7451


In [39]:
model_5 = 0.7451

## Выводы

Выведем результаты со значениями метрики f1 на тестовых данных в Series

In [41]:
index = ['logistic_regression', 'decision_tree', 'random_forest', 'catboost','lgbmboost']
table_f1 = pd.Series([model_1, model_2, model_3, model_4, model_5], index=index)
table_f1.sort_values(ascending=False)

catboost               0.7519
lgbmboost              0.7451
logistic_regression    0.7318
random_forest          0.6505
decision_tree          0.5402
dtype: float64

Лучшие результаты показала модель СatboostClassifier, значение метрики f1 которой составило 0.7519. 