<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Вывод" data-toc-modified-id="Вывод-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Вывод</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Лемматизация-и-подготовка-к-обучению" data-toc-modified-id="Лемматизация-и-подготовка-к-обучению-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Лемматизация и подготовка к обучению</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#LGBMClassifier" data-toc-modified-id="LGBMClassifier-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>LGBMClassifier</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.


**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
pd.options.mode.chained_assignment = None

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [3]:
df = pd.read_csv("/datasets/toxic_comments.csv")
df.info()
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0
...,...,...
159566,""":::::And for the second time of asking, when ...",0
159567,You should be ashamed of yourself \n\nThat is ...,0
159568,"Spitzer \n\nUmm, theres no actual article for ...",0
159569,And it looks like it was actually you who put ...,0


In [4]:
df.duplicated().sum()

0

In [5]:
df["toxic"].mean()

0.10167887648758234

### Вывод
Данные не содержат пропусков и дубликатов.
10.17% твитов в датасете являются токсичными.

## Обучение

### Лемматизация и подготовка к обучению

In [6]:
def ft_clear(text):
    cleaned = re.sub(r"[^a-zA-Z\' ]", ' ', text)
    return " ".join(cleaned.split())

In [7]:
lemmatizer = WordNetLemmatizer()
def ft_lemmatizer(text):
    words_list = nltk.word_tokenize(text)
    return ' '.join([lemmatizer.lemmatize(word) for word in words_list])
    

In [8]:
df['lem_nltk'] = df['text'].apply(ft_clear)

In [9]:
df['lem_nltk'] = df['lem_nltk'].apply(ft_lemmatizer)
df

Unnamed: 0,text,toxic,lem_nltk
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,D'aww He match this background colour I 'm see...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I 'm really not trying to edit war It ...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I ca n't make any real suggestion on impr...
4,"You, sir, are my hero. Any chance you remember...",0,You sir are my hero Any chance you remember wh...
...,...,...,...
159566,""":::::And for the second time of asking, when ...",0,And for the second time of asking when your vi...
159567,You should be ashamed of yourself \n\nThat is ...,0,You should be ashamed of yourself That is a ho...
159568,"Spitzer \n\nUmm, theres no actual article for ...",0,Spitzer Umm there no actual article for prosti...
159569,And it looks like it was actually you who put ...,0,And it look like it wa actually you who put on...


In [10]:
stopwords = set(stopwords.words('english'))

In [11]:
# notebook после этого падает
#corpus = df['lem_nltk'].values.astype('U')

In [12]:
target = df['toxic']

In [13]:
X_train, X_test, y_train, y_test = train_test_split(df['lem_nltk'], target, test_size=0.25, random_state=42)

In [14]:
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)

X_train: (119678,)
X_test: (39893,)


In [15]:
count_vect = TfidfVectorizer(stop_words=stopwords)
tf_idf  = count_vect.fit_transform(X_train) 
tf_idf.shape

(119678, 138815)

In [16]:
tf_idf_test = count_vect.transform(X_test) 
tf_idf_test.shape

(39893, 138815)

### Logistic Regression

In [17]:
%%time
LR = LogisticRegression(random_state=42)
LR.fit(tf_idf, y_train)
predicted = LR.predict(tf_idf_test)
f1 = f1_score(y_test, predicted)
print(f"f1 на тестовой выборке: {f1:.3f}")

f1 на тестовой выборке: 0.734
CPU times: user 24 s, sys: 27.4 s, total: 51.5 s
Wall time: 51.5 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [18]:
%%time
LR = LogisticRegression(random_state=42)
params = {"class_weight": ["balanced", None], 'C':[1,10,20]}

grid = GridSearchCV(LR, params,
                  cv=3,
                  scoring='f1',
                  n_jobs=-1,
                  verbose=2)
grid.fit(tf_idf, y_train)

Fitting 3 folds for each of 6 candidates, totalling 18 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END .........................C=1, class_weight=balanced; total time=  49.5s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END .........................C=1, class_weight=balanced; total time=  49.4s
[CV] END .........................C=1, class_weight=balanced; total time=  29.9s
[CV] END .............................C=1, class_weight=None; total time=  47.6s
[CV] END .............................C=1, class_weight=None; total time=  47.7s
[CV] END .............................C=1, class_weight=None; total time=  45.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ........................C=10, class_weight=balanced; total time=  46.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ........................C=10, class_weight=balanced; total time=  46.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ........................C=10, class_weight=balanced; total time=  46.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ............................C=10, class_weight=None; total time=  45.2s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ............................C=10, class_weight=None; total time=  46.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ............................C=10, class_weight=None; total time=  46.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ........................C=20, class_weight=balanced; total time=  44.3s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ........................C=20, class_weight=balanced; total time=  46.8s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ........................C=20, class_weight=balanced; total time=  46.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ............................C=20, class_weight=None; total time=  44.0s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ............................C=20, class_weight=None; total time=  46.4s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


[CV] END ............................C=20, class_weight=None; total time=  46.9s
CPU times: user 6min 38s, sys: 7min 55s, total: 14min 33s
Wall time: 14min 34s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GridSearchCV(cv=3, estimator=LogisticRegression(random_state=42), n_jobs=-1,
             param_grid={'C': [1, 10, 20], 'class_weight': ['balanced', None]},
             scoring='f1', verbose=2)

In [19]:
print("grid.best_score_ f1:", grid.best_score_)
print("grid.best_params_", grid.best_params_)

grid.best_score_ f1: 0.7574474369288406
grid.best_params_ {'C': 10, 'class_weight': 'balanced'}


In [20]:
predicted = grid.predict(tf_idf_test)
f1 = f1_score(y_test, predicted)
print(f"f1 на тестовой выборке: {f1:.3f}")

f1 на тестовой выборке: 0.767


### LGBMClassifier

In [21]:
%%time
LGBM = LGBMClassifier(random_state=42)
LGBM.fit(tf_idf, y_train)
predicted = LGBM.predict(tf_idf_test)
f1 = f1_score(y_test, predicted)
print(f"f1 на тестовой выборке: {f1:.3f}")

f1 на тестовой выборке: 0.751
CPU times: user 2min 44s, sys: 0 ns, total: 2min 44s
Wall time: 2min 45s


In [22]:
%%time
LGBM = LGBMClassifier(metric='f1', verbose=10, n_jobs=-1, random_state=42) 
LGBM.fit(tf_idf, y_train, eval_set=[(tf_idf_test, y_test)], eval_metric='f1')

[LightGBM] [Info] Number of positive: 12166, number of negative: 107512
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.997741
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.997741
[LightGBM] [Debug] init for col-wise cost 32.804062 seconds, init for row-wise cost 33.513323 seconds
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 579093
[LightGBM] [Info] Number of data points in the train set: 119678, number of used features: 10968
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.101656 -> initscore=-2.178957
[LightGBM] [Info] Start training from score -2.178957
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 30
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 30
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 30
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 30
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 30
[LightGBM] [De

LGBMClassifier(metric='f1', random_state=42, verbose=10)

In [23]:
predicted = LGBM.predict(tf_idf_test)
f1 = f1_score(y_test, predicted)
print(f"f1 на тестовой выборке: {f1:.3f}")

f1 на тестовой выборке: 0.751


In [24]:
%%time
LGBM = LGBMClassifier(random_state=42)
params = {
        'learning_rate':[0.02,0.1],
        'num_leaves':[50, 100],
        'max_depth': [6, 12]
        }

grid = GridSearchCV(LGBM, params,
                  cv=3,
                  scoring='f1',
                  n_jobs=-1,
                  verbose=2)
grid.fit(tf_idf, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] END .....learning_rate=0.02, max_depth=6, num_leaves=50; total time= 1.1min
[CV] END .....learning_rate=0.02, max_depth=6, num_leaves=50; total time=  51.2s
[CV] END .....learning_rate=0.02, max_depth=6, num_leaves=50; total time=  50.9s
[CV] END ....learning_rate=0.02, max_depth=6, num_leaves=100; total time=  52.1s
[CV] END ....learning_rate=0.02, max_depth=6, num_leaves=100; total time=  50.0s
[CV] END ....learning_rate=0.02, max_depth=6, num_leaves=100; total time=  50.5s
[CV] END ....learning_rate=0.02, max_depth=12, num_leaves=50; total time= 1.4min
[CV] END ....learning_rate=0.02, max_depth=12, num_leaves=50; total time= 1.4min
[CV] END ....learning_rate=0.02, max_depth=12, num_leaves=50; total time= 1.4min
[CV] END ...learning_rate=0.02, max_depth=12, num_leaves=100; total time= 1.7min
[CV] END ...learning_rate=0.02, max_depth=12, num_leaves=100; total time= 1.7min
[CV] END ...learning_rate=0.02, max_depth=12, num

GridSearchCV(cv=3, estimator=LGBMClassifier(random_state=42), n_jobs=-1,
             param_grid={'learning_rate': [0.02, 0.1], 'max_depth': [6, 12],
                         'num_leaves': [50, 100]},
             scoring='f1', verbose=2)

In [25]:
print("grid.best_score_ f1:", grid.best_score_)
print("grid.best_params_", grid.best_params_)

grid.best_score_ f1: 0.6921929974687394
grid.best_params_ {'learning_rate': 0.1, 'max_depth': 12, 'num_leaves': 50}


In [26]:
predicted = grid.predict(tf_idf_test)
f1 = f1_score(y_test, predicted)
print(f"f1 на тестовой выборке: {f1:.3f}")

f1 на тестовой выборке: 0.709


## Выводы

В ходе предобработки данных была произведена:
- лемматизация и очистка текстов от стоп-слов при помощи инструментов библиотеки nltk
- корпус текстов обработан и переведен в векторный вид при помощи TfidfVectorizer

Лучший показатель F1 на тестовой выборки у модели LogisticRegression: 76.7%.

У модели LGBMClassifier с параметрами по умолчанию показатель F1 так же удовлетворяет требованию: 75.1%.

При этом и время обучения модели LogisticRegression быстрее LGBMClassifier: 14min 33s против 31min 45s

Для задачи классификации комментариев на токсичность рекомендую применять модель  LogisticRegression.