# Проект для «Викишоп»

Интернет-магазин запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 


## Подготовка

In [1]:
import pandas as pd 
import re
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import time
import nltk
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
from pymystem3 import Mystem
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LogisticRegression
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import f1_score
from tqdm.notebook import tqdm
tqdm.pandas()

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')
data

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [4]:
data.isna().sum()

Unnamed: 0    0
text          0
toxic         0
dtype: int64

In [5]:
data.duplicated().sum()

0

In [6]:
data['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

In [7]:
data['toxic'].value_counts().loc[0] / len(data) * 100

89.83878663084147

При первоначальном изучении данных видно, что в данных 159 292 комментария, пропусков и дубликатов нет. Проверила целевой признак, если предположить, что 0-это нормальный комментарий, а 1-токсичный, то изначально токсичных комментариев ориентировочно 10% от общих.

In [8]:
data['text'] = data['text'].str.lower() 

In [9]:
%%time
lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}
def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join(
    [
        lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN))
        for word, pos in pos_tagged_text
    ]
    )

data["text"] = data["text"].progress_apply(lambda text: lemmatize_words(text))
data.head()

  0%|          | 0/159292 [00:00<?, ?it/s]

CPU times: user 7min 58s, sys: 6.01 s, total: 8min 4s
Wall time: 8min 18s


Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,explanation why the edits make under my userna...,0
1,1,d'aww! he match this background colour i'm see...,0
2,2,"hey man, i'm really not try to edit war. it's ...",0
3,3,""" more i can't make any real suggestion on imp...",0
4,4,"you, sir, be my hero. any chance you remember ...",0


In [10]:
def clear_text(text):
    clear_text = re.sub(r'[^a-zA-Z]', ' ', text)
    clear_text = " ".join(clear_text.split())
    
    return clear_text

In [11]:
%%time
data['text'] = data['text'].apply(clear_text)

CPU times: user 4 s, sys: 24 ms, total: 4.03 s
Wall time: 4.04 s


In [12]:
data.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,explanation why the edits make under my userna...,0
1,1,d aww he match this background colour i m seem...,0
2,2,hey man i m really not try to edit war it s ju...,0
3,3,more i can t make any real suggestion on impro...,0
4,4,you sir be my hero any chance you remember wha...,0


In [13]:
data_train, data_test=train_test_split(data, test_size=0.1, random_state=12345) 
print(data_train.shape)
print(data_test.shape)

(143362, 3)
(15930, 3)


In [14]:
features_train = data_train.drop(['toxic'], axis=1)
target_train = data_train['toxic']
features_test=data_test.drop(['toxic'], axis=1)
target_test=data_test['toxic']

Провела предобработку данных, лемматизировала текст и очистила текст от символов. Далее я разделила датасет на 2 выборки для дальнейшего обучения моделей (так как планирую в при обучении использовать кросс-валидацию) в соотношении 90/10

## Обучение

In [15]:
%%time
lr_pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words='english', sublinear_tf=True)), 
                     ('clf', LogisticRegression())])
    
parameters = {'clf__solver': ['liblinear','lbfgs'],
              'clf__C': [1, 10, 100],
              'clf__class_weight': (['balanced'])
             }

lr_grid = GridSearchCV(estimator=lr_pipeline, scoring='f1', cv=3, n_jobs=-1, param_grid = parameters)
lr_grid.fit(features_train['text'], target_train)
lr_best_parameters = lr_grid.best_params_
print('Лучшие параметры для логистической регрессии', lr_best_parameters)
print('Лучшая оценка для логистической регрессии', lr_grid.best_score_.round(2))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Лучшие параметры для логистической регрессии {'clf__C': 10, 'clf__class_weight': 'balanced', 'clf__solver': 'liblinear'}
Лучшая оценка для логистической регрессии 0.76
CPU times: user 7min, sys: 5min 37s, total: 12min 38s
Wall time: 12min 39s


In [16]:
%%time
cat_pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')), #обучим Catboost
                     ('cat', CatBoostClassifier())]) 

parameters = {
            'cat__n_estimators' : [50, 100],
            'cat__max_depth' : [5, 7, 10]
             }
model_cat= GridSearchCV(cat_pipeline,
                        parameters,
                        scoring='f1',
                        cv=3,
                        n_jobs=-1,
                        verbose=10)
model_cat.fit(features_train['text'], target_train)
print('Лучшие параметры для Catboost', model_cat.best_params_)
print('Лучшая оценка для Catboos', model_cat.best_score_.round(2))

Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV 1/3; 1/6] START cat__max_depth=5, cat__n_estimators=50......................
Learning rate set to 0.5
0:	learn: 0.3575457	total: 1.48s	remaining: 1m 12s
1:	learn: 0.2726031	total: 2.74s	remaining: 1m 5s
2:	learn: 0.2443653	total: 3.98s	remaining: 1m 2s
3:	learn: 0.2314736	total: 5.25s	remaining: 1m
4:	learn: 0.2227647	total: 6.44s	remaining: 58s
5:	learn: 0.2130087	total: 7.64s	remaining: 56s
6:	learn: 0.2059945	total: 8.84s	remaining: 54.3s
7:	learn: 0.2018987	total: 10s	remaining: 52.6s
8:	learn: 0.1981358	total: 11.2s	remaining: 51s
9:	learn: 0.1951909	total: 12.3s	remaining: 49.4s
10:	learn: 0.1919790	total: 13.5s	remaining: 48s
11:	learn: 0.1890610	total: 14.8s	remaining: 46.9s
12:	learn: 0.1859762	total: 16s	remaining: 45.6s
13:	learn: 0.1832799	total: 17.2s	remaining: 44.1s
14:	learn: 0.1808463	total: 18.3s	remaining: 42.8s
15:	learn: 0.1789310	total: 19.5s	remaining: 41.4s
16:	learn: 0.1769437	total: 20.6s	remainin

In [17]:
%%time
from lightgbm import LGBMRegressor
LGBM_pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')), 
                     ('lgbm', LGBMClassifier(random_state=12345))])  
parameters={'lgbm__n_estimators' : [50, 100],
            'lgbm__max_depth' : [5, 7, 10]
            }
model_LGBM= GridSearchCV(LGBM_pipeline,
                        parameters,
                        scoring='f1',
                        cv=3,
                        n_jobs=-1,
                        verbose=10
                        )
model_LGBM.fit(features_train['text'], target_train)
print('Лучшие параметры для LGBM', model_LGBM.best_params_)
print('Лучшая оценка для LGBM', model_LGBM.best_score_.round(2))

Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV 1/3; 1/6] START lgbm__max_depth=5, lgbm__n_estimators=50....................
[CV 1/3; 1/6] END ..lgbm__max_depth=5, lgbm__n_estimators=50; total time= 3.4min
[CV 2/3; 1/6] START lgbm__max_depth=5, lgbm__n_estimators=50....................
[CV 2/3; 1/6] END ..lgbm__max_depth=5, lgbm__n_estimators=50; total time= 5.1min
[CV 3/3; 1/6] START lgbm__max_depth=5, lgbm__n_estimators=50....................
[CV 3/3; 1/6] END ..lgbm__max_depth=5, lgbm__n_estimators=50; total time= 3.4min
[CV 1/3; 2/6] START lgbm__max_depth=5, lgbm__n_estimators=100...................
[CV 1/3; 2/6] END .lgbm__max_depth=5, lgbm__n_estimators=100; total time= 5.0min
[CV 2/3; 2/6] START lgbm__max_depth=5, lgbm__n_estimators=100...................
[CV 2/3; 2/6] END .lgbm__max_depth=5, lgbm__n_estimators=100; total time= 2.8min
[CV 3/3; 2/6] START lgbm__max_depth=5, lgbm__n_estimators=100...................
[CV 3/3; 2/6] END .lgbm__max_depth=5, lgbm__n_est

## Выводы

В процессе работы я изучила данные, преобразовала их, лемматизировала, убрала лишние символы. Далее обучила три модели:линейную регрессию, LGBM и Catboost. Наивысшая метрика f1 получилась линейной регрессии= 0.76, catboost хуже = 0.75 Lgbm чуть меньше = 0.68. Проверим линейную регрессию на тесте.

In [20]:
predictions_test=lr_grid.predict(features_test['text']).round()
model_lr_grid_score=f1_score(target_test, predictions_test)
print('F1 для линейной регрессии на тестовой выборке', model_lr_grid_score.round(2))

F1 для линейной регрессии на тестовой выборке 0.77


In [21]:
result = {
    'Model' : ['LogisticRegression', 'LGBMRegressor', 'CatBoostRegressor'],
    'Лучшая оценка' :pd.Series([0.76, 0.68, 0.75]),
    'Тестирование лучшей модели':pd.Series ([0.77, 0, 0])
    }
display(pd.DataFrame(result))

Unnamed: 0,Model,Лучшая оценка,Тестирование лучшей модели
0,LogisticRegression,0.76,0.77
1,LGBMRegressor,0.68,0.0
2,CatBoostRegressor,0.75,0.0


Метрика качества линейной регресси получилась 0.77 , а значит поставленной цели удалось добиться