# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import numpy as np
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.ensemble import AdaBoostClassifier


import nltk
from nltk.corpus import stopwords as nltk_stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score, accuracy_score

from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.io as pio
pio.templates.default = 'seaborn'

import torch
import transformers
from tqdm import notebook

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
data = pd.read_csv('/datasets/toxic_comments.csv')
data.head(
)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


Выведем соотношение классов

In [3]:
data['toxic'].value_counts(normalize=True)

0    0.898321
1    0.101679
Name: toxic, dtype: float64

Имеем малую долю токсичных комментариев

Дальнейшие шаги:
- Предобработка текста (удаление стоп-слов, лишних символов)
- Токенизация 
- Лемматизация (Приведение слова к его нач.форме)
- Векторизация корпуса

In [4]:

lemmatizer = WordNetLemmatizer()
def lemmatize(text):
    word_list = nltk.word_tokenize(text)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    return lemmatized_output

In [5]:
def text_cleaner(text):
    # Удаляем все знаки и символы(кроме латинских)    
    text = re.sub(r'[^a-zA-Z ]', ' ', text)
    return " ".join(text.split())

In [6]:
corpus = data['text'].values.astype('U')


In [7]:

corpus[0:5]

array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
       "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)",
       "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",
       '"\nMore\nI can\'t make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences f

In [8]:
lemmatized_corpus = [lemmatize(text_cleaner(text)) for text in corpus]

In [9]:
lemmatized_corpus[0:5]

['Explanation Why the edits made under my username Hardcore Metallica Fan were reverted They weren t vandalism just closure on some GAs after I voted at New York Dolls FAC And please don t remove the template from the talk page since I m retired now',
 'D aww He match this background colour I m seemingly stuck with Thanks talk January UTC',
 'Hey man I m really not trying to edit war It s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page He seems to care more about the formatting than the actual info',
 'More I can t make any real suggestion on improvement I wondered if the section statistic should be later on or a subsection of type of accident I think the reference may need tidying so that they are all in the exact same format ie date format etc I can do that later on if no one else doe first if you have any preference for formatting style on reference or want to do it yourself please let me know There appears to be

In [10]:
stopwords = set(nltk_stopwords.words('english'))

In [11]:
# Разделим на train/test

X_train, X_test, y_train, y_test = train_test_split(lemmatized_corpus, data['toxic'], 
                                                    test_size=0.2,
                                                    random_state=42)


In [12]:
print('Train shape:', len(X_train))
print('Test shape:', len(X_test))


Train shape: 127656
Test shape: 31915


In [13]:
# Векторизация корпуса и удаление стоп-слов

tf_idf_vect = TfidfVectorizer(ngram_range=(1,1), stop_words=stopwords,
               min_df=3, max_df=0.9, strip_accents='unicode', use_idf=1,
               smooth_idf=1, sublinear_tf=1 )

X_train_vect = tf_idf_vect.fit_transform(X_train)
X_test_vect = tf_idf_vect.transform(X_test)


In [14]:
print('X_train vectorized shape:', X_train_vect.shape)
print('X_test vectorized shape:', X_test_vect.shape)

X_train vectorized shape: (127656, 41830)
X_test vectorized shape: (31915, 41830)


###  Итог подготовки

На данном этапе был сформирован готовый корпус текстов. Для этого мы привели слова в текстах к их начальной форме, затем убрали все лишние символы и осуществили векторизацию, т.е перевод текстов в векторное пространство

## Обучение

Введём константную модель, которая на любой объект будет давать ответ _toxic=0_. С ней и будет сравниваться результат остальных моделей.

In [16]:
baseline_predictor = pd.Series(data=np.zeros((len(y_test))), index=y_test.index, dtype='int16')
baseline_accuacy = accuracy_score(y_test, baseline_predictor)
print(f'Accuracy for baseline model:{baseline_accuacy:.3f}')


Accuracy for baseline model:0.898


### Logistic Regression

Обучим модель лог. регрессии, а затем посмотрим на результаты кросс-валидации. В сетке переберём параметр регуляризации - C.

In [17]:
parameters = {'C': np.linspace(10, 20, num = 11, endpoint = True),
             'max_iter': [1000]}
lr_model = LogisticRegression()
clf = GridSearchCV(lr_model, parameters,
                  cv=5,
                  scoring='f1',
                  n_jobs=-1,
                  verbose=2)
clf.fit(X_train_vect, y_train)

Fitting 5 folds for each of 11 candidates, totalling 55 fits
[CV] C=10.0, max_iter=1000 ...........................................


[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.




[CV] ............................ C=10.0, max_iter=1000, total=  16.4s
[CV] C=10.0, max_iter=1000 ...........................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   16.4s remaining:    0.0s


[CV] ............................ C=10.0, max_iter=1000, total=  16.3s
[CV] C=10.0, max_iter=1000 ...........................................
[CV] ............................ C=10.0, max_iter=1000, total=  16.1s
[CV] C=10.0, max_iter=1000 ...........................................
[CV] ............................ C=10.0, max_iter=1000, total=  16.4s
[CV] C=10.0, max_iter=1000 ...........................................
[CV] ............................ C=10.0, max_iter=1000, total=  14.2s
[CV] C=11.0, max_iter=1000 ...........................................
[CV] ............................ C=11.0, max_iter=1000, total=  15.4s
[CV] C=11.0, max_iter=1000 ...........................................
[CV] ............................ C=11.0, max_iter=1000, total=  15.2s
[CV] C=11.0, max_iter=1000 ...........................................
[CV] ............................ C=11.0, max_iter=1000, total=  15.8s
[CV] C=11.0, max_iter=1000 ...........................................
[CV] .

[Parallel(n_jobs=-1)]: Done  55 out of  55 | elapsed: 14.9min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'C': array([10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20.]),
                         'max_iter': [1000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1', verbose=2)

In [18]:
print(f"Лучший F1 на CV : {clf.best_score_:.3f}")
print(f"Лучший параметр C для регрессии: {clf.best_params_}")

Лучший F1 на CV : 0.773
Лучший параметр C для регрессии: {'C': 13.0, 'max_iter': 1000}


In [20]:
lrm = LogisticRegression(C=13, max_iter=1000)
lrm.fit(X_train_vect, y_train)
predict = lrm.predict(X_test_vect)
f1_lr = f1_score(y_test, predict)

In [21]:
print(f'F1 на test`e {f1_lr:.3f}')

F1 на test`e 0.782


Далее проверим модель лог. регресси на адекватность, т.е сравним accuracy логрега и baseline модели

In [22]:
lrm_accuracy = accuracy_score(y_test, predict)
print(f"Accuracy на логистической регрессии {lrm_accuracy:.3f}")


Accuracy на логистической регрессии 0.960


Accuracy на логреге превосходит метрику константной модели, отсюда заметно, что модель вменяема

### AdaBoost

In [23]:
ada_boost_model = AdaBoostClassifier(n_estimators=100)
ada_boost_model.fit(X_train_vect, y_train)
pred = ada_boost_model.predict(X_test_vect)
f1_ada = f1_score(y_test, pred)

In [24]:
print(f'F1 на test`e {f1_ada:.3f}')


F1 на test`e 0.711


In [25]:
param_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1]}

search=GridSearchCV(estimator=ada_boost_model,param_grid=param_grid,scoring='f1',n_jobs=-1,cv=5)

In [26]:
search.fit(X_train_vect, y_train)

KeyboardInterrupt: 

In [30]:
ada_boost_model = AdaBoostClassifier(n_estimators=170)
ada_boost_model.fit(X_train_vect, y_train)
pred = ada_boost_model.predict(X_test_vect)
f1_ada = f1_score(y_test, pred)

print(f'F1 на test`e {f1_ada:.3f}')

F1 на test`e 0.730


### SVM

#### LinearSVC

In [31]:
parameters = {'C': np.linspace(1, 31, num = 7, endpoint = True)}
lsvcm = LinearSVC(max_iter = 1000)
clf_lsvc = GridSearchCV(lsvcm, parameters,
                  cv=5,
                  scoring='f1',
                  n_jobs=-1,
                  verbose=2)

In [33]:
clf_lsvc.fit(X_train_vect, y_train)

Fitting 5 folds for each of 7 candidates, totalling 35 fits
[CV] C=1.0 ...........................................................


[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ............................................ C=1.0, total=   0.8s
[CV] C=1.0 ...........................................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s


[CV] ............................................ C=1.0, total=   0.8s
[CV] C=1.0 ...........................................................
[CV] ............................................ C=1.0, total=   0.8s
[CV] C=1.0 ...........................................................
[CV] ............................................ C=1.0, total=   0.9s
[CV] C=1.0 ...........................................................
[CV] ............................................ C=1.0, total=   0.8s
[CV] C=6.0 ...........................................................
[CV] ............................................ C=6.0, total=   2.8s
[CV] C=6.0 ...........................................................
[CV] ............................................ C=6.0, total=   2.1s
[CV] C=6.0 ...........................................................
[CV] ............................................ C=6.0, total=   2.5s
[CV] C=6.0 ...........................................................
[CV] .


Liblinear failed to converge, increase the number of iterations.



[CV] ........................................... C=21.0, total=   6.1s
[CV] C=21.0 ..........................................................
[CV] ........................................... C=21.0, total=   5.4s
[CV] C=21.0 ..........................................................
[CV] ........................................... C=21.0, total=   6.1s
[CV] C=21.0 ..........................................................
[CV] ........................................... C=21.0, total=   6.1s
[CV] C=21.0 ..........................................................
[CV] ........................................... C=21.0, total=   6.0s
[CV] C=26.0 ..........................................................
[CV] ........................................... C=26.0, total=   6.5s
[CV] C=26.0 ..........................................................
[CV] ........................................... C=26.0, total=   6.3s
[CV] C=26.0 ..........................................................
[CV] .

[Parallel(n_jobs=-1)]: Done  35 out of  35 | elapsed:  2.6min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=LinearSVC(C=1.0, class_weight=None, dual=True,
                                 fit_intercept=True, intercept_scaling=1,
                                 loss='squared_hinge', max_iter=1000,
                                 multi_class='ovr', penalty='l2',
                                 random_state=None, tol=0.0001, verbose=0),
             iid='warn', n_jobs=-1,
             param_grid={'C': array([ 1.,  6., 11., 16., 21., 26., 31.])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='f1', verbose=2)

In [34]:

print(f"f1 на кросс-валидации : {clf_lsvc.best_score_:.3f}")
print(f"Параметр регуляризации для лучшей модели: {clf_lsvc.best_params_}")

Наилучший показатель f1 на кросс-валидации : 0.777
Параметр регуляризации для лучшей модели: {'C': 1.0}


In [15]:
lsvcm = LinearSVC(C=1, max_iter=1000)
lsvcm.fit(X_train_vect, y_train)
predict = lsvcm.predict(X_test_vect)
f1_lsvc = f1_score(y_test, predict)

In [16]:
print(f"f1 на тестовой выборке: {f1_lsvc:.3f}")


f1 на тестовой выборке: 0.787


## Выводы

Был проведён анализ тональности текстов. В исследовании были задействованы 3 модели **Logistic Regression, LinearSVC, AdaBoostClassifier**.

 - Модель LinearSVC показала высший F1-score (0,787). 
 - Logistic Regression(0.782)  и она также прошла проверку на вменяемость по метрике accuracy (> 0.9)
 - AdaBoost (0.73)

Возможно, что мы могли получить лучшие результаты для AdaBoost, если его потюнить. Но лог.регрессия и svm с задачей справились порог для f1 (0.75) пройден.