# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

In [1]:
import pandas as pd
import numpy as np
from pymystem3 import Mystem
import re
from tqdm import tqdm
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from nltk.stem import WordNetLemmatizer 
nltk.download('wordnet')
from sklearn.tree import DecisionTreeClassifier
from nltk.corpus import wordnet
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Подготовка

In [2]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
df.head(5)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [5]:
df.shape

(159292, 3)

In [6]:
df.duplicated().sum()

0

In [7]:
df.isna().sum()

Unnamed: 0    0
text          0
toxic         0
dtype: int64

In [8]:
df['Unnamed: 0'].duplicated().sum()

0

In [9]:
#удалим признак 'Unnamed: 0', он является дублером индекса
df.drop('Unnamed: 0', axis = 1, inplace = True)

In [10]:
#создадим функцию очистики текста
def clear(text):
    text_clr = re.sub(r'[^a-zA-Z ]', ' ', text).split()
    text_clr = " ".join(text_clr)
    text_clr = text_clr.lower()
    return text_clr

In [11]:
df['text'] = df['text'].apply(clear)

In [12]:
#создадим экземпляр WordNetLemmatizer() и вызовем функцию lemmatize()

lemmatizer = WordNetLemmatizer()

In [13]:
#функция для сопоставления тегов
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [14]:
#токенизация с дальнейшей лемматизацией
def token_lem(text):
    tok = nltk.word_tokenize(text)
    lem = ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in tok])
    return lem

In [15]:
%%time
df['text_lem'] = df['text'].apply(token_lem)

CPU times: user 18min 22s, sys: 1min 37s, total: 19min 59s
Wall time: 20min


In [28]:
df

Unnamed: 0,text,toxic,text_lem
0,explanation why the edits made under my userna...,0,explanation why the edits make under my userna...
1,d aww he matches this background colour i m se...,0,d aww he match this background colour i m seem...
2,hey man i m really not trying to edit war it s...,0,hey man i m really not try to edit war it s ju...
3,more i can t make any real suggestions on impr...,0,more i can t make any real suggestion on impro...
4,you sir are my hero any chance you remember wh...,0,you sir be my hero any chance you remember wha...
...,...,...,...
159287,and for the second time of asking when your vi...,0,and for the second time of ask when your view ...
159288,you should be ashamed of yourself that is a ho...,0,you should be ashamed of yourself that be a ho...
159289,spitzer umm theres no actual article for prost...,0,spitzer umm there no actual article for prosti...
159290,and it looks like it was actually you who put ...,0,and it look like it be actually you who put on...


### Выводы
- Проведен анализ данных 
- Пропуски и дубликаты не выявлены
- Проведена очистка данных, проведена токенизация с дальнейшей лемматизацией
- Подготовлены данные для моделирования

## Обучение

In [17]:
#определим стоп-слова
stop_words = set(nltk_stopwords.words('english'))
stop_words_list = list(stop_words)

In [18]:
#определим признаки
df_features = df['text_lem']
df_target = df['toxic']

In [19]:
#Данные разбиты на обучающие и тестовые в соотношении 90/10
train_features, test_features, train_target, test_target = train_test_split(
    df_features, df_target, test_size = 0.1, random_state = 12345)

In [20]:
display(train_features.shape)
display(test_features.shape)
display(train_target.shape)
display(test_target.shape)

(143362,)

(15930,)

(143362,)

(15930,)

In [21]:
#Создадим и обучим модель TF_IDF
count_tf_idf = TfidfVectorizer(stop_words=stop_words_list)
count_tf_idf.fit(train_features)
tf_idf_train_features = count_tf_idf.transform(train_features)
tf_idf_test_features = count_tf_idf.transform(test_features)

In [22]:
display(tf_idf_train_features.shape)
display(tf_idf_test_features.shape)

(143362, 142204)

(15930, 142204)

**LogisticRegression**

In [23]:
%%time

model = LogisticRegression(random_state=12345, max_iter=150, solver='liblinear')

tuning_model=GridSearchCV(estimator=model,
                          param_grid={'penalty' : ['l1','l2'], 
                                     'C' : [5,10,15]},
                          scoring='f1',
                          cv=5,
                          verbose=5)

tuning_model.fit(tf_idf_train_features, train_target)
print(tuning_model.best_params_)
print('F1 = ', tuning_model.best_score_)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV 1/5] END ................................C=5, penalty=l1; total time=   2.2s
[CV 2/5] END ................................C=5, penalty=l1; total time=   2.5s
[CV 3/5] END ................................C=5, penalty=l1; total time=   2.3s
[CV 4/5] END ................................C=5, penalty=l1; total time=   2.6s
[CV 5/5] END ................................C=5, penalty=l1; total time=   2.4s
[CV 1/5] END ................................C=5, penalty=l2; total time=  14.0s
[CV 2/5] END ................................C=5, penalty=l2; total time=  14.1s
[CV 3/5] END ................................C=5, penalty=l2; total time=  14.2s
[CV 4/5] END ................................C=5, penalty=l2; total time=  13.8s
[CV 5/5] END ................................C=5, penalty=l2; total time=  13.6s
[CV 1/5] END ...............................C=10, penalty=l1; total time=   3.8s
[CV 2/5] END ...............................C=10,

**LGBMClassifier**

In [24]:
%%time
model = LGBMClassifier(random_state=12345)
tuning_model=GridSearchCV(estimator = model,
                          param_grid= {'max_depth' : [21,31]},
                          scoring='f1',
                          cv=5,
                          verbose=5)

tuning_model.fit(tf_idf_train_features, train_target)
print(tuning_model.best_params_)
print('F1 = ', tuning_model.best_score_)

Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END ...................................max_depth=21; total time= 9.1min
[CV 2/5] END ...................................max_depth=21; total time=11.6min
[CV 3/5] END ...................................max_depth=21; total time= 2.6min
[CV 4/5] END ...................................max_depth=21; total time= 2.7min
[CV 5/5] END ...................................max_depth=21; total time= 6.8min
[CV 1/5] END ...................................max_depth=31; total time= 7.4min
[CV 2/5] END ...................................max_depth=31; total time= 2.9min
[CV 3/5] END ...................................max_depth=31; total time= 2.8min
[CV 4/5] END ...................................max_depth=31; total time= 2.8min
[CV 5/5] END ...................................max_depth=31; total time= 2.8min
{'max_depth': 31}
F1 =  0.742883932470747
CPU times: user 54min 30s, sys: 2.32 s, total: 54min 32s
Wall time: 54min 55s


**DecisionTreeClassifier**

In [25]:
%%time
model = DecisionTreeClassifier(random_state=12345)
tuning_model=GridSearchCV(estimator = model,
                          param_grid= {'max_depth' : range(1,10, 2)},
                          scoring='f1',
                          cv=5,
                          verbose=5)

tuning_model.fit(tf_idf_train_features, train_target)
print(tuning_model.best_params_)
print('F1 = ', tuning_model.best_score_)

Fitting 5 folds for each of 5 candidates, totalling 25 fits
[CV 1/5] END ....................................max_depth=1; total time=  10.8s
[CV 2/5] END ....................................max_depth=1; total time=  11.4s
[CV 3/5] END ....................................max_depth=1; total time=  10.6s
[CV 4/5] END ....................................max_depth=1; total time=  11.0s
[CV 5/5] END ....................................max_depth=1; total time=  11.3s
[CV 1/5] END ....................................max_depth=3; total time=  11.2s
[CV 2/5] END ....................................max_depth=3; total time=  12.3s
[CV 3/5] END ....................................max_depth=3; total time=  11.7s
[CV 4/5] END ....................................max_depth=3; total time=  11.9s
[CV 5/5] END ....................................max_depth=3; total time=  12.1s
[CV 1/5] END ....................................max_depth=5; total time=  12.0s
[CV 2/5] END ....................................

На кросс-валидации было использовано 3 модели, и получены след. значения метрики:
1. LogisticRegression, F1 =  0.771443484781444
2. LGBMClassifier, F1 =  0.742883932470747
3. DecisionTreeClassifier, F1 =  0.596346505374412

Лучшее значение показала модель - LogisticRegression, прогоним ее на тестовой выборке

In [27]:
%%time
model = LogisticRegression(random_state=12345, max_iter=150, solver='liblinear', C = 5, penalty = 'l1')
model.fit(tf_idf_train_features, train_target)
prediction_final = model.predict(tf_idf_test_features)
f1 = f1_score(test_target,prediction_final)
print('F1 =', f1)

F1 = 0.7776261937244202
CPU times: user 3.09 s, sys: 91 ms, total: 3.18 s
Wall time: 3.18 s


**Условие задания выполнены, качество метрики F1 на тестовой выборке больше чем 0.75**

## Выводы

Для построение модели были выполены след действия:
1. Предподготовка данных (проведена токенизация с дальнейшей лемматизацией)
2. Удалены стоп-слова
3. Создана и обучена модель TF_IDF
4. Построены 3 модели:
    1. LogisticRegression, F1 =  0.771443484781444
    2. LGBMClassifier, F1 =  0.742883932470747
    3. DecisionTreeClassifier, F1 =  0.596346505374412
5. После прогонки на тестовой выборке LogisticRegression значение F1 = 0.7776261937244202, что удовлетворяет условие задания