<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1">Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2">Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3">Выводы</a></span></li></ul></div>

# Оценка комментариев

Пользователи Интернет-магазина могут редактировать и дополнять описания товаров. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Цель: Обучить модель классифицировать комментарии на позитивные и негативные. В нашем распоряжении набор данных с разметкой о токсичности правок. Модель должна иметь значение метрики качества *F1* не меньше 0.75. 



## Подготовка

In [211]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from tqdm import notebook
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
from sklearn.ensemble import RandomForestClassifier
from nltk.stem import WordNetLemmatizer
import re 
import lightgbm as lgb



import warnings
warnings.filterwarnings("ignore")

pd.set_option('display.float_format', '{:,.2f}'.format)

In [212]:
# чтение файла с данными и сохранение 
try: data = pd.read_csv('/datasets/toxic_comments.csv') 
except: data = pd.read_csv('toxic_comments.csv')

In [213]:
# просмотрим информацию 
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [214]:
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


## Обучение

In [215]:
# загрузим модули
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stopwords = set(nltk_stopwords.words('english'))

# создадим счётчик, указав в нём стоп-слова
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/yuliaanikeeva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/yuliaanikeeva/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/yuliaanikeeva/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [216]:
# посмотрим текст до обработки
data['text'].head()

159561    "\nNo he did not, read it again (I would have ...
159562    "\n Auto guides and the motoring press are not...
159563    "\nplease identify what part of BLP applies be...
159564    Catalan independentism is the social movement ...
159565    The numbers in parentheses are the additional ...
159566    ":::::And for the second time of asking, when ...
159567    You should be ashamed of yourself \n\nThat is ...
159568    Spitzer \n\nUmm, theres no actual article for ...
159569    And it looks like it was actually you who put ...
159570    "\nAnd ... I really don't think you understand...
Name: text, dtype: object

In [190]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

In [217]:
# напишем функции для очистки и лемматизации
def clear_text(text):
    return " ".join(re.sub(r'[^a-zA-z]', ' ', text).split())

def lemmatize_text(text):
    return " ".join(lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text))

In [218]:
# проведем обработку
data['text_clear'] = data['text'].apply(clear_text)
data['text_lemm'] = data['text_clear'].apply(lemmatize_text)

In [220]:
# посмотрим текст после обработки
data['text_lemm'].head()

159561    No he did not read it again I would have thoug...
159562    Auto guide and the motoring press are not good...
159563    please identify what part of BLP applies becau...
159564    Catalan independentism is the social movement ...
159565    The number in parenthesis are the additional d...
159566    And for the second time of asking when your vi...
159567    You should be ashamed of yourself That is a ho...
159568    Spitzer Umm there no actual article for prosti...
159569    And it look like it wa actually you who put on...
159570    And I really don t think you understand I came...
Name: text_lemm, dtype: object

In [221]:
# разделим выборку
features = data['text_lemm']
target = data['toxic']
train, valid = train_test_split(data, test_size=0.2)
features_train = train['text_lemm']
features_valid = valid['text_lemm']
target_train = train['toxic']
target_valid = valid['toxic']


print(features_train.shape, features_valid.shape)
print(target_train.shape, target_valid.shape)

In [255]:
# Вычислим TF-IDF для корпуса текстов
corpus = features_train.values
tf_idf = count_tf_idf.fit_transform(corpus)
print("Размер матрицы:", tf_idf.shape)

Размер матрицы: (127656, 147426)


In [256]:
# вычислим величину TF-IDF для валидационной выборки
corpus_valid = features_valid.values
tf_idf_valid = count_tf_idf.transform(corpus_valid)
print("Размер матрицы:", tf_idf_valid.shape)

Размер матрицы: (31915, 147426)


In [257]:
# Логистическая регрессия 
model = LogisticRegression(random_state=12345)
model.fit(tf_idf, target_train)
predictions = model.predict(tf_idf_valid)

f1 = f1_score(target_valid, predictions)
f1

0.7262569832402235

In [258]:
# Логистическая регрессия с кросс-вариацией
model = LogisticRegression(random_state=12345)
scores = cross_val_score(model, tf_idf, target_train, scoring = 'f1', cv=10) 
final_score = scores.mean()
final_score


0.719663186196813

In [281]:
# Градиентный бустинг LightGBM
model = lgb.LGBMModel(objective="binary", n_estimators=200,  learning_rate=(0.2), max_depth=10)
model.fit(tf_idf, target_train, eval_set=[(tf_idf_valid, target_valid),], eval_metric="f1")

predictions = model.predict(tf_idf_valid)
pred_bi = [1 if pred >= 0.5 else 0 for pred in predictions]
f1 = f1_score(target_valid, pred_bi)
f1


[1]	valid_0's binary_logloss: 0.266662
[2]	valid_0's binary_logloss: 0.248263
[3]	valid_0's binary_logloss: 0.235654
[4]	valid_0's binary_logloss: 0.226922
[5]	valid_0's binary_logloss: 0.220032
[6]	valid_0's binary_logloss: 0.213619
[7]	valid_0's binary_logloss: 0.208642
[8]	valid_0's binary_logloss: 0.20441
[9]	valid_0's binary_logloss: 0.200447
[10]	valid_0's binary_logloss: 0.197022
[11]	valid_0's binary_logloss: 0.19399
[12]	valid_0's binary_logloss: 0.191296
[13]	valid_0's binary_logloss: 0.189012
[14]	valid_0's binary_logloss: 0.18639
[15]	valid_0's binary_logloss: 0.18437
[16]	valid_0's binary_logloss: 0.182143
[17]	valid_0's binary_logloss: 0.180345
[18]	valid_0's binary_logloss: 0.178533
[19]	valid_0's binary_logloss: 0.176808
[20]	valid_0's binary_logloss: 0.175267
[21]	valid_0's binary_logloss: 0.174127
[22]	valid_0's binary_logloss: 0.172929
[23]	valid_0's binary_logloss: 0.171548
[24]	valid_0's binary_logloss: 0.170141
[25]	valid_0's binary_logloss: 0.168878
[26]	valid_0'

0.7539936102236422

In [287]:
# Случайный лес
model = RandomForestClassifier(random_state=12345, n_estimators=200, max_depth=30) 
model.fit(tf_idf, target_train)
predictions = model.predict(tf_idf_valid)

f1 = f1_score(target_valid, predictions)
f1

0.011550151975683891

## Выводы

В ходе работы был Обработан текст сообщений: Сначала текст был очищен от знаков препинания и других символов, отличных от букв латинского алфавита, разбит на токены, далее проведена лемматизация с помощью word net, а также создан счётчик и посчитана TF-IDF. 

Для подготовки предсказаний обучены модели логистической регрессии, случайного леса в классификации и LGBM градиентный бустинг. Для оценки качества использовалась метрика F1. **Наилучший результат F1 (0,75) получен на модели градиентного бустинга LGBM**. 