# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию.

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75.

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели.
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

In [None]:
!pip install -q spacy

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

import nltk
from nltk.corpus import stopwords
import spacy
import re

import joblib
import warnings

In [None]:
TEST_SIZE = 0.25
RANDOM_STATE = 42

warnings.filterwarnings('ignore')

## Подготовка

In [None]:
# загрузка данных
df = pd.read_csv('/datasets/toxic_comments.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


Оставим только столбцы `text` и `toxic`

In [None]:
df = df[['text', 'toxic']]

Определим функцию для вывода основной информации о датасете

In [None]:
def data_info(data):
    display(data.head())
    data.info()
    display(data.describe(include='all', datetime_is_numeric=True).T)
    print()
    print('Распределение данных по классам:')
    print(df['toxic'].value_counts())

In [None]:
data_info(df)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
text,159292.0,159292.0,sorry \n\nFor stuffing the battery page - mt i...,1.0,,,,,,,
toxic,159292.0,,,,0.101612,0.302139,0.0,0.0,0.0,0.0,1.0



Распределение данных по классам:
0    143106
1     16186
Name: toxic, dtype: int64


Пропусков и дубликатов в датасете не обнаружено, типы данных указаны корректно, а баланс классов смещен в сторону нетоксичных комментариев.

Разделим набор данных на обучающую и тестовую выборки и создадим списки текстов

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['toxic'], axis=1),
    df['toxic'],
    test_size = TEST_SIZE,
    random_state = RANDOM_STATE
)

In [None]:
train_texts = X_train['text'].to_list()
test_texts = X_test['text'].to_list()

Определим функции для очистки и лемматизации текста

In [None]:
# загрузка инструментов для работы с английским языком
nlp = spacy.load("en_core_web_sm")

def lemm_text(texts):

    # очистка текста от символов
    clean_texts = [re.sub(r'[^a-zA-Z ]', ' ', text) for text in texts]

    # разбиение списка текстов на порции
    make_parts = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
    # разбиение на порции по 100 текстов
    text_parts = make_parts(clean_texts, 100)

    # слияние текстов с разделителем
    res = []
    for part in text_parts:
        united_texts = ' '.join([text + '*' for text in part])

        # лемматизация текстов после слияния
        nlp_text_object = nlp(united_texts)
        lemm_text_part = ' '.join([token.lemma_ for token in nlp_text_object])

        # наполнение списка результатов
        lemm_text_list = lemm_text_part.split('*')
        for text in lemm_text_list[:-1]:
            res.append(text)

    return res

In [None]:
%%time
# создание корпуса лемматизированной тренировочной выборки
with joblib.parallel_backend("threading"):
    X_train_corpus = lemm_text(train_texts)

CPU times: user 21min 34s, sys: 8.11 s, total: 21min 43s
Wall time: 21min 44s


In [None]:
# проверка корректности лемматизации тренировочной выборки
print(len(X_train_corpus), len(train_texts))
print('=' * 50)
print(train_texts[0])
print('-' * 50)
print(X_train_corpus[0])
print('=' * 50)
print(train_texts[-1])
print('-' * 50)
print(X_train_corpus[-1])

119469 119469
The source was also not acceptable for reasons already stated. You also just rambled on about a load of nonsense.
--------------------------------------------------
the source be also not acceptable for reason already state   you also just ramble on about a load of nonsense 
"
I'm sorry, Tip; the deduction was obvious (and wrong). The article as written did not contain any links to reliable sources, depending instead on the subject's own website, amazon.com, and other notoriously unreliable sources. Remember also that Notability is not contagious; being the agent for a notable person does not make this guy notable. I'd suggest you rewrite this one in a sandbox, using more reliable sources; and if there are no such sources, consider what this says about the guy's notability. Your fellow cheesehead,   |  Talk "
--------------------------------------------------
    I m sorry   Tip   the deduction be obvious   and wrong    the article as write do not contain any link to reli

In [None]:
%%time
# создание корпуса лемматизированной тестовой выборки
with joblib.parallel_backend("threading"):
    X_test_corpus = lemm_text(test_texts)

CPU times: user 7min 20s, sys: 4.76 s, total: 7min 25s
Wall time: 7min 26s


In [None]:
# проверка корректности лемматизации тестовой выборки
print(len(X_test_corpus), len(test_texts))
print('=' * 50)
print(test_texts[0])
print('-' * 50)
print(X_test_corpus[0])
print('=' * 50)
print(test_texts[3])
print('-' * 50)
print(X_test_corpus[3])

39823 39823
Sometime back, I just happened to log on to www.izoom.in with a friend’s reference and I was amazed to see the concept Fresh Ideas Entertainment has come up with. So many deals… all under one roof. This website is very user friendly and easy to use and is fun to be on.
You have Gossip, Games, Facts… Another exciting feature to add to it is Face of the Week… Every week, 4 new faces are selected and put up as izoom faces. It’s great to have been selected in four out of a group of millions. 
This new start up has already got many a deals in its kitty. Few of them being TheFortune Hotel, The Beach… are my personal favorites. izoom.in has a USP of mobile coupons. Coupons are available even when a user cannot access internet. You just need to SMS izoom support to 56767 and you get attended immediately.
All I can say is izoom.in is a must visit website for everyone before they go out for shopping or dining or for outing.
Cheers!!!
--------------------------------------------------

In [None]:
# инициализация экземпляра класса векторизатора
count = TfidfVectorizer(stop_words=set(stopwords.words('english')))

In [None]:
%%time
# преобразование корпуса слов тренировочной выборки
X_train_tf_idf = count.fit_transform(X_train_corpus)
# преобразование корпуса слов тестовой выборки
X_test_tf_idf = count.transform(X_test_corpus)

CPU times: user 6.4 s, sys: 20 ms, total: 6.42 s
Wall time: 6.45 s


## Обучение

Инициализируем пайплайн для обучения моделей

In [None]:
pipe_final = Pipeline([
    ('models', LogisticRegression(random_state=RANDOM_STATE))
])

Зададим гиперпараметры

In [None]:
param_grid = [
    {
        'models': [LogisticRegression(
            random_state=RANDOM_STATE,
            solver='liblinear',
            penalty='l2'
        )],
        'models__C': [5, 10, 15],
    },

    {
        'models': [DecisionTreeClassifier(random_state=RANDOM_STATE)],
        'models__max_features': range(6, 8),
        'models__max_depth': range(8, 10)
    },

    {
        'models': [KNeighborsClassifier()],
        'models__n_neighbors': [5, 25]
    }
]

In [None]:
grid_search = GridSearchCV(
    pipe_final,
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)

In [None]:
%%time
# запуск подбора параметров
with joblib.parallel_backend("threading"):
        grid_search.fit(X_train_tf_idf, y_train)

print('Лучшая модель и её параметры:\n\n', grid_search.best_estimator_)
print ('Метрика лучшей модели на тренировочной выборке:', grid_search.best_score_)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
Лучшая модель и её параметры:

 Pipeline(steps=[('models',
                 LogisticRegression(C=15, random_state=42,
                                    solver='liblinear'))])
Метрика лучшей модели на тренировочной выборке: 0.766682041196146
CPU times: user 13min 24s, sys: 6min 9s, total: 19min 34s
Wall time: 19min 36s


In [None]:
results = pd.DataFrame(grid_search.cv_results_)
results.sort_values(by='rank_test_score', inplace=True)
results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_models,param_models__C,param_models__max_depth,param_models__max_features,param_models__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
2,13.785714,0.686751,0.011516,0.000972,"LogisticRegression(C=15, random_state=42, solv...",15.0,,,,"{'models': LogisticRegression(C=15, random_sta...",0.765125,0.759722,0.778465,0.760439,0.769659,0.766682,0.006893,1
1,14.92907,0.932095,0.01043,0.001102,"LogisticRegression(C=15, random_state=42, solv...",10.0,,,,"{'models': LogisticRegression(C=15, random_sta...",0.764842,0.758653,0.778496,0.76,0.767104,0.765819,0.00705,2
0,13.167403,0.761498,0.010342,0.001279,"LogisticRegression(C=15, random_state=42, solv...",5.0,,,,"{'models': LogisticRegression(C=15, random_sta...",0.76112,0.75232,0.773723,0.756976,0.76061,0.76095,0.007121,3
7,0.030183,0.003997,93.850744,0.66345,KNeighborsClassifier(),,,,5.0,"{'models': KNeighborsClassifier(), 'models__n_...",0.249917,0.250171,0.25611,0.251768,0.243852,0.250364,0.00394,4
8,0.032839,0.000534,95.767897,1.770554,KNeighborsClassifier(),,,,25.0,"{'models': KNeighborsClassifier(), 'models__n_...",0.117375,0.116647,0.112273,0.103623,0.094716,0.108927,0.008628,5
3,0.101449,0.004676,0.028232,0.001807,DecisionTreeClassifier(random_state=42),,8.0,6.0,,{'models': DecisionTreeClassifier(random_state...,0.0,0.001642,0.000821,0.0,0.0,0.000493,0.000657,6
5,0.107041,0.00711,0.030295,0.002285,DecisionTreeClassifier(random_state=42),,9.0,6.0,,{'models': DecisionTreeClassifier(random_state...,0.0,0.001642,0.000821,0.0,0.0,0.000493,0.000657,6
4,0.103524,0.003744,0.027805,0.0018,DecisionTreeClassifier(random_state=42),,8.0,7.0,,{'models': DecisionTreeClassifier(random_state...,0.0,0.0,0.000821,0.0,0.0,0.000164,0.000329,8
6,0.110142,0.007792,0.028804,0.002238,DecisionTreeClassifier(random_state=42),,9.0,7.0,,{'models': DecisionTreeClassifier(random_state...,0.0,0.0,0.000821,0.0,0.0,0.000164,0.000329,8


Выведем результаты лучших моделей

Проверим качество лучшей модели на тестовой выборке

In [None]:
pred = grid_search.best_estimator_.predict(X_test_tf_idf)

print("f1 на тестовой выборке:", f1_score(y_test, pred))

f1 на тестовой выборке: 0.7802182621909103


## Выводы

В ходе выполнения проекта по классификации комментариев на негативные и позитивные были выполнены следующие шаги:  
- сформирована тренировочная и тестовая выборки с использованием стратификации для обеспечения равномерного распределения комментариев каждого класса (т.к. токсичных коментариев в датасете оказалось значительно меньше),  
- проведена лемматизация с помощью pymystem3, а также фильтрация специальных символов и стоп-слов,  
- создан наборов признаков для обучения моделей с помощью TfIdfVectorizer,  
- обучен ряд моделей: LogisticRegression, DecisionTreeClassifier и KNeighborsClassifier,  
- оценена точность лучшей модели на тестовой выборке.  

В результате, KNeighborsClassifier с параметрами n_neighbors=5 показала неплохие результаты, однако наилучшей моделью оказалась LogisticRegression с параметром регуляризации C=15, на тестовой выборке её качество составило 0.78 (f1_score).