<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. В нашем распоряжении набор данных с разметкой о токсичности правок.

**Цель:**
Построить модель, которая будет искать токсичные комментарии (классифицировать комментарии на позитивные и негативные) и отправлять их на модерацию, со значением метрики качества *F1* не меньше 0.75.

**Шаги выполнения проекта**

1. Загрузим и подготовим данные.
2. Обучим разные модели. 
3. Сделаем вывод.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
!pip install nltk
!{sys.executable} -m pip install spacy
!{sys.executable} -m spacy download en

/bin/bash: {sys.executable}: command not found
/bin/bash: {sys.executable}: command not found


In [2]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')

import warnings
warnings.filterwarnings('ignore')

from sklearn import linear_model, preprocessing

from nltk.corpus import stopwords as nltk_stopwords

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import f1_score
from sklearn import metrics

from pymystem3 import Mystem
m = Mystem()

from sklearn.model_selection import train_test_split

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
import sys
import spacy

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
try:
    try:
    #локальная версия
        comments = pd.read_csv('\datasets\toxic_comments.csv')
    except:    
    #сетевая версия
        comments = pd.read_csv('/datasets/toxic_comments.csv')
except:
    print('Не удалось прочесть файл. Возможно, был указан не верный путь.')

In [4]:
comments.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0


In [5]:
def clean_data(row):
    row = re.sub(r"(?:\n|\r)", " ", row)
    row = re.sub(r"[^a-zA-Z ]+", "", row).strip()
    row = row.lower()
    return row

In [6]:
comments['text'] = comments['text'].apply(clean_data)
comments.head()

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,explanation why the edits made under my userna...,0
1,1,daww he matches this background colour im seem...,0
2,2,hey man im really not trying to edit war its j...,0
3,3,more i cant make any real suggestions on impro...,0
4,4,you sir are my hero any chance you remember wh...,0


In [7]:
comments.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [8]:
def lemmatize(text):
    lemm = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
    doc = lemm(text)
    return " ".join([token.lemma_ for token in doc])

In [10]:
df_train_valid, df_test = train_test_split(comments, 
                                           test_size = 0.1,
                                           random_state = 42)

In [11]:
df_train, df_valid = train_test_split(df_train_valid,
                                      shuffle=False,
                                      test_size=0.25,
                                      random_state = 42)

In [12]:
print(df_train.shape, df_valid.shape, df_test.shape)

(107521, 3) (35841, 3) (15930, 3)


In [13]:
features_train = df_train.drop('toxic', axis=1).values
target_train = df_train['toxic'].values

In [14]:
features_valid = df_valid.drop('toxic', axis=1).values
target_valid = df_valid['toxic'].values

In [15]:
features_test = df_test.drop('toxic', axis=1).values
target_test = df_test['toxic'].values

In [16]:
print(features_train.shape, features_valid.shape, features_test.shape, target_train.shape[0])

(107521, 2) (35841, 2) (15930, 2) 107521


In [17]:
train_corpus = df_train['text'].values.astype('U')

In [20]:
train_corpus[0] = lemmatize(str(train_corpus[0]))
train_corpus

array(['stop delete image without first consulting   quadell you need to stop delete image without first consulting with the person who post they that be bad behavior on your part cease and desist if you think there s a problem first discuss it with the person involve do not take matter into your own hand and just take action do that show disrespect to other and make people angry   ledastray',
       'controversial material of any kind that is unsourced or poorly sourced must be removed immediately especially if potentially libelous blp rules were applied it seems',
       'image source info   hi there as i was clearing out catns i came across this image imageeichmann trial  in glass boxjpg that you uploaded is it possible that you could provide source info for it so that it wont be deleted under csd i at a later date sorry to bother  rw',
       ...,
       'about the peaceful loving gonna destroy the uk as a whole because of icecream religion of peace that is islam',
       'jps and 

In [21]:
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords, ngram_range=(1,1))
tf_idf_train = count_tf_idf.fit_transform(train_corpus)

In [23]:
valid_corpus = df_valid['text'].values.astype('U')

valid_corpus[0] = lemmatize(str(valid_corpus[0]))
valid_corpus

array(['ps can you send I an empty email just for reestablish contact I have a question to you',
       'i dont care i have at least one proof link you are so stupid that you dont know how to translate my link and of couse you dont have any argument or link additionaly you must read discussion page and again t dont have any ammo in turret',
       'no it would be a new tmeplate and the terrorism template could either be below the new template or be a link inside the new template i do not agree either with the fact that all islamist terrorism related entries branch off from this one they are simply related to this one  then why have a muhammad template all articles related to muhammad also branch off from islam correct there you go  talkcontribs',
       ...,
       'john bull vs adolf hitler     look at the pictures of these two people what do you think',
       'im sorry tip the deduction was obvious and wrong the article as written did not contain any links to reliable sources depend

In [24]:
tf_idf_valid = count_tf_idf.transform(valid_corpus)

In [25]:
test_corpus = df_test['text'].values.astype('U')
test_corpus

array(['sometime back i just happened to log on to wwwizoomin with a friends reference and i was amazed to see the concept fresh ideas entertainment has come up with so many deals all under one roof this website is very user friendly and easy to use and is fun to be on you have gossip games facts another exciting feature to add to it is face of the week every week  new faces are selected and put up as izoom faces its great to have been selected in four out of a group of millions  this new start up has already got many a deals in its kitty few of them being thefortune hotel the beach are my personal favorites izoomin has a usp of mobile coupons coupons are available even when a user cannot access internet you just need to sms izoom support to  and you get attended immediately all i can say is izoomin is a must visit website for everyone before they go out for shopping or dining or for outing cheers',
       'the latest edit is much better dont make this article state super at all',
    

In [27]:
test_corpus[0] = lemmatize(str(test_corpus[0]))
test_corpus

array(['sometime back I just happen to log on to wwwizoomin with a friend reference and I be amazed to see the concept fresh idea entertainment have come up with so many deal all under one roof this website be very user friendly and easy to use and be fun to be on you have gossip game fact another exciting feature to add to it be face of the week every week   new face be select and put up as izoom face its great to have be select in four out of a group of million   this new start up have already get many a deal in its kitty few of they be thefortune hotel the beach be my personal favorite izoomin have a usp of mobile coupon coupon be available even when a user can not access internet you just need to sms izoom support to   and you get attend you all I can say be izoomin be a must visit website for everyone before they go out for shopping or dining or for outing cheer',
       'the latest edit is much better dont make this article state super at all',
       'october  utc  i would think

In [28]:
tf_idf_test = count_tf_idf.transform(test_corpus)

## Обучение

# LogisticRegression

In [29]:
%%time
lr = LogisticRegression(random_state=42, 
                        solver='liblinear',
                        max_iter=100
                       )

params = {'penalty':['l1', 'l2'],
          'C':list(range(1,15,3)) 
         }

lr_gs = GridSearchCV(lr,
                     params,
                     cv=3,
                     scoring='f1',
                     verbose=True).fit(tf_idf_train, target_train)

print ("Best Params", lr_gs.best_params_)
print ("Best Score", lr_gs.best_score_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Params {'C': 4, 'penalty': 'l1'}
Best Score 0.7675195516618311
CPU times: user 2min 2s, sys: 1min 20s, total: 3min 23s
Wall time: 3min 23s


In [30]:
lr_best = LogisticRegression(random_state=42,
                             class_weight = 'balanced',
                             C = 4,
                             penalty = 'l1',
                             solver='liblinear',
                             max_iter=100)

lr_best.fit(tf_idf_train, target_train)

LogisticRegression(C=4, class_weight='balanced', penalty='l1', random_state=42,
                   solver='liblinear')

In [31]:
pred1 = lr_best.predict(tf_idf_valid)
print(f'Метрика F1 модели LogisticRegression на валидационной выборке: {f1_score(target_valid, pred1)}')

Метрика F1 модели LogisticRegression на валидационной выборке: 0.7683500826866811


# DecisionTreeClassifier

In [32]:
%%time
tree = DecisionTreeClassifier(random_state = 42)
params = {'criterion':['gini', 'entropy'],
          'max_depth':list(range(1,15,5)) 
         }

tree_gs = GridSearchCV(tree,
                       params,
                       cv=3,
                       scoring='f1',
                       verbose=True).fit(tf_idf_train, target_train)

print ("Лучшие параметры", tree_gs.best_params_)
print ("Лучшая метрика F1 модели DecisionTreeClassifier", tree_gs.best_score_)

Fitting 3 folds for each of 6 candidates, totalling 18 fits
Лучшие параметры {'criterion': 'gini', 'max_depth': 11}
Лучшая метрика F1 модели DecisionTreeClassifier 0.5820756016268165
CPU times: user 3min 18s, sys: 1.09 s, total: 3min 20s
Wall time: 3min 20s


In [33]:
tree_best = DecisionTreeClassifier(random_state = 42,
                                   criterion='gini',
                                   max_depth=11)

tree_best.fit(tf_idf_train, target_train)

DecisionTreeClassifier(max_depth=11, random_state=42)

In [34]:
pred2 = tree_best.predict(tf_idf_valid)
print(f'Метрика F1 модели DecisionTreeClassifier на валидационной выборке: {f1_score(target_valid, pred2)}')

Метрика F1 модели DecisionTreeClassifier на валидационной выборке: 0.5971512052593133


**Под условия нашего проекта подходит модель LogisticRegression, метрика F1 которой на обучающей и валидационной выборке превышает необходимое пороговое значение 0.75. Проверим работу этой модели на тестовой выборке:**

In [35]:
pred1 = lr_best.predict(tf_idf_test)      

In [36]:
f1_lr = f1_score(target_test, pred1)     
f1_lr

0.7768643570170287

## Выводы

**Пороговое значение метрики 0.75 преодолено, получено 0.78.** 
Лучшая модель - логистическая регрессия с подробранными гиперпараметрами:
- C = 4, 
- penalty = 'l1'

Цель проекта достигнута, заказчик сможет искать токсичные комментарии (классифицировать комментарии на позитивные и негативные) и отправлять их на модерацию.

## Чек-лист проверки

- [x]  Jupyter Notebook открыт
- [x]  Весь код выполняется без ошибок
- [x]  Ячейки с кодом расположены в порядке исполнения
- [x]  Данные загружены и подготовлены
- [x]  Модели обучены
- [x]  Значение метрики *F1* не меньше 0.75
- [x]  Выводы написаны