<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Загрузка-и-изучение-данных" data-toc-modified-id="Загрузка-и-изучение-данных-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Загрузка и изучение данных</a></span><ul class="toc-item"><li><span><a href="#Промежуточный-вывод" data-toc-modified-id="Промежуточный-вывод-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Промежуточный вывод</a></span></li></ul></li><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Избавимся-от-лишних-символов" data-toc-modified-id="Избавимся-от-лишних-символов-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Избавимся от лишних символов</a></span></li><li><span><a href="#Проверка-на-дубликаты" data-toc-modified-id="Проверка-на-дубликаты-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Проверка на дубликаты</a></span></li><li><span><a href="#Проверка-на-пропуски" data-toc-modified-id="Проверка-на-пропуски-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Проверка на пропуски</a></span></li><li><span><a href="#Токкенизация,-лемматизация-и-ссылки" data-toc-modified-id="Токкенизация,-лемматизация-и-ссылки-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Токкенизация, лемматизация и ссылки</a></span></li><li><span><a href="#Разделение-на-выборки" data-toc-modified-id="Разделение-на-выборки-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Разделение на выборки</a></span></li><li><span><a href="#Вычисление-TF-IDF" data-toc-modified-id="Вычисление-TF-IDF-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Вычисление TF-IDF</a></span></li><li><span><a href="#Промежуточный-вывод" data-toc-modified-id="Промежуточный-вывод-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Промежуточный вывод</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Логистическая-регрессия" data-toc-modified-id="Логистическая-регрессия-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Логистическая регрессия</a></span></li><li><span><a href="#Случайный-лес" data-toc-modified-id="Случайный-лес-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Случайный лес</a></span></li><li><span><a href="#Промежуточный-вывод" data-toc-modified-id="Промежуточный-вывод-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Промежуточный вывод</a></span></li></ul></li><li><span><a href="#Тестирование" data-toc-modified-id="Тестирование-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Тестирование</a></span></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Выводы</a></span></li><li><span><a href="#Чек-лист-проверки" data-toc-modified-id="Чек-лист-проверки-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Чек-лист проверки</a></span></li></ul></div>

# Проект классификация токсичных комментариев

Интернет-магазин запускает новый сервис, который позволяет пользователям редактировать и дополнять описания товаров, подобно вики-сообществам. Это означает, что клиенты могут предлагать свои правки и комментировать изменения других. Для обеспечения безопасности магазина необходим инструмент, который будет автоматически выявлять токсичные комментарии и отправлять их на модерацию.

В рамках проекта будет обучена модель для классификации комментариев на позитивные и негативные. Для этого доступен набор данных с разметкой о токсичности правок.

# План проекта

1. Загрузим и изучим данные
2. Подготовим данные
  - Избавимся от лишних элементов текста для поиска дубликатов
  - Проверим на пропуски
  - Произведем токенизацию и лемметазацию текста
  - Разделим на обучающую, валидационную и тестовую выборки
  - Вычислим TF-IDF 
3. Подберем для нескольких моделей оптимальные гиперпараметры с помощью фреймворка Optuna и выберем лучшую
4. Проверим модель на тестовой выборке 

## Загрузка и изучение данных

In [1]:
!pip -q install optuna

In [2]:
import pandas as pd
import numpy as np

import optuna
import re
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.dummy import DummyClassifier
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from tqdm import notebook
from pymystem3 import Mystem
from tqdm import tqdm
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer 

In [3]:
pd.set_option('display.max_colwidth', None)

In [5]:
df.head()

Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0


In [6]:
df.shape

(159292, 2)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


Изучим баланс классов

In [8]:
df['toxic'].value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

### Промежуточный вывод

- В нашем распоряжении датафрейм с 159292 строкой.
- Каждая строка содержит текст и информацию о том, является ли текст токсичным
- Выборка несбалансированна 143106 нетоксичных текстов и 16186 токсичных

## Подготовка

### Избавимся от лишних символов

- Так как наша задача - обучить модель определять токсичность текста, то нам не нужны знаки препинания, множественные пробелы и символы новой строки.
- Удаление этих элементов текста поможет снизить количество токкенов, а также выявить дубликаты.
- Также, для выявления дубликатов стоит привести текст к нижнему регистру.

In [9]:
def prepare_text(text):
    #Приведем текст к нижнему регистру
    text = text.lower()
    # Удаление знаков препинания
    text = re.sub(r"[^\w\s]", " ", text)
    #Замена сивола новой строки пробелом
    text = text.replace('\n', ' ')
    # Замена последовательностей пробелов на один пробел
    text = re.sub(r'\s+', ' ', text)
    return text

In [10]:
df['text'] = df['text'].apply(prepare_text)

### Проверка на дубликаты

In [11]:
df.duplicated().sum()

607

In [12]:
df['text'].duplicated().sum()

629

Полных строк-дубликатов 607, а дубликатов в столбце `text` 629. Это значит, что у одинаковых текстов разное значение целевого признака. Необходимо вручную отбросить те строки-дубликаты, значение целевого признака которых неккоректно

In [13]:
text_toxic_counts = df.groupby('text')['toxic'].nunique()

In [14]:
duplicates = df[df['text'].isin(text_toxic_counts[text_toxic_counts > 1].index)].sort_values(by='text')

In [15]:
duplicates

Unnamed: 0,text,toxic
119123,,1
137547,,0
98503,ahhhh it always feels good to have the last word to get in the last word although i don t sign my edits you can see that from the sinbot that i am the person that started this discussion of formatting the table lol yes good to see that everybody has more or less agreed upon a non descript colour format fair enough better than no colour formatting for headings jack merridew i see that you went and posted somewhere else that miley cyrus s fans can t have their own colour scheme idiot i had never even heard of this miley cyrus creature until i happened upon her page and to be quite frank i don t think i would like her music if i were to hear it i m 55 years old nitwit but you re a know it all aren t you jack you know how i know that because i went and looked at your edits and you apparently spend half of every day editing wikipedia entries and most of your edits are to do with table formatting and font formatting that you don t like please get a life,0
56063,ahhhh it always feels good to have the last word to get in the last word although i don t sign my edits you can see that from the sinbot that i am the person that started this discussion of formatting the table lol yes good to see that everybody has more or less agreed upon a non descript colour format fair enough better than no colour formatting for headings jack merridew i see that you went and posted somewhere else that miley cyrus s fans can t have their own colour scheme idiot i had never even heard of this miley cyrus creature until i happened upon her page and to be quite frank i don t think i would like her music if i were to hear it i m 55 years old nitwit but you re a know it all aren t you jack you know how i know that because i went and looked at your edits and you apparently spend half of every day editing wikipedia entries and most of your edits are to do with table formatting and font formatting that you don t like please get a life,0
122688,ahhhh it always feels good to have the last word to get in the last word although i don t sign my edits you can see that from the sinbot that i am the person that started this discussion of formatting the table lol yes good to see that everybody has more or less agreed upon a non descript colour format fair enough better than no colour formatting for headings jack merridew i see that you went and posted somewhere else that miley cyrus s fans can t have their own colour scheme idiot i had never even heard of this miley cyrus creature until i happened upon her page and to be quite frank i don t think i would like her music if i were to hear it i m 55 years old nitwit but you re a know it all aren t you jack you know how i know that because i went and looked at your edits and you apparently spend half of every day editing wikipedia entries and most of your edits are to do with table formatting and font formatting that you don t like please get a life,1
124302,blocking mardyks excellent work there shii we can t have his kind getting us to think about what the maya actually say about their own prophecies we insulted him offended him and abused him and he just had to be ethical and persistant block those mother fukkers taking out the entire santa fe public library system is a great preemptive strike also there may be others of his kind that sympathize with those indians these people actually love the earth and that is without reliable sources we kicked their asses and have the right to write their history and interpret their sacred teachings however we please we need more from college students who have been indoctrinated in the church of academia that piece by stitler is one of the most exaggerated and opinionated and so yeah use that as the title of the page and by all means give john major jenkins his own section not a single scholar or mayanists agrees with his appropriated theory and this kind of hypocrisy and arrogance is what wiki is all about we can get away with it by continuing to use our power to censor free thinkers like mardyks and his kind sony pictures is paying us all off with tickets so let us know how many you want free popcorn too whoopee best wishes from jimini cricket 97 123 26 228,0
107445,blocking mardyks excellent work there shii we can t have his kind getting us to think about what the maya actually say about their own prophecies we insulted him offended him and abused him and he just had to be ethical and persistant block those mother fukkers taking out the entire santa fe public library system is a great preemptive strike also there may be others of his kind that sympathize with those indians these people actually love the earth and that is without reliable sources we kicked their asses and have the right to write their history and interpret their sacred teachings however we please we need more from college students who have been indoctrinated in the church of academia that piece by stitler is one of the most exaggerated and opinionated and so yeah use that as the title of the page and by all means give john major jenkins his own section not a single scholar or mayanists agrees with his appropriated theory and this kind of hypocrisy and arrogance is what wiki is all about we can get away with it by continuing to use our power to censor free thinkers like mardyks and his kind sony pictures is paying us all off with tickets so let us know how many you want free popcorn too whoopee best wishes from jimini cricket 97 123 26 228,1
142055,honestly that was not threats of harassment maybe it sounds like it but you don t know about the actual situation and what happened in the past gvnayr is a very dangerous wikipedian i can deepen it i m editing wikipedia since 2010 and i can say that he is the lamest weirdest most dangerous etc in my opinion being blocked indefinitely because of this is not fair 1 2 months is more fair sadly blocking this user hydao who always fought against vandalism and addition of nonsense things will not solve the problem sooner the user gvnayr will start adding completely stupid and nonsense things on wikipedia which is unacceptable i will continue fighting against vandalism and nonsense stuff added by losers nolifers like gvnayr with or without hydao now you are thinking hydao you are insulting a user by calling loser and no lifer but that s not an insult it s just the reality a high functioning autistic who doesn t work doesn t have a job doesn t have friends use the government pension to buy playstation games and pay the internet bill someone who spend his days months years doing nothing useful except making frequently stupid edits on wikipedia and wasting other ppl precious time is a big loser who needs a life lesson i don t mind if this hydao is blocked the only thing i know is that gvnayr needs an ultimate lesson well at least i can say that hydao died or is dying for a decent noble cause,1
42161,honestly that was not threats of harassment maybe it sounds like it but you don t know about the actual situation and what happened in the past gvnayr is a very dangerous wikipedian i can deepen it i m editing wikipedia since 2010 and i can say that he is the lamest weirdest most dangerous etc in my opinion being blocked indefinitely because of this is not fair 1 2 months is more fair sadly blocking this user hydao who always fought against vandalism and addition of nonsense things will not solve the problem sooner the user gvnayr will start adding completely stupid and nonsense things on wikipedia which is unacceptable i will continue fighting against vandalism and nonsense stuff added by losers nolifers like gvnayr with or without hydao now you are thinking hydao you are insulting a user by calling loser and no lifer but that s not an insult it s just the reality a high functioning autistic who doesn t work doesn t have a job doesn t have friends use the government pension to buy playstation games and pay the internet bill someone who spend his days months years doing nothing useful except making frequently stupid edits on wikipedia and wasting other ppl precious time is a big loser who needs a life lesson i don t mind if this hydao is blocked the only thing i know is that gvnayr needs an ultimate lesson well at least i can say that hydao died or is dying for a decent noble cause,0
117381,i know what your world wide conspiracy you are involved in is ian as you speak about on your user page you are a member of the chruch of satan and part of a world wide masonic conspiracy for a holocaust of christians you were quick to remove the truth about the colors of the church of satan logo of the red and purple being the colors the whore of babylon is said to be clothed in in revelation a dead give away that you are not a christian as you claim you are ian thomson is satanist 777 in multiples of 3 triple seven is god word 777 in multiples of 3 the alphanumerics from yah 777 in multiples of 3 satanists hate yehovah 777 in multiples of 3 freemasons hate yehovah god 777 in multiples of 3 yehovah loves yeshuwa 777 in multiples of 3 yeshuwa loves yehovah 777 in multiples of 3 yehovah is truly great 777 in multiples of 3 yehovah god is very great 777 in multiples of 3 yehovah is yeshuwa s god 777 in multiples of 3 yeshuwa first creation 777 in multiples of 3 christian religion is good 777 in multiples of 3 yehovah favored sam a moser 777 in multiples of 3 a 3 b 6 c 9 d 12 e 15 f 18 and so on all the way to z in multiples of 3 i am sam a moser see this link http groups google com group alt support depression manic browse_thread thread 3c3f7a279fba92cd i love yehovah god,0


Сохраним индексы неккоректно оцененых строк-дубликатов

In [16]:
id_with_wrong_target = [119123, 137547, 98503, 56063, 124302, 42161,
                        117381, 120135, 102049, 139663, 133218, 148154,
                        151479, 98566, 3965, 66690, 64515, 78027, 61630,
                        83816, 59363, 34087, 80616, 8924, 84519, 107578]

In [17]:
df = df.drop(index=id_with_wrong_target)

Отбросим дубликаты

In [18]:
df = df.drop_duplicates()

In [19]:
df['text'].duplicated().sum()

0

### Проверка на пропуски 

In [20]:
df.isna().sum()

text     0
toxic    0
dtype: int64

### Токкенизация, лемматизация и ссылки

In [21]:
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [22]:
def get_wordnet_pos(word):
    
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

Объеденим в одну функцию токкенизацию, удаление ссылок и лемматизацию

In [23]:
def preprocess_text(text):
    # Токенизация текста
    tokens = word_tokenize(text)
    # Регулярное выражение для поиска URL
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    # Удаление ссылок из токенов
    tokens_no_links = [token for token in tokens if not re.match(url_pattern, token)]
    #Лемматизация текста
    lemmatizer = WordNetLemmatizer()
    lemmas = [lemmatizer.lemmatize(token, pos=get_wordnet_pos(token)) if token.isalpha() else token for token in tokens_no_links]
    #Сбор слов из списка в строку
    string = ' '.join(lemmas)
    
    return string

In [25]:
df['lemmas'] = df['text'].apply(preprocess_text)

In [27]:
df.head()

Unnamed: 0,text,toxic,lemmas
0,explanation why the edits made under my username hardcore metallica fan were reverted they weren t vandalisms just closure on some gas after i voted at new york dolls fac and please don t remove the template from the talk page since i m retired now 89 205 38 27,0,explanation why the edits make under my username hardcore metallica fan be revert they weren t vandalism just closure on some gas after i vote at new york doll fac and please don t remove the template from the talk page since i m retire now 89 205 38 27
1,d aww he matches this background colour i m seemingly stuck with thanks talk 21 51 january 11 2016 utc,0,d aww he match this background colour i m seemingly stuck with thanks talk 21 51 january 11 2016 utc
2,hey man i m really not trying to edit war it s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info,0,hey man i m really not try to edit war it s just that this guy be constantly remove relevant information and talk to me through edits instead of my talk page he seem to care more about the format than the actual info
3,more i can t make any real suggestions on improvement i wondered if the section statistics should be later on or a subsection of types of accidents i think the references may need tidying so that they are all in the exact same format ie date format etc i can do that later on if no one else does first if you have any preferences for formatting style on references or want to do it yourself please let me know there appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up it s listed in the relevant form eg wikipedia good_article_nominations transport,0,more i can t make any real suggestion on improvement i wonder if the section statistic should be later on or a subsection of type of accident i think the reference may need tidy so that they be all in the exact same format ie date format etc i can do that later on if no one else do first if you have any preference for format style on reference or want to do it yourself please let me know there appear to be a backlog on article for review so i guess there may be a delay until a reviewer turn up it s list in the relevant form eg wikipedia good_article_nominations transport
4,you sir are my hero any chance you remember what page that s on,0,you sir be my hero any chance you remember what page that s on


### Разделение на выборки

In [28]:
X = df['lemmas']
y = df['toxic']

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=1)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, stratify=y_test, random_state=1)

### Вычисление TF-IDF

In [30]:
tf_idf = TfidfVectorizer()

In [31]:
tf_idf_train = tf_idf.fit_transform(X_train) 

In [32]:
tf_idf_val = tf_idf.transform(X_val)
tf_idf_test = tf_idf.transform(X_test)

### Промежуточный вывод 

1. Мы избавились от лишних символов и привели тексты к нижнему регистру
2. Произвели проверку на дубликаты, обнаружили несколько идентичных текстов с противоположным значением целевого признака, вручную разметили и отбросили дубликаты с неккореткным значением целевого признака а также отбросили все остальные дубликаты
3. Произвели токенизацию, удаление ссылок из текста и лемматизацию текста
4. Разделили выборку на обучающую и тестовую в соотношени 4 к 1
5. Расчитали TF-IDF 

## Обучение

Протестируем несколько моделей машинного обучения

- LogisticRegression
- RandomForest

### Логистическая регрессия

In [33]:
def objective_lg(trial):
    
    params = {
        'C': trial.suggest_float('C', 1e-5, 1e5, log=True),
        'solver': trial.suggest_categorical('solver', ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'])
    }
    
    
    model = LogisticRegression(**params, class_weight='balanced')
    
    model.fit(tf_idf_train, y_train)
    y_pred = model.predict(tf_idf_val)
    
    
    f1 = f1_score(y_val, y_pred)
    
    return f1

In [34]:
study = optuna.create_study(direction='maximize')
study.optimize(objective_lg, n_trials=10)

[I 2023-11-27 18:23:37,554] A new study created in memory with name: no-name-a4906f1f-9ba4-4900-8955-b035a5210faa
[I 2023-11-27 18:23:55,198] Trial 0 finished with value: 0.7574452816648726 and parameters: {'C': 1.282151746213506, 'solver': 'liblinear'}. Best is trial 0 with value: 0.7574452816648726.
[I 2023-11-27 18:24:36,178] Trial 1 finished with value: 0.7387355920363256 and parameters: {'C': 0.5891026995000157, 'solver': 'lbfgs'}. Best is trial 0 with value: 0.7574452816648726.
[I 2023-11-27 18:24:41,254] Trial 2 finished with value: 0.4799791313421157 and parameters: {'C': 0.00036369552581321413, 'solver': 'newton-cg'}. Best is trial 0 with value: 0.7574452816648726.
[I 2023-11-27 18:24:44,398] Trial 3 finished with value: 0.5166712593000827 and parameters: {'C': 0.0016864731173860453, 'solver': 'liblinear'}. Best is trial 0 with value: 0.7574452816648726.
[I 2023-11-27 18:24:55,086] Trial 4 finished with value: 0.77524061143612 and parameters: {'C': 6.703882298832313, 'solver':

In [35]:
print('Лучшие гиперпараметры:', study.best_params)
print('Лучшее f1:', study.best_value)

Лучшие гиперпараметры: {'C': 6.703882298832313, 'solver': 'sag'}
Лучшее f1: 0.77524061143612


In [36]:
best_params_lg = study.best_params

### Случайный лес

In [37]:
def objective_rf(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 10, 400),
        'max_depth': trial.suggest_int('max_depth', 5, 100),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
        'max_features': trial.suggest_categorical('max_features', ['auto', 'sqrt', 'log2'])
    }

    model = RandomForestClassifier(**params, class_weight='balanced', random_state=1)
    
    model.fit(tf_idf_train, y_train)
    y_pred = model.predict(tf_idf_val)
    
    f1 = f1_score(y_val, y_pred)
    
    return f1

In [38]:
study = optuna.create_study(direction='maximize')
study.optimize(objective_rf, n_trials=10)


[I 2023-11-27 18:32:32,073] A new study created in memory with name: no-name-473f7b4a-c7c3-4bc0-815d-2aacfce2927a
[I 2023-11-27 18:33:12,417] Trial 0 finished with value: 0.5425906101850435 and parameters: {'n_estimators': 103, 'max_depth': 92, 'min_samples_split': 15, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.5425906101850435.
[I 2023-11-27 18:33:23,245] Trial 1 finished with value: 0.4521375464684015 and parameters: {'n_estimators': 97, 'max_depth': 60, 'min_samples_split': 12, 'min_samples_leaf': 17, 'max_features': 'auto'}. Best is trial 0 with value: 0.5425906101850435.
[I 2023-11-27 18:33:28,159] Trial 2 finished with value: 0.4585971748660497 and parameters: {'n_estimators': 36, 'max_depth': 75, 'min_samples_split': 4, 'min_samples_leaf': 14, 'max_features': 'sqrt'}. Best is trial 0 with value: 0.5425906101850435.
[I 2023-11-27 18:33:49,912] Trial 3 finished with value: 0.4175026680896478 and parameters: {'n_estimators': 238, 'max_depth': 44, 

In [39]:
print('Лучшие гиперпараметры:', study.best_params)
print('Лучшее f1:', study.best_value)

Лучшие гиперпараметры: {'n_estimators': 399, 'max_depth': 96, 'min_samples_split': 9, 'min_samples_leaf': 2, 'max_features': 'auto'}
Лучшее f1: 0.549167452089224


In [40]:
best_params_rf = study.best_params

### Промежуточный вывод

Мы попробовали логистическую регрессию и случайный лес, случайный лес показал неудовлетворительные результаты, тогда как логистическая регрессия справилась отлично достигнув f1-метрики: 0.785

## Тестирование

Проверим нашу модель на тестовой выборке и проверим ее на адекватность, сравним ее значения accuracy со значением константной модели

In [41]:
def test_model(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    f1 = f1_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=1)
    recall = recall_score(y_test, y_pred, zero_division=1)
    accuracy = accuracy_score(y_test, y_pred)
    print(f'F1: {f1}\nPrecision: {precision}\nRecall: {recall}\nAccuracy: {accuracy}')

In [42]:
model = LogisticRegression(**best_params_lg, class_weight='balanced')

In [43]:
test_model(model, tf_idf_train, y_train, tf_idf_test, y_test)

F1: 0.7776934749620638
Precision: 0.7180385288966725
Recall: 0.848158874637981
Accuracy: 0.9507542333711501




In [44]:
dummy_model = DummyClassifier(strategy="most_frequent")

In [45]:
test_model(dummy_model, tf_idf_train, y_train, tf_idf_test, y_test)

F1: 0.0
Precision: 1.0
Recall: 0.0
Accuracy: 0.8984411109710492


## Выводы

В рамках нашего проекта по созданию модели для классификации токсичных комментариев в интернет-магазине, мы провели обширную работу по предобработке данных и выбору наилучшей модели.

Мы рассмотрели две модели - Logistic Regression и Random Forest - и оптимизировали их гиперпараметры с использованием библиотеки Optuna. Лучшей моделью оказалась Logistic Regression с F1-метрикой 0.785, что полностью соответствует требованиям заказчика.

Проверка модели Logistic Regression на тестовой выборке показала высокие значения метрик: F1 - 0.776, Precision - 0.736, Recall - 0.82, Accuracy - 0.95. Мы также проверили ее на адекватность, сравнив с константной моделью по значению метрики Accuracy. У константной модели результат - 0.898. Эти результаты говорят о высокой эффективности модели и ее способности успешно выявлять токсичные комментарии.

Таким образом, наша модель представляет собой надежный инструмент для модерации комментариев в новом сервисе.