Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

### Инструкция по выполнению проекта

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

Для начала необходимо симпортировать все необходимые библиотеки

In [1]:
import pandas as pd
from pymystem3 import Mystem
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
import string
import nltk
import re
import sys
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
import lightgbm as lgb
from sklearn.metrics import accuracy_score, f1_score
from tqdm import notebook


Далее счиываем предоставленный датасет

In [2]:
df = pd.read_csv('/datasets/toxic_comments.csv')

Посмотрим на данные:

In [3]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


Никаких проблем с данными не обнаружено

Ниже объявляются все необходимые функции для извлечения признаков

In [5]:
def remove_punct(text):
    text  = "".join([char for char in text if char not in string.punctuation])
    text = re.sub('[0-9]+', '', text)
    return text

def tokenization(text):
    text = re.split('\W+', text)
    return text

def remove_stopwords(text):
    stopword = nltk.corpus.stopwords.words('english')
    text = [word for word in text if word not in stopword]
    return text

def stemming(text):
    sys.setrecursionlimit(160000)
    ps = nltk.PorterStemmer()
    text = [ps.stem(word) for word in text]
    return text

def lemmatizer(text):
    wn = nltk.WordNetLemmatizer()
    text = [wn.lemmatize(word) for word in text]
    return text

Ниже будут применены все функции попорядку

1. Для начала применим функцию, которая удаляет все цифры и символы

In [6]:
df['punct'] = df['text'].apply(lambda x: remove_punct(x))
df.head(3)

Unnamed: 0,text,toxic,punct
0,Explanation\nWhy the edits made under my usern...,0,Explanation\nWhy the edits made under my usern...
1,D'aww! He matches this background colour I'm s...,0,Daww He matches this background colour Im seem...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man Im really not trying to edit war Its j...


2. производится токенизация очищенного текста

In [7]:
df['tokenized'] = df['punct'].apply(lambda x: tokenization(x.lower()))
df.head(3)

Unnamed: 0,text,toxic,punct,tokenized
0,Explanation\nWhy the edits made under my usern...,0,Explanation\nWhy the edits made under my usern...,"[explanation, why, the, edits, made, under, my..."
1,D'aww! He matches this background colour I'm s...,0,Daww He matches this background colour Im seem...,"[daww, he, matches, this, background, colour, ..."
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man Im really not trying to edit war Its j...,"[hey, man, im, really, not, trying, to, edit, ..."


3. удаляются все стоп-слова из токекнизированного списка

In [8]:
df['nonstop'] = df['tokenized'].apply(lambda x: remove_stopwords(x))
df.head(3)

Unnamed: 0,text,toxic,punct,tokenized,nonstop
0,Explanation\nWhy the edits made under my usern...,0,Explanation\nWhy the edits made under my usern...,"[explanation, why, the, edits, made, under, my...","[explanation, edits, made, username, hardcore,..."
1,D'aww! He matches this background colour I'm s...,0,Daww He matches this background colour Im seem...,"[daww, he, matches, this, background, colour, ...","[daww, matches, background, colour, im, seemin..."
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man Im really not trying to edit war Its j...,"[hey, man, im, really, not, trying, to, edit, ...","[hey, man, im, really, trying, edit, war, guy,..."


4. Очищенный от стоп-слов список подвергается стеммингу

In [9]:
df['stemmed'] = df['nonstop'].apply(lambda x: stemming(x))
df.head(3)

Unnamed: 0,text,toxic,punct,tokenized,nonstop,stemmed
0,Explanation\nWhy the edits made under my usern...,0,Explanation\nWhy the edits made under my usern...,"[explanation, why, the, edits, made, under, my...","[explanation, edits, made, username, hardcore,...","[explan, edit, made, usernam, hardcor, metalli..."
1,D'aww! He matches this background colour I'm s...,0,Daww He matches this background colour Im seem...,"[daww, he, matches, this, background, colour, ...","[daww, matches, background, colour, im, seemin...","[daww, match, background, colour, im, seemingl..."
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man Im really not trying to edit war Its j...,"[hey, man, im, really, not, trying, to, edit, ...","[hey, man, im, really, trying, edit, war, guy,...","[hey, man, im, realli, tri, edit, war, guy, co..."


5. Очищенный от стоп-слов список подвергается лематизации

In [11]:
df['lemmatized'] = df['nonstop'].apply(lambda x: lemmatizer(x))
df.head(3)

Unnamed: 0,text,toxic,punct,tokenized,nonstop,stemmed,lemmatized
0,Explanation\nWhy the edits made under my usern...,0,Explanation\nWhy the edits made under my usern...,"[explanation, why, the, edits, made, under, my...","[explanation, edits, made, username, hardcore,...","[explan, edit, made, usernam, hardcor, metalli...","[explanation, edits, made, username, hardcore,..."
1,D'aww! He matches this background colour I'm s...,0,Daww He matches this background colour Im seem...,"[daww, he, matches, this, background, colour, ...","[daww, matches, background, colour, im, seemin...","[daww, match, background, colour, im, seemingl...","[daww, match, background, colour, im, seemingl..."
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man Im really not trying to edit war Its j...,"[hey, man, im, really, not, trying, to, edit, ...","[hey, man, im, really, trying, edit, war, guy,...","[hey, man, im, realli, tri, edit, war, guy, co...","[hey, man, im, really, trying, edit, war, guy,..."


### Выводы:
1. Данные считаны
2. Написаны функции, позволяющие извлечь признаки
3. Извлечены признаки из исходных текстов, а именно:
    - punct: из исходного текста удалены цифры и символы
    - tokenized:очищенный от цифр и символов текст токенизирован
    - nonstop: из токенизированного текста удалены стоп-слова
    - stemmed: произведен стемминг очищенного от стоп-словв текста
    - lemmatized: произведена лематизация очищенного от стоп-слов текста

# 2. Обучение

Предобработанный датасет разбивается на обучающую и тестовую выборки

In [10]:
train, test = train_test_split(df, test_size=.30, random_state=123)

Создается список, в котором будут храниться результаты тестирования, а именно признак и значение метрики F1

In [11]:
result_list = []

Создается счетчик TF-IDF 

In [12]:
count_tf_idf = TfidfVectorizer()

Далее по каждому признаку производится поиск наилучших параметров и находится значение F1

#### Пояснение
Изначально я пытался обучаь модели и искать опимальные параметры при помощи lgbmregressor. Первый раз модель обучалась 4 дня, не вернув никаких результатов. После я немного упростил алгоритм и у меня начинало умирать ядро (возвращалась ошибка dead kernel). После еще 4 дней попыток (да и по всем срокам я уже давно вывалился за допустимые рамки) я решил обучать модель на простой LogisticRegression. Я так же использую GridSearchCV для демонстрации того, что я умею работать с этим инструментом
Так же я пытался запустить обучение локально, но:
- Я не справлился с установкой pymystem3
- Я посчитал, что процессор моего ноутбука гораздо слабее процессора, используемого на сервере

In [33]:
%%time

features = ['lemmatized','text','punct','tokenized','nonstop','stemmed']
for feature in features:
    X_train = train[feature]
    y_train = train['toxic']
    X_test = test[feature]
    y_test = test['toxic']

    train_corpus = X_train.astype('U')
    tfidf = count_tf_idf.fit_transform(train_corpus) 
    corpus_test = X_test.astype('U')
    tfidf_test = count_tf_idf.transform(corpus_test)
# model = LogisticRegression()
# model.fit(tfidf, y_train)
# predict = model.predict(tfidf_test)
# f1_score(y_test, predict)
#param_grid={'num_leaves': [num_leaves for num_leaves in range (150, 151)],
#      'max_depth': [max_depth for max_depth in range (10, 11)]}
    param_grid={'class_weight':['balanced', None],
          'C':[1,10,100]}
#tscv = TimeSeriesSplit(n_splits=1)
    gs_rf = GridSearchCV(LogisticRegression(), param_grid=param_grid, n_jobs=-1, verbose=10)
    gs_rf.fit(tfidf, y_train)
    bp = gs_rf.best_params_
    model = LogisticRegression(
        C=bp["C"], 
        class_weight=bp["class_weight"]
    )
    model.fit(tfidf, y_train)
    predictions = model.predict(tfidf_test)
    fs = f1_score(y_test, predictions)
    result_list.append(['feature = '+feature, f'f1 score = {round(fs,2)}'])

[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] C=1, class_weight=balanced ......................................
[CV] .......... C=1, class_weight=balanced, score=0.943, total=  12.7s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   12.7s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.944, total=  12.4s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   25.1s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.943, total=  13.2s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   38.4s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.951, total=   9.4s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:   47.7s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.949, total=  10.0s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   57.8s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.950, total=   9.2s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  1.1min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.950, total=  28.8s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:  1.6min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.951, total=  27.1s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:  2.0min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.951, total=  28.9s
[CV] C=10, class_weight=None .........................................


[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  2.5min remaining:    0.0s


[CV] ............. C=10, class_weight=None, score=0.958, total=  15.2s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.957, total=  16.2s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.958, total=  17.7s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.949, total=  45.9s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.951, total=  47.0s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.951, total=  49.2s
[CV] C=100, class_weight=None ........................................
[CV] ............ C=100, class_weight=None, score=0.955, total=  25.4s
[CV] C=100, class_weight=None ........................................
[CV] .

[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  7.1min finished
[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] C=1, class_weight=balanced ......................................
[CV] .......... C=1, class_weight=balanced, score=0.938, total=  13.5s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   13.5s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.939, total=  14.6s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   28.1s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.939, total=  14.0s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   42.1s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.951, total=   9.6s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:   51.7s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.950, total=   9.4s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.0min remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.950, total=   9.4s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  1.2min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.951, total=  27.5s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:  1.6min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.953, total=  27.4s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:  2.1min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.950, total=  27.9s
[CV] C=10, class_weight=None .........................................


[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  2.6min remaining:    0.0s


[CV] ............. C=10, class_weight=None, score=0.958, total=  16.1s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.959, total=  15.8s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.958, total=  16.2s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.951, total=  44.2s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.954, total=  43.2s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.952, total=  44.2s
[CV] C=100, class_weight=None ........................................
[CV] ............ C=100, class_weight=None, score=0.956, total=  23.0s
[CV] C=100, class_weight=None ........................................
[CV] .

[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  6.9min finished
[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] C=1, class_weight=balanced ......................................
[CV] .......... C=1, class_weight=balanced, score=0.938, total=  13.8s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   13.9s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.939, total=  14.2s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   28.0s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.940, total=  13.7s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   41.8s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.950, total=   9.8s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:   51.6s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.949, total=   9.4s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.0min remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.949, total=   9.2s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  1.2min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.951, total=  27.3s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:  1.6min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.953, total=  27.9s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:  2.1min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.951, total=  26.6s
[CV] C=10, class_weight=None .........................................


[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  2.5min remaining:    0.0s


[CV] ............. C=10, class_weight=None, score=0.958, total=  16.0s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.957, total=  15.4s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.958, total=  15.7s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.952, total=  42.8s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.954, total=  42.5s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.952, total=  42.8s
[CV] C=100, class_weight=None ........................................
[CV] ............ C=100, class_weight=None, score=0.956, total=  25.3s
[CV] C=100, class_weight=None ........................................
[CV] .

[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  7.1min finished
[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] C=1, class_weight=balanced ......................................
[CV] .......... C=1, class_weight=balanced, score=0.938, total=  18.8s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   18.8s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.939, total=  15.1s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   33.9s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.940, total=  13.8s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   47.7s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.950, total=   9.8s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:   57.5s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.949, total=   9.4s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  1.1min remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.949, total=   9.2s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  1.3min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.951, total=  30.0s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:  1.8min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.953, total=  28.0s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:  2.2min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.951, total=  26.9s
[CV] C=10, class_weight=None .........................................


[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  2.7min remaining:    0.0s


[CV] ............. C=10, class_weight=None, score=0.958, total=  16.1s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.957, total=  19.2s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.958, total=  15.7s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.952, total=  42.9s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.954, total=  41.8s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.952, total=  43.0s
[CV] C=100, class_weight=None ........................................
[CV] ............ C=100, class_weight=None, score=0.956, total=  26.1s
[CV] C=100, class_weight=None ........................................
[CV] .

[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  7.3min finished
[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] C=1, class_weight=balanced ......................................
[CV] .......... C=1, class_weight=balanced, score=0.942, total=  11.1s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   11.1s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.942, total=  11.1s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   22.2s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.942, total=  11.1s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   33.3s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.949, total=   8.0s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:   41.3s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.948, total=   8.7s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   50.0s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.949, total=   8.1s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:   58.1s remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.951, total=  23.4s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:  1.4min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.951, total=  23.0s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:  1.7min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.950, total=  22.8s
[CV] C=10, class_weight=None .........................................


[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  2.1min remaining:    0.0s


[CV] ............. C=10, class_weight=None, score=0.957, total=  14.4s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.956, total=  13.8s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.957, total=  14.0s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.950, total=  38.5s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.951, total=  37.6s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.951, total=  40.0s
[CV] C=100, class_weight=None ........................................
[CV] ............ C=100, class_weight=None, score=0.954, total=  29.9s
[CV] C=100, class_weight=None ........................................
[CV] .

[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  6.3min finished
[Parallel(n_jobs=-1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 6 candidates, totalling 18 fits
[CV] C=1, class_weight=balanced ......................................
[CV] .......... C=1, class_weight=balanced, score=0.942, total=  11.7s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   11.7s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.943, total=  11.3s
[CV] C=1, class_weight=balanced ......................................


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:   23.0s remaining:    0.0s


[CV] .......... C=1, class_weight=balanced, score=0.941, total=  12.5s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:   35.5s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.952, total=   8.6s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:   44.1s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.951, total=   8.2s
[CV] C=1, class_weight=None ..........................................


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   52.3s remaining:    0.0s


[CV] .............. C=1, class_weight=None, score=0.951, total=   8.2s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed:  1.0min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.949, total=  25.0s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed:  1.4min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.950, total=  24.9s
[CV] C=10, class_weight=balanced .....................................


[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed:  1.8min remaining:    0.0s


[CV] ......... C=10, class_weight=balanced, score=0.949, total=  25.5s
[CV] C=10, class_weight=None .........................................


[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  2.3min remaining:    0.0s


[CV] ............. C=10, class_weight=None, score=0.958, total=  14.3s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.958, total=  14.6s
[CV] C=10, class_weight=None .........................................
[CV] ............. C=10, class_weight=None, score=0.957, total=  16.2s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.947, total=  45.6s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.950, total=  43.0s
[CV] C=100, class_weight=balanced ....................................
[CV] ........ C=100, class_weight=balanced, score=0.948, total=  43.7s
[CV] C=100, class_weight=None ........................................
[CV] ............ C=100, class_weight=None, score=0.955, total=  27.3s
[CV] C=100, class_weight=None ........................................
[CV] .

[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:  6.5min finished


CPU times: user 27min 13s, sys: 18min 35s, total: 45min 49s
Wall time: 45min 57s


In [34]:
result_list

[['feature = lemmatized', 'f1 score = 0.78'],
 ['feature = text', 'f1 score = 0.79'],
 ['feature = punct', 'f1 score = 0.78'],
 ['feature = tokenized', 'f1 score = 0.78'],
 ['feature = nonstop', 'f1 score = 0.78'],
 ['feature = stemmed', 'f1 score = 0.78']]

# 3. Выводы

1. Для исходного даасета были поучены дополнительные признаки, такие как punct, tokenized, nonstop, stemmed, lemmatized
2. Для получения значения признака был использован счетчик TF-IDF
3. Обучена модель линейной регрессии, на которой были получены следующие результаты:
    - Признак: text Значение F1: 0.79
    - Признак: lemmatized Значение F1: 0.78
    - Признак: punct Значение F1: 0.78
    - Признак: tokenized Значение F1: 0.78
    - Признак: nonstop Значение F1: 0.78
    - Признак: stemmed Значение F1: 0.78
    