## Задание 5.1

Набор данных тут: https://github.com/sismetanin/rureviews, также есть в папке [Data](https://drive.google.com/drive/folders/1YAMe7MiTxA-RSSd8Ex2p-L0Dspe6Gs4L). Те, кто предпочитает работать с английским языком, могут использовать набор данных `sms_spam`.

Применим полученные навыки и решим задачу анализа тональности отзывов. 

Нужно повторить весь пайплайн от сырых текстов до получения обученной модели.

Обязательные шаги предобработки:
1. токенизация
2. приведение к нижнему регистру
3. удаление стоп-слов
4. лемматизация
5. векторизация (с настройкой гиперпараметров)
6. построение модели
7. оценка качества модели

Обязательно использование векторайзеров:
1. мешок n-грамм (диапазон для n подбирайте самостоятельно, запрещено использовать только униграммы).
2. tf-idf ((диапазон для n подбирайте самостоятельно, также нужно подбирать параметры max_df, min_df, max_features)
3. символьные n-граммы (диапазон для n подбирайте самостоятельно)

В качестве классификатора нужно использовать наивный байесовский классификатор. 

Для сравнения векторайзеров между собой используйте precision, recall, f1-score и accuracy. Для этого сформируйте датафрейм, в котором в строках будут разные векторайзеры, а в столбцах разные метрики качества, а в  ячейках будут значения этих метрик для соответсвующих векторайзеров.

In [None]:
!pip install pymorphy2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import numpy as np
import math
import nltk 
import string
from sklearn.metrics import * 
from sklearn.model_selection import train_test_split
from pymorphy2 import MorphAnalyzer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
!gdown 1XBVRjzM3pOja1fIjrVQDifnXtcJDS927

Downloading...
From: https://drive.google.com/uc?id=1XBVRjzM3pOja1fIjrVQDifnXtcJDS927
To: /content/women-clothing-accessories.csv
100% 21.8M/21.8M [00:00<00:00, 69.4MB/s]


# Загрузим датасет и получим необходимую информацию о нем.

In [None]:
df = pd.read_csv("/content/women-clothing-accessories.csv", sep='\t', usecols=[0, 1])

In [None]:
df.head(3)

Unnamed: 0,review,sentiment
0,качество плохое пошив ужасный (горловина напер...,negative
1,"Товар отдали другому человеку, я не получила п...",negative
2,"Ужасная синтетика! Тонкая, ничего общего с пре...",negative


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90000 entries, 0 to 89999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     90000 non-null  object
 1   sentiment  90000 non-null  object
dtypes: object(2)
memory usage: 1.4+ MB


In [None]:
df.describe()

Unnamed: 0,review,sentiment
count,90000,90000
unique,87321,3
top,Товар не пришёл,negative
freq,58,30000


In [None]:
df['sentiment'].value_counts()

negative    30000
neautral    30000
positive    30000
Name: sentiment, dtype: int64

# Проведем предобработку данных.

In [None]:
def data_preprocessing(data_series):
  #Токенизация
  data_series = data_series.apply(word_tokenize)

  #Приведение к нижнему регистру
  data_series = data_series.apply(lambda sentence: [word.lower() for word in sentence])

  #Лемматизация
  pymorphy2_analyzer = MorphAnalyzer()
  data_series = data_series.apply(lambda sentence: [pymorphy2_analyzer.parse(word)[0].normal_form for word in sentence])

  #Удаление стоп-слов
  noice = set(stopwords.words('russian'))
  data_series = data_series.apply(lambda sentence: [word for word in sentence if word not in noice])


  data_series = data_series.apply(lambda sentence: ' '.join(sentence))
  return data_series

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df.review, df.sentiment, train_size = 0.8)

In [None]:
x_train_preprocessed = data_preprocessing(x_train)
x_test_preprocessed = data_preprocessing(x_test)

#Мешок n-gram (1,1)

## n-gram range(1,1)

In [None]:
'''
 Объявляем векторизатор. 
 Векторизатор преобразует слово или набор слов в числовой вектор, понятный алгоритму машинного обучения, 
 который привык работать с числовыми табличными данными. 
 '''
vectorizer_CountVectorizer_ng11 = CountVectorizer(ngram_range=(1, 1))

vectorized_x_train = vectorizer_CountVectorizer_ng11.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_ng11.transform(x_test_preprocessed)

# Наивный Байесовский классификатор
clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

# Будем записывать результаты в датафрейм, чтобы в конце сравнить результаты
comparison_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_ng11'])

## n-gram range (1,2)

In [None]:
vectorizer_CountVectorizer_ng12 = CountVectorizer(ngram_range=(1, 2))

vectorized_x_train = vectorizer_CountVectorizer_ng12.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_ng12.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_ng12'])

comparison_df = comparison_df.append(new_df)

## n-gram range(1,3)

In [None]:
vectorizer_CountVectorizer_ng13 = CountVectorizer(ngram_range=(1, 3))

vectorized_x_train = vectorizer_CountVectorizer_ng13.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_ng13.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_ng13'])

comparison_df = comparison_df.append(new_df)

## n-gram range(2,2)

In [None]:
vectorizer_CountVectorizer_ng11 = CountVectorizer(ngram_range=(2, 2))

vectorized_x_train = vectorizer_CountVectorizer_ng11.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_ng11.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)
new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_ng22'])
comparison_df = comparison_df.append(new_df)

## n-gram range(2,3)

In [None]:
vectorizer_CountVectorizer_ng11 = CountVectorizer(ngram_range=(2, 3))

vectorized_x_train = vectorizer_CountVectorizer_ng11.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_ng11.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)
new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_ng23'])
comparison_df = comparison_df.append(new_df)

##n-gram (2, 4)

In [None]:
vectorizer_CountVectorizer_ng24 = CountVectorizer(ngram_range=(2, 4))

vectorized_x_train = vectorizer_CountVectorizer_ng24.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_ng24.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_ng24'])

comparison_df = comparison_df.append(new_df)

## n-gram (3, 3)

In [None]:
vectorizer_CountVectorizer_ng24 = CountVectorizer(ngram_range=(3, 3))

vectorized_x_train = vectorizer_CountVectorizer_ng24.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_ng24.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_ng33'])

comparison_df = comparison_df.append(new_df)

Текущее состояние датафрейма

In [None]:
comparison_df

Unnamed: 0,precision,recall,f1,accuracy
vectorizer_CountVectorizer_ng11,0.709831,0.703167,0.703724,0.703167
vectorizer_CountVectorizer_ng12,0.721927,0.719389,0.719457,0.719389
vectorizer_CountVectorizer_ng13,0.718108,0.717556,0.717256,0.717556
vectorizer_CountVectorizer_ng22,0.692024,0.690667,0.690765,0.690667
vectorizer_CountVectorizer_ng23,0.683873,0.685056,0.684105,0.685056
vectorizer_CountVectorizer_ng24,0.681791,0.6835,0.682318,0.6835
vectorizer_CountVectorizer_ng33,0.614148,0.595056,0.599782,0.595056


#TF-IDF

## ngram_range=(1, 1)

In [None]:
vectorizer_TfidfVectorizer_ng11 = TfidfVectorizer(ngram_range=(1, 1))

vectorized_x_train = vectorizer_TfidfVectorizer_ng11.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_TfidfVectorizer_ng11.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_TfidfVectorizer_ng11'])

comparison_df = comparison_df.append(new_df)

## ngram_range=(1, 2)

In [None]:
vectorizer_TfidfVectorizer_ng12 = TfidfVectorizer(ngram_range=(1, 2))

vectorized_x_train = vectorizer_TfidfVectorizer_ng12.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_TfidfVectorizer_ng12.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_TfidfVectorizer_ng12'])

comparison_df = comparison_df.append(new_df)

#Теперь подберем гиперпараметры max_df, min_df, max_features.

In [None]:
vocab_len = vectorizer_TfidfVectorizer_ng11.vocabulary_.__len__()
vocab_len

28471

##ngram_range=(1, 2), max_df=0.5, min_df=0.01, max_features=int(vocab_len/2)

In [None]:
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf001_maxf2 = TfidfVectorizer(
    ngram_range=(1, 2), max_df=0.5, min_df=0.01, max_features=int(vocab_len/2)
    )

vectorized_x_train = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf001_maxf2.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf001_maxf2.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_TfidfVectorizer_ng12_maxdf05_mindf001_maxf2'])

comparison_df = comparison_df.append(new_df)

##ngram_range=(1, 2), max_df=0.5

In [None]:
vectorizer_TfidfVectorizer_ng12_maxdf05 = TfidfVectorizer(
    ngram_range=(1, 2), max_df=0.5
    )

vectorized_x_train = vectorizer_TfidfVectorizer_ng12_maxdf05.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_TfidfVectorizer_ng12_maxdf05.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_TfidfVectorizer_ng12_maxdf05'])

comparison_df = comparison_df.append(new_df)

##ngram_range=(1, 2), max_df=0.1

In [None]:
vectorizer_TfidfVectorizer_ng12_maxdf01 = TfidfVectorizer(
    ngram_range=(1, 2), max_df=0.1
    )

vectorized_x_train = vectorizer_TfidfVectorizer_ng12_maxdf01.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_TfidfVectorizer_ng12_maxdf01.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_TfidfVectorizer_ng12_maxdf01'])

comparison_df = comparison_df.append(new_df)

##ngram_range=(1, 2), max_df=0.5, min_df=0.01

In [None]:
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf001 = TfidfVectorizer(
    ngram_range=(1, 2), max_df=0.5, min_df=0.01
    )

vectorized_x_train = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf001.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf001.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_TfidfVectorizer_ng12_maxdf05_mindf001'])

comparison_df = comparison_df.append(new_df)

##ngram_range=(1, 2), max_df=0.5, min_df=5

In [None]:
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5 = TfidfVectorizer(
    ngram_range=(1, 2), max_df=0.5, min_df=5
    )

vectorized_x_train = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5'])

comparison_df = comparison_df.append(new_df)

##ngram_range=(1, 2), max_df=0.5, min_df=5, max_features=int(vocab_len/2)

In [None]:
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf2 = TfidfVectorizer(
    ngram_range=(1, 2), max_df=0.5, min_df=5, max_features=int(vocab_len/2)
    )

vectorized_x_train = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf2.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf2.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf2'])

comparison_df = comparison_df.append(new_df)

##ngram_range=(1, 2), max_df=0.5, min_df=5, max_features=int(vocab_len*0.99)


In [None]:
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf099 = TfidfVectorizer(
    ngram_range=(1, 2), max_df=0.5, min_df=5, max_features=int(vocab_len*0.99)
    )

vectorized_x_train = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf099.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf099.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf099'])

comparison_df = comparison_df.append(new_df)

Текущее состояние датафрейма

In [None]:
comparison_df

Unnamed: 0,precision,recall,f1,accuracy
vectorizer_CountVectorizer_ng11,0.709831,0.703167,0.703724,0.703167
vectorizer_CountVectorizer_ng12,0.721927,0.719389,0.719457,0.719389
vectorizer_CountVectorizer_ng13,0.718108,0.717556,0.717256,0.717556
vectorizer_CountVectorizer_ng22,0.692024,0.690667,0.690765,0.690667
vectorizer_CountVectorizer_ng23,0.683873,0.685056,0.684105,0.685056
vectorizer_CountVectorizer_ng24,0.681791,0.6835,0.682318,0.6835
vectorizer_CountVectorizer_ng33,0.614148,0.595056,0.599782,0.595056
vectorizer_TfidfVectorizer_ng11,0.708238,0.700944,0.70189,0.700944
vectorizer_TfidfVectorizer_ng12,0.72311,0.720556,0.720519,0.720556
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf001_maxf2,0.684455,0.672167,0.67407,0.672167


#Символьные n-граммы

Cимвольные n-граммы используются, например, для задачи определения языка. Ещё одна замечательная особенность признаков-символов - для них не нужна токенизация и лемматизация, можно использовать такой подход для языков, у которых нет готовых анализаторов.

##ngram_range=(3, 6)

In [None]:
vectorizer_CountVectorizer_char_ng3_6 = CountVectorizer(analyzer='char', ngram_range=(3, 6))

vectorized_x_train = vectorizer_CountVectorizer_char_ng3_6.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_char_ng3_6.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_char_ng3_6'])

comparison_df = comparison_df.append(new_df)

##ngram_range=(1, 10)

In [None]:
vectorizer_CountVectorizer_char_ng1_10 = CountVectorizer(analyzer='char', ngram_range=(1, 10))

vectorized_x_train = vectorizer_CountVectorizer_char_ng1_10.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_char_ng1_10.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_char_ng1_10'])

comparison_df = comparison_df.append(new_df)

## ngram_range=(2, 5)

In [None]:
vectorizer_CountVectorizer_char_ng2_5 = CountVectorizer(analyzer='char', ngram_range=(2, 5))

vectorized_x_train = vectorizer_CountVectorizer_char_ng2_5.fit_transform(x_train_preprocessed)
vectorized_x_test = vectorizer_CountVectorizer_char_ng2_5.transform(x_test_preprocessed)

clf = MultinomialNB()
clf.fit(vectorized_x_train, y_train)
y_pred = clf.predict(vectorized_x_test)

new_df = pd.DataFrame(
    {'precision': [precision_score(y_test, y_pred, average='weighted')],
     'recall': [recall_score(y_test, y_pred, average='weighted')],
     'f1': [f1_score(y_test, y_pred, average='weighted')],
     'accuracy': [accuracy_score(y_test, y_pred)]}, 
    index=['vectorizer_CountVectorizer_char_ng2_5'])

comparison_df = comparison_df.append(new_df)

#Итоговый датафрейм

In [None]:
comparison_df.sort_values(by=['accuracy'], ascending=False)

Unnamed: 0,precision,recall,f1,accuracy
vectorizer_TfidfVectorizer_ng12,0.72311,0.720556,0.720519,0.720556
vectorizer_TfidfVectorizer_ng12_maxdf05,0.72311,0.720556,0.720519,0.720556
vectorizer_CountVectorizer_char_ng1_10,0.727433,0.720167,0.721787,0.720167
vectorizer_CountVectorizer_ng12,0.721927,0.719389,0.719457,0.719389
vectorizer_TfidfVectorizer_ng12_maxdf01,0.721351,0.717889,0.718026,0.717889
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf099,0.726654,0.717833,0.719155,0.717833
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5,0.726327,0.717667,0.718934,0.717667
vectorizer_CountVectorizer_ng13,0.718108,0.717556,0.717256,0.717556
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf2,0.726734,0.716889,0.718433,0.716889
vectorizer_CountVectorizer_char_ng3_6,0.721762,0.710944,0.712875,0.710944


In [None]:
comparison_df.sort_values(by=['f1'], ascending=False)

Unnamed: 0,precision,recall,f1,accuracy
vectorizer_CountVectorizer_char_ng1_10,0.727433,0.720167,0.721787,0.720167
vectorizer_TfidfVectorizer_ng12,0.72311,0.720556,0.720519,0.720556
vectorizer_TfidfVectorizer_ng12_maxdf05,0.72311,0.720556,0.720519,0.720556
vectorizer_CountVectorizer_ng12,0.721927,0.719389,0.719457,0.719389
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf099,0.726654,0.717833,0.719155,0.717833
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5,0.726327,0.717667,0.718934,0.717667
vectorizer_TfidfVectorizer_ng12_maxdf05_mindf5_maxf2,0.726734,0.716889,0.718433,0.716889
vectorizer_TfidfVectorizer_ng12_maxdf01,0.721351,0.717889,0.718026,0.717889
vectorizer_CountVectorizer_ng13,0.718108,0.717556,0.717256,0.717556
vectorizer_CountVectorizer_char_ng3_6,0.721762,0.710944,0.712875,0.710944


#Вывод:
Исходя из ранее плученных результатов, можно сказать, что лучше всего с поставленной задачей справился векторизатор Символьных n-грамм с параметрами ngram_range=(1, 10), исходя из значения метрик accuracy и f1. 

## Задание 5.2 Регулярные выражения

Регулярные выражения - способ поиска и анализа строк. Например, можно понять, какие даты в наборе строк представлены в формате DD/MM/YYYY, а какие - в других форматах. 

Или бывает, например, что перед работой с текстом, надо почистить его от своеобразного мусора: упоминаний пользователей, url и так далее.

Навык полезный, давайте в нём тоже потренируемся.

Для работы с регулярными выражениями есть библиотека **re**

In [None]:
import re

В регулярных выражениях, кроме привычных символов-букв, есть специальные символы:
* **?а** - ноль или один символ **а**
* **+а** - один или более символов **а**
* **\*а** - ноль или более символов **а** (не путать с +)
* **.** - любое количество любого символа

Пример:
Выражению \*a?b. соответствуют последовательности a, ab, abc, aa, aac НО НЕ abb!

Рассмотрим подробно несколько наиболее полезных функций:

### findall
возвращает список всех найденных непересекающихся совпадений.

Регулярное выражение **ab+c.**: 
* **a** - просто символ **a**
* **b+** - один или более символов **b**
* **c** - просто символ **c**
* **.** - любой символ


In [None]:
result = re.findall('ab+c.', 'abcdefghijkabcabcxabc') 
print(result)

['abcd', 'abca']


Вопрос на внимательность: почему нет abcx?

**Задание**: вернуть список первых двух букв каждого слова в строке, состоящей из нескольких слов.

In [None]:
find_test_txt = 'With the lights out, its less dangerous, Here we are now, entertain us'
re.findall(r'\b\w{1,2}', find_test_txt) 

['Wi', 'th', 'li', 'ou', 'it', 'le', 'da', 'He', 'we', 'ar', 'no', 'en', 'us']

### split
разделяет строку по заданному шаблону


In [None]:
result = re.split(',', 'itsy, bitsy, teenie, weenie') 
print(result)

['itsy', ' bitsy', ' teenie', ' weenie']


можно указать максимальное количество разбиений

In [None]:
result = re.split(',', 'itsy, bitsy, teenie, weenie', maxsplit=2) 
print(result)

['itsy', ' bitsy', ' teenie, weenie']


**Задание**: разбейте строку, состоящую из нескольких предложений, по точкам, но не более чем на 3 предложения.

In [None]:
split_test_txt1 = 'Im on. the highway. to hell.On the highway. to hell'
print(re.split('[.]', split_test_txt1, maxsplit=2))

split_test_txt2 = 'So close, no matter how far. Couldnt be much more from the heart. Forever trusting who we are. And nothing else matters'
print(re.split('[.]', split_test_txt2, maxsplit=2))

['Im on', ' the highway', ' to hell.On the highway. to hell']
['So close, no matter how far', ' Couldnt be much more from the heart', ' Forever trusting who we are. And nothing else matters']


### sub
ищет шаблон в строке и заменяет все совпадения на указанную подстроку

параметры: (pattern, repl, string)

In [None]:
result = re.sub('a', 'b', 'abcabc')
print (result)

bbcbbc


**Задание**: напишите регулярное выражение, которое позволит заменить все цифры в строке на "DIG".

In [None]:
text = "This is 10 percent luck, 20 percent skill, 15 percent concentrated power of will, 5 percent pleasure, 15 percent pain, And a 100 percent reason to remember the name "
re.sub(r"\d", "DIG", text)

'This is DIGDIG percent luck, DIGDIG percent skill, DIGDIG percent concentrated power of will, DIG percent pleasure, DIGDIG percent pain, And a DIGDIGDIG percent reason to remember the name '

**Задание**: напишите  регулярное выражение, которое позволит убрать url из строки.

In [None]:
text2 = 'Forever trusting who we are https://www.amalgama-lab.com/songs/m/metallica/nothing_else_matters.html And nothing else matters'
print(text2)
print(re.sub('http\S+', '', text2))

Forever trusting who we are https://www.amalgama-lab.com/songs/m/metallica/nothing_else_matters.html And nothing else matters
Forever trusting who we are  And nothing else matters


### compile
компилирует регулярное выражение в отдельный объект

In [None]:
# Пример: построение списка всех слов строки:
prog = re.compile('[А-Яа-яё\-]+')
prog.findall("Слова? Да, больше, ещё больше слов! Что-то ещё.")

['Слова', 'Да', 'больше', 'ещё', 'больше', 'слов', 'Что-то', 'ещё']

**Задание**: для выбранной строки постройте список слов, которые длиннее трех символов.

In [None]:
pattern1 = re.compile('[А-Яа-яё\-]{4,}')
pattern1.findall("Слова? Да, больше, ещё больше слов! Что-то ещё.")

['Слова', 'больше', 'больше', 'слов', 'Что-то']

**Задание**: вернуть список доменов (@gmail.com) из списка адресов электронной почты:

```
abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz
```

In [None]:
pattern2 = re.compile('\S+@gmail.com')
pattern2.findall("abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz")

['abc.test@gmail.com']