# Классификация тональности комментария

Мне предоставлены комментарии пользователей с разметкой о тональности комментария. Необходимо обучить модель классифицировать комментарии на позитивные и негативные.

### Содержание:

#### 1) <a href='#First look'> Изучение данных</a>
#### 2) <a href='#Preprocesing'> Предобработка данных</a>
#### 3) <a href='#Model training'> Обучение моделей</a>

- <a href='#CountVectorizer'> Обучение с помощью CountVectorizer</a>
- <a href='#TfidfVectorizer'> Обучение с помощью TfidfVectorizer</a>
- <a href='#DistilBert'> Обучение с помощью DistilBert</a>

#### 4) <a href='#Conclusion'> Вывод</a>

<a id='First look'></a>
## Изучение данных

In [1]:
import pandas as pd
import numpy as np

import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import torch
import transformers

import re
from tqdm import notebook

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

import lightgbm as lgb

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\yansa\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
comms = pd.read_csv('C:\\Users\\yansa\\YP_Projects\\YP_DataSets\\SP12\\toxic_comments.csv')

In [3]:
comms.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
comms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [5]:
comms['toxic'].value_counts()

0    143346
1     16225
Name: toxic, dtype: int64

Наблюдается сильный дисбаланс классов.

<a id='Preprocesing'></a>
## Предобработка данных

Провожу лемматизацию, удаляю цифры, знаки, лишние пробелы:

In [6]:
def text_preprocessing(text):
    tokenized = nltk.word_tokenize(text)
    joined = ' '.join(tokenized)
    text_only = re.sub(r'[^a-zA-Z]', ' ', joined)
    final = ' '.join(text_only.split())
    return final

In [7]:
comms['text'][0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

In [8]:
notebook.tqdm.pandas() 
comms['text_final'] = comms['text'].progress_apply(text_preprocessing)

  from pandas import Panel


HBox(children=(FloatProgress(value=0.0, max=159571.0), HTML(value='')))




In [9]:
comms['text_final'][0]

'Explanation Why the edits made under my username Hardcore Metallica Fan were reverted They were n t vandalisms just closure on some GAs after I voted at New York Dolls FAC And please do n t remove the template from the talk page since I m retired now'

Разбиваю на обучающую и тестовую выборки:

In [10]:
x_comms = comms['text_final']
y_comms = comms['toxic']

In [11]:
x_train_comms, x_test_comms, y_train_comms, y_test_comms = train_test_split(x_comms, y_comms, random_state=0, 
                                                                            stratify=y_comms)
x_train_comms.shape, x_test_comms.shape, y_train_comms.shape, y_test_comms.shape

((119678,), (39893,), (119678,), (39893,))

In [12]:
print('Распределение классов в обучающей выборке:')
print(y_train_comms.value_counts()[0] / y_train_comms.value_counts().sum())
print(y_train_comms.value_counts()[1] / y_train_comms.value_counts().sum())
print()
print('Распределение классов в тестовой выборке:')
print(y_test_comms.value_counts()[0] / y_test_comms.value_counts().sum())
print(y_test_comms.value_counts()[1] / y_test_comms.value_counts().sum())

Распределение классов в обучающей выборке:
0.8983188221728304
0.10168117782716957

Распределение классов в тестовой выборке:
0.8983280274734916
0.1016719725265084


Создаю корпус:

In [13]:
x_train_comms_corpus = x_train_comms.values.astype('U')
x_test_comms_corpus = x_test_comms.values.astype('U')

Добавляю стоп слова:

In [14]:
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))
len(stopwords)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yansa\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


179

Готовлю датафрейм для сбора результатов и функцию для обучения моделей:

In [15]:
results = pd.DataFrame({
    'Preprocessing model' : [], 'Learning model' : [], 'Train f1 score' : [], 'Test f1 score' : []
})

In [16]:
models = [
    LogisticRegression(max_iter=1000),
    lgb.LGBMClassifier(n_estimators = 1000, learning_rate = 0.1)
]

In [17]:
def learn_models(models_list, x_train, y_train, x_test, y_test, prepr_model : str):
    for i in models_list:
        clf_gs = GridSearchCV(i, {}, cv=5, scoring='f1')
        clf_gs.fit(x_train,y_train)
        
        train_f1_score = f1_score(y_train, clf_gs.predict(x_train))
        test_f1_score = f1_score(y_test, clf_gs.predict(x_test))
        
        name = str(i).split(sep='(')[0]
        
        globals()['results'] = globals()['results'].append({
            'Preprocessing model' : prepr_model, 'Learning model' : name, 
            'Train f1 score' : round(train_f1_score, 2), 'Test f1 score' : round(test_f1_score, 2)}, 
            ignore_index=True)

<a id='Model training'></a>
## Обучение моделей

<a id='CountVectorizer'></a>
### Обучение с помощью векторизации

In [18]:
vectorizer = CountVectorizer(stop_words=stopwords, dtype=np.float32) 

In [19]:
x_train_comms_vectorized = vectorizer.fit_transform(x_train_comms_corpus)
x_test_comms_vectorized = vectorizer.transform(x_test_comms_corpus)

In [20]:
print(x_train_comms_vectorized.shape)
print(x_train_comms_vectorized[:5].toarray())

(119678, 142722)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [21]:
%%time
learn_models(models, x_train_comms_vectorized, y_train_comms, x_test_comms_vectorized, y_test_comms,
             prepr_model = 'CountVectorizer')

Wall time: 4min 31s


In [22]:
results[-2:]

Unnamed: 0,Preprocessing model,Learning model,Train f1 score,Test f1 score
0,CountVectorizer,LogisticRegression,0.9,0.76
1,CountVectorizer,LGBMClassifier,0.92,0.77


<a id='TfidfVectorizer'></a>
### Обучение с помощью TF-IDF

In [23]:
tf_idf = TfidfVectorizer(stop_words=stopwords)

In [24]:
x_train_comms_tf_idf = tf_idf.fit_transform(x_train_comms)
x_test_comms_tf_idf = tf_idf.transform(x_test_comms)

In [25]:
print(x_train_comms_tf_idf.shape)
print(x_train_comms_tf_idf[:5].toarray())

(119678, 142722)
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [26]:
%%time
learn_models(models, x_train_comms_tf_idf, y_train_comms, x_test_comms_tf_idf, y_test_comms,
             prepr_model = 'TfidfVectorizer')

Wall time: 13min 51s


In [27]:
results[-2:]

Unnamed: 0,Preprocessing model,Learning model,Train f1 score,Test f1 score
2,TfidfVectorizer,LogisticRegression,0.76,0.72
3,TfidfVectorizer,LGBMClassifier,0.95,0.78


<a id='DistilBert'></a>
### Обучение с помощью DistilBert

Для ускорения обучения модели Bert, возьму срез из 2000 строк из основного датафрейма.

Проверяю сохранен ли в срезе дисбаланс классов основного датафрейма:

In [28]:
print('Распределение классов во всем датасете:')
print(comms['toxic'].value_counts()[0] / comms['toxic'].value_counts().sum())
print(comms['toxic'].value_counts()[1] / comms['toxic'].value_counts().sum())
print()
print('Распределение классов в срезе нак котором буду применять Bert:')
print(comms['toxic'][:2000].value_counts()[0] / comms['toxic'][:2000].value_counts().sum())
print(comms['toxic'][:2000].value_counts()[1] / comms['toxic'][:2000].value_counts().sum())

Распределение классов во всем датасете:
0.8983211235124177
0.10167887648758234

Распределение классов в срезе нак котором буду применять Bert:
0.895
0.105


In [29]:
model_class, tokenizer_class, pretrained_weights = (transformers.DistilBertModel, transformers.DistilBertTokenizer, 
                                                    'distilbert-base-uncased')
#model_class, tokenizer_class, pretrained_weights = (transformers.BertModel, transformers.BertTokenizer, 
#                                                    'bert-base-uncased')

In [30]:
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Tokenization

In [31]:
tokenized = comms['text_final'][:2000].progress_apply(lambda x: tokenizer.encode(x[:512], add_special_tokens=True))

HBox(children=(FloatProgress(value=0.0, max=2000.0), HTML(value='')))




Padding

In [32]:
len(tokenized[0])

56

In [33]:
padded = np.array([i + [0]*(512-len(i)) for i in tokenized.values])

In [34]:
len(padded[0])

512

Masking

In [35]:
attention_mask = np.where(padded != 0, 1, 0)
padded.shape, attention_mask.shape

((2000, 512), (2000, 512))

In [36]:
batch_size = 10
embeddings = []

In [37]:
for i in notebook.tqdm(range(padded.shape[0] // batch_size)):
    batch = torch.LongTensor(padded[batch_size*i:batch_size*(i+1)])
    attention_mask_batch = torch.LongTensor(attention_mask[batch_size*i:batch_size*(i+1)])
    
    with torch.no_grad():
        batch_embeddings = model(batch, attention_mask=attention_mask_batch)
    embeddings.append(batch_embeddings[0][:,0,:].numpy())

HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))




In [38]:
x_bert = np.concatenate(embeddings)
y_bert = comms['toxic'][:2000]
x_bert.shape, y_bert.shape

((2000, 768), (2000,))

In [39]:
x_train_bert, x_test_bert, y_train_bert, y_test_bert = train_test_split(x_bert, y_bert, random_state=0, stratify=y_bert)

In [40]:
%%time
learn_models(models, x_train_bert, y_train_bert, x_test_bert, y_test_bert,
             prepr_model = 'DistilBert')

Wall time: 31.2 s


In [41]:
results[-2:]

Unnamed: 0,Preprocessing model,Learning model,Train f1 score,Test f1 score
4,DistilBert,LogisticRegression,0.87,0.68
5,DistilBert,LGBMClassifier,1.0,0.58


<a id='Conclusion'></a>
## Вывод

In [42]:
results.sort_values('Test f1 score', ascending=False)

Unnamed: 0,Preprocessing model,Learning model,Train f1 score,Test f1 score
3,TfidfVectorizer,LGBMClassifier,0.95,0.78
1,CountVectorizer,LGBMClassifier,0.92,0.77
0,CountVectorizer,LogisticRegression,0.9,0.76
2,TfidfVectorizer,LogisticRegression,0.76,0.72
4,DistilBert,LogisticRegression,0.87,0.68
5,DistilBert,LGBMClassifier,1.0,0.58


Наилучший результат показала модель LGBMClassifier на предобработанном тексте с помощью модели TfidfVectorizer. Вероятнее всего DistilBert показал плохой результат, так как был обучен всего на 2000 строках, в то время как остальные модели на всем датафрейме.