### Урок 7. Рекурентные сети для обработки последовательностей

# Домашнее задание

1. Попробуйте обучить нейронную сеть GRU/LSTM для предсказания сентимента сообщений с твитера на примере https://www.kaggle.com/datasets/arkhoshghalb/twitter-sentiment-analysis-hatred-speech

2. Опишите, какой результат вы получили? Что помогло вам улучшить ее точность?

У кого нет возможности работать через каггл (нет верификации), то можете данные взять по ссылке: https://drive.google.com/file/d/1czQcI0Zgvgo6DjW1-yTFUhL8_XVsF6vi/view?usp=sharing

In [1]:
# Context
# The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate 
# speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from 
# other tweets.
# Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' 
# denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

# Контекст
# Цель этой задачи - обнаружить ненавистнические высказывания в твитах. Для простоты мы говорим, что твит содержит 
# ненавистнические высказывания, если с ним связаны расистские или сексистские настроения. Итак, задача состоит в том, чтобы 
# отличить расистские или сексистские твиты от других твитов.
# Формально, учитывая обучающую выборку твитов и меток, где метка "1" означает, что твит является расистским/сексистским, 
# а метка "0" означает, что твит не является расистским /сексистским, ваша цель - предсказать метки в тестовом наборе данных.

In [2]:
import pandas as pd
import numpy as np

In [3]:
path = '\PyTorch-7'
path = ''

In [4]:
df_train = pd.read_csv("train.csv",index_col=0)
df_test = pd.read_csv("test.csv",index_col=0)

In [5]:
df_train

Unnamed: 0_level_0,label,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,@user when a father is dysfunctional and is s...
2,0,@user @user thanks for #lyft credit i can't us...
3,0,bihday your majesty
4,0,#model i love u take with u all the time in ...
5,0,factsguide: society now #motivation
...,...,...
31958,0,ate @user isz that youuu?ðððððð...
31959,0,to see nina turner on the airwaves trying to...
31960,0,listening to sad songs on a monday morning otw...
31961,1,"@user #sikh #temple vandalised in in #calgary,..."


In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31962 entries, 1 to 31962
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   31962 non-null  int64 
 1   tweet   31962 non-null  object
dtypes: int64(1), object(1)
memory usage: 749.1+ KB


In [7]:
df_test

Unnamed: 0_level_0,tweet
id,Unnamed: 1_level_1
31963,#studiolife #aislife #requires #passion #dedic...
31964,@user #white #supremacists want everyone to s...
31965,safe ways to heal your #acne!! #altwaystohe...
31966,is the hp and the cursed child book up for res...
31967,"3rd #bihday to my amazing, hilarious #nephew..."
...,...
49155,thought factory: left-right polarisation! #tru...
49156,feeling like a mermaid ð #hairflip #neverre...
49157,#hillary #campaigned today in #ohio((omg)) &am...
49158,"happy, at work conference: right mindset leads..."


In [8]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17197 entries, 31963 to 49159
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   tweet   17197 non-null  object
dtypes: object(1)
memory usage: 268.7+ KB


In [9]:
from sklearn.model_selection import train_test_split

In [10]:
# Разбиение датасета на train и val

In [11]:
df_train, df_val = train_test_split(df_train, 
                                    test_size=0.2, 
                                    random_state=10, 
                                    stratify=df_train['label'])

df_train.shape, df_val.shape

((25569, 2), (6393, 2))

In [12]:
# Предобработка

In [13]:
from string import punctuation
from stop_words import get_stop_words
from pymorphy2 import MorphAnalyzer
import re

In [14]:
puncts = set(punctuation)
# Не будем очищать текст от апострофов, заменим их потом на пробелы,
# т.к. встроенные в nltk английские стоп-слова и так потом отфильтруют лишнее
puncts = puncts - {"'"}

In [15]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

In [16]:
lemmatizer = WordNetLemmatizer()
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Relict/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Relict/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [17]:
exclude = set(punctuation)
morpher = MorphAnalyzer()

def preprocess_text(txt):
    txt = str(txt)
    txt = ''.join(char for char in txt if char not in puncts) # очистка от пунктуации
    txt = txt.replace("'", " ")
    txt = txt.lower().split()
    txt = [word for word in txt if word.isalpha()] # очистка от символов и цифр
    txt = [lemmatizer.lemmatize(word) for word in txt] # лемматизация
    txt = [word for word in txt if word not in stopwords.words('english')] # очистка от стопслов
    return ' '.join(txt)

df_train['tweet'] = df_train['tweet'].apply(preprocess_text)
df_val['tweet'] = df_val['tweet'].apply(preprocess_text)
df_test['tweet'] = df_test['tweet'].apply(preprocess_text)

In [18]:
df_train.head()

Unnamed: 0_level_0,label,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
28599,0,attending user user
27507,0,nimwit libtards embrace crookedhillary cronyga...
16242,0,still sad pray today boanoite triste instagram...
23285,0,user le month till york host one greatest annu...
19694,0,user user user person change way living listen...


In [19]:
train_corpus = ''.join(df_train['tweet'].values)

In [20]:
from nltk.tokenize import word_tokenize

In [21]:
tokens = word_tokenize(train_corpus)
tokens[:10]

['attending',
 'user',
 'usernimwit',
 'libtards',
 'embrace',
 'crookedhillary',
 'cronygarchy',
 'braincell',
 'anemicstill',
 'sad']

In [22]:
MAX_WORDS = 2000
MAX_LEN = 20

In [23]:
from nltk.probability import FreqDist

In [24]:
dist = FreqDist(tokens)
tokens_top = [items[0] for items in dist.most_common(MAX_WORDS - 1)]

tokens_top[:10]

['user', 'day', 'love', 'u', 'amp', 'like', 'life', 'happy', 'get', 'wa']

In [25]:
vocabulary = {word: count for count, word in dict(enumerate(tokens_top, 1)).items()}

In [26]:
# Переведём твиты в набор индексов, добавим паддинг

def text_to_sequence(txt, maxlen):
    result = []
    tokens = word_tokenize(txt)
    for word in tokens:
        if word in vocabulary:
            result.append(vocabulary[word])

    padding = [0] * (maxlen-len(result))
    return result[-maxlen:] + padding

In [27]:
X_train = np.array([text_to_sequence(txt, MAX_LEN) for txt in df_train['tweet'].values])
X_val = np.array([text_to_sequence(txt, MAX_LEN) for txt in df_val['tweet'].values])

X_train.shape, X_val.shape

((25569, 20), (6393, 20))

In [28]:
import torch
import torch.nn as nn
from torchinfo import summary

In [29]:
class GRU_Net(nn.Module):
    def __init__(self, vocab_size=2000, embedding_dim=512, out_dim=256, use_last=True, threshold=0.5, num_classes=1):
        super().__init__()
        self.threshold = threshold
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0) 
        self.gru = nn.GRU(embedding_dim, out_dim, batch_first=True) 
        self.linear = nn.Linear(out_dim, num_classes)
        self.dp = nn.Dropout(0.5)
        self.use_last = use_last
        
    def forward(self, x):                          
        x = self.embedding(x)
        x = self.dp(x)
        x, _ = self.gru(x)
           
        if self.use_last:
            x = x[:,-1,:]
        else:
            x = torch.mean(x[:,:], dim=1)
            
        x = self.dp(x)
        x = self.linear(x)
        x = torch.sigmoid(x)
        return x
    
    def predict(self, x):
        x = torch.IntTensor(x).to(device)
        x = self.forward(x)
        x = torch.squeeze((x > self.threshold).int())
        return x

In [30]:
summary(GRU_Net(), input_data=torch.IntTensor(X_train[np.newaxis, 0]))

Layer (type:depth-idx)                   Output Shape              Param #
GRU_Net                                  [1, 1]                    --
├─Embedding: 1-1                         [1, 20, 512]              1,024,000
├─Dropout: 1-2                           [1, 20, 512]              --
├─GRU: 1-3                               [1, 20, 256]              591,360
├─Dropout: 1-4                           [1, 256]                  --
├─Linear: 1-5                            [1, 1]                    257
Total params: 1,615,617
Trainable params: 1,615,617
Non-trainable params: 0
Total mult-adds (M): 12.85
Input size (MB): 0.00
Forward/backward pass size (MB): 0.12
Params size (MB): 6.46
Estimated Total Size (MB): 6.59

In [31]:
from torch.utils.data import DataLoader, Dataset

In [32]:
class DataWrapper(Dataset):
    def __init__(self, data, target):
        self.data = torch.from_numpy(data)
        self.target = torch.from_numpy(target)
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
            
        return x, y
    
    def __len__(self):
        return self.data.shape[0]

In [33]:
torch.random.manual_seed(10)

train_dataset = DataWrapper(X_train, df_train['label'].values)
train_loader = DataLoader(train_dataset, batch_size=512, shuffle=True)

val_dataset = DataWrapper(X_val, df_val['label'].values)
val_loader = DataLoader(val_dataset, batch_size=512, shuffle=True)

In [34]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [47]:
def train_nn(epochs, embedding_dim=512, hidden_size=256, lr=0.001, threshold=0.5):

    torch.random.manual_seed(10)
    torch.backends.cudnn.deterministic = True

    net = GRU_Net(vocab_size=MAX_WORDS, embedding_dim=embedding_dim, 
              out_dim=hidden_size, threshold=threshold).to(device)
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    criterion = nn.BCELoss()
    
    f1 = [0, 0.01]
     
    for epoch in range(epochs):
        train_losses = np.array([])
        test_losses = np.array([])
        tp, fp, tn, fn = 0, 0, 0, 0

        
        if f1[epoch] < f1[epoch+1]:             
            for i, (inputs, labels) in enumerate(train_loader):
                net.train()
                inputs, labels = inputs.to(device), labels.to(device)

                optimizer.zero_grad()
                outputs = net(inputs)

                loss = criterion(outputs, labels.float().view(-1, 1))
                loss.backward()
                optimizer.step()

                train_losses = np.append(train_losses, loss.item())

                net.eval()
                outputs = torch.squeeze((net(inputs) > threshold).int())

                tp += ((labels == 1) & (outputs == 1)).sum().item()
                tn += ((labels == 0) & (outputs == 0)).sum().item()
                fp += ((labels == 0) & (outputs == 1)).sum().item()
                fn += ((labels == 1) & (outputs == 0)).sum().item()

            precision = tp / (tp + fp) if (tp + fp) != 0 else 0
            recall = tp / (tp + fn) if (tp + fn) != 0 else 0

            f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) != 0 else 0
            f1.append(f1_score)
                     
            print(f'Epoch [{epoch + 1}/{epochs}]. ' \
                  f'Loss: {train_losses.mean():.3f}. ' \
                  f'F1-score: {f1_score:.3f}', end='. ')
            
                        
            tp, fp, tn, fn = 0, 0, 0, 0

            with torch.no_grad():
                for i, (inputs, labels) in enumerate(val_loader):

                    inputs, labels = inputs.to(device), labels.to(device)
                    outputs = net(inputs)

                    loss = criterion(outputs, labels.float().view(-1, 1))
                    test_losses = np.append(test_losses, loss.item())

                    tp += ((labels == 1) & (torch.squeeze((outputs > threshold).int()) == 1)).sum()
                    tn += ((labels == 0) & (torch.squeeze((outputs > threshold).int()) == 0)).sum()
                    fp += ((labels == 0) & (torch.squeeze((outputs > threshold).int()) == 1)).sum()
                    fn += ((labels == 1) & (torch.squeeze((outputs > threshold).int()) == 0)).sum()

            precision = tp / (tp + fp) if (tp + fp) != 0 else 0
            recall = tp / (tp + fn) if (tp + fn) != 0 else 0

            f1_score = 2 * precision * recall / (precision + recall) if (precision + recall) != 0 else 0

            print(f'Test loss: {test_losses.mean():.3f}. Test F1-score: {f1_score:.3f}.')
        else:
            break
    
    print('Training is finished!')


In [48]:
train_nn(epochs=100)

Epoch [1/100]. Loss: 0.299. F1-score: 0.174. Test loss: 0.180. Test F1-score: 0.444.
Epoch [2/100]. Loss: 0.173. F1-score: 0.565. Test loss: 0.154. Test F1-score: 0.502.
Epoch [3/100]. Loss: 0.151. F1-score: 0.633. Test loss: 0.144. Test F1-score: 0.585.
Epoch [4/100]. Loss: 0.132. F1-score: 0.690. Test loss: 0.142. Test F1-score: 0.613.
Epoch [5/100]. Loss: 0.122. F1-score: 0.723. Test loss: 0.138. Test F1-score: 0.600.
Epoch [6/100]. Loss: 0.114. F1-score: 0.756. Test loss: 0.148. Test F1-score: 0.609.
Epoch [7/100]. Loss: 0.105. F1-score: 0.777. Test loss: 0.135. Test F1-score: 0.625.
Epoch [8/100]. Loss: 0.099. F1-score: 0.798. Test loss: 0.141. Test F1-score: 0.626.
Epoch [9/100]. Loss: 0.093. F1-score: 0.821. Test loss: 0.154. Test F1-score: 0.640.
Epoch [10/100]. Loss: 0.087. F1-score: 0.836. Test loss: 0.143. Test F1-score: 0.634.
Epoch [11/100]. Loss: 0.077. F1-score: 0.865. Test loss: 0.163. Test F1-score: 0.629.
Epoch [12/100]. Loss: 0.074. F1-score: 0.872. Test loss: 0.160.

### Вывод: наилучшие показатели по тестовым данным получены на 9 эпохе  f1-score составил 0,64 (параметры нейронной сети приняты аналогично урока №6). f1-score по уроку №7 показал большую точность чем было по уроку №6. На 21 эпохе с началом переобучения нейронной сети цикл обучения был остановлен.