# Description of the notebook

In the current notebook I've conducted experiments of applying **LSTM** model trained from scratch for classification task

## Preprocessing steps:

* tokenization using TweetTokenizer

* lemmatization using WordNetLemmatizer

* filtering out punctuation symbols and stopwords

Exactly that scheme demonstrated the best results on the 1st HW task

## Hyperparameters of the LSTM model

* embedding dimension: grid = [64, 128, 256] (tests the impact of feature vector size on model expressiveness versus computational cost)

* dimension of hidden representation in recurrent neural network: grid = [64, 128, 256]

* number of layers of our reccurent neural network: grid = [1, 5, 10] (explores model depth to capture complex patterns, with deeper models risking overfitting)

* one- and bidirectional represenation options (bidirectional LSTMs use both past and future context, potentially improving performance)

* dropout rate = 0.2 (provides regularization to reduce overfitting)

## LSTM experiments results

| is_bidirectional | embedding_dim | hidden_dim | n_layers | val_loss | val_f1  |
|------------------|--------------|------------|----------|----------|---------|
| True            | 64           | 64         | 1        | 0.5948   | 0.6209  |
| True            | 64           | 64         | 5        | 0.5376   | 0.7049  |
| True            | 64           | 64         | 10       | 0.5657   | 0.6778  |
| True            | 64           | 128        | 1        | 0.5930   | 0.6363  |
| True            | 64           | 128        | 5        | 0.5317   | 0.7102  |
| True            | 64           | 128        | 10       | 0.5279   | 0.6929  |
| True            | 64           | 256        | 1        | 0.5678   | 0.6711  |
| True            | 64           | 256        | 5        | 0.5373   | 0.6945  |
| True            | 64           | 256        | 10       | 0.5330   | 0.6951  |
| True            | 128          | 64         | 1        | 0.5641   | 0.6420  |
| True            | 128          | 64         | 5        | 0.5417   | 0.7015  |
| True            | 128          | 64         | 10       | 0.5296   | 0.7095  |
| True            | 128          | 128        | 1        | 0.5277   | 0.6932  |
| True            | 128          | 128        | 5        | 0.5378   | 0.7127  |
| True            | 128          | 128        | 10       | 0.5223   | 0.7010  |
| True            | 128          | 256        | 1        | 0.5528   | 0.6747  |
| True            | 128          | 256        | 5        | 0.5322   | 0.6990  |
| True            | 128          | 256        | 10       | 0.5634   | 0.6985  |
| True            | 256          | 64         | 1        | 0.5084   | 0.7058  |
| True            | 256          | 64         | 5        | 0.5452   | 0.6924  |
| True            | 256          | 64         | 10       | 0.5431   | 0.6989  |
| True            | 256          | 128        | 1        | 0.5163   | 0.6990  |
| True            | 256          | 128        | 5        | 0.5254   | 0.7063  |
| True            | 256          | 128        | 10       | 0.5429   | 0.7020  |
| True            | 256          | 256        | 1        | 0.5393   | 0.6861  |
| True            | 256          | 256        | 5        | 0.5437   | 0.7048  |
| True            | 256          | 256        | 10       | 0.5032   | **0.7293**  |
| False           | 64           | 64         | 1        | 0.6235   | 0.6442  |
| False           | 64           | 64         | 5        | 0.5612   | 0.6639  |
| False           | 64           | 64         | 10       | 0.5926   | 0.6511  |
| False           | 64           | 128        | 1        | 0.5550   | 0.6845  |
| False           | 64           | 128        | 5        | 0.5499   | 0.6678  |
| False           | 64           | 128        | 10       | 0.5295   | 0.6959  |
| False           | 64           | 256        | 1        | 0.5287   | 0.6867  |
| False           | 64           | 256        | 5        | 0.5322   | 0.6776  |
| False           | 64           | 256        | 10       | 0.5384   | 0.6786  |
| False           | 128          | 64         | 1        | 0.6041   | 0.6628  |
| False           | 128          | 64         | 5        | 0.5792   | 0.6787  |
| False           | 128          | 64         | 10       | 0.5609   | 0.7026  |
| False           | 128          | 128        | 1        | 0.5598   | 0.7038  |
| False           | 128          | 128        | 5        | 0.5423   | 0.7026  |
| False           | 128          | 128        | 10       | 0.5559   | 0.6944  |
| False           | 128          | 256        | 1        | 0.5477   | 0.6981  |
| False           | 128          | 256        | 5        | 0.5279   | 0.7123  |
| False           | 128          | 256        | 10       | 0.5273   | 0.7142  |
| False           | 256          | 64         | 1        | 0.6279   | 0.6513  |
| False           | 256          | 64         | 5        | 0.5825   | 0.6852  |
| False           | 256          | 64         | 10       | 0.5629   | 0.6921  |
| False           | 256          | 128        | 1        | 0.5699   | 0.7061  |
| False           | 256          | 128        | 5        | 0.5605   | 0.6947  |
| False           | 256          | 128        | 10       | 0.5418   | 0.7067  |
| False           | 256          | 256        | 1        | 0.5635   | 0.6986  |
| False           | 256          | 256        | 5        | 0.5326   | 0.7038  |
| False           | 256          | 256        | 10       | 0.5425   | 0.7061  |


---

## Code:

In [1]:
import numpy as np
import pandas as pd
import nltk
import re
import matplotlib.pyplot as plt
import seaborn

In [10]:
data_full = pd.read_csv('train_data.csv')

In [11]:
data_full.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [12]:
data = data_full[['text', 'target']]

In [13]:
from nltk.tokenize import TweetTokenizer

In [14]:
from nltk.stem import WordNetLemmatizer

In [15]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\yunes\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [16]:
tknzr = TweetTokenizer()
lemmatizer = WordNetLemmatizer()

In [17]:
def tokenize_and_lemmatize(text):
    tokens = tknzr.tokenize(text)
    return list(map(lemmatizer.lemmatize, tokens))

In [18]:
data['tokenized_text'] = data['text'].apply(
    lambda sent: tokenize_and_lemmatize(sent)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['tokenized_text'] = data['text'].apply(


In [19]:
nltk.download('stopwords', quiet=True)

True

In [20]:
from nltk.corpus import stopwords
from string import punctuation

In [21]:
stopwords_set = set(stopwords.words("english"))
punctuation_set = set(punctuation)
noise = stopwords_set.union(punctuation_set)

In [22]:
data['filtered_text'] = data['tokenized_text'].apply(
    lambda tokens: [token.lower() for token in tokens if token.lower() not in noise]
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['filtered_text'] = data['tokenized_text'].apply(


In [23]:
data['filtered_text_joined'] = data['filtered_text'].apply(lambda tokens: ' '.join(tokens))

In [24]:
data

Unnamed: 0,text,target,tokenized_text,filtered_text,filtered_text_joined
0,Our Deeds are the Reason of this #earthquake M...,1,"[Our, Deeds, are, the, Reason, of, this, #eart...","[deeds, reason, #earthquake, may, allah, forgi...",deeds reason #earthquake may allah forgive u
1,Forest fire near La Ronge Sask. Canada,1,"[Forest, fire, near, La, Ronge, Sask, ., Canada]","[forest, fire, near, la, ronge, sask, canada]",forest fire near la ronge sask canada
2,All residents asked to 'shelter in place' are ...,1,"[All, resident, asked, to, ', shelter, in, pla...","[resident, asked, shelter, place, notified, of...",resident asked shelter place notified officer ...
3,"13,000 people receive #wildfires evacuation or...",1,"[13,000, people, receive, #wildfires, evacuati...","[13,000, people, receive, #wildfires, evacuati...","13,000 people receive #wildfires evacuation or..."
4,Just got sent this photo from Ruby #Alaska as ...,1,"[Just, got, sent, this, photo, from, Ruby, #Al...","[got, sent, photo, ruby, #alaska, smoke, #wild...",got sent photo ruby #alaska smoke #wildfires p...
...,...,...,...,...,...
7608,Two giant cranes holding a bridge collapse int...,1,"[Two, giant, crane, holding, a, bridge, collap...","[two, giant, crane, holding, bridge, collapse,...",two giant crane holding bridge collapse nearby...
7609,@aria_ahrary @TheTawniest The out of control w...,1,"[@aria_ahrary, @TheTawniest, The, out, of, con...","[@aria_ahrary, @thetawniest, control, wild, fi...",@aria_ahrary @thetawniest control wild fire ca...
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1,"[M1, ., 94, [, 01:04, UTC, ], ?, 5km, S, of, V...","[m1, 94, 01:04, utc, 5km, volcano, hawaii, htt...",m1 94 01:04 utc 5km volcano hawaii http://t.co...
7611,Police investigating after an e-bike collided ...,1,"[Police, investigating, after, an, e-bike, col...","[police, investigating, e-bike, collided, car,...",police investigating e-bike collided car littl...


In [25]:
from sklearn.model_selection import train_test_split

In [26]:
X_train, X_val, y_train, y_val = train_test_split(data['filtered_text_joined'], data['target'], test_size=0.2, random_state=42)

In [27]:
def build_vocab(texts, max_words=20000):
    token_count_dict = {}
    for text in texts:
        for token in text.split():
            if token not in token_count_dict:
                token_count_dict[token] = 1
            else:
                token_count_dict[token] = token_count_dict[token] + 1

    tokens_freq_list = list(token_count_dict.items())
    tokens_freq_list.sort(key=lambda x: x[1], reverse=True)
    sorted_tokens = tokens_freq_list[:max_words - 2]

    vocabulary = {
        "<pad>": 0,
        "<oov>": 1,
    }

    for i, (token, count) in enumerate(sorted_tokens):
        vocabulary[token] = i

    return vocabulary

In [28]:
vocab = build_vocab(data['filtered_text_joined'])

In [29]:
vocab

{'<pad>': 0,
 '<oov>': 1,
 '...': 0,
 '\x89': 1,
 'wa': 2,
 'like': 3,
 'û_': 4,
 'fire': 5,
 'get': 6,
 'ha': 7,
 'new': 8,
 'via': 9,
 'one': 10,
 'u': 11,
 'people': 12,
 '2': 13,
 'video': 14,
 'emergency': 15,
 'disaster': 16,
 'time': 17,
 'body': 18,
 'police': 19,
 'day': 20,
 'year': 21,
 'would': 22,
 'still': 23,
 'building': 24,
 'say': 25,
 'go': 26,
 'news': 27,
 'home': 28,
 'crash': 29,
 'storm': 30,
 'back': 31,
 '..': 32,
 'burning': 33,
 'know': 34,
 'suicide': 35,
 '3': 36,
 'got': 37,
 'california': 38,
 'see': 39,
 'man': 40,
 'look': 41,
 'car': 42,
 'first': 43,
 'attack': 44,
 'life': 45,
 'death': 46,
 'bomb': 47,
 'train': 48,
 'going': 49,
 'make': 50,
 'love': 51,
 'family': 52,
 'rt': 53,
 'two': 54,
 'killed': 55,
 'world': 56,
 'dead': 57,
 'flood': 58,
 'û': 59,
 'accident': 60,
 'nuclear': 61,
 'today': 62,
 'full': 63,
 'want': 64,
 'war': 65,
 'need': 66,
 'good': 67,
 'think': 68,
 'may': 69,
 "can't": 70,
 'way': 71,
 'pm': 72,
 'watch': 73,
 'ûªs'

In [30]:
def text_to_id(text, vocab):
    return [vocab.get(token, vocab['<oov>']) for token in text.split()]

In [31]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [32]:
class TweetDisasterDataset(Dataset):
    def __init__(self, df, vocab):
        self.texts = list(df['filtered_text_joined'].values)
        self.labels = list(df['target'].values)
        self.vocab = vocab

    def __getitem__(self, idx):
        sequence = torch.tensor(text_to_id(self.texts[idx], self.vocab), dtype=torch.long)
        label = torch.tensor(self.labels[idx], dtype=torch.float)
        return sequence, label

    def __len__(self):
        return len(self.labels)

In [33]:
train_dataset = TweetDisasterDataset(pd.concat([X_train, y_train], axis=1), vocab)
val_dataset = TweetDisasterDataset(pd.concat([X_val, y_val], axis=1), vocab)

In [34]:
def collate_fn(batch):
    sequences, labels = zip(*batch)
    sequences_padded = pad_sequence(sequences, batch_first=True, padding_value=vocab['<pad>'])
    labels = torch.tensor(labels, dtype=torch.float)

    return sequences_padded, labels

In [35]:
batch_size = 64

In [36]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

In [37]:
class TweetDisasterLSTMClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, bidirectional=False, num_layers=1, dropout=0.3):
        super(TweetDisasterLSTMClassifier, self).__init__()

        self.embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(input_size=embedding_dim, hidden_size=hidden_dim, num_layers=num_layers, bidirectional=bidirectional,
                            batch_first=True, dropout=dropout if num_layers > 1 else 0)
        self.dropout = nn.Dropout(p=dropout)
        self.fc = nn.Linear(hidden_dim, 1)

    def forward(self, x):
        embedding_vector = self.embedding(x)
        out, (hid, c) = self.lstm(embedding_vector)

        return torch.sigmoid(self.fc(self.dropout(hid[-1]))).squeeze()

In [38]:
embedding_dim = 128
hidden_dim = 64
num_layers = 1
dropout_p = 0.3
vocab_size = len(vocab)

In [39]:
model = TweetDisasterLSTMClassifier(vocab_size, embedding_dim, hidden_dim, False, num_layers, dropout_p)

In [40]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cpu


In [41]:
model = model.to(device)

In [42]:
loss_fn = nn.BCELoss()

In [43]:
optimizer = optim.Adam(model.parameters(), lr=1e-4)

In [44]:
best_lstm_val_loss = np.inf
max_epochs_early_stopping= 3
counter_early_stopping = 0
num_epochs = 25

In [None]:
for epoch in range(num_epochs):
    model.train()
    train_losses = []
    for sequences, labels in train_dataloader:
        sequences, labels = sequences.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(sequences)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())

    avg_train_loss = np.mean(train_losses)

    model.eval()
    val_losses = []
    with torch.no_grad():
        for sequences, labels in val_dataloader:
            sequences, labels = sequences.to(device), labels.to(device)
            outputs = model(sequences)
            loss = loss_fn(outputs, labels)
            val_losses.append(loss.item())

    avg_val_loss = np.mean(val_losses)
    print(f"epoch {epoch+1}/{num_epochs}: train Loss: {avg_train_loss:.4f}, val loss: {avg_val_loss:.4f}")

    if avg_val_loss < best_lstm_val_loss:
        best_lstm_val_loss = avg_val_loss
        counter_early_stopping = 0
        torch.save(model.state_dict(), "best_lstm_model.pt")
        print("model saved.")
    else:
        counter_early_stopping += 1
        if counter_early_stopping >= max_epochs_early_stopping:
            print("early stopping triggered.")
            break


epoch 1/25: train Loss: 0.6903, val loss: 0.6870
model saved.
epoch 2/25: train Loss: 0.6857, val loss: 0.6824
model saved.
epoch 3/25: train Loss: 0.6816, val loss: 0.6776
model saved.
epoch 4/25: train Loss: 0.6766, val loss: 0.6717
model saved.
epoch 5/25: train Loss: 0.6692, val loss: 0.6636
model saved.
epoch 6/25: train Loss: 0.6525, val loss: 0.6537
model saved.
epoch 7/25: train Loss: 0.6199, val loss: 0.6495
model saved.
epoch 8/25: train Loss: 0.5873, val loss: 0.6455
model saved.
epoch 9/25: train Loss: 0.5498, val loss: 0.6436
model saved.
epoch 10/25: train Loss: 0.5052, val loss: 0.6306
model saved.
epoch 11/25: train Loss: 0.4642, val loss: 0.6200
model saved.
epoch 12/25: train Loss: 0.4382, val loss: 0.6166
model saved.
epoch 13/25: train Loss: 0.4062, val loss: 0.6225
epoch 14/25: train Loss: 0.3791, val loss: 0.6331
epoch 15/25: train Loss: 0.3639, val loss: 0.6329
early stopping triggered.


In [None]:
model.load_state_dict(torch.load("best_lstm_model.pt"))
model.eval()

TweetDisasterLSTMClassifier(
  (embedding): Embedding(20000, 128, padding_idx=0)
  (lstm): LSTM(128, 64, batch_first=True)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=64, out_features=1, bias=True)
)

In [None]:
val_preds = []
val_labels = []

with torch.no_grad():
    for sequences, labels in val_dataloader:
        sequences = sequences.to(device)
        outputs = model(sequences)
        preds = (outputs > 0.5).int().cpu().numpy()
        val_preds.extend(preds)
        val_labels.extend(labels.cpu().numpy())

In [None]:
from sklearn.metrics import accuracy_score, classification_report, f1_score

In [None]:
accuracy = accuracy_score(val_labels, val_preds)
print("val acc:", accuracy)
print("classification report:")
print(classification_report(val_labels, val_preds))

val acc: 0.7051871306631649
classification report:
              precision    recall  f1-score   support

         0.0       0.76      0.71      0.73       874
         1.0       0.64      0.70      0.67       649

    accuracy                           0.71      1523
   macro avg       0.70      0.70      0.70      1523
weighted avg       0.71      0.71      0.71      1523



In [None]:
def train_and_evaluate(embedding_dim, hidden_dim, bidirectional=False, num_layers=1, dropout_rate=0.2, num_epochs=10, patience=3):
    model = TweetDisasterLSTMClassifier(vocab_size, embedding_dim, hidden_dim, bidirectional, num_layers, dropout_p)
    model.to(device)
    loss_fn = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=1e-4)
    best_val_loss = np.inf
    counter = 0
    patience = 3
    for epoch in range(num_epochs):
        model.train()
        train_losses = []
        for sequences, labels in train_dataloader:
            sequences, labels = sequences.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(sequences)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            train_losses.append(loss.item())
        avg_train_loss = np.mean(train_losses)
        model.eval()
        val_losses = []
        with torch.no_grad():
            for sequences, labels in val_dataloader:
                sequences, labels = sequences.to(device), labels.to(device)
                outputs = model(sequences)
                loss = loss_fn(outputs, labels)
                val_losses.append(loss.item())
        avg_val_loss = np.mean(val_losses)
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            counter = 0
            best_model_state = model.state_dict()
        else:
            counter += 1
            if counter >= patience:
                break
    model.load_state_dict(best_model_state)
    model.eval()
    val_preds = []
    val_labels = []
    with torch.no_grad():
        for sequences, labels in val_dataloader:
            sequences = sequences.to(device)
            outputs = model(sequences)
            preds = (outputs > 0.5).int().cpu().numpy()
            val_preds.extend(preds)
            val_labels.extend(labels.cpu().numpy())
    f1_val_score = f1_score(val_labels, val_preds)
    return best_val_loss, f1_val_score

In [None]:
bidirectional = [True, False]
embedding_dims = [64, 128, 256]
hidden_dims = [64, 128, 256]
num_layers = [1, 5, 10]

In [None]:
results = {}
for is_bidirectional in bidirectional:
    for emb_dim in embedding_dims:
        for hid_dim in hidden_dims:
            for n_layers in num_layers:
                loss_val, f1_val = train_and_evaluate(emb_dim, hid_dim, is_bidirectional, num_layers=n_layers, num_epochs=15)
                results[(is_bidirectional, emb_dim, hid_dim, n_layers)] = (loss_val, f1_val)
                print(f"is_bidirectional: {is_bidirectional}, embedding_dim: {emb_dim}, hidden_dim: {hid_dim}, n_layers: {n_layers}, val_loss: {loss_val:.4f}, val_f1: {f1_val:.4f}")

is_bidirectional: True, embedding_dim: 64, hidden_dim: 64, n_layers: 1, val_loss: 0.5948, val_f1: 0.6209
is_bidirectional: True, embedding_dim: 64, hidden_dim: 64, n_layers: 5, val_loss: 0.5376, val_f1: 0.7049
is_bidirectional: True, embedding_dim: 64, hidden_dim: 64, n_layers: 10, val_loss: 0.5657, val_f1: 0.6778
is_bidirectional: True, embedding_dim: 64, hidden_dim: 128, n_layers: 1, val_loss: 0.5930, val_f1: 0.6363
is_bidirectional: True, embedding_dim: 64, hidden_dim: 128, n_layers: 5, val_loss: 0.5317, val_f1: 0.7102
is_bidirectional: True, embedding_dim: 64, hidden_dim: 128, n_layers: 10, val_loss: 0.5279, val_f1: 0.6929
is_bidirectional: True, embedding_dim: 64, hidden_dim: 256, n_layers: 1, val_loss: 0.5678, val_f1: 0.6711
is_bidirectional: True, embedding_dim: 64, hidden_dim: 256, n_layers: 5, val_loss: 0.5373, val_f1: 0.6945
is_bidirectional: True, embedding_dim: 64, hidden_dim: 256, n_layers: 10, val_loss: 0.5330, val_f1: 0.6951
is_bidirectional: True, embedding_dim: 128, hi

In [None]:
best_params = max(results, key=lambda k: results[k][1])
print("best hyperparameters:", best_params, "with val_loss:", results[best_params][0], "and f1_val:", results[best_params][1])

best hyperparameters: (True, 256, 256, 10) with val_loss: 0.503241935124 and f1_val: 0.7292817679558011


In [45]:
embedding_dim_one_directional = 256
hidden_dim_one_directional = 128
num_layers_one_directional = 10
is_bidirectional = False

In [46]:
# for bidirectional
embedding_dim = 256
hidden_dim = 256
num_layers = 10
is_bidirectional = True

In [None]:
model = TweetDisasterLSTMClassifier(vocab_size, embedding_dim, hidden_dim, is_bidirectional, num_layers, dropout_p)
model.to(device)
loss_fn = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
best_val_loss = np.inf
counter = 0
patience = 3
for epoch in range(num_epochs):
    model.train()
    train_losses = []
    for sequences, labels in train_dataloader:
        sequences, labels = sequences.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(sequences)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
    avg_train_loss = np.mean(train_losses)
    model.eval()
    val_losses = []
    with torch.no_grad():
        for sequences, labels in val_dataloader:
            sequences, labels = sequences.to(device), labels.to(device)
            outputs = model(sequences)
            loss = loss_fn(outputs, labels)
            val_losses.append(loss.item())
    avg_val_loss = np.mean(val_losses)
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        counter = 0
        best_model_state = model.state_dict()
    else:
        counter += 1
        if counter >= patience:
            break
model.load_state_dict(best_model_state)
model.eval()

In [None]:
tknzr = TweetTokenizer()
lemmatizer = WordNetLemmatizer()

In [None]:
def tokenize_and_lemmatize(text):
    tokens = tknzr.tokenize(text)
    return list(map(lemmatizer.lemmatize, tokens))

In [None]:
tokenizer = tokenize_and_lemmatize

In [None]:
test_data = pd.read_csv('test_data.csv')[['text']]

In [None]:
test_data['tokenized_text'] = test_data['text'].apply(
    lambda sent: tokenizer(sent)
)

In [None]:
test_data['filtered_text'] = test_data['tokenized_text'].apply(
    lambda tokens: [token.lower() for token in tokens if token.lower() not in noise]
)

test_data['filtered_text_joined'] = test_data['filtered_text'].apply(lambda tokens: " ".join(tokens))

In [None]:
test_sequences = [torch.tensor(text_to_id(text, vocab), dtype=torch.long) for text in test_data['filtered_text_joined']]

In [None]:
test_sequences_padded = pad_sequence(test_sequences, batch_first=True, padding_value=vocab["<pad>"])
test_sequences_padded = test_sequences_padded.to(device)

In [None]:
model.eval()
with torch.no_grad():
    test_outputs = model(test_sequences_padded)
    test_predictions = (test_outputs > 0.5).int().cpu().numpy()

In [None]:
test_predictions

array([1, 0, 1, ..., 1, 1, 1], dtype=int32)

In [None]:
sample_submission = pd.read_csv('sample_submission.csv')

In [None]:
test_submission_lstm = pd.DataFrame(test_predictions, index=sample_submission.id, columns=['target'])

In [None]:
test_submission_lstm.index.name = 'id'

In [None]:
test_submission_lstm.to_csv('test_submission_lstm_bidirectional.csv')

LSTM experiments results

| is_bidirectional | embedding_dim | hidden_dim | n_layers | val_loss | val_f1  |
|------------------|--------------|------------|----------|----------|---------|
| True            | 64           | 64         | 1        | 0.5948   | 0.6209  |
| True            | 64           | 64         | 5        | 0.5376   | 0.7049  |
| True            | 64           | 64         | 10       | 0.5657   | 0.6778  |
| True            | 64           | 128        | 1        | 0.5930   | 0.6363  |
| True            | 64           | 128        | 5        | 0.5317   | 0.7102  |
| True            | 64           | 128        | 10       | 0.5279   | 0.6929  |
| True            | 64           | 256        | 1        | 0.5678   | 0.6711  |
| True            | 64           | 256        | 5        | 0.5373   | 0.6945  |
| True            | 64           | 256        | 10       | 0.5330   | 0.6951  |
| True            | 128          | 64         | 1        | 0.5641   | 0.6420  |
| True            | 128          | 64         | 5        | 0.5417   | 0.7015  |
| True            | 128          | 64         | 10       | 0.5296   | 0.7095  |
| True            | 128          | 128        | 1        | 0.5277   | 0.6932  |
| True            | 128          | 128        | 5        | 0.5378   | 0.7127  |
| True            | 128          | 128        | 10       | 0.5223   | 0.7010  |
| True            | 128          | 256        | 1        | 0.5528   | 0.6747  |
| True            | 128          | 256        | 5        | 0.5322   | 0.6990  |
| True            | 128          | 256        | 10       | 0.5634   | 0.6985  |
| True            | 256          | 64         | 1        | 0.5084   | 0.7058  |
| True            | 256          | 64         | 5        | 0.5452   | 0.6924  |
| True            | 256          | 64         | 10       | 0.5431   | 0.6989  |
| True            | 256          | 128        | 1        | 0.5163   | 0.6990  |
| True            | 256          | 128        | 5        | 0.5254   | 0.7063  |
| True            | 256          | 128        | 10       | 0.5429   | 0.7020  |
| True            | 256          | 256        | 1        | 0.5393   | 0.6861  |
| True            | 256          | 256        | 5        | 0.5437   | 0.7048  |
| True            | 256          | 256        | 10       | 0.5032   | **0.7293**  |
| False           | 64           | 64         | 1        | 0.6235   | 0.6442  |
| False           | 64           | 64         | 5        | 0.5612   | 0.6639  |
| False           | 64           | 64         | 10       | 0.5926   | 0.6511  |
| False           | 64           | 128        | 1        | 0.5550   | 0.6845  |
| False           | 64           | 128        | 5        | 0.5499   | 0.6678  |
| False           | 64           | 128        | 10       | 0.5295   | 0.6959  |
| False           | 64           | 256        | 1        | 0.5287   | 0.6867  |
| False           | 64           | 256        | 5        | 0.5322   | 0.6776  |
| False           | 64           | 256        | 10       | 0.5384   | 0.6786  |
| False           | 128          | 64         | 1        | 0.6041   | 0.6628  |
| False           | 128          | 64         | 5        | 0.5792   | 0.6787  |
| False           | 128          | 64         | 10       | 0.5609   | 0.7026  |
| False           | 128          | 128        | 1        | 0.5598   | 0.7038  |
| False           | 128          | 128        | 5        | 0.5423   | 0.7026  |
| False           | 128          | 128        | 10       | 0.5559   | 0.6944  |
| False           | 128          | 256        | 1        | 0.5477   | 0.6981  |
| False           | 128          | 256        | 5        | 0.5279   | 0.7123  |
| False           | 128          | 256        | 10       | 0.5273   | 0.7142  |
| False           | 256          | 64         | 1        | 0.6279   | 0.6513  |
| False           | 256          | 64         | 5        | 0.5825   | 0.6852  |
| False           | 256          | 64         | 10       | 0.5629   | 0.6921  |
| False           | 256          | 128        | 1        | 0.5699   | 0.7061  |
| False           | 256          | 128        | 5        | 0.5605   | 0.6947  |
| False           | 256          | 128        | 10       | 0.5418   | 0.7067  |
| False           | 256          | 256        | 1        | 0.5635   | 0.6986  |
| False           | 256          | 256        | 5        | 0.5326   | 0.7038  |
| False           | 256          | 256        | 10       | 0.5425   | 0.7061  |


Results on the kaggle test dataset:

* public score (bidirectional, hidden_dim = 256, embedding_dim = 256, n_layers = 10): 0.76248
* public score (one directional, hidden_dim = 128, embedding_dim = 256, n_layers = 10): 0.72295