## Description of the notebook

In the current notebook I've conducted experiments of applying **Transformer** model trained from scratch for classification task

## Preprocessing steps:

* tokenization using TweetTokenizer

* lemmatization using WordNetLemmatizer

* filtering out punctuation symbols and stopwords

Exactly that scheme demonstrated the best results on the 1st HW task

## Hyperparameters of the Transformer model

* embedding dimension: grid = [64, 128, 256] (determines the size of token representations and influences model capacity and efficiency)

* number of attention head: grid = [2, 4, 8] (controls parallel attention streams; more heads can capture diverse relationships)

* number of layers: grid = [1, 2, 4] (defines model depth; deeper models may capture more complex patterns)

* dropout rate = 0.3 (applies regularization to prevent overfitting)

## Quality of the classification for different sets of hyperparameters:

| embedding dim | nhead | num_layers | val Loss | val f1 score  |
|---------------|-------|------------|----------|---------|
| 64            | 2     | 1          | 0.6272   | 0.6094  |
| 64            | 2     | 2          | 0.5820   | 0.6708  |
| 64            | 2     | 4          | 0.6005   | 0.6604  |
| 64            | 4     | 1          | 0.6220   | 0.5687  |
| 64            | 4     | 2          | 0.6277   | 0.6347  |
| 64            | 4     | 4          | 0.5548   | 0.6962  |
| 64            | 8     | 1          | 0.6093   | 0.5952  |
| 64            | 8     | 2          | 0.6033   | 0.6517  |
| 64            | 8     | 4          | 0.5934   | 0.6581  |
| 128           | 2     | 1          | 0.6138   | 0.6548  |
| 128           | 2     | 2          | 0.5837   | 0.6844  |
| 128           | 2     | 4          | 0.5337   | **0.7128**  |
| 128           | 4     | 1          | 0.6052   | 0.6543  |
| 128           | 4     | 2          | 0.5621   | 0.7052  |
| 128           | 4     | 4          | 0.5753   | 0.6698  |
| 128           | 8     | 1          | 0.5843   | 0.6786  |
| 128           | 8     | 2          | 0.5470   | 0.6992  |
| 128           | 8     | 4          | 0.5902   | 0.6322  |
| 256           | 2     | 1          | 0.5930   | 0.6793  |
| 256           | 2     | 2          | 0.6341   | 0.6352  |
| 256           | 2     | 4          | 0.5908   | 0.6918  |
| 256           | 4     | 1          | 0.6231   | 0.6137  |
| 256           | 4     | 2          | 0.5596   | 0.7089  |
| 256           | 4     | 4          | 0.5546   | 0.6908  |
| 256           | 8     | 1          | 0.5980   | 0.6579  |
| 256           | 8     | 2          | 0.5768   | 0.6977  |
| 256           | 8     | 4          | 0.5566   | 0.6232  |


---

## Code:

In [None]:
import numpy as np
import pandas as pd
import nltk
import re
import matplotlib.pyplot as plt
import seaborn

In [None]:
data_full = pd.read_csv('train_data.csv')

In [None]:
data_full.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [None]:
data = data_full[['text', 'target']]

In [None]:
from nltk.tokenize import TweetTokenizer

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
tknzr = TweetTokenizer()
lemmatizer = WordNetLemmatizer()

In [None]:
def tokenize_and_lemmatize(text):
    tokens = tknzr.tokenize(text)
    return list(map(lemmatizer.lemmatize, tokens))

In [None]:
data['tokenized_text'] = data['text'].apply(
    lambda sent: tokenize_and_lemmatize(sent)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['tokenized_text'] = data['text'].apply(


In [None]:
nltk.download('stopwords', quiet=True)

True

In [None]:
from nltk.corpus import stopwords
from string import punctuation

In [None]:
stopwords_set = set(stopwords.words("english"))
punctuation_set = set(punctuation)
noise = stopwords_set.union(punctuation_set)

In [None]:
data['filtered_text'] = data['tokenized_text'].apply(
    lambda tokens: [token.lower() for token in tokens if token.lower() not in noise]
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['filtered_text'] = data['tokenized_text'].apply(


In [None]:
data['filtered_text_joined'] = data['filtered_text'].apply(lambda tokens: ' '.join(tokens))

In [None]:
data

Unnamed: 0,text,target,tokenized_text,filtered_text,filtered_text_joined
0,Our Deeds are the Reason of this #earthquake M...,1,"[Our, Deeds, are, the, Reason, of, this, #eart...","[deeds, reason, #earthquake, may, allah, forgi...",deeds reason #earthquake may allah forgive u
1,Forest fire near La Ronge Sask. Canada,1,"[Forest, fire, near, La, Ronge, Sask, ., Canada]","[forest, fire, near, la, ronge, sask, canada]",forest fire near la ronge sask canada
2,All residents asked to 'shelter in place' are ...,1,"[All, resident, asked, to, ', shelter, in, pla...","[resident, asked, shelter, place, notified, of...",resident asked shelter place notified officer ...
3,"13,000 people receive #wildfires evacuation or...",1,"[13,000, people, receive, #wildfires, evacuati...","[13,000, people, receive, #wildfires, evacuati...","13,000 people receive #wildfires evacuation or..."
4,Just got sent this photo from Ruby #Alaska as ...,1,"[Just, got, sent, this, photo, from, Ruby, #Al...","[got, sent, photo, ruby, #alaska, smoke, #wild...",got sent photo ruby #alaska smoke #wildfires p...
...,...,...,...,...,...
7608,Two giant cranes holding a bridge collapse int...,1,"[Two, giant, crane, holding, a, bridge, collap...","[two, giant, crane, holding, bridge, collapse,...",two giant crane holding bridge collapse nearby...
7609,@aria_ahrary @TheTawniest The out of control w...,1,"[@aria_ahrary, @TheTawniest, The, out, of, con...","[@aria_ahrary, @thetawniest, control, wild, fi...",@aria_ahrary @thetawniest control wild fire ca...
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1,"[M1, ., 94, [, 01:04, UTC, ], ?, 5km, S, of, V...","[m1, 94, 01:04, utc, 5km, volcano, hawaii, htt...",m1 94 01:04 utc 5km volcano hawaii http://t.co...
7611,Police investigating after an e-bike collided ...,1,"[Police, investigating, after, an, e-bike, col...","[police, investigating, e-bike, collided, car,...",police investigating e-bike collided car littl...


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(data['filtered_text_joined'], data['target'], test_size=0.2, random_state=42)

In [None]:
def build_vocab(texts, max_words=20000):
    token_count_dict = {}
    for text in texts:
        for token in text.split():
            if token not in token_count_dict:
                token_count_dict[token] = 1
            else:
                token_count_dict[token] = token_count_dict[token] + 1

    tokens_freq_list = list(token_count_dict.items())
    tokens_freq_list.sort(key=lambda x: x[1], reverse=True)
    sorted_tokens = tokens_freq_list[:max_words - 2]

    vocabulary = {
        "<pad>": 0,
        "<oov>": 1,
    }

    for i, (token, count) in enumerate(sorted_tokens):
        vocabulary[token] = i

    return vocabulary

In [None]:
vocab = build_vocab(data['filtered_text_joined'])

In [None]:
vocab

{'<pad>': 0,
 '<oov>': 1,
 '...': 0,
 '\x89': 1,
 'wa': 2,
 'like': 3,
 'û_': 4,
 'fire': 5,
 'get': 6,
 'ha': 7,
 'new': 8,
 'via': 9,
 'one': 10,
 'u': 11,
 'people': 12,
 '2': 13,
 'video': 14,
 'emergency': 15,
 'disaster': 16,
 'time': 17,
 'body': 18,
 'police': 19,
 'day': 20,
 'year': 21,
 'would': 22,
 'still': 23,
 'building': 24,
 'say': 25,
 'go': 26,
 'news': 27,
 'home': 28,
 'crash': 29,
 'storm': 30,
 'back': 31,
 '..': 32,
 'burning': 33,
 'know': 34,
 'suicide': 35,
 '3': 36,
 'got': 37,
 'california': 38,
 'see': 39,
 'man': 40,
 'car': 41,
 'look': 42,
 'first': 43,
 'attack': 44,
 'life': 45,
 'death': 46,
 'bomb': 47,
 'train': 48,
 'going': 49,
 'make': 50,
 'love': 51,
 'family': 52,
 'rt': 53,
 'two': 54,
 'killed': 55,
 'world': 56,
 'dead': 57,
 'flood': 58,
 'û': 59,
 'accident': 60,
 'nuclear': 61,
 'today': 62,
 'full': 63,
 'want': 64,
 'war': 65,
 'need': 66,
 'good': 67,
 'think': 68,
 'may': 69,
 "can't": 70,
 'way': 71,
 'pm': 72,
 'watch': 73,
 'ûªs'

In [None]:
def text_to_id(text, vocab):
    return [vocab.get(token, vocab['<oov>']) for token in text.split()]

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [None]:
class TweetDisasterDataset(Dataset):
    def __init__(self, df, vocab):
        self.texts = list(df['filtered_text_joined'].values)
        self.labels = list(df['target'].values)
        self.vocab = vocab

    def __getitem__(self, idx):
        sequence = torch.tensor(text_to_id(self.texts[idx], self.vocab), dtype=torch.long)
        label = torch.tensor(self.labels[idx], dtype=torch.float)
        return sequence, label

    def __len__(self):
        return len(self.labels)

In [None]:
train_dataset = TweetDisasterDataset(pd.concat([X_train, y_train], axis=1), vocab)
val_dataset = TweetDisasterDataset(pd.concat([X_val, y_val], axis=1), vocab)

In [None]:
def collate_fn(batch):
    sequences, labels = zip(*batch)
    sequences_padded = pad_sequence(sequences, batch_first=True, padding_value=vocab['<pad>'])
    labels = torch.tensor(labels, dtype=torch.float)

    return sequences_padded, labels

In [None]:
batch_size = 64

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

In [None]:
class TweetDisasterTransformerClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim, nhead, num_layers, dropout):
        super(TweetDisasterTransformerClassifier, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        encoder_layer = nn.TransformerEncoderLayer(d_model=embedding_dim, nhead=nhead, dropout=dropout)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
        self.dropout = nn.Dropout(dropout)
        self.fc = nn.Linear(embedding_dim, 1)
    def forward(self, x):
        emb = self.embedding(x)
        emb = emb.transpose(0, 1)
        out = self.transformer_encoder(emb)
        out = out.mean(dim=0)
        out = self.dropout(out)
        out = self.fc(out)
        return torch.sigmoid(out).squeeze()

In [None]:
embedding_dim = 128
nhead = 4
num_layers_transformer = 2
dropout_p = 0.3
vocab_size = len(vocab)

In [None]:
model = TweetDisasterTransformerClassifier(vocab_size, embedding_dim, nhead, num_layers_transformer, dropout_p)



In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [None]:
model = model.to(device)

In [None]:
loss_fn = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)

In [None]:
best_transformer_val_loss = np.inf
max_epochs_early_stopping = 3
counter_early_stopping = 0
num_epochs = 25
for epoch in range(num_epochs):
    model.train()
    train_losses = []
    for sequences, labels in train_dataloader:
        sequences, labels = sequences.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(sequences)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
    avg_train_loss = np.mean(train_losses)
    model.eval()
    val_losses = []
    with torch.no_grad():
        for sequences, labels in val_dataloader:
            sequences, labels = sequences.to(device), labels.to(device)
            outputs = model(sequences)
            loss = loss_fn(outputs, labels)
            val_losses.append(loss.item())
    avg_val_loss = np.mean(val_losses)
    print(f"epoch {epoch+1}/{num_epochs}: train Loss: {avg_train_loss:.4f}, val loss: {avg_val_loss:.4f}")
    if avg_val_loss < best_transformer_val_loss:
        best_transformer_val_loss = avg_val_loss
        counter_early_stopping = 0
        torch.save(model.state_dict(), "best_transformer_model.pt")
        print("model saved.")
    else:
        counter_early_stopping += 1
        if counter_early_stopping >= max_epochs_early_stopping:
            print("early stopping triggered.")
            break

epoch 1/25: train Loss: 0.6820, val loss: 0.6650
model saved.
epoch 2/25: train Loss: 0.6457, val loss: 0.6554
model saved.
epoch 3/25: train Loss: 0.6171, val loss: 0.6236
model saved.
epoch 4/25: train Loss: 0.5999, val loss: 0.6085
model saved.
epoch 5/25: train Loss: 0.5660, val loss: 0.5998
model saved.
epoch 6/25: train Loss: 0.5512, val loss: 0.5820
model saved.
epoch 7/25: train Loss: 0.5093, val loss: 0.5911
epoch 8/25: train Loss: 0.4853, val loss: 0.5752
model saved.
epoch 9/25: train Loss: 0.4516, val loss: 0.5797
epoch 10/25: train Loss: 0.4107, val loss: 0.5791
epoch 11/25: train Loss: 0.3791, val loss: 0.6200
early stopping triggered.


In [None]:
model.load_state_dict(torch.load("best_transformer_model.pt"))
model.eval()

TweetDisasterTransformerClassifier(
  (embedding): Embedding(20000, 128, padding_idx=0)
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-1): 2 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=2048, bias=True)
        (dropout): Dropout(p=0.3, inplace=False)
        (linear2): Linear(in_features=2048, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.3, inplace=False)
        (dropout2): Dropout(p=0.3, inplace=False)
      )
    )
  )
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=128, out_features=1, bias=True)
)

In [None]:
val_preds = []
val_labels = []
with torch.no_grad():
    for sequences, labels in val_dataloader:
        sequences, labels = sequences.to(device), labels.to(device)
        outputs = model(sequences)
        preds = (outputs > 0.5).int().cpu().numpy()
        val_preds.extend(preds)
        val_labels.extend(labels.cpu().numpy())

In [None]:
from sklearn.metrics import accuracy_score, classification_report, f1_score

In [None]:
accuracy = accuracy_score(val_labels, val_preds)
print("val acc:", accuracy)
print("classification report:")
print(classification_report(val_labels, val_preds))

val acc: 0.7321076822061721
classification report:
              precision    recall  f1-score   support

         0.0       0.76      0.78      0.77       874
         1.0       0.69      0.67      0.68       649

    accuracy                           0.73      1523
   macro avg       0.73      0.72      0.73      1523
weighted avg       0.73      0.73      0.73      1523



In [None]:
def train_and_evaluate(embedding_dim, nhead, num_layers, dropout_rate, num_epochs=10, patience=3):
    model = TweetDisasterTransformerClassifier(vocab_size, embedding_dim, nhead, num_layers, dropout_rate)
    model.to(device)
    loss_fn = nn.BCELoss()
    optimizer = optim.Adam(model.parameters(), lr=1e-4)
    best_val_loss = np.inf
    counter = 0
    for epoch in range(num_epochs):
        model.train()
        train_losses = []
        for sequences, labels in train_dataloader:
            sequences, labels = sequences.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(sequences)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
            train_losses.append(loss.item())
        avg_train_loss = np.mean(train_losses)
        model.eval()
        val_losses = []
        with torch.no_grad():
            for sequences, labels in val_dataloader:
                sequences, labels = sequences.to(device), labels.to(device)
                outputs = model(sequences)
                loss = loss_fn(outputs, labels)
                val_losses.append(loss.item())
        avg_val_loss = np.mean(val_losses)
        if avg_val_loss < best_val_loss:
            best_val_loss = avg_val_loss
            counter = 0
            best_model_state = model.state_dict()
        else:
            counter += 1
            if counter >= patience:
                break
    model.load_state_dict(best_model_state)
    model.eval()
    val_preds = []
    val_labels = []
    with torch.no_grad():
        for sequences, labels in val_dataloader:
            sequences, labels = sequences.to(device), labels.to(device)
            outputs = model(sequences)
            preds = (outputs > 0.5).int().cpu().numpy()
            val_preds.extend(preds)
            val_labels.extend(labels.cpu().numpy())
    f1_val_score = f1_score(val_labels, val_preds)
    return best_val_loss, f1_val_score

In [None]:
embedding_dims = [64, 128, 256]
nheads = [2, 4, 8]
num_layers_list = [1, 2, 4]

In [None]:
results = {}
for emb_dim in embedding_dims:
    for head in nheads:
        for n_layers in num_layers_list:
            loss_val, f1_val = train_and_evaluate(emb_dim, head, n_layers, dropout_p, num_epochs=15)
            results[(emb_dim, head, n_layers)] = (loss_val, f1_val)
            print(f"embedding_dim: {emb_dim}, nhead: {head}, num_layers: {n_layers}, val_loss: {loss_val:.4f}, val_f1: {f1_val:.4f}")



embedding_dim: 64, nhead: 2, num_layers: 1, val_loss: 0.6272, val_f1: 0.6094




embedding_dim: 64, nhead: 2, num_layers: 2, val_loss: 0.5820, val_f1: 0.6708




embedding_dim: 64, nhead: 2, num_layers: 4, val_loss: 0.6005, val_f1: 0.6604




embedding_dim: 64, nhead: 4, num_layers: 1, val_loss: 0.6220, val_f1: 0.5687




embedding_dim: 64, nhead: 4, num_layers: 2, val_loss: 0.6277, val_f1: 0.6347




embedding_dim: 64, nhead: 4, num_layers: 4, val_loss: 0.5548, val_f1: 0.6962




embedding_dim: 64, nhead: 8, num_layers: 1, val_loss: 0.6093, val_f1: 0.5952




embedding_dim: 64, nhead: 8, num_layers: 2, val_loss: 0.6033, val_f1: 0.6517




embedding_dim: 64, nhead: 8, num_layers: 4, val_loss: 0.5934, val_f1: 0.6581




embedding_dim: 128, nhead: 2, num_layers: 1, val_loss: 0.6138, val_f1: 0.6548




embedding_dim: 128, nhead: 2, num_layers: 2, val_loss: 0.5837, val_f1: 0.6844




embedding_dim: 128, nhead: 2, num_layers: 4, val_loss: 0.5337, val_f1: 0.7128




embedding_dim: 128, nhead: 4, num_layers: 1, val_loss: 0.6052, val_f1: 0.6543




embedding_dim: 128, nhead: 4, num_layers: 2, val_loss: 0.5621, val_f1: 0.7052




embedding_dim: 128, nhead: 4, num_layers: 4, val_loss: 0.5753, val_f1: 0.6698




embedding_dim: 128, nhead: 8, num_layers: 1, val_loss: 0.5843, val_f1: 0.6786




embedding_dim: 128, nhead: 8, num_layers: 2, val_loss: 0.5470, val_f1: 0.6992




embedding_dim: 128, nhead: 8, num_layers: 4, val_loss: 0.5902, val_f1: 0.6322




embedding_dim: 256, nhead: 2, num_layers: 1, val_loss: 0.5930, val_f1: 0.6793




embedding_dim: 256, nhead: 2, num_layers: 2, val_loss: 0.6341, val_f1: 0.6352




embedding_dim: 256, nhead: 2, num_layers: 4, val_loss: 0.5908, val_f1: 0.6918




embedding_dim: 256, nhead: 4, num_layers: 1, val_loss: 0.6231, val_f1: 0.6137




embedding_dim: 256, nhead: 4, num_layers: 2, val_loss: 0.5596, val_f1: 0.7089




embedding_dim: 256, nhead: 4, num_layers: 4, val_loss: 0.5546, val_f1: 0.6908




embedding_dim: 256, nhead: 8, num_layers: 1, val_loss: 0.5980, val_f1: 0.6579




embedding_dim: 256, nhead: 8, num_layers: 2, val_loss: 0.5768, val_f1: 0.6977




embedding_dim: 256, nhead: 8, num_layers: 4, val_loss: 0.5566, val_f1: 0.6232


In [None]:
best_params = max(results, key=lambda k: results[k][1])
print("best hyperparameters:", best_params, "with val_loss:", results[best_params][0], "and f1_val:", results[best_params][1])

best hyperparameters: (128, 2, 4) with val_loss: 0.5336895485719045 and f1_val: 0.7128082736674622


In [None]:
embedding_dim = 128
nhead = 2
num_layers = 4

In [None]:
model = TweetDisasterTransformerClassifier(vocab_size, embedding_dim, nhead, num_layers, dropout_p)
model.to(device)



TweetDisasterTransformerClassifier(
  (embedding): Embedding(20000, 128, padding_idx=0)
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-3): 4 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=2048, bias=True)
        (dropout): Dropout(p=0.3, inplace=False)
        (linear2): Linear(in_features=2048, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.3, inplace=False)
        (dropout2): Dropout(p=0.3, inplace=False)
      )
    )
  )
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=128, out_features=1, bias=True)
)

In [None]:
loss_fn = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
best_val_loss = np.inf
counter = 0
patience = 3

In [None]:
for epoch in range(num_epochs):
    model.train()
    train_losses = []
    for sequences, labels in train_dataloader:
        sequences, labels = sequences.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(sequences)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
        train_losses.append(loss.item())
    avg_train_loss = np.mean(train_losses)
    model.eval()
    val_losses = []
    with torch.no_grad():
        for sequences, labels in val_dataloader:
            sequences, labels = sequences.to(device), labels.to(device)
            outputs = model(sequences)
            loss = loss_fn(outputs, labels)
            val_losses.append(loss.item())
    avg_val_loss = np.mean(val_losses)
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        counter = 0
        best_model_state = model.state_dict()
    else:
        counter += 1
        if counter >= patience:
            break
model.load_state_dict(best_model_state)
model.eval()

TweetDisasterTransformerClassifier(
  (embedding): Embedding(20000, 128, padding_idx=0)
  (transformer_encoder): TransformerEncoder(
    (layers): ModuleList(
      (0-3): 4 x TransformerEncoderLayer(
        (self_attn): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=128, out_features=128, bias=True)
        )
        (linear1): Linear(in_features=128, out_features=2048, bias=True)
        (dropout): Dropout(p=0.3, inplace=False)
        (linear2): Linear(in_features=2048, out_features=128, bias=True)
        (norm1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (norm2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (dropout1): Dropout(p=0.3, inplace=False)
        (dropout2): Dropout(p=0.3, inplace=False)
      )
    )
  )
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=128, out_features=1, bias=True)
)

In [None]:
tknzr = TweetTokenizer()
lemmatizer = WordNetLemmatizer()

In [None]:
def tokenize_and_lemmatize(text):
    tokens = tknzr.tokenize(text)
    return list(map(lemmatizer.lemmatize, tokens))

In [None]:
tokenizer = tokenize_and_lemmatize

In [None]:
test_data = pd.read_csv('test_data.csv')[['text']]

In [None]:
test_data['tokenized_text'] = test_data['text'].apply(lambda sent: tokenizer(sent))
test_data['filtered_text'] = test_data['tokenized_text'].apply(lambda tokens: [token.lower() for token in tokens if token.lower() not in noise])
test_data['filtered_text_joined'] = test_data['filtered_text'].apply(lambda tokens: " ".join(tokens))

In [None]:
test_sequences = [torch.tensor(text_to_id(text, vocab), dtype=torch.long) for text in test_data['filtered_text_joined']]
test_sequences_padded = pad_sequence(test_sequences, batch_first=True, padding_value=vocab["<pad>"])
test_sequences_padded = test_sequences_padded.to(device)

In [None]:
model.eval()
with torch.no_grad():
    test_outputs = model(test_sequences_padded)
    test_predictions = (test_outputs > 0.5).int().cpu().numpy()

In [None]:
sample_submission = pd.read_csv('sample_submission.csv')

In [None]:
test_submission_transformer = pd.DataFrame(test_predictions, index=sample_submission.id, columns=['target'])

In [None]:
test_submission_transformer.index.name = 'id'

In [None]:
test_submission_transformer.to_csv('test_submission_transformer.csv')

Quality of the classification for different sets of hyperparameters:

| embedding dim | nhead | num_layers | val Loss | val f1 score  |
|---------------|-------|------------|----------|---------|
| 64            | 2     | 1          | 0.6272   | 0.6094  |
| 64            | 2     | 2          | 0.5820   | 0.6708  |
| 64            | 2     | 4          | 0.6005   | 0.6604  |
| 64            | 4     | 1          | 0.6220   | 0.5687  |
| 64            | 4     | 2          | 0.6277   | 0.6347  |
| 64            | 4     | 4          | 0.5548   | 0.6962  |
| 64            | 8     | 1          | 0.6093   | 0.5952  |
| 64            | 8     | 2          | 0.6033   | 0.6517  |
| 64            | 8     | 4          | 0.5934   | 0.6581  |
| 128           | 2     | 1          | 0.6138   | 0.6548  |
| 128           | 2     | 2          | 0.5837   | 0.6844  |
| 128           | 2     | 4          | 0.5337   | **0.7128**  |
| 128           | 4     | 1          | 0.6052   | 0.6543  |
| 128           | 4     | 2          | 0.5621   | 0.7052  |
| 128           | 4     | 4          | 0.5753   | 0.6698  |
| 128           | 8     | 1          | 0.5843   | 0.6786  |
| 128           | 8     | 2          | 0.5470   | 0.6992  |
| 128           | 8     | 4          | 0.5902   | 0.6322  |
| 256           | 2     | 1          | 0.5930   | 0.6793  |
| 256           | 2     | 2          | 0.6341   | 0.6352  |
| 256           | 2     | 4          | 0.5908   | 0.6918  |
| 256           | 4     | 1          | 0.6231   | 0.6137  |
| 256           | 4     | 2          | 0.5596   | 0.7089  |
| 256           | 4     | 4          | 0.5546   | 0.6908  |
| 256           | 8     | 1          | 0.5980   | 0.6579  |
| 256           | 8     | 2          | 0.5768   | 0.6977  |
| 256           | 8     | 4          | 0.5566   | 0.6232  |


public score on the kaggle test dataset: 0.71897