## Description of the notebook

In the current notebook I've fine-tuned **DistillBERT** model from transformers library for our tweet classification task

## Preprocessing steps for the respective DistillBERT tokenizer:

* lowering case

* tokenization using WordPiece algorithm

## DistillBERT 

- is the distilled version of the BERT that preserves much of the original model’s performance while reducing its size due to compressing the architecture to 6 transformer encoder layers (base BERT model has 110 million parameters, DistillBERT approximately 67 million parameters). It is oftenly used to retreive informative representations from text.

## DistillBERT configuration:

* activation function = "gelu"

* dropout rate = 0.1

* dimension of hidden representation = 3072

* number of attention heads = 12

* number of layers = 6

* vocabulary size = 30522

* overall number of parameters = 66.955.010

## Hyperparameters of fine-tuning process:

* learning rate = 2e-5

* weight decay coefficient for L2-regularization = 0.01

* number of training epochs = 3

## Results

* validation f1 score = 0.792779
* test f1 score = 0.82224

---

## Code:

In [None]:
import numpy as np
import pandas as pd
import nltk
import re
import matplotlib.pyplot as plt

In [None]:
data_full = pd.read_csv('train_data.csv')

In [None]:
data_full.head(10)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


In [None]:
data = data_full[['text', 'target']]

In [None]:
from nltk.tokenize import TweetTokenizer

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [None]:
tknzr = TweetTokenizer()
lemmatizer = WordNetLemmatizer()

In [None]:
def tokenize_and_lemmatize(text):
    tokens = tknzr.tokenize(text)
    return list(map(lemmatizer.lemmatize, tokens))

In [None]:
data['tokenized_text'] = data['text'].apply(
    lambda sent: tokenize_and_lemmatize(sent)
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['tokenized_text'] = data['text'].apply(


In [None]:
nltk.download('stopwords', quiet=True)

True

In [None]:
from nltk.corpus import stopwords
from string import punctuation

In [None]:
stopwords_set = set(stopwords.words("english"))
punctuation_set = set(punctuation)
noise = stopwords_set.union(punctuation_set)

In [None]:
data['filtered_text'] = data['tokenized_text'].apply(
    lambda tokens: [token.lower() for token in tokens if token.lower() not in noise]
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['filtered_text'] = data['tokenized_text'].apply(


In [None]:
data['filtered_text_joined'].str.len().describe()

Unnamed: 0,filtered_text_joined
count,7613.0
mean,80.572442
std,30.222827
min,3.0
25%,60.0
50%,83.0
75%,105.0
max,142.0


In [None]:
data['filtered_text_joined'] = data['filtered_text'].apply(lambda tokens: ' '.join(tokens))

In [None]:
data

Unnamed: 0,text,target,tokenized_text,filtered_text,filtered_text_joined
0,Our Deeds are the Reason of this #earthquake M...,1,"[Our, Deeds, are, the, Reason, of, this, #eart...","[deeds, reason, #earthquake, may, allah, forgi...",deeds reason #earthquake may allah forgive u
1,Forest fire near La Ronge Sask. Canada,1,"[Forest, fire, near, La, Ronge, Sask, ., Canada]","[forest, fire, near, la, ronge, sask, canada]",forest fire near la ronge sask canada
2,All residents asked to 'shelter in place' are ...,1,"[All, resident, asked, to, ', shelter, in, pla...","[resident, asked, shelter, place, notified, of...",resident asked shelter place notified officer ...
3,"13,000 people receive #wildfires evacuation or...",1,"[13,000, people, receive, #wildfires, evacuati...","[13,000, people, receive, #wildfires, evacuati...","13,000 people receive #wildfires evacuation or..."
4,Just got sent this photo from Ruby #Alaska as ...,1,"[Just, got, sent, this, photo, from, Ruby, #Al...","[got, sent, photo, ruby, #alaska, smoke, #wild...",got sent photo ruby #alaska smoke #wildfires p...
...,...,...,...,...,...
7608,Two giant cranes holding a bridge collapse int...,1,"[Two, giant, crane, holding, a, bridge, collap...","[two, giant, crane, holding, bridge, collapse,...",two giant crane holding bridge collapse nearby...
7609,@aria_ahrary @TheTawniest The out of control w...,1,"[@aria_ahrary, @TheTawniest, The, out, of, con...","[@aria_ahrary, @thetawniest, control, wild, fi...",@aria_ahrary @thetawniest control wild fire ca...
7610,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1,"[M1, ., 94, [, 01:04, UTC, ], ?, 5km, S, of, V...","[m1, 94, 01:04, utc, 5km, volcano, hawaii, htt...",m1 94 01:04 utc 5km volcano hawaii http://t.co...
7611,Police investigating after an e-bike collided ...,1,"[Police, investigating, after, an, e-bike, col...","[police, investigating, e-bike, collided, car,...",police investigating e-bike collided car littl...


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(data['filtered_text_joined'], data['target'], test_size=0.2, random_state=42)

In [None]:
def build_vocab(texts, max_words=20000):
    token_count_dict = {}
    for text in texts:
        for token in text.split():
            if token not in token_count_dict:
                token_count_dict[token] = 1
            else:
                token_count_dict[token] = token_count_dict[token] + 1

    tokens_freq_list = list(token_count_dict.items())
    tokens_freq_list.sort(key=lambda x: x[1], reverse=True)
    sorted_tokens = tokens_freq_list[:max_words - 2]

    vocabulary = {
        "<pad>": 0,
        "<oov>": 1,
    }

    for i, (token, count) in enumerate(sorted_tokens):
        vocabulary[token] = i

    return vocabulary

In [None]:
vocab = build_vocab(data['filtered_text_joined'])

In [None]:
def text_to_id(text, vocab):
    return [vocab.get(token, vocab['<oov>']) for token in text.split()]

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

In [None]:
class TweetDisasterDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length=128):
        self.data = dataframe.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.loc[idx, 'filtered_text_joined']
        label = int(self.data.loc[idx, 'target'])
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        item = {key: val.squeeze() for key, val in encoding.items()}
        item['labels'] = torch.tensor(label, dtype=torch.long)
        return item

In [None]:
train_dataset = TweetDisasterDataset(pd.concat([X_train, y_train], axis=1), tokenizer)
val_dataset = TweetDisasterDataset(pd.concat([X_val, y_val], axis=1), tokenizer)

In [None]:
def collate_fn(batch):
    sequences, labels = zip(*batch)
    sequences_padded = pad_sequence(sequences, batch_first=True, padding_value=vocab['<pad>'])
    labels = torch.tensor(labels, dtype=torch.float)

    return sequences_padded, labels

In [None]:
batch_size = 64

In [None]:
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False, collate_fn=collate_fn)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

In [None]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
num_labels = len(data['target'].unique())
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
print(model.config)

DistilBertConfig {
  "_attn_implementation_autoset": true,
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.48.3",
  "vocab_size": 30522
}



In [None]:
sum([p.numel() for p in model.parameters()])

66955010

In [None]:
max_length = 156
train_dataset = TweetDisasterDataset(pd.concat([X_train, y_train], axis=1), tokenizer, max_length)
val_dataset = TweetDisasterDataset(pd.concat([X_val, y_val], axis=1), tokenizer, max_length)

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
)



In [None]:
from sklearn.metrics import accuracy_score, f1_score, classification_report

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)
    f1 = f1_score(labels, preds)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3343,0.40356,0.822062,0.780567
2,0.3159,0.412324,0.826658,0.787097
3,0.3,0.45512,0.826658,0.792779


TrainOutput(global_step=1143, training_loss=0.35851998143517333, metrics={'train_runtime': 297.6654, 'train_samples_per_second': 61.378, 'train_steps_per_second': 3.84, 'total_flos': 737398402846560.0, 'train_loss': 0.35851998143517333, 'epoch': 3.0})

In [None]:
eval_results = trainer.evaluate()
print("Evaluation Results:", eval_results)

Evaluation Results: {'eval_loss': 0.40355968475341797, 'eval_accuracy': 0.8220617202889035, 'eval_f1': 0.7805668016194331, 'eval_runtime': 7.1884, 'eval_samples_per_second': 211.869, 'eval_steps_per_second': 13.355, 'epoch': 3.0}


In [None]:
test_data = pd.read_csv('test_data.csv')[['text']]

In [None]:
test_data['tokenized_text'] = test_data['text'].apply(lambda sent: tokenizer(sent))
test_data['filtered_text'] = test_data['tokenized_text'].apply(lambda tokens: [token.lower() for token in tokens if token.lower() not in noise])
test_data['filtered_text_joined'] = test_data['filtered_text'].apply(lambda tokens: " ".join(tokens))

In [None]:
class TestTweetDisasterDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length=128):
        self.data = dataframe.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        text = self.data.loc[idx, 'filtered_text_joined']
        encoding = self.tokenizer(
            text,
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        item = {key: val.squeeze() for key, val in encoding.items()}
        return item

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
test_dataset = TestTweetDisasterDataset(pd.concat([test_data['filtered_text_joined']], axis=1), tokenizer)

In [None]:
test_predictions = trainer.predict(test_dataset)

In [None]:
test_predictions

PredictionOutput(predictions=array([[-0.10567158, -0.01350216],
       [-0.7699423 ,  0.6951574 ],
       [-0.53195107,  0.5693046 ],
       ...,
       [-1.3361688 ,  1.4661204 ],
       [-1.0142916 ,  0.9938146 ],
       [-0.64462376,  0.45996076]], dtype=float32), label_ids=None, metrics={'test_runtime': 14.3074, 'test_samples_per_second': 228.064, 'test_steps_per_second': 14.258})

In [None]:
test_preds = test_predictions.predictions.argmax(axis=-1)

In [None]:
test_preds

array([1, 1, 1, ..., 1, 1, 1])

In [None]:
sample_submission = pd.read_csv('sample_submission.csv')

In [None]:
test_submission_distilbert = pd.DataFrame(test_preds, index=sample_submission.id, columns=['target'])

In [None]:
test_submission_distilbert.index.name = 'id'

In [None]:
test_submission_distilbert.to_csv('test_submission_distilbert.csv')

public test score: 0.82224