# Classic ML models

## Dataset

The dataset consists of tweets labeled as either disaster-related (1) or non-disaster (0). Initial exploration shows:
- Columns: `id`, `keyword`, `location`, `text`, `target`.
- `keyword` and `location` have missing values, so the focus will be on the `text` column for classification.

In [1]:
import pandas as pd
import re

train_df = pd.read_csv("train.csv")
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


## Preproccessing

I chose NLTK’s TweetTokenizer for tokenization due to its specialized handling of Twitter data. Unlike generic tokenizers, it effectively preserves **hashtags** (e.g., #NLP), @mentions (e.g., @username), emojis, and contractions (e.g., "don’t" - "do", "n’t"), ensuring Twitter-specific elements remain intact for downstream analysis.

For lemmatization, WordNetLemmatizer was selected because it leverages WordNet’s lexical database and considers word context (via part-of-speech tagging). This allows it to produce accurate base forms (e.g., converting "running" to "run" when tagged as a verb), making it ideal for nuanced text processing in English.

To compare lemmatization with a faster but less context-aware approach, PorterStemmer was included. While stemming aggressively truncates suffixes (e.g., "running" → "run", "happily" → "happili"), it offers computational efficiency, which is advantageous for large datasets.

Moreover, I decided to try using stopwords list, which filters some extra words, thus reducing overall dimensionality and improving text proccessing. Tweets' hashtags are treated as full words, excluding '#' symbol. Hashtags like `#wildfires` are split into constituent words ("wildfires") to convert them into meaningful tokens.

In [2]:
from nltk import TweetTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk import download
from nltk.corpus import stopwords
download('wordnet')
download('stopwords')

tk = TweetTokenizer()
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english')) - set(['not', 'no']) #prevent certain words from deleting as they may contain essential clarification (e.g., fire != no fire)

#Applying tokenization
train_df['tokenized'] = train_df['text'].apply(lambda tweet: tk.tokenize(tweet))

#Replace hashtags with words
def remove_hashtags(token):
    hashtags = re.findall(r'#\w+', token)
    for hashtag in hashtags:
        words = re.findall(r'[A-Z]?[a-z]+', hashtag[1:])
        token = token.replace(hashtag, ' '.join(words))
    
    return token

#Remove words from stoplist
def remove_stopwords(text):
    filtered_tokens = [word for word in text if word.lower() not in stop_words]
    return filtered_tokens

#Preproccessing
train_df['preprocessed'] = train_df['tokenized'].apply(lambda list: remove_stopwords([remove_hashtags(token) for token in list]))


#Lemmatization and stemming
train_df['lemmatized'] = train_df['preprocessed'].apply(lambda list: ' '.join([lemmatizer.lemmatize(token) for token in list]))
train_df['stemmed'] = train_df['preprocessed'].apply(lambda list: ' '.join([stemmer.stem(token) for token in list]))


[nltk_data] Downloading package wordnet to /home/nixos/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/nixos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
train_df.head()

Unnamed: 0,id,keyword,location,text,target,tokenized,preprocessed,lemmatized,stemmed
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,"[Our, Deeds, are, the, Reason, of, this, #eart...","[Deeds, Reason, earthquake, May, ALLAH, Forgiv...",Deeds Reason earthquake May ALLAH Forgive u,deed reason earthquak may allah forgiv us
1,4,,,Forest fire near La Ronge Sask. Canada,1,"[Forest, fire, near, La, Ronge, Sask, ., Canada]","[Forest, fire, near, La, Ronge, Sask, ., Canada]",Forest fire near La Ronge Sask . Canada,forest fire near la rong sask . canada
2,5,,,All residents asked to 'shelter in place' are ...,1,"[All, residents, asked, to, ', shelter, in, pl...","[residents, asked, ', shelter, place, ', notif...",resident asked ' shelter place ' notified offi...,resid ask ' shelter place ' notifi offic . no ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,"[13,000, people, receive, #wildfires, evacuati...","[13,000, people, receive, wildfires, evacuatio...","13,000 people receive wildfire evacuation orde...","13,000 peopl receiv wildfir evacu order califo..."
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,"[Just, got, sent, this, photo, from, Ruby, #Al...","[got, sent, photo, Ruby, Alaska, smoke, wildfi...",got sent photo Ruby Alaska smoke wildfire pour...,got sent photo rubi alaska smoke wildfir pour ...


## Training

### Classical Models
For basic ML approach I have chosen three different models: 
- LogisticRegression, which works well with high-dimensional text data and provides interpretability via coefficients.
   - **Parameters**: Grid search over `C` (regularization), `penalty` (L1/L2), and `solver`.

- Support Vector Machine (SVM), which is effective in high-dimensional spaces with kernel tricks (tested: linear, RBF kernels).
   - **Parameters**: Optimized `C`, `kernel`, and `gamma`.

- Naive Bayes (NB), which is fast and suitable for sparse text data with Laplace smoothing (`alpha`) and also known to work well in text classification.
   - **Parameters**: Tested `alpha` (smoothing) and `fit_prior` (class balance).

### Split dataset

In [8]:
#splitting in train/test sets
from sklearn.model_selection import train_test_split
y = train_df.pop('target')
X_train_lem, X_val_lem, y_train, y_val = train_test_split(train_df['lemmatized'], y, test_size=0.3, random_state=42, stratify=y)
X_train_stem, X_val_stem, y_train, y_val = train_test_split(train_df['stemmed'], y, test_size=0.3, random_state=42, stratify=y)

### Training

The training pipeline includes different vectorizers such as CountVectorizer and TfidfVectorizer.
>CountVectorizer is a pre-processing technique used to convert text data into numerical form. This creates a bag of words where each word is treated as a separate feature and the count of each word in a given document is used as the value of that feature.

>TfidfVectorizer is based on the logic that words that are too abundant in a corpus and words that are too rare are both not statistically important for finding a pattern. The Logarithmic factor in tfidf mathematically penalizes the words that are too abundant or too rare in the corpus by giving them low tfidf scores.

Both of them are used for vectorizing texts and support context, so we will try them in comparance to find the best option.

The full pipeline uncludes testing of the following combinations:
- LogisticRegression with CountVectorizer and lemmatized text
- SVM with CountVectorizer and lemmatized text
- Multinomial Naive Bayes with CountVectorizer and lemmatized text
- LogisticRegression with TfidfVectorizer and stemmed text
- SVM with TfidfVectorizer and stemmed text
- Multinomial Naive Bayes with TfidfVectorizer and stemmed text
- Multinomial Naive Bayes with CountVectorizer and stemmed text

For every option GridSearch is used for tuning hyperparameters.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Pipelines with parameter grids
param_grids = {
    'bow_LogReg': {
        'pipeline': Pipeline([
            ('vectorizer', CountVectorizer()),
            ('model', LogisticRegression(max_iter=1000))
        ]),
        'params': {
            'vectorizer__ngram_range': [(1,1), (1,2)],
            'vectorizer__max_features': [5000, 10000],
            'model__C': [0.1, 1, 10],
            'model__penalty': ['l1', 'l2'],
            'model__solver': ['liblinear', 'saga']
        },
        'data': 'lem'  # using lemmatized data
    },
    
    'bow_SVM': {
        'pipeline': Pipeline([
            ('vectorizer', CountVectorizer()),
            ('model', SVC())
        ]),
        'params': {
            'vectorizer__ngram_range': [(1,2)],
            'vectorizer__max_features': [10000],
            'model__C': [0.1, 1, 10],
            'model__kernel': ['linear', 'rbf'],
            'model__gamma': ['scale', 'auto']
        },
        'data': 'lem' # using lemmatized data
    },
    
    'bow_MultiNB': {
        'pipeline': Pipeline([
            ('vectorizer', CountVectorizer()),
            ('model', MultinomialNB())
        ]),
        'params': {
            'vectorizer__ngram_range': [(1,1)],
            'vectorizer__max_features': [20000],
            'model__alpha': [0.01, 0.1, 0.5, 1.0, 1.5, 2.0, 5.0],
            'model__fit_prior': [True, False]
        },
        'data': 'lem' # using lemmatized data
    },
    
    'stem_LogReg': {
        'pipeline': Pipeline([
            ('vectorizer', TfidfVectorizer()),
            ('model', LogisticRegression(max_iter=1000))
        ]),
        'params': {
            'vectorizer__ngram_range': [(1,1), (1,2)],
            'vectorizer__max_features': [5000, 10000],
            'vectorizer__use_idf': [True, False],
            'model__C': [0.1, 1, 10],
            'model__penalty': ['l2'],
            'model__solver': ['sag', 'saga']
        },
        'data': 'stem'  # using stem data
    },
    
    'stem_SVM': {
        'pipeline': Pipeline([
            ('vectorizer', TfidfVectorizer()),
            ('model', SVC())
        ]),
        'params': {
            'vectorizer__ngram_range': [(1,2)],
            'vectorizer__max_features': [10000],
            'model__C': [1, 10],
            'model__kernel': ['linear'],
            'model__gamma': ['scale']
        },
        'data': 'stem' # using stem data
    },
    
    'stem_MultiNB': {
        'pipeline': Pipeline([
            ('vectorizer', TfidfVectorizer()),
            ('model', MultinomialNB())
        ]),
        'params': {
            'vectorizer__ngram_range': [(1,1)],
            'vectorizer__max_features': [20000],
            'model__alpha': [0.1, 1.0],
            'model__fit_prior': [True]
        },
        'data': 'stem' # using stem data
    },

    #Additional option
    'bow_MultiNB_stem': {
        'pipeline': Pipeline([
            ('vectorizer', CountVectorizer()),
            ('model', MultinomialNB())
        ]),
        'params': {
            'vectorizer__ngram_range': [(1,1)],
            'vectorizer__max_features': [20000],
            'model__alpha': [0.01, 0.1, 0.5, 1.0, 1.5, 2.0, 5.0],
            'model__fit_prior': [True, False]
        },
        'data': 'stem' # using stem data
    }
}

# GridSearch
best_models = {}

for name, config in param_grids.items():
    print(f"\n=== Training {name} ===")
    
    X_train = X_train_lem if config['data'] == 'lem' else X_train_stem
    
    grid_search = GridSearchCV(
        estimator=config['pipeline'],
        param_grid=config['params'],
        scoring='f1',
        cv=3,
        n_jobs=-1,
        verbose=1
    )
    
    grid_search.fit(X_train, y_train)
    best_models[name] = {
        'model': grid_search.best_estimator_,
        'score': grid_search.best_score_,
        'params': grid_search.best_params_
    }

# Best model score
best_model_info = max(best_models.items(), key=lambda x: x[1]['score'])
print(f"\nBest model: {best_model_info[0]}")
print(f"Best F1-score: {best_model_info[1]['score']:.4f}")
print(f"Best params: {best_model_info[1]['params']}")


=== Training bow_LogReg ===
Fitting 3 folds for each of 48 candidates, totalling 144 fits





=== Training bow_SVM ===
Fitting 3 folds for each of 12 candidates, totalling 36 fits

=== Training bow_MultiNB ===
Fitting 3 folds for each of 14 candidates, totalling 42 fits

=== Training stem_LogReg ===
Fitting 3 folds for each of 48 candidates, totalling 144 fits

=== Training stem_SVM ===
Fitting 3 folds for each of 2 candidates, totalling 6 fits

=== Training stem_MultiNB ===
Fitting 3 folds for each of 2 candidates, totalling 6 fits

=== Training bow_MultiNB_stem ===
Fitting 3 folds for each of 14 candidates, totalling 42 fits

Best model: bow_MultiNB_stem
Best F1-score: 0.7437
Best params: {'model__alpha': 2.0, 'model__fit_prior': True, 'vectorizer__max_features': 20000, 'vectorizer__ngram_range': (1, 1)}


Based on F1-score, Multinomial Naive Bayes with stemmed text was picked as the best model and will be used to make predictions for the test data.

## Predict

In [14]:
test_df = pd.read_csv("test.csv")
test_df.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


We will again apply necessary transformations to make our test data suitable for prediction. 

In [None]:
#Applying tokenization
test_df['tokenized'] = test_df['text'].apply(lambda tweet: tk.tokenize(tweet))
#Preproccessing
test_df['preprocessed'] = train_df['tokenized'].apply(lambda list: remove_stopwords([remove_hashtags(token) for token in list]))
#Lemmatization and stemming
test_df['lemmatized'] = test_df['preprocessed'].apply(lambda list: ' '.join([lemmatizer.lemmatize(token) for token in list]))
test_df['stemmed'] = test_df['preprocessed'].apply(lambda list: ' '.join([stemmer.stem(token) for token in list]))

# Best model prediction
if 'stem' in best_model_info[0]:
    X_test = test_df['stemmed']
else:
    X_test = test_df['lemmatized']

final_model = best_model_info[1]['model']
y_pred = final_model.predict(X_test)

Stemming + NB performed better than lemmatization, likely due to reduced dimensionality without losing critical signal.

In [None]:
submission_df = pd.DataFrame({
    'id': test_df['id'],
    'target': y_pred
})

submission_df.to_csv('submission.csv', index=False)

   id  target
0   0       1
1   2       1
2   3       1
3   9       1
4  11       1


# Neural Networks

### Architecture Choices
Tweets contain contextual dependencies where word order matters (e.g., "flood warning" vs. "warning flood"). LSTM/GRU cells address vanishing gradients in vanilla RNNs, enabling long-term dependency capture. 

Bidirectional Layers process text in both forward/backward directions to capture context from past *and* future tokens. Critical for phrases like *"not safe"* where negation ("not") informs the meaning of subsequent words. 

Embedding Layer converts tokens to dense vectors (`embedding_dim=100-200`) to represent semantic relationships. Larger dimensions (e.g., 200) help capture nuanced meanings but risk overfitting on small datasets.

**Hyperparameter Choices**:
   - **Hidden Dimension**: `128-256` balances model capacity and computational cost. Larger sizes (256) improve context retention but require more data.
   - **Dropout**: `0.3-0.5` regularizes the model by randomly disabling neurons, mitigating overfitting on noisy tweet data.
   - **Layers**: Stacked LSTM/GRU layers (`n_layers=2`) learn hierarchical features but increase complexity. Single layers suffice for shorter texts like tweets.

## Preprocessing

I have chosen stemming Over Lemmatization, because it reduces vocabulary size (e.g., "running" → "run") while retaining disaster-related root words. Critical for computational efficiency in embedding layers.
All sequences fixed to `max_len=100`. Shorter tweets padded with `<pad>`; longer ones truncated to avoid noise.

In [32]:
from collections import Counter
from nltk.tokenize import TweetTokenizer

class TextProcessor:
    def __init__(self, max_vocab=20000):
        self.tokenizer = TweetTokenizer()
        self.stemmer = PorterStemmer()
        self.stop_words = set(stopwords.words('english')) - {'not', 'no'}
        self.vocab = {}
        self.max_vocab = max_vocab

    def remove_hashtags(self, token):
        hashtags = re.findall(r'#\w+', token)
        for hashtag in hashtags:
            words = re.findall(r'[A-Z]?[a-z]+', hashtag[1:])
            token = token.replace(hashtag, ' '.join(words))
        
        return token
    
    def preprocess_text(self, text):
        tokens = self.tokenizer.tokenize(text)
        
        processed = [
            self.remove_hashtags(token) 
            for token in tokens
            if token.lower() not in self.stop_words
        ]
        
        return ' '.join([self.stemmer.stem(token) for token in processed])
        
    def build_vocab(self, texts):
        processed_texts = [self.preprocess_text(text) for text in texts]
        counts = Counter()
        for text in processed_texts:
            tokens = self.tokenizer.tokenize(text)
            counts.update(tokens)
        
        vocab = ['<pad>', '<unk>'] + [word for word, _ in counts.most_common(self.max_vocab-2)]
        self.vocab = {word: idx for idx, word in enumerate(vocab)}
    
    def text_to_sequence(self, text, max_len=100):
        tokens = self.tokenizer.tokenize(text)[:max_len]
        sequence = [self.vocab.get(token, self.vocab['<unk>']) for token in tokens]
        return sequence + [self.vocab['<pad>']] * (max_len - len(tokens))

In [2]:
from torch.utils.data import Dataset, DataLoader
import torch

class TweetDataset(Dataset):
    def __init__(self, texts, labels, processor, max_len=100):
        self.texts = texts
        self.labels = labels
        self.processor = processor
        self.max_len = max_len
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = self.texts.iloc[idx]
        sequence = self.processor.text_to_sequence(text, self.max_len)
        if self.labels is not None:
            return torch.LongTensor(sequence), torch.tensor(self.labels.iloc[idx], dtype=torch.long)
        return torch.LongTensor(sequence)

- Embedding Layer converts token IDs to dense vectors.
- LSTM/GRU processes sequences step-by-step, updating hidden states to retain context.
- Bidirectional layers concatenate forward/backward outputs for richer representations.
- Final Hidden State: Used as the tweet's "summary" for classification.

In [3]:
import torch.nn as nn

class TweetClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=100, hidden_dim=128, n_layers=2, dropout=0.5, bidirectional=True):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers=n_layers, bidirectional=bidirectional, dropout=dropout, batch_first=True)
        self.dropout = nn.Dropout(dropout)
        fc_input_dim = hidden_dim * 2 if bidirectional else hidden_dim
        self.fc = nn.Linear(fc_input_dim, 2)

    def forward(self, x):
        embedded = self.embedding(x)
        _, (hidden, _) = self.lstm(embedded)

        if self.lstm.bidirectional:
                hidden = self.dropout(torch.cat((hidden[-2], hidden[-1]), dim=1))
        else:
            hidden = self.dropout(hidden[-1])
            
        return self.fc(hidden)

In [None]:
class RNNClassifier(nn.Module):
    def __init__(self, vocab_size, embedding_dim=100, hidden_dim=128, 
                 n_layers=1, dropout=0.3, rnn_type='lstm'):
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        if rnn_type.lower() == 'lstm':
            self.rnn = nn.LSTM(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        elif rnn_type.lower() == 'gru':
            self.rnn = nn.GRU(embedding_dim, hidden_dim, n_layers, dropout=dropout, batch_first=True)
        else:
            raise ValueError("Unsupported RNN type")
            
        self.fc = nn.Linear(hidden_dim, 2)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):
        embedded = self.dropout(self.embedding(x))
        _, hidden = self.rnn(embedded)
        
        if isinstance(hidden, tuple):  # For LSTM
            hidden = hidden[0]
            
        out = self.fc(self.dropout(hidden[-1]))
        return out

- Loss Function: Cross-entropy loss compares predicted probabilities vs. true labels (disaster/non-disaster).
- Optimizer: Adam (adaptive learning rate) with `lr=0.001` balances speed and stability.
- Gradient Clipping: Prevents exploding gradients in RNNs (`max_norm=1.0`).

In [None]:
import torch.optim as optim
import numpy as np
from sklearn.model_selection import train_test_split
torch.manual_seed(42)
np.random.seed(42)


class ExperimentRunner:
    def __init__(self, train_df, test_df):
        self.train_df = train_df
        self.test_df = test_df
        self.results = []
        self.best_model = None
        self.best_acc = 0
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        
    def run_experiments(self, configs, epochs=10, batch_size=64):
        processor = TextProcessor(max_vocab=20000)
        processor.build_vocab(train_df['text'])
        
        X_train, X_val, y_train, y_val = train_test_split(
            self.train_df['text'], self.train_df['target'],
            test_size=0.2, stratify=self.train_df['target'], random_state=42
        )
        
        train_dataset = TweetDataset(X_train, y_train, processor)
        val_dataset = TweetDataset(X_val, y_val, processor)
        
        for config in configs:
            print(f"\nRunning experiment with config: {config}")
            
            model = RNNClassifier(
                vocab_size=len(processor.vocab),
                **config
            )
            model.to(self.device)
            
            optimizer = optim.Adam(model.parameters(), lr=config.get('lr', 0.001))
            criterion = nn.CrossEntropyLoss()
            
            train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
            val_loader = DataLoader(val_dataset, batch_size=batch_size)
            
            for epoch in range(epochs):
                model.train()
                for inputs, labels in train_loader:
                    inputs, labels = inputs.to(self.device), labels.to(self.device)
                    
                    optimizer.zero_grad()
                    outputs = model(inputs)
                    loss = criterion(outputs, labels)
                    loss.backward()
                    optimizer.step()
                
                val_acc = self.evaluate(model, val_loader, self.device)
                print(f"Epoch {epoch+1}/{epochs} | Val Acc: {val_acc:.4f}")
                
                if val_acc > self.best_acc:
                    self.best_acc = val_acc
                    self.best_model_info = {
                        'config': config,
                        'state_dict': model.state_dict(),
                        'vocab_size': len(processor.vocab)
                    }
                    
            self.results.append({
                **config,
                'val_acc': val_acc
            })
            
        results_df = pd.DataFrame(self.results)
        print("\nExperiment Results:")
        print(results_df.sort_values('val_acc', ascending=False))
        
    def evaluate(self, model, data_loader, device):
        model.eval()
        correct = 0
        total = 0
        
        with torch.no_grad():
            for inputs, labels in data_loader:
                inputs, labels = inputs.to(device), labels.to(device)
                outputs = model(inputs)
                _, predicted = torch.max(outputs, 1)
                correct += (predicted == labels).sum().item()
                total += labels.size(0)
                
        return correct / total
    
    def predict(self, test_df):
        if not self.best_model_info:
            raise ValueError("Train model first!")
            
        processor = TextProcessor()
        processor.build_vocab(self.train_df['text'])
        
        model = RNNClassifier(
            vocab_size=self.best_model_info['vocab_size'],
            **self.best_model_info['config']
        )
        
        model.to(self.device)
        
        model.load_state_dict(self.best_model_info['state_dict'])
        model.eval()
        
        test_dataset = TweetDataset(test_df['text'], None, processor)
        test_loader = DataLoader(test_dataset, batch_size=64)
        
        predictions = []
        with torch.no_grad():
            for inputs in test_loader:
                inputs = inputs.to(self.device)
                outputs = model(inputs)
                _, preds = torch.max(outputs, 1)
                predictions.extend(preds.cpu().numpy())
                
        return predictions

To experiment with different parameters 3 configs are used with the ability of adding additional ones. They have different amount of layers, different dimensionality. But the main feature to experiment with is the type of the RNN. While LSTM supports long-term memory which is often useful for translation and long text analyzing, GRU is simplified version which, to my mind, will be the best to analyze such short messages as tweets. For classifying tweets (short texts, local context), **GRU** is often sufficient and more effective. However, if tweets contain complex contextual relationships (e.g., sarcasm, multi-step events), **LSTM** may show better accuracy by preserving long-term dependencies.

In [18]:
experiment_configs = [
    {'embedding_dim': 100, 'hidden_dim': 128, 'n_layers': 1, 'dropout': 0.3, 'rnn_type': 'lstm'},
    {'embedding_dim': 200, 'hidden_dim': 256, 'n_layers': 2, 'dropout': 0.4, 'rnn_type': 'gru'},
    {'embedding_dim': 150, 'hidden_dim': 192, 'n_layers': 1, 'dropout': 0.5, 'rnn_type': 'lstm'},
]

In [None]:
import pandas as pd

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

runner = ExperimentRunner(train_df, test_df)
runner.run_experiments(experiment_configs, epochs=10)

predictions = runner.predict(test_df)

submission_df = pd.DataFrame({
    'id': test_df['id'],
    'target': predictions
})
submission_df.to_csv('best_predictions.csv', index=False)


Running experiment with config: {'embedding_dim': 100, 'hidden_dim': 128, 'n_layers': 1, 'dropout': 0.3, 'rnn_type': 'lstm'}




Epoch 1/10 | Val Acc: 0.5706
Epoch 2/10 | Val Acc: 0.5706
Epoch 3/10 | Val Acc: 0.5706
Epoch 4/10 | Val Acc: 0.5706
Epoch 5/10 | Val Acc: 0.5706
Epoch 6/10 | Val Acc: 0.5706
Epoch 7/10 | Val Acc: 0.5706
Epoch 8/10 | Val Acc: 0.5706
Epoch 9/10 | Val Acc: 0.5706
Epoch 10/10 | Val Acc: 0.5706

Running experiment with config: {'embedding_dim': 200, 'hidden_dim': 256, 'n_layers': 2, 'dropout': 0.4, 'rnn_type': 'gru'}
Epoch 1/10 | Val Acc: 0.4294
Epoch 2/10 | Val Acc: 0.5706
Epoch 3/10 | Val Acc: 0.6225
Epoch 4/10 | Val Acc: 0.6770
Epoch 5/10 | Val Acc: 0.6907
Epoch 6/10 | Val Acc: 0.7104
Epoch 7/10 | Val Acc: 0.7058
Epoch 8/10 | Val Acc: 0.7183
Epoch 9/10 | Val Acc: 0.7282
Epoch 10/10 | Val Acc: 0.7255

Running experiment with config: {'embedding_dim': 150, 'hidden_dim': 192, 'n_layers': 1, 'dropout': 0.5, 'rnn_type': 'lstm'}




Epoch 1/10 | Val Acc: 0.5706
Epoch 2/10 | Val Acc: 0.5706
Epoch 3/10 | Val Acc: 0.5706
Epoch 4/10 | Val Acc: 0.5706
Epoch 5/10 | Val Acc: 0.5706
Epoch 6/10 | Val Acc: 0.5706
Epoch 7/10 | Val Acc: 0.5706
Epoch 8/10 | Val Acc: 0.5706
Epoch 9/10 | Val Acc: 0.5706
Epoch 10/10 | Val Acc: 0.5706

Experiment Results:
   embedding_dim  hidden_dim  n_layers  dropout rnn_type   val_acc
1            200         256         2      0.4      gru  0.725542
0            100         128         1      0.3     lstm  0.570584
2            150         192         1      0.5     lstm  0.570584


# BERT

BERT is a **pre-trained language model** developed by Google in 2018. It is based on the **transformer** architecture and has revolutionised NLP by learning to understand the context of words in two directions (left to right and right to left).

For tweet classification, BERT excels because tweets often rely on subtle context, sarcasm, or localized slang. For instance, the phrase "fire on the mountain" could indicate disaster or metaphor. BERT’s bidirectional attention captures such nuances by analyzing how words interact across the entire tweet. Additionally, its pretrained knowledge of language patterns allows it to generalize well even with limited labeled disaster data. Fine-tuning BERT on task-specific data adapts these universal language features to identify keywords (e.g., "evacuation," "flood") while filtering noise like informal abbreviations or hashtags. This makes it superior to simpler models for short, context-dependent text.ё

In [39]:
class TweetBERTDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = self.labels[idx] if self.labels is not None else -1
        
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'label': torch.tensor(label, dtype=torch.long) if label != -1 else torch.tensor(0)
        }

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from transformers import get_linear_schedule_with_warmup

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2,
    output_attentions=False,
    output_hidden_states=False
)

train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_df.text.values,
    train_df.target.values,
    test_size=0.2,
    random_state=42,
    stratify=train_df.target.values
)

def create_data_loader(texts, labels, tokenizer, max_len=128, batch_size=16):
    dataset = TweetBERTDataset(
        texts=texts,
        labels=labels,
        tokenizer=tokenizer,
        max_len=max_len
    )
    return DataLoader(dataset, batch_size=batch_size)

In [None]:
BATCH_SIZE = 16
MAX_LEN = 128

train_loader = create_data_loader(train_texts, train_labels, tokenizer, MAX_LEN, BATCH_SIZE)
val_loader = create_data_loader(val_texts, val_labels, tokenizer, MAX_LEN, BATCH_SIZE)
test_loader = create_data_loader(test_df.text.values, None, tokenizer, MAX_LEN, BATCH_SIZE)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

EPOCHS = 3
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_loader) * EPOCHS

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=total_steps
)

loss_fn = torch.nn.CrossEntropyLoss().to(device)



In [44]:
def train_epoch(model, data_loader, optimizer, scheduler, device):
    model.train()
    total_loss = 0
    
    for batch in data_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)
        
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        
        loss = outputs.loss
        total_loss += loss.item()
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
    
    return total_loss / len(data_loader)

In [45]:
def eval_model(model, data_loader, device):
    model.eval()
    correct_predictions = 0
    
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['label'].to(device)
            
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            
            _, preds = torch.max(outputs.logits, dim=1)
            correct_predictions += torch.sum(preds == labels)
    
    return correct_predictions.double() / len(data_loader.dataset)

In [None]:
best_accuracy = 0
for epoch in range(EPOCHS):
    print(f'Epoch {epoch + 1}/{EPOCHS}')
    print('-' * 10)
    
    train_loss = train_epoch(model, train_loader, optimizer, scheduler, device)
    print(f'Train loss: {train_loss}')
    
    val_acc = eval_model(model, val_loader, device)
    print(f'Validation accuracy: {val_acc}')
    
    if val_acc > best_accuracy:
        torch.save(model.state_dict(), 'best_bert_model.bin')
        best_accuracy = val_acc

model.load_state_dict(torch.load('best_bert_model.bin'))
model.eval()

predictions = []
with torch.no_grad():
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        
        outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        
        _, preds = torch.max(outputs.logits, dim=1)
        predictions.extend(preds.cpu().numpy())


submission_df = pd.DataFrame({
    'id': test_df['id'],
    'target': predictions
})
submission_df.to_csv('bert_predictions.csv', index=False)

Epoch 1/3
----------
Train loss: 0.46010255981774467
Validation accuracy: 0.8220617202889035
Epoch 2/3
----------
Train loss: 0.30855737548820145
Validation accuracy: 0.8371634931057124
Epoch 3/3
----------
Train loss: 0.22307961451970137
Validation accuracy: 0.8207485226526592


  model.load_state_dict(torch.load('best_bert_model.bin'))


The experimental results highlight a clear progression in performance across the three approaches, directly correlated with model complexity and contextual understanding.

1. The classical ML models (Logistic Regression, SVM, Naive Bayes) achieved a *Kaggle* score of **0.54183**, reflecting their limitations in handling noisy, context-dependent text like tweets. While techniques such as stemming and stopword removal improved feature quality, these models struggled with semantic nuances—for example, distinguishing metaphorical phrases ("fire in the sky" vs. literal disasters) or sarcasm. Their reliance on bag-of-words representations inherently ignores word order and context, leading to suboptimal performance despite computational efficiency.

2. The neural network approach (RNN/GRU) improved the score to **0.76739**, demonstrating the advantage of sequence-aware architectures. By processing text bidirectionally and capturing local dependencies, these models better interpreted phrases like "no fire" or "evacuation ordered." However, their performance plateaued due to limited pretrained knowledge and an inability to grasp deeper semantic relationships (e.g., linking "flood" with "rescue" across a tweet). Training time increased moderately, but the trade-off was justified by the accuracy gain.

3. BERT achieved the highest score (**0.82960**), showcasing the power of transformer-based architectures. Its bidirectional attention mechanism resolved ambiguities by analyzing entire tweets holistically—for instance, recognizing that "storm" in "storm of protests" is metaphorical, while "storm surge" indicates a disaster. Pretraining on vast corpora allowed it to generalize patterns (e.g., associating "power outage" with disaster reports) even with limited labeled data. However, this came at a computational cost: training BERT required significantly more time and resources, making it less practical for real-time applications without hardware acceleration.

### Critical Analysis
The notebook’s preprocessing pipeline—particularly hashtag decomposition and negation-aware stopword removal—likely enhanced all models, but BERT benefited most. Classical models lacked the capacity to leverage these refined features fully, while BERT’s self-attention dynamically prioritized relevant tokens (e.g., "wildfire" > "photo"). Notably, the neural network’s intermediate performance suggests hybrid approaches (e.g., BERT + GRU) could further balance speed and accuracy.

To optimize efficiency, I may consider distilled BERT variants (e.g., DistilBERT) or quantization. For classical models, integrating contextual embeddings (e.g., BERT-as-a-feature) might bridge the accuracy gap. Finally, error analysis on misclassified tweets (e.g., sarcastic or ambiguous posts) could guide targeted improvements, such as data augmentation or domain-specific pretraining.