## Natural Language Processing with Disaster Tweets

### Approach: Transformers - trainer class with pre-trained model DistilRoBERTa-base

---
### Data Analysis
- **Dataset size:** 
    - Training: 7503 samples
    - Testing: 3263 samples
- **Problem Nature:**
    - Tweets are short (max 280 characters).
    - Tweets often include noisy data (hashtags, misspellings, slang, abbreviations).
    - Context is crucial as tweets might be sarcastic or humorous.
---

#### Dependencies

In [20]:
# https://pytorch.org/get-started/locally/ | torch
# pip install transformers pandas scikit-learn ftfy

import pandas as pd
import re
import html
from ftfy import fix_text
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score
import warnings

# Suppress for a cleaner console
warnings.filterwarnings("ignore")

# seed
torch.manual_seed(70)

<torch._C.Generator at 0x27ac580f9b0>

In [22]:
# local debugging for nvidia gpu
print(torch.cuda.get_device_name(0))

NVIDIA GeForce RTX 5070 Ti


#### Pre-processing

In [3]:

def clean_text(text):
    """
    Cleaning pipeline helper.

    Notes:
        - DistilRoBERTa-base can tokenize and learn from hashtags, mentions, emojis, so we should keep them.
        - Replacing links with a placeholder [URL] signals the model a URL was present, possibly useful for context.
        - Repeated punctuation could indicate emotion 

    """
    # Try to fix broken unicode errors found in dataset (e.g. "Û÷Institute" and "PeaceÛª")
    text = fix_text(text)

    # replace links with [URL] placeholder
    text = re.sub(r'https?://\S+|www\.\S+', ' [URL] ', text)  

    # Decode common HTML entities (like &amp;)
    text = html.unescape(text)

    # Replace multiple spaces/tabs/newlines with a single space
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text
    

def preprocess_data(df):
    # handle cases where keyword and location are blank
    df['keyword'].fillna('', inplace=True)
    df['location'].fillna('', inplace=True)

    # combine text, keyword, and location into a single column for better context
    df['combined_text'] = df[['text', 'keyword', 'location']].apply(lambda x: ' '.join(x), axis=1)

    # apply clean_text function to the combined text column
    df['combined_text'] = df['combined_text'].apply(clean_text)
    
    return df

### Helpers

In [4]:
# return accuracy and F1 score for predictions.
def compute_metrics(pred):
    labels = pred.label_ids # True labels
    preds = np.argmax(pred.predictions, axis=1) # Predicted labels
    acc = accuracy_score(labels, preds)
    f1 = f1_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1
    }


# https://huggingface.co/transformers/v3.2.0/custom_datasets.html
class TweetDataset(torch.utils.data.Dataset):
    """
    Custom Dataset for tweets:
        - Tokenizes the text
        - Handles cases with or without labels (for training vs. testing)
    """
    def __init__(self, texts, labels, tokenizer, max_length=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        # Tokenize the text and convert to tensors
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        item = {key: val.squeeze(0) for key, val in encoding.items()}
    
        # add labels if they exist (for training)
        if self.labels is not None:
            item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
            
        return item

#### Processing

In [6]:
# Load the datasets
train_df = pd.read_csv('../data/train.csv')
test_df = pd.read_csv('../data/test.csv')

In [6]:
# Preview of cleaned tweets from the training dataset
sample_tweets = train_df.sample(4)
for _, tweet in sample_tweets.iterrows():
    original = tweet['text']
    cleaned = clean_text(original)
    
    print(f"Original: {original}")
    print(f"Cleaned:  {cleaned}")
    print("-" * 80)

Original: @Daorcey @nsit_ YOUR a great pair. Like a couple of Graywardens fighting the blight...
Cleaned:  @Daorcey @nsit_ YOUR a great pair. Like a couple of Graywardens fighting the blight...
--------------------------------------------------------------------------------
Original: @ArianaGrande I literally walked out of the concert and screamed MY SOUL HAS BEEN BLESSED
Cleaned:  @ArianaGrande I literally walked out of the concert and screamed MY SOUL HAS BEEN BLESSED
--------------------------------------------------------------------------------
Original: MEN CRUSH EVERY FUCKING DAY???????????????????????????? http://t.co/Fs4y1c9mNf
Cleaned:  MEN CRUSH EVERY FUCKING DAY???????????????????????????? [URL]
--------------------------------------------------------------------------------
Original: ÛÏ@based_georgie: yo forreal we need to have like an emergency action plan incase donald trump becomes presidentÛ
Whipe that lil baby
Cleaned:  ‰ÛÏ@based_georgie: yo forreal we need to have

In [23]:
# Preprocess the data
train_df = preprocess_data(train_df)
test_df = preprocess_data(test_df)

# Split training data into training and validation sets
train_texts, val_texts, train_labels, val_labels = train_test_split(
    train_df['combined_text'].tolist(), 
    train_df['target'].tolist(), 
    test_size=0.2, # 20% validation, 80% training split
    random_state=404
)

# Model configuration
model_name = "distilroberta-base"  
tokenizer = AutoTokenizer.from_pretrained(model_name) # Load the tokenizer class
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2) # Load modal for sequence classification (binary)

# Create training and validation datasets using custom Dataset class
train_dataset = TweetDataset(train_texts, train_labels, tokenizer)
val_dataset = TweetDataset(val_texts, val_labels, tokenizer)

# help(TrainingArguments)

# Arguments for the Trainer
training_args = TrainingArguments(
    output_dir='../results',              # Directory to save checkpoints and model predictions
    num_train_epochs=4,                  # number of passes over the traning dataset (can overfit quickly since dataset is very small)
    per_device_train_batch_size=16,      # Batch size per device during training
    per_device_eval_batch_size=16,       # Batch size for evaluation
    eval_strategy="epoch",         # Evaluate at the end of each pass
    weight_decay=0.01,                   # helps generalization
    logging_dir='../logs',
    logging_steps=10,                    # Log rate for training info
    seed=35,
    load_best_model_at_end=True,         # will keep best checkpoint automatically
    metric_for_best_model="f1",          # Use F1 score to determine the best model
    greater_is_better=True,
    save_strategy="epoch"
)

# Initialize the Trainer (AdamW optimizer is used by default)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

# fine-tune it on the training data
trainer.train()

# Save model and tokenizer
model.save_pretrained("../saved_model")
tokenizer.save_pretrained("../saved_model")

# Evaluate the model on the validation set and print results
eval_result = trainer.evaluate()
print("Validation results:", eval_result)

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at distilroberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3869,0.428163,0.828628,0.797203
2,0.3367,0.464606,0.83979,0.807571
3,0.3658,0.562177,0.826001,0.795367
4,0.1843,0.620817,0.831911,0.7952


Validation results: {'eval_loss': 0.46460622549057007, 'eval_accuracy': 0.8397898883782009, 'eval_f1': 0.807570977917981, 'eval_runtime': 1.2577, 'eval_samples_per_second': 1210.959, 'eval_steps_per_second': 76.331, 'epoch': 4.0}


---

#### Predictions on Test Data

In [24]:
# format the test set (no labels) into a acceptable format for the Trainer to predict on
test_dataset = TweetDataset(test_df['combined_text'].tolist(), labels=None, tokenizer=tokenizer)

# inference/prediction
test_outputs = trainer.predict(test_dataset)

# Extract the predicted class (0 or 1)
test_predictions = np.argmax(test_outputs.predictions, axis=1)

#### Create Submission File

In [26]:
# Create a submission format
submission = pd.DataFrame({
    "id": test_df["id"],
    "target": test_predictions
})

# Save as a CSV file
file_name = "../results/ap_2.csv" 
submission.to_csv(file_name, index=False)
print(f"Submission saved as {file_name}")

Submission saved as ../results/ap_2.csv
