# GPT Model for Disaster Tweet Classification

In this Notebook, we will train a GPT-2 model to classify tweets as disaster-related or not. We will use the original raw `train.csv`, without any cleaning of data, to train the GPT-2 model because we want to evaluate if it is worth our time training this model at all. If it does not achieve significantly better results, we will most likely not continue fine-tuning the GPT-2 model and stick to our initial classification models. 

We also decided to drop all other features except `text` and `keyword` because they have already been proven to be important features and that the other features are either dirty or are numerical data which GPT-2 does not take in.

## Import the necessary Libraries

In [1]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from transformers import GPT2Tokenizer, GPT2Config, GPT2ForSequenceClassification, TrainingArguments, Trainer, TextClassificationPipeline
import re

## Load the Dataset

In [2]:
# Concatenate keyword and text so that it can be passed to GPT-2 model as input
def concatenate_keyword_text(file_path):
    data = pd.read_csv(file_path)
    data['keyword_text'] = data['keyword'].fillna('') + ' ' + data['text']
    return data

train = concatenate_keyword_text('../preprocessing/train.csv')
test = concatenate_keyword_text('../preprocessing/test.csv')

train_data = train.copy()
test_data = test.copy()

# Split the training data into train set and validation set
X_train, X_val, y_train, y_val = train_test_split(train_data["keyword_text"], train_data["target"], test_size=0.2, random_state=42)

## Load and Prepare GPT-2 Model

In [3]:
# Load the GPT-2 tokenizer, model, and configuration
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
config = GPT2Config.from_pretrained("gpt2", num_labels=2)
config.pad_token_id = tokenizer.eos_token_id
model = GPT2ForSequenceClassification.from_pretrained("gpt2", config=config)

# Set a padding token for the GPT-2 tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the training and validation data
train_encodings = tokenizer(X_train.to_list(), truncation=True, padding=True)
val_encodings = tokenizer(X_val.to_list(), truncation=True, padding=True)

# Create a dataset object for the trainer
class DisasterDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = DisasterDataset(train_encodings, y_train.to_list())
val_dataset = DisasterDataset(val_encodings, y_val.to_list())

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

KeyboardInterrupt: 

## Train the GPT-2 Model

In [None]:
# Set up the training arguments and trainer
training_args = TrainingArguments(
    output_dir="./gpt2_results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
    logging_steps=100,
    evaluation_strategy="steps",
    save_strategy="no",
    seed=42
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

## Make predictions on Test Data

In [None]:
# Make predictions on the test data
test_pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
predictions = test_pipeline(test_data["keyword_text"].to_list())

# Save the predictions to a CSV file
test_data["target"] = [prediction["label"].split("_")[-1] for prediction in predictions]
test_data[["id", "target"]].to_csv("predictions/gpt2_predictions.csv", index=False)

## Unexpected Results

It achieved an accuracy score of 0.82224 which is significantly higher than the scores achieved by any other models we have trained, and this is before any cleaning up of the dataset. Since it achieved much better results than the previous models we have trained, we decided to continue fine-tuning it.

We decided to clean up the data by removing any URL and special characters, and train the GPT-2 model with the cleaned data. This reduces noise in the dataset and keeps the data consistent. We have also decided to not convert characters to lowercase, eliminate stop words, and avoid lemmatization, because the GPT-2 model has the ability to process these words with context, ultimately producing predictions that are more meaningful.

Our prediction is that the model will perform slightly better than the first.

## Cleaning the Data

In [None]:
def clean_text(text):
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\@\w+|\#','', text)
    text = re.sub(r'[^A-Za-z0-9]+', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

X_train = X_train.apply(lambda x: clean_text(x))
X_val = X_val.apply(lambda x: clean_text(x))
test_data["keyword_text"] = test_data["keyword_text"].apply(lambda x: clean_text(x))

## Check the cleaned Dataset

In [None]:
print("Cleaned X_train:")
print(X_train.head())

print("\nCleaned X_val:")
print(X_val.head())

print("\nCleaned test_data:")
print(test_data["keyword_text"].head())

## Load and Prepare GPT-2 Model with the cleaned Data

In [None]:
model = GPT2ForSequenceClassification.from_pretrained("gpt2", config=config)

train_encodings = tokenizer(X_train.to_list(), truncation=True, padding=True)
val_encodings = tokenizer(X_val.to_list(), truncation=True, padding=True)

train_dataset = DisasterDataset(train_encodings, y_train.to_list())
val_dataset = DisasterDataset(val_encodings, y_val.to_list())

## Train the GPT-2 Model with the cleaned Data

In [None]:
# Set up the training arguments and trainer
training_args = TrainingArguments(
    output_dir="./gpt2_results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
    logging_steps=100,
    evaluation_strategy="steps",
    save_strategy="no",
    seed=42
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

## Make predictions on Test Data

In [None]:
test_pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
predictions = test_pipeline(test_data["keyword_text"].to_list())

test_data["target"] = [prediction["label"].split("_")[-1] for prediction in predictions]
test_data[["id", "target"]].to_csv("predictions/gpt2_predictions2.csv", index=False)

## Worse Result

Surprisingly, the accuracy score dropped to 0.8155 with the cleaned data. We are assuming that while cleaning the data, some important contextual information might have been lost. For example, removing URLs might have removed some links to news articles that could have provided valuable context for the disaster-related content. Similarly, special characters like hashtags or mentions could have been useful for the model to understand the context and relevance of the tweet.

We decided to further fine-tune the GPT-2 model using the original non-cleaned data by optimizing `Learning Rate` and `Number of Epochs` for training the GPT-2 model. We decided to perform a Grid Search to find the best combination of learning rate and number of epochs.

This involves training and evaluating the model with different combinations of learning rates and the number of epochs, and the best model with the highest validation accuracy will be used as the final model for predictions. This approach can help in finding the optimal hyperparameters for the model, leading to improved performance and potentially better prediction results.

## Prepare GPT-2 Model with the non-cleaned Data

In [None]:
X_train, X_val, y_train, y_val = train_test_split(train_data["keyword_text"], train_data["target"], test_size=0.2, random_state=42)
test_data = test.copy()

train_encodings = tokenizer(X_train.to_list(), truncation=True, padding=True)
val_encodings = tokenizer(X_val.to_list(), truncation=True, padding=True)

train_dataset = DisasterDataset(train_encodings, y_train.to_list())
val_dataset = DisasterDataset(val_encodings, y_val.to_list())

## Grid Search for Learning Rate and Number of Epochs

`learning_rates = [1e-5, 5e-5, 1e-4]`: These learning rates are chosen because they are within the typical range used in practice for fine-tuning pre-trained models like GPT-2. A smaller learning rate, like 1e-5, might lead to slower convergence but can be more stable, while a larger learning rate, like 1e-4, can speed up the training process but may cause overshooting and instability in training. The value 5e-5 is chosen as a middle ground between these two extremes.

`num_epochs = [3, 5, 7]`: These values for the number of epochs are chosen based on the assumption that the pre-trained GPT-2 model has already learned useful text representations, and a smaller number of epochs might be sufficient for fine-tuning. Using a smaller number of epochs can help prevent overfitting and reduce training time. However, it's essential to monitor the model's performance on the validation dataset and adjust the number of epochs accordingly.

In [None]:
def compute_val_accuracy(model, val_dataset):
    val_pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
    # Convert tokenized inputs back to original text
    val_texts = [tokenizer.decode(x["input_ids"], skip_special_tokens=True) for x in val_dataset]
    val_predictions = val_pipeline(val_texts)
    val_labels = [x["labels"].tolist() for x in val_dataset]
    val_accuracy = accuracy_score(val_labels, [int(prediction["label"].split("_")[-1]) for prediction in val_predictions])
    return val_accuracy

learning_rates = [1e-5, 5e-5, 1e-4]
num_epochs = [3, 5, 7]

best_model = None
best_val_accuracy = 0
best_lr = None
best_epoch = None

for lr in learning_rates:
    for epoch in num_epochs:
        training_args = TrainingArguments(
            output_dir=f"./gpt2_results_lr{lr}_epoch{epoch}",
            num_train_epochs=epoch,
            per_device_train_batch_size=8,
            per_device_eval_batch_size=8,
            logging_dir="./logs",
            logging_steps=100,
            evaluation_strategy="steps",
            save_strategy="no",
            seed=42,
            learning_rate=lr,
        )
        
        model = GPT2ForSequenceClassification.from_pretrained("gpt2", config=config)

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
        )

        trainer.train()

        val_accuracy = compute_val_accuracy(model, val_dataset)
        print(f"Learning rate: {lr}, Num Epochs: {epoch}, Validation accuracy: {val_accuracy}")

        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_model = model
            best_lr = lr
            best_epoch = epoch

print(f"Best validation accuracy: {best_val_accuracy}, Best learning rate: {best_lr}, Best number of epochs: {best_epoch}")

## Make Predictions on Test Data

In [None]:
test_pipeline = TextClassificationPipeline(model=best_model, tokenizer=tokenizer)
predictions = test_pipeline(test_data["keyword_text"].to_list())

test_data["target"] = [prediction["label"].split("_")[-1] for prediction in predictions]
test_data[["id", "target"]].to_csv("predictions/gpt2_best_predictions.csv", index=False)

## Worst Result

It is unexpected that the best model with the optimal combination of learning rate and number of epochs performed the worst, with an accuracy score of 0.81366. Here are some possible reasons we can think of for this:

1. **Overfitting**: The model might have overfit the training data due to the chosen combination of learning rate and number of epochs. This can happen when the model becomes too specialized in learning the patterns in the training data, and as a result, performs poorly on new, unseen data.

2. **Local minima**: It is possible that the model got stuck in a local minimum during the training process, which may have resulted in suboptimal performance. The chosen learning rate and number of epochs might not have been suitable to escape the local minimum and reach a better solution.

3. **Randomness**: The training process involves a certain amount of randomness, which can lead to different results with different runs, even with the same hyperparameters. The worse performance might be attributed to the inherent randomness in the training process, and perhaps with additional runs, the performance might improve.

## Conclusion

In conclusion, while the GPT-2 model initially showed promising results, the fine-tuning attempts with various preprocessing techniques and hyperparameter combinations did not yield the expected improvements. This result shocked us as we initially believed that cleaning the data and fine-tuning the hyperparameters would lead to significant improvements in the model's performance. 

Nevertheless, the GPT-2 model still performed the best out of all the other models we have trained. It was an unexpected outcome but an eye-opening experience that has allowed us to draw several valuable insights:

1. **Importance of raw data**: The fact that the GPT-2 model performed better on raw data suggests that the context and structure of the original data might be more important than initially thought. This highlights the importance of carefully considering the impact of preprocessing on the model's performance.

2. **Complexity of fine-tuning**: Fine-tuning a pre-trained model like GPT-2 can be a complex process, and finding the optimal combination of hyperparameters might not always lead to the best results. This underlines the importance of exploring different strategies and being open to experimentation during the fine-tuning process.

3. **Model robustness**: The GPT-2 model's ability to perform well on raw data, without any preprocessing or cleaning, showcases the robustness of this model. It demonstrates that it can effectively handle noisy and unstructured data, making it a strong contender for natural language processing tasks.