# GPT Model for Disaster Tweet Classification

In this notebook, we will train a GPT-2 model to categorize tweets as disaster or not.

We will first use the unprocessed `text` feature from train.csv to train the GPT-2 model. This approach allows us to evaluate the transformer's performance in comparison to traditional models when provided with a less optimized feature for training.

## Import the necessary Libraries

In [5]:
import pandas as pd
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from transformers import GPT2Tokenizer, GPT2Config, GPT2ForSequenceClassification, TrainingArguments, Trainer, TextClassificationPipeline
import re

## Load the Dataset

In [6]:
train = pd.read_csv('../preprocessing/train.csv')
test = pd.read_csv('../preprocessing/test.csv')

train_data = train.copy()
test_data = test.copy()

# Split the training data into train set and validation set
X_train, X_val, y_train, y_val = train_test_split(train_data["text"], train_data["target"], test_size=0.2, random_state=42)

## Load and Prepare GPT-2 Model

In [None]:
# Load the GPT-2 tokenizer, model, and configuration
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
config = GPT2Config.from_pretrained("gpt2", num_labels=2)
config.pad_token_id = tokenizer.eos_token_id
model = GPT2ForSequenceClassification.from_pretrained("gpt2", config=config)

# Set a padding token for the GPT-2 tokenizer
tokenizer.pad_token = tokenizer.eos_token

# Tokenize the training and validation data
train_encodings = tokenizer(X_train.to_list(), truncation=True, padding=True)
val_encodings = tokenizer(X_val.to_list(), truncation=True, padding=True)

# Create a dataset object for the trainer
class DisasterDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = DisasterDataset(train_encodings, y_train.to_list())
val_dataset = DisasterDataset(val_encodings, y_val.to_list())

## Train the GPT-2 Model

In [None]:
# Set up the training arguments and trainer
training_args = TrainingArguments(
    output_dir="./gpt2_results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
    logging_steps=100,
    evaluation_strategy="steps",
    save_strategy="no",
    seed=42
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

## Make predictions on Test Data

In [None]:
# Make predictions on the test data
test_pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
predictions = test_pipeline(test_data["text"].to_list())

# Save the predictions to a CSV file
test_data["target"] = [prediction["label"].split("_")[-1] for prediction in predictions]
test_data[["id", "target"]].to_csv("predictions/gpt2_predictions.csv", index=False)

## Impressive Results

It achieved an accuracy score of 0.81918 which is significantly higher than the scores achieved by any other models we have trained, and this is before any preprocessing of the dataset. Since it achieved much better results than the previous models we have trained, we decided to continue fine-tuning it.

We decided to use the `text`, `keyword`, `tweet_count` and `punctuation_count` features to train the GPT-2 model. These features have been proven to display strong relationship with `target` in data_preprocessing.ipynb.

Since the GPT-2 model only takes in a single text input, our plan is to concatenate all of the features into one `combined_text` feature and pass it to GPT-2 model as a text input to train it.

Examples of how combined_text will look like:

1. "Courageous and honest analysis of need to use Atomic Bomb in 1945. #Hiroshima70 Japanese military refused surrender. https://t.co/VhmtyTptGR (Keyword: military , Tweet Length: 140, Punctuation Count: 8)"
2. "Typhoon Soudelor kills 28 in China and Taiwan (Keyword: , Tweet length: 45, Punctuation Count: 0)"

By concatenating the `keyword`, `tweet_length` and `punctuation_length` at the end of `text` explicitly in parentheses, it can help the GPT-2 model better understand the data with more context, and potentially perform better at predictions.

## Load the Dataset

In [8]:
def combine_text(data, text):
    data[text] = data['text'] + ' (Keyword: ' + data['keyword'].fillna('') + ',' + ' Tweet length: ' + data['tweet_length'].astype(str) + ',' + ' Punctuation Count: ' + data['punctuation_count'].astype(str) +')'
    return data

# Load the train dataset with tweet_count and punctuation_count
train_meta_data = pd.read_csv('../preprocessing/train_data_mod.csv')
train_meta_data = train_meta_data.drop(['target', 'text', 'keyword'], axis=1)

# Load the test dataset with tweet_count and punctuation_count
test_meta_data = pd.read_csv('../preprocessing/test_data_mod.csv')
test_meta_data = test_meta_data.drop(['text', 'keyword'], axis=1)

# Merge the data
train_data = train_data.merge(train_meta_data, on='id')
test_data = test_data.merge(test_meta_data, on='id')

# Concatenate the features
train_data = combine_text(train_data, 'combined_text')
test_data = combine_text(test_data, 'combined_text')

# Split the training data into train set and validation set
X_train, X_val, y_train, y_val = train_test_split(train_data["combined_text"], train_data["target"], test_size=0.2, random_state=42)

## Check the concatenated Data

In [9]:
pd.set_option('display.max_colwidth', 1000)

print("X_train:")
print(X_train.head())

print("\nX_val:")
print(X_val.head())

print("\ntest_data:")
print(test_data["combined_text"].head())

X_train:
4996        Courageous and honest analysis of need to use Atomic Bomb in 1945. #Hiroshima70 Japanese military refused surrender. https://t.co/VhmtyTptGR (Keyword: military, Tweet length: 140, Punctuation Count: 8)
3263                                                   @ZachZaidman @670TheScore wld b a shame if that golf cart became engulfed in flames. #boycottBears (Keyword: engulfed, Tweet length: 98, Punctuation Count: 4)
4907    Tell @BarackObama to rescind medals of 'honor' given to US soldiers at the Massacre of Wounded Knee. SIGN NOW &amp; RT! https://t.co/u4r8dRiuAc (Keyword: massacre, Tweet length: 143, Punctuation Count: 12)
2855                               Worried about how the CA drought might affect you? Extreme Weather: Does it Dampen Our Economy? http://t.co/fDzzuMyW8i (Keyword: drought, Tweet length: 118, Punctuation Count: 8)
4716                                                                       @YoungHeroesID Lava Blast &amp; Power Red #PantherAttack @Ja

## Load and Prepare GPT-2 Model with the concatenated Data

In [None]:
model = GPT2ForSequenceClassification.from_pretrained("gpt2", config=config)

train_encodings = tokenizer(X_train.to_list(), truncation=True, padding=True)
val_encodings = tokenizer(X_val.to_list(), truncation=True, padding=True)

train_dataset = DisasterDataset(train_encodings, y_train.to_list())
val_dataset = DisasterDataset(val_encodings, y_val.to_list())

## Train the GPT-2 Model with the concatenated Data

In [None]:
# Set up the training arguments and trainer
training_args = TrainingArguments(
    output_dir="./gpt2_results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
    logging_steps=100,
    evaluation_strategy="steps",
    save_strategy="no",
    seed=42
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

## Make predictions on Test Data

In [None]:
test_pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
predictions = test_pipeline(test_data["combined_text"].to_list())

test_data["target"] = [prediction["label"].split("_")[-1] for prediction in predictions]
test_data[["id", "target"]].to_csv("predictions/gpt2_predictions2.csv", index=False)

## Slightly Better Results

It achieved an accuracy score of 0.82715 which is slightly higher than the scores achieved by using a non-preprocessed data (0.81918). The increase is lesser than we have expected because we thought that incorporating additional features would have a more significant impact on the model's performance. However, it is still an improvement, and we can consider further refining the model.

Here, we decided to clean up the data by

1. removing any URLs present in the text (http, https, or www),
2. removing any mentions (@) or hashtags (#) from the text,
3. removing any non-alphanumeric characters (keeps only letters and numbers) and replacing them with a single space,
4. replacing multiple whitespace characters (spaces, tabs, etc.) with a single space and removes any leading/trailing spaces,

and train the GPT-2 model with the cleaned data. This reduces noise in the dataset and keeps the data more consistent. We have also decided to not convert characters to lowercase, eliminate stop words, and avoid lemmatization, because the GPT-2 model has the ability to process these words with context, ultimately producing predictions that are more meaningful.

Our prediction is that this model will perform slightly better than the previous model.

## Cleaning the Data

In [10]:
def clean_text(text):
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\@\w+|\#','', text)
    text = re.sub(r'[^A-Za-z0-9]+', ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Clean the text column in train and test dataset
train_data['text'] = train_data['text'].apply(lambda x: clean_text(x))
test_data["text"] = test_data["text"].apply(lambda x: clean_text(x))

# Concatenate the cleaned text with other features
train_data = combine_text(train_data, 'cleaned_combined_text')
test_data = combine_text(test_data, 'cleaned_combined_text')

# Split the training data into train set and validation set
X_train, X_val, y_train, y_val = train_test_split(train_data["cleaned_combined_text"], train_data["target"], test_size=0.2, random_state=42)

## Check the cleaned Dataset

In [11]:
print("Cleaned X_train:")
print(X_train.head())

print("\nCleaned X_val:")
print(X_val.head())

print("\nCleaned test_data:")
print(test_data["cleaned_combined_text"].head())

Cleaned X_train:
4996    Courageous and honest analysis of need to use Atomic Bomb in 1945 Hiroshima70 Japanese military refused surrender (Keyword: military, Tweet length: 140, Punctuation Count: 8)
3263                                                wld b a shame if that golf cart became engulfed in flames boycottBears (Keyword: engulfed, Tweet length: 98, Punctuation Count: 4)
4907                Tell to rescind medals of honor given to US soldiers at the Massacre of Wounded Knee SIGN NOW amp RT (Keyword: massacre, Tweet length: 143, Punctuation Count: 12)
2855                          Worried about how the CA drought might affect you Extreme Weather Does it Dampen Our Economy (Keyword: drought, Tweet length: 118, Punctuation Count: 8)
4716                                                                                    Lava Blast amp Power Red PantherAttack (Keyword: lava, Tweet length: 82, Punctuation Count: 6)
Name: cleaned_combined_text, dtype: object

Cleaned X_val:
2644     

## Load and Prepare GPT-2 Model with the cleaned Data

In [None]:
model = GPT2ForSequenceClassification.from_pretrained("gpt2", config=config)

train_encodings = tokenizer(X_train.to_list(), truncation=True, padding=True)
val_encodings = tokenizer(X_val.to_list(), truncation=True, padding=True)

train_dataset = DisasterDataset(train_encodings, y_train.to_list())
val_dataset = DisasterDataset(val_encodings, y_val.to_list())

## Train the GPT-2 Model with the cleaned Data

In [None]:
# Set up the training arguments and trainer
training_args = TrainingArguments(
    output_dir="./gpt2_results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir="./logs",
    logging_steps=100,
    evaluation_strategy="steps",
    save_strategy="no",
    seed=42
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

## Make predictions on Test Data

In [None]:
test_pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
predictions = test_pipeline(test_data["cleaned_combined_text"].to_list())

test_data["target"] = [prediction["label"].split("_")[-1] for prediction in predictions]
test_data[["id", "target"]].to_csv("predictions/gpt2_predictions3.csv", index=False)

## Better Result

Indeed, the accuracy score is very slightly better at 0.8296 (vs 0.82715) with the cleaned data.

We will further fine-tune the GPT-2 model by optimizing `Learning Rate` and `Number of Epochs`. We decided to perform Grid Search to find the best combination of learning rate and number of epochs.

This involves training and evaluating the model with different combinations of learning rates and the number of epochs, and the best model with the highest validation accuracy will be used as the final model for predictions. This approach can help in finding the optimal hyperparameters for the model, leading to improved performance and potentially better prediction results.

## Grid Search for Learning Rate and Number of Epochs

`learning_rates = [1e-5, 5e-5, 1e-4]`: These learning rates are chosen because they are within the typical range used in practice for fine-tuning pre-trained models like GPT-2. A smaller learning rate, like 1e-5, might lead to slower convergence but can be more stable, while a larger learning rate, like 1e-4, can speed up the training process but may cause overshooting and instability in training. The value 5e-5 is chosen as a middle ground between these two extremes.

`num_epochs = [3, 5, 7]`: These values for the number of epochs are chosen based on the assumption that the pre-trained GPT-2 model has already learned useful text representations, and a smaller number of epochs might be sufficient for fine-tuning. Using a smaller number of epochs can help prevent overfitting and reduce training time.

In [None]:
def compute_val_accuracy(model, val_dataset):
    val_pipeline = TextClassificationPipeline(model=model, tokenizer=tokenizer)
    # Convert tokenized inputs back to original text
    val_texts = [tokenizer.decode(x["input_ids"], skip_special_tokens=True) for x in val_dataset]
    val_predictions = val_pipeline(val_texts)
    val_labels = [x["labels"].tolist() for x in val_dataset]
    val_accuracy = accuracy_score(val_labels, [int(prediction["label"].split("_")[-1]) for prediction in val_predictions])
    return val_accuracy

learning_rates = [1e-5, 5e-5, 1e-4]
num_epochs = [3, 5, 7]

best_model = None
best_val_accuracy = 0
best_lr = None
best_epoch = None

for lr in learning_rates:
    for epoch in num_epochs:
        training_args = TrainingArguments(
            output_dir=f"./gpt2_results_lr{lr}_epoch{epoch}",
            num_train_epochs=epoch,
            per_device_train_batch_size=8,
            per_device_eval_batch_size=8,
            logging_dir="./logs",
            logging_steps=100,
            evaluation_strategy="steps",
            save_strategy="no",
            seed=42,
            learning_rate=lr,
        )
        
        model = GPT2ForSequenceClassification.from_pretrained("gpt2", config=config)

        trainer = Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=val_dataset,
        )

        trainer.train()

        val_accuracy = compute_val_accuracy(model, val_dataset)
        print(f"Learning rate: {lr}, Num Epochs: {epoch}, Validation accuracy: {val_accuracy}")

        if val_accuracy > best_val_accuracy:
            best_val_accuracy = val_accuracy
            best_model = model
            best_lr = lr
            best_epoch = epoch

print(f"Best validation accuracy: {best_val_accuracy}, Best learning rate: {best_lr}, Best number of epochs: {best_epoch}")

## Make Predictions on Test Data

In [None]:
test_pipeline = TextClassificationPipeline(model=best_model, tokenizer=tokenizer)
predictions = test_pipeline(test_data["cleaned_combined_text"].to_list())

test_data["target"] = [prediction["label"].split("_")[-1] for prediction in predictions]
test_data[["id", "target"]].to_csv("predictions/gpt2_best_predictions.csv", index=False)

## Unexpected Result

It is unexpected that the best model with the optimal combination of learning rate and number of epochs **performed the worst**, with an accuracy score of 0.81366. Here are some possible reasons we can think of for this:

1. **Overfitting**: The model might have overfit the training data due to the chosen combination of learning rate and number of epochs. This can happen when the model becomes too specialized in learning the patterns in the training data, and as a result, performs poorly on new, unseen data.

2. **Local minima**: It is possible that the model got stuck in a local minimum during the training process, which may have resulted in suboptimal performance. The chosen learning rate and number of epochs might not have been suitable to escape the local minimum and reach a better solution.

3. **Randomness**: The training process involves a certain amount of randomness, which can lead to different results with different runs, even with the same hyperparameters. The worse performance might be attributed to the inherent randomness in the training process, and perhaps with additional runs, the performance might improve.

## Conclusion

In conclusion, while the GPT-2 model initially showed promising results, the fine-tuning attempts with various preprocessing techniques and hyperparameter combinations did not yield the expected improvements. This result shocked us as we initially believed that preprocessing the data and fine-tuning the hyperparameters would lead to significant improvements in the model's performance.

Nevertheless, the GPT-2 model still performed the best out of all the other models we have trained. It was an unexpected outcome but an eye-opening experience that has allowed us to draw several valuable insights:

1. **Importance of raw data**: The fact that the GPT-2 model performed really well on raw data suggests that the context and structure of the original data might be more important than initially thought. This highlights the importance of carefully considering the impact of preprocessing on the model's performance.

2. **Complexity of fine-tuning**: Fine-tuning a pre-trained model like GPT-2 can be a complex process, and finding the optimal combination of hyperparameters might not always lead to the best results. This underlines the importance of exploring different strategies and being open to experimentation during the fine-tuning process.

3. **Model robustness**: The GPT-2 model's ability to perform well on raw data, without any preprocessing or cleaning, showcases the robustness of this model. It demonstrates that it can effectively handle noisy and unstructured data, making it a strong contender for natural language processing tasks.