My attempt at a sequence classification task using BERT via Hugging Face, as part of the [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started) getting started competition. The task is to classify whether tweets are about a real disaster or not. This is just a simple project to familiarize myself with these libraries.

Much is owed to the Hugging Face fine-tuning tutorial [here](https://huggingface.co/docs/transformers/training), as well as [this notebook](https://www.kaggle.com/code/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert#7.-Model) for some text cleaning and information plotting techniques.

In [None]:
import pandas as pd
import numpy as np

#libraries for data exploration and display
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#libraries for text preprocessing

import string
import re
import emoji
from bs4 import BeautifulSoup

#machine learning tools

import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support, accuracy_score
from transformers import TrainingArguments, Trainer
from transformers import EarlyStoppingCallback
from transformers import BertTokenizer, BertForSequenceClassification

We begin with some exploratory data analysis, examining the tweets and related data.

In [None]:
#take a look at training data
train_data = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
train_data.head(10)

In [None]:
#take a look at test data
test_data = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
test_data.head(10)

In [None]:
#examine the distribution of disaster and non-disaster tweets
true_count = train_data[train_data['target'] == 1].shape[0]
false_count = train_data[train_data['target'] == 0].shape[0]

print(f'Disaster tweets: {true_count}')
print(f'Non-disaster tweets: {false_count}')

In [None]:
#check data types and look for null values
print('Disaster tweet information:')
print(train_data.info())
print('\nNon-disaster tweet information:')
print(test_data.info())

It seems we are missing some keywords, plus a significant portion of locations.

In [None]:
#examine the contents of keyword
train_data['keyword'].value_counts()

In [None]:
#examine the contents of location
train_data['location'].value_counts()

In [None]:
#seeing a trail of several unique values, check total unique count
train_data['location'].nunique()

With so many unique values, it seems best to drop location.

In [None]:
train_data.drop('location', axis=1)
test_data.drop('location', axis=1)

Now, we work on the keyword column:

In [None]:
#remove null values
train_data['keyword'] = train_data['keyword'].fillna('') 
test_data['keyword'] = test_data['keyword'].fillna('') 

In [None]:
#associate each train keyword with its mean percentage of real disaster tweets
train_data['target_mean'] = train_data.groupby('keyword')['target'].transform('mean')

In [None]:
#plot the keywords to observe their association with real disasters
plt.figure(figsize=(8, 60))

sns.countplot(y=train_data.sort_values(by='target_mean', ascending=False)['keyword'],
              hue=train_data.sort_values(by='target_mean', ascending=False)['target'])

Some keywords are significantly more associated with real disasters than others, suggesting it may be a good idea to incorporate keyword information. For now, I experiment with simply appending the keywords to the text.

In [None]:
#experiment: simply append the keywords to the text
for keyword, text in zip(train_data['keyword'], train_data['text']):
    re.sub(r'%20', ' ', keyword) #minor text cleaning
    text = text + " " + keyword
    
for keyword, text in zip(test_data['keyword'], test_data['text']):
    re.sub(r'%20', ' ', keyword)
    text = text + " " + keyword

We now examine the text contents of the tweets and determine how to preprocess them.

In [None]:
#create corpora for each target.
disaster_corpus = []
for tweet in train_data[train_data['target']==1]['text'].str.split():
    for word in tweet:
        disaster_corpus.append(word)
        
nondisaster_corpus = []
for tweet in train_data[train_data['target']==0]['text'].str.split():
    for word in tweet:
        nondisaster_corpus.append(word)
        
disaster_corpus

We can look at these corpora to determine how to clean the text, finding a variety of misspelled words, links, usernames, and so on which can negatively impact our model's ability to handle the text.

We now clean the text, removing URLs, html, emojis, and other undesirable elements:

In [None]:
#cleaning function to be mapped onto all tweets
def clean_text(text):
    '''Performs various text cleaning operations, returning a processed string'''
    #make all text lowercase
    text = text.lower()
    
    #handle sequences like '\x89ÛÏ'
    text = text.encode('ascii', 'ignore').decode('utf-8') 
    
    #remove links
    text = re.sub(r'https?://\S+|www\.\S+', ' ', text) 
    
    #remove html
    soup = BeautifulSoup(text, 'lxml')
    text = soup.get_text(strip=True)
    
    #remove emoji
    text = emoji.replace_emoji(text, replace='') 
    
    #remove mentions including user name
    text = re.sub(r'@\S+', '', text) 
    
    #remove numbers and mixed alphanumeric words (often user names, l33t, etc.)
    text = re.sub(r'\w*\d+\w*', '', text) 
    
    #remove punctuation
    table = str.maketrans('', '', string.punctuation)
    text = text.translate(table)
    
    return text

In [None]:
#apply text cleaning
train_data['cleaned_text'] = train_data['text'].apply(lambda text: clean_text(text))
test_data['cleaned_text'] = test_data['text'].apply(lambda text: clean_text(text))

In [None]:
#new corpora of cleaned words, used during text cleaning experiments
disaster_corpus = []
for tweet in train_data[train_data['target']==1]['cleaned_text'].str.split():
    for word in tweet:
        disaster_corpus.append(word)
        
nondisaster_corpus = []
for tweet in train_data[train_data['target']==0]['cleaned_text'].str.split():
    for word in tweet:
        nondisaster_corpus.append(word)

disaster_corpus

Satisfied with how the text has been cleaned, we now begin tokenizing and encoding the text.

In [None]:
#create tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
#extract text from series and split into training and validation set
X_train, X_val = train_test_split(train_data, test_size=0.2)
X_test = test_data

In [None]:
#examine how many tokens the tokenizer makes for our data, and set max_length accordingly:
temp_tokenized = train_data['cleaned_text'].apply(lambda text : tokenizer.tokenize(text))
sns.histplot(x=temp_tokenized.apply(lambda text : len(text)))

Based on this, we set the max_length to 64, in case there are some slightly longer tweets in the test set.

We then obtain encoding data (input ids and attention masks) for all our tweets:

In [None]:
#function to directly create a list of dictonaries to be loaded into the model:
def tokenize_df(tokenizer, df, has_label=True):
        #convert to list
        tweet_list = df['cleaned_text'].tolist()
        #returns dictionary with keys: input_ids, attention_mask)
        encoding_dict = tokenizer(tweet_list, padding="max_length", max_length=64,
                                  truncation=True, return_token_type_ids=False)
        #add 'label' key if we are making train or validation data
        if has_label: 
            encoding_dict['label'] = df['target'].tolist()
        #convert dictionary of lists into list of dictionaries
        return [dict(zip(encoding_dict, t)) for t in zip(*encoding_dict.values())]

#apply above function to create lists to use for our data:
X_train_tokenized = tokenize_df(tokenizer, X_train)
X_val_tokenized = tokenize_df(tokenizer, X_val)
X_test_tokenized = tokenize_df(tokenizer, X_test, has_label=False)

The tokenized training data is now a list of dictionaries containing encodings and attention masks for each example:

In [None]:
X_train_tokenized[0]

In [None]:
#dataset approach - I am currently experimenting with not using a dataset object 
#to better understand what is passed into and out of the model
'''#Create torch datasets for our dictionaries:
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

train_dataset = Dataset(X_train_tokenized, y_train)
eval_dataset = Dataset(X_eval_tokenized, y_val)
test_dataset = Dataset(X_test_tokenized)'''

#dataset object free method:
train_dataset = X_train_tokenized
eval_dataset = X_val_tokenized
test_dataset = X_test_tokenized

I have chosen to use a very standard BERT classification model for simplicity, and because it is easy to find examples and discussions about the performace of BERT, since it has been widely used and well-studied.

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels = 2)

In [None]:
#use gpu if available
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)
print(f'Using {device}')

In [None]:
#establish metrics to use when training:
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc,
            'f1': f1,
            'precision': precision,
            'recall': recall}

In [None]:
training_args = TrainingArguments(
                output_dir="output",
                learning_rate=2e-5,
                num_train_epochs=2, 
                report_to="none",
                evaluation_strategy="steps", 
                eval_steps=100,
                save_steps=100,
                logging_steps=100,
                load_best_model_at_end=True
                )

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)])

In [None]:
#run the model
trainer.train()

In [None]:
metrics=trainer.evaluate()
print(metrics)

Let's look at some simple charts for our training, in addition to the Trainer.Train() output above:

In [None]:
log_df = pd.DataFrame(trainer.state.log_history)
sns.lineplot(data=log_df[['loss', 'eval_loss', 'eval_accuracy', 'eval_f1']])

It seems that our model fits to the data and decides to early-stop at around the 1 epoch point.

We can now make predictions on the test data:

In [None]:
#make predictions:
predictions = trainer.predict(test_dataset)
predictions

predictions is a PredictionOutput with several keys, so we extract the predictions themselves:

In [None]:
y_pred = predictions.predictions
y_pred

There are two values per example, representing the two classes. We use argmax to identify which is higher for each and create a list of the results:

In [None]:
y_pred = [np.argmax(pred) for pred in y_pred]
y_pred[:20]

We then load that data into a submission file:

In [None]:
#submit file (by replacing a column in the sample file)
sample_submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")
sample_submission["target"] = y_pred
sample_submission.to_csv("submission.csv", index=False)