In [1]:
import os
import re
import torch
import pandas as pd
import numpy as np

from torch.optim import SGD
from torch import nn
from tqdm import tqdm
from transformers import BertTokenizerFast, BertModel
from sklearn.model_selection import train_test_split

tqdm.pandas()

DATA_FOLDER = 'data'
DATA_TEST_FILE = 'test.csv'
DATA_TRAIN_FILE = 'train.csv'
DATA_SAMPLE_SUBMISSION_FILE = 'sample_submission.csv'

# Data exploration

## Dataset Description
What files do I need?
You'll need `train.csv`, `test.csv` and `sample_submission.csv`.

## What should I expect the data format to be?
Each sample in the train and test set has the following information:

The `text` of a tweet
A `keyword` from that tweet (although this may be blank!)
The `location` the tweet was sent from (may also be blank)
## What am I predicting?
You are predicting whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

## Files
- train.csv - the training set
- test.csv - the test set
- sample_submission.csv - a sample submission file in the correct format
  
## Columns
- `id` - a unique identifier for each tweet
- `text` - the text of the tweet
- `location` - the location the tweet was sent from (may be blank)
- `keyword` - a particular keyword from the tweet (may be blank)
- `target` - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

This code is for loading and exploring the training and testing datasets for a research project related to disaster response. It uses the pandas library to read in csv files located in the specified `DATA_FOLDER` directory, and fill any missing values with an empty string.

The `train_df` and `test_df` dataframes are created by reading in the training and testing datasets respectively. Each dataframe has columns named 'id', 'keyword', 'location', 'text', and 'target', where 'target' is the binary classification label indicating whether a tweet is about a real disaster or not.

The `sample_submission_df` dataframe is also loaded from a csv file located in the `DATA_FOLDER`. This is likely the format that the submission file should follow for the competition or project that this code is part of.

The `train_df.head(5)` line is used to display the first 5 rows of the `train_df` dataframe.

Lastly, there are four `print` statements which output the number of unique values in the 'keyword' and 'location' columns for the training and testing datasets respectively. This gives an idea of how many different keywords or locations are present in the datasets, which could be useful for understanding the data or for feature engineering later on in the project.

In [2]:
train_df = pd.read_csv(os.path.join(DATA_FOLDER, DATA_TRAIN_FILE))
train_df = train_df.fillna('')
test_df = pd.read_csv(os.path.join(DATA_FOLDER, DATA_TEST_FILE))
test_df = test_df.fillna('')

sample_submission_df = pd.read_csv(os.path.join(DATA_FOLDER, DATA_SAMPLE_SUBMISSION_FILE))
train_df.head(5)

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
print(f"Training - Number of unique values in keyword = {train_df['keyword'].nunique()}")
print(f"Training - Number of unique values in location = {train_df['location'].nunique()}\n")

print(f"Testing - Number of unique values in keyword = {test_df['keyword'].nunique()}")
print(f"Testing - Number of unique values in location = {test_df['location'].nunique()}")

Training - Number of unique values in keyword = 222
Training - Number of unique values in location = 3342

Testing - Number of unique values in keyword = 222
Testing - Number of unique values in location = 1603


In [4]:
def remove_urls(text):
    return re.sub(r"http\S+", "", text)

train_df['text'] = train_df['text'].apply(remove_urls)

This code defines a PyTorch neural network class called `BertClassifier`, which is used to perform binary classification on text data using a pre-trained BERT (Bidirectional Encoder Representations from Transformers) model.

The `__init__` method initializes the neural network architecture. The `BertModel.from_pretrained('bert-base-cased')` line loads a pre-trained BERT model from the Hugging Face Transformers library. The `dropout` layer randomly drops some of the outputs during training to prevent overfitting. The `fc1` and `fc2` fully connected layers are used to transform the output from the BERT model into a single output value that represents the probability of the input text being about a real disaster.

The `forward` method defines how the input data is processed through the neural network. The `input_ids` and `attention_mask` inputs are passed through the pre-trained BERT model to obtain a `pooled_output`, which represents the output of the last hidden layer of the BERT model for the entire input sequence. The `pooled_output` is then passed through a ReLU activation function and the fully connected layers to obtain the final `logits` output, which is a single value representing the probability of the input text being about a real disaster.

The `tokenizer` line initializes a BERT tokenizer from the Hugging Face Transformers library, which is used to convert text data into input features that can be fed into the `BertClassifier model`. The `BertTokenizerFast.from_pretrained('bert-base-cased')` line loads a pre-trained BERT tokenizer that has been optimized for fast tokenization.

In [5]:
class BertClassifier(nn.Module):
    def __init__(self):
        super(BertClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased')
        self.dropout = nn.Dropout(0.1)
        self.fc1 = nn.Linear(768, 256)
        self.fc2 = nn.Linear(256, 1)

    def forward(self, input_ids, attention_mask):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs.pooler_output
        pooled_output = self.dropout(pooled_output)
        hidden = nn.functional.relu(self.fc1(pooled_output))
        logits = self.fc2(hidden)
        return logits
    
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

# Training the model

This code defines a function `train_and_evaluate` that trains and evaluates a given BERT-based binary text classifier model using the provided training and validation datasets. The function takes three inputs:

- `model`: an instance of the `BertClassifier` class defined earlier, representing the BERT-based text classifier.
- `train_data`: a pandas dataframe containing the training data, with columns 'text' and 'target'.
- `val_data`: a pandas dataframe containing the validation data, with columns 'text' and 'target'.

The function performs the following steps:

1. Extract the labels and texts from the training and validation datasets.
2. Tokenize the texts using the BERT tokenizer, pad the sequences to a maximum length of 32, and convert the resulting tokenized texts to PyTorch tensors.
3. Create PyTorch TensorDataset objects for the tokenized texts and labels, and then create PyTorch DataLoader objects for the training and validation datasets.
4. Set the device for the PyTorch tensors to either 'cuda' or 'cpu' depending on availability of a GPU.
5. Define the loss function and optimizer for training the BERT model.
6. Train the model for a specified number of epochs (EPOCHS), using the training dataset and validation dataset in each epoch.
7. For each epoch, compute and print the training and validation loss and accuracy metrics.
9. Return the trained model.

During training, the model is evaluated on the validation dataset at the end of each epoch to prevent overfitting. The `train_and_evaluate` function uses binary cross-entropy with logits loss (`nn.BCEWithLogitsLoss()`) as the loss function, stochastic gradient descent (`SGD()`) as the optimizer, and uses the PyTorch `flatten()` function to reshape the model's output tensor to match the shape of the labels tensor before passing them to the loss function. The optimizer is applied after computing the gradients with respect to the loss for each batch of the training dataset. Finally, the accuracy is calculated by computing the number of true positive and true negative predictions divided by the total number of predictions for each batch.

In [6]:
EPOCHS = 10
LR = 1e-4
MOMENTUM = 0.9
    
def train_and_evaluate(model, train_data, val_data):
    train_labels = train_data['target'].values
    train_texts = train_data['text'].values
    
    val_labels = val_data['target'].values
    val_texts = val_data['text'].values

    # tokenize texts
    train_inputs = tokenizer(list(train_texts), padding=True, truncation=True, max_length=32, return_tensors='pt')
    val_inputs = tokenizer(list(val_texts), padding=True, truncation=True, max_length=32, return_tensors='pt')

    # create dataset and dataloader
    dataset_train = torch.utils.data.TensorDataset(train_inputs['input_ids'], train_inputs['attention_mask'], torch.tensor(train_labels))
    dataset_val = torch.utils.data.TensorDataset(val_inputs['input_ids'], val_inputs['attention_mask'], torch.tensor(val_labels))
    
    dataloader_train = torch.utils.data.DataLoader(dataset_train, batch_size=32, shuffle=True)
    dataloader_val = torch.utils.data.DataLoader(dataset_val, batch_size=32, shuffle=True)

    # set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # define loss function and optimizer
    criterion = nn.BCEWithLogitsLoss()
    optimizer = SGD(model.parameters(), lr=LR, momentum=MOMENTUM)

    # move model to device
    model.to(device)

    for epoch in range(EPOCHS):
        model.train()
        
        epoch_loss = 0
        epoch_acc = 0
        
        for input_ids, attention_mask, labels in tqdm(dataloader_train):
            # move inputs and labels to device
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            labels = labels.to(device)

            # forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.flatten(), labels.float())

            # backward pass and optimization step
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # calculate metrics
            acc = ((outputs > 0) == labels.unsqueeze(-1)).sum().item()
            epoch_loss += loss.item()
            epoch_acc += acc
            
        model.eval()

        total_acc_val = 0
        total_loss_val = 0
        
        for input_ids, attention_mask, labels in tqdm(dataloader_val):
            # move inputs and labels to device
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            labels = labels.to(device)
            
            # forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.flatten(), labels.float())
            
            # calculate metrics
            acc = ((outputs > 0) == labels.unsqueeze(-1)).sum().item()
            total_loss_val += loss.item()
            total_acc_val += acc

        train_loss = epoch_loss / len(dataset_train)
        train_accuracy = epoch_acc / len(dataset_train)
        
        val_loss = total_loss_val / len(dataset_val)
        val_accuracy = total_acc_val / len(dataset_val)
            
        

        print(f"Epoch {epoch+1}/{10}: train_loss={train_loss:.4f}, train_accuracy={train_accuracy:.4f}, val_loss={val_loss:.4f}, val_accuracy={val_accuracy:.4f}")

    return model


This code defines a `train()` function that takes in a pre-trained `model` and a `train_data` dataset, and fine-tunes the model on the training data.

The function first extracts the labels and texts from the training data, tokenizes the texts using the `tokenizer` object, and creates a `TensorDataset` and a `DataLoader` for the training data.

Then, the function moves the model to the appropriate device (GPU or CPU), defines the loss function and optimizer (using the same values for `LR` and `MOMENTUM` as in the `train_and_evaluate()` function), and enters a training loop that runs for `EPOCHS` epochs.

In each epoch, the function iterates over the batches in the training data, moves the inputs and labels to the device, performs a forward pass through the model to obtain the outputs, calculates the loss and accuracy, performs backpropagation and an optimization step, and accumulates the epoch-level loss and accuracy metrics.

At the end of each epoch, the function prints out the epoch-level training loss and accuracy.

Finally, the function returns the trained model.

Note that this function does not evaluate the trained model on a validation set or perform early stopping, so it is not a complete training and evaluation pipeline. It is intended to be used as a helper function within a larger pipeline that includes validation and early stopping.

In [7]:
def train(model, train_data):
    train_labels = train_data['target'].values
    train_texts = train_data['text'].values

    # tokenize texts
    train_inputs = tokenizer(list(train_texts), padding=True, truncation=True, max_length=32, return_tensors='pt')
    
    # create dataset and dataloader
    dataset_train = torch.utils.data.TensorDataset(train_inputs['input_ids'], train_inputs['attention_mask'], torch.tensor(train_labels))
    
    dataloader_train = torch.utils.data.DataLoader(dataset_train, batch_size=32, shuffle=True)

    # set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # define loss function and optimizer
    criterion = nn.BCEWithLogitsLoss()
    optimizer = SGD(model.parameters(), lr=LR, momentum=MOMENTUM)

    # move model to device
    model.to(device)

    for epoch in range(EPOCHS):
        model.train()
        
        epoch_loss = 0
        epoch_acc = 0
        
        for input_ids, attention_mask, labels in tqdm(dataloader_train):
            # move inputs and labels to device
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            labels = labels.to(device)

            # forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.flatten(), labels.float())

            # backward pass and optimization step
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            # calculate metrics
            acc = ((outputs > 0) == labels.unsqueeze(-1)).sum().item()
            epoch_loss += loss.item()
            epoch_acc += acc

        train_loss = epoch_loss / len(dataset_train)
        train_accuracy = epoch_acc / len(dataset_train)
        

        print(f"Epoch {epoch+1}/{10}: train_loss={train_loss:.4f}, train_accuracy={train_accuracy:.4f}")

    return model


## Training and evaluating the model

The code first creates an instance of the `BertClassifier` class, which is a PyTorch model that uses the BERT architecture for text classification.

Then, it splits the `train_df` dataframe into two subsets using `train_test_split()` from scikit-learn, with a train size of 80% and stratified sampling based on the `keyword` column. The resulting subsets are assigned to `df_train` and `df_val`.

Finally, the `train_and_evaluate()` function is called with the `bert` model, `df_train` as the training data, and `df_val` as the validation data. The function trains the model on the training data, evaluates it on the validation data, and returns the trained model. The variable `model` is assigned the trained model.

In [8]:
bert = BertClassifier()

df_train, df_val = train_test_split(train_df, train_size=0.8, stratify=train_df['keyword'])

# train bert and ecobert and benchmark the results
model = train_and_evaluate(bert, df_train, df_val)

Downloading pytorch_model.bin:   0%|          | 0.00/416M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 191/191 [00:16<00:00, 11.28it/s]
100%|██████████| 48/48 [00:01<00:00, 39.35it/s]


Epoch 1/10: train_loss=0.0213, train_accuracy=0.5677, val_loss=0.0209, val_accuracy=0.5817


100%|██████████| 191/191 [00:16<00:00, 11.81it/s]
100%|██████████| 48/48 [00:01<00:00, 38.41it/s]


Epoch 2/10: train_loss=0.0203, train_accuracy=0.6284, val_loss=0.0193, val_accuracy=0.7367


100%|██████████| 191/191 [00:16<00:00, 11.69it/s]
100%|██████████| 48/48 [00:01<00:00, 38.10it/s]


Epoch 3/10: train_loss=0.0182, train_accuracy=0.7415, val_loss=0.0167, val_accuracy=0.7754


100%|██████████| 191/191 [00:16<00:00, 11.54it/s]
100%|██████████| 48/48 [00:01<00:00, 37.62it/s]


Epoch 4/10: train_loss=0.0159, train_accuracy=0.7810, val_loss=0.0151, val_accuracy=0.7787


100%|██████████| 191/191 [00:16<00:00, 11.48it/s]
100%|██████████| 48/48 [00:01<00:00, 37.55it/s]


Epoch 5/10: train_loss=0.0145, train_accuracy=0.8021, val_loss=0.0138, val_accuracy=0.8109


100%|██████████| 191/191 [00:16<00:00, 11.44it/s]
100%|██████████| 48/48 [00:01<00:00, 37.03it/s]


Epoch 6/10: train_loss=0.0138, train_accuracy=0.8107, val_loss=0.0132, val_accuracy=0.8181


100%|██████████| 191/191 [00:16<00:00, 11.40it/s]
100%|██████████| 48/48 [00:01<00:00, 36.80it/s]


Epoch 7/10: train_loss=0.0132, train_accuracy=0.8156, val_loss=0.0130, val_accuracy=0.8175


100%|██████████| 191/191 [00:16<00:00, 11.38it/s]
100%|██████████| 48/48 [00:01<00:00, 37.37it/s]


Epoch 8/10: train_loss=0.0127, train_accuracy=0.8273, val_loss=0.0131, val_accuracy=0.8207


100%|██████████| 191/191 [00:16<00:00, 11.37it/s]
100%|██████████| 48/48 [00:01<00:00, 36.94it/s]


Epoch 9/10: train_loss=0.0125, train_accuracy=0.8314, val_loss=0.0135, val_accuracy=0.8122


100%|██████████| 191/191 [00:16<00:00, 11.34it/s]
100%|██████████| 48/48 [00:01<00:00, 37.06it/s]

Epoch 10/10: train_loss=0.0121, train_accuracy=0.8348, val_loss=0.0128, val_accuracy=0.8299





## Training the model on the entire training dataset

This code is very similar to the previous one, but it trains the BERT classifier on the entire `train_df` dataset, without splitting it into a training and validation set.

The `train` function takes in the BERT model and the training data, preprocesses the data, creates a dataset and dataloader, defines the loss function and optimizer, moves the model to the device, and then trains the model for a specified number of epochs.

At each epoch, the function loops through the batches in the dataloader, moves the inputs and labels to the device, performs a forward pass to obtain the outputs, computes the loss and performs a backward pass and optimization step to update the model's parameters. It also calculates the accuracy and loss for the entire training set at the end of each epoch.

The function returns the trained model.

In [9]:
bert = BertClassifier()

# train bert and ecobert and benchmark the results
model = train(bert, train_df)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
100%|██████████| 238/238 [00:20<00:00, 11.35it/s]


Epoch 1/10: train_loss=0.0208, train_accuracy=0.6128


100%|██████████| 238/238 [00:20<00:00, 11.33it/s]


Epoch 2/10: train_loss=0.0175, train_accuracy=0.7423


100%|██████████| 238/238 [00:21<00:00, 11.31it/s]


Epoch 3/10: train_loss=0.0148, train_accuracy=0.7936


100%|██████████| 238/238 [00:20<00:00, 11.35it/s]


Epoch 4/10: train_loss=0.0137, train_accuracy=0.8089


100%|██████████| 238/238 [00:21<00:00, 11.31it/s]


Epoch 5/10: train_loss=0.0132, train_accuracy=0.8157


100%|██████████| 238/238 [00:21<00:00, 11.31it/s]


Epoch 6/10: train_loss=0.0128, train_accuracy=0.8256


100%|██████████| 238/238 [00:21<00:00, 11.32it/s]


Epoch 7/10: train_loss=0.0124, train_accuracy=0.8278


100%|██████████| 238/238 [00:21<00:00, 11.31it/s]


Epoch 8/10: train_loss=0.0121, train_accuracy=0.8342


100%|██████████| 238/238 [00:21<00:00, 11.32it/s]


Epoch 9/10: train_loss=0.0116, train_accuracy=0.8466


100%|██████████| 238/238 [00:21<00:00, 11.33it/s]

Epoch 10/10: train_loss=0.0115, train_accuracy=0.8472





# Making predictions

This code defines a function `predict` that takes a trained model and a text sentence as inputs and predicts the sentiment of the sentence using the model. The sentiment prediction is performed by first tokenizing the sentence using the `tokenizer` function with the same parameters used during training, and then passing the resulting token IDs and attention mask to the model to obtain a predicted score, which is then converted to a probability using the sigmoid function. The predicted probability is then rounded to the nearest integer to obtain the predicted sentiment label.

The function is then used to generate predictions for a test dataset by applying it to each row of the `text` column of the `test_df` DataFrame and assigning the predicted labels to the `target` column of a `sample_submission_df` DataFrame. The `progress_apply` function from the `tqdm` package is used to display a progress bar during the prediction process. Note that the trained model must be in evaluation mode by calling the `eval` method before making predictions.

In [10]:
def predict(model, sentence):
    # tokenize the sentence and convert to tensor
    inputs = tokenizer(sentence, padding='max_length', max_length=32, truncation=True, return_tensors='pt')

    # set device
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    
    # get the predicted logits for the input
    with torch.no_grad():
        logits = model(inputs['input_ids'].to(device), attention_mask=inputs['attention_mask'].to(device)).flatten()

    # convert logits to probabilities and return the predicted label
    probs = torch.sigmoid(logits).squeeze()
    label = torch.round(probs).item()

    return int(label)
    
model.eval()
sample_submission_df['target'] = test_df['text'].progress_apply(lambda x: predict(model, x))

100%|██████████| 3263/3263 [00:21<00:00, 153.85it/s]


In [11]:
sample_submission_df.describe()

Unnamed: 0,id,target
count,3263.0,3263.0
mean,5427.152927,0.368986
std,3146.427221,0.482604
min,0.0,0.0
25%,2683.0,0.0
50%,5500.0,0.0
75%,8176.0,1.0
max,10875.0,1.0


In [12]:
sample_submission_df.to_csv('submission.csv', index=False)