### This is a Python notebook for predicting sentiment of a tweet whether it is for disaster or not.

__The dataset is taken from Kaggle:__ https://www.kaggle.com/c/nlp-getting-started/overview

__Main objective:__
1. Utilise Pre-trained language model from HuggingFace like BERT-base-uncase.
2. Implement the code with Object Oriented Programming

For understanding OOP in Deep Learning project I took help of this video : https://www.youtube.com/watch?v=5FYBf-HG3as

In [1]:
!pip install transformers==3

Collecting transformers==3
[?25l  Downloading https://files.pythonhosted.org/packages/9c/35/1c3f6e62d81f5f0daff1384e6d5e6c5758682a8357ebc765ece2b9def62b/transformers-3.0.0-py3-none-any.whl (754kB)
[K     |▍                               | 10kB 23.6MB/s eta 0:00:01[K     |▉                               | 20kB 29.5MB/s eta 0:00:01[K     |█▎                              | 30kB 20.8MB/s eta 0:00:01[K     |█▊                              | 40kB 24.2MB/s eta 0:00:01[K     |██▏                             | 51kB 17.1MB/s eta 0:00:01[K     |██▋                             | 61kB 16.6MB/s eta 0:00:01[K     |███                             | 71kB 15.7MB/s eta 0:00:01[K     |███▌                            | 81kB 17.1MB/s eta 0:00:01[K     |████                            | 92kB 17.0MB/s eta 0:00:01[K     |████▍                           | 102kB 16.4MB/s eta 0:00:01[K     |████▊                           | 112kB 16.4MB/s eta 0:00:01[K     |█████▏                         

In [2]:
#importing neccesary packages
import pandas as pd
import torch
import transformers
from sklearn import model_selection
from sklearn import metrics
from transformers import AdamW
from transformers import get_linear_schedule_with_warmup
import torch.nn as nn
import numpy as np
from tqdm import tqdm

In [3]:
#variables storing configuration setting
DEVICE = "cuda"
MAX_LEN = 64
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 10
BERT_PATH = "bert-base-uncased"
MODEL_PATH = "/content/drive/MyDrive/dataset/recommendation/model.bin"
TRAINING_FILE = "/content/drive/MyDrive/dataset/recommendation/train.csv"
TOKENIZER = transformers.BertTokenizer.from_pretrained(BERT_PATH, do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




In [4]:
#importing the training dataset
data = pd.read_csv("/content/drive/MyDrive/dataset/nlp_disaster/train.csv")
data.shape

(7613, 5)

In [5]:
data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
data.isnull().sum()

id             0
keyword       61
location    2533
text           0
target         0
dtype: int64

In [7]:
# Removing id (this is Index column), keywaord,location (thay have missing data)
data.drop(['id','keyword','location'],axis=1,inplace=True)

#renaming main independent variable 'text' and target variable as 'sentiment' 
data.columns = ["review","sentiment"]

In [8]:
#splitting datset into Train and Validation dataset with 90:10 ratio.

df_train, df_valid = model_selection.train_test_split(
        data, test_size=0.1, random_state=42, stratify=data.sentiment.values
    )
df_train = df_train.reset_index(drop=True)
df_valid = df_valid.reset_index(drop=True)

In [9]:
# class for transforing the train and validation data to store tokensizer and length for BERT transformer model.

class BERTDataset:
    def __init__(self, review, target):
        self.review = review
        self.target = target
        self.tokenizer = TOKENIZER
        self.max_len = MAX_LEN

    def __len__(self):
        return len(self.review)

    def __getitem__(self, item):
        review = str(self.review[item])
        review = " ".join(review.split())

        inputs = self.tokenizer.encode_plus(
            review,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
        )

        ids = inputs["input_ids"]
        mask = inputs["attention_mask"]
        token_type_ids = inputs["token_type_ids"]

        return {
            "ids": torch.tensor(ids, dtype=torch.long),
            "mask": torch.tensor(mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
            "targets": torch.tensor(self.target[item], dtype=torch.float),
        }

In [10]:
#class for Pre-trained model BERT Base Uncased

class BERTBaseUncased(nn.Module):
    def __init__(self):
        super(BERTBaseUncased, self).__init__()
        self.bert = transformers.BertModel.from_pretrained(BERT_PATH)
        self.bert_drop = nn.Dropout(0.3)
        self.out = nn.Linear(768, 1)

    def forward(self, ids, mask, token_type_ids):
        _, o2 = self.bert(ids, attention_mask=mask, token_type_ids=token_type_ids)
        bo = self.bert_drop(o2)
        output = self.out(bo)
        return output

In [11]:
# Creation of train and validation torch data objects which are to be used for Pytorch model

train_dataset = BERTDataset(review=df_train.review.values, target=df_train.sentiment.values)
train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=TRAIN_BATCH_SIZE, num_workers=4)
valid_dataset = BERTDataset(review=df_valid.review.values, target=df_valid.sentiment.values)
valid_data_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=VALID_BATCH_SIZE, num_workers=1)

In [12]:
# Initialisation of BERT model using Pytorch with optimizer parameters

device = torch.device(DEVICE)
model = BERTBaseUncased()
model.to(device)
param_optimizer = list(model.named_parameters())
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
optimizer_parameters = [
        {
            "params": [
                p for n, p in param_optimizer if not any(nd in n for nd in no_decay)
            ],
            "weight_decay": 0.001,
        }
    ]

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [13]:
# Adding hyperparameter and Adam optimizer for the model
num_train_steps = int(len(df_train) / TRAIN_BATCH_SIZE * EPOCHS)
optimizer = AdamW(optimizer_parameters, lr=3e-5)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_train_steps)

In [14]:
# Defining loss function which returns Binary loss function value 

def loss_fn(outputs, targets):
    return nn.BCEWithLogitsLoss()(outputs, targets.view(-1, 1))


# Defining the training function consisting for forward (to get loss) and back prop (to update the weights)
def train_fn(data_loader, model, optimizer, device, scheduler):
    model.train()

    for bi, d in tqdm(enumerate(data_loader), total=len(data_loader)):
        ids = d["ids"]
        token_type_ids = d["token_type_ids"]
        mask = d["mask"]
        targets = d["targets"]

        ids = ids.to(device, dtype=torch.long)
        token_type_ids = token_type_ids.to(device, dtype=torch.long)
        mask = mask.to(device, dtype=torch.long)
        targets = targets.to(device, dtype=torch.float)

        optimizer.zero_grad()
        outputs = model(ids=ids, mask=mask, token_type_ids=token_type_ids)

        loss = loss_fn(outputs, targets)
        loss.backward()
        optimizer.step()
        scheduler.step()

# tuning the model using evaluation dataset
def eval_fn(data_loader, model, device):
    model.eval()
    fin_targets = []
    fin_outputs = []
    with torch.no_grad():
        for bi, d in tqdm(enumerate(data_loader), total=len(data_loader)):
            ids = d["ids"]
            token_type_ids = d["token_type_ids"]
            mask = d["mask"]
            targets = d["targets"]

            ids = ids.to(device, dtype=torch.long)
            token_type_ids = token_type_ids.to(device, dtype=torch.long)
            mask = mask.to(device, dtype=torch.long)
            targets = targets.to(device, dtype=torch.float)

            outputs = model(ids=ids, mask=mask, token_type_ids=token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

In [15]:
# Training the model and choosing the best model based on classififcation evaluation metrics.

best_accuracy = 0
#for epoch in range(EPOCHS):
for epoch in range(2):
    train_fn(train_data_loader, model, optimizer, device, scheduler)
    outputs, targets = eval_fn(valid_data_loader, model, device)
    outputs = np.array(outputs) >= 0.5
    accuracy = metrics.accuracy_score(targets, outputs)
    f1score = metrics.f1_score(targets, outputs)
    roc_auc_scr = metrics.roc_auc_score(targets, outputs)
    print(f"Accuracy Score = {accuracy}")
    print(f"roc auc Score = {roc_auc_scr}")
    print(f"F1 Score = {f1score}")
    if accuracy > best_accuracy:
        torch.save(model.state_dict(), MODEL_PATH)
        best_accuracy = accuracy

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.
 37%|███▋      | 319/857 [00:43<01:14,  7.19it/s]Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pai

Accuracy Score = 0.8451443569553806
roc auc Score = 0.8393124538648107
F1 Score = 0.8156249999999999


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.
 37%|███▋      | 319/857 [00:46<01:18,  6.87it/s]Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you may want to check this is the right behavior.
Truncation was not explicitely activated but `max_length` is provided a specific value, please use `truncation=True` to explicitely truncate examples to max length. Defaulting to 'only_first' truncation strategy. If you encode pai

Accuracy Score = 0.8162729658792651
roc auc Score = 0.8121269640409153
F1 Score = 0.7852760736196318





### The evaluation metrics are : 

1. Accuracy Score = 0.8162729658792651 
2. roc auc Score = 0.8121269640409153 
3. F1 Score = 0.7852760736196318

The model has good accuracy score of 81% , but most importantly it has decent AUC score and F1 score as we want to capture more disaster tweets

### Future work : - 

1) Implementation of Prediction pipeline where unseen datafile is upload uisng REST API and prediction is shown simultaneously
2) Checking the performance of sentiment analysis with other Masked Language Models.