# Real or Not? NLP with Disaster Tweets
## Predict which Tweets are about real disasters and which ones are not
Competition Description

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:


Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480


The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. 



# $\color{blue}{\text{Summary of main results}}$

## - We fine tune a BERT model for this task

## - We take output of [cls] token from all layers of BERT model and use their average for better performance

## - Cross Validation (CV) scores are obtained using 5-fold CV

## - Accuracies of upto 85.5 % and F1 scores of upto 0.85 can be obtained using this approach

## - To run this notebook locally, change the path of data files and files for pretrained bert model accordingly 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/pretrained-bert-models-for-pytorch/bert-base-cased-vocab.txt
/kaggle/input/pretrained-bert-models-for-pytorch/bert-base-multilingual-cased-vocab.txt
/kaggle/input/pretrained-bert-models-for-pytorch/bert-large-uncased-vocab.txt
/kaggle/input/pretrained-bert-models-for-pytorch/bert-large-cased-vocab.txt
/kaggle/input/pretrained-bert-models-for-pytorch/bert-base-uncased-vocab.txt
/kaggle/input/pretrained-bert-models-for-pytorch/bert-base-multilingual-uncased-vocab.txt
/kaggle/input/pretrained-bert-models-for-pytorch/bert-base-chinese-vocab.txt
/kaggle/input/pretrained-bert-models-for-pytorch/bert-large-uncased/bert_config.json
/kaggle/input/pretrained-bert-models-for-pytorch/bert-large-uncased/pytorch_model.bin
/kaggle/input/pretrained-bert-models-for-pytorch/bert-base-uncased/bert_config.json
/kaggle/input/pretrained-bert-models-for-pytorch/bert-base-uncased/pytorch_model.bin
/kaggle/input/pretrained-bert-models-for-pytorch/bert-base-cased/bert_config.json
/kaggle/input/pre

In [2]:
from transformers import BertTokenizer, BertModel, BertConfig

In [3]:
train_df_data = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv').fillna('')

In [4]:
import torch

In [5]:
tokenizer = BertTokenizer.from_pretrained('../input/pretrained-bert-models-for-pytorch/bert-base-uncased-vocab.txt',
                                         do_lower_case=True)

In [6]:
def numericalize(input_list, m_len):
    output_list = []
    output_list_token_type_ids = []
    output_list_attention_mask = []
    for i, row in enumerate(input_list):
        temp = tokenizer.encode_plus(row, max_length=m_len, 
                                            truncation_strategy='longest_first', 
                                            pad_to_max_length=True, return_tensors='pt')
        output_list.append(temp['input_ids'])
        output_list_token_type_ids.append(temp['token_type_ids'])
        output_list_attention_mask.append(temp['attention_mask'])
    output_tensor = torch.stack(output_list).squeeze()
    output_tensor_ids = torch.stack(output_list_token_type_ids).squeeze()
    output_tensor_mask = torch.stack(output_list_attention_mask).squeeze()
    return output_tensor, output_tensor_ids, output_tensor_mask

In [7]:
from torch.utils.data import TensorDataset, DataLoader

In [8]:
train_on_gpu=torch.cuda.is_available()

In [9]:
import torch.nn as nn
bert_model_config = '../input/pretrained-bert-models-for-pytorch/bert-base-uncased/bert_config.json'
bert_config = BertConfig.from_json_file(bert_model_config)
bert_config.output_hidden_states=True
bert_model = BertModel.from_pretrained('../input/pretrained-bert-models-for-pytorch/bert-base-uncased/', config = bert_config)        

class reornot(nn.Module):
    
    def __init__(self, ):
        
        super(reornot, self).__init__()
        self.bert = bert_model
        for param in self.bert.parameters():
            param.requires_grad = True
        self.fc1 = nn.Linear(bert_config.hidden_size, 1) 
        self.sigmoid = nn.Sigmoid()
        
        
    def forward(self, tokens, ids, masks):
        h = self.bert(tokens, ids, masks)[2]
        cls_outs = torch.stack([layer[:,0,:] for layer in h], dim = 2)
        cls_output = cls_outs.mean(2)
        x = self.fc1(cls_output)
        output = self.sigmoid(x)
        return output

In [10]:
def accuracy(y_actual, y_pred):
    y_ = np.round(np.array(y_pred))
    return np.sum(y_actual == y_) / y_actual.shape[0]

In [11]:
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
m_len = 84
n=5
seed = 1072
kf = KFold(n_splits=n, random_state=seed, shuffle=True)
net = reornot()
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
net = net.to(device)
model_name = 'init.net'
checkpoint = {'state_dict': net.state_dict()}
with open(model_name, 'wb') as f:
    torch.save(checkpoint, f)
res_acc = []
res_f1 = []
f1_s_old = 0.0
for train_index, val_index in kf.split(train_df_data):
    train_df = train_df_data.iloc[train_index]
    val_df = train_df_data.iloc[val_index]
    tweets_train = train_df['text'].values
    tweets_val = val_df['text'].values
    y_train = train_df['target'].values
    y_val = val_df['target'].values
    tweets_num, ids, masks = numericalize(tweets_train, m_len)
    tweets_num_val, val_ids, val_masks = numericalize(tweets_val, m_len)
    train_data = TensorDataset(tweets_num, ids, masks, torch.from_numpy(y_train))
    val_data = TensorDataset(tweets_num_val, val_ids, val_masks, torch.from_numpy(y_val))
    train_bs = 64
    train_loader = DataLoader(train_data, shuffle = True, batch_size=train_bs)
    valid_loader = DataLoader(val_data, shuffle = True, batch_size=train_bs)
    with open('init.net', 'rb') as f:
        checkpoint = torch.load(f)
        net.load_state_dict(checkpoint['state_dict'])
    criterion = nn.BCELoss()
    opt = torch.optim.Adam(net.parameters(), lr=0.00001)
    net.train()
    EPOCHS = 3
    loss_vs_epoch = []
    valloss_vs_epoch = []
    for epoch in range(EPOCHS):
        losses = []
        for i, (tokens, ids, masks, target) in enumerate(train_loader):
            y_pred = net(tokens.long().to(device), ids.long().to(device), masks.long().to(device))
            loss = criterion(y_pred, target[:, None].float().to(device))
            opt.zero_grad()
            loss.backward()
            opt.step()
            losses.append(loss.item())
            print('\rEpoch: %d/%d, %f%% loss: %0.2f'% (epoch+1, EPOCHS, i/len(train_loader)*100, loss.item()), end='')
        print(' Average loss:', np.mean(losses))
    val_losses = []
    net.eval()
    avg_acc = 0
    preds = []
    originals = []
    for i, (tokens, ids, masks, target) in enumerate(valid_loader):
        y_pred = net(tokens.long().to(device), ids.long().to(device), masks.long().to(device))
        loss = criterion(y_pred,  target[:, None].float().to(device))
        acc = accuracy(target.cpu().numpy(), y_pred.detach().cpu().numpy().squeeze())
        avg_acc += acc
        val_losses.append(loss.item())
        preds.append(y_pred.cpu().detach().numpy())
        originals.append(target.cpu().detach().numpy())
        print('\r%0.2f%% loss: %0.2f, accuracy %0.2f'% (i/len(valid_loader)*100, loss.item(), acc), end='')
    print(' Average val loss:', np.mean(val_losses))    
    print('\nAverage accuracy: ', avg_acc / len(valid_loader))
    res_acc.append(avg_acc / len(valid_loader))
    f1_s = f1_score(np.concatenate(originals).squeeze(), np.round(np.concatenate(preds)).squeeze(), average='macro')
    print('\nF1 Score',f1_s)
    res_f1.append(f1_s)
    print()
    if f1_s > f1_s_old:
        f1_s_old = f1_s
        model_name = 'best_model.net'
        checkpoint = {'state_dict': net.state_dict()}
        with open(model_name, 'wb') as f:
            torch.save(checkpoint, f)         
for i, result in enumerate(res_acc, 1):
    print(f"Fold-{i}: {result}")
for i, result in enumerate(res_f1, 1):
    print(f"Fold-{i}: {result}")
print(f"{n}-fold CV accuracy result: Mean: {np.mean(res_acc)} Standard deviation:{np.std(res_acc)}")
print(f"{n}-fold CV F1 result: Mean: {np.mean(res_f1)} Standard deviation:{np.std(res_f1)}")    

Epoch: 1/3, 98.958333% loss: 0.67 Average loss: 0.5003274992729226
Epoch: 2/3, 98.958333% loss: 0.26 Average loss: 0.37597498918573063
Epoch: 3/3, 98.958333% loss: 0.23 Average loss: 0.33007651086275774
95.83% loss: 0.39, accuracy 0.86 Average val loss: 0.41958513110876083

Average accuracy:  0.8315206290849674

F1 Score 0.826426277808598

Epoch: 1/3, 98.958333% loss: 0.35 Average loss: 0.4987368894120057
Epoch: 2/3, 98.958333% loss: 0.44 Average loss: 0.37824069429188967
Epoch: 3/3, 98.958333% loss: 0.34 Average loss: 0.32758076426883537
95.83% loss: 0.43, accuracy 0.80 Average val loss: 0.41850362718105316

Average accuracy:  0.8310227736928105

F1 Score 0.8265041389169003

Epoch: 1/3, 98.958333% loss: 0.40 Average loss: 0.4985437532886863
Epoch: 2/3, 98.958333% loss: 0.30 Average loss: 0.3841154268011451
Epoch: 3/3, 98.958333% loss: 0.10 Average loss: 0.3356547951698303
95.83% loss: 0.32, accuracy 0.86 Average val loss: 0.38611518157025176

Average accuracy:  0.839984170751634

F1 S

In [12]:
with open('best_model.net', 'rb') as f:
    checkpoint = torch.load(f)
loaded = reornot()
loaded.load_state_dict(checkpoint['state_dict'])

<All keys matched successfully>

In [13]:
test_df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv').fillna('')
tweets_test = test_df['text'].values
submission = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

In [14]:
tweets_num, ids, masks = numericalize(tweets_test, m_len)
test_data = TensorDataset(tweets_num, ids, masks)
test_bs = 2
test_loader = DataLoader(test_data, shuffle = False, batch_size=test_bs)

In [15]:
loaded = loaded.to(device)
loaded.eval()
preds = []
for i, (tokens, ids, masks) in enumerate(test_loader):
    y_pred = loaded(tokens.long().to(device), ids.long().to(device), masks.long().to(device))
    preds.append(y_pred.cpu().detach()) 

In [16]:
submit_list = []
for i in range(len(test_loader)):
    submit_list.append(preds[i].numpy())

In [17]:
nsub = np.concatenate(submit_list)

In [18]:
submission['target'] = nsub.round().astype(int)
submission.to_csv('submission.csv', index=False)