# Real or Not? NLP with Disaster Tweets

## 1. Introduction

In this notebook, we will be training a Neural Network and also Google's BERT Model on our dataset.
Below is a list of  that we will try.
1. Word Embeddings using Bag of Words
2. Word Embeddings using TF IDF
3. Word Embeddings using GloVe
4. Showing Confusion Matrices on the validation set for the 2 trained models.

## 2. Import libraries
The below code intializes hyperparameters of the model and also import the necessary python libraries.

In [1]:
import keras
from keras.models import Sequential
from keras.initializers import Constant
from keras.layers import (LSTM,
                          Embedding,
                          BatchNormalization,
                          Dense,
                          TimeDistributed,
                          Dropout,
                          Bidirectional,
                          Flatten,
                          GlobalMaxPool1D)

from nltk.tokenize import word_tokenize
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers.embeddings import Embedding
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from keras.optimizers import Adam


import pandas as pd
import numpy as np

from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    classification_report,
    accuracy_score
)

import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')

Using TensorFlow backend.


## 3. Load data
The below code loads the data and prints the shape of Train and Test.

In [2]:
dataset = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')
submission = pd.read_csv('./sample_submission.csv')


print('There are {} rows and {} columns in train'.format(dataset.shape[0],dataset.shape[1]))
print('There are {} rows and {} columns in test'.format(test.shape[0],test.shape[1]))

train = dataset.text.values
test = test.text.values
sentiments = dataset.target.values

There are 7613 rows and 5 columns in train
There are 3263 rows and 4 columns in test


## 4. word embedding- transformation from words to vectors.¶
The challenge with textual data is that it needs to be represented in a format that can be mathematically used in solving some problem. In simple words, we need to get an integer representation of a word.

GloVe is an acronym for Global Vectors for Word Representation. This allows us to take a corpus of text, and intuitively transform each word in that corpus into a position in a high-dimensional space which means that similar words will be placed together.

The first task is to download pre-trained word vectors that is available in 3 varieties : 50D, 100D and 200 Dimensional. We will try 100D here. Before we load the vectors in code, we have to understand how the text file is formatted.
Each line of the text file contains a word, followed by N numbers. The N numbers describe the vector of the word’s position. N may vary depending on which vectors is downloaded, for us, N is 100, since we are using glove.6B.100d.

Below is the summary of tasks if the below code:
1. parse the GloVe vectors file and build the dictionary of it's vectors. 
2. text is split into words using word_tokenize to build the corpus.
3. build the dictionary consisting of unique words from the corpus.
4. create embedding vectors for each sample in Train with maximum length of 72 words.

Notice the Train shape printed below. It has changed from 5 features to 72 features.

In [3]:
def embed(corpus):
        return word_tokenizer.texts_to_sequences(corpus)


word_tokenizer = Tokenizer()
word_tokenizer.fit_on_texts(train)
vocab_length = len(word_tokenizer.word_index) + 1

longest_train = max(train, key=lambda sentence: len(word_tokenize(sentence)))
length_long_sentence = len(word_tokenize(longest_train))
print("number of words in the longest sentence in Train={}".format(length_long_sentence))


padded_sentences = pad_sequences(embed(train), length_long_sentence, padding='post')
test_sentences = pad_sequences( embed(test), length_long_sentence, padding='post')

#Twitter Gloves

embeddings_dictionary = dict()
embedding_dim = 200
glove_file = open('./glove-twitter/glove.twitter.27B.' + str(embedding_dim) + 'd.txt', encoding="utf8")

for line in glove_file:
    records = line.split()
    word = records[0]
    vector_dimensions = np.asarray(records[1:], dtype='float32')
    embeddings_dictionary[word] = vector_dimensions

glove_file.close()
embedding_matrix = np.zeros((vocab_length, embedding_dim))

for word, index in word_tokenizer.word_index.items():
    if index >= vocab_length:
        continue
    embedding_vector = embeddings_dictionary.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector
    
print('There are {} rows and {} columns in train'.format(padded_sentences.shape[0],padded_sentences.shape[1]))

number of words in the longest sentence in Train=72
There are 7613 rows and 72 columns in train


## 5. Train Bidirectional LSTM Model with pre-trained GloVe word embeddings
In the below section we will build 5 networks and train it using GloVe features as the inputs. The architecture of all the networks is same and below is it's pictorial representation.
<img src="./images/blstm.png" width="700" height="900">

In [None]:
def BLSTM():
    model = Sequential()
    model.add(Embedding(input_dim=embedding_matrix.shape[0],
                        output_dim=embedding_matrix.shape[1],
                        weights=[embedding_matrix],
                        input_length=length_long_sentence,
                        trainable=False))

    model.add(Bidirectional(LSTM(length_long_sentence, return_sequences=True, recurrent_dropout=0.2)))
    model.add(GlobalMaxPool1D())
    model.add(BatchNormalization())
    model.add(Dropout(0.5))
    model.add(Dense(length_long_sentence, activation="relu"))
    model.add(Dropout(0.5))
    model.add(Dense(length_long_sentence, activation="relu"))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['accuracy'])

    model.summary()
    return model


for idx in range(5):
    print("*" * 20 + '\nModel: ' + str(idx) + '\n')

    reduce_lr = ReduceLROnPlateau( monitor='val_loss', factor=0.2, verbose=1, patience=5, min_lr=0.001)

    checkpoint = ModelCheckpoint('./models/model_' + str(idx) + '.h5',monitor='val_loss', mode='auto',verbose=1,
                                 save_weights_only=True, save_best_only=True)


    X_train, X_test, y_train, y_test = train_test_split(padded_sentences, sentiments, test_size=0.5)

    model = BLSTM()
    model.fit(X_train, y_train, batch_size=32, epochs=15, validation_data=[X_test, y_test],
                  callbacks=[reduce_lr, checkpoint], verbose=1)

from glob import glob
import scipy

x_models = []
labels = []

********************
Model: 0

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 72, 200)           4540200   
_________________________________________________________________
bidirectional_2 (Bidirection (None, 72, 144)           157248    
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 144)               0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 144)               576       
_________________________________________________________________
dropout_4 (Dropout)          (None, 144)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 72)                10440     
_________________________________________________________________
dropout_5 (Dropout)          (None, 72)      


Epoch 00008: val_loss did not improve from 0.42355
Epoch 9/15

Epoch 00009: val_loss did not improve from 0.42355
Epoch 10/15

Epoch 00010: ReduceLROnPlateau reducing learning rate to 0.001.

Epoch 00010: val_loss did not improve from 0.42355
Epoch 11/15

Epoch 00011: val_loss did not improve from 0.42355
Epoch 12/15

Epoch 00012: val_loss did not improve from 0.42355
Epoch 13/15

Epoch 00013: val_loss did not improve from 0.42355
Epoch 14/15

Epoch 00014: val_loss did not improve from 0.42355
Epoch 15/15

Epoch 00015: ReduceLROnPlateau reducing learning rate to 0.001.

Epoch 00015: val_loss did not improve from 0.42355
********************
Model: 2

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 72, 200)           4540200   
_________________________________________________________________
bidirectional_4 (Bidirection (None, 72, 144)           157248    
__________

We will now predict on test and create submission file that will be uploaded to kaggle. Here we are predicting classes using the 5 models generated and taking the mode for each prediction.

In [None]:

for idx in glob('*.h5'):
    model = BLSTM()
    model.load_weights(idx)
    x_models.append(model)

for idx in x_models:
    preds = idx.predict_classes(test_sentences)
    labels.append(preds)

labels = scipy.stats.mode(labels)[0]
labels = np.squeeze(labels)

submission.target = labels
submission.to_csv("glovetwitter.bilstmsample.submission.csv", index=False)