<p style="background-color:skyblue;font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:350%; text-shadow: 2px 2px 2px #CCCCCC; text-align:center; border-radius: 15px 50px;">Natural Language Processing with Disaster Tweets</p>
<img src='https://images.unsplash.com/photo-1475776408506-9a5371e7a068?ixlib=rb-1.2.1&ixid=MXwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHw%3D&auto=format&fit=crop&w=954&q=80' style='width:500px; border-radius: 100px 0px;'/>
<br/>

## This notebook includes EDA,cleaning followed by LSTM model implemented using Keras and at last a very simplified and concise implementation of the BERT Model using ktrain

# <p style="font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:200%; background: linear-gradient(to right, black, #eee, black); text-align:center; border-radius: 15px 50px; padding: 2px">Table of Content</p>

* [1. Importing Libraries](#1)
* [2. Loading the Dataset](#2)
* [3. Data Visualization](#3)
    * [3.1 Distribution of Target Values](#3.1)
    * [3.2 Length of tweets v/s Number of tweets](#3.2)
    * [3.3 Number of Words v/s Number of Tweets](#3.3)
* [4. Missing Values Analysis](#4)
* [5. Outliers Analysis](#5)
* [6. Data Prepocessing](#6)
    * [6.1 Removing URL & Special Characters](#6.1)
    * [6.2 Tokenization](#6.2)
    * [6.3 Glove Vector Embeddings](#6.3)
* [7. LSTM Model](#7)
    * [7.1 Model Creation](#7.1)
    * [7.2 Model Training](#7.2)
    * [7.3 Prediction & Evaluation](#7.3)
* [8. Bert Model](#8)
    * [8.1 Text Preprocessing for Bert](#8.1)
    * [8.2 Model Creation](#8.2)
    * [8.3 Model Training](#8.3)
    * [8.4 Prediction & Evaluation](#8.4)

<a id='1'></a>
# <p style="font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:200%; background: linear-gradient(to right, black, #eee, black); text-align:center; border-radius: 15px 50px; padding: 2px">Loading the Libraries</p>

In [None]:
!pip install ktrain

In [None]:
import numpy as np
import pandas as pd
import re
from collections import Counter
from matplotlib import pyplot as plt
import seaborn as sns
import spacy # For tokenization, lemmatization, removing stop words & punctuations.
from keras.layers import LSTM, Dense, Dropout, Activation # Layers of keras used in LSTM Model.
from keras.models import Sequential # Sequential Neural Network
from keras.callbacks import EarlyStopping # Early Stopping Callback in the NN
import tensorflow as tf
import ktrain # For Bert Model Implementation.
from ktrain import text # Preprocessing text for the Bert Model.
sns.set()
nlp = spacy.load("en_core_web_sm",disable=["tagger", "parser", "ner"])

<a id='2'></a>
# <p style="font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:200%; background: linear-gradient(to right, black, #eee, black); text-shadow: 2px 2px 2px #CCCCCC; text-align:center; border-radius: 15px 50px; padding: 2px">Loading the Dataset</p>

In [None]:
train = pd.read_csv('../input/nlp-getting-started/train.csv') # Training Data
test = pd.read_csv('../input/nlp-getting-started/test.csv') # Testing Data
train_len = len(train)
test_len = len(test)
print('Training Dataset:',train_len)
print('Testing Dataset:',test_len)
train.head()

<a id='3'></a>
# <p style="font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:200%; background: linear-gradient(to right, black, #eee, black); text-shadow: 2px 2px 2px #CCCCCC; text-align:center; border-radius: 15px 50px; padding: 2px">Data Visualization</p>

<a id='3.1'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">    Distribution of Target Values</p>


In [None]:
plt.figure(figsize=(15,5))
plt.subplot(1, 2, 1)
ax = sns.countplot(x='target',data=train,label=['Not a Disaster','Disaster'])
ax.set_xticklabels(['Disaster','Not a Disaster'])
plt.suptitle("Distribution of Target Values")
terms = np.array(['Disaster', 'Not a Disaster'])
weigtage = np.array([len(train[train['target'] == 1]),len(train[train['target'] == 0])])
plt.subplot(1, 2, 2)
plt.pie(weigtage,labels=terms, autopct="%1.1f%%")
plt.title('target')
plt.show()

<a id='3.2'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:;  border-radius: 10px 20px; padding: 2px">Length of tweets v/s Number of tweets</p>

In [None]:
plt.figure(figsize=(10,5))
word_len = train['text'].map(lambda x: len(x))
plt.hist(word_len)
plt.xlabel('Length of tweets')
plt.ylabel('Number of tweets')
plt.title('Length of tweets v/s Number of tweets')
plt.show()

<a id='3.3'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Number of Words v/s Number of Tweets</p>

In [None]:
plt.figure(figsize=(10,5))
word_len = train['text'].str.split().map(lambda x: len(x))
plt.hist(word_len)
plt.xlabel('Number of Words')
plt.ylabel('Number of tweets')
plt.title('Number of Words v/s Number of Tweets')
plt.show()

<a id='4'></a>
# <p style="font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:200%; background: linear-gradient(to right, black, #eee, black); text-shadow: 2px 2px 2px #CCCCCC; text-align:center; border-radius: 15px 50px; padding: 2px">Missing Values Analysis</p>

In [None]:
def missing_val_analysis(data):
    missing_values = data.isnull().sum()
    missing_values = missing_values[missing_values > 0].sort_values(ascending = False)
    missing_values_data = pd.DataFrame(missing_values)
    missing_values_data.reset_index(level=0, inplace=True)
    missing_values_data.columns = ['Feature','Number of Missing Values']
    missing_values_data['Percentage of Missing Values'] = (100.0*missing_values_data['Number of Missing Values'])/len(data)
    return missing_values_data

In [None]:
missing_val_analysis(train) # Missing value analysis in the training data.

### Since the number of rows having missing keywords is very less, so we would just fill all the missing values with empty string, add this to the end of text and drop the column.
### Since location is not likely to help the model with its prediction, therefore we would drop this column for now.

In [None]:
train['keyword'].fillna('',inplace=True)
train['text'] = train['text'] + ' ' + train['keyword']
train['text'] = train['text'].apply(lambda x: x.strip())
train.drop(['keyword'],axis=1,inplace=True)
train.drop(['location'],axis=1,inplace=True)
train.head()

In [None]:
missing_val_analysis(test)

### We would take the same action here as we took for the training dataset.

In [None]:
test['keyword'].fillna('',inplace=True)
test['text'] = test['text'] + ' ' + test['keyword']
test['text'] = test['text'].apply(lambda x: x.strip())
test.drop(['keyword'],axis=1,inplace=True)
test.drop(['location'],axis=1,inplace=True)
test.head()

<a id='5'></a>
# <p style="font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:200%; background: linear-gradient(to right, black, #eee, black); text-shadow: 2px 2px 2px #CCCCCC; text-align:center; border-radius: 15px 50px; padding: 2px">Outliers Analysis</p>

### After carefully examining the training dataset, it is seen that there are multiple records having same text and for some of these texts there are contradicting predicitons.Thus we would analyze this and remove duplicate records along with outliers.

In [None]:
duplicate_records = train[train.duplicated(['text','target'],keep=False)] # Duplicate records with same targets.
print('Records having same text and targets:',len(duplicate_records))
duplicate_records.head()

In [None]:
train.drop_duplicates(['text','target'],inplace=True) # Dropping the duplicate records having same targets.

In [None]:
contradicting_records = train[train.duplicated(['text'],keep=False)] # Duplicate records with outliers.
print('Records having same text but different targets:',len(contradicting_records))
contradicting_records.head()

### Since the number of records were very less, therefore by manual inspection ,a list is created containing the indices of all the outliers which needs to be removed.

In [None]:
records_to_drop = [610,2832,3243,3985,4244,4232,4292,4305,4306,4312,4320,4381,4618,5620,6091,6616] # Outliers.

train.drop(records_to_drop,inplace=True) # Dropping the outliers.
train = train.reset_index(drop=True) # Resetting the indexes.
train.head()

<a id='6'></a>
# <p style="font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:200%; background: linear-gradient(to right, black, #eee, black); text-shadow: 2px 2px 2px #CCCCCC; text-align:center; border-radius: 15px 50px; padding: 2px">Data Prepocessing</p>

<a id='6.1'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Removing URL & Special Characters</p>

In [None]:
def remove_url(text):
    return re.sub(r'https?://\S+|www\.\S+','',text)
def remove_char(text):
    return re.sub(r'[^A-Za-z0-9 ]+', '', text)

def remove_preprocess(text):
    return remove_char(remove_url(text))

In [None]:
train['text'] = train['text'].apply(lambda x: remove_preprocess(x))
test['text'] = test['text'].apply(lambda x: remove_preprocess(x))
train.head()

<a id='6.2'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Tokenization</p>

### Here we would combine the training and testing data since tokenization needs to be performed on both the dataset.Also we would further analyze the combined data to draw out more visualizations.

In [None]:
data = pd.concat([train,test],axis=0,sort=False)
data.drop(['target'],axis=1,inplace=True)
data.head()

In [None]:
def filtered_token(token):
    if token.is_stop or token.is_space or token.like_num or token.like_url or token.like_email or token.is_digit or token.is_punct or len(token.lemma_) <= 2:
        return False
    return True
        

def tokenize(text):
    tokens = []
    doc = nlp(text)
    for token in doc:
        if filtered_token(token):
            tokens.append(token.lemma_.lower())
    return tokens

In [None]:
data['tokens'] = data['text'].apply(lambda x: tokenize(x))
data.head()

In [None]:
plt.figure(figsize=(10,5))
word_len = data['tokens'].map(lambda x: len(x))
plt.hist(word_len)
plt.xlabel('Number of tokens')
plt.ylabel('Number of tweets')
plt.title('Number of tokens v/s Number of tweets')
plt.show()

<a id='6.3'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Glove Vector Embeddings</p>

### Here would be using the glove vectors 300 dimensions file.

In [None]:
glove_vec_file = open('../input/glove6b/glove.6B.300d.txt')
embeddings = {}
for line in glove_vec_file:
    values = line.split()
    word = values[0]
    embedding = np.array(values[1:])
    embeddings[word] = embedding
glove_vec_file.close()

In [None]:
def embeddings_out(data,maxlen=20):
    output = np.zeros((data.shape[0],20,300))
    for ix in range(len(data)):
        curr_len = min(maxlen,len(data.iloc[ix]['tokens']))
        for jx in range(curr_len):
            word = str(data.iloc[ix]['tokens'][jx])
            if word in embeddings:
                output[ix][jx] = embeddings[data.iloc[ix]['tokens'][jx]]
    return output

In [None]:
X = embeddings_out(data) # Final training dataset to feed into LSTM model.
X.shape

### Splitting the dataset into training and testing parts.

In [None]:
X_train = X[:len(train)]
X_test = X[len(train):]
y_train = train['target'].values
len(X_train),len(X_test),len(y_train)

<a id='7'></a>
# <p style="font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:200%; background: linear-gradient(to right, black, #eee, black); text-shadow: 2px 2px 2px #CCCCCC; text-align:center; border-radius: 15px 50px; padding: 2px">LSTM Model</p>

<a id='7.1'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Model Creation</p>

### Here we have used the Stacked LSTMs to get better accuracy as compared to a single LSTM model.

In [None]:
def create_model():
    model = Sequential()
    model.add(LSTM(64,input_shape=(X.shape[1],X.shape[2]),return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(64,input_shape=(X.shape[1],X.shape[2]),return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(64,input_shape=(X.shape[1],X.shape[2])))
    model.add(Dropout(0.1))
    model.add(Dense(64,activation='relu'))
    model.add(Dense(64,activation='relu'))
    model.add(Activation('softmax'))
    model.add(Dense(1,activation='sigmoid'))
    model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
    return model

In [None]:
model = create_model()
model.summary()

<a id='7.2'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Model Training</p>

## We would be using early stopping callback and would use 1/10th of the training data as validation to estimate the optimum number of epochs that would prevent overfitting

In [None]:
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)

hist = model.fit(X_train,y_train,epochs=100,batch_size=64,shuffle=True,validation_split=0.1,callbacks=[early_stop])

In [None]:
losses = pd.DataFrame(model.history.history)
losses.plot()

## Training the model with full training data and optimum number of epochs!!

In [None]:
model = create_model()

In [None]:
history = model.fit(x=X_train,y=y_train,
          batch_size=64,epochs=20,shuffle=True)

In [None]:
losses = pd.DataFrame(model.history.history)
losses.plot()

<a id='7.3'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Prediction & Evaluation</p>

In [None]:
pred = (model.predict(X_test) > 0.5).astype("int32")
pred

In [None]:
result = pd.DataFrame()
result['id'] = test['id']
result['target'] = pred
result.head()

In [None]:
result.to_csv('submission_LSTM.csv',index=False)

### This gives us an accuracy of 0.80079.This can be further improved through hyperparameter tuning of the model but for now we would shift to Bert Model to further improve the accuracy of the model.

<a id='8'></a>
# <p style="font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:200%; background: linear-gradient(to right, black, #eee, black); text-shadow: 2px 2px 2px #CCCCCC; text-align:center; border-radius: 15px 50px; padding: 2px">Bert Model</p>

<a id='8.1'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Text Preprocessing for Bert</p>

In [None]:
train_data = train.head(7000).copy()
val_data = train.tail(525).copy()

In [None]:
(X_train, y_train), (X_val, y_val), preproc = text.texts_from_df(train_df=train_data,
text_column = 'text',label_columns = 'target',val_df = val_data,maxlen = 256,preprocess_mode = 'bert')

<a id='8.2'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Model Creation</p>

In [None]:
model = text.text_classifier(name = 'bert',
                             train_data = (X_train, y_train),
                             preproc = preproc)

In [None]:
learner = ktrain.get_learner(model=model, train_data=(X_train, y_train),
                   val_data = (X_val, y_val),
                   batch_size = 16)

<a id='8.3'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Model Training</p>

In [None]:
learner.fit_onecycle(lr = 2e-5, epochs = 2)

predictor = ktrain.get_predictor(learner.model, preproc)

<a id='8.4'></a>
## <p style="opacity:0.8; font-family:'Myriad Pro', 'Myriad', helvetica, arial, sans-serif;font-size:150%; background: linear-gradient(to right, skyblue, blue); text-shadow: 2px 2px 2px #CCCCCC; text-align:; border-radius: 10px 20px; padding: 2px">Prediction & Evaluation</p>

In [None]:
result = pd.DataFrame()
result['id'] = test['id']
result['target'] = predictor.predict(test['text'].values)
result['target'] = result['target'].map(lambda x:1 if x=='target' else 0)
result.head()

In [None]:
result.to_csv('submission_bert.csv',index=False)

### Thanks alot for reading this notebook.Feedbacks and upvotes are most welcome !!