# Table of Content
1.  General Advice
1.  Some important study material links related to this project
1.  References of some other notebooks used in this project
1.  Importing relevant Libraries
1.  Loading the dataset
1.  EDA
     *      Check for the presence of imbalanced dataset.
     *      Checking number of words in disastrous & non-disastrous tweets
     *      Checking stop words for disastrous & Non-disastrous tweets
     *      Checking common words in both types of tweets
     *      Cleaning the dataset
     *      Printing WordCloud
1.  Word Embedding using GloVe  
1.  Building an LSTM Model with GloVe results
1.  Predicting train,cv,test outputs from the LSTM model
1.  Plotting Confusion Matrix
1.  Applying BERT
1.  Testing the BERT model on train & test dataset
1.  Comparing LSTM with GloVe vs BERT model

# 1. General Advice

* Ensure that the Internet option is turned "ON" in kaggle to download libraries,datasets,etc.
 
* Ensure that you have enabled GPU accelerator for your notebook at the time of running your notebook for faster performance. You can enable it by clicking on the arrow appearing on the right of "Save Version" button. 





# 2. Some important study material links related to this project

http://jalammar.github.io/illustrated-bert/

http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

# 3. References of some other notebooks used in this project

* https://www.kaggle.com/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert

* https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub

* https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert

* https://www.kaggle.com/shahules/basic-eda-cleaning-and-glove

# 4. Importing relevant Libraries

In [None]:
import pandas as pd
import numpy as np
import os

import matplotlib
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns

from nltk.corpus import stopwords
from nltk.util import ngrams

from wordcloud import WordCloud

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.metrics import classification_report,confusion_matrix

from collections import defaultdict
from collections import Counter
plt.style.use('ggplot')
stop=set(stopwords.words('english'))

import re
from nltk.tokenize import word_tokenize
import gensim
import string

from tqdm import tqdm
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, LSTM,Dense, SpatialDropout1D, Dropout
from keras.initializers import Constant
from keras.optimizers import Adam

# 5. Loading the dataset

In [None]:
df_train= pd.read_csv('../input/nlp-getting-started/train.csv')
df_test=pd.read_csv('../input/nlp-getting-started/test.csv')
submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")

In [None]:
print('Training Set Shape = {}'.format(df_train.shape))
print('Training Set Memory Usage = {:.2f} MB'.format(df_train.memory_usage().sum() / 1024**2))
print('Test Set Shape = {}'.format(df_test.shape))
print('Test Set Memory Usage = {:.2f} MB'.format(df_test.memory_usage().sum() / 1024**2))

df_train.head()
print(df_train.columns)
print(df_train['text'].values[0])
print("="*50)
print(df_test['text'].values[150])
print("="*50)

# 6. EDA

* ## Check for the presence of imbalanced dataset

In [None]:
# extracting the number of examples of each class
Real_len = df_train[df_train['target'] == 1].shape[0]
Not_len = df_train[df_train['target'] == 0].shape[0]
# bar plot of the 3 classes
plt.rcParams['figure.figsize'] = (7, 5)
plt.bar(10,Real_len,3, label="Disastrous tweets counts", color='red')
plt.bar(15,Not_len,3, label="Non-Disastrous tweets counts", color='green')
plt.legend()
plt.ylabel('Number of examples')
plt.title('Propertion of examples')
plt.show()

> ### CONCLUSION : Both disastrous & non-disastrous tweets are almost equally probable to occur. So there is no problem of imbalanced dataset.

*What if we had imbalanced dataset : We then had to implement techniques like upsampling, creating synthetic points(Data Augmentation),etc. Downsampling majority class is generally not recommended because it results into loss of dataset which ultimately results into loss of information.*

* ## Checking number of words in disatrous & non-disastrous tweets[](http://)

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
tweet_len=df_train[df_train['target']==1]['text'].str.len()
ax1.hist(tweet_len,color='red')
ax1.set_title('disaster tweets')
tweet_len=df_train[df_train['target']==0]['text'].str.len()
ax2.hist(tweet_len,color='green')
ax2.set_title('Not disaster tweets')
fig.suptitle('Characters in tweets')
plt.show()

> ### CONCLUSION - We performed this check to see if number of characters appearing in disastrous tweets is very different from the number of characters appearing in non-disastrous tweets but ~120-140 character sentences are very common between disastrous & non-disastrous tweets. And the overall distribution also looks very similar so character count per sentence feature is not so important to classify between disastrous & non-disastrous tweets

* ## Checking stop words for disastrous & Non-disastrous tweets

In [None]:
def create_corpus(target): # creating corpus of all the words in our dataset
    corpus=[]
    for x in df_train[df_train['target']==target]['text'].str.split():
        for i in x:
            corpus.append(i)
    return corpus

In [None]:
corpus=create_corpus(0)

dic=defaultdict(int)
for word in corpus:
    if word in stop:
        dic[word]+=1
        
top_0=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:20]

####################################################################

corpus=create_corpus(1)

dic=defaultdict(int)
for word in corpus:
    if word in stop:
        dic[word]+=1
        
top_1=sorted(dic.items(), key=lambda x:x[1],reverse=True)[:20]

In [None]:
plt.rcParams['figure.figsize'] = (18.0, 6.0)
x0,y0=zip(*top_0)
x1,y1=zip(*top_1)
plt.bar(x0,y0, color=['green'], label = "Non-disaster")                # indicates non-disaster tweets
plt.bar(x1,y1, color=['red'], label = "Disaster")              # indicates disaster tweets
plt.legend()

> ### CONCLUSION : Both types of tweets have stop word "the" appearing most number of times. Stop words "on", "by", "at",  "from", "are", "after", "as" doesn't appear in non-disasterous tweets.But the count of these words are not enough to qualify them as a differentiating feature. In summary we cannot find much from here also.

* > ## Checking common words in both types of tweets

In [None]:
corpus = create_corpus(1)
plt.figure(figsize=(16,5))
counter=Counter(corpus)
most=counter.most_common()
x=[]
y=[]
for word,count in most[:40]:
    if (word not in stop) :
        x.append(word)
        y.append(count)
        
sns.barplot(x=y,y=x).set_title('Words most frequently occuring in Non-disastrous tweets')

> ### CONCLUSION : Words like "-","..." are appearing in most of the sentences of non-disastrous tweets so we will eliminate such words appearing in the sentences because they don't add any information about the tweet.

*But why are we removing unnecessary words? Why can't we just leave them as it is? : Some of the un-necessary words are occuring in almost every sentence so it will occupy so many space and doesn't make any sense to keep that word which is not adding any differentiating value for the classification*

In [None]:
corpus = create_corpus(0)
plt.figure(figsize=(16,5))
counter=Counter(corpus)
most=counter.most_common()
x=[]
y=[]
for word,count in most[:40]:
    if (word not in stop) :
        x.append(word)
        y.append(count)
        
sns.barplot(x=y,y=x).set_title('Words most frequently occuring in Disastrous tweets')

> ### CONCLUSION : Similar argument can be made for most frequently appearing words in disastrous tweets. Data cleaning is required in the whole dataset.



* ## Cleaning the dataset

*Concatenating train & test data to apply cleaning on whole dataset*

In [None]:
df=pd.concat([df_train,df_test])       
df.shape

*Removing URLs from tweets*

In [None]:
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

*Removing HTML tags from tweets*

In [None]:
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

Romoving Emojis from tweets

In [None]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

remove_emoji("Omg another Earthquake üòîüòî")

*Removing punctuations from tweets*

In [None]:
def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

*Applying all the data cleaning methods*

In [None]:
df['text']=df['text'].apply(lambda x : remove_URL(x))
df['text']=df['text'].apply(lambda x : remove_html(x))
df['text']=df['text'].apply(lambda x: remove_emoji(x))
df['text']=df['text'].apply(lambda x : remove_punct(x))

* ## Printing WordCloud

In [None]:
#creating seperate corpus for both disastrous & non-disastrous tweets
corpus_new1 = create_corpus(1)
corpus_new0 = create_corpus(0)

In [None]:
plt.figure(figsize=(12,8))
word_cloud1 = WordCloud(
                          background_color='black',
                          max_font_size = 80
                         ).generate(" ".join(corpus_new1[:50]))

word_cloud0 = WordCloud(
                          background_color='black',
                          max_font_size = 80
                         ).generate(" ".join(corpus_new0[:50]))


In [None]:
fig, ax = plt.subplots(1,2, figsize=(20, 5))
ax[0].imshow(word_cloud1)
ax[0].set_title('Most frequently appearing words in disastrous tweets',fontsize = 20)
ax[0].axis("off")

ax[1].imshow(word_cloud0)
ax[1].set_title('Most frequently appearing words in non-disastrous tweets',fontsize = 20)
ax[1].axis("off")

plt.show()

> ## CONCLUSION - Words like "Shelter", "evacuation", "Forest", "Reason" appear a lot in disastrous tweets as we can see from the word cloud. Words like "wonderful", "lovely" and some words very irrelevant to a disastrous tweet appear in non-disastrous tweets, as expected. 

*Why are we drawing Word Clouds : Generally we draw word clouds to have an idea about the most common words appearing in the document(tweets in this case). In reference to this dataset, it also gives an idea about the words people are most commonly using in a disastrous & non-disastrous tweets. *

# 7. Word Embedding using GloVe

*Creating a new corpus*

In [None]:
def create_corpus_new(df):
    corpus=[]
    for tweet in tqdm(df['text']):
        words=[word.lower() for word in word_tokenize(tweet)]
        corpus.append(words)
    return corpus   

In [None]:
corpus=create_corpus_new(df)

*Importing pretrained GloVe vector representation of words*


*Please note that we are creating an embedding dictionary which will store each word representation in it. One more thing to note here is that these words are not the words from our dataset instead these are the words stored in the glove model *

In [None]:
embedding_dict={}
with open('../input/glove-global-vectors-for-word-representation/glove.6B.100d.txt','r') as f:
    for line in f:
        values=line.split()
        word=values[0]
        vectors=np.asarray(values[1:],'float32')
        embedding_dict[word]=vectors
f.close()

*Performing padding to ensure that each word vector representation is of same length because models coming up require inputs to be of same length.*

In [None]:
MAX_LEN=50
tokenizer_obj=Tokenizer()
tokenizer_obj.fit_on_texts(corpus)
sequences=tokenizer_obj.texts_to_sequences(corpus)

tweet_pad=pad_sequences(sequences,maxlen=MAX_LEN,truncating='post',padding='post')

*Representation of first word in our corpus. Note that the size of tweet_pad[0][0:] is 50 dimension as MAX_LEN defined above is 50.*

In [None]:
tweet_pad[0][0:]

In [None]:
word_index=tokenizer_obj.word_index
print('Total number of unique words in corpus(full datset):',len(word_index))

*Here we are creating embedding matrix which will store vector representation of each word in our corpus. Note that this is the first time we are creating vector representation of words appearing in our dataset(corpus).embedding_matrix[i] gives the vector representaion of ith word in our dataset. 
Please note that we are creating a 100 dimension vector representation of each word here.*

In [None]:
num_words=len(word_index)+1
embedding_matrix=np.zeros((num_words,100))

for word,i in tqdm(word_index.items()):
    if i < num_words:
        emb_vec=embedding_dict.get(word)
        if emb_vec is not None:
            embedding_matrix[i]=emb_vec

# 8. Building an LSTM Model with GloVe results

*Please note that we are providing embedding_matrix as embedding_initializer in embedding layer of this LSTM model.
Embedding layer in Keras also provides a way to vectorize the words. Instead of providing embedding_matrix specifically if you leave it as it is then it will perform word embedding on its own.*

*People prefer to use trained word embedding models because they have been trained on a very large dataset and models like GloVe, W2V, BERT are some of the most advanced and accurate word embedding techniques out there.* 

In [None]:
model=Sequential()

embedding=Embedding(num_words,100,embeddings_initializer=Constant(embedding_matrix),
                   input_length=MAX_LEN,trainable=False)

model.add(embedding)
model.add(SpatialDropout1D(0.2))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))


optimzer=Adam(learning_rate=3e-4)

model.compile(loss='binary_crossentropy',optimizer=optimzer,metrics=['accuracy'])

In [None]:
model.summary()

*Splitting train & test datasets. Please note that df stores all the dataset so we are splitting df into train and test*

In [None]:
train=tweet_pad[:df_train.shape[0]]
test=tweet_pad[df_train.shape[0]:]

In [None]:
train.shape

*Further splitting train set further into train and cv sets*

In [None]:
X_train,X_cv,y_train,y_cv=train_test_split(train,df_train["target"],test_size=0.1)
print('Shape of train  ->',X_train.shape)
print("Shape of Validation ->",X_cv.shape)

*Fitting the model with the train dataset keeping cv dataset as validation_data*

In [None]:
%%time
# Recomended 10-20 epochs
history=model.fit(X_train,y_train,batch_size=64,epochs=10,validation_data=(X_cv,y_cv),verbose=2)

> ## RESULTS : We are getting ~82% accuracy on cv dataset(unseen by model)

# 9. Predicting train,cv,test outputs from the LSTM model

In [None]:
train_pred_GloVe = model.predict(X_train)
train_pred_GloVe_int = train_pred_GloVe.round().astype('int')

cv_pred_GloVe = model.predict(X_cv)
cv_pred_GloVe_int = cv_pred_GloVe.round().astype('int')

test_pred_GloVe = model.predict(test)
test_pred_GloVe_int = test_pred_GloVe.round().astype('int')


# 10. Plotting Confusion Matrix

In [None]:
# Showing Confusion Matrix
def plot_cm(y_true, y_pred, title, figsize):
    cm = confusion_matrix(y_true, y_pred, labels=np.unique(y_true))
    cm_sum = np.sum(cm, axis=1, keepdims=True)
    cm_perc = cm / cm_sum.astype(float) * 100
    annot = np.empty_like(cm).astype(str)
    nrows, ncols = cm.shape
    for i in range(nrows):
        for j in range(ncols):
            c = cm[i, j]
            p = cm_perc[i, j]
            if i == j:
                s = cm_sum[i]
                annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s)
            elif c == 0:
                annot[i, j] = ''
            else:
                annot[i, j] = '%.1f%%\n%d' % (p, c)
    cm = pd.DataFrame(cm, index=np.unique(y_true), columns=np.unique(y_true))
    cm.index.name = 'Actual'
    cm.columns.name = 'Predicted'
    fig, ax = plt.subplots(figsize=figsize)
    plt.title(title)
    sns.heatmap(cm, cmap= "YlGnBu", annot=annot, fmt='', ax=ax)

In [None]:
# Showing Confusion Matrix for LSTM+GloVe model on cv dataset
plot_cm(cv_pred_GloVe_int, y_cv, 'Confusion matrix for LSTM+GloVe model on cv dataset', figsize=(7,7))

> ## RESULTS : Our model is able to accurately classify ~82% of disastrous tweets as disastrous and ~79 of non-disastrous and non-disastrous.

*Storing model's output for data in submission file*

In [None]:
submission['target'] = test_pred_GloVe_int
submission.head(10)

submission.to_csv("submission.csv", index=False, header=True)
submission.head()

# 11. Applying BERT

*I would strongly recommend you to refer to the external reference links provided in the starting of this kernal. Following lines of codes will be very clear after that.*

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub


# We will use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py
    
import tokenization

In [None]:
# Thanks to https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [None]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    
    if Dropout_num == 0:
        # Without Dropout
        out = Dense(1, activation='sigmoid')(clf_output)
    else:
        # With Dropout(Dropout_num), Dropout_num > 0
        x = Dropout(Dropout_num)(clf_output)
        out = Dense(1, activation='sigmoid')(x)

    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

In [None]:
# Load BERT from the Tensorflow Hub
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

Please note that we are again loading the dataset

In [None]:
# Load CSV files containing training data
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

## Cleaning the data

*Concatenating train & test data to apply cleaning on whole dataset similar to what we done earlier*

In [None]:
df=pd.concat([train,test])       
df.shape

In [None]:
df['text']=df['text'].apply(lambda x : remove_URL(x))
df['text']=df['text'].apply(lambda x : remove_html(x))
df['text']=df['text'].apply(lambda x: remove_emoji(x))
df['text']=df['text'].apply(lambda x : remove_punct(x))

*Please note that df here includes train as well as test dataset.Therefore we have to split this dataset into train and test before applying BERT model in it.*

In [None]:
train = df[:train.shape[0]]
test = df[train.shape[0]:]

*Splitting train further into train and cv datasets*

In [None]:
X_train_BERT,X_cv_BERT,y_train_BERT,y_cv_BERT=train_test_split(train,train['target'].values,test_size=0.2)
print('Shape of train set ->',X_train_BERT.shape)
print("Shape of Validation set ->",X_cv_BERT.shape)
print("Shape of test set ->",test.shape)

*Load tokenizer from the bert layer*

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

*Encoding the text into tokens, masks, and segment flags which are taken as input to BERT model*

In [None]:
# Thanks to https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
train_input = bert_encode(X_train_BERT.text.values, tokenizer, max_len=160)
cv_input = bert_encode(X_cv_BERT.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, tokenizer, max_len=160)
train_labels = y_train_BERT
cv_labels = y_cv_BERT

*Building BERT model with self tuning the parameters*

In [None]:
# Thanks to https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
Dropout_num = 0
model_BERT = build_model(bert_layer, max_len=160)
model_BERT.summary()

In [None]:
# Thanks to https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
checkpoint = ModelCheckpoint('model_BERT.h5', monitor='val_loss', save_best_only=True)

train_history = model_BERT.fit(
    train_input, train_labels,
    validation_split = 0.1,
    epochs = 3, # recomended 3-5 epochs
    callbacks=[checkpoint],
    batch_size = 16
)

# 12. Testing the BERT model on train & test dataset

In [None]:
# Thanks to https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub
# Prediction by BERT model with my tuning
model_BERT.load_weights('model_BERT.h5')
train_pred_BERT = model_BERT.predict(train_input)
train_pred_BERT_int = train_pred_BERT.round().astype('int')

cv_pred_BERT = model_BERT.predict(cv_input)
cv_pred_BERT_int = cv_pred_BERT.round().astype('int')

In [None]:
test_pred_BERT = model_BERT.predict(test_input)
test_pred_BERT_int = test_pred_BERT.round().astype('int')

In [None]:
submission['target'] = test_pred_BERT_int
submission.head(10)

In [None]:
submission.to_csv("submission.csv", index=False, header=True)

## Printing Confusion Matrix

In [None]:
plot_cm(cv_pred_BERT_int, cv_labels, 'Confusion matrix for BERT model on cv dataset', figsize=(7,7))

# 13. Comparing LSTM with GloVe vs BERT model

> ## CONCLUSION : ~86% disastrous tweets are correctly predicted & ~85% non-disastrous are also correctly predicted. Whereas in LSTM+GloVe model this percentage was ~82% & ~79% respectively. So we have significantly improved from LSTM+GloVe model to BERT model.

# ‚úîÔ∏èPLEASE GIVE THIS NOTEBOOK AN UPVOTE IF YOU LIKED IT!!!