# <a id="top_section"></a>

<div align='center'><font size="6" color="#000000"><b>NLP with disaster tweets!(BERT explained) <br>(~84.5% Accuracy)</b></font></div>
<hr>
<div align='center'><font size="5" color="#000000">About the problem</font></div>
<hr>

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.<br>
I have two notebooks on this competition , the first one is using basic naive-base model whereas this one is by using BERT pre-trained model. If you're a beginner I highly recommend you to check out the basic model notebook first ! Here is the link <br>
#### [NLP w/ Disaster tweets!(Explained)](https://www.kaggle.com/friskycodeur/nlp-w-disaster-tweets-explained)
<br>
<img src='https://c7.uihere.com/files/932/486/348/tweet-bird-logotwitter-icon-buttonflat-social-vector.jpg' height=500 width=500>
<br>

### Here are the things I will try to cover in this Notebook:

- A brief introduction to BERT
- Defining helper functions
- Data cleaning
- Using BERT model to get higher accuracy
- Is it posible to get 100% accuracy and if yes then how can you get that !

### If you liked this kernel feel free to upvote and leave feedback, thanks!

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home" align='center'>Table of Content</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#top_section" role="tab" aria-controls="profile">About the Problem<span class="badge badge-primary badge-pill">1</span></a>
<a class="list-group-item list-group-item-action" data-toggle="list" href="#bert" role="tab" aria-controls="messages">A brief intoduction to BERT<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#sec1" role="tab" aria-controls="messages">Importing basic libraries and data<span class="badge badge-primary badge-pill">3</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#sec2" role="tab" aria-controls="settings">Defining helpful functions<span class="badge badge-primary badge-pill">4</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#sec3" role="tab" aria-controls="settings">Data Cleaning<span class="badge badge-primary badge-pill">5</span></a> 
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#sec4" role="tab" aria-controls="settings">Pre-training BERT<span class="badge badge-primary badge-pill">6</span></a>
    <a class="list-group-item list-group-item-action" data-toggle="list" href="#sec5" role="tab" aria-controls="settings">Modelling<span class="badge badge-primary badge-pill">7</span></a>
    <a class="list-group-item list-group-item-action" data-toggle="list" href="#sec6" role="tab" aria-controls="settings">Submission<span class="badge badge-primary badge-pill">8</span></a>
    <a class="list-group-item list-group-item-action" data-toggle="list" href="#sec7" role="tab" aria-controls="settings">The secret to 100% accuracy<span class="badge badge-primary badge-pill">9</span></a>    
    <a class="list-group-item list-group-item-action" data-toggle="list" href="#sec8" role="tab" aria-controls="settings">References and Some last words<span class="badge badge-primary badge-pill">10</span></a>  

Let us start with some basic understandings.

> If you want to read about the EDA of this data and want to see how it can be implemented using Naive-bayes approach then refer this notebook >>> [NLP w/ Disaster tweets!(Explained)](https://www.kaggle.com/friskycodeur/nlp-w-disaster-tweets-explained)

<a id="bert"></a>
<h1 align='left'> BERT </h1>

BERT stands for Bidirectional Encoder Representation from Transformers. In a transformer flow if we stack a number of encoders then we get a BERT. It is easier to make BERT understand a language. BERT also has a variety of problems such as Question-Answering , Sentiment Analysis , Text sumamrzation ,etc. <br>
Steps to use a BERT model : <br>
- Pretraining BERT : To understand language
- Fine tune BERT : To help us in our specific task

#### Pretraining BERT 
- To make BERT learn what is language.
- It has two part Masked Langauge Modelling(MLM) and Next Sentence Prediction(NSP).
- Both of these problems are trained simultaneously.

#### Fine tuning BERT
- It is a quiet fast process.
- Only the output parameters are leant from scratch and whereas the rest of the parameters are slightly fine-tuned and not that   much changed which in turn makes the process faster.

If you want to read more about BERT. [Click Here.](https://arxiv.org/pdf/1810.04805.pdf)

***

<a id="sec1"></a>
## Importing required libraries and Data

In [None]:
# We will use the official tokenization script created by the Google team
!wget --quiet https://raw.githubusercontent.com/tensorflow/models/master/official/nlp/bert/tokenization.py

We will start by importing the libraries to be used and the dataset provided.

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
import tensorflow_hub as hub

import re
import tokenization
import string

In [None]:
train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")
submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")

<a id="sec2"></a>
## Defining helpful functions

In this part we will define some functions , which we can use later to make the process smoother. <br><br>
First we will create a function which takens input the text we want to work on , the tokenizer we are using . This will help us to encode the text for BERT using tokenizer and will give us tokens, masks and segments !

In [None]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

Now we will build a mdel-building function , in which we  will input the layer/bert-layer and get the model as an output.

In [None]:
def build_model(bert_layer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    input_mask = Input(shape=(max_len,), dtype=tf.int32, name="input_mask")
    segment_ids = Input(shape=(max_len,), dtype=tf.int32, name="segment_ids")

    _, sequence_output = bert_layer([input_word_ids, input_mask, segment_ids])
    clf_output = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(clf_output)
    
    model = Model(inputs=[input_word_ids, input_mask, segment_ids], outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    
    return model

<a id="sec3"></a>
## Data cleaning

We are here at the data cleaning part.<br><br>
First off let's convert everything in lowercase.

In [None]:
def lowercase_text(text):
    return text.lower()

train.text=train.text.apply(lambda x: lowercase_text(x))
test.text=test.text.apply(lambda x: lowercase_text(x))

Now we will remove the text noises.

In [None]:
def remove_noise(text):
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

train.text=train.text.apply(lambda x: remove_noise(x))
test.text=test.text.apply(lambda x: remove_noise(x))

Let's see how our data looks now. Must be cleaner.

In [None]:
train.text.head(5)

<a id="sec4"></a>
## Pre-training BERT

- First we will load bert from tensorhub
- From the bert-layer we will load the tokenizer
- We will encode and convert the data into Bert-input form

In [None]:
module_url = "https://tfhub.dev/tensorflow/bert_en_uncased_L-24_H-1024_A-16/1"
bert_layer = hub.KerasLayer(module_url, trainable=True)

We will load the tokenizer from our bert layer now !

In [None]:
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)

Now let's encode our text using this tokenizer and the **bert_encoder** function we created earlier.

In [None]:
train_input = bert_encode(train.text.values, tokenizer, max_len=160)
test_input = bert_encode(test.text.values, tokenizer, max_len=160)
train_labels = train.target.values

<a id="sec5"></a>
## Modelling

We have done the pre-training of the model , now we will build our model using BERT.

In [None]:
model = build_model(bert_layer, max_len=160)
model.summary()

Let's train our model now and see how it's doing.<br>
You can try and play with epochs and batch_size to see if you can get better accuracy.<br>
That's how i got to 84.5 from 82 ;)

In [None]:
checkpoint = ModelCheckpoint('model.h5', monitor='val_loss', save_best_only=True)

train_history = model.fit(
    train_input, train_labels,
    validation_split=0.2,
    epochs=2,
    callbacks=[checkpoint],
    batch_size=15
)

In [None]:
metrics=pd.DataFrame(model.history.history)
metrics

<a id="sec6"></a>
## Submission

We have succesfully pre-train,buid and train our model.<br>
And now is the time to get the predictions and submit our solution.

In [None]:
model.load_weights('model.h5')
test_pred = model.predict(test_input)

In [None]:
submission['target'] = test_pred.round().astype(int)
submission.to_csv('submission.csv', index=False)

<a id="sec7"></a>
## The secret to 100% accuracy

100% accuracy is not really that possible in this problem, as far as i have seen with my models and by reading the kernels of other kagglers , but still you can see there are people on the top with 100% accuracy, so what is the secret of 100% accuracy?
<br>
The secret is simply a leaked labels. The labels of the test set which is being used for estiamtion of our score is available on some other site and hence alot of people are using that to get that 100% accuracy mark.
<br>
This competition is for leaning purpose only , not for ranks or anything.<br>
Here is a notebook explaining it [A Real Disaster - Leaked Label](https://www.kaggle.com/szelee/a-real-disaster-leaked-label)

<a id="sec8"></a>
## References
<br>

- [NLP - EDA, Bag of Words, TF IDF, GloVe, BERT](https://www.kaggle.com/vbmokin/nlp-eda-bag-of-words-tf-idf-glove-bert)
- [Disaster NLP: Keras BERT using TFHub](https://www.kaggle.com/xhlulu/disaster-nlp-keras-bert-using-tfhub)

# Some last words:

Thank you for reading! I'm still a beginner and want to improve myself in every way I can. So if you have any ideas to feedback please let me know in the comments section!


<div align='center'><font size="3" color="#000000"><b>And again please star if you liked this notebook so it can reach more people, Thanks!</b></font></div>

<img src='https://thumbs.gfycat.com/EnormousRegularClam-size_restricted.gif'>