# BERT

BERT stands for Bidirectional Encoder Representations from Transformers. BERT is a “deeply bidirectional” model. Bidirectional means that BERT learns information from both the left and the right side of a token’s context during the training phase.

The bidirectionality of a model is important for truly understanding the meaning of a language. Let’s see an example to illustrate this. There are two sentences in this example and both of them involve the word “bank”:
![BERT captures both the left and right context](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/sent_context.png)

If we try to predict the nature of the word “bank” by only taking either the left or the right context, then we will be making an error in at least one of the two given examples.

One way to deal with this is to consider both the left and the right context before making a prediction. That’s exactly what BERT does! Traditionally, we had language models either trained to predict the next word in a sentence (right-to-left context used in GPT) or language models that were trained on a left-to-right context or a shallow concatenation of these two (ElMo). This made the models susceptible to errors due to loss in information.

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/bert-vs-openai-.jpg)

It’s evident from the above image: BERT is bi-directional, GPT is unidirectional (information flows only from left-to-right), and ELMO is shallowly bidirectional.

BERT is pre-trained on two NLP tasks:

* Masked Language Modeling
*  Next Sentence Prediction

(This excerpt is taken from here:
https://www.analyticsvidhya.com/blog/2019/09/demystifying-bert-groundbreaking-nlp-framework/
For detailed explanation please visit the link.)

The model structure implemented in this notebook is similar to this:

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2019/09/bert_pipeline2.png)

In [None]:
import re
import pandas as pd
import numpy as np
from tqdm import tqdm
import transformers
import tensorflow.keras as keras 
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None) 
pd.set_option('display.max_rows', 20)  
pd.set_option('display.max_colwidth', -1)  

# The Dataset
### 1-real disaster
### 0-not a disaster

In [None]:
df = pd.read_csv('../input/nlp-getting-started/train.csv')
print(len(df))
print(df.columns)

df

In [None]:
sns.countplot(x = 'target', data = df)
plt.xlabel('Classes')
plt.ylabel('Count')
plt.show()

# Text Cleaning and Preprocessing

In [None]:
def clean_text(text):
    
    # Remove http/https links
    text = re.sub(r'http\S+', '', text)  
    
    # Remove mentions
    text = re.sub(r"(?:\@)\w+", '', text)
    
    # Remove any characters that is not and alphabet or number
    text = re.sub(r'[^a-zA-Z0-9\'.,?$&\s]', '', text)  
    
    # Lower case all the alphabets
    text = text.lower()
    
    return text

# Let's view some random tweets with their cleaned versions
for i in range(10):
    index = np.random.randint(low=0, high=len(df))
    print('Raw text:', df['text'][index])
    print('Cleaned text:', clean_text(df['text'][index]))
    print('Label: ', df['target'][index], '\n')

# Tokenizing the Inputs

In [None]:
# Tokenizing the sentences 
def convert_to_features(data, tokenizer, max_len=None):
    data = data.replace('\n', '')
    
    # Return a dictionary containing 'input_ids', 'attention_mask' & 'token_type_ids' each of shape (1, max_len)
    if max_len is not None:
        tokenized = tokenizer.encode_plus(
            data, 
            padding ='max_length',
            max_length=max_len, 
            truncation=True,
            return_tensors='np',
            return_attention_mask=True,
            return_token_type_ids=True,
        )
        
    else:
        tokenized = tokenizer.encode_plus(
            data,  
            return_tensors='np',
            return_attention_mask=True,
            return_token_type_ids=True,
        )
    return tokenized

# Create dataset for data with labels
def create_inputs_with_targets(x, y, tokenizer, max_len=128):
    
    dataset_dict = {
        "input_ids": [],
        "attention_mask": [],
        'labels': []
    }
    
    for sentence, label in tqdm(zip(x,y)):
        cleaned_sentence = clean_text(sentence)
        temp = convert_to_features(cleaned_sentence, tokenizer, max_len=max_len)
        dataset_dict["input_ids"].append(temp["input_ids"][0])
        dataset_dict["attention_mask"].append(temp["attention_mask"][0])
        dataset_dict["labels"].append(label)

    x = [
        np.array(dataset_dict["input_ids"]),
        np.array(dataset_dict["attention_mask"]),
    ]
    
    y = np.array(dataset_dict['labels'])
    
    return x, y

# Create dataset for data without labels
def create_inputs_without_targets(x, tokenizer, max_len=128):
    
    dataset_dict = {
        "input_ids": [],
        "attention_mask": [],
    }
    
    for sentence in tqdm(x):
        cleaned_sentence = clean_text(sentence)
        temp = convert_to_features(cleaned_sentence, tokenizer, max_len=max_len)
        dataset_dict["input_ids"].append(temp["input_ids"][0])
        dataset_dict["attention_mask"].append(temp["attention_mask"][0])

    x = [
        np.array(dataset_dict["input_ids"]),
        np.array(dataset_dict["attention_mask"]),
    ]
    
    return x

In [None]:
# I will be using the BERT base model here
base_model = 'bert-base-uncased'

bert_tokenizer = transformers.BertTokenizer.from_pretrained(base_model)
max_len = 80

In [None]:
# Splitting the dataframe into 80-20% training and validation dataframes
validation_data_indices = df.sample(frac=0.2).index
validation_df = df.loc[validation_data_indices, :].reset_index(drop=True)
train_df = df.drop(validation_data_indices, axis=0).reset_index(drop=True)
test_df = pd.read_csv('../input/nlp-getting-started/test.csv')

x_train, y_train = create_inputs_with_targets(list(train_df['text']), list(train_df['target']), tokenizer=bert_tokenizer, max_len=max_len)
x_val, y_val = create_inputs_with_targets(list(validation_df['text']), list(validation_df['target']), tokenizer=bert_tokenizer, max_len=max_len)
x_test = create_inputs_without_targets(list(test_df['text']), tokenizer=bert_tokenizer, max_len=max_len)

print('Training dataframe size: ', len(train_df))
print('Validation dataframe size: ', len(validation_df))
print('Test dataframe size: ', len(test_df))

# Building The Model

In [None]:
def create_model(model_name, max_len=128):
    
    seed = 500
    my_init = tf.keras.initializers.glorot_uniform(seed)
    max_len = max_len
    
    # BERT encoder
    encoder = transformers.TFAutoModel.from_pretrained(model_name)
    
    # UnFreeze the base model weights
    encoder.trainable = True

    # Define input shapes 
    input_ids = keras.layers.Input(shape=(max_len,), dtype=tf.int32)
    attention_mask = keras.layers.Input(shape=(max_len,), dtype=tf.int32)
     
    sequence_output = encoder(input_ids, attention_mask=attention_mask)['last_hidden_state']
    
    # Add trainable layers on top of frozen layers to adapt the pretrained features on the new data.
    bi_lstm = tf.keras.layers.Bidirectional(keras.layers.LSTM(64, return_sequences=True))(sequence_output)
    
    # Applying hybrid pooling approach 
    avg_pool = tf.keras.layers.GlobalAveragePooling1D()(bi_lstm)
    max_pool = tf.keras.layers.GlobalMaxPooling1D()(bi_lstm)
    
    concat = tf.keras.layers.concatenate([avg_pool, max_pool])
    dropout = tf.keras.layers.Dropout(0.3)(concat)
    output = tf.keras.layers.Dense(1, activation="sigmoid")(dropout)
    
    model = tf.keras.models.Model(
        inputs=[input_ids, attention_mask], outputs=[output]
    )
    
    return model

# Initializing and Training the Model on TPU 

In [None]:
epochs = 20
lr = 2e-4

# Initialize the model on tpu
use_tpu = True
if use_tpu:
    # Create distribution strategy
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)

    # Create model
    with strategy.scope():
        model = create_model(base_model, max_len=max_len)
        
        optimizer = keras.optimizers.Adam(learning_rate=lr)
    
        model.compile(optimizer=optimizer,
                      loss = keras.losses.BinaryCrossentropy(), 
                      metrics= [keras.metrics.BinaryAccuracy()])
        
else:
    model = create_model()

model.summary()

In [None]:
# Train the model
my_callbacks = [keras.callbacks.EarlyStopping(monitor='val_binary_accuracy', patience=2, mode='max', restore_best_weights=True)]

hist = model.fit(x_train, 
                 y_train,
                 validation_data = (x_val, y_val),
                 epochs= epochs, 
                 batch_size= 128,
                 callbacks = my_callbacks,
                 verbose= 1)

# Submission

In [None]:
predictions = model.predict(x_test)

ids = list(test_df['id'])
target = [round(i[0]) for i in predictions]

sub = pd.DataFrame({'id':ids, 'target':target}, index=None)
sub.to_csv('submission.csv', index=False)
sub

# Do not submit this

### Here is the perfect submission file. This is only put here in case you want to test your model's performance.

In [None]:
# !git clone https://github.com/mitramir55/Kaggle_NLP_competition.git
# perfect = pd.read_csv('Kaggle_NLP_competition/perfect_submission.csv')
# cheat = list(perfect['target'])