# <p style="background-color:#80ccff; font-family:newtimeroman; font-size:150%; text-align:center; border-radius:  80px 5px; padding-top:8px; padding-bottom:8px;">BERT - Bidirectional Encoder Representations from Transformers</p>

# Attention Is All You Need

BERT is Google's deep learning algorithm for NLP.

The main technical innovation is the application of bidirectional training that traverses the text from left to right and from right to left. The result is a deeper trained model than a single-drive model, yielding better results

## Transformer

<img src="https://i2.wp.com/neptune.ai/wp-content/uploads/Transformer-network.png?resize=400%2C549&ssl=1">

The original Transformr model is a stack of 6 layers. The output of layers *l* is the input of *l+1* until the 
final prediction is reached. There is a 6 layers encoder on the left ans a 6 layer decoder stack on the right. 

On the left, the inputs enter the encoder side of the Transformer through an attention sub-layer and FeedForward Network (FNN) sub-layer. On the right, the target outputs go into the decoder side of the Transformer through two attention sub-layers ans a FFN sub-layer. We immediately notice thar there is no  RNN, LSTM, or CNN. Recurrence has been abandoned. 

The attention mechanism is a "word-to-word" operation. The attention mechanism will fing how each word is realted to all other words in a sequence, including the word being analyzed itself. 

<img src="https://cdn-images-1.medium.com/max/800/1*X92uPDSMofn49e3oyOwf-Q.png">

The attention mechanism will provide a deeper relationship between words and produce better results.

For each attention sub-layer, the original Transformer model run not on but eight attention mechanisms in parallel to spedd up the calculations. This process is named "multi-head attention".

## Input Embedding

<img src="https://iq.opengenus.org/content/images/2020/06/encoder-1.png">

The input embedding sub-layer converts the input tokens to vectors of dimension using learned embeddings in the original Transformer model.

The Transformer contains a learned embedding sub-layer. Many embedding methods can be applied to the tokenized input.

## Multi-Head Attention

The multi-head attention sub-layer contains eight heads and is followed by post-layer normalization, which will add residual connections to the output of the sub-layers an normalize it.

The input of multi-attention sub-layer of the first layer of the encoder stack is a vector that contains the embedding and the positional encoding of each word. 

## Decoder

The structure of the decoder layers remanins the same as the encoder for all the N=6 layers of the Transformer model. Each layer contains threee sub-layers: a multi-headed masked attention mechanism, a multi-headed attention mechanism, and a fully connceted position-wise feedforard network. 

The decoder has third main sub-layer, which is the masked multi-head attention mechanism. In this sub-layer output, at a given position, the following words ate masked so the the Transformer bases its assumptions on tis inferences without seeing the rest of the sequence. That way, in this model, it cannot see future parts of the sequence.

## Attention Layers

The Transformer is an autoregressive model. It uses the previous output sequences as an additional input. The multi-head attention layers of the decoder use the same process as the encoder.

However, the masked multi-head attentio sub-layer 1 only lets attention apply to the postions up to and including the current position. The future words ate hidden from the Transformer, and this forces it to learn how to predict. 

# <p style="background-color:#80ccff; font-family:newtimeroman; font-size:150%; text-align:center; border-radius:  80px 5px; padding-top:8px; padding-bottom:8px;">Import</p>

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
plt.style.use('fivethirtyeight')

import re
import string

import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau, LearningRateScheduler

import transformers
from tqdm.notebook import tqdm
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer

import warnings
warnings.filterwarnings('ignore')

In [1]:
data = pd.read_csv('../input/nlp-getting-started/train.csv')

# <p style="background-color:#80ccff; font-family:newtimeroman; font-size:150%; text-align:center; border-radius:  80px 5px; padding-top:8px; padding-bottom:8px;">EDA</p>

In [1]:
data.head()

In [1]:
data.drop(columns=['keyword','location'], inplace=True)

In [1]:
data.head()

In [1]:
values = data['target'].value_counts().values
fig = go.Figure(data=[go.Pie(labels=['Count 0','Count 1',], values=values)])
fig.update_layout(template="plotly_dark",title={'text': "Count of Type",'y':0.9,
                                                'x':0.45,'xanchor': 'center','yanchor': 'top'},
                  font=dict(size=18, color='white', family="Courier New, monospace"))
fig.show()

In [1]:
data['message_len'] = data['text'].apply(lambda x: len(x.split(' ')))
data.head()

In [1]:
fig = px.histogram(data, x='message_len')
fig.update_layout(template="plotly_dark",title={'text': "Phrase Length",'y':0.9,
                                                'x':0.45,'xanchor': 'center','yanchor': 'top'},
                  font=dict(size=18, color='white', family="Courier New, monospace"))
fig.show()

# <p style="background-color:#80ccff; font-family:newtimeroman; font-size:150%; text-align:center; border-radius:  80px 5px; padding-top:8px; padding-bottom:8px;">Preprocessing</p>

### Function to clear text and prepare to tokenizer

In [1]:
def text_clear(data):
    tx = data.apply(lambda x: re.sub("http\S+", '', str(x)))
    tx = tx.apply(lambda x: re.sub(u'[^a-zA-Z0-9áéíóúÁÉÍÓÚâêîôÂÊÎÔãõÃÕçÇ: ]', '',x))
    tx = tx.apply(lambda x: re.sub(' +', ' ', x)) # remover espaços em brancos
    tx = tx.apply(lambda x: re.sub('(#[A-Za-z]+[A-Za-z0-9-_]+)', '', x)) # remover as hashtag
    tx = tx.apply(lambda x: re.sub('(@[A-Za-z]+[A-za-z0-9-_]+)', '', x)) # remover os @usuario
    tx = tx.apply(lambda x: re.sub('rt', '', x)) # remover os rt
    tx = tx.apply(lambda x: ''.join([i for i in x if i not in string.punctuation]))
    return tx

In [1]:
data['text'] = text_clear(data['text'])
data.head()

### Token, Attention Mask and Padding

In [1]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')

def bert_encode(data, maximum_length) :
    input_ids = []
    attention_masks = []

    for text in data:
        encoded = tokenizer.encode_plus(
            text, 
            add_special_tokens=True,
            max_length=maximum_length,
            pad_to_max_length=True,
            truncation=True,
            return_attention_mask=True
        )
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
        
    return np.array(input_ids),np.array(attention_masks)

In [1]:
texts = data['text']
target = data['target']

train_input_ids, train_attention_masks = bert_encode(texts,30)

In [1]:
from transformers import TFBertModel
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# <p style="background-color:#80ccff; font-family:newtimeroman; font-size:150%; text-align:center; border-radius:  80px 5px; padding-top:8px; padding-bottom:8px;">Model</p>

In [1]:
import tensorflow as tf
from tensorflow.keras.optimizers import Adam

def create_model(bert_model):
    
    input_ids = tf.keras.Input(shape=(30,),dtype='int32')
    attention_masks = tf.keras.Input(shape=(30,),dtype='int32')

    output = bert_model([input_ids,attention_masks])
    output = output[1]
    output = tf.keras.layers.Dense(32,activation='relu')(output)
    output = tf.keras.layers.Dropout(0.2)(output)
    output = tf.keras.layers.Dense(1,activation='sigmoid')(output)
    
    model = tf.keras.models.Model(inputs = [input_ids,attention_masks],outputs = output)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    return model

In [1]:
model = create_model(bert_model)

* EarlyStopping: Stop training when a monitored metric has stopped improving.

* ReduceLROnPlateau :Reduce learning rate when a metric has stopped improving.

In [1]:
stoped = EarlyStopping(monitor='val_loss', patience=5, min_delta=0.0001)
redutor = ReduceLROnPlateau(monitor='val_accuracy', patience=3, verbose=1, factor=0.5, min_lr=0.00001)

In [1]:
history = model.fit([train_input_ids, train_attention_masks],
    target, validation_split=0.2, epochs=30, batch_size=16, callbacks=[stoped, redutor])

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axes = plt.subplots(1, 2, figsize=(15,5))
axes[0].plot(history.history['accuracy'])
axes[0].plot(history.history['val_accuracy'])
axes[0].set_xlabel('Epochs')
axes[0].set_ylabel('Accuracy')
axes[0].legend(['Accuracy in Train','accuracy in Test'])
axes[0].grid(True)

axes[1].plot(history.history['loss'])
axes[1].plot(history.history['val_loss'])
axes[1].set_xlabel('Epochs')
axes[1].set_ylabel('Erro')
axes[1].legend(['Erro in Train','Erro in Test'])
axes[1].grid(True)