## NLP Workshop-2: Model building and Training


Authored by [abhilash1910](https://www.kaggle.com/abhilash1910)


### Movie Reviews !!

The second and most interesting part of the curriculum is building different models - both statistical and deep learning so as to provide a proper classification model for our use case. In this case, we will create an initial baseline with statistical classification algorithms by using both non semantic vectors as well as semantic vectors. Later we will try to improvise on these baselines with standard neural models like Bidirectional LSTMs, Convolution Networks, Gated Recurrecnt Units. The improvements of these traditional neural models over the baselines would be further investigated when we will explore advanced architectures, particularly that of an encoder decoder . Further advancement would be made on attention based encoder-decoder modules like Transformers and using the different flavours from BERT to GPT.

The following is the workflow of this notebook:

- Statistical Classifiers
  - With Non Semantic TfIdf Baseline
  - With Semantic Static Embeddings
  - With Semantic Dynamic Embeddings 
  - With Transformer Embeddings
  - Models: LR,SVM,NB,XGboost,DT,RF,LGBM,LDA,KNN
  
  
- Traditional NN models
  - With Static Embeddings
  - With Dynamic Embeddings
  - Models: LSTM,CNN,BiLSTM,GRU,Encoder-Decoders


- Advanced Architectures
  - Transformers
  - Attention
  - BERT TPU
  - All BERT variants
  - GPT TPU
  - All variants of GPT2
  - Hybrid Transformer
  

This is an in depth approach to analyse the performance of different models on this task.

<img src="https://lumiere-a.akamaihd.net/v1/images/eu_bpan_showcase_hero_v4_m_823e00f3.jpeg?region=0,0,750,668">


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Recap from Workshop-1

In the previous [Notebook](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop), we learned about cleaning,EDA, vectorization and embeddings. We saw how cleaning played a significant role in determining knowledge embeddings from the corpus and also determined the similarity between words and sentences. 

We will be following a few steps from the elaborate Notebook:

- Use the cleaning methods (regex) for redundancy removal
- Lemmatization
- Vectorization
- Embeddings (Static,Dynamic,Transformer)

Since we have already implemented the Regex cleaning method, we can apply the same here. In the first section of this notebook, we will be running statistical classifiers, with 3 different vectorized data.

- Non semantic TFIDF Vectorized Data
- Static Embedding Vectorized Data
- Dynamic Embedding Vectorized Data

For the first use case of statistical models, we will be relying on TFIDF Baseline with Statistical classifiers.


<img src="https://i.pinimg.com/originals/b0/ec/e4/b0ece436f4244f1f97bab3facf4d6b8a.gif">


In [None]:
#Load the set
train_df=pd.read_csv('../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
train_df.head()

In [None]:
import numpy as np
import pandas as pd
from sklearn import preprocessing,metrics,manifold
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV,cross_val_predict
from imblearn.over_sampling import ADASYN,SMOTE
from imblearn.under_sampling import NearMiss
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
import collections
import matplotlib.patches as mpatches
from sklearn.metrics import accuracy_score
%matplotlib inline
from sklearn.preprocessing import RobustScaler
import xgboost
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import classification_report,roc_auc_score,roc_curve,r2_score,recall_score,confusion_matrix,precision_recall_curve
from collections import Counter
from sklearn.model_selection import StratifiedKFold,KFold,StratifiedShuffleSplit
from nltk import word_tokenize
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA, TruncatedSVD,SparsePCA
from sklearn.metrics import classification_report,confusion_matrix
from nltk.tokenize import word_tokenize
from collections import defaultdict
from collections import Counter
import seaborn as sns
from wordcloud import WordCloud,STOPWORDS
import nltk
from nltk.corpus import stopwords
import string
from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from nltk.stem import WordNetLemmatizer
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC
import xgboost
from imblearn.metrics import classification_report_imbalanced
from sklearn.metrics import classification_report,roc_auc_score,roc_curve,r2_score,recall_score,confusion_matrix,precision_recall_curve
from collections import Counter
from sklearn.model_selection import StratifiedKFold,KFold,StratifiedShuffleSplit
from xgboost import XGBClassifier as xg
from lightgbm import LGBMClassifier as lg
from sklearn.ensemble import RandomForestRegressor,GradientBoostingClassifier,RandomForestClassifier,AdaBoostClassifier,BaggingClassifier,ExtraTreesClassifier

In [None]:
%%time
#Convert the labels into integers (numerics) for reference.

train_li=[]
for i in range(len(train_df)):
    if (train_df['sentiment'][i]=='positive'):
        train_li.append(1)
    else:
        train_li.append(0)
train_df['Binary']=train_li
train_df.head()

In [None]:
%%time
#Running the Preprocessing and cleaning phase as well as the TFIDF Vectorization

import re
#Removes Punctuations
def remove_punctuations(data):
    punct_tag=re.compile(r'[^\w\s]')
    data=punct_tag.sub(r'',data)
    return data

#Removes HTML syntaxes
def remove_html(data):
    html_tag=re.compile(r'<.*?>')
    data=html_tag.sub(r'',data)
    return data

#Removes URL data
def remove_url(data):
    url_clean= re.compile(r"https://\S+|www\.\S+")
    data=url_clean.sub(r'',data)
    return data

#Removes Emojis
def remove_emoji(data):
    emoji_clean= re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    data=emoji_clean.sub(r'',data)
    url_clean= re.compile(r"https://\S+|www\.\S+")
    data=url_clean.sub(r'',data)
    return data
#Lemmatize the corpus
def lemma_traincorpus(data):
    lemmatizer=WordNetLemmatizer()
    out_data=""
    for words in data:
        out_data+= lemmatizer.lemmatize(words)
    return out_data

def tfidf(data):
    tfidfv = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), lowercase=True, max_features=150000)
    fit_data_tfidf=tfidfv.fit_transform(data)
    return fit_data_tfidf


train_df['review']=train_df['review'].apply(lambda z: remove_punctuations(z))
train_df['review']=train_df['review'].apply(lambda z: remove_html(z))
train_df['review']=train_df['review'].apply(lambda z: remove_url(z))
train_df['review']=train_df['review'].apply(lambda z: remove_emoji(z))
count_good=train_df[train_df['sentiment']=='positive']
count_bad=train_df[train_df['sentiment']=='negative']
train_df['review']=train_df['review'].apply(lambda z: lemma_traincorpus(z))


## Revisiting TFIDF for our Baseline

TFIDF vectorization is non semantic frequency based algorithm which uses a logarithmic distribution over document frequencies to embed vectors based on normalized frequency of occurence of words in the corpus. A descriptive formulation is provided here:

<img src=https://plumbr.io/app/uploads/2016/06/tf-idf.png>


The logical inference for using TFIDF vectorization over other vectorization strategies to embed vectors in HD space is to capture rare words occuring across the corpus. This vectorized embeddings can be applied on a statistical model for training.

<img src="https://cdn-images-1.medium.com/max/876/1*_OsV8gO2cjy9qcFhrtCdiw.jpeg">

In [None]:
%%time
#TFIDF Vectorize the Data

train_set=tfidf(train_df['review'])

## Statistical Training Without Balancing

We have heard of balancing techniques in the previous [Notebook](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop), and approaches like [SMOTE](https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/), [Adasyn](https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.ADASYN.html),are used to balance the classes. In the first case, we will not be balancing the set and evaluate the preliminary training of the statistical models.This initial benchmark can be used for further improvement by balancing the dataset.




## Splitting the set


Originally we have the transformed /vectorized data as we saw in previous notebook, the TSNE transformation of which looked like :

<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..jgD4rWmavXEjaRs2D5hj1A.oiGPZCbEUh-SE708RH5SVGYnriaNgG8ZQr-jCjhIhA_NfrLPNRkn5yemLGwtV2_YHBt5K5pxw5oS3IlczfFEFbPHuJVfiSXazjeucA_MICTf8t7GQ0Qp8Eb-2uR2vchwySVuto8Sox5FOwbWswurk-VZKPIOn4whx2pUdeGIe2uEVEd6LcklBb_J0OlYGjytC-4Qh32eyWE1q5KrXIktcUXPDeuChQQYOvKiWJlK_Sz6mLz4bW4ZuYvgcOtmOc8IMx_muTyVbmd5RlBycC4cdme6Q5qHVS7SR-2eM9FLxKpyCtnor5sdDCSVZ749-eylO0KQ2xKX_KPUnYAXKVwuHPalAUNcWylpjR67Q_SxVckm5qzT1mI_iBUh4fKqe0Fq4QyoQt8E1ulug_AdE9UhyGENgn2AYa2WyueUKXPXc3xZSrWvI7AwPmvy9vV5Bh96qDod5vOYtkKHofOzAMXMjvB4JgGyYQoN4l39XKwu99RZGN0V8nUmugorm_kSxvKrqeZDGXNiU0OZZo6XXbO9NGtG5XU8gcZ2DzGqonuI8p0sROGP_nhybo28Z-MXkTelqS_ZSMkEbNN-2uJI0EsZACwPBj0xRRVYc-lkqcGkklAbWUkJO-pNYozWTED9uqPjNuJocY1DXFPJZX-4eHxliWUCV5-9zRrYekkx7zXZ2ds.I4RU-iNAzcSSEj2dlQ6ZnA/__results___files/__results___50_0.png">


In generic supervised learning, we are implying model training after splitting the tfidf vectorized data. This transformation is linear and just splits depending on the ratio/test size provided by us. For this strategy, we will be using the sklearn split module.There are many versions of splitting the data based on the data, and some of them include:

- [Test Train Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- [Stratified Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
- [Stratified Shuffle Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)
- [Shuffle Split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html)
- [Stratified K Fold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html)

Jason's [blog](https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/) provides a good idea about splitting techniques for generic classification problems.

In [None]:
%%time
#Normal Train Test Split

train_y=train_df['sentiment']
train_x,test_x,train_y,test_y=train_test_split(train_set,train_y,test_size=0.2,random_state=42)
train_x.shape,train_y.shape,test_x.shape,test_y.shape

## Splitting in Imbalanced Classes

There are several ways to strategize and split imbalanced classes. One of the ways is to use the "stratify" (Stratified Split) option during train_test_split. this splitting allows propertional splitting between the  classes, and also maintains the proportionality on the training and testing dataset.

In [None]:
%%time
#Stratified Train Test Split

train_stratify_y=train_df['sentiment']
train_stratified_x,test_stratified_x,train_stratified_y,test_stratified_y=train_test_split(train_set,train_stratify_y,test_size=0.2,random_state=42,stratify=train_stratify_y)
train_stratified_x.shape,train_stratified_y.shape,test_stratified_x.shape,test_stratified_y.shape

## Analysing TFIDF-LR Baseline with simple split 

In this case, we want to evaluate the performance of a Logistic Regression classifier on the tfidf vectorized data sampled with normal train_test_split. Logistic Regression Classifier uses a sigmoid kernel for training. In a supervised learning mode , Logistic Regression is one of the standardised models under generalized linear models which tries a convex optimization by passing the cost function through the sigmoid kernel. The sigmoid function is denoted by the formulation:

<img src="https://www.gstatic.com/education/formulas2/-1/en/sigmoid_function.svg">


This equation due to its convergence property (+/- infinity) and due to its differentiability , the sigmoid kernel allows clamping of predicted values to binary labels. The sigmoid curve actually has optima at x=0 and x=1.Now in the case of supervised logistic regression, when we try to optimize the cost function (in this case a linear sum of weights & biases passed through sigmoid kernel), by applying stochastic gradient descent. Since by gradient descent, the steepest slope is considered, the change in derivatives (gradients) at each stage is computed and the weights of the cost function are updated. The effective loss function for logistic regression is E=(|y_predicted -y_actual|^2). This [blog](https://machinelearningmastery.com/gradient-descent-for-machine-learning/) provides an idea. 



<img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xoKf0Xfi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/2000/1%2AXisqJi764DTxragiRFexSQ.png">

Some resources:

- [Blog](https://www.analyticsvidhya.com/blog/2020/10/how-does-the-gradient-descent-algorithm-work-in-machine-learning/)
- [Sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- [KDNuggets](https://www.kdnuggets.com/2017/04/simple-understand-gradient-descent-algorithm.html)
- [Blog](https://towardsdatascience.com/machine-learning-101-an-intuitive-introduction-to-gradient-descent-366b77b52645)



In [None]:
#Applying Logistic Regression on split tfidf baseline
model=LogisticRegression()
model.fit(train_x,train_y)
pred=model.predict(test_x)
print("Evaluate confusion matrix for LR")
print(confusion_matrix(test_y,pred))
print(f"Accuracy Score for LR with C=1.0  ={accuracy_score(test_y,pred)}")

## MultiNomial Naive Bayes on TFIDF Baseline

[MultiNomial NB](https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/modules/generated/sklearn.naive_bayes.MultinomialNB.html) is a probabilistic statistical classification model which uses conditional probability to segregate or classify samples. This works well with discrete integer valued features (such as count vectorization) but can also be used with TFIDF vectors. Particularly, this uses the Bayes Theorem which tries to determine conditional probability using prior and posterior probabilities as shown in the figure:

<img src="https://storage.googleapis.com/coderzcolumn/static/tutorials/machine_learning/article_image/Scikit-Learn%20-%20Naive%20Bayes.jpg">

The major concept under this category is statistics of [Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html). There are many variants:

- Gaussian NB which relies on Gaussian distribution of the input features

<img src="https://www.researchgate.net/profile/Yune_Lee/publication/255695722/figure/fig1/AS:297967207632900@1448052327024/Illustration-of-how-a-Gaussian-Naive-Bayes-GNB-classifier-works-For-each-data-point.png">


- Complement NB which is suited for unbalanced classes and relies on the statistics of complement of each class to generate the weights. It is better than Multinomial NB for textual classification as it has a normalization factor (and a smoothing hyperparameter alpha) which tends to capture information from longer sequences of text.


<img src="https://www.researchgate.net/profile/Motaz_Saad/publication/231521157/figure/fig7/AS:667829850345476@1536234452166/Figure-31-Complement-Naive-Bayes-Algorithm-72.png">

- Bernoulli NB which relies on multivariate bernoulli distributions of the input features and also expects the data to be in binary format.

<img src="https://www.astroml.org/_images/fig_simple_naivebayes_1.png">

Other variants include:

- Categorical NB
- Partial Fit of NB models

Resources:

- [Blog](https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/)
- [Kernel](https://www.kaggle.com/abhilash1910/nlp-workshop-ml-india#Vectorization-and-Benchmarking)
- [Jason's Blog](https://machinelearningmastery.com/classification-as-conditional-probability-and-the-naive-bayes-algorithm/)




In [None]:
#Applying MultiNomial Naive Bayes on split tfidf baseline
model=MultinomialNB()
model.fit(train_x,train_y)
pred=model.predict(test_x)
print("Evaluate confusion matrix for NB")
print(confusion_matrix(test_y,pred))
print(f"Accuracy Score for NB ={accuracy_score(test_y,pred)}")

## Multiple Baseline computation using KFold and Cross Validation


In this concept, we will be training multiple statistical models based on KFold and Cross Validation Technique.
KFold cross validators provide train/test split indices for splitting the dataset into 'k' folds without shuffling. The general methodology for using Kfold and Cross Validation is provided below:

- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
   - Take the group as a hold out or test data set
   - Take the remaining groups as a training data set
   - Fit a model on the training set and evaluate it on the test set
   - Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores

This technique has a following rule: The first ``` n_samples % n_splits``` folds have size ``` n_samples // n_splits + 1```, other folds have size ``` n_samples // n_splits```, where n_samples is the number of samples.

A typical flowchart of cross validation is provided below:

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png">

This allows for better hyperparameter search using GridSearch CV algorithms which will be covered later.
The following procedure is followed for each of the k “folds”:

- A model is trained using  of the folds as training data;

- The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png">

Some resources:

- [Blog](https://machinelearningmastery.com/k-fold-cross-validation/)
- [Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
- [Documentation](https://scikit-learn.org/stable/modules/cross_validation.html)



## Brief Introduction of Statistical Models


This is going to be a brief introduction of different statistical models which will be used simultaeneously with k fold and cross validation techniques for examining the accuracy of the models. In this case, we will be focussing on accuracy as the major KPI and later we will be running on different observations such as f1,ROC etc.


### Decision Trees

[Decision Trees](https://scikit-learn.org/stable/modules/tree.html) is a supervised model for classification/regression. This works on creating decision branches which evolves a criteria and is often acknowledged as a simplistic classification (white box) model as the stages of decision can be easily derived. A regression tree appears as follows:

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_tree_regression_0011.png">

The [algorithms](https://scikit-learn.org/stable/modules/tree.html) include ID3,C4.5/C5.0,CART which can be analysed as follows:

- ID3(Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan. The algorithm creates a multiway tree, finding for each node (i.e. in a greedy manner) the categorical feature that will yield the largest information gain for categorical targets. Trees are grown to their maximum size and then a pruning step is usually applied to improve the ability of the tree to generalise to unseen data.

- C4.5 is the successor to ID3 and removed the restriction that features must be categorical by dynamically defining a discrete attribute (based on numerical variables) that partitions the continuous attribute value into a discrete set of intervals. C4.5 converts the trained trees (i.e. the output of the ID3 algorithm) into sets of if-then rules. These accuracy of each rule is then evaluated to determine the order in which they should be applied. Pruning is done by removing a rule’s precondition if the accuracy of the rule improves without it.

- C5.0 is Quinlan’s latest version release under a proprietary license. It uses less memory and builds smaller rulesets than C4.5 while being more accurate.

- CART (Classification and Regression Trees) is very similar to C4.5, but it differs in that it supports numerical target variables (regression) and does not compute rule sets. CART constructs binary trees using the feature and threshold that yield the largest information gain at each node.


In general, the major logics involved in Decision Trees involves computation of Entropy or Gini Index, which is as follows:

<img src="https://qph.fs.quoracdn.net/main-qimg-690a5cee77c5927cade25f26d1e53e77">

Typically a Gini Coefficient  is evaluated as the area between the ```y=x``` line and Lorentz curve

<img src="https://i.stack.imgur.com/iawuF.jpg">


Misclassification is another criteria:

<img src="https://miro.medium.com/max/2180/1*O5eXoV-SePhZ30AbCikXHw.png">

Typically a decision tree appears as follows:

<img src="https://scikit-learn.org/stable/_images/iris.png">

Some resources:

- [Blog](https://towardsdatascience.com/scikit-learn-decision-trees-explained-803f3812290d)
- [Blog](https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/)
- [Blog](https://machinelearningmastery.com/cost-sensitive-decision-trees-for-imbalanced-classification/)



### Random Forests 


[Random Forests](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)  is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the ```max_samples``` parameter ```if bootstrap=True (default)```, otherwise the whole dataset is used to build each tree. When splitting each node during the construction of a tree, the best split is found either from all input features or a random subset of size ```max_features```. (See the parameter tuning guidelines for more details).The purpose of these two sources of randomness is to decrease the variance of the forest estimator. Indeed, individual decision trees typically exhibit high variance and tend to overfit. The injected randomness in forests yield decision trees with somewhat decoupled prediction errors. By taking an average of those predictions, some errors can cancel out. Random forests achieve a reduced variance by combining diverse trees, sometimes at the cost of a slight increase in bias. In practice the variance reduction is often significant hence yielding an overall better model.

<img src="https://www.researchgate.net/profile/Hung_Cao12/publication/333438248/figure/fig6/AS:763710377299970@1559094151459/Random-Forest-model-with-majority-voting.ppm">


### Gradient Boosting Forests and Trees

[Gradient Boosting](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/) is a central part of ensemble modelling in sklearn.

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

Two families of ensemble methods are usually distinguished:

- In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

  Examples: Bagging methods, Forests of randomized trees

- By contrast, in boosting methods, base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

  Examples: AdaBoost, Gradient Tree Boosting


Pictorially these can be represented as :

<img src="https://miro.medium.com/max/3908/1*FoOt85zXNCaNFzpEj7ucuA.png">


Several Boosting Models can be found under this criteria:

#### [AdaBoosting](https://blog.paperspace.com/adaboost-optimizer/#:~:text=AdaBoost%20is%20an%20ensemble%20learning,turn%20them%20into%20strong%20ones.): 
The core principle of AdaBoost is to fit a sequence of weak learners (i.e., models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions from all of them are then combined through a weighted majority vote (or sum) to produce the final prediction. The data modifications at each so-called boosting iteration consist of applying weights  to each of the training samples. Initially, those weights are all set to (1/N), so that the first step simply trains a weak learner on the original data. For each successive iteration, the sample weights are individually modified and the learning algorithm is reapplied to the reweighted data. At a given step, those training examples that were incorrectly predicted by the boosted model induced at the previous step have their weights increased, whereas the weights are decreased for those that were predicted correctly. As iterations proceed, examples that are difficult to predict receive ever-increasing influence. Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence

<img src="https://scikit-learn.org/stable/_images/sphx_glr_plot_adaboost_hastie_10_2_0011.png">


### [LightGBM](https://lightgbm.readthedocs.io/en/latest/)

Light GBM is another gradient boosting strategy which relies on trees.It has the following advantages:

- Faster training speed and higher efficiency.

- Lower memory usage.

- Better accuracy.

- Support of parallel and GPU learning.

- Capable of handling large-scale data.

LightGBM grows leaf-best wise and will choose the leaf with maximum max delta loss to grow.

<img src="https://lightgbm.readthedocs.io/en/latest/_images/leaf-wise.png">


Some resources:

- [XGB](https://xgboost.readthedocs.io/en/latest/tutorials/model.html)
- [Blogs](https://www.analyticsvidhya.com/blog/tag/gradient-boosting/)

In [None]:
%%time
#KFold and cross validation on tfidf baseline
models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
# models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
# models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
model_result=[]
scoring='accuracy'
print("Statistical Model TFIDF- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_x,train_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    model_result.append(results.mean())

### Applying SMOTE balancing on TFIDF Baseline

Here, we will be applying SMOTE on TFIDF vectorized data for creating a different baseline.Since in our case, the data is balanced, applying SMOTE on balanced data would only reduce the efficiency of the models.



In [None]:
#Balancing The Sampple for TFIDF Baseline
#SMOTE oversampling
smote=SMOTE(random_state=42,k_neighbors=2)
smote_train_x,smote_train_y=smote.fit_sample(train_x,train_y)
smote_train_x.shape,smote_train_y.shape

In [None]:
%%time
#Applying SMOTE TFIDF Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
# models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
model_training_result,model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model SMOTE TFIDF- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,smote_train_x,smote_train_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    predictions=cross_val_predict(model,test_x,test_y)
    accuracy = accuracy_score(predictions,test_y)
    model_training_result.append(results.mean())
    model_validation_result.append(accuracy)

final_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_outcomes['Model']=models
final_outcomes['Training Acc']=model_training_result
final_outcomes['Validation Acc']=model_validation_result
final_outcomes.to_csv('TFIDF-SMOTE-Baseline.csv',index=False)
final_outcomes

## Concluding Non Semantic Baseline Techniques


We have seen the effect of applying a statistical classifier on the basis of non semantic TFIDF vectorized data and also attained a parallel analysis of the accuracy of the different algorithms. The inference for using these statistical models is that it provides an initial benchmark which has to be improved further by trying different models as such. This provides a quick overview of how  a traditional classifier can be used for non semantic classification and in the next case we will be using semantic embeddings (vectors) with these traditional classifiers.


<img src="https://media2.giphy.com/media/118u58QrLaLnDG/giphy.gif">

## Static Semantic Embedding Baseline


In this context, we will be applying the traditional ML classifiers on Word2Vec, Glove and Fasttext data to create a classification model. We have already realised the concept of these embeddings and some of the images of applying these embeddings on the corpus is provided in the Notebook [here](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop):

### Word2Vec From [Gensim](https://www.analyticsvidhya.com/blog/tag/word2vec/)

<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___63_1.png">

<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___59_1.png">

### Google News Vectors Variant of [Word2Vec](https://code.google.com/archive/p/word2vec/)


<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___66_0.png">

### [Glove](https://www.analyticsvidhya.com/blog/tag/glove/) [Embeddings](https://nlp.stanford.edu/projects/glove/)

<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___70_0.png">

### [Fasttext Embeddings](https://fasttext.cc/docs/en/supervised-tutorial.html)

<img src="https://www.kaggleusercontent.com/kf/48903343/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..C4q3IW8yHfXenn6fNzGSXQ.eHWK4_UfbdG6wh4wTDUBKfapys5hjPEyZqNdH-szO_domUFCC5uIpgEqg6IWleffop3sIT6lPyKsKPaMMNxUR9GtqbkZroidkwVVGL_4hdybmE0o6-yCbbAGxJfO44uwvSIc8ak_QqhFLLxhANMRRceiSBn7jK4mK9iUhnx4EBXhj9JQfuwlHDlCyWrL9FtyQV0a_iY0yRpJ39EvAjGAwYSnwk2FqjADkptTUaKA3liDy_ZohvtUGXZh4BEX0SCLLXpfKkleqq5sTeTLMU_h-AHH2z8AyVXFpSTMVAmXh2urgGjl1BbQjyf1fhxETZFj1eoCnpFddvNuK8hrIqdvuDyaybFnb_MTFScC3104AWu7sI7ke3-7fUFc0dGSzBzY2RE1s17MdaepXCNmDuBr40Yd2O4fN9VziTSgAZUKEhnbJe1C_MZtuYkUMZY7kRjiHstQQrS4lVYnALNlFlqzDy61-dl7MMgQlcK-EePn6lK4umB94lq1EC-AFLzFaoMeX4iC6z8LUPcSaQjjlkuOHBWpdfj3fYCQ_uaXeD7-vKuqTr-c1EAhrdXJjgOCzQjLmlvw1oisiTPpVDSjl2P4J6fL8WxEG4TmiBIHhQJywYr0lMN2kZdLDa3U98KtbFf5ed4_7upTP_IzQ2g8VzKvZa8W-9qRGFWLoW1hehO0DxQ.dFpEskCVbLZ4qvYaycKJcg/__results___files/__results___73_0.png">


In [None]:
%%time
## Load word2vec algorithm from gensim and vectorize the words
from gensim.models import Word2Vec,KeyedVectors
check_df=list(train_df['review'].str.split())
model=Word2Vec(check_df,min_count=1,iter=20)

In [None]:
#Label Encode the labels
from sklearn.preprocessing import LabelEncoder
label_y= LabelEncoder()
labels=label_y.fit_transform(train_df['sentiment'])
labels

## Creating Sentence Vectors from Word2Vec

Since Word2Vec creates vector embeddings for individual words in a corpus by transforming them to a manifold, we need effective document /sentence vectors from these individual vectorized words.  The concept of [pooling](https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-the-internet-fbb8b1ad5df8#:~:text=A%20pooling%20layer%20is%20another,in%20pooling%20is%20max%20pooling) is derived from Neural Networks particularly [Convolution Neural Network Architectures](https://analyticsindiamag.com/max-pooling-in-convolutional-neural-network-and-its-features/) where MaxPooling signifies taking the maximum from a range (particularly a kernel/filter or a window of input features). A typical Maxpooling diagram is as follows:


<img src="https://i.redd.it/61tcfy2xy2u41.png">



But in the case of creating document embeddings, a general notion is to use [Average pooling](https://i.redd.it/61tcfy2xy2u41.png). Mean pooling is generally used to create document vectors by taking the average of all the vectors in the context. A schematic diagram of the same is provided:



<img src="https://yashuseth.files.wordpress.com/2018/08/5.jpg?w=834">


As we move forward towards using complex embeddings, we will be using Mean Pooling to create sentence/paragraph vectors from the individual word vectors. There are also other strategies involving Max Pooling and then applying Mean Pooling on the word Vectors to create complete vectors.

<img src="https://www.researchgate.net/profile/Xingsheng_Yuan/publication/332810604/figure/fig2/AS:754128875683841@1556809743129/Simple-word-embedding-based-model-with-modified-hierarchical-pooling-strategy.png">


Some resources

- [Research](https://www.researchgate.net/figure/Simple-word-embedding-based-model-with-modified-hierarchical-pooling-strategy_fig2_332810604)
- [Some paper](https://www.cs.tau.ac.il/~wolf/papers/qagg.pdf)
- [Huggingface](https://medium.com/huggingface/universal-word-sentence-embeddings-ce48ddc8fc3a)

In [None]:
#Convert word vectors to sentence vectors/sentence vectors and apply mean pooling

def convert_sentence(data):
    vocab=[w for w in data if w in model.wv.vocab]
    avg_pool=np.mean(model[vocab],axis=0)
#     sum_pool=np.sum(model[vocab],axis=0)
#     min_pool=np.min(model[vocab],axis=0)
#     max_pool=np.max(model[vocab],axis=0)
    return avg_pool

train_df['Vectorized_Reviews']=train_df['review'].apply(convert_sentence)

#Split the dataset into training and testing sets
train_y=train_df['sentiment']
train_x,test_x,train_y,test_y=train_test_split(train_df['Vectorized_Reviews'],train_y,test_size=0.2,random_state=42)
train_x.shape,train_y.shape,test_x.shape,test_y.shape
    

### Convert the sentence vectors to List

This is done to ensure the dimensionality of the input sentence vectors is that of an array (list). This can be easily fed into any statistical classifier for our use case.




In [None]:
test_x=list(test_x)
train_x=list(train_x)

## Apply Statistical Models on Static Embeddings

Now we move forward to apply the statsitical models on the Compressed Sentence Vectors . This will allow us to apply gradient boosting algorithms on the static embeddings computed by taking the mean of the word vectors.We have also compressed the train and test dataset for our purposes.

Steps:

- Apply Word2Vec on the Corpus
- Create Sentence Vectors by Mean Pooling
- Run the input sentence vectors with Kfold Cross Validation on Traditional and gradient boosting classifiers.


<img src="https://media1.tenor.com/images/0e438477bb88b5683690bfe101cf1181/tenor.gif?itemid=10724659">

In [None]:
%%time
#Applying W2V Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
#models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
w2v_model_training_result,w2v_model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model Word2Vec- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_x,train_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    w2v_model_training_result.append(results.mean())
    predictions=cross_val_predict(model,test_x,test_y)
    accuracy = accuracy_score(predictions,test_y)
    w2v_model_validation_result.append(accuracy)

final_w2v_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_w2v_outcomes['Model']=models
final_w2v_outcomes['Training Acc']=w2v_model_training_result
final_w2v_outcomes['Validation Acc']=w2v_model_validation_result
final_w2v_outcomes.to_csv('W2V-Baseline.csv',index=False)
final_w2v_outcomes

## Applying XGBoost and LightGBM

This will allow us to further analyse the  results of XGBoost over LightGBM.

<img src="https://miro.medium.com/max/1554/1*FLshv-wVDfu-i54OqvZdHg.png">

In [None]:
#Evaluating XGBoost & Light GBM on the dataset
from xgboost import XGBClassifier as xg
from lightgbm import LGBMClassifier as lg
model_xgb= xg(n_estimators=100,random_state=42)
model_xgb.fit(train_x,train_y)
y_pred_xgb=model_xgb.predict(test_x)
print(accuracy_score(test_y,y_pred_lgbm.round()))
# print("Confusion matrix")
model_lgbm= lg(n_estimators=100,random_state=42)
model_lgbm.fit(train_x,train_y)
y_pred_lgbm=model_lgbm.predict(test_x)
# print("Confusion matrix")
# print(confusion_matrix(test_y,y_pred_lgbm))
print(accuracy_score(test_y,y_pred_lgbm.round()))

## Converting Other Vectors (Glove,Fasttext) to Word2Vec 

In this case, we will be using Glove,Fasttext by converting them to Word2Vec embeddings and then applying mean pooling on them. The method of conversion is taken from the [previous Notebook](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop). The steps involved are as follows:

- Convert Glove,Fasttext,Google News to Word2Vec by using Gensim
- Apply Mean Pooling on the Word Vectors to create Sentence Vectors
- Apply Statistical Classifiers with Kfold Cross Validation

There are alternate strategies to apply bu the this is by far the simplest one with minimalistic code.
Some resource:

- [Good Alternate Script](https://www.kaggle.com/eswarbabu88/toxic-comment-glove-logistic-regression)
- [Blog](https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa)



In [None]:
from gensim.models import Word2Vec,KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

##Google News Vectors to word2vec format for mean pooling
google_news_embed='../input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin'
google_loaded_model=KeyedVectors.load_word2vec_format(google_news_embed,binary=True)
print(google_loaded_model)
##Glove Vectors to word2vec format for mean pooling
glove_file='../input/glove-global-vectors-for-word-representation/glove.6B.50d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove_loaded=glove2word2vec(glove_file, word2vec_output_file)
glove_loaded = KeyedVectors.load_word2vec_format(word2vec_output_file, binary=False)
print(glove_loaded)
##Fasttext to word2vec format for mean pooling
fasttext_file="../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec"
fasttext_model = KeyedVectors.load_word2vec_format(fasttext_file, binary=False)
print(fasttext_model)

In [None]:
def convert_sentence_embeddings(data,model):
    vocab=[w for w in data if w in model.wv.vocab]
    avg_pool=np.mean(model[vocab],axis=0)
#     sum_pool=np.sum(model[vocab],axis=0)
#     min_pool=np.min(model[vocab],axis=0)
#     max_pool=np.max(model[vocab],axis=0)
    return avg_pool

#Google vectors
print('Google Vectors')
train_google_df=train_df
train_google_df['Google_News_Vectorized_Reviews']=train_df['review'].apply(lambda z: convert_sentence_embeddings(z,google_loaded_model) )
#Split the dataset into training and testing sets
train_google_y=train_df['sentiment']
train_google_x,test_google_x,train_google_y,test_google_y=train_test_split(train_google_df['Google_News_Vectorized_Reviews'],train_google_y,test_size=0.2,random_state=42)
train_google_x=list(train_google_x)
test_google_x=list(test_google_x)
# train_google_x.shape,train_google_y.shape,test_google_x.shape,test_google_y.shape


# #Glove Vectors
# print('Glove Vectors')
# train_glove_df=train_df
# train_glove_df['Glove_Vectorized_Reviews']=train_df['review'].apply(lambda z: convert_sentence_embeddings(z,glove_loaded) )
# #Split the dataset into training and testing sets
# train_glove_y=train_df['sentiment']
# train_glove_x,test_glove_x,train_glove_y,test_glove_y=train_test_split(train_glove_df['Glove_Vectorized_Reviews'],train_glove_y,test_size=0.2,random_state=42)
# train_glove_x=list(train_glove_x)
# test_glove_x=list(test_glove_x)
# # train_glove_x.shape,train_glove_y.shape,test_glove_x.shape,test_glove_y.shape

# #FastText Vectors
# print('Fasttext Vectors')
# train_fasttext_df=train_df
# train_fasttext_df['Fasttext_Vectorized_Reviews']=train_df['review'].apply(lambda z: convert_sentence_embeddings(z,fasttext_model) )
# #Split the dataset into training and testing sets
# train_fasttext_y=train_df['sentiment']
# train_fasttext_x,test_fasttext_x,train_fasttext_y,test_fasttext_y=train_test_split(train_fasttext_df['Fasttext_Vectorized_Reviews'],train_fasttext_y,test_size=0.2,random_state=42)
# train_fasttext_x=list(train_fasttext_x)
# test_fasttext_x=list(test_fasttext_x)
# # train_fasttext_x.shape,train_fasttext_y.shape,test_fasttext_x.shape,test_fasttext_y.shape



In [None]:
#FastText Vectors
print('Fasttext Vectors')
train_fasttext_df=train_df
train_fasttext_df['Fasttext_Vectorized_Reviews']=train_df['review'].apply(lambda z: convert_sentence_embeddings(z,fasttext_model) )
#Split the dataset into training and testing sets
train_fasttext_y=train_df['sentiment']
train_fasttext_x,test_fasttext_x,train_fasttext_y,test_fasttext_y=train_test_split(train_fasttext_df['Fasttext_Vectorized_Reviews'],train_fasttext_y,test_size=0.2,random_state=42)
train_fasttext_x=list(train_fasttext_x)
test_fasttext_x=list(test_fasttext_x)
# train_fasttext_x.shape,train_fasttext_y.shape,test_fasttext_x.shape,test_fasttext_y.shape


## Applying the Classifiers

Here in this sequence of codebases we apply the classifiers for our use case. 

In [None]:
%%time
#Applying Google Vectors Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
#models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
models.append(('XGBoosting',xg(n_estimators=100,random_state=42)))
models.append(('LightGBM',lg(n_estimators=100,random_state=42)))
google_model_training_result,google_model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model Google Vectors- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_google_x,train_google_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    w2v_model_training_result.append(results.mean())
    predictions=cross_val_predict(model,test_google_x,test_google_y)
    accuracy = accuracy_score(predictions,test_google_y)
    w2v_model_validation_result.append(accuracy)

final_google_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_google_outcomes['Model']=models
final_google_outcomes['Training Acc']=google_model_training_result
final_google_outcomes['Validation Acc']=google_model_validation_result
final_google_outcomes.to_csv('GoogleNewsVectors-Baseline.csv',index=False)
final_google_outcomes

In [None]:
%%time
#Applying Glove Vectors Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
#models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
models.append(('XGBoosting',xg(n_estimators=100,random_state=42)))
models.append(('LightGBM',lg(n_estimators=100,random_state=42)))
glove_model_training_result,glove_model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model Google Vectors- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_glove_x,train_glove_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    w2v_model_training_result.append(results.mean())
    predictions=cross_val_predict(model,test_glove_x,test_glove_y)
    accuracy = accuracy_score(predictions,test_glove_y)
    w2v_model_validation_result.append(accuracy)

final_glove_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_glove_outcomes['Model']=models
final_glove_outcomes['Training Acc']=glove_model_training_result
final_glove_outcomes['Validation Acc']=glove_model_validation_result
final_glove_outcomes.to_csv('GoogleNewsVectors-Baseline.csv',index=False)
final_glove_outcomes

In [None]:
%%time
#Applying Fasttext Vectors Balanced Baseline with KFold

models=[]
models.append(('LogisticRregression',LogisticRegression(C=1.0,penalty='l2')))
models.append(('KNearestNeighbors',KNeighborsClassifier()))
models.append(('DecisionTree',DecisionTreeClassifier(criterion='entropy')))
#models.append(('RandomForestRegressor',RandomForestRegressor(n_estimators = 1000, random_state = 42)))
#models.append(('RandomForestClassifier',RandomForestClassifier(n_estimators = 1000, criterion='gini')))
models.append(('GradientBoostClassifier',GradientBoostingClassifier(learning_rate=1e-2, loss='deviance',n_estimators=100)))
models.append(('AdaBoostClassifier',AdaBoostClassifier(learning_rate=1e-2,algorithm='SAMME.R',n_estimators=100)))
models.append(('ExtraTreesClassifier',ExtraTreesClassifier(n_estimators=10, max_depth=None,min_samples_split=2)))
models.append(('BagClassifier',BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)))
# models.append(('HistGradientBoostClassifier',HistGradientBoostingClassifier(max_iter=100)))
#models.append(('SupportVectorClassifier',SVC(C=1.0,kernel='sigmoid')))
models.append(('XGBoosting',xg(n_estimators=100,random_state=42)))
models.append(('LightGBM',lg(n_estimators=100,random_state=42)))
fasttext_model_training_result,fasttext_model_validation_result=[],[]
scoring='accuracy'
print("Statistical Model Google Vectors- Baseline Evaluation")
for name,model in models:
    kfold=KFold(n_splits=10,random_state=7)
    results=cross_val_score(model,train_fasttext_x,train_fasttext_y,cv=kfold)
    print("=======================")
    print("Classifiers: ",name, "Has a training score of", round(results.mean(), 2) * 100, "% accuracy score")
    w2v_model_training_result.append(results.mean())
    predictions=cross_val_predict(model,test_fasttext_x,test_fasttext_y)
    accuracy = accuracy_score(predictions,test_fasttext_y)
    w2v_model_validation_result.append(accuracy)

final_fasttext_outcomes=pd.DataFrame(columns=['Model','Training Acc','Validation Acc'])
final_fasttext_outcomes['Model']=models
final_fasttext_outcomes['Training Acc']=fasttext_model_training_result
final_fasttext_outcomes['Validation Acc']=fasttext_model_validation_result
final_fasttext_outcomes.to_csv('GoogleNewsVectors-Baseline.csv',index=False)
final_fasttext_outcomes

## Standard Neural Networks with Static Semantic Embeddings Baseline


<img src="https://miro.medium.com/max/688/1*zR61FG9RUd6ul4ecXA_euQ.jpeg">


In this context, we will be building a preliminary deep model using sophisticated neural networks and variants of RNNs. We will be building a simple LSTM model for validating the influence of deep models with respect to the statistical ones. In the first case, we will be using the Keras Embedding layer and visualize the results before using the embedding models.

[Keras LSTM](https://keras.io/api/layers/recurrent_layers/lstm/)
[Keras](https://keras.io/)
[Keras Starter Guides](https://keras.io/examples/nlp/)
[Tensorflow Starter](https://www.tensorflow.org/tutorials/keras/text_classification)
[Tensorflow Hub](https://www.tensorflow.org/tutorials/keras/text_classification_with_hub)
[Jason's Blog-Best practises](https://machinelearningmastery.com/best-practices-document-classification-deep-learning/)
[Jason's Blog-Convolution Networks](https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-sentiment/)


More resources will be provided, and for now we will be focussing on creating specific  RNN (Recurrent Neural Variants) with/without Static Semantic Embeddings to create a Neural Model Baseline. 


<img src="https://miro.medium.com/max/875/1*n-IgHZM5baBUjq0T7RYDBw.gif">


### Recurrent Neural Networks

Recurrent neural networks (RNN) are a class of neural networks that is powerful for modeling sequence data such as time series or natural language.

Schematically, a RNN layer uses a for loop to iterate over the timesteps of a sequence, while maintaining an internal state that encodes information about the timesteps it has seen so far.

The Keras RNN API is designed with a focus on:

- Ease of use: the built-in keras.layers.RNN, keras.layers.LSTM, keras.layers.GRU layers enable you to quickly build recurrent models without having to make difficult configuration choices.

- Ease of customization: You can also define your own RNN cell layer (the inner part of the for loop) with custom behavior, and use it with the generic keras.layers.RNN layer (the for loop itself). This allows you to quickly prototype different research ideas in a flexible way with minimal code.


A classic RNN appears as follows:

<img src="https://miro.medium.com/max/627/1*go8PHsPNbbV6qRiwpUQ5BQ.png">

This [video](https://youtu.be/8HyCNIVRbSU) provides a good description of how RNNs work.


Particulary a RNN works on the logic:


<img src="https://miro.medium.com/max/875/1*3mDe6V5DRXqpHYKDfxN4Rg.png">


There are various kinds of such networks:


- Encoding Recurrent Neural Networks are just folds. They’re often used to allow a neural network to take a variable length list as input, for example taking a sentence as input.


<img src="https://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-encoding.png">


- Generating Recurrent Neural Networks are just unfolds. They’re often used to allow a neural network to produce a list of outputs, such as words in a sentence.


<img src="https://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-generating.png">


- General Recurrent Neural Networks are accumulating maps. They’re often used when we’re trying to make predictions in a sequence. For example, in voice recognition, we might wish to predict a phenome for every time step in an audio segment, based on past context.


<img src="https://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-general.png">


- Bidirectional Recursive Neural Networks are a more obscure variant, which I mention primarily for flavor. In functional programming terms, they are a left and a right accumulating map zipped together. They’re used to make predictions over a sequence with both past and future context.

<img src="https://colah.github.io/posts/2015-09-NN-Types-FP/img/RNN-bidirectional.png">

 
 
Some resources for understanding the derivatives and optimization inside the RNNs:

- [Maths](https://www.cs.toronto.edu/~tingwuwang/rnn_tutorial.pdf)
- [Blog](https://colah.github.io/posts/2015-09-NN-Types-FP/)
- [Blog](https://towardsdatascience.com/under-the-hood-of-neural-networks-part-2-recurrent-af091247ba78)
- [Kernel](https://www.kaggle.com/abhilash1910/nlp-workshop-ml-india#Neural-Networks)


These are some starter resources for creating preliminary networks for sentiment analysis, text/intent classifications. There will be some advanced architectures which will be focussed later.


### Long Short Term Memory (LSTM)

[Drawbacks of RNNS](https://colah.github.io/posts/2015-08-Understanding-LSTMs/): One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.But there are also cases where we need more context. Consider trying to predict the last word in the text “I grew up in France… I speak fluent French.” Recent information suggests that the next word is probably the name of a language, but if we want to narrow down which language, we need the context of France, from further back. It’s entirely possible for the gap between the relevant information and the point where it is needed to become very large.
Unfortunately, as that gap grows, RNNs become unable to learn to connect the information.In theory, RNNs are absolutely capable of handling such “long-term dependencies.” A human could carefully pick parameters for them to solve toy problems of this form. Sadly, in practice, RNNs don’t seem to be able to learn them. The problem was explored in depth by Hochreiter (1991) [German] and Bengio, et al. (1994), who found some pretty fundamental reasons why it might be difficult.
Thankfully, LSTMs don’t have this problem!

- LSTMs:
 
 <img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png">
 
 The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ```ht−1``` and ```xt```, and outputs a number between 0 and 1 for each number in the cell state ```Ct−1```. A 1 represents “completely keep this” while a 0 represents “completely get rid of this.”

Let’s go back to our example of a language model trying to predict the next word based on all the previous ones. In such a problem, the cell state might include the gender of the present subject, so that the correct pronouns can be used. When we see a new subject, we want to forget the gender of the old subject.

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png">


The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, ```C~t```, that could be added to the state. In the next step, we’ll combine these two to create an update to the state.

In the example of our language model, we’d want to add the gender of the new subject to the cell state, to replace the old one we’re forgetting.

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png">

It’s now time to update the old cell state, ```Ct−1```, into the new cell state ```Ct```. The previous steps already decided what to do, we just need to actually do it.

We multiply the old state by ```ft```, forgetting the things we decided to forget earlier. Then we add ```it∗C~t```. This is the new candidate values, scaled by how much we decided to update each state value.

In the case of the language model, this is where we’d actually drop the information about the old subject’s gender and add the new information, as we decided in the previous steps.

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png">

Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to.

For the language model example, since it just saw a subject, it might want to output information relevant to a verb, in case that’s what is coming next. For example, it might output whether the subject is singular or plural, so that we know what form a verb should be conjugated into if that’s what follows next.

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png">


An illustrated working of the LSTM is provided:


<img src="https://miro.medium.com/max/1900/1*GjehOa513_BgpDDP6Vkw2Q.gif">


Some blogs:

- [Blog](https://www.google.com/url?sa=i&url=https%3A%2F%2Ftowardsdatascience.com%2Fillustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21&psig=AOvVaw3GJ2-g9jyCgtlUxlTAmyJ8&ust=1608535825759000&source=images&cd=vfe&ved=0CA0QjhxqFwoTCLjax4eF3O0CFQAAAAAdAAAAABAD)
- [Blog](https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-introduction-to-lstm/)
- [Blog](https://machinelearningmastery.com/gentle-introduction-long-short-term-memory-networks-experts/)
- [Paper](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43905.pdf)

There are several Variants of LSTMs some of the most famous being Depth GRU /Gated Recurrent Units:

A slightly more dramatic variation on the LSTM is the Gated Recurrent Unit, or GRU, introduced by Cho, et al. (2014). It combines the forget and input gates into a single “update gate.” It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models, and has been growing increasingly popular.

<img src="https://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-var-GRU.png">




In [None]:
#Import libraries
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.layers import LSTM, Dense,Flatten,Conv2D,Conv1D,GlobalMaxPooling1D,GlobalMaxPool1D
from keras.optimizers import Adam
import numpy as np  
import pandas as pd 
import keras.backend as k
from keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Input, LSTM, Embedding, Dense, Concatenate, TimeDistributed, Bidirectional,GRU
from tensorflow.keras.models import Model,Sequential
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.preprocessing import OneHotEncoder
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model

### Creating a Basic LSTM Neural Model without Embeddings

In this case, we will not be using an y pretrained static/dynamic embeddings but will be using a simple Neural Network model of LSTM to create our network.The steps are as follows:


- Tokenize the input data (Keras.Preprocessing)
- Creating the limits of Maxlen, Max Features and Embedding Size for our Embedding Matrix
- Pad the tokenized data to maintain uniformity in length of the input features

A more descriptive overview is found [here](https://www.kaggle.com/abhilash1910/nlp-workshop-ml-india#Neural-Networks) . This also provides an [idea](https://www.tensorflow.org/guide/keras/rnn)




In [None]:
##First Step is to test model performance without pretrained Embeddings
## Will be using only Keras Embeddings in this case with a minimal neural network model

maxlen=1000
max_features=5000 
embed_size=300

#clean some null words or use the previously cleaned & lemmatized corpus

train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)

val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)


## Creating the Model architecture

Here we creating a [sequential model](https://keras.io/api/models/sequential/) and the embedding always has to be the first layer for our use case. In any neural model, Embedding layer always comes first followed by other layers- LSTM/GRU and others. The heirarchy of the model can be represented as below:

<img src="https://media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs12859-019-3079-8/MediaObjects/12859_2019_3079_Fig2_HTML.png">


A proper model overview comprising of LSTMs and Embeddings is provided here:

<img src="https://d3i71xaburhd42.cloudfront.net/6ac8328113639044d2beb83246b9d07f513ac6c8/3-Figure1-1.png">

Some resources:

- [Kernels](https://www.kaggle.com/rajmehra03/a-detailed-explanation-of-keras-embedding-layer)
- [Kernels](https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings)


In [None]:
##Design a Simple Network

model=Sequential()
model.add(Embedding(max_features,embed_size,input_length=maxlen))
model.add(LSTM(60))
model.add(Dense(16,activation='relu'))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
plot_model(
    model,
    to_file="simple_model.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96,
)
model.fit(train_x,train_y,batch_size=512,epochs=3,verbose=2)


## Model Visualization

The model with its individual layers can be visualized as follows:

<img src="https://i.imgur.com/qI6zRqM.png">


We can also look into the model parameters and the weights of the intermediate layers. We can visualize the sizes of the hidden layers and the output of each sequential layer in the model. Some resources:

- [Keras](https://keras.io/getting_started/faq/#how-can-i-obtain-the-output-of-an-intermediate-layer-feature-extraction)
- [Stack Overflow](https://stackoverflow.com/questions/41711190/keras-how-to-get-the-output-of-each-layer)
- [Kite](https://www.kite.com/python/answers/how-to-get-the-output-of-each-layer-of-a-keras-model-in-python#:~:text=A%20Keras%20model%20runs%20data,and%20applies%20the%20layer%20funtion.)


In [None]:
## Get to know individual layer sizes and parameters 

from keras import backend as k
inputs=model.input
outputs=[layer.output for layer in model.layers]
print(f"Outputs of the sequential layers{outputs}")
functions=[k.function([inputs],[outs]) for outs in outputs]
print(f'Sequential Model Layers{functions}')





## Inference from the Model

The Model has almost 1.5 million parameters to be trained even without any embeddings.This is the simplicity of using Keras to train such large parametric mdoels.

<img src="https://media1.tenor.com/images/54603c681d37cecb2973e7974dea7f43/tenor.gif?itemid=16430080">

In [None]:
#Fit and validate
model.fit(train_x,train_y,batch_size=128,epochs=3,verbose=2,validation_data=(val_x,val_y))

## Build a Static Semantic Embedding Neural Network(LSTM) Baseline

In this case, we will be using pretrained embeddings for ouruse case. For this we will be using the embedding matrix creation  code from our [previous Notebook](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop).

Particularly this lines of code:

```python
from keras.preprocessing.sequence import pad_sequences
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
maxlen=1000
max_features=5000 
embed_size=300

train_sample=train_df['review']

#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_sample))
train_sample=tokenizer.texts_to_sequences(train_sample)

#Pad the sequence- To allow same length for all vectorized words
train_sample=pad_sequences(train_sample,maxlen=maxlen)



EMBEDDING_FILE = '../input/wikinews300d1msubwordvec/wiki-news-300d-1M-subword.vec'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE) if len(o)>100)

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector
plt.plot(embedding_matrix[20])
plt.show()
```

In [None]:
##Build Static Embedding on top of a Neural Model

from keras.preprocessing.sequence import pad_sequences
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
maxlen=1000
max_features=5000 
embed_size=300

train_sample=train_df['review']

#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_sample))
train_sample=tokenizer.texts_to_sequences(train_sample)

#Pad the sequence- To allow same length for all vectorized words
train_sample=pad_sequences(train_sample,maxlen=maxlen)



EMBEDDING_FILE = '../input/glove-global-vectors-for-word-representation/glove.6B.50d.txt'
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE) if len(o)>100)

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

## Run the same model with the Pretrained Embeddings

Now we will run the same model as before with the pretrained static embeddings- Glove in our use case. I have trained it for 2 epochs but this can be made to train on an even larger epoch size.

In [None]:
inp=Input(shape=(maxlen,))
z=Embedding(max_features,embed_size,weights=[embedding_matrix])(inp)
z=Bidirectional(LSTM(60,return_sequences='True'))(z)
z=GlobalMaxPool1D()(z)
z=Dense(16,activation='relu')(z)
z=Dense(1,activation='sigmoid')(z)
model=Model(inputs=inp,outputs=z)
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
plot_model(
    model,
    to_file="glove_simple_model.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96,
)

model.fit(train_x,train_y,batch_size=128,epochs=3,verbose=2,validation_data=(val_x,val_y))

## Model Visualization

The Simple LSTM Model with Glove Pretrained embeddings:

<img src="https://i.imgur.com/7FpjJVP.png">

## Working With ELMO

[ELMO](https://tfhub.dev/google/elmo/3) is a deep contextual embedding model comprising of stacked dual lstm layers.Salient features of the model:

- Computes contextualized word representations using character-based word representations and bidirectional LSTMs, as described in the paper "Deep contextualized word representations" [1].

- This modules supports inputs both in the form of raw text strings or tokenized text strings.

- The module outputs fixed embeddings at each LSTM layer, a learnable aggregation of the 3 layers, and a fixed mean-pooled vector representation of the input.

- The complex architecture achieves state of the art results on several benchmarks.

The entire architectural model is as follows:



<img src="http://jalammar.github.io/images/Bert-language-modeling.png">


The deep contextual representations are retained with the help of bidirectional lstm layers, which uses look-ahead and look-back mechanisms combined to provide a correct representation of the context.


<img src="http://jalammar.github.io/images/elmo-forward-backward-language-model-embedding.png">


Some resources:

- [Paper](https://arxiv.org/abs/1802.05365)
- [Kernel](https://www.kaggle.com/sarthak221995/textclassification-95-5-accuracy-elmo)
- [Blog](https://towardsdatascience.com/elmo-contextual-language-embedding-335de2268604)
- [Blog](https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-features-from-text/)
- [Jay's Blog](http://jalammar.github.io/illustrated-bert/)


## Some Issues with ELMO


ELMO appears to be working well with TF 1.15 (rather any  TF version <2.0.0) . For using ELMO from Tensorflow [Hub](https://tfhub.dev/google/elmo/3), we have to follow the steps:

- Restart the Kernel
- Run the cell containing:
  ```python
   !pip install -U tensorflow==1.15
  ```
- Check if the older version of tensorflow is installed (the session will automatically get restarted).
  ```python
   import tensorflow as tf
   tf.__version__
  ```
- Make sure the ELMO embeddings are loaded and can be used by clicking on the example cell below this markdown.
- Create the ELMO embeddings and feed it into the classifier.


Since ELMO has several benchmarks due to its bidirectionality (LSTMs), the only layers which provide a proper accuracy are Dense layers (when placed after the Embedding Layer).ELMO embeddings generally have a shape of (?,?,1024), and hence compressing these multidimensional embeddings to a Dense Layer (eg.256 units) takes a huge computation time. The program for the ELMO embedding classifier is written in tensorflow.

In [None]:
#Make sure ELMO embeddings are working in tf version 1.15

# !pip install tensorflow==1.15
import tensorflow as tf
import tensorflow_hub as tf_hub

elmo = tf_hub.Module("https://tfhub.dev/google/elmo/2")
embeddings = elmo(
    ["the cat is on the mat", "dogs are in the fog"],
    signature="default",
    as_dict=True)["elmo"]
embeddings

In [None]:
import tensorflow as tf
import tensorflow_hub as tf_hub

# from keras.layers import Input, Lambda, Dense
# from keras import backend as k
elmo_embed=tf_hub.Module("https://tfhub.dev/google/elmo/2",trainable=True)

#Creating the elmo embeddings by squeezing the inputs
def create_embedding(z):
    return elmo_embed(tf.squeeze(tf.cast(z,tf.string)),signature='default',as_dict=True)["default"]

train_y=labels
X=train_df['review'].tolist()
train_x,test_x,train_y,test_y=train_test_split(np.asarray(X),train_y,test_size=0.2,random_state=42)
#Create the model with ELMO Embeddings and Dense Layers

inp=tf.keras.layers.Input(shape=(1,),dtype=tf.string)
z=tf.keras.layers.Lambda(create_embedding,output_shape=(1024,))(inp)
z=tf.keras.layers.Dense(128,activation='relu')(z)
z=tf.keras.layers.Dense(1,activation='sigmoid')(z)
model=tf.keras.Model(inputs=inp,outputs=z)
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
plot_model(
    model,to_file="seq2seq_encoder_decoder_model.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)


with tf.Session() as session:
#     K.set_session(session)
    session.run(tf.global_variables_initializer())  
    session.run(tf.tables_initializer())
    history = model.fit(train_x, train_y, epochs=1, batch_size=16)
    model.save_weights('./response-elmo-model.h5')

# with tf.Session() as session:
#     K.set_session(session)
#     session.run(tf.global_variables_initializer())
#     session.run(tf.tables_initializer())
#     model.load_weights('./response-elmo-model.h5')  
#     predicts = model.predict(x_test, batch_size=16)


In [None]:
!pip uninstall tensorflow -y
!pip uninstall tensorflow-cloud -y
!pip install -U tensorflow==1.15

In [None]:
!pip uninstall pytorch-lightning -y
!pip uninstall tensorflow-probability -y

### Checking for the Installed Tensorflow Version

This code block checks for the tensorflow version (downgraded)

In [None]:
import tensorflow as tf
tf.__version__


## Conclusion of ELMO Embeddings

We used ELMO embeddings to see the performance of ELMO on our classification model. For the actual implementation of the ELMO (Peters etal), the resources are provided:

- [Github](https://github.com/allenai/bilm-tf)
- [Resources](https://paperswithcode.com/paper/deep-contextualized-word-representations)


<img src="https://media3.giphy.com/media/3o7budMRwZvNGJ3pyE/giphy.gif">


## Encoder Decoder Architectures

[Encoder-Decoders](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html) is a classic architecture mostly popular for sequence2sequence learning. Encoder-Decoders are most popularly used for neural machine translation (seq2seq learning with attention). The general workflow revolves around stacks of RNNs (LSTMs/GRUs/TimeDistributed Cells) which behaves as an encoder takes as input 3 parameters (max_features,embed_size,maxlen in our example) and returns an output. We then save the 2 output LSTM cell states ,the h and c states. We design the decoder model in a similar manner (if the internal layers are modified it becomes a hybrid decoder). And while passing the inputs of the decoder, we also pass the 2 output LSTM cell states from the encoder output (namely the h and c states). The output of the decoder is then passed through a activation/distribution function to optimize our target loss function.


<img src="https://miro.medium.com/max/875/1*CkeGXClZ5Xs0MhBc7xFqSA.png">



### Descriptive overview of NMT with Encoder Decoders:



In the context of NMT, the words of one language should be mapped to a different language (machine translation). An example of such an architecture is as follows:



<img src="https://blog.keras.io/img/seq2seq/seq2seq-teacher-forcing.png">



In the general case, input sequences and output sequences have different lengths (e.g. machine translation) and the entire input sequence is required in order to start predicting the target. This requires a more advanced setup, which is what people commonly refer to when mentioning "sequence to sequence models" with no further context. Here's how it works:

- A RNN layer (or stack thereof) acts as "encoder": it processes the input sequence and returns its own internal state. Note that we discard the outputs of the encoder RNN, only recovering the state. This state will serve as the "context", or "conditioning", of the decoder in the next step.

- Another RNN layer (or stack thereof) acts as "decoder": it is trained to predict the next characters of the target sequence, given previous characters of the target sequence. Specifically, it is trained to turn the target sequences into the same sequences but offset by one timestep in the future, a training process called "teacher forcing" in this context. Importantly, the encoder uses as initial state the state vectors from the encoder, which is how the decoder obtains information about what it is supposed to generate. Effectively, the decoder learns to generate targets[t+1...] given targets[...t], conditioned on the input sequence.



### Pictorial Representation of Encoder-Decoders for Generative Modelling



<img src="https://miro.medium.com/max/1250/1*LYGO4IxqUYftFdAccg5fVQ.png">



Some resources:

- [TF-Blog](https://www.tensorflow.org/tutorials/text/nmt_with_attention)
- [Blog](https://towardsdatascience.com/how-to-implement-seq2seq-lstm-model-in-keras-shortcutnlp-6f355f3e5639)
- [Blog](https://machinelearningmastery.com/develop-encoder-decoder-model-sequence-sequence-prediction-keras/)


In [None]:
maxlen=1000
max_features=5000 
embed_size=300

#clean some null words or use the previously cleaned & lemmatized corpus

train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)

val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)

#sequence to sequence basic lstm encoder gru decoders
def seq2seq_encoder_decoder(maxlen,max_features,embed_size):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,input_length=maxlen,trainable=True)(encoder_inp)
    encoder_lstm_cell=LSTM(60,return_state='True')
    encoder_outputs,encoder_state_lstm_h,encoder_state_lstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,input_length=maxlen,trainable=True)(decoder_inp)
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=[encoder_state_lstm_h,encoder_state_lstm_c])
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
    
    
model=seq2seq_encoder_decoder(maxlen,max_features,embed_size)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)
    
model.fit([train_x,train_x],train_y,batch_size=512,epochs=3,verbose=2)


## Encoder-Decoder Model Architecture

The model architecture by using trainable Embeddings (no pretrained embeddings):

<img src="https://i.imgur.com/sGBqoBB.png">



## Homologous Encoder Decoder With Pretrained Embeddings

In this context, we will be applying a pretrained static embedding (Glove-embedding matrix) to our Encoder Decoder model (comprising of LSTM units). Then we will visualize the training and performance on our dataset.



In [None]:
def seq2seq_encoder_decoder_glove(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
    encoder_lstm_cell=LSTM(60,return_state='True')
    encoder_outputs,encoder_state_lstm_h,encoder_state_lstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=[encoder_state_lstm_h,encoder_state_lstm_c])
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
    
    
model=seq2seq_encoder_decoder_glove(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)
    
model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Homologous Encoder-Decoder Model Architecture

The model architecture with pretrained Glove 50d embeddings :

<img src="https://i.imgur.com/WQww2yU.png">

In [None]:
#Bidirectional LSTM Encoder-Decoder
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)

def seq2seq_encoder_decoder_glove_bilstm(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
    encoder_lstm_cell=Bidirectional(LSTM(60,return_state='True'))
    encoder_outputs,encoder_state_flstm_h,encoder_state_flstm_c,encoder_state_blstm_h,encoder_state_blstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h,encoder_state_flstm_c,encoder_state_blstm_h,encoder_state_blstm_c]
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    
    decoder_lstm_cell=Bidirectional(LSTM(60,return_sequences='True',return_state=True),merge_mode="concat")
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
    
    
model=seq2seq_encoder_decoder_glove_bilstm(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove_bilstm.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)
    
model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Model Architecture

The model architecture is shown here:

<img src="https://i.imgur.com/5FAd57t.png">

## Hybrid Encoder Decoder Models

These classes of encoder decoders allow different variants of RNNs (LSTM/Bilstm) which acts as a variational circuit. Hybrid deccoder models generally have a compression decoder which implies that the decoder can be GRU/LSTM while the encoder can be any Bidirectional version of that. This allows a smooth compression of the tensors by concatenating the hidden and cell state channels.

In [None]:
#Bidirectional LSTM Encoder-Decoder
# maxlen=1000
# max_features=5000 
# embed_size=300

train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)

def seq2seq_encoder_decoder_glove_bilstm_hybrid(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
    encoder_lstm_cell=Bidirectional(LSTM(60,return_state='True'),merge_mode='sum')
    encoder_outputs,encoder_state_flstm_h,encoder_state_flstm_c,encoder_state_blstm_h,encoder_state_blstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h+encoder_state_blstm_h,encoder_state_flstm_c+encoder_state_blstm_c]
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
    
    
model=seq2seq_encoder_decoder_glove_bilstm_hybrid(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove_bilstm_hybrid.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)
    
model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Hybrid Encoder Decoder With Attention

This section will comprise of Hybrid Encoder Decoder Architectures with variants of Attention Mechanisms. For an introduction attention refers to allowing certain neural weights to be focussed during training and this in turn assists in model performance.

The main paper behind this is [Attention is all you need](https://paperswithcode.com/paper/attention-is-all-you-need)
A preview of this is provided in the images below:

<img src="https://www.tensorflow.org/images/seq2seq/attention_mechanism.jpg">

<img src="https://www.tensorflow.org/images/seq2seq/attention_equation_0.jpg">


More details will be explained soon on the different variants . For now, this tf resource should [help](https://www.tensorflow.org/tutorials/text/nmt_with_attention)

In [None]:
!pip install MiniAttention

In [None]:
#Bidirectional LSTM Hybrid Encoder-Decoder with Hierarchical Attention
import MiniAttention.MiniAttention as ma
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)

def seq2seq_encoder_decoder_glove_bilstm_hybrid_attention(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
    encoder_embed_attention=ma.MiniAttentionBlock(None,None,None,keras.regularizers.L2(l2=0.02),None,None,None,None,None)(encoder_embed)
    encoder_lstm_cell=Bidirectional(LSTM(60,return_state='True'),merge_mode="sum")
    encoder_outputs,encoder_state_flstm_h,encoder_state_flstm_c,encoder_state_blstm_h,encoder_state_blstm_c=encoder_lstm_cell(encoder_embed_attention)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h+encoder_state_blstm_h,encoder_state_flstm_c+encoder_state_blstm_c]
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
    
    
model=seq2seq_encoder_decoder_glove_bilstm_hybrid_attention(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove_bilstm_hybrid.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)
    
model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Bahdanau Attention



<img src="https://miro.medium.com/max/639/1*qhOlQHLdtfZORIXYuoCtaA.png">

Bahdanau et al. proposed an attention mechanism that learns to align and translate jointly. It is also known as Additive attention as it performs a linear combination of encoder states and the decoder states.

let’s understand the Attention mechanism suggested by Bahdanau

- All hidden states of the encoder(forward and backward) and the decoder are used to generate the context vector, unlike how just the last encoder hidden state is used in seq2seq without attention.
- The attention mechanism aligns the input and output sequences, with an alignment score parameterized by a feed-forward network. It helps to pay attention to the most relevant information in the source sequence.
- The model predicts a target word based on the context vectors associated with the source position and the previously generated target words.


### Alignment Score

The alignment score maps how well the inputs around position “j” and the output at position “i” match. The score is based on the previous decoder’s hidden state, s₍ᵢ₋₁₎ just before predicting the target word and the hidden state, hⱼ of the input sentence.


<img src="https://miro.medium.com/max/535/1*u2YdTRPjN34Fpr-zxvoJsg.png">


The decoder decides which part of the source sentence it needs to pay attention to, instead of having encoder encode all the information of the source sentence into a fixed-length vector.
The alignment vector that has the same length with the source sequence and is computed at every time step of the decode.


### Attention Weights


We apply a softmax activation function to the alignment scores to obtain the attention weights.

<img src="https://miro.medium.com/max/685/1*3aCyU9aSVHvxzOwvQdExdQ.png">
 

Some resources:


- [Paper](https://arxiv.org/abs/1409.0473)
- [Lilian's Blog](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#:~:text=Self%2Dattention%2C%20also%20known%20as,summarization%2C%20or%20image%20description%20generation.)
- [Nice Blog](https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a)
- [Blog](https://medium.com/analytics-vidhya/neural-machine-translation-using-bahdanau-attention-mechanism-d496c9be30c3)

In [None]:
import MiniAttention.MiniAttention as ma
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)



class Simple_Attention(tf.keras.layers.Layer):
    def __init__(self,units):
        super(Simple_Attention,self).__init__()
        self.units=units
        self.Wq=tf.keras.layers.Dense(self.units)
        self.Wk=tf.keras.layers.Dense(self.units)
        self.Wv=tf.keras.layers.Dense(60)
        
    def call(self,q,v):
        self.q=q
        self.v=v
#         print(self.q.shape)
        q_t=tf.expand_dims(self.q,1)
        score=self.Wv(tf.nn.tanh(self.Wq(self.q)+self.Wk(self.v)))
        attention_wts=tf.nn.softmax(score,axis=1)
#         print(attention_wts.shape)
        context_vector=(attention_wts*self.v)
#         context_vector=tf.reduce_sum(context_vector,axis=1)
#         print(context_vector.shape)
        return context_vector,attention_wts
    

        
def seq2seq_encoder_decoder_glove_bilstm_hybrid_bahdanau(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
#     encoder_embed_attention=ma.MiniAttentionBlock(None,None,None,keras.regularizers.L2(l2=0.02),None,None,None,None,None)(encoder_embed)
    encoder_lstm_cell=LSTM(60,return_state='True')
    encoder_outputs,encoder_state_flstm_h,encoder_state_flstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h,encoder_state_flstm_c]
    
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    bahdanau_attention=Simple_Attention(60)
    
    decoder_embed_attention_h,decoder_embed_wghts_h=bahdanau_attention(encoder_state_flstm_h,encoder_outputs)
    decoder_embed_attention_c,decoder_embed_wghts_c=bahdanau_attention(encoder_state_flstm_c,encoder_outputs)
#     print(decoder_embed_wghts)
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=[decoder_embed_wghts_h,decoder_embed_wghts_c])
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
model=seq2seq_encoder_decoder_glove_bilstm_hybrid_bahdanau(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove_bilstm_bahdanau_attention.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Encoder Decoder Model With Bahdanau Attention

The model architecture is as follows:

<img src="https://i.imgur.com/lIMEt59.png">

In [None]:
# import MiniAttention.MiniAttention as ma
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)



class Simple_Attention(tf.keras.layers.Layer):
    def __init__(self,units):
        super(Simple_Attention,self).__init__()
        self.units=units
        self.Wq=tf.keras.layers.Dense(self.units)
        self.Wk=tf.keras.layers.Dense(self.units)
        self.Wv=tf.keras.layers.Dense(60)
        
    def call(self,q,v):
        self.q=q
        self.v=v
#         print(self.q.shape)
        q_t=tf.expand_dims(self.q,1)
        score=self.Wv(tf.nn.tanh(self.Wq(self.q)+self.Wk(self.v)))
        attention_wts=tf.nn.softmax(score,axis=1)
#         print(attention_wts.shape)
        context_vector=(attention_wts*self.v)
#         context_vector=tf.reduce_sum(context_vector,axis=1)
#         print(context_vector.shape)
        return context_vector,attention_wts
    

        
def seq2seq_encoder_decoder_glove_bilstm_hybrid_bahdanau(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
#     encoder_embed_attention=ma.MiniAttentionBlock(None,None,None,keras.regularizers.L2(l2=0.02),None,None,None,None,None)(encoder_embed)
    encoder_lstm_cell=GRU(60,return_state='True')
    encoder_outputs,encoder_state_flstm_h=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h]
    
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    bahdanau_attention=Simple_Attention(60)
    
    decoder_embed_attention_h,decoder_embed_wghts_h=bahdanau_attention(encoder_state_flstm_h,encoder_outputs)
#     decoder_embed_attention_c,decoder_embed_wghts_c=bahdanau_attention(encoder_state_flstm_c,encoder_outputs)
#     print(decoder_embed_wghts)
    decoder_lstm_cell=GRU(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h=decoder_lstm_cell(decoder_embed,initial_state=[decoder_embed_wghts_h])
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
model=seq2seq_encoder_decoder_glove_bilstm_hybrid_bahdanau(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove_bilstm_bahdanau_attention.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Encoder Decoder With Luong Attention

In this case, we are  going to replicate the training process with [Luong Dot Product (Multiplicative Style) Attention Mechanism](https://arxiv.org/abs/1508.04025).


### Global Attention


<img  src="https://miro.medium.com/max/626/1*LhEapXF1mtaB3rDgIjcceg.png">


Luong, et al., 2015 proposed the “global” and “local” attention. The global attention is similar to the soft attention, while the local one is an interesting blend between hard and soft, an improvement over the hard attention to make it differentiable: the model first predicts a single aligned position for the current target word and a window centered around the source position is then used to compute a context vector.

The commonality between Global and Local attention

- At each time step t, in the decoding phase, both approaches, global and local attention, first take the hidden state hₜ at the top layer of a stacking LSTM as an input.
- The goal of both approaches is to derive a context vector 𝒸ₜ to capture relevant source-side information to help predict the current target word yₜ
- Attentional vectors are fed as inputs to the next time steps to inform the model about past alignment decisions.
- Global and local attention models differ in how the context vector 𝒸ₜ is derived
- Before we discuss the global and local attention, let’s understand the conventions used by Luong’s attention mechanism for any given time t
  - 𝒸ₜ : context vector
  - aₜ : alignment vector
  - hₜ : current target hidden state
  - hₛ : current source hidden state
  - yₜ: predicted current target word
  - h˜ₜ : Attentional vectors

- The global attentional model considers all the hidden states of the encoder when calculating the context vector 𝒸ₜ.
- A variable-length alignment vector aₜ equal to the size of the number of time steps in the source sequence is derived by comparing the current target hidden state hₜ with each of the source hidden state hₛ
- The alignment score is referred to as a content-based function for which we consider three different alternatives


### Local Attention


<img src="https://miro.medium.com/max/538/1*YXjdGl3CnSfHfzYpQiObgg.png">


- Local attention only focuses on a small subset of source positions per target words unlike the entire source sequence as in global attention
- Computationally less expensive than global attention
- The local attention model first generates an aligned position Pₜ for each target word at time t.
- The context vector 𝒸ₜ is derived as a weighted average over the set of source hidden states within selected the window
- The aligned position can be monotonically or predictively selected


### Formula 

<img src="https://miro.medium.com/max/875/1*_Ta67S8_lXTbVzJMztkxKg.png">


Some resources:

- [Lilan's Blog](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#:~:text=Self%2Dattention%2C%20also%20known%20as,summarization%2C%20or%20image%20description%20generation.)
- [Paper](https://arxiv.org/pdf/1508.04025.pdf)
- [Paper](https://arxiv.org/pdf/1508.04025.pdf)
- [Paper](https://arxiv.org/pdf/1508.4025.pdf)


In [None]:
import MiniAttention.MiniAttention as ma
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)




class Luong_Attention(tf.keras.layers.Layer):
    def __init__(self,units):
        super(Luong_Attention,self).__init__()
        self.units=units
        self.Wq=tf.keras.layers.Dense(self.units)
        self.Wk=tf.keras.layers.Dense(self.units)
        self.Wv=tf.keras.layers.Dense(60)
        
    def call(self,q,v):
        self.q=q
        self.v=v
#         print(self.q.shape)
        q_t=tf.expand_dims(self.q,1)
#         self.q=tf.transpose(self.q)
        score=(self.q)*(self.v)
        attention_wts=tf.nn.softmax(score,axis=1)
#         print(attention_wts.shape)
        context_vector=(attention_wts*self.v)
#         context_vector=tf.reduce_sum(context_vector,axis=1)
#         print(context_vector.shape)
        return context_vector,attention_wts
    

        
def seq2seq_encoder_decoder_glove_bilstm_hybrid_luong(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
#     encoder_embed_attention=ma.MiniAttentionBlock(None,None,None,keras.regularizers.L2(l2=0.02),None,None,None,None,None)(encoder_embed)
    encoder_lstm_cell=LSTM(60,return_state='True')
    encoder_outputs,encoder_state_flstm_h,encoder_state_flstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h,encoder_state_flstm_c]
    
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    luong_attention=Luong_Attention(128)
    
    decoder_embed_attention_h,decoder_embed_wghts_h=luong_attention(encoder_state_flstm_h,encoder_outputs)
    decoder_embed_attention_c,decoder_embed_wghts_c=luong_attention(encoder_state_flstm_c,encoder_outputs)
#     print(decoder_embed_wghts)
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=[decoder_embed_wghts_h,decoder_embed_wghts_c])
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
model=seq2seq_encoder_decoder_glove_bilstm_hybrid_luong(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove_bilstm_luong_attention.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Encoder Decoder Model Architecture with Luong Attention

The architecture is as follows:

<img src="https://i.imgur.com/ZAJ2iTH.png">

## Graves Cosine Attention

Here we apply ,cosine transformation on the Dot product Attention.

<img src="https://theaisummer.com/assets/img/posts/attention/attention-calculation.png">


Architecture

<img src="https://miro.medium.com/max/2048/0*hMbmU5-BjN-i6mZh.jpg">

In [None]:
import MiniAttention.MiniAttention as ma
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)




class Graves_Attention(tf.keras.layers.Layer):
    def __init__(self,units):
        super(Graves_Attention,self).__init__()
        self.units=units
        self.Wq=tf.keras.layers.Dense(self.units)
        self.Wk=tf.keras.layers.Dense(self.units)
        self.Wv=tf.keras.layers.Dense(60)
        
    def call(self,q,v):
        self.q=q
        self.v=v
#         print(self.q.shape)
        q_t=tf.expand_dims(self.q,1)
#         self.q=tf.transpose(self.q)
        score=tf.math.cos((self.q)*(self.v))
        attention_wts=tf.nn.softmax(score,axis=1)
#         print(attention_wts.shape)
        context_vector=(attention_wts*self.v)
#         context_vector=tf.reduce_sum(context_vector,axis=1)
#         print(context_vector.shape)
        return context_vector,attention_wts
    

        
def seq2seq_encoder_decoder_glove_bilstm_hybrid_graves(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
#     encoder_embed_attention=ma.MiniAttentionBlock(None,None,None,keras.regularizers.L2(l2=0.02),None,None,None,None,None)(encoder_embed)
    encoder_lstm_cell=LSTM(60,return_state='True')
    encoder_outputs,encoder_state_flstm_h,encoder_state_flstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h,encoder_state_flstm_c]
    
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    graves_attention=Graves_Attention(128)
    
    decoder_embed_attention_h,decoder_embed_wghts_h=graves_attention(encoder_state_flstm_h,encoder_outputs)
    decoder_embed_attention_c,decoder_embed_wghts_c=graves_attention(encoder_state_flstm_c,encoder_outputs)
#     print(decoder_embed_wghts)
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=[decoder_embed_wghts_h,decoder_embed_wghts_c])
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
model=seq2seq_encoder_decoder_glove_bilstm_hybrid_graves(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove_bilstm_graves_attention.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Encoder Decoder Model Architecture with Graves Attention

The architecture is as follows:

<img src="https://i.imgur.com/2AInGw0.png">

In [None]:
import MiniAttention.MiniAttention as ma
import math
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)




class Scaled_Dot_Product_Attention(tf.keras.layers.Layer):
    def __init__(self,units):
        super(Scaled_Dot_Product_Attention,self).__init__()
        self.units=units
        self.Wq=tf.keras.layers.Dense(self.units)
        self.Wk=tf.keras.layers.Dense(self.units)
        self.Wv=tf.keras.layers.Dense(60)
        
    def call(self,q,v,n):
        self.q=q
        self.v=v
        self.n=n
#         print(self.q.shape)
        q_t=tf.expand_dims(self.q,1)
#         self.q=tf.transpose(self.q)
        score=((self.q)*(self.v))/math.sqrt(self.n)
        attention_wts=tf.nn.softmax(score,axis=1)
#         print(attention_wts.shape)
        context_vector=(attention_wts*self.v)
#         context_vector=tf.reduce_sum(context_vector,axis=1)
#         print(context_vector.shape)
        return context_vector,attention_wts
    

        
def seq2seq_encoder_decoder_glove_bilstm_hybrid_scaled_dot_product(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
#     encoder_embed_attention=ma.MiniAttentionBlock(None,None,None,keras.regularizers.L2(l2=0.02),None,None,None,None,None)(encoder_embed)
    encoder_lstm_cell=LSTM(60,return_state='True')
    encoder_outputs,encoder_state_flstm_h,encoder_state_flstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h,encoder_state_flstm_c]
    
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    scaled_dot_product_attention=Scaled_Dot_Product_Attention(128)
    
    decoder_embed_attention_h,decoder_embed_wghts_h=scaled_dot_product_attention(encoder_state_flstm_h,encoder_outputs,64)
    decoder_embed_attention_c,decoder_embed_wghts_c=scaled_dot_product_attention(encoder_state_flstm_c,encoder_outputs,64)
#     print(decoder_embed_wghts)
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=[decoder_embed_wghts_h,decoder_embed_wghts_c])
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
model=seq2seq_encoder_decoder_glove_bilstm_hybrid_scaled_dot_product(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove_bilstm_scaled_dot_product_attention.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Encoder Decoder Model Architecture with Scaled Dot Product Attention

The architecture is as follows:

<img src="https://i.imgur.com/pBnshbv.png">

In [None]:
import MiniAttention.MiniAttention as ma
import math
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)




class Scaled_Dot_Product_Self_Attention(tf.keras.layers.Layer):
    def __init__(self,units):
        super(Scaled_Dot_Product_Self_Attention,self).__init__()
        self.units=units
        self.Wq=tf.keras.layers.Dense(self.units)
        self.Wk=tf.keras.layers.Dense(self.units)
        self.Wv=tf.keras.layers.Dense(60)
        
    def call(self,q,k,v,n):
        self.q=q
        self.v=v
        self.n=n
        self.k=k
#         print(self.q.shape)
        q_t=tf.expand_dims(self.q,1)
#         self.q=tf.transpose(self.q)
        score=(self.Wq(self.q)*self.Wk(self.k))/math.sqrt(n)
        attention_wts=tf.nn.softmax(score,axis=1)
#         print(attention_wts.shape)
        context_vector=(attention_wts*self.v)
        context_vector=tf.reduce_sum(context_vector,axis=1)
#         print(context_vector.shape)
        return context_vector,attention_wts
    

        
def seq2seq_encoder_decoder_glove_bilstm_hybrid_scaled_dot_product_self(maxlen,max_features,embedding_matrix):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    print(embedding_matrix.shape)
    encoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(encoder_inp)
#     encoder_embed_attention=ma.MiniAttentionBlock(None,None,None,keras.regularizers.L2(l2=0.02),None,None,None,None,None)(encoder_embed)
    encoder_lstm_cell=LSTM(60,return_state='True')
    encoder_outputs,encoder_state_flstm_h,encoder_state_flstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h,encoder_state_flstm_c]
    
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(max_features,embed_size,weights=[embedding_matrix])(decoder_inp)
    scaled_dot_product_attention=Scaled_Dot_Product_Self_Attention(60)
    
    decoder_embed_attention_h,decoder_embed_wghts_h=scaled_dot_product_attention(encoder_state_flstm_h,encoder_state_flstm_h,encoder_outputs,64)
    decoder_embed_attention_c,decoder_embed_wghts_c=scaled_dot_product_attention(encoder_state_flstm_c,encoder_state_flstm_c,encoder_outputs,64)
#     print(decoder_embed_wghts)
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=[decoder_embed_wghts_h,decoder_embed_wghts_c])
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model
model=seq2seq_encoder_decoder_glove_bilstm_hybrid_scaled_dot_product_self(maxlen,max_features,embedding_matrix)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="seq2seq_encoder_decoder_model_glove_bilstm_scaled_dot_self_product_attention.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Encoder Decoder Model Architecture with Self Attention

The architecture is as follows:

<img src="https://i.imgur.com/3sdKFMW.png">

## Self Attention 


This is produced from [Google-research](http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf).Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing. [Jay's Blog](http://jalammar.github.io/illustrated-transformer/) provide a very good idea of this logic.

<img src="http://jalammar.github.io/images/t/transformer_self-attention_visualization.png">


Three vectors q,k and v (query,key and value) are taken into consideration for computation of the self attention mechanism.The q,k and v are normally of 64 dimensions.


<img src="http://jalammar.github.io/images/t/transformer_self_attention_vectors.png">


The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.


<img  src="http://jalammar.github.io/images/t/transformer_self_attention_score.png">


The third and forth steps are to divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.


<img src="http://jalammar.github.io/images/t/self-attention_softmax.png">



This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.
The fifth step is to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
The sixth step is to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).


<img  src="http://jalammar.github.io/images/t/self-attention-output.png">

The cumulative computation can be thought of like this:

<img src="http://jalammar.github.io/images/t/self-attention-matrix-calculation-2.png">



In [None]:
# import MiniAttention.MiniAttention as ma
import math
import transformers
from transformers import AutoTokenizer,AutoModelForQuestionAnswering

maxlen=1000
embed_size=768
max_features=1000
train_df=train_df[:1000]
labels=label_y.fit_transform(train_df['sentiment'])
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
#Tokenizing steps- must be remembered
tokenizer=Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_x))
train_x=tokenizer.texts_to_sequences(train_x)
val_x=tokenizer.texts_to_sequences(val_x)

#Pad the sequence- To allow same length for all vectorized words
train_x=pad_sequences(train_x,maxlen=maxlen)
val_x=pad_sequences(val_x,maxlen=maxlen)
val_y=test_y
print("Padded and Tokenized Training Sequence".format(),train_x.shape)
print("Target Values Shape".format(),train_y.shape)
print("Padded and Tokenized Training Sequence".format(),val_x.shape)
print("Target Values Shape".format(),val_y.shape)

def build_model(transformer, max_len=maxlen):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    #Replaced from the Embedding+LSTM/CoNN layers
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    return cls_token,sequence_output
    


class Scaled_Dot_Product_Self_Attention(tf.keras.layers.Layer):
    def __init__(self,units):
        super(Scaled_Dot_Product_Self_Attention,self).__init__()
        self.units=units
        self.Wq=tf.keras.layers.Dense(self.units)
        self.Wk=tf.keras.layers.Dense(self.units)
        self.Wv=tf.keras.layers.Dense(60)
        
    def call(self,q,k,v,n):
        self.q=q
        self.v=v
        self.n=n
        self.k=k
#         print(self.q.shape)
        q_t=tf.expand_dims(self.q,1)
#         self.q=tf.transpose(self.q)
        score=(self.Wq(self.q)*self.Wk(self.k))/math.sqrt(n)
        attention_wts=tf.nn.softmax(score,axis=1)
#         print(attention_wts.shape)
        context_vector=(attention_wts*self.v)
        context_vector=tf.reduce_sum(context_vector,axis=1)
#         print(context_vector.shape)
        return context_vector,attention_wts
    

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]
        
        
def fetch_vectors(string_list,pretrained_model, batch_size=64):
    # inspired by https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
    model = transformers.TFDistilBertModel.from_pretrained(pretrained_model)
    
    fin_features = []
    for data in chunks(string_list, batch_size):
        tokenized = []
        for x in data:
            x = " ".join(x.strip().split()[:300])
            tok = tokenizer.encode(x, add_special_tokens=True)
            tokenized.append(tok[:512])

        max_len = 512
        #bert variants have attention id, input id and segment id
        padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized])
        attention_mask = np.where(padded != 0, 1, 0)
        input_ids = tf.convert_to_tensor(padded)
        attention_mask = tf.convert_to_tensor(attention_mask)

        last_hidden_states = model(input_ids, attention_mask=attention_mask)

        features = last_hidden_states[0][:, 0, :].cpu().numpy()
        fin_features.append(features)

    fin_features = np.vstack(fin_features)
    return fin_features

        
def distilbert_encoder_decoder_attention(maxlen,max_features,distilbert_embeddings):
    #Creating LSTM  encoder neural model with no pretrained embeddings
    encoder_inp=Input(shape=(maxlen,))
    encoder_embed=Embedding(distilbert_embeddings.shape[0],embed_size,weights=[distilbert_embeddings])(encoder_inp)
    print(encoder_inp.shape)
#     encoder_embed_attention=ma.MiniAttentionBlock(None,None,None,keras.regularizers.L2(l2=0.02),None,None,None,None,None)(encoder_embed)
    encoder_lstm_cell=LSTM(60,return_state='True')
    encoder_outputs,encoder_state_flstm_h,encoder_state_flstm_c=encoder_lstm_cell(encoder_embed)
    print(f'Encoder Ouputs Shape{encoder_outputs.shape}')
    encoded_states=[encoder_state_flstm_h,encoder_state_flstm_c]
    
    #Creating LSTM decoder model and feeding the output states (h,c) of lstm of encoders
    decoder_inp=Input(shape=(maxlen,))
    decoder_embed=Embedding(distilbert_embeddings.shape[0],embed_size,weights=[distilbert_embeddings])(decoder_inp)
    scaled_dot_product_attention=Scaled_Dot_Product_Self_Attention(60)
    
    decoder_embed_attention_h,decoder_embed_wghts_h=scaled_dot_product_attention(encoder_state_flstm_h,encoder_state_flstm_h,encoder_outputs,64)
    decoder_embed_attention_c,decoder_embed_wghts_c=scaled_dot_product_attention(encoder_state_flstm_c,encoder_state_flstm_c,encoder_outputs,64)
#     print(decoder_embed_wghts)
    decoder_lstm_cell=LSTM(60,return_sequences='True',return_state=True)
    decoder_outputs,decoder_state_lstm_h,decoder_state_lstm_c=decoder_lstm_cell(decoder_embed,initial_state=[decoder_embed_wghts_h,decoder_embed_wghts_c])
#     decoderoutputs,_,_=decoder_lstm_cell(decoder_embed,initial_state=encoded_states)
    
    decoder_dense_cell=Dense(16,activation='relu')
    decoder_d_output=decoder_dense_cell(decoder_outputs)
    decoder_dense_cell2=Dense(1,activation='sigmoid')
    decoder_output=decoder_dense_cell2(decoder_d_output)
    model=Model([encoder_inp,decoder_inp],decoder_output)
    model.summary()
    return model



# transformer_layer = (
#         transformers.TFDistilBertModel
#         .from_pretrained('distilbert-base-multilingual-cased'))
# embedding_matrix,embedding_vector=build_model(transformer_layer,maxlen)
# embedding_matrix=tf.keras.layers.Reshape((maxlen, 768))(embedding_matrix)
distilbert_embeddings = fetch_vectors(train_df.review.values,'distilbert-base-uncased')

model=distilbert_encoder_decoder_attention(maxlen,max_features,distilbert_embeddings)  
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
plot_model(
    model,to_file="distilbert_encoder_decoder_attention.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

model.fit([train_x,train_x],train_y,batch_size=512,epochs=2,verbose=2)


## Conclusion on Attention

We have seen different flavours of attention mechanism and in the next section, we move forward to transformers.


<img src="https://i.pinimg.com/originals/4a/2e/2b/4a2e2b7a3aabd03daebf11d1a2e970cc.gif">

## Enter Transformers


### Multi Head Self Scaled Dot Product Attention


This improves the performance of the attention layer in two ways:

- It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.

- It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.


<img src="http://jalammar.github.io/images/t/transformer_attention_heads_qkv.png">


The entire modelling of multi head attention can be summed up in this image:


<img src="http://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png">


### Combining Multi HeadAttention with Encoder Decoders With Layer Normalization


In the classic encoder-decoder model, we will add the Multi head attention mechanism along with some modifications. The Encoder  contains a self attention head along with a Addition and Layer Normalization Layer. [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf) tries to apply normalization (mean and variance)on the cumulative hidden units present in a particular layer rather than minibatch sampling (as in the case of batch normalization). The "covariate shift" issue can be resolved to an extent using this normalization technique.

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTJyynaajlr9lye4s0p28jOsE2VLZQ1R3l9rw&usqp=CAU">


#### Encoder


The encoder model with this modification appears as follows:


<img src="http://jalammar.github.io/images/t/transformer_resideual_layer_norm_2.png">

 
The classic transformer consists of 8 stacked encoder decoder units with self attention(multihead) and layer normalization ,along with FFNN Dense Network inside each of them.This leads to a more robust architecture as compared to standard Encoder Decoder models.


<img src="http://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png">


#### Decoder


Now that we’ve covered most of the concepts on the encoder side, we basically know how the components of decoders work as well. But let’s take a look at how they work together.

The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence:


<img src="http://jalammar.github.io/images/t/transformer_decoding_1.gif">


The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.


<img src="http://jalammar.github.io/images/t/transformer_decoding_2.gif">


#### Masking


The self attention layers in the decoder operate in a slightly different way than the one in the encoder:

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.

The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.


### Final Linear and Softmax Activation


This is similar to the softmax activated output of the final FFNN layer.The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.

The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.

Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.


<img src="http://jalammar.github.io/images/t/transformer_decoder_output_softmax.png">


Some resources:

- [Research](https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html)
- [Repo](https://github.com/tensorflow/tensor2tensor)
- [Jupyter](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)
- [Talk](https://www.youtube.com/watch?v=rBCqOTEfxvg)

In [None]:
#Taken from Google-research implementation of transformer

import random, os, sys
import numpy as np
from keras.models import *
from keras.layers import *
from keras.callbacks import *
from keras.initializers import *
import tensorflow as tf
from keras.engine.topology import Layer
import keras.backend as K
try:
    from dataloader import TokenList, pad_to_longest
    # for transformer
except: pass

#Layer normalization class
class LayerNormalization(Layer):
    def __init__(self, eps=1e-6, **kwargs):
        self.eps = eps
        super(LayerNormalization, self).__init__(**kwargs)
    def build(self, input_shape):
        #Adding custom weights
        self.gamma = self.add_weight(name='gamma', shape=input_shape[-1:],
                                     initializer=Ones(), trainable=True)
        self.beta = self.add_weight(name='beta', shape=input_shape[-1:],
                                    initializer=Zeros(), trainable=True)
        super(LayerNormalization, self).build(input_shape)
    def call(self, x):
        mean = K.mean(x, axis=-1, keepdims=True)
        std = K.std(x, axis=-1, keepdims=True)
        return self.gamma * (x - mean) / (std + self.eps) + self.beta
    def compute_output_shape(self, input_shape):
        return input_shape

#Division by 8 (q.k/d^0.5)
class ScaledDotProductAttention():
    def __init__(self, d_model, attn_dropout=0.1):
        self.temper = np.sqrt(d_model)
        self.dropout = Dropout(attn_dropout)
    def __call__(self, q, k, v, mask):
        attn = Lambda(lambda x:K.batch_dot(x[0],x[1],axes=[2,2])/self.temper)([q, k])
        if mask is not None:
            mmask = Lambda(lambda x:(-1e+10)*(1-x))(mask)
            attn = Add()([attn, mmask])
        attn = Activation('softmax')(attn)
        attn = self.dropout(attn)
        output = Lambda(lambda x:K.batch_dot(x[0], x[1]))([attn, v])
        return output, attn

class MultiHeadAttention():
    # mode 0 - big martixes, faster; mode 1 - more clear implementation
    def __init__(self, n_head, d_model, d_k, d_v, dropout, mode=0, use_norm=True):
        self.mode = mode
        self.n_head = n_head
        self.d_k = d_k
        self.d_v = d_v
        self.dropout = dropout
        if mode == 0:
            self.qs_layer = Dense(n_head*d_k, use_bias=False)
            self.ks_layer = Dense(n_head*d_k, use_bias=False)
            self.vs_layer = Dense(n_head*d_v, use_bias=False)
        elif mode == 1:
            self.qs_layers = []
            self.ks_layers = []
            self.vs_layers = []
            for _ in range(n_head):
                self.qs_layers.append(TimeDistributed(Dense(d_k, use_bias=False)))
                self.ks_layers.append(TimeDistributed(Dense(d_k, use_bias=False)))
                self.vs_layers.append(TimeDistributed(Dense(d_v, use_bias=False)))
        #Joining scaled dot product
        self.attention = ScaledDotProductAttention(d_model)
        self.layer_norm = LayerNormalization() if use_norm else None
        self.w_o = TimeDistributed(Dense(d_model))

    def __call__(self, q, k, v, mask=None):
        d_k, d_v = self.d_k, self.d_v
        n_head = self.n_head

        if self.mode == 0:
            qs = self.qs_layer(q)  # [batch_size, len_q, n_head*d_k]
            ks = self.ks_layer(k)
            vs = self.vs_layer(v)

            def reshape1(x):
                s = tf.shape(x)   # [batch_size, len_q, n_head * d_k]
                x = tf.reshape(x, [s[0], s[1], n_head, d_k])
                x = tf.transpose(x, [2, 0, 1, 3])  
                x = tf.reshape(x, [-1, s[1], d_k])  # [n_head * batch_size, len_q, d_k]
                return x
            qs = Lambda(reshape1)(qs)
            ks = Lambda(reshape1)(ks)
            vs = Lambda(reshape1)(vs)

            if mask is not None:
                mask = Lambda(lambda x:K.repeat_elements(x, n_head, 0))(mask)
            head, attn = self.attention(qs, ks, vs, mask=mask)  
                
            def reshape2(x):
                s = tf.shape(x)   # [n_head * batch_size, len_v, d_v]
                x = tf.reshape(x, [n_head, -1, s[1], s[2]]) 
                x = tf.transpose(x, [1, 2, 0, 3])
                x = tf.reshape(x, [-1, s[1], n_head*d_v])  # [batch_size, len_v, n_head * d_v]
                return x
            head = Lambda(reshape2)(head)
        elif self.mode == 1:
            heads = []; attns = []
            for i in range(n_head):
                qs = self.qs_layers[i](q)   
                ks = self.ks_layers[i](k) 
                vs = self.vs_layers[i](v) 
                head, attn = self.attention(qs, ks, vs, mask)
                heads.append(head); attns.append(attn)
            head = Concatenate()(heads) if n_head > 1 else heads[0]
            attn = Concatenate()(attns) if n_head > 1 else attns[0]

        outputs = self.w_o(head)
        outputs = Dropout(self.dropout)(outputs)
        if not self.layer_norm: return outputs, attn
        outputs = Add()([outputs, q])
        return self.layer_norm(outputs), attn
#Feedforward layer using COnv1D and Layer normalization.
class PositionwiseFeedForward():
    def __init__(self, d_hid, d_inner_hid, dropout=0.1):
        self.w_1 = Conv1D(d_inner_hid, 1, activation='relu')
        self.w_2 = Conv1D(d_hid, 1)
        self.layer_norm = LayerNormalization()
        self.dropout = Dropout(dropout)
    def __call__(self, x):
        output = self.w_1(x) 
        output = self.w_2(output)
        output = self.dropout(output)
        output = Add()([output, x])
        return self.layer_norm(output)
#Encoder layer containing self/multi head attention with positionwisefeedforward
class EncoderLayer():
    def __init__(self, d_model, d_inner_hid, n_head, d_k, d_v, dropout=0.1):
        self.self_att_layer = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.pos_ffn_layer  = PositionwiseFeedForward(d_model, d_inner_hid, dropout=dropout)
    def __call__(self, enc_input, mask=None):
        output, slf_attn = self.self_att_layer(enc_input, enc_input, enc_input, mask=mask)
        output = self.pos_ffn_layer(output)
        return output, slf_attn
#Decoder layer with same architecture as the encoder.
class DecoderLayer():
    def __init__(self, d_model, d_inner_hid, n_head, d_k, d_v, dropout=0.1):
        self.self_att_layer = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.enc_att_layer  = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
        self.pos_ffn_layer  = PositionwiseFeedForward(d_model, d_inner_hid, dropout=dropout)
    def __call__(self, dec_input, enc_output, self_mask=None, enc_mask=None):
        output, slf_attn = self.self_att_layer(dec_input, dec_input, dec_input, mask=self_mask)
        output, enc_attn = self.enc_att_layer(output, enc_output, enc_output, mask=enc_mask)
        output = self.pos_ffn_layer(output)
        return output, slf_attn, enc_attn
#This is from the paper "Attention is all you need" which hypothesizes sin and cosine for positional encoding
def GetPosEncodingMatrix(max_len, d_emb):
    pos_enc = np.array([
        [pos / np.power(10000, 2 * (j // 2) / d_emb) for j in range(d_emb)] 
        if pos != 0 else np.zeros(d_emb) 
            for pos in range(max_len)
            ])
    pos_enc[1:, 0::2] = np.sin(pos_enc[1:, 0::2]) # dim 2i
    pos_enc[1:, 1::2] = np.cos(pos_enc[1:, 1::2]) # dim 2i+1
    return pos_enc

#normal padding class for masking
def GetPadMask(q, k):
    ones = K.expand_dims(K.ones_like(q, 'float32'), -1)
    mask = K.cast(K.expand_dims(K.not_equal(k, 0), 1), 'float32')
    mask = K.batch_dot(ones, mask, axes=[2,1])
    return mask

def GetSubMask(s):
    len_s = tf.shape(s)[1]
    bs = tf.shape(s)[:1]
    mask = K.cumsum(tf.eye(len_s, batch_shape=bs), 1)
    return mask

class Encoder():
    def __init__(self, d_model, d_inner_hid, n_head, d_k, d_v, \
                layers=6, dropout=0.1, word_emb=None, pos_emb=None):
        self.emb_layer = word_emb
        self.pos_layer = pos_emb
        self.emb_dropout = Dropout(dropout)
        self.layers = [EncoderLayer(d_model, d_inner_hid, n_head, d_k, d_v, dropout) for _ in range(layers)]
        
    def __call__(self, src_seq, src_pos, return_att=False, active_layers=999):
        x = self.emb_layer(src_seq)
        if src_pos is not None:
            pos = self.pos_layer(src_pos)
            x = Add()([x, pos])
        x = self.emb_dropout(x)
        if return_att: atts = []
        mask = Lambda(lambda x:GetPadMask(x, x))(src_seq)
        for enc_layer in self.layers[:active_layers]:
            x, att = enc_layer(x, mask)
            if return_att: atts.append(att)
        return (x, atts) if return_att else x


class Transformer():
    def __init__(self, len_limit, d_model=embed_size, \
              d_inner_hid=512, n_head=10, d_k=64, d_v=64, layers=2, dropout=0.1, \
              share_word_emb=False, **kwargs):
        self.name = 'Transformer'
        self.len_limit = len_limit
        self.src_loc_info = True
        self.d_model = d_model
        self.decode_model = None
        d_emb = d_model

        pos_emb = Embedding(len_limit, d_emb, trainable=False, \
                            weights=[GetPosEncodingMatrix(len_limit, d_emb)])

        i_word_emb = Embedding(max_features, d_emb, weights=[embedding_matrix]) # Add Kaggle provided embedding here

        self.encoder = Encoder(d_model, d_inner_hid, n_head, d_k, d_v, layers, dropout, \
                               word_emb=i_word_emb, pos_emb=pos_emb)

        
    def get_pos_seq(self, x):
        mask = K.cast(K.not_equal(x, 0), 'int32')
        pos = K.cumsum(K.ones_like(x, 'int32'), 1)
        return pos * mask

    def compile(self, active_layers=999):
        src_seq_input = Input(shape=(None,))
        src_seq = src_seq_input
        src_pos = Lambda(self.get_pos_seq)(src_seq)
        if not self.src_loc_info: src_pos = None

        x = self.encoder(src_seq, src_pos, active_layers=active_layers)
        # x = GlobalMaxPool1D()(x) # Not sure about this layer. Just wanted to reduce dimension
        x = GlobalAveragePooling1D()(x)
        outp = Dense(1, activation="sigmoid")(x)

        self.model = Model(inputs=src_seq_input, outputs=outp)
        self.model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

classic_transformer = Transformer(maxlen, layers=1)
classic_transformer.compile()
model = classic_transformer.model
model.summary()
plot_model(
    model,to_file="Classic_Transformer.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)


## Classic Transformer Architecture -MultiHead Self Attention

The following is the architecture:

<img src="https://i.imgur.com/nImDpAf.png">

In [None]:
model.fit(train_x[:1000],train_y[:1000],epochs=2,verbose=2,batch_size=512)

In [None]:
import numpy as np
from transformers import AutoTokenizer, pipeline, TFDistilBertModel
from scipy.spatial.distance import cosine
def transformer_embedding(name,inp,model_name):

    model = model_name.from_pretrained(name)
    tokenizer = AutoTokenizer.from_pretrained(name)
    pipe = pipeline('feature-extraction', model=model, 
                tokenizer=tokenizer)
    features = pipe(inp)
    features = np.squeeze(features)
    return features
embedding_features1=transformer_embedding('distilbert-base-uncased',z[0],TFDistilBertModel)
embedding_features2=transformer_embedding('distilbert-base-uncased',z[1],TFDistilBertModel)


## Huggingface Transformers


We will be leveraging the power of Transformers (from [Huggingface](https://huggingface.co/)) for training the corpus using any variant of transformer architecture.  Some information regarding TPU usage:


### TPU


<img src="https://storage.googleapis.com/kaggle-media/tpu/tpuv3angle.jpg">


In this context, we will be using the TPU cluster from the Notebook (Hardware accelerations). TPUs provide a better performance with respect to Tensorflow and Keras computations on tensors against GPUs.But it has to be explicitly called out in the code segment.

[Kaggle Documentation on TPUs provide an excellent starting point for this.Highly recommend to go through it.](https://www.kaggle.com/docs/tpu)

Steps to check and run the TPU cluster:

```python
detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() tf.config.experimental_connect_to_cluster(tpu) tf.tpu.experimental.initialize_tpu_system(tpu)

instantiate a distribution strategy
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

instantiating the model in the strategy scope creates the model on the TPU
with tpu_strategy.scope(): model = tf.keras.Sequential( … ) # define your model normally model.compile( … )

train model normally
model.fit(training_dataset, epochs=EPOCHS, steps_per_epoch=…)
```

Some points on TPUs:

- TPUs are network-connected accelerators and you must first locate them on the network. This is what TPUClusterResolver() does.
- To go fast on a TPU, increase the batch size. The rule of thumb is to use batches of 128 elements per core (ex: batch size of 128*8=1024 for a TPU with 8 cores). At this size, the 128x128 hardware matrix multipliers of the TPU (see hardware section below) are most likely to be kept busy. You start seeing interesting speedups from a batch size of 8 per core though. In the sample above, the batch size is scaled with the core count through this line of code:

```python
BATCH_SIZE = 16 * tpu_strategy.num_replicas_in_sync
```

<img src="https://storage.googleapis.com/kaggle-media/tpu/tpu_rule_of_thumb.png">

- With larger batch sizes, TPUs will be crunching through the training data faster. This is only useful if the larger training batches produce more “training work” and get your model to the desired accuracy faster. That is why the rule of thumb also calls for increasing the learning rate with the batch size. You can start with a proportional increase but additional tuning may be necessary to find the optimal learning rate schedule for a given model and accelerator

- Because TPUs are very fast, many models ported to TPU end up with a data bottleneck. The TPU is sitting idle, waiting for data for the most part of each training epoch. TPUs read training data exclusively from GCS (Google Cloud Storage). And GCS can sustain a pretty large throughput if it is continuously streaming from multiple files in parallel. Following a couple of best practices will optimize the throughput:For TPU training, organize your data in GCS in a reasonable number (10s to 100s) of reasonably large files (10s to 100s of MB).

- To enable parallel streaming from multiple TFRecord files, we can modify :

   - num_parallel_reads=AUTO instructs the API to read from multiple files if available. It figures out how many automatically.
   - experimental_deterministic = False disables data order enforcement. We will be shuffling the data anyway so order is not important. With this setting the API can use any TFRecord as soon as it is streamed in.
   

#### TPU Hardware

At approximately 20 inches (50 cm), a TPU v3-8 board is a fairly sizeable piece of hardware. It sports 4 dual-core TPU chips for a total of 8 TPU cores.Each TPU core has a traditional vector processing part (VPU) as well as dedicated matrix multiplication hardware capable of processing 128x128 matrices. This is the part that specifically accelerates machine learning workloads.

TPUs are equipped with 128GB of high-speed memory allowing larger batches, larger models and also larger training inputs. In the sample above, you can try using 512x512 px input images, also provided in the dataset, and see the TPU v3-8 handle them easily.

<img src="https://storage.googleapis.com/kaggle-media/tpu/tpu_cores_and_chips.png">

Some resources:

- [Cloud TPU](https://cloud.google.com/tpu/docs/tpus)
- [Tensorflow TPU](https://www.tensorflow.org/tfrc)


<img src="https://i.pinimg.com/originals/73/d3/a1/73d3a14d212314ab1f7268b71d639c15.gif">

In [None]:
import os

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from kaggle_datasets import KaggleDatasets
import transformers
from tqdm.notebook import tqdm
from tokenizers import BertWordPieceTokenizer
from sklearn.model_selection import train_test_split
import numpy as np
from transformers import AutoTokenizer, pipeline, TFDistilBertModel
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

print("REPLICAS: ", strategy.num_replicas_in_sync)
#allow experimental tf
AUTO = tf.data.experimental.AUTOTUNE

# Data access
GCS_DS_PATH = KaggleDatasets().get_gcs_path('imdb-dataset-of-50k-movie-reviews')

# Configuration of hyperparameters
EPOCHS = 3
#batch size denotes the partitioning amongst the cluster replicas.
BATCH_SIZE = 16 * strategy.num_replicas_in_sync
MAX_LEN = 192



In [None]:
from kaggle_datasets import KaggleDatasets
GCS_PATH = KaggleDatasets().get_gcs_path('imdb-dataset-of-50k-movie-reviews')
!gsutil ls $GCS_PATH

In [None]:
!ls /kaggle/input

## Transformer Workflow with TPU


In this case, we will be producing a robust workflow using Tensorflow TPU with Google Cloud Storage bucket data for training any Transformer models. The following are the steps:


-  Load the TPU cluster
- Fast Encode the data with tokenizer from [Huggingface](https://github.com/huggingface/tokenizers).This is done by chunks of window sizes (batches)
- We will use the Transformer Embeddings (which we created in [Notebook-1](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop))
- For all transformer models of BERT variants,the standard is to abstract the last hidden layer of the outputs.
- This particular layer contains the embedding vectors , generally of size (?,768) for BERT base and (?,1024) for BERT large variants.
- Then the model uses a FFNN Dense Network with a sigmoid/softmax activation to get the output weights.
- Then we convert the input data (train data/validation data) to a [Tensorflow Dataset](https://www.tensorflow.org/guide/data) which can leverage the power of TPU.
- Then we use a distributed training pattern on TPU using [tf.strategy] (https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy)

In this case,we are using [Huggingface](https://huggingface.co/transformers/pretrained_models.html) models. The first in this case,is using [DistilBERT model](https://huggingface.co/transformers/model_doc/distilbert.html).


<img src="https://cdn.nextjournal.com/data/QmNQFSULXLPYnGhHSCxmeGk8oHjfdWnybmZGFztfS26fgZ?filename=2019-05-26%2023-43-43%20%E7%9A%84%E8%9E%A2%E5%B9%95%E6%93%B7%E5%9C%96.png&content-type=image/png">




## BERT 

[BERT](https://arxiv.org/abs/1810.04805) is [bidirectional encoder Transformer model](https://github.com/google-research/bert)


<img src="http://jalammar.github.io/images/distilBERT/bert-output-tensor.png">



The entire workflow can be designed as follows:


This image can be used to describe the workflow:


<img src="http://jalammar.github.io/images/distilBERT/bert-input-to-output-tensor-recap.png">


Slicing the important part
For sentence classification, we’re only only interested in BERT’s output for the [CLS] token, so we select that slice of the cube and discard everything else.


<img src="http://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png">


BERT Model

<img src="https://miro.medium.com/max/740/1*G6PYuBxc7ryP4Pz7nrZJgQ@2x.png">




## DistilBERT Model


The distilbert performs better than Bert in most cases owing to continuous feedback of attention weights from the teacher to the student network. Where the weights change by a large extent in case of Bert, this fails to happen in DistilBert.


<img src="https://storage.googleapis.com/groundai-web-prod/media%2Fusers%2Fuser_14%2Fproject_391208%2Fimages%2FKD_figures%2Ftransformer_distillation.png">


DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark. 2 significant benchmarks aspects of this Model:

- Quantization :This leads to approximation of internal weight vectors to a numerically smaller precision
- Weights Pruning: Removing some connections from the network.

Knowledge distillation (sometimes also referred to as teacher-student learning) is a compression technique in which a small model is trained to reproduce the behavior of a larger model (or an ensemble of models). It was introduced by Bucila et al. and generalized by Hinton et al. a few years later. We will follow the latter method.Rather than training with a cross-entropy over the hard targets (one-hot encoding of the gold class), we transfer the knowledge from the teacher to the student with a cross-entropy over the soft targets (probabilities of the teacher). Our training loss thus becomes:

<img src="https://miro.medium.com/max/311/1*GZkQPjKC_Wqx1F4Uu3FdiQ.png">

This loss is a richer training signal since a single example enforces much more constraint than a single hard target.
To further expose the mass of the distribution over the classes, Hinton et al. introduce a softmax-temperature:

<img src="https://miro.medium.com/max/291/1*BaVyKMXRWaudFvcI9So8MQ.png">

When T → 0, the distribution becomes a Kronecker (and is equivalent to the one-hot target vector), when T →+∞, it becomes a uniform distribution. The same temperature parameter is applied both to the student and the teacher at training time, further revealing more signals for each training example. At inference, T is set to 1 and recover the standard Softmax.


Some resources:

- [Blog](https://medium.com/huggingface/distilbert-8cf3380435b5)
- [Huggingface](https://huggingface.co/transformers/model_doc/distilbert.html)
- [Paper](https://arxiv.org/abs/1910.01108)

In [None]:
#Tokenize the data and separate them in chunks of 256 units

maxlen=512
chunk_size=256
def fast_encode(texts, tokenizer, chunk_size=chunk_size, maxlen=maxlen):
    tokenizer.enable_truncation(max_length=maxlen)
    tokenizer.enable_padding(max_length=maxlen)
    all_ids = []
    #sliding window methodology
    for i in tqdm(range(0, len(texts), chunk_size)):
        text_chunk = texts[i:i+chunk_size].tolist()
        encs = tokenizer.encode_batch(text_chunk)
        all_ids.extend([enc.ids for enc in encs])
    
    return np.array(all_ids)
def build_model(transformer, max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    #Replaced from the Embedding+LSTM/CoNN layers
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    out = Dense(1, activation='sigmoid')(cls_token)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
    model.summary()
    return model

# First load the real tokenizer
tokenizer = transformers.DistilBertTokenizer.from_pretrained('distilbert-base-multilingual-cased')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
val_y=test_y
train_x = fast_encode(train_x.astype(str), fast_tokenizer, maxlen=MAX_LEN)
val_x = fast_encode(test_x.astype(str), fast_tokenizer, maxlen=MAX_LEN)

train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((train_x, train_y))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((val_x, val_y))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

In [None]:
with strategy.scope():
    transformer_layer = (
        transformers.TFDistilBertModel
        .from_pretrained('distilbert-base-multilingual-cased')
    )
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()
plot_model(
    model,to_file="Distilbert_Transformer.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)
n_steps = train_x.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS
)

## DistilBERT base Architecture -768 D

The model architecture for Distilbert for classification is provided here:

<img src="https://i.imgur.com/uUzt9dk.png">

## Albert Transformer

[Albert](https://arxiv.org/abs/1909.11942) is a lightweight bert which introduces parameter sharing, caching, and intermediate repeated splitting of the embedding matrix for efficient modelling tasks.

According to the paper:


'The first one is a factorized embedding parameterization. By decomposing
the large vocabulary embedding matrix into two small matrices, we separate the size of the hidden
layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden
size without significantly increasing the parameter size of the vocabulary embeddings. The second
technique is cross-layer parameter sharing. This technique prevents the parameter from growing
with the depth of the network. Both techniques significantly reduce the number of parameters for
BERT without seriously hurting performance, thus improving parameter-efficiency. An ALBERT
configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster.
The parameter reduction techniques also act as a form of regularization that stabilizes the training
and helps with generalization.
To further improve the performance of ALBERT, we also introduce a self-supervised loss for
sentence-order prediction (SOP). SOP primary focuses on inter-sentence coherence and is designed
to address the ineffectiveness (Yang et al., 2019; Liu et al., 2019) of the next sentence prediction
(NSP) loss proposed in the original BERT.'


Resources:

- [Github](https://github.com/google-research/albert)
- [Huggingface](https://huggingface.co/transformers/model_doc/albert.html)


In [None]:
## Testing AlbertTransformer

tokenizer = transformers.AlbertTokenizer.from_pretrained('albert-base-v1')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
val_y=test_y
train_x = fast_encode(train_x.astype(str), fast_tokenizer, maxlen=MAX_LEN)
val_x = fast_encode(test_x.astype(str), fast_tokenizer, maxlen=MAX_LEN)

train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((train_x, train_y))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((val_x, val_y))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)
with strategy.scope():
    transformer_layer = (
        transformers.TFAlbertModel
        .from_pretrained('albert-base-v1')
    )
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()
plot_model(
    model,to_file="AlbertTransformer.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

n_steps = train_x.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS
)

## Albert Base Architecture -768 D

The model architecture for Albert is as follows:

<img src="https://i.imgur.com/ztcIbsb.png">

## XLM Roberta/Roberta


[XLM](https://arxiv.org/pdf/1907.11692.pdf) builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

<img src="https://camo.githubusercontent.com/f5c0d05eb0635cdd0e17e137265af23fa825b1d4/68747470733a2f2f646c2e666261697075626c696366696c65732e636f6d2f584c4d2f786c6d5f6669677572652e6a7067">Tips:


This implementation is the same as BertModel with a tiny embeddings tweak as well as a setup for Roberta pretrained models.

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.

RoBERTa doesn’t have token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token (or </s>)

CamemBERT is a wrapper around RoBERTa.


Resources:

- [FAIR](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/)
- [Pytorch](https://pytorch.org/hub/pytorch_fairseq_roberta/)
- [Github](https://github.com/pytorch/fairseq/tree/master/examples/roberta)
- [Huggingface](https://huggingface.co/transformers/model_doc/roberta.html)

In [None]:
## Testing AlbertTransformer
import os

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from kaggle_datasets import KaggleDatasets
import transformers
from tqdm.notebook import tqdm
from tokenizers import BertWordPieceTokenizer
from sklearn.model_selection import train_test_split
import numpy as np
from transformers import AutoTokenizer, pipeline, TFDistilBertModel

tokenizer = transformers.XLMRobertaTokenizer.from_pretrained('xlm-roberta-base')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
val_y=test_y
train_x = fast_encode(train_x.astype(str), fast_tokenizer, maxlen=MAX_LEN)
val_x = fast_encode(test_x.astype(str), fast_tokenizer, maxlen=MAX_LEN)

train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((train_x, train_y))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((val_x, val_y))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)
with strategy.scope():
    transformer_layer = (
        transformers.TFRobertaModel
        .from_pretrained('roberta-base')
    )
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()
plot_model(
    model,to_file="Roberta-Transformer.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

n_steps = train_x.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS
)

## Roberta Base Architecture -768 D

The model architecture for Albert is as follows:

<img src="https://i.imgur.com/n6wDjpP.png">

## Conclusion of BERT-base Transformers


This section concludes the classification models created using all BERT-based transformer models ranging from Bert to Albert /Roberta. These classes of Bidirectional Encoder Models are based on Discriminatory Transformer architectures and are well suited for classification tasks in general (although they are used for language modelling, question answering).

Now we move forward to some Generative transformers like GPT.



<img src="https://i.pinimg.com/originals/76/04/48/760448c0de6bed1e9b810b006d264561.gif">


## GPT-Generative Pretraining


<img src="https://jalammar.github.io/images/gpt2/gpt2-self-attention-split-attention-heads-1.png">

[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1] of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

Tips:

- GPT-2 is a model with absolute position embeddings so it’s usually advised to pad the inputs on the right rather than the left.

- GPT-2 was trained with a causal language modeling (CLM) objective and is therefore powerful at predicting the next token in a sequence. Leveraging this feature allows GPT-2 to generate syntactically coherent text as it can be observed in the run_generation.py example script.

- The PyTorch models can take the past as input, which is the previously computed key/value attention pairs.

 Resource:
 
 - [Jay's Blog](http://jalammar.github.io/illustrated-gpt2/)


In [None]:
## Testing GPT2Transformer
import os

import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import ModelCheckpoint
from kaggle_datasets import KaggleDatasets
import transformers
from tqdm.notebook import tqdm
from tokenizers import BertWordPieceTokenizer
from sklearn.model_selection import train_test_split
import numpy as np
from transformers import AutoTokenizer, pipeline, TFDistilBertModel

tokenizer = transformers.GPT2Tokenizer.from_pretrained('gpt2-medium')
# Save the loaded tokenizer locally
tokenizer.save_pretrained('.')
# Reload it with the huggingface tokenizers library
fast_tokenizer = BertWordPieceTokenizer('vocab.txt', lowercase=False)
train_y=labels
train_x,test_x,train_y,test_y=train_test_split(train_df['review'],train_y,test_size=0.2,random_state=42)
val_x=test_x
val_y=test_y
train_x = fast_encode(train_x.astype(str), fast_tokenizer, maxlen=MAX_LEN)
val_x = fast_encode(test_x.astype(str), fast_tokenizer, maxlen=MAX_LEN)

train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((train_x, train_y))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((val_x, val_y))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)
with strategy.scope():
    transformer_layer = (
        transformers.TFGPT2Model
        .from_pretrained('gpt2-medium')
    )
    model = build_model(transformer_layer, max_len=MAX_LEN)
model.summary()
plot_model(
    model,to_file="GPT2-Transformer.png",
    show_shapes=True,
    show_layer_names=True,
    rankdir="TB",
    expand_nested=False,
    dpi=96)

n_steps = train_x.shape[0] // BATCH_SIZE
train_history = model.fit(
    train_dataset,
    steps_per_epoch=n_steps,
    validation_data=valid_dataset,
    epochs=EPOCHS
)

## Conclusion of Notebook


This terminates the notebook -2 for the workshop. The same codebase can be used  with any model from the huggingface repository. In the next [notebook](https://www.kaggle.com/colearninglounge/nlp-end-to-end-cll-nlp-workshop-3), we will breifly look into tensorboard graphs and training a simple model with tensorboard.



<img src="https://i.pinimg.com/originals/1d/cd/04/1dcd045c688cb9b8c85c79ab05834094.gif">

In [None]:
#Testing Classical-Quantum Circuits 
import pennylane as qml
import tensorflow as tf
from tensorflow import keras
n_qubits = 2
dev = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev)
def qnode(inputs, weights):
    qml.templates.AngleEmbedding(inputs, wires=range(n_qubits))
    qml.templates.BasicEntanglerLayers(weights, wires=range(n_qubits))
    return [qml.expval(qml.PauliZ(wires=i)) for i in range(n_qubits)]
n_layers = 6
weight_shapes = {"weights": (n_layers, n_qubits)}
qlayer = qml.qnn.KerasLayer(qnode, weight_shapes, output_dim=n_qubits)
clayer_1 = tf.keras.layers.Dense(2)
clayer_2 = tf.keras.layers.Dense(2, activation="softmax")
model = tf.keras.models.Sequential([clayer_1, qlayer, clayer_2])
opt = tf.keras.optimizers.SGD(learning_rate=0.2)
model.compile(opt, loss="mae", metrics=["accuracy"])
model.fit(train_x,train_y,batch_size=128,epochs=1,verbose=2,validation_data=(val_x,val_y))
model.summary()
