### Transformers in NLP : Part 3

### Business Context:

**In Part-1** of the Transformer series, we have discussed on the **entirity of NLP**, starting from **Bag of words** to **Transformer architecture.**

We finally discussed on **BERT** which is one of the **State-of-the-Art Transformer** models for downstream NLP tasks.

**In Part-2**, we came to know of the **limitations of BERT** and the ways of improving it. 

We came across the concept of **Auto Regression**, **Auto Encoding** and explored 2 new models **RoBERTa** & **XLNet** which further **improved on BERT performance** significantly by changes in training techniques and architectures.

**In Part-3** of this transformer series lecture, we will deal with the **short comings** of these models in terms of **Memory Optimization, Prediction Latency & Space usage.** 

We will understand new techniques and architecture modififcations which help to solve these issues while deploying a model to production and study in detail 2 new models which are:

* **ALBERT : A Lite BERT for Self-supervised Learning of Language Representations** 
* **DistillBERT : A distilled version of BERT: smaller, faster, cheaper and lighter** 

and see how these optimize space and memory with minor changes in prediction accuracy.

Finally we will have a **comparative study across all the 5 transformer models** discussed in this lecture series.

### References:

* Google Images
* Transformers research paper -> https://arxiv.org/abs/1706.03762
* BERT research paper -> https://arxiv.org/abs/1810.04805
* ALBERT research paper -> https://arxiv.org/abs/1909.11942
* DistilBERT research paper -> https://arxiv.org/abs/1910.01108
* Wikipedia, Google

### Understanding Production Scenario:

* Full Fledged ML solution
* Solution integrated to other applications via Restful call
* Deployed either in Windows/Linux servers present in the organization
* Deployed in Cloud services like AWS, Azure, GCP
* Deployed as a standalone solution
* Deployed as a containerized docker image
* Integration with CI-CD tools
* Re-training on new data at fixed intervals

### Problems with heavy duty models:

* Running/Operating cost is very high
* Costs even more to scale the solution
* High space usage
* Difficult to integrate with docker because of solution size
* High Inference time resulting in prediction latency
* Lower customer satisfaction
* Increased CPU/GPU/RAM costs exceeding business profits in some scenarios
* Too big and complex models with many parameters results in Out of Memory issues
* Not possible to integrate with applications on the go

### Recap on BERT :

For detailed explanation, refer **Transformers in NLP : Part 1**

* Bidirectional Encoder Representations from Transformers
* Composed of encoder and decoder blocks
* Multiple identical encoders and decoders stacked on top of each other, having same no of units.

![encoder-decoder_2.png](attachment:encoder-decoder_2.png)

* Uses self atention to better understand context

![self-attention.png](attachment:self-attention.png)

* Learns information from both the left and the right side of a token’s context
* Pre-trained on the BooksCorpus (800M words) and English Wikipedia (2,500M words).

![transformer.png](attachment:transformer.png)


### ALBERT : 

A Lite BERT for Self-supervised Learning of Language Representations

* Generally, the larger the network, the more the parameters, the better the model is on downstream task
* But Larger models have high memory & GPU requirements
* Helps to scale BERT
* Helps to bring down GPU/memory limitations
* Helps to bring down training speed

### ALBERT implementations:

### 1. Factorized Embedding Parameterization:

* Decomposes larger vocabulary matrix into 2 small matrices
* Helps to grow the hidden size without increasing parameter size of vocabulary embeddings


![albert-hidden-dimension.png](attachment:albert-hidden-dimension.png)


![embedding-decompose-albert.png](attachment:embedding-decompose-albert.png)



### 2. Cross Layer Parameter Sharing:

* Each layer in BERT has different parameters
* Whereas in ALBERT, parameters are shared across the network
* Prevents parameter growth with depth

![bert-albert.png](attachment:bert-albert.png)

### 3. Sentence Order Prediction:

* Ineffectiveness of NSP demonstrated in RoBERTa
* Focuses on Intersentence coherence

![sop.png](attachment:sop.png)

### Experimental Results and Observations:

* Discussed in detail with reference from the research paper

### ALBERT Code Implementation:

* Covered in separate notebook

### DistilBERT:

A distilled version of BERT: smaller, faster, cheaper and lighter

* Larger pre-trained models tend to perform better
* The larger the model, the more the no of parameters to train

![parameter-graph.png](attachment:parameter-graph.png)

* Higher cost of scaling and compute needed
* Can't be used on the go because of space and memory limitations
* A smaller general purpose language model
* Uses Knowledge Distillation on BERT to bring down the size and increase speed
* 40% lesser in size
* 60% faster
* retains 97% of BERT performance

### Knowledge Distillation:



![kd-3.jpg](attachment:kd-3.jpg)

![kd-1.png](attachment:kd-1.png)

### Experimental Results and Observations:

* Discussed in detail with reference from the research paper

### DistilBERT Code Implementation:

* Covered in separate notebook

### Comparative study:

![bert-roberta-distilbert-albert.png](attachment:bert-roberta-distilbert-albert.png)

![bert-roberta-distilbert-xlnet.png](attachment:bert-roberta-distilbert-xlnet.png)