# Week 5_ Sequence-to-Sequence Models

- Introduction to sequence-to-sequence models and its architecture
- Understanding of Encoder-Decoder Models and its variants
- Introduction to attention mechanism and its role in sequence-to-sequence models
- Understanding of Beam Search and its application in sequence-to-sequence models
- Implementing machine translation models using PyTorch or TensorFlow
- Understanding of evaluation metrics for machine translation
- Understanding of transfer learning and fine-tuning pre-trained models for machine translation tasks
- Introduction to unsupervised machine translation and its techniques
- Understanding of Multilingual models and its application in NLP tasks
- Understanding the concept of zero-shot learning and its application in machine translation tasks
- Understanding the concept of back-translation and its application in machine translation tasks
- Understanding the concept of ensembling in machine translation tasks
- Understanding the concept of language model pre-training and its application in machine translation tasks

##  Sequence-to-sequence models and its architecture

Sequence to Sequence (often abbreviated to seq2seq) models is a special class of Recurrent Neural Network architectures that we typically use (but not restricted) to solve complex Language problems like Machine Translation, Question Answering, creating Chatbots, Text Summarization, etc.

<img src="images/sequence.jpg" width ="600px" height ="600px">

Image source: [Link to source](https://miro.medium.com/v2/resize:fit:669/0*iDgmgGnrzq65dPXy.jpg)


<img src="images/sequence 1.png" width ="600px" height ="600px">

Image source: [Link to source](https://miro.medium.com/v2/resize:fit:700/1*y4D1XNJQmx-Gii1oHeHy_A.png)



## Encoder-Decoder Models and its variants

#### The Encoder-Decoder Network

This network have been applied to very wide range of applications including machine translation, text summarisation, questioning answering and dialogue. Let’s try to understand the idea underlying the encoder-decoder networks. The encoder takes the input sequence and creates a contextual representation (which is also called context) of it and the decoder takes this contextual representation as input and generates output sequence.

<img src="images/encoder decoder core.png" width ="600px" height ="600px">

Image source: [Link to source](https://medium.com/nerd-for-tech/nlp-theory-and-code-encoder-decoder-models-part-11-30-e686bcb61dc7)

### Encoder:

Encoder takes the input sequence and generated a context which is the essence of the input to the decoder.

<img src="images/encoder.png" width ="600px" height ="600px">

Image source: [Link to source](https://medium.com/nerd-for-tech/nlp-theory-and-code-encoder-decoder-models-part-11-30-e686bcb61dc7)

The entire purpose of the encoder is to generate a contextual representation/ context for the input sequence.

### Decoder:

Decoder takes the context as input and generates a sequence of output. When we employ RNN as decoder, the context is the final hidden state of the RNN encoder.

<img src="images/decoder.png" width ="600px" height ="600px">

Image source: [Link to source](https://medium.com/nerd-for-tech/nlp-theory-and-code-encoder-decoder-models-part-11-30-e686bcb61dc7)

The first decoder RNN cell takes “CONTEXT” as its prior hidden state. The decoder then generated the output until the end-of-sequence marker is generated.

- complete encoder decoder model

<img src="images/complete encoder decoder.png" width ="600px" height ="600px">

Image source: [Link to source](https://medium.com/nerd-for-tech/nlp-theory-and-code-encoder-decoder-models-part-11-30-e686bcb61dc7)

<img src="images/encoder decoder 2.png" width ="600px" height ="600px">

Image source: [Link to source](https://towardsdatascience.com/what-is-an-encoder-decoder-model-86b3d57c5e1a)

### Types of Encoders and Decoders

There are two main types of encoder and decoder: 
- Linear 
- Nonlinear

#### Linear encoders and decoders:

These are the most common type. They work by taking an input signal and converting it into an output signal that is proportional to the input.

#### Nonlinear encoders and decoders:

These are less common but are more versatile. They work by taking an input signal and converting it into an output signal that is not proportional to the input.









## Introduction to attention mechanism and its role in sequence-to-sequence models

<img src="images/attention mechanism.png" width ="600px" height ="600px">

Image source: [Link to source](https://miro.medium.com/v2/resize:fit:1022/1*qhOlQHLdtfZORIXYuoCtaA.png)



Seq2Seq model with an attention mechanism consists of an encoder, decoder, and attention layer.

Attention layer consists of

- Alignment layer
- Attention weights
- Context vector

## Understanding of Beam Search and its application in sequence-to-sequence models


<img src="images/beam1.png" width ="600px" height ="600px">

Image source: [Link to source](https://medium.com/@dhartidhami/beam-search-in-seq2seq-model-7606d55b21a5)



<img src="images/beam2.png" width ="600px" height ="600px">

Image source: [Link to source](https://medium.com/@dhartidhami/beam-search-in-seq2seq-model-7606d55b21a5)



<img src="images/beam3.png" width ="600px" height ="600px">

Image source: [Link to source](https://medium.com/@dhartidhami/beam-search-in-seq2seq-model-7606d55b21a5)


### APPLICATIONS 

A beam search is most often used to maintain tractability in large systems with insufficient memory to store the entire search tree.For example, 
- It has been used in many machine translation systems.
- Each part is processed to select the best translation, and many different ways of translating the words appear.
- According to their sentence structures, the top best translations are kept, and the rest are discarded. The translator then evaluates the translations according to a given criterion, choosing the translation which best keeps the goals.
- The first use of a beam search was in the Harpy Speech Recognition System, CMU 1976.

 ## Evaluation metrics for machine translation
 
 
 
 
<img src="images/evaluation matrices.png"  width ="1000px" height ="1000px">

Image source: [Link to source](https://www.mdpi.com/2227-7390/11/4/1006)

 


## Transfer learning and fine-tuning pre-trained models for machine translation tasks


<img src="images/transferlearning.png" width ="800px" height ="800px">

Image source: [Link to source](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/616b35e3dcd432047dd02ea5_uYLdnVpAfjC3DC7eWJM2xWyQin_dbVcak0JlRpd7S2bAkdylh-9JITWttww3Wq8fKI56Tl3_v7Y-aVh4nKgl4mZl4ZvcoUIViQRJhBBSw2cpC087oc2iZYvBytr8o1ks1FY1LQxh%3Ds0.png)


<img src="images/transfer2.png" width ="800px" height ="800px">

Image source: [Link to source](https://slds-lmu.github.io/seminar_nlp_ss20/figures/02-00-transfer-learning-for-nlp/compare-classical-transferlearning-ml.png)


<img src="images/PRETRAINED.png" width ="800px" height ="800px">

Image source: [Link to source](https://link.springer.com/article/10.1007/s13218-021-00746-2/figures/1)



<img src="images/multiphased.png" width ="800px" height ="800px">

Image source: [Link to source](https://link.springer.com/article/10.1007/s13218-021-00746-2/figures/1)

- Top: source network pre-trained on ImageNet. 
- Bottom: proposed multi-phase fine-tuning approach. The layers to be fine-tuned in the target network are adapted over several phases starting from the top layer.



### Understanding of fine-tuning pre-trained models with an example:

Most of the world’s text is not in English. To enable researchers and practitioners to build impactful solutions in their domains, understanding how our NLP architectures fare in many languages needs to be more than an afterthought. In this post, we introduce our latest paper that studies multilingual text classification and introduces MultiFiT, a novel method based on ULMFiT. MultiFiT, trained on 100 labeled documents in the target language, outperforms multi-lingual BERT. It also outperforms the cutting-edge LASER algorithm—even though LASER requires a corpus of parallel texts, and MultiFiT does not.
 

<img src="images/pre2.png" width ="800px" height ="800px">

Image source: [Link to source](https://eisenjulian.github.io/content/images/size/w2000/2020/08/multifit_bootstrapping.png)
 
Visit link for further understanding


<img src="images/How-does-fine-tuning-pre-trained-models-work (1).png" width ="800px" height ="800px">

Image source: [Link to source](https://d3lkc3n5th01x7.cloudfront.net/wp-content/uploads/2023/02/02020040/How-does-fine-tuning-pre-trained-models-work.png)
 
 
 
 
## What is fine-tuning a pre-trained model?
The fine-tuning technique is used to optimize a model’s performance on a new or different task. It is used to tailor a model to meet a specific need or domain, say cancer detection, in the field of healthcare. Pre-trained models are fine-tuned by training them on large amounts of labeled data for a certain task, such as Natural Language Processing (NLP) or image classification. Once trained, the model can be applied to similar new tasks or datasets with limited labeled data by fine-tuning the pre-trained model.

The fine-tuning process is commonly used in transfer learning, where a pre-trained model is used as a starting point to train a new model for a contrasting but related task. A pre-trained model can significantly diminish the labeled data required to train a new model, making it an effective tool for tasks where labeled data is scarce or expensive.



## Unsupervised machine translation and its technique

<img src="images/svsu.png" width ="600px" height ="600px">

Image source: [Link to source](https://upload.wikimedia.org/wikipedia/commons/thumb/9/90/Task-guidance.png/450px-Task-guidance.png)


### Unsupervised Learning
So far we have covered the concept of supervised learning as well as common machine learning algorithms for both regression and classification problems. Now let’s talk about the second approach in the whole spectrum of machine learning, which is unsupervised learning.

In contrast with supervised learning, we don’t need to provide the model with the ground truth label of each data point during the training process. This means that the model will  learn the pattern of data points by itself, hence the name ‘unsupervised’.

In real life application, unsupervised learning is a very useful method since most of the data are unlabeled and the fact that it’s very time consuming to provide a ground-truth label for each data point.

There are a lot of examples of use cases that use the unsupervised learning approach, such as dimensionality reduction, clustering, and anomaly detection.

<img src="images/differencesupervised.png" width ="600px" height ="600px">

Image source: [Link to source](https://www.stratascratch.com/blog/supervised-vs-unsupervised-learning/)

### What is unsupervised Learning?


<img src="images/supervised-vs-unsupervised-machine-learning-768x624.png" width ="600px" height ="600px">

Image source: [Link to source](https://vitalflux.com/wp-content/uploads/2021/11/supervised-vs-unsupervised-machine-learning-768x624.png)


Unsupervised learning is defined as machine learning model training technique in which machine learning models are not provided with any labelled data, and they must learn from the input/environment themselves. Unsupervised machine-learning techniques try to find patterns in a pool of unlabelled examples (even though such an example is missing some information by definition). The unsupervised learning is primarily of two types:

- Clustering: This method of unsupervised learning relies on creating clusters from the input data. The datapoints that have similarities will result in belonging to same clusters, and using those clusters, the predictions will be made.
- 
- Association: The second method of unsupervised learning is association, in which the algorithms find the rules from the input data and the predictions are then made based on those rules and the data.

Here are key points regarding unsupervised machine learning:

- The training dataset is a pool of unlabelled examples.
- The goal of an unsupervised learning is discover hidden pattern within the dataset where the output is not predefined.
- Unsupervised learning models can solve complex clustering and association problems.
- Some of the examples of unsupervised learning algorithms includesthe following:
-      Hierarchical clustering.
-      K-means clustering.
-      Principal component analysis
-      DBSCAN
-      A priori algorithm for association

Some real-world examples of applications leveraging unsupervised learning algorithms include customer segmentation, user profiling, fraud detection, machine quality inspection, machine failure prediction etc.






 

## Multilingual models and its application in NLP tasks

### What is multilingual Natural Language Processing (NLP)? 

Multilingual NLP is a technology that integrates linguistics, artificial intelligence, and computer science to serve the purpose of processing and analyzing substantial amounts of natural human language in numerous settings.

<img src="images/multi 1.png" width ="600px" height ="600px">

Image source: [Link to source](https://stagezero.ai/blog/multilingual-nlp-solutions/)


### How does multilingual NLP work? 

There are many different forms of multilingual NLP, but in general, it enables computational software to understand the language of certain texts, along with contextual nuances. Multilingual NLP is also capable of obtaining specific data and delivering key insights. In short, multilingual NLP technology makes the impossible possible which is to process and analyze large amounts of data. Without it, this kind of task can probably only be executed by employing a very labor- and time-intensive approach. 


- HOW MULTILINGUAL NLP CAN BREAK DOWN LANGUAGE BARRIERS
There have been recent advancements in building models which will help to cater to a diverse spectrum of languages, helping researchers overcome the biggest causes of language barriers.

<img src="images/multi 2.png" width ="1000px" height ="600px">

Image source: [Link to source](https://www.allerin.com/wp-blog/wp-content/uploads/2022/07/How-Multilingual-NLP-Can-Break-Down-Language-Barriers.png)



#  zero-shot learning and its application in machine translation tasks

## What is Zero-Shot Learning?
Transfer learning finds its inspiration in human’s capacity to generalize from experience. Humans are very good at using previous knowledge to handle new situations. For instance, a person that speaks Swedish can use his experience to learn a similar language to Norwegian.

Zero-shot learning (ZSL) is a form of transfer learning that aims to learn patterns from labeled data in order to detect classes that were never seen during training.  As the lack of labeled data and scalability is a regular problem in machine learning applications, ZSL has gained much attention in recent years thanks to its ability to predict unseen classes.


<img src="images/zero 1.jpg" width ="600px" height ="600px">

Image source: [Link to source](https://modulai.io/wp-content/uploads/2022/08/ZSL_example_updated-1280x757.jpg)



<img src="images/zero 2.png" width ="600px" height ="600px">

Image source: [Link to source](https://editor.analyticsvidhya.com/uploads/68432img3.PNG)

<img src="images/zero 3.jpg" width ="600px" height ="600px">

Image source: [Link to source](https://opendatascience.com/wp-content/uploads/2019/11/Creativity-Inspired-Zero-shot-Learning-2-640x300.jpg)

# Back-translation and its application in machine translation tasks


## What is back translation?

Back translation, also called reverse translation, is the process of re-translating content from the target language back to its source language in literal terms. For example, if you’re translating content from English to Swedish, the translator would also write a back translation English so the intention of the translated option is easily understood. Back translations don’t impact the translation memory or other resources like glossaries used by the translator.

Back translation (sometimes referred to as double translation) is most helpful when the content at hand includes taglines, slogans, titles, product names, clever phrases and puns because the implied meaning of the content in one language doesn’t necessarily work for another language or region. The back translation offers the content owner the opportunity to review what creative liberties the translators took to adapt the content for their market. And oftentimes, for content that is this complex, the translator will offer multiple options so the source content owner can make a decision that makes the most sense for the brand.

This is often confused with double translation, which is a method of translation where one translator creates multiple versions of a piece of content to account for the nuances in different words and phrases.

Back translation is often used as a quality assurance method. The back translation process looks like this:

- A linguist translates the original source text into the new language.
- Then the linguist translates the localized string back into the source language literally to convey the meaning of the translation
- The content owner or project manager selects the option that best represent the brand, tone and intention of the source content



<img src="images/backtranslation.png" width ="900px" height ="900px">

Image source: [Link to source](https://repository-images.githubusercontent.com/196985080/c4892580-bef8-11e9-9ba7-b888350d6fa7)




# Ensembling in machine translation tasks

## What is Ensemble Learning?

Ensemble Learning is a method of reaching a consensus in predictions by fusing the salient properties of two or more models. The final ensemble learning framework is more robust than the individual models that constitute the ensemble because ensembling reduces the variance in the prediction errors

Ensemble Learning tries to capture complementary information from its different contributing models—that is, an ensemble framework is successful when the contributing models are statistically diverse. 

In other words, models that display performance variation when evaluated on the same dataset are better suited to form an ensemble.

<img src="images/ensemble-learning 1.png" width ="600px" height ="600px">

Image source: [Link to source](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/61f7bbd4e90cce440b88ea32_ensemble-learning.png)

## How does ensemble learning work?
Ensemble learning combines the mapping functions learned by different classifiers to generate an aggregated mapping function. 

The diverse methods proposed over the years use different strategies for computing this combination. 

Below we describe the most popular methods that are commonly used in the literature.
- Bagging

<img src="images/essemble 2.png" width ="800px" height ="800px">

Image source: [Link to source](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/61a4414d28946a3ac3e69ed9_q-FrlRMLk-5nSxZ_3ONlFpu5hQ61PsuAxkusTD1vEX5NqkdH2Ie0u_75rIySTZKXVI4VBxM-AIw3APQvRboG3kv-3l3cA5c5qyMwwTMe2OLXzoAgA051Dqbx7XVfdJaDyNwrSLUf.png)

- Boosting

<img src="images/essemble 3.png" width ="800px" height ="800px">

Image source: [Link to source](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/61a4414d5e568a661fb7896c_mji7xyiAlyQAdxQde14HY1OVvAVzDyyKhDOo4a4bg53_m2OHUvHhMGexaHuHCfKGRVQQlfFlihuodX7LD5hugPgGw8ZzJV4bHjHc648Zr0LyVr2I0i6ciJvJri_OFCuQpOf81xcn.png)

- Stacking

<img src="images/essemble 4.png" width ="800px" height ="800px">

Image source: [Link to source](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/61a4414dba9e9f94d7b31368_RhQ6ctlYepNo3J-yChyk_jLM_siHT9eGIJTpcI0NEPhADEcGic31JW4TWwLLzWv0LvqyDjFx9yQ8m16kKENTtPZeW-fY-9z6k7m-rsmPseGIeHhB-IiI0V5t4hImEPZRnEWPChAo.png)

- Mixture of Experts

<img src="images/essemble 5.png" width ="800px" height ="800px">

Image source: [Link to source](https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/61a4414da1649f856e553d74_jiy3kw7cLPoe3-DClXGWwEZo4FNQuf3NzqSpelWXN0di_Ydnyz9QnHqGiFAd87pc-WELMSXGKKdw1wqeA5pjSioytSVNXCmU6wdaV3nFUYZjkKKs_deV_XyUjLOfU8K9o5RWu_n4.png)



