# Week 4_ Word Embeddings and Language Models


- Introduction to word embeddings and its types
- Understanding of word2vec and GloVe embeddings
- Implementing word embeddings using PyTorch or TensorFlow
- Understanding of language models and its types
- Introduction to Markov Chain and Hidden Markov Models
- Introduction to n-gram language models
- Understanding of Recurrent Neural Language Models
- Introduction to Generative Pre-training Transformer (GPT) and its variants
- Introduction to fine-tuning pre-trained language models
- Understanding of evaluation metrics for language models
- Understanding of text generation using language models
- Introduction to machine translation and its techniques
- Understanding the role of attention mechanism in language models
- Understanding the concept of Language model pre-training and its application in NLP tasks.
- Understanding the concept of zero-shot learning and its application in NLP tasks.

## Word embeddings and its types

<img src = "images/word embeding.png" width="600px" height="600px">

Image Source:[image source](https://www.researchgate.net/profile/Hamid-Bekamiri/publication/361134482/figure/fig1/AS:1164172024918017@1654571648469/Types-of-word-embedding-techniques-Selva-and-Kanniga-2021.ppm)


<img src = "images/word embeding 2.png" width="600px" height="600px">

Image Source:[image source](https://www.mathworks.com/help/examples/textanalytics/win64/VisualizeWordEmbeddingsUsingTextScatterPlotsExample_01.png)

Word embeddings are basically a form of word representation that bridges the human understanding of language to that of a machine. They have learned representations of text in an n-dimensional space where words that have the same meaning have a similar representation. Meaning that two similar words are represented by almost similar vectors that are very closely placed in a vector space. These are essential for solving most Natural language processing problems.

<img src = "images/word to vec.png" width="400px" height="400px">

Image Source:[image source](https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/07/03000751/we1.png)

## Word2vec and GloVe embeddings

Word2vec is a method to efficiently create word embeddings by using a two-layer neural network. 
The input of word2vec is a text corpus and its output is a set of vectors known as feature vectors that represent words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep neural networks can understand.
The Word2Vec objective function causes the words that have a similar context to have similar embeddings. Thus in this vector space, these words are really close. Mathematically, the cosine of the angle (Q) between such vectors should be close to 1, i.e. angle close to 0.

<img src = "images/w2vec.png" width="350px" height="350px">

Image Source:[image source](https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/07/03000917/we3.png)

Word2vec is not a single algorithm but a combination of two techniques – CBOW(Continuous bag of words) and Skip-gram model. Both of these are shallow neural networks which map word(s) to the target variable which is also a word(s). Both of these techniques learn weights which act as word vector representations. 


<img src = "images/cbow and skip-grams.png" width="600px" height="600px">

Image Source:[image source](https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/07/03000954/we4-1068x651.png)


### Continuous Bag-of-Words model  (CBOW)


<img src = "images/cbow.png" width="600px" height="600px">

Image Source:[image source](https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/07/03001031/we5.png)


## Skip-gram model

The working of the skip-gram model is quite similar to the CBOW but there is just a difference in the architecture of its neural network and the way the weight matrix is generated  as shown in the figure below:


<img src = "images/skip grams.png" width="600px" height="600px">

Image Source:[image source](https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/07/03001127/we7.png)


## GloVe

GloVe (Global Vectors for Word Representation) is an alternate method to create word embeddings. It is based on matrix factorization techniques on the word-context matrix. A large matrix of co-occurrence information is constructed and you count each “word” (the rows), and how frequently we see this word in some “context” (the columns) in a large corpus. Usually, we scan our corpus in the following manner: for each term, we look for context terms within some area defined by a window-size before the term and a window-size after the term. Also, we give less weight for more distant words.

The number of “contexts” is, of course, large, since it is essentially combinatorial in size. So then we factorize this matrix to yield a lower-dimensional matrix, where each row now yields a vector representation for each word. In general, this is done by minimizing a “reconstruction loss”. This loss tries to find the lower-dimensional representations which can explain most of the variance in the high-dimensional data.

## Language models and its types

<img src = "images/language models .png" width="800px" height="800px">

<img src = "images/common examples of language models.png" width="800px" height="800px">

Image Source:[Further Explaination](https://insights.daffodilsw.com/blog/what-are-language-models-in-nlp)

<img src = "images/language models.png" width="700px" height="700px">

Image Source:[image source](https://www.researchgate.net/profile/Mrinal-Bachute/publication/358420139/figure/fig3/AS:11431281092693731@1666938327851/Types-of-language-models-based-on-Deep-Learning.png)

##  Markov Chain and Hidden Markov Models

On the surface, Markov Chains (MCs) and Hidden Markov Models (HMMs) look very similar.
We’ll clarify their differences in two ways: Firstly, by diving into their mathematical details. Secondly, by considering the different problems, each one is used to solve.
Together, this will build a deep understanding of the logic behind the models and their potential applications.
Before doing that, let’s start with their common building block: Stochastic Processes

## Stochastic Processes

<img src = "images/stochastic process.png" width="400px" height="400px">

Image Source:[image source](https://towardsdatascience.com/markov-and-hidden-markov-model-3eec42298d75)



<img src = "images/markove chain.png" width="400px" height="400px">

Image Source:[image source](https://www.youtube.com/watch?v=i3AkTO9HLXo)


<img src = "images/Markove chain 1.png" width="400px" height="400px">




<img src = "images/random walk markove.png" width="400px" height="400px">




<img src = "images/infinity walke.png" width="400px" height="400px">

Image Source:[Further Explaination](https://www.youtube.com/watch?v=i3AkTO9HLXo)

<img src="https://play-lh.googleusercontent.com/2El-X0RCUTz3jdOrERZh3NosHTpyYznQqWvQV4gnibCJq02tLztlQGjdOio4GY-oEt8" width="300px" height="300px">