<font color='red'>Q0. What is the difference between `shallow neural network` and `deep neural networks`?</font>

Previously we have seen two feature extraction techniques for text data -
- Bag of Words
- TFIDF

But apart from this we can also use below feature extraction techniques for texts - 
- One hot encoding
- Word vectors

### One hot encoding

Let's assume I have word dictionary as shown below -

![alt text](WE1.png 'word dictionary')

![alt text](WE2.png 'one hot vectors')

If I have sentences and if I use `one-hot encoding` as feature extraction, the sentences would like this -

![alt text](WE3.png 'one hot representation')

<font color='red'>Q1. What are some disadvantages of using `one hot vectors`?</font>

Answer - 
- It creates a very large vector for the text or it creates a very `sparse matrix` which takes lot memory and computational time
- No embedded meaning - that is does not carry word's meaning as shown below - 

![alt text](WE4.png 'disadvantage of one hot representation')

### Word vectors

![alt text](WE5.png 'word vectors')

similarly we can keep adding more dimensions to these above word vectors and they will be able to capture the `semantic` relationships between the words.

<font color='red'>Q2. What are the advantages of using `word vectors`?</font>

Answer - 
- It is low dimensional than one hot vectors
- word vectors able to capture the meaning of the words or the semantic relationships among the words

![alt text](WE6.png 'Title')

### How to create word vectors?

* 1\. Corpus
    * The corpus should have context in itself. For example, if we want to create word embedding for legal/law based clients then I should have corpus related to law books, legal contracts etc. A wikipedia corpus will not help to capture the context which are related to legal/law based business cases
* 2\. Embedding method
    * Word embeddings are generally `by product` of a `semi-supervised` machine/deep learning task

### Word embedding methods

- **Word2Vec** (you can read the research paper [here](https://arxiv.org/pdf/1301.3781.pdf)). It uses `shallow neural network` to train the word vectors and it has two model architectures - 
    - **Continuous bag of words (COBW)**: This architecture tried to learn a missing word given its surrounding words
    - **Continuous skip-gram/skip-gram with negative sampling (SGNS)**: This architecture tries to learn the surrounding words given an input word
    
- **Global vectors (GloVe)**, you can read the research paper [here](https://nlp.stanford.edu/pubs/glove.pdf). It generates the word vectors by factorizing the logarithm of corpus's word co-occurence matrix.

- **fastText** (you can read the research paper [here](https://arxiv.org/pdf/1607.04606.pdf)). It also uses skip-gram architectures and represents words as n-gram of characters. There is another embedding from facebook - [StarSpace](https://arxiv.org/pdf/1709.03856.pdf)

<font color='red'>The above word vectors will have same word vectors no matter what in what context they are being used. But there are few advanced word embedding models where the words changes their embeddings based on the context on where they are being used. Below are some examples - </font>

- **Embeddings from language models (ELMo)**, you can read the research paper [here](https://arxiv.org/pdf/1802.05365.pdf)
- **Bidirectinal encoder representation from transformers (BERT)**, you can read the reseach paper [here](https://arxiv.org/pdf/1810.04805.pdf)
- **Generative pre-training 3 (GPT 3)**, you can read the research paper [here](https://arxiv.org/pdf/2005.14165.pdf)

<font color='red'>You can finetune these above three models to have customized word embeddings based on the corpus of your business problem that you are solving</font>

### Continuous bag of words (CBOW) model

In this architecture we try to predict the center word based on its surrouding words. The idea is - if two uniques words are frequently surrounded by similar sets of words when used in various sentences then those words tend to be related in their meaning or in other words we can say that those two words are `semantically` similar

![alt text](WE7.png 'skip-gram intution')

with a large enough corpus the model will learn to predict that the missing word is - `reading`, `studying` or `writing` and some synonyms of these words.

![alt text](WE8.png 'skip-gram intution')

![alt text](WE9.png 'skip-gram architecture')

<font color='red'>Q3. How can we prepare the above data for training? As we already know machines don't understand string</font>

![alt text](WE10.png 'skip-gram architecture')

### How to extract the `word embeddings` from above network?

![alt text](WE11.png 'extracting word embeddings')

### GloVe word embeddings

GloVe word embeddings creates `co-occurence` matrices and then it used `matrix factorization` technique to compute word embeddings. There are two types co-occurence matrices - 

- first order co-occurences (bee and honey)
- second order co-occurences (bee and bumblebee)

![alt text](WE12.png 'co-occurence matrices')

to compute the above blue values, we can -
- keep a window size of our choice (this is a hyper parameter)
- compute word co-occurences: n_uv
- or we can compute pointwise mutual information (PMI)
$$PMI = log\frac{P(u, v)}{P(u) * P(v)}$$
this measure helps to remove those words which are co-occured randomly and they actually don't have any correlation among them

![alt text](WE13.png 'matrix factorization')

<font color='red'>Q3. But how GloVe factorizez the PMI matrix?</font>

GloVe uses `truncated SVD` technique to factorize the PMI matrix

![alt text](WE14.png 'SVD')

But the total number of `eigen values` will be equal to the `rank` of the PMI matrix. If rank is smaller than the size of the matrix then we will get `size-rank` number of zero eigen values or if rank is equal to size of the matrix then the last few eigen values are very small and does not actually convey any important meaning, so we can ignore those zero/smaller eigen values. Therefore, we can use `truncated SVD` to remove the unnecessary/meaningless information from our matrix factorization. But how to truncate these matrices?

![alt text](WE15.png 'Truncated SVD')

### Analogies and biases of word vectors

![alt text](WE16.png 'word analogies')

The word vectors might have `biases` because of the inherent biases present in the training data itself. You can read more about word vectors biases and how to debias them in this [paper](https://arxiv.org/pdf/1607.06520.pdf)

![alt text](WE17.png 'word vector biases')

In [1]:
from gensim.models.keyedvectors import KeyedVectors

unable to import 'smart_open.gcs', disabling that module


In [2]:
wv_embeddings = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

In [9]:
wv_embeddings['anjali']

array([-0.00772095,  0.09619141,  0.03125   ,  0.26953125, -0.07080078,
       -0.03686523,  0.09960938, -0.06591797, -0.18945312,  0.06396484,
       -0.05029297, -0.13085938,  0.02905273,  0.2109375 , -0.13574219,
        0.00878906, -0.02661133, -0.04248047, -0.01489258, -0.02026367,
        0.0559082 ,  0.04858398,  0.00350952, -0.06835938, -0.04418945,
       -0.15234375, -0.10253906, -0.07519531,  0.02307129, -0.07080078,
       -0.11035156,  0.10205078,  0.02062988,  0.01452637, -0.11132812,
       -0.16699219,  0.11425781, -0.01989746,  0.05908203, -0.09912109,
       -0.03015137, -0.05102539,  0.03710938,  0.06176758, -0.05932617,
       -0.25195312, -0.1640625 , -0.01324463,  0.06982422, -0.02929688,
        0.03295898,  0.12890625,  0.0112915 ,  0.11328125,  0.13183594,
        0.02514648, -0.06494141, -0.09130859, -0.02270508, -0.20800781,
       -0.05859375, -0.04541016, -0.08544922, -0.08544922,  0.09228516,
       -0.03613281, -0.17871094,  0.04467773,  0.07080078, -0.01

In [None]:
wv_embeddings.most_similar(positive=['woman', 'king'], negative=['man'])

In [None]:
wv_embeddings.doesnt_match(['breakfast', 'cereal', 'dinner', 'lunch'])

In [None]:
wv_embeddings.most_similar_to_given('music', ['water', 'sound', 'backpack', 'mouse'])