## ML Techniques 
* Naive Bayes: Assumes that each feature is independent of other feature    
* SVM: Handles noise well but slow on training and scaling is issue 
* HMM:
* * Each hidden state is dependent on previous states. 
* * Utilizes the sequential nature of language

![](helper/1.JPG)
* Conditional Random Fields (CRF): 
* * Performs classification task for each element in a sequence. 
* * Outperforms others on Part of Speech tagging. 

## DL Techniques
* RNN 
* LSTM, GRU 
* CNN
* Transformers, self attention 
* Autoencoders 

## Why DL isnt yet a silver bullet 
* Overfitting on small dataset 
* Lack of Domain generalization
* DOesnt yet grasp common sense knowledge, logic
* Costly 
* Less Interpretable 
* Less avenues for few shot learning and data augmentation 
* Difficult to deploy on edge 
* Higher technical Debt


we typically lowercase the text before stemming. We also don’t remove tokens or lowercase the text before doing lemmatization because we have to know the part of speech of the word to get its lemma, and that requires all tokens in the sentence to be intact. A good practice to follow is to prepare a sequential list of pre-processing tasks to be done after having a clear understanding of how to process our data.

## NLP Pipeline 
DO READ IT AGAIN. Quite Comprehensive 

## Text Representation 
* How to convert text into numbers. 
* It isnt as stright forward as images, speech where the input is inherently represented as numbers.
* * Basic vectorization approaches
* * Distributed representations
* * Universal language representation
* * Handcrafted features

## Vectorization approaches
### One Hot Encoding 
* Make a corpus of all the words, mapping each word to a unique ID. 
* Every word is represented as a binary vector of size of corpus, vecotr[unique ID] = 1. We have one such vector for each word in a sentence. 
* Problems: 
* * Size if one hot vector is directly proportional to size of corpus: sparsity
* * The vector dimension varies with length of sentence, we would prefer fixed size.    
* * Words considered atomic units and no dis/similarity between them considered.
* * Out of vocab problem. We may need to retrain the model again with new word added. 
* * Seldom used 

[code](./TextRepresentation/OneHot.ipynb)
### Bag of Words
* We can make a corpus and keep a count of occurence each word in a document. 
* We can also just keep a note of what words occurred sans counts (used in sentiment analysis) 
* Advantages: 
* * Easy to interpret and implement 
* * Documents having same words will be closer in euclidean space, thus considers semantic similarity. 
* * Fixed length encoding.
* Disadvantages: 
* * Size of vector increases with size of vocab. 
* * Doesnt capture similarity between words. 
* * Out of vocab problem. 
* * Word order is lost. 

### Bag of N words 
* It breaks the text into chunks of n contiguous words/ tokens 
* Adv/ disadv:
* * Thus it captures some context, thus can capture semantic similarity to a level 
* * As 'n' increases, the dimentionality increases rapidly. 
* * Has Oout of vocab problem 

### TF IDF (term frequency inverse )
* Quantify importance of given word relative to other words in the document. 
* Used for information retrieval 
* If word w occurs many times in one particular document but not often in rest of the document, that word is important for that document.
* Term frequency:
* * How often a term occurs in a doc 
* * Larger doc may have higher count, hence we normalize term count by length of the document. 
$$ TF = \frac{Number of occurence of term t in doc}{total number of terms in doc d} $$ 

* Inverse document frequency: 
* * TF gives equal importance to all terms.
* * weigh down very common terms (stop words) and weigh up rare occurance.
* * $$ IDF = log_e \frac{total number of documents in corpus}{total number of docs with term t in it} $$ 


$$ TF-IDF score = TF*IDF $$

* For corpus:
* S1 = 'dog bites man'
* S2 = 'man bites dog'
* S3 = 'dog eats meat'
* S4 = 'man eats food'

<img src="helper/2.JPG" alt="Drawing" style="width: 600px;"/> 


## Distributed representations
* Distributional Similarity: Meaning of the word can be understood from the context they appear
* Distributional Hypothesis: Words that occur in similar context have similar meaning. 
* Distributional Representation: High dimensional representation based on occurence of word and context. eg TF IDF, one hot
* Distributed Representation: Compress the distributional representations to get compact and dense vectors. 

### Word Embeddings
##### Pretrained 
* Smaller, denser vectors. 
* Derives meaning of the word from the context.
* Projects words to a vector space where similar words cluster together.

* Pretrained word embddings: 
* * Word2Vec, GloVe, fasttext
* * Better to use this and if it doesnt work train custo embedding. 

#### Disadv: 
* Cant distinguish homophones: dog bark vs tree bark 
* memory intensive 
* Unintentional bias
* corpus dependent

#### Train Embedding 
#### CBOW
* Continuous Bag of words:predict centre word given context. If we take contextk, we have 2k context words and we predict 1 output prediction: the middle word. 
* Salient points in diagram below: 
* * Input will be vector rep (one hot vector) with dimension V. 
* * Weights V*N (N is length of our embedding == hidden layer). 
* * Output: one word rep using one hot vector.

<img src="helper/3.JPG" alt="Drawing" style="width: 600px;"/> 
<img src="helper/4.JPG" alt="Drawing" style="width: 600px;"/> 

#### Skipgram
* predict context words from centre word.
* We take a 2k+1 window, from the centre word we predict 2k words. So it gives 2k word pairs.
* We shift the window across the whole corpus to geberate training data. 
* Salient points in diagram below: 
* * We start with middle word one hot vector.
* * A V*N weight converts it to the embedding. 
* * We have seperate N*V weight to generate each context. 

<img src="helper/5.JPG" alt="Drawing" style="width: 600px;"/> 
<img src="helper/6.JPG" alt="Drawing" style="width: 600px;"/> 

Some important hyperparameters: 
* Dimensionality of word vectors. (50-500)
* Context window


#### Going beyond words
We can find embeddings of constituent words and take sum/avg etc. Though we loose the ordering information, it works suprisingly well. 

Both self and pre trained embeddings heavily depend on the vocabulary. Vocabulary overlap is a great metric to guage performance of NLP model.
* So out of vocaulary can be dealt by removing OOV words during pre processing only. 
* Or we specificy random embeddings of OOV, this boosts performance by 1-2% rather than just excluding it. 
* Use subword properties. fastext has embedding for word and characters together, word's embedding is aggregation of charater n grams. 
* Distributed Representation: 
* * Even with fasttext n grams, dog bites man and man bites dog will have the same embedding which isnt right. 
* * Doc2Vec takes arb length text and in addition to word vectors also has 'paragraph vector' (learns words with context)
* * While training on large corpus, paragraph vectors are unique to the text (arb length) in consideration while word vectors are shared through across all texts

#### Universal Text representation
* Contextual word representations: Primitive, predict next word using n gram prev words 
* Transformers can be used as pre trained models to get text representations. 

### Things to remember
* Any text representation has inherent bias based on the text it was trained on. Eg. embedding trained on tech news will think apple is closer to microsoft vs peers, oranges. 
* Embeddings are bulky. Word2Vec is 4.5GB: This is can be a bottleneck in deployment. Hack: Use in memory database like Redis with a cache.
* We need more than embeddings. Eg: for sarcasm detection. 


## Visualization

### t SNE: t distributed stocastic neighbouring embedding 
