# Natural Language Processing

## Natural language understanding and generation

### What is NLP? 
- Field of computer science, artificial intelligence, and linguistics
- Concerned with giving computers the ability to support and manipulate data in human language
- Upto the 1980s, most NLP systems were based on complex sets of hand-written rules
- 1990s: Machine learning algorithms started to be used successfully in NLP tasks
- 2000s: Research focusing  on learning from data that has not been hand-annotated, increasingly using unsupervised and semi-supervised learning techniques
- 2003: The word n-gram model was overperformed by a [multi-layer perceptron](https://dl.acm.org/doi/10.5555/944919.944966) by Yoshua Bengio with co-authors.
- 2010s: Deep learning revolution in NLP, with neural network advances applied to language modeling, word embeddings, and many other areas. 
- [List of NLP tasks](https://en.wikipedia.org/wiki/Natural_language_processing#Common_NLP_tasks)
- two components: 
    - Natural language understanding (NLU): ability to understand human language
    - Natural language generation (NLG): ability to generate human language

#### NL Understanding
There are different levels of understanding:
- Morphological / Lexical level: breaking down words into their smallest units of meaning (morphemes)
    - eg: `unhappiness` -> `un-` (prefix), `happi-` (root), `-ness` (suffix)
- Syntactic level: understanding the grammatical structure of sentences, relationships between words, and the order in which they appear
    - eg: `The cat chased the mouse` -> subject-verb-object structure
- Semantic level: understanding the meaning of words, phrases, and sentences, and how they combine to convey meaning
    - eg: `iron horse` -> locomotive, not a literal horse made of iron
- Discourse level: understanding larger units of language, such as conversations, texts, or speeches
    - eg: analyzing political speech to understand how language is employed to persuade and convey specific messages
- Pragmatic level: understanding how language is used in context to achieve communicative goals, considering factors such as speaker intentions, implied meanings, and the social context
    - eg: `Can you pass the salt?` -> polite request rather than a literal question about someone's ability to pass salt

#### NL Generation
NLG will decide what to say and how to say it. It will generate a text that is grammatically correct and coherent. Two steps:
- Deep planning: decide what to say
- Syntactic generation: decide how to say it

## Language Modeling

Goal: assign a probability to a sequence of words. $$ P(W) = P(w_1, w_2, ..., w_n) $$
Related task is to predict the next word in a sequence: $$ P(w_n | w_1, w_2, ..., w_{n-1}) $$
A model that computes either of these probabilities is called a language model.

This can be used for:
1. Machine translation
    - $P(\text{the cat is small}) > P(\text{small is the cat})$
2. Spell correction
    - $P(\text{about 15 minutes}) > P(\text{about 15 minuetes})$
3. Speech recognition
    - $P(\text{I saw a van}) > P(\text{eyes awe of an})$
4. Summarization, question answering, sentiment analysis, etc.

The chain rule can be applied to decompose the joint probability of a sequence of words into a product of conditional probabilities:
$$ P(\text{the dog barked}) = P(\text{the}) \times P(\text{dog} | \text{the}) \times P(\text{barked} | \text{the dog}) $$

How do we estimate these probabilities? 
- just count and divide?
    - $P(\text{the | it's water is so transparent that}) = \frac{C(\text{it's water is so transparent that the})}{C(\text{it's water is so transparent that})}$
    - no, because there are too many possible sentences, and we don't have enough data to estimate all of these probabilities
- Markov assumption? - it's a simplifying assumption that allows us to estimate the probabilities
    - $P(\text{the | it's water is so transparent that}) = P(\text{the | that})$
    - no, because the previous word is not always a good predictor of the next word
    - so maybe use the previous two words? $P(\text{the | it's water is so transparent that}) = P(\text{the | transparent that})$
        - better
    - For computing $P(w_1 w_2 \dots w_n) = \prod_{i=1}^n P(w_i | w_1 w_2 \dots w_{i-1}) $, the factors are simplified using:
        - $ P(w_i | w_1 w_2 \dots w_{i-1}) = P(w_i | w_{i-k} \dots w_{i-1}) $
        - $k$ is the order of the Markov model

### [N-gram Language Models](https://youtu.be/hM49MPmakNI?si=aPfWb8x5UJXmL-fC)

- Unigram Model $$ P(w_1 w_2 \dots w_n) = \prod_{i=1}^n P(w_i) $$
    - words are independent of each other
- Bigram Model $$ P(w_1 w_2 \dots w_n) = \prod_{i=1}^n P(w_i | w_{i-1}) $$
    - each word depends only on the previous word, look a little bit more like English, but still not very good
- in general, this can be extended to tri-gram, 4-gram, etc.
- however this is an insufficient model, because it is not able to capture long-distance dependencies
- for example, in the sentence "I grew up in France ... I speak fluent French", the word "French" is dependent on the word "France", but they are far apart
- "The computer which I had just put into the machine room on the fifth floor crashed." - The word "crashed" is dependent on the word "computer", but they are far apart
- in practice, we can get away with this for simple applications

### [Estimating bigram probabilities](https://youtu.be/UyC0bBiZY-A?si=RKWYZW4k2qNzRTSl)
- The maximum likelihood estimate of the bigram probability is:
$$ P(w_i | w_{i-1}) = \frac{C(w_{i-1} w_i)}{C(w_{i-1})} $$
- $P(\text{I want english food}) = P(\text{I | <s>}) \times P(\text{want | I}) \times P(\text{english | want}) \times P(\text{food | english}) \times P(\text{</s> | food})$
- what kind of knowledge is captured by this model?
    - fact about the world: "want" is more likely to be followed by "chinese" than "english"
    - fact about language / grammar: "want" is more likely to be followed by a noun than a verb
- in practice, we do everything in log space:
    - to avoid underflow, and also adding is faster than multiplying
    - $p1 \times p2 \times p3 = \exp(\log(p1) + \log(p2) + \log(p3))$
    
### [Evaluation and Perplexity](https://youtu.be/B_2bntDYano?list=PLaZQkZp6WhWwJllbfwOD9cbIHXmdkOICY)
- evaluate if our LM prefers good sentences to bad sentences
    - assign higher probability to real / more frequent sentences, than ungrammatical / less frequent sentences
- we train the model on a training set, and evaluate it on a test set
    - the evaluation metric tells us how well the model generalizes to unseen data
- best evaluation metric is the one that correlates best with human judgement
    - put each model in a task (like speech recognition, spelling correction, machine translation, etc.), and see which one performs best
    - run the task and get an accuracy score for A and B
    - if A is better than B, then A is a better model
    - extrinsic evaluation, but it's time consuming, can take days or weeks
- intrinsic evaluation: evaluate the model directly on the task of language modeling
    - perplexity: how surprised is the model to see the test set?
    - perplexity is the inverse probability of the test set, normalized by the number of words
    - it is a bad approximation to extrinsic evaluation, unless the test set is very similar to the training set
    - it is generally only useful in pilot experiments, to quickly compare different models

#### Perplexity
Intuition:
- Shannon game : how well can we predict the next word?
    - I always order coffee with ___
    - A better model will assign higher probability to the word that actually comes next
    - Gives the highest $P(\text{sentence})$ to the test set
- Perplexity is the probability of the test set, normalized by the number of words
    $$PP(W) = P(w_1, w_2, ..., w_n)^{-\frac{1}{n}} = \sqrt[n]{\frac{1}{P(w_1, w_2, ..., w_n)}} = \sqrt[n]{\prod_{i=1}^n \frac{1}{P(w_i | w_1, w_2, ..., w_{i-1})}}$$
    For bigrams, $$PP(W) = \sqrt[n]{\prod_{i=1}^n \frac{1}{P(w_i | w_{i-1})}}$$
- Minimizing perplexity is equivalent to maximizing the probability of the test set
- Perplexity is the average branching factor
    - eg: phone system gets 120k calls, with "operator", "sales", and "support" in a fourth of the calls each, and 30k different names in the remaining calls
        - this means each of the 30k names has a probability of $\frac{1}{120k}$, as they appear once in 120k calls, while the other three words have a probability of $\frac{1}{4}$, and they appear in 30k calls each
        - $PP = \sqrt[120k]{\frac{1}{{\frac{1}{4}}^{30k} \times {\frac{1}{4}}^{30k} \times {\frac{1}{4}}^{30k} \times {\frac{1}{120k}}^{30k}}} = \sqrt[120k]{\frac{1}{\frac{1}{4^{90k}} \times \frac{1}{120k^{30k}}}} = \sqrt[120k]{4^{90k} \times 120k^{30k}} = \sqrt[4]{4^{3} \times 120000} = 52.64$
    - eg: sentence consisting of random digits, perplexity of a model that assigns equal probability to all digits is 10
        - $PP(W) = P(w_1, w_2, ..., w_n)^{-\frac{1}{n}} = \frac{1}{10^n}^{-\frac{1}{n}} = 10$
- Training 30M words, testing on 1.5M words, WSJ : perplexities were unigram (962), bigram (170), trigram (109)

### Smoothing
- Problem: if we have a bigram that we haven't seen in the training set, then the probability will be 0

#### MLE estimator 
- The MLE of some parameter of a model $M$ given data $D$ is the value of the parameter that maximizes the likelihood $P(D | M)$
- Suppose word $w$ occurs $C(w)$ times in the training set, and the training set has $N$ words, then the answer to the question "what's the probability that some word from the training set is $w$?" is $\frac{C(w)}{N}$, however this may be a bad estimate for some other corpus
- Any kind of smoothing is a non-MLE estimator, because we are changing the probability distribution from the one that maximizes the likelihood of the training set in a hope to generalize better to other data

#### Add-one smoothing
- Add one to the count of every bigram (also called Laplace smoothing)
- If MLE estimate is $P_{MLE} = \frac{C(w_{i-1} w_i)}{C(w_{i-1})}$, then add-one estimate is $P_{add-1} = \frac{C(w_{i-1} w_i) + 1}{C(w_{i-1}) + V}$, where $V$ is the vocabulary size
- If you reconstitute counts : $C^*(w_{i-1} w_i) = \frac{[C(w_{i-1} w_i) + 1] \times C(w_{i-1})}{C(w_{i-1}) + V}$, there is a huge change in the counts 
    - so add-1 isn't used in practice for bigrams
    - but we do use it in practice where the number of zeros is not very large
    - it is used in other domains, like text classification

#### Backoff and interpolation
- sometimes it helps to use less context
    - condition on less context, for context where we haven't learned much
- backoff : use trigram if you have good evidence, otherwise use bigram, otherwise use unigram, etc
- interpolation : mix unigram, bigram, trigram, etc
    - works better than backoff in practice
    - linear interpolation: 
        - simple interpolation: $P_{interp}(w_i | w_{i-1}w_{i-2}) = \lambda_1 P(w_i | w_{i-1}w_{i-2}) + \lambda_2 P(w_i | w_{i-1}) + \lambda_3 P(w_i)$
        - linear interpolation: $P_{interp}(w_i | w_{i-1}w_{i-2}) = \lambda_1(w_{i-2}^{i-1}) P(w_i | w_{i-1}w_{i-2}) + \lambda_2(w_{i-2}^{i-1}) P(w_i | w_{i-1}) + \lambda_3(w_{i-2}^{i-1}) P(w_i)$

##### Huge web-scale n-grams
- How to deal with, e.g., Google N-gram corpus
- Pruning
    - Entropy-based pruning
    - Only store N-grams with count > threshold.
- Efficiency
    - Efficient data structures like tries
    - Bloom filters: approximate language models
    - Store words as indexes, not strings
        - Use Huffman coding to fit large numbers of words into two bytes

##### Stupid backoff
$$ S(w_i | w_{i-k+1}^{i-1}) = \begin{cases} \frac{C(w_{i-k+1}^{i-1} w_i)}{C(w_{i-k+1}^{i-1})} & \text{if } C(w_{i-k+1}^{i} w_i) > 0 \\ \alpha S(w_i | w_{i-k+2}^{i-1}) & \text{otherwise} \end{cases} $$

where $\alpha = 0.4$ and $S(w_i) = \frac{C(w_i)}{N}$

## Vector Semantics and Embeddings

#### Language concepts
- Lemmas: base form of a word
    - eg: "walk", "walked", "walking", "walks" are all lemmas of the word "walk"
- senses: meaning of a word
    - eg: "bank" has two senses: financial institution, and the land alongside a river
- relations : synonymy (never perfect, like $water \neq H_2O$, $big \neq large$), antonymy (opposite meaning, eg: "hot" and "cold"), similarity (eg: "tea" and "coffee"), relatedness (eg: "tea" and "cup"), connotation (eg: "cheap" and "inexpensive")
- semantic field : words that cover a particular semantic domain (eg: "tea", "coffee", "cup", "mug", "saucer", "spoon" are all in the semantic field of "drinking tea")
- semantic frame : perspectives on a situation (eg: verbs like buy / sell / pay all are different perspectives on the same situation)
- distributional hypothesis : words that occur in similar contexts tend to have similar meanings
    - Instead of representing words based on explicit linguistic features or definitions, the focus shifted to capturing the distributional patterns of words in large corpora of text
    - this is the basis of modern word embeddings

### Word Embeddings

Two types : 
- frequency-based (long (|V| = 20000 to 50000) sparse vectors, words represented by counts of neighboring words)
    - eg: count vectors, TF-IDF vectors, Co-occurrence vectors
- prediction-based (short (50 to 100 length) dense vectors, representation created by a neural network classifier that predicts neighboring words)
    - eg: word2vec, GloVe

#### Term-document matrix
- rows are words, columns are documents
- each cell is the number of times a word occurs in a document
- this is a sparse matrix (bag-of-words representation)
- two documents are similar if their columns are similar, and two words are similar if their rows are similar

<img alt="picture 0" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/2b699d35d8970eef1f4709663fc7b4031d30eea724bf304fcd3b3645ad29d4a2.png" width="500" style="display: block; margin-left: auto; margin-right: auto;" />

#### Term-context matrix
- rows are words, columns are context words
<img alt="picture 1" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/d496eb30f9c3d6f6da1ee065a8e5cd4cc1d0723d2aa2579daaac079e6df6a53f.png" width="500" style="display: block; margin-left: auto; margin-right: auto;" />

#### TF-IDF
TF-IDF is a weighting scheme that assigns each term in a document a weight based on its term frequency (TF) and inverse document frequency (IDF). The TF-IDF weight is defined as $\text{tf-idf}(t, d, D) = \text{tf}(t, d) \times \text{idf}(t, D)$, where $t$ is a term, $d$ is a document, and $D$ is a corpus of documents. 

The term frequency $\text{tf}(t, d)$ can be defined as either $\text{tf}(t, d) = f_{t, d}$, where $f_{t, d}$ is the raw count of term $t$ in document $d$, or logaritmically scaled to prevent a bias towards longer documents as $\text{tf}(t, d) = \log(1 + f_{t, d})$. 

The inverse document frequency $\text{idf}(t, D)$ is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is defined as $$\text{idf}(t, D) = \log_{10} \frac{\text{Number of documents in D}}{\text{Number of documents containing t}}$$.

#### Cosine similarity
- measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them
- $$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \sqrt{\sum_{i=1}^n B_i^2}}$$

<img alt="picture 2" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/c9fc96966a3577ee15eb66e8f3549b3a5b889110a2a2887bcf090a6b9bf424f5.png" width="500" style="display: block; margin-left: auto; margin-right: auto;" />

### Word2Vec
- prediction based algorithm for generating word embeddings, dense vectors in a continuous vector space, which capture semantic and syntactic word relationships
- focuses on learning distributed representations based on the context in which words appear
- two models: 
    1. Continuous Bag of Words (CBOW) - predicts the current word given the context, 
    2. Skip-gram - predicts the context given the current word
        - hierarchical softmax
        - negative sampling

#### Word2Vec: CBOW 


#### Word2Vec: Skip-gram with negative sampling
The intuition of skip-gram is:
1. Treat the target word and a neighboring context word as positive examples.
2. Randomly sample other words in the lexicon to get negative samples.
3. Use logistic regression to train a classifier to distinguish those two cases.
4. Use the learned weights as the embeddings.


        



## Neural Networks
- networks made of layers of neurons, each neuron is a function that takes in a vector of inputs and produces an output
- neuron computation:
    - $z = \mathbf{w} \cdot \mathbf{x} + b$
    - $a = \sigma(z) = \frac{1}{1 + e^{-z}}$
    - $\mathbf{w}$ is the weight vector, $\mathbf{x}$ is the input vector, $b$ is the bias, $z$ is the weighted sum of inputs, $a$ is the activation, $\sigma$ is the activation function
    - $\sigma$ is usually a non-linear function, like sigmoid, tanh, ReLU, etc.

<img alt="picture 3" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/8af6e21a4c29b591e9d46db17670e392849a8dabaf9ba1252632d02d883556be.png" style="display: block; margin-left: auto; margin-right: auto; width: 200px"/>

#### XOR problem
- XOR is a non-linearly separable problem, perceptrons can't solve it
    - perceptron has equation : $$ y = \begin{cases} 1 & \text{if } \mathbf{w} \cdot \mathbf{x} + b > 0 \\ 0 & \text{otherwise} \end{cases} $$

<img alt="picture 4" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/931adf0dd47129de66f28ac80b8cf3bf884be7da08e8465e872746e06a3203c9.png" width="700" style="display: block; margin-left: auto; margin-right: auto"/>

<img alt="picture 5" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/c8f206bd3f1b724cb8cccea16f4a1d5f1f3be82e93dbf7ab5f61fc8da62e4052.png" width="300" style="display: block; margin-left: auto; margin-right: auto; padding-top: 20px"/>

Solution : add a hidden layer, which is a non-linear transformation
- we can think of a neural network classifier with one hidden layer as building a vector $h$ which is a hidden layer representation of the input, and then running standard logistic regression on the features that the network develops in $h$.

### Feedforward Neural Networks
- a feedforward neural network is an artificial neural network wherein connections between the nodes do not form a cycle
- one or more hidden layers between the input and output layers
- equations for a two layer neural network:
    - $z^{[1]} = W^{[1]} a^{[0]} + b^{[1]}$
    - $a^{[1]} = g^{[1]}(z^{[1]})$
    - $z^{[2]} = W^{[2]} a^{[1]} + b^{[2]}$
    - $a^{[2]} = g^{[2]}(z^{[2]})$
    - $\hat{y} = a^{[2]}$
- activation functions are generally different for final layer and hidden layers
    - final layer: sigmoid, softmax
    - hidden layers: tanh, ReLU, leaky ReLU, etc.
- we replace the bias term with $a^{[i]}_0 = 1$, so that we can write the equations in matrix form
    - $z^{[1]} = W^{[1]} a^{[0]}$
    - $a^{[1]} = g^{[1]}(z^{[1]})$
    - $z^{[2]} = W^{[2]} a^{[1]}$
    - $a^{[2]} = g^{[2]}(z^{[2]})$
    - $\hat{y} = a^{[2]}$
- define loss function:
    - $L(\hat{y}, y) = -\sum_{i=1}^m y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)$
    - $J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}) = \frac{1}{m} \sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)})$
- for a multinomial classification problem,
    - $L(\hat{y}, y) = -\sum_{i=1}^m \sum_{j=1}^k y_{ij} \log(\hat{y}_{ij})$
    - $J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}) = \frac{1}{m} \sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)})$
- steps in summary:
    - For every training tuple $(x, y)$
        - Run forward computation to find our estimate $\hat{y}$
        - Run backward computation to update weights:
            - For every output node
                - Compute loss $L$ between true $y$ and the estimated $\hat{y}$
                - For every weight $w$ from hidden layer to the output layer
                    - Update the weight
            - For every hidden node
                - Assess how much blame it deserves for the current answer
                - For every weight $w$ from input layer to the hidden layer
                    - Update the weight

### Computation Graphs
- Error propogation, backward differentiation on a computation graph is used to compute the gradients of the loss function for a network
- Neural language models use a neural network as a probabilistic classifier, to compute the probability of the next word given the previous n words
- Neural language models can use pretrained embeddings, or can learn embeddings from scratch in the process of language modeling

$L(a, b, c) = c(a + 2b)$

1. First we do the forward pass,
    <img alt="picture 6" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/adebdf40c90c52c1126e918616285d6f5feaf2f5522e5ba476f211c885f262b9.png" width="500" style="display: block; margin-left: auto; margin-right: auto"/>
2. Then we can compute the derivatives using the chain rule,
    - $\frac{\partial L}{\partial c} = e, \frac{\partial L}{\partial a} = \frac{\partial L}{\partial e} \frac{\partial e}{\partial a}, \frac{\partial L}{\partial b} = \frac{\partial L}{\partial e} \frac{\partial e}{\partial d} \frac{\partial d}{\partial b}$  
3. Now we compute 
    - $\frac{\partial L}{\partial e} = c, \frac{\partial L}{\partial c} = e$
    - $\frac{\partial e}{\partial a} = 1, \frac{\partial e}{\partial d} = 1$
    - $\frac{\partial d}{\partial b} = 2$
4. Then we do backward pass,
    <img alt="picture 7" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/a135e1010c19f7550ed6d8472ec297388fd1b6e2406fa53f0fbcb77778e06da8.png" width="500" style="display: block; margin-left: auto; margin-right: auto"/>

Sample computation graph for a simple 2-layer neural net with two input dimensions and 2 hidden dimensions.

<img alt="picture 8" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/e729d679c81d447bdf098a9cf6b74ceddc3e9a2fc974fa2335a9e48fe9218ccf.png" width="500" style="display: block; margin-left: auto; margin-right: auto"/>

### Representation learning
- The real power of deep learning comes from the ability to learn features from the data

<img alt="picture 9" src="https://cdn.jsdelivr.net/gh/sharatsachin/images-cdn@master/images/37ac55a7ef2157255bd96af917860415b8102458e82288a980b7969ed7d80ced.png" width="500" style="display: block; margin-left: auto; margin-right: auto"/>

## POS Tagging
- Part-of-speech tagging is the process of assigning a part-of-speech tag (Noun, Verb, Adjective...) to each word in an input text
- POS examples:
    - Noun (NN) - chair/bandwidth/pacing
    - Verb (V) - study/debate/munch
    - Adjective (ADJ) - purple/tall/ridiculous
    - Adverb (ADV) - unfortunately/slowly/loudly
    - Preposition (P) - of/by/to/with/on/for
    - Pronoun (PRO) - I/me/mine/myself
    - Determiner (DET) - the/a/that/those/this
- types:
    - open class: nouns, verbs, adjectives, adverbs (new words are added to these classes regularly)
    - closed class: prepositions, pronouns, determiners 
- tagset: set of all possible tags used for a dataset, eg: Penn Treebank tagset
- approaches:
    1. rules based (hand-crafted rules)
        - stage 1 : assign potential POS tags based on dictionary lookup
        - stage 2 : assign POS tags based on hand-crafted rules to sort out ambiguities
    2. statistical (learned from data, context-HMM tagger)
    3. hybrid (rules + statistics, eg: Brill transformation-based tagger)

#### Hidden Markov Models
- Markov models - statistical model that assumes the Markov property, i.e., the probability distribution over future states depends only on the current state, and not on the sequence of events that preceded it $$P(q_i | q_1, q_2, ..., q_{i-1}) = P(q_i | q_{i-1})$$
- Hidden Markov models - Markov models with hidden states, i.e., the state is not directly visible, but output, dependent on the state, is visible
- Formalized: 
    - $Q = q_1, q_2, ..., q_T$ is the sequence of states from a set of $N$ states $S = s_1, s_2, ..., s_N$
    - $O = o_1, o_2, ..., o_T$ is the sequence of observations, each drawn from a vocabulary $V = v_1, v_2, ..., v_V$
    - $A = a_{ij}$ is the state transition probability matrix, where $a_{ij} = P(q_t = S_j | q_{t-1} = S_i)$
        - it represents the probability of moving from state $i$ to state $j$
    - $B = b_j(k)$ is the emission/observation probability matrix, where $b_j(k) = P(o_t = v_k | q_t = S_j)$
        - it represents the probability of observing $v_k$ from state $q_j$
    - $\pi = \pi_i$ is the initial state distribution, where $\pi_i = P(q_1 = S_i)$

### HMM based POS tagging
- HMM has two parts:
    - transition probabilities: $P(t_i | t_{i-1}) = \frac{C(t_{i-1}, t_i)}{C(t_{i-1})}$
        - probability of a tag given the previous tag
    - emission probabilities: $P(w_i | t_i) = \frac{C(t_i, w_i)}{C(t_i)}$
        - probability given a tag, that it will be associated with a word
- goal is to get tag sequence $\hat{t_{1:n}} = \arg\max_{t_{1:n}} P(t_{1:n} | w_{1:n}) = \arg\max_{t_{1:n}} P(w_{1:n} | t_{1:n}) P(t_{1:n})$ using Bayes rule, then applying simplifying assumptions:
    - $P(w_{1:n} | t_{1:n}) \approx \prod_{i=1}^n P(w_i | t_i)$
    - $P(t_{1:n}) \approx \prod_{i=1}^n P(t_i | t_{i-1})$
- therefore, we have $t_{1:n} = \arg\max_{t_{1:n}} \prod_{i=1}^n P(w_i | t_i) P(t_i | t_{i-1})$, which correspond to the emission and transition probabilities respectively

#### Viterbi algorithm
- finds the optimal sequence of tags, given an observation sequence and an HMM $\lambda = (A, B)$, it re 





Viterbi algorithm for finding the optimal sequence of tags. Given an observation sequence and
an HMM λ = (A,B), the algorithm returns the state path through the HMM that assigns maximum likelihood
to the observation sequence.
