![Tulane](https://github.com/tulane-cmps6730/main/blob/main/img/banner.png?raw=true)

<center>

<font size="+3">Sequence Models I</font>

[Aron Culotta](https://cs.tulane.edu/~aculotta/)  
[Tulane University](https://cs.tulane.edu/)

<a href="http://colab.research.google.com/github/tulane-cmps6730/main/blob/main/notebooks/06_Sequence_Models_I.ipynb">
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Google_Colaboratory_SVG_Logo.svg/320px-Google_Colaboratory_SVG_Logo.svg.png"  width=10%/></a>
<a href="https://github.com/tulane-cmps6730/main/tree/main">
        <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c2/GitHub_Invertocat_Logo.svg/240px-GitHub_Invertocat_Logo.svg.png" width=6%/></a>

In this module, we'll learn about models that better capture the structure of languages, including sequences and trees.

</center>

<hr size=10 color=#285C4D>

## Sequence Classification

Classification assumes each input has a single output label:

- **input:** email
- **output:** spam or not


However, for many language processing tasks, the output is a sequence of labels.

Examples include:


- Part-of-speech tagging:

| Det | N   | V      | P    | D   | N    |
|-----|-----|--------|------|-----|------|
| The | cow | jumped | over | the | moon. |


- Named-entity recognition

| Person    | Person|  _     | _    | Location  | _    |
|-----------|-------|--------|------|-----------|------|
| President | Trump | flew   | to   | D.C.      | today. |

- Speech recognition

| "hu"      | "ow" |  "ah"  | "r"  | "ye"      | "ooh"    |
|-----------|------|--------|------|-----------|----------|
|-----------|------|--------|------|-----------|----------|

<img src='https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/audio.jpg?raw=1' width=50%/>

<br>

All of the above examples have a sequence of input symbols (e.g., words) and a sequence of output symbols (e.g., parts-of-speech).

<br><br><br>
Over the next two modules, we will consider different models that support sequence predictions. Much like we did with classification, we'll first begin with traditional, statistical approaches that look a lot like Naive Bayes and n-gram language models. Then, we will proceed to neural network versions of these ideas.



<hr size=10 color=#285C4D>

## Hidden Markov Models


Hidden Markov Models (HMMs) are probabilistic models of sequences of input variables and output variables. They were used widely in the 1980s-2000s, initially for speech recognition systems.

They can be viewed similarly to n-gram language models, but whereas language models assume all variables are observable, HMMs extend to sequences of output variables that are not observerd.

The output at time $i$ depends on the input at time $i$ and the output at previous times $i-1$, $i-2$, ...:

- **Language model**: $p(w_i|w_{i-2}, w_{i-1})$
- **Hidden Markov model**: $p(y_i|w_i, y_{i-1}, y_{i-2}, \ldots)$
  - e.g., $y_i$ is part-of-speech tag at time $i$

<br><br><br>

### Markov chain




To begin, let's consider an **observed** Markov chain.

The language models we saw previously are  examples of this.  
E.g, bigram language model assumes next word is independent of all other words given the previous word:

$$p(w_i \mid w_{i-n} \ldots w_{i-1}) \triangleq p(w_i \mid w_{i-1})$$

This is a **first-order Markov assumption**. We can draw this model as a **weighted** finite state machine.

E.g., assume our language model only has three words  
{"snow", "white", "is"}:

![snow](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/snow.png?raw=1)

- $p($is $ \mid $ snow, white$)  \triangleq   p($ is $ \mid $ snow$) = a_{21}$
- Each edge is weighted by a conditional probability.


<br><br>

A Markov chain consists of the following:

- $Q=q_1 \ldots q_N$, a set of $N$ **states**
- $A = a_{01}a_{02} \ldots a_{n1} \ldots a_{nn}$, a **transition probability matrix**.
  - Each $a_{ij}=p(s_j | s_i)$ is the probability of the transition from state $i$ to state $j$
  - All transitions from a state must sum to 1: $\sum_{j=1}^n a_{ij} = 1 $ $\forall i$
- $q_0$: start state
- $q_F$: end (final) state


### Sequence probabilities


**A Markov chain assigns a probability to a sequence of words:**
- Equal to the product of the probabilities for an accepting path
  - If multiple accepting paths, equal to the largest value for any path
  
  

![snow](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/snow.png?raw=1)

$p($"snow white is"$)$    
$= p($snow $\mid q_0) * p($ white $\mid $ snow$ ) * p($ is $\mid $ white$) * p(q_F \mid $ is $)$  
$= a_{02} * a_{23} * a_{31} * a_{14}$


**"Unrolled" chain**

![snow_unrolled](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/snow_unrolled.png?raw=1)





### From Observed to Hidden Markov Chains



- A Markov chain over **unobserved** ("hidden") variables

![pos](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/pos.png?raw=1)

![pos_u](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/pos_unrolled.png?raw=1)


### Example: weather inference



- Future climatologist wants to know:
  - What was the weather like in New Orleans in March 2024?
- We do not have records of temperature, but we luckily I kept records of the number of ice creams I ate each day.

**Problem:**

- Given a sequence of observations $O$
  - ints representing number of ice creams I ate per day
- Predict the hidden sequence $Q$ of weather states
  - "H" for Hot or "C" for Cold
  
![icecream](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/icecream.png?raw=1)

### Formal Definition of HMMs



A Hidden Markov Model consists of the following:

- $Q=q_1 \ldots q_n$, a set of $n$ **states**
- $A = a_{01}a_{02} \ldots a_{n1} \ldots a_{nn}$, a **transition probability matrix**.
  - Each $a_{ij}=p(s_j | s_i)$ is the probability of the transition from state $i$ to state $j$
  - All transitions from a state must sum to 1: $\sum_{j=1}^n a_{ij} = 1 $ $\forall i$
- $O = o_1 \ldots o_T$, a sequence of $T$ **observations**
  - Each is drawn from vocabulary $V = v_1 \ldots v_V$
- $B = b_i(o_i)$, a sequence of **observation likelihoods**, aka **emission probabilities**
  - The probability of observation $o_i$ being generated by state $i$
  - $p(o_i|q_i)$
- $q_0$: start state and $q_F$: end (final) state
  - Neither is associated with observations
  - Each has transition probabilities for states:
    - $a_{01} \ldots a_{0n}$ for start transitions
    - $a_{1F} \ldots a_{nF}$ for end transitions

### HMM Assumptions



1. **Markov assumption**:
  - $p(q_i \mid q_1 \ldots q_{i-1}) \triangleq p(q_i \mid q_{i-1})$  
<br><br>
2. **Output independence**:
  - $p(o_i \mid q_1 \ldots q_i \ldots q_T, o_1 \ldots o_i \ldots o_T) \triangleq p(o_i \mid q_i)$
  
  
<br><br>
Why make these assumptions?


<br><br><br>

HMMs may be:
- **fully connected** ("ergodic"): every state is reachable from every other state
- or not: e.g., illegal to transition from "of" to present tense verb ("of jump")

<hr size=10 color=#285C4D>

## Three Fundamental Problems of HMMs



1. **Likelihood:**
  - Given an HMM $\lambda = (A,B)$ and an observation sequence $O$
  - Compute the likelihood $p(O \mid \lambda)$.
  - I.e., what is the probability of this observation sequence given this HMM?
<br><br><br>  
2. **Decoding:**
  - Given an observation sequence $O$ and an HMM $\lambda=(A,B)$
  - Find the most probable sequence $Q$
<br><br><br>
3. **Learning:**
  - Given an observation sequence $O$ and the set of states in the HMM
  - Learn the HMM parameters $A$ and $B$.
  
<br><br><br><br>

### 1. Computing Likelihood with the Forward Algorithm



 Given an HMM $\lambda = (A,B)$ and an observation sequence $O$
  - Compute the likelihood $p(O \mid \lambda)$.
  - I.e., what is the probability of this observation sequence given this HMM?
  
<br><br>
Recall how we computed the likelihood of an *observed* sequence in a **Markov chain**
- Just multiply transition probabilities

$p($"snow white is"$)$    
$= p($snow $\mid q_0) * p($ white $\mid $ snow$ ) * p($ is $\mid $ white$) * p(q_F \mid $ is $)$  
$= a_{02} * a_{23} * a_{31} * a_{14}$

![snow_unrolled](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/snow_unrolled.png?raw=1)

<br><br>

**Why doesn't this work for HMMs?**

![pos_u](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/pos_unrolled.png?raw=1)

<br><br><br>
We don't observe the hidden states!

<br><br>
Let's start with a simpler problem:
- Assume we know the hidden states for this observation sequence
- E.g., in ice cream example
  - I ate {3, 1, 3} ice creams
  - The weather was {hot, hot, cold}
  - $O=\{3,1,3\}$
  - $Q = \{H, H, C\}$
  
<br><br>
We can compute probability as:

$$
p(O|Q) = \prod_i^T p(o_i \mid q_i)
$$

$$p(\{3, 1, 3\} \mid \{H, H, C\}) = p(3 \mid H) * p(1 \mid H) * p(3 \mid C)$$  
$$= .4 * .2 * .1 $$

![l1](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/likelihood1.png?raw=1)


<br>

Since we don't know the true weather state sequence, we will sum over all possible state sequences, weighted by probability.

First, note that the joint probability $p(O,Q)$ just multiplies all the transitions $A$ and $B$:
$$
p(O, Q) = p(O \mid Q) \times p(Q) = \prod_i^T p(o_i \mid q_i) \times \prod_i^T p(q_i \mid q_{i-1})
$$

![icj](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/icecream_joint.png?raw=1)

To compute $p(O)$ from $p(O, Q)$, recall the notion of *marginalization* in probability:

$$p(x=x_j) = \sum_{y_i} p(x=x_j,y=y_i)$$  

E.g., if $x=1$ means a student passed test 1, and $y=1$ means a student passed test 2,  
then the probability that a student passes test 1 is:

$$
p(x=1) = p(x=1,y=0) + p(x=1,y=1)
$$

<br>

We will use marginalization to sum over possible state sequences, to compute the probability of the observation sequence.

$$
p(O) = \sum_Q p(O, Q) = \sum_Q p(O \mid Q) p(Q)
$$
- last step by chain rule: $p(X,Y) = p(X \mid Y) p(Y)$  


<br>
Back to the ice cream example:

$$ p(\{3, 1, 3\}) = p(\{3, 1, 3\}, \{C, C, C\}) + p(\{3, 1, 3\}, \{C, C, H\}) + \ldots $$  
$$ + p(\{3, 1, 3\}, \{H, H, C\}) + p(\{3, 1, 3\}, \{H, H, H\}) $$


**Problem with this approach**: If we have $N$ states and $T$ observations, how many possible hidden sequences must we sum over?

<br><br><br><br>
$$N^T$$

<br><br><br>
Is there a polynomial time algorithm?
<br><br>
Yes: Dynamic programming to the rescue.
- There is a lot of duplicate work in the summation above.
- We will store intermediate values as we scan the input from left to right.
<br><br>


#### Forward algorithm

##### Forward Trellis
Stores observation probabilities up to time step $t$

- $\alpha_t(j)=$ the probability of being in state $j$ after seeing the first $t$ observations  
  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$=p(o_1\ldots o_t, q_t=j \mid \lambda) = \sum_{i=1}^N \alpha_{t-1}(i)a_{ij}b_j(o_t)$
  - $\alpha_{t-1}(i)$ the previous forward path probability
  - $a_{ij}$ the transition probability from previous state $i$ to current state $j$
  - $b_j(o_t)$ the likelihood of the observation at time $t$ $p(o_t \mid q_j$)
<br><br>
- E.g., $\alpha_2(H)$ computes forward probability of being in state H at time 2 having generated the partial observation $\{3, 1\}$
  - Generate two next steps:
    - $\alpha_1(H) \times p(H \mid H) \times p(1 \mid H)$ (if previous state was $H$)
    - $\alpha_1(C) \times p(H \mid C) \times p(1 \mid H)$ (if previous state was $C$)
    
![trellis](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/trellis.png?raw=1)


##### Recursive definition

1. Initialization:
  - $\alpha_1(j) = a_{0j}b_j(o_j) $ &nbsp; $1 \le j \le N$
  - probabilities of starting in state $j$ and emitting observation $o_1$
<br><br>
2. Recursion
  - $\alpha_t(j) = \sum_{i=1}^N \alpha_{t-1}(i)a_{ij}b_j(o_t) $ &nbsp;&nbsp;$ 1 \le j \le N, 1 < t < T$
  - the probability of being in state $j$ after seeing the first $t$ observations
  - sums over all possible states we could have been in at time step $t-1$
<br><br>
3. Termination
  - $ p(O \mid \lambda) = \alpha_T(q_F) = \sum_{i=1}^N \alpha_T(i)a_{iF}$
  
![trellis2](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/trellis2.png?raw=1)

#### Pseudocode

![code](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/forward_code.png?raw=1)

**Runtime?**
<br><br><br><br>
$$O(N^2T) << O(N^T)$$

<br><br><br><br>


### 2. Decoding with the Viterbi algorithm


  - Given an observation sequence $O$ and an HMM $\lambda=(A,B)$
  - Find the most probable sequence $Q$
  - E.g., given the ice cream observations $\{3, 1, 3\}$, what is the most likely temperature sequence?
  
  
**Naive approach:**  
- Run the forward algorithm for each possible state sequence
- Return the state sequence that maximizes $p(O \mid \lambda)$
- Still exponential: $O(N^T)$

**Efficient approach:**

#### Viterbi algorithm

A dynamic program, similar to the forward algorithm, to find best path through the trellis.

![viterbi](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/viterbi.png?raw=1)

$$v_t(j) = \max_{q_0 \ldots q_{t-1}}p(q_0\ldots q_{t-1}, o_1 \ldots o_t, q_t = j \mid \lambda)$$  
$$ = \max_{i=1}^N v_{t-1}(i)a_{ij}b_j(o_t)$$
- $v_{t-1}(i)$: the previous Viterbi path probability from the prior time step
- $a_{ij}$: the transition probability from prior state $q_i$ to current state $q_j$
- $b_j(o_t)$: the probability of the observation at time $t$: $p(o_t \mid q_t)$

Contrast with the forward algorithm: $\alpha_t(j)=\sum_{i=1}^N \alpha_{t-1}a_{ij}b_j(o_t)$  
Two differences:
 - sum becomes $\max$
 - Also need additional bookkeeping to store the best path itself, rather than just the probability.






##### Pseudocode

![vit_code](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/viterbi_code.png?raw=1)

##### Recursive definition

1. Initialization:
  - $v_1(j) = a_{0j}b_j(o_j) $ &nbsp; $1 \le j \le N$
    - probabilities of starting in state $j$ and emitting observation $o_1$
  - $bt_1(j) = 0$ back trace to store best path
<br><br>
2. Recursion
  - $v_t(j) = \max_{i=1}^N v_{t-1}(i)a_{ij}b_j(o_t) $ &nbsp;&nbsp;$ 1 \le j \le N, 1 < t < T$
    - the probability of the best path continuing to state $j$ after seeing the first $t$ observations
  - $bt_t(j) = argmax_{i=1}^N v_{t-1}(i)a_{ij}b_j(o_t) $ &nbsp;&nbsp;$ 1 \le j \le N, 1 < t < T$
    - update the best path so far
<br><br>
3. Termination
  - $ P^* = v_T(q_F) = \max_{i=1}^N v_T(i)a_{iF}$
    - the best score
  - $q_T^* = bt_T(q_F) = argmax_{i=1}^N v_T(i) * a_{iF}$
    - the start of the backtrace (the best second to last state)
  

### 3. Learning




This uses the simple "counting" estimation we used for language models and naive Bayes



#### Transition probabilities

$$p(q_i \mid q_{i-1}) = \frac{C(q_{i-1}, q_i)}{C(q_{i-1})}$$

E.g., if the labeled training set $D$ has only two sentences:

> D &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; N &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; V &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; N  
> The &nbsp; dog &nbsp; ate &nbsp; food  
>  
> D &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; A &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; N &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; V &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  
> A black cat drank

Then we compute:

$$
p(N \mid D) = \frac{C(D, N)}{C(D)} = \frac{1}{2}
$$

We can again use add-$k$ smoothing, e.g., $k=1$:
$$
p(N \mid D) = \frac{C(D, N) + 1}{C(D) + N*k} = \frac{1+1}{2 + 4} = \frac{2}{6}
$$
- where  $N$ is the number of part of speech tags.



#### Emission probabilities

We can do the same to estimate the emission probabilities:

$$
p(o_i \mid q_i) = \frac{C(o_i, q_i)}{C(q_i)}
$$

e.g.

$$
p(dog \mid N) = \frac{C(dog, N)}{C(N)} = \frac{1}{3}
$$




#### Start/end transitions

We may optionally include probabilities to represent start and end transitions:


$$p(q_i \mid \mathrm{start} ) = \frac{C( \mathrm{start} , q_i)}{C(\mathrm{start})}$$

$$p(q_i \mid \mathrm{end}) = \frac{C(q_i, \mathrm{end})}{C(\mathrm{end})}$$

e.g.:

$$p(\mathrm{The} \mid \mathrm{start}) = \frac{1}{2}$$


<br><br>

- Note: when computing smoothed estimate, the start states do not factor into the denominator:

$$
p(N \mid D) = \frac{C(D, N) + k}{C(D) + N*k} = \frac{1+1}{2 + 4} = \frac{2}{6}
$$


$$p(\mathrm{start} \mid D) = 0$$

<br><br>

- Note also: if we do not model end states, then when calculating transition probabilities, we handle the final tag in a sentence differently. E.g.
$$p(V \mid N) = \frac{C(N, V)}{C(N, *)} = \frac{2}{2} = 1$$
  - where $C(N, *)$ means "count the number of N tags followed by any other tag"
  - Thus, we ignore the "N" tag for "food" in the first sentence.
  - The reason for this is we need to have $\sum_i p(q_i \mid N)=1$.
  
<br><br><br><br>

#### Unsupervised HMM Learning

We won't cover it here, but there are also methods that estimate HMM parameters in an **unsupervised** way.

Input is an *unlabeled training set* $D = \{\vec{o}_1, \ldots \vec{o}_k\}$ and the number $n$ indicating the number of HMM states allowed.

How can we possibly learn $p(q_i \mid q_{i-1})$ and $p(o_i \mid q_i)$ without any data about $q$?

Crazy idea:
- Initialize the probabilities randomly.
- Repeat:
  - Predict the labels for all the unlabeled data.
  - Re-estimate the probabilities as if the predicted labels were the true labels.
  
This is called the **EM Algorithm** (Expectation-Maximization). See book (Appendix A) for more details


## From Sequences to Trees

While the above models **sequences** of outputs, we know that language exhibits a hierarhical structure.

For example, words can be grouped into phrases:

> **Noun phrases**: "the big house" or "a beautiful day"  
> **Adjective phrases**: "very useful"  
> **Prepositional phrases**: "on the hill"  
> **Verb phrases**: "saw the dog"  

<br><br>

**Language is recursive**

- A sentence has many parts, many of which have subparts, many of which have subparts, ...

> I saw the dog with one eye on the hill with the tree by the lake...

We need a way to compactly represent this recursion. What data structures do you know of that are great at representing recursion?

<br><br><br><br><br>

Trees!

![figs/parse.png](https://github.com/tulane-cmps6730/main/blob/main/lec/overview/figs/green.png?raw=1)



### Context Free Grammar (CFG)



A CFG is a set of rules that define a language.


Informally:

- Set of **rules** or **productions**
  - Define how constituents can be grouped
- **Lexicon**: list of words and symbols

**Example: CFG for Noun Phrases**

> NP $\rightarrow$ Det Nominal  
> NP $\rightarrow$ ProperNoun  
> Nominal $\rightarrow$ Noun | Noun Nominal

Rules can be part of a hierarchy:

> Det $\rightarrow$ a  
> Det $\rightarrow$ the  
> Noun $\rightarrow$ flight  

- **Terminal** symbols: words in the language (e.g., "a", "flight")
- **Nonterminal** symbols: clusters or generalizations of terminals (e.g., Noun, Nominal, NP)




#### Derivation

- A sequence of rule expansions to generate a given string.
- This sequence is most commonly shown as a **parse tree**

![figs/parse.png](https://github.com/tulane-cmps6730/main/blob/main/lec/parsing/figs/parse.png?raw=1)


**Derivation**  
1. NP $\rightarrow$ Det Nom
2. Det $\rightarrow$ a
3. Nom $\rightarrow$  Noun
4. Noun $\rightarrow$ flight


#### CFGs, Formally

A context-free grammar is a four-tuple:

1. A set of non-terminal symbols (or 'variables') $N$
2. A set of terminal symbols $\Sigma$ (disjoint from $N$)
3. A set of productions $P$ of the form $A \rightarrow \alpha$, where
 - $A \in N$ is a non-terminal
 - $\alpha$ is a string of symbols from the infinite set $(\Sigma \cup N)^*$
4. A start symbol $S$


A string $\alpha_1$ **derives** a string $\alpha_m$ if $\alpha_1$ can be rewritten as $\alpha_m$ by a series of rule applications from $P$.

$$\alpha_1 \Rightarrow \alpha_2,  \alpha_2 \Rightarrow \alpha_3,  \ldots,  \alpha_{m-1} \Rightarrow \alpha_m$$  

Denoted: $\alpha_1 \Rightarrow^* \alpha_m$

If $A \rightarrow \beta$ is a production in $P$, and $\alpha$ and $\gamma$ are strings in $(\Sigma \cup N)^*$,
- $\alpha A \gamma$ **directly derives** $\alpha \beta \gamma$
- denoted: $\alpha A \gamma \Rightarrow \alpha \beta \gamma$

> a $Noun$ ride $\Rightarrow$ a train ride (by applying $Noun \rightarrow train$)


#### Example: CFG for airline reservation system

![figs/lexicon.png](https://github.com/tulane-cmps6730/main/blob/main/lec/parsing/figs/lexicon.png?raw=1)

![figs/grammar.png](https://github.com/tulane-cmps6730/main/blob/main/lec/parsing/figs/grammar.png?raw=1)

![figs/flight.png](https://github.com/tulane-cmps6730/main/blob/main/lec/parsing/figs/flight.png?raw=1)

### Probabilistic CFGs (PCFGs)



But wait, what about all the ambiguity of language??

Recall: **"I made her duck."**

Need to model the probabilities of each possible derivation.

<br><br>

Simple idea: augment each rule in $P$ with a conditional probability:

$$ A \rightarrow \beta  [p] $$

where $p \triangleq p(A \rightarrow \beta \mid A)$  
Probability that nonterminal $A$ is expanded into $\beta$  
Probabilities of all expansions of $A$ must sum to 1.

So, a PCFG is a five-tuple $G = (N, \Sigma, P, S, D)$  
where the additional term $D: P \mapsto \mathbb{R}$ assigns probabilities to each rule in $P$

![pcfg1](https://github.com/tulane-cmps6730/main/blob/main/lec/parsing/figs/pcfg1.png?raw=1)




#### Why PCFGs? Resolving ambiguity

![ambig](https://github.com/tulane-cmps6730/main/blob/main/lec/parsing/figs/ambig.png?raw=1)

#### Probability of a parse

$$ p(T, S) = \prod_{n \in T} p(r(n)) $$
- $T$ is a parse
- $S$ is a sentence
- $r(n)$ is one rule used to expand node $n$ in the parse tree
- Probability of a parse is the product of the probabilities of each rule in that parse.

Note that
$$p(T, S) = p(T)p(S|T) = p(T) $$
since $p(S|T) == 1$.

E.g., probability of $[S \rightarrow$ Aux NP VP $ \wedge $ Aux $\rightarrow$ can] = .15 * .40

<br><br><br>

By decomposing the probability into a product of production probabilities, what assumptions have we made?

Recall independence assumption $p(A,B) \triangleq p(A) p(B)$

- Each production is independent of all other productions
  - E.g., Seeing S $\rightarrow $NP VP has no impact on NP $\rightarrow$ Det N
- Insensitive to lexical information (words)
  - E.g., Seeing NP $\rightarrow$ Det N is the same whether the noun is "cat" or "people"


#### Probability of a sentence

$$p(S) = \sum_{T \in \tau(S)} p(T, S) = \sum_{T \in \tau(S)} p(T) $$
- $\tau(S)$ is set of all valid parse trees of sentence S
- Probability of a sentence is the sum of the probabilities of all valid parses of that sentence.

<br><br>

Recall the fluency component of translation models:
For a source sentence $A$, find the best target sentence $B$ by:

$$T^* \leftarrow \mathrm{argmax}_{\: B} \: P(B) P(A \mid B)$$

- $P(B)$: "fluency". How likely is this sentence $B$ overall? <-- **Can use sentence probability above!**
- $P(A \mid B):$  "faithfulness". How closely does the meaning of $B$ match that of $A$?




#### Parsing with PCFGs

For a given sentence, we often want to know the highest probability parse:

$$
T^* = \mathrm{argmax}_{T \in \tau(S)} p(T)
$$
<br><br><br>

**CKY Parsing**

 Cocke-Younger-Kasami algorithm for finding the most likely parse of a sentence:

$$
T^* = \mathrm{argmax}_{T \in \tau(S)} p(T)
$$

- Analogous to the Viterbi algorithm for HMMs, but here extend to parse trees instead of sequences.
- CKY is a bottom-up parser using dynamic programming to store intermediate results (for subtrees)
- Assumes rules are in **Chomsky Normal Form (CNF)**
 - Each production is either A $\rightarrow$ B C or A $\rightarrow a$.
 - Any CFG can be converted to CNF (though may expand grammar quadratically)

We won't cover this algorithm in detail in this course, but see the textbook for more details (Jurafsky and Martin, Chapter 17).

<hr size=10 color=#285C4D>

## Group Task


Consider again the ice cream example (from [Figure A.2 in Appendix A](https://github.com/tulane-cmps6730/main/blob/main/read/slpA.pdf))

  
![icecream](https://github.com/tulane-cmps6730/main/blob/main/lec/sequence/figs/icecream.png?raw=1)

Apply the Viterbi algorithm to find the most probable path for the input <1,3,1>. In Canvas, enter the most probable path as well as its probability.

#### image sources
- https://www.cs.colorado.edu/~martin/SLP/
