# Parsing and Grammar
## Parsing
Sentence $\rightarrow$ components of sentence
* Input: sentence
* Ouput: parse tree with a PoS for each word in sentence
    * Explains the "who did what to whom and why?"


## Grammar
The syntatic structure of a language

* Phrase: meaningful unit words
* Clasue: subject + predicate + phrases
* Sentence: main verb + one or more clasuses

### Context Free Grammar (CFG)
Defined as $G = (N, \Sigma, R, S)$
* N: Non-terminals
* $\Sigma$: Terminals
* R: Rules for language, ex $A_i \rightarrow B_1 \ B_2 \ ... \ B_N \quad B_i \in \{N \cup \Sigma \}$
* S: Start symbol, $S \in N$

Problem: CFGs can produce multiple valid parse trees - ambiguity problem.

### CFG Parsing
Task: assign a parse tree (derivation) to input strings, such that
* the tree covers all and only the input elements and starts with $S$
* several such trees might exist

Parsing Strategies
* Top-down: start with S, try to reach the terminals
* Bottom-up: start from the terminal, try to reach S

### Chomksy Normal Form (CNF)
Transforms grammar into rules with at most two productions, i.e. rules should have the form $A \rightarrow BC$.

Tranforming a rule of the form $A \rightarrow BCD$ is done as follows:
* $A \rightarrow X D$
* $X\rightarrow BC$

This also applies to longer right hand sides, in this case we iterate until it has been binarized.

Example: $A \rightarrow BCDE$ is done as follows:
* $A \rightarrow X Y$
* $X\rightarrow BC$
* $Y\rightarrow DE$

Rules which have 2 or 1 productions are left untouched and simply transferred directly to the new grammar.

### CKY Recognition
* Dynamic programming approach to building parse trees.
* Requires all rules to be in Chomsky Normal Form
    * This allows the algorithm to encode the parse tree as a two-dimensional array

Algorithm
```
CKY(S, G) -> T
    let N = len(S)
    let T = [N][N]
    
    for j = 1...N:
        
        // FILL OUT DIAGONAL
        for all A which satisfy (A -> S[j] in G):
            T[j-1, j] = T[j-1,j] UNION {A}
        
        // FILL OUT SUPER-DIAGONAL
        for i = j-2...0:
            for k = i+1...j-1:
                for all A which satisfy (A -> BC in G) 
                    and (B in T[i, k]) // left
                    and (C in T[k,j]): // below
                        T[i,j] = T[i,j] UNION {A}

```

### CKY Parsing
The above algorithm simply builds a table, but not actually the parse itself.
In order to do so the algorithm needs to add two steps:
* Each non terminal also stores which rules it was derived from
* Allow multiple versions of the same non-terminal in the table

Recovering the parse tree can be done by running the Viterbi algorithm on the table.


### Statistical Parsing (PCFG)
Probabilistic version of CFG, each rule is assigned a probability - usually derived from empirical data.

$$PCFG = (N, \Sigma, R, S)$$
* N: Set of non terminals
* $\Sigma$: Set of terminals
* R: Set of rules/productions of the form $A \rightarrow  \beta [p]$
    * A: non-terminal
    * $\beta$: String of symbols produced
    * p: the probability of choosing B, given A, p($\beta$|A)
        * Must satisfy $\sum_\beta p(A\rightarrow \beta) = 1$
* S: Start symbol

#### Disambiguation with PCFGs
Probabilities can be used to choose the most likely parse tree, by looking at the conditional probability for generating it.

Let the i'th rule be defined as $LHS_i \rightarrow RHS_i$.

The probability of a parse tree $T$ which uses $n$ rules, given a sentence $S$ is then:

$$P(T,S) \prod_{i=1}^n P(RHS_i \ | \ LHS_i)$$

$P(T,S)$ is both the joint prob. or the parse and setences - as well as the prob. of the parse $P(T)$.

__Explanation__
* Joint probability of T, S: $P(T,S) = P(S|T) \cdot P(T) = P(T|S) \cdot P(S)$

* Since a parse tree $T$ includes all words in $S$, the probability $P(w_1, w_2, ..., w_n|T)$ is 1 - i.e. $P(S|T) = 1$.
* This means $P(T,S) = P(S|T)\cdot P(T) = 1 \cdot P(T) = P(T)$

__Disambiguation__

Let the strings of S be called the yield of some parse tree T over S.

Disambiguation is then picking the parse tree most probably given S:

$$T^*(S) = argmax_T \ P(T|S)$$

We can rewrite the conditional $P(T|S) = \frac{P(T,S)}{P(S)}$ and therefore achieve:

$$T^*(S) = argmax_T \ \frac{P(T,S)}{P(S)}$$

But from before we realized that $P(T,S)$ is equal to $P(T)$ and for every tree the probability $P(S)$ will be the same, which means it is irrelevant for our purpose.

$$T^*(S) = argmax_T \ P(T)$$

This means it is enough to simply choose the parse tree with the highest probability to generate the most likely parse.

# Dependency Parsing

* Dependency Parsing relies on __Dependency Grammars__ 
* Consituency parsing relies on __Context Free Grammars__

Idea:
* Phrase structure is not important
* Syntatic structure is important

__Typed Dependency__
* Dependencies between words are of a sepcific class
    * ex det, root, nsub, nmod
* The structure of a sentence is with directions between the lexical items (words)

__Free word order__
Some language have a very relaxed rule-set when it comes to ordering
* This mean many CFG rules would be needed, which makes it infeasible
* Dependency Grammars has 1 relation per word, pointing to another lexical item, no matter the language

__Grammatical Relation (Binary relations)__
* Head: Primary noun in NounPhrase or verb in VerbPhrase
* Dependent: In DG, head dependent relatonship arises from links between head and word immeadiately dependent on the head

Grammatical Function
* Role of the dependent, relative to the head
* Subject, direct object, indeirect object
* In eglish, strongly correlated with word position
    * Not in many other languages
    
Dependency Parsing Formalism
* Model as Directed Graph: $G = (V,A) \quad V: vertices, \ A: arcs$
* $V:$ words, stems, affixes, punctuation
* $A:$ Grammatical function relationships


Dependency Tree Constraints
* One root node - with no incoming arcs
* Each node has exactly 1 incoming arc (except root)
* There exists a unique path from the root to all other vertices

Projectivity
* An arc is projective iff there exists a path from head to every word between the head and dependent
* A Dependency Tree is projective iff all arcs are projective
    * I.e. no crossing arcs
* Flexible word order languages = non projective tree
* CFGs = Projective Tree

Dependency parsing
* Lexical head: N = head(NP) and V = head(VP) 
* Head is the most important word in a phrase

## Exercises:

__What makes dependency parsing better than constituency parsing when dealing with languages with flexible
word orders?__
* Constituency parsing requires rules for all word orders in the form of a CFG, whereas Dep. Parsing uses one single head-dep relation which encapsulates all possible word orderings

__What are the characteristics of the parses generated through dependency parsing that make them more suitable for tasks such as coreference resolution or question answering?__
* Finds HD relationships, whereas const. parsing requires these relationships to be given beforehand.

__What are the three restrictions that apply to dependency trees?__
* Excatly one root node with no incoming arcs
* All nodes exept for the root node has exactly 1 incoming arc
* There is a path from the root node to every other node in the graph

__An additional constraint is applied to dependency trees, projectivity. What does it mean and why is it important?__
What is it?
* Projectivity: Phrase is dependent iff there is a path through every word between Head and Dependent.
    * A phrase is projective if no arcs are crossing, when set up in the sentence order.
    * A tree conisting of only projective phrases is said to be projective
Why is it important?
* Transition based parsing produces projective trees, and non-projective trees are errorneous
* English dependency treebanks were derived from phrase-structure treebanks through the use of head-finding rules, which are projective.

__There are two dominant approaches for dependency parsing, transition-based and graph-based. What are their main advantages and disadvantages?__
* Transition based: Linear time wrt. to word count, greedy based algorithm (except for beam-search)
* Graph-based: Exhaustive search, much slower

