#### Dependency Grammar and Parsing

Previously we looked at `constituency grammars` which describe the syntactic structure of sentences in terms of `hierarchical/nested phrasal constituents`.  Another common and useful type of grammar formalism which we will now explore is called `dependency grammars`. In a dependency grammar, the syntax of sentences is described entirely by `binary assymmetric grammatical relations` between words called `dependencies`. Such a relation can be depicted by a `labelled arrow` that goes from a `head` word to its `dependent` word. The dependents of a particular head word play the role of a modifier of that head word.

All the dependency relations in a sentence are then captured in a `directed-acyclic-graph`, which we call a `dependency tree`, as shown in the example below. 

<img src="dependency_tree_example.png" width="430" height="150">

The `head of a sentence is usually a tensed verb`, also called the `predicate` (which in the above example is the verb "cancelled"), and all other words connect to this head through a dependency path. Also, each word is a dependent of exactly one head. The root node of the tree is designated as the head of the predicate word which is the head of the entire senetence.

Dependency relations can be broadly classified into two main categories: `clausal argument relations` and `nominal  modifier relations`. Clausl relations describe syntactic roles that words play with resepct to the predicate, such as `nominal/noun subject` (the word "United" in the example) and `direct/indirect object` (the word "flights" in the example). Modifier relations catagorize the ways in which words can modify their heads, such as `adjectival modifiers` (obviously these are adjectives), `nominal modifiers` (these are nouns), `determiners` and `case modifiers` (these are prepositions). In the example above, for the phrase "morning flights", the head word is "flights" and the dependent word "morning" is a nominal modifier of this head. 

Given any sentence, the goal of a `dependency parser` is to generate the dependency tree of that sentence. There are two main types of dependency parsing algorithms:

1) Greedy Transition-Based Parsers

2) Graph-Based Parsers

Transition-based parsers are implemented in terms of a `state machine`. Parsing involves starting from an initial state and executing a sequence of `shift-reduce` operations to reach a goal/terminal state. An `oracle`, is used to decide which operation to execute at each step. Such an oracle is trained using supervised machine learning.

On the other hand, a Graph-based method starts with a `fully-connected graph` (where the words are the vertices and the edges represent all possible head-dependent assignments). Then a `scoring model`, which can also be trained using supervised machine learning, is used to assign weights/scores to each edge (along with scores for all possible labels for an edge). Then, parsing involves finding the optimal tree which has the largest sum of edge scores, which can be done by constructing a `maximum spanning tree` from the initial fully-connected graph.

In this notebook, we will look at some of the basic ideas behind a greedy transition-based parser, and create a `training oracle` for generating training data, which we will later use in a different notebook to train an oracle using supervised machine learning.

#### Greedy Transition-Based Parsing Algorithm:

For this algorithm, we have the following components: a `stack`, a `buffer`, a `list of dependency relations`, a `set of operations` and an `oracle`. The stack is initialized with the designated `ROOT` of the tree and the buffer is the list of words for the sentence to be parsed. At each step, the oracle can choose form the following actions: `LEFTARC`, `RIGHTARC` and `SHIFT`. 

The `LEFTARC` operation assigns a head-dependent relation between the word at the top of the stack and the second word from the stack, then removes the second word. Also, the second word cannot be the `ROOT`.

The `RIGHTARC` operation assigns a head-dependent relation between the second word on the stack and the top word on the stack, then removes the top word. 

The `SHIFT` operation takes the top word from the buffer and places it on the top of the stack.

The `LEFTARC` and `RIGHTARC` are also called `reduce operations`. Each time one of these operations is executed, we add the corresponding head-dependent relation to the list of dependency relations. Also note that these operations create unlabeled dependency relations. In order to accomodate labeled dependency relations, we need separate `LEFTARC` and `RIGHTARC` operations for each possible relation, e.g. for the direct-object label we would have `LEFTARC-DOBJ` and `RIGHTARC-DOBJ`.

The state/configuration of the parse is defined by the state of the stack, buffer and dependency relations list. The goal/terminal state is the state where the stack only contains the `ROOT` and the buffer is empty. Parsing involves starting from the initial state and performing a sequence of operations (chosen by the oracle) to arrive at the goal state. Since this is a greedy algorithm, once an operation is executed, the new state cannot be undone, so a single wrong operation can lead to the parse being incorrect at the end. 

We will consider a simplified example of parsing where we ignore the dependency labels and assume a perfect oracle.

Example sentence: "Book me the morning flight"

|Step | Stack          | Buffer                           | Operation | Relation Added |
|-----|-------         |--------                          |-----------|----------------|
|  0  | [ROOT,]        | [Book, me, the, morning, flight] | SHIFT     |         |   
|  1  | [ROOT, Book]        | [me, the, morning, flight] | SHIFT     |         |   
|  2  | [ROOT, Book, me]        | [the, morning, flight] | RIGHTARC     |     (book $\to$ me)    |   
|  3  | [ROOT, Book]        | [the, morning, flight] | SHIFT     |         |   
|  4  | [ROOT, Book, the]        | [morning, flight] | SHIFT     |        |   
|  5  | [ROOT, Book, the, morning]        | [flight] | SHIFT     |     |   
|  6  | [ROOT, Book, the, morning, flight]        | [] | LEFTARC     |  (morning $\gets$ flight)   |   
|  7  | [ROOT, Book, the, flight]        | [] | LEFTARC     |  (the $\gets$ flight)   |   
|  8  | [ROOT, Book, flight]        | [] | RIGHTARC     |  (book $\to$ flight)   |   
|  9  | [ROOT, Book]        | [] | RIGHTARC     |  (ROOT $\to$ Book)   |   
|  10  | [ROOT]        | [] | DONE     |     |   


To train a neural-network based Oracle, we need to pair features that are extracted from the currect state of a parse and the corresponding ground-truth operation that needs to be executed next. We will draw instances from the Penn treebank dataset which contains full dependency parse trees. We will then create (state features, next operation) pairs from these trees.  


In [25]:
""" 

Our training data consists of dependency parse trees expressed in `CoNLL-U format`. An example of a parsed sentences in this format is shown below:

1	The	_	DET	DT	_	4	det	_	_
2	luxury	_	NOUN	NN	_	4	compound	_	_
3	auto	_	NOUN	NN	_	4	compound	_	_
4	maker	_	NOUN	NN	_	7	nsubj	_	_
5	last	_	ADJ	JJ	_	6	amod	_	_
6	year	_	NOUN	NN	_	7	nmod:tmod	_	_
7	sold	_	VERB	VBD	_	0	root	_	_
8	1,214	_	NUM	CD	_	9	nummod	_	_
9	cars	_	NOUN	NNS	_	7	dobj	_	_
10	in	_	ADP	IN	_	12	case	_	_
11	the	_	DET	DT	_	12	det	_	_
12	U.S.	_	PROPN	NNP	_	7	nmod	_	_


Each line has a sequence of tab separated fields:  `TOKEN_ID    WORD_FORM   LEMMA   U_POS   X_POS   FEATS   HEAD_ID    DEPREL   DEPS    MISC`

where LEMMA is the base form of the word, U_POS is the universal part-of-speech tag, and X_POS is the language-specific part-of-speech tag. The HEAD_ID is the id of the token that is the parent of the current token in the parse tree, and DEPREL is the dependency relation between the current token and its parent. The DEPS field is a list of secondary dependencies, and the MISC field is a catch-all for other information.

A lot of these fields are blank in the file containing our dataset, because we don't need that information for our task. We will only use the WORD_FORM, U_POS, HEAD_ID, and DEPREL fields.

"""

import os

def read_conllu(file_path):
    """
    Read a CoNLL-U file and return a list of sentences, where each sentence is a list of dictionaries, one for each token.
    """
    with open(file_path, 'r') as f:
        sentences = f.read().strip().split('\n\n')
        examples = []
        for sentence in sentences:
            token_dicts = []
            for line in sentence.split('\n'):
                if line[0] == '#':
                    continue    
                token_dict = list(zip(['id', 'form', 'lemma', 'upostag', 'xpostag' , 'feats', 'head', 'deprel', 'deps', 'misc'], line.split('\t')))
                # only keep form, xpostag, head, and deprel
                token_dicts.append(dict([token_dict[1], token_dict[4], token_dict[6], token_dict[7]]))
            examples.append(token_dicts)
        return examples

In [26]:
data_train = read_conllu(os.path.join('data', 'train.conll'))
data_val = read_conllu(os.path.join('data', 'dev.conll'))

print(f"Number of sentences in the training data: {len(data_train)}")
print(f"Number of sentences in the validation data: {len(data_val)}")

Number of sentences in the training data: 39832
Number of sentences in the validation data: 1700


In [88]:
# function for extracting all the tokens and labelled head-dependency relations from the data
def get_tokens_relations(data_instance):
    """
    Extract all the labeled dependency relations from the data.
    """
    tokens = []
    relations = []
    for token_id, token in enumerate(data_instance):
        head_id = int(token['head'])
        if head_id == 0:
            head = 'ROOT'
        else:
            head = data_instance[head_id - 1]['form']
        dependent = token['form']
        tokens.append((dependent, token_id+1))
        relation = token['deprel']
        relations.append(((head, head_id), (dependent, token_id+1), relation))
    return tokens, relations


In [89]:
example = data_train[5]
print(example)

tokens, relations = get_tokens_relations(example)
print(tokens)
print(relations)

[{'form': 'BELL', 'xpostag': 'NNP', 'head': '3', 'deprel': 'compound'}, {'form': 'INDUSTRIES', 'xpostag': 'NNP', 'head': '3', 'deprel': 'compound'}, {'form': 'Inc.', 'xpostag': 'NNP', 'head': '4', 'deprel': 'nsubj'}, {'form': 'increased', 'xpostag': 'VBD', 'head': '0', 'deprel': 'root'}, {'form': 'its', 'xpostag': 'PRP$', 'head': '6', 'deprel': 'nmod:poss'}, {'form': 'quarterly', 'xpostag': 'JJ', 'head': '4', 'deprel': 'dobj'}, {'form': 'to', 'xpostag': 'TO', 'head': '9', 'deprel': 'case'}, {'form': '10', 'xpostag': 'CD', 'head': '9', 'deprel': 'nummod'}, {'form': 'cents', 'xpostag': 'NNS', 'head': '4', 'deprel': 'nmod'}, {'form': 'from', 'xpostag': 'IN', 'head': '12', 'deprel': 'case'}, {'form': 'seven', 'xpostag': 'CD', 'head': '12', 'deprel': 'nummod'}, {'form': 'cents', 'xpostag': 'NNS', 'head': '4', 'deprel': 'nmod'}, {'form': 'a', 'xpostag': 'DT', 'head': '14', 'deprel': 'det'}, {'form': 'share', 'xpostag': 'NN', 'head': '12', 'deprel': 'nmod:npmod'}, {'form': '.', 'xpostag': '.'

#### Training Oracle

We will now implement the training oracle, whose job is to predict the next operation for a given parse state using the parsed sentences from our dataset. This is easy to do. Given a state, the next action must be chosen as follows:

* Choose `LEFTARC-label` if $(S_1, S_2, label) \in R_p$

* Choose `RIGHTARC-label` if $(S_2, S_1, label) \in R_p$ and $\forall (S_1, w, label) \in R_p$ we also have $(S_1, w, label) \in R_c$, i.e. all dependents of $S_1$ have already been assigned. 

* Choose `SHIFT` otherwise


where $S_1$ and $S_2$ denote the top and second items on the stack, $R_c$ is the set of dependency relation in the current state of the parse and $R_p$ is the set of all labelled dependency relations in the reference parse.

In [121]:
def training_oracle(instance_idx, max_iters=100, verbose=False):
    # get the tokens and relations for the refenrence parse 
    tokens, Rp = get_tokens_relations(data_train[instance_idx])
    if verbose: print(f"refenrence parse: {Rp}")

    head_dep = [(r[0], r[1]) for r in Rp]
    labels = [r[2] for r in Rp]
    # intialize the stack and buffer
    stack = [('ROOT', 0), tokens[0]]
    buffer = tokens[1:]
    Rc = []
    states = [(stack.copy(), buffer.copy(), Rc.copy())]
    actions = ['SHIFT']
    # parse the sentence to get the sequence of states and actions
    niters = 0
    
    if verbose: 
        print(f"\nStack: {stack}")
        print(f"Buffer: {buffer}")    

    while (buffer or len(stack) > 1) and niters < max_iters:
        # get top two elements of stack
        S1 = stack[-1]
        S2 = stack[-2] 
        niters += 1

        # check if LEFTARC possible
        if (S1, S2) in head_dep:
            # remove second element of stack
            stack.pop(-2)
            rel = Rp[head_dep.index((S1, S2))]
            Rc.append(rel)
            next_action = 'LEFTARC-' + rel[2]

        # check if RIGHTARC possible
        elif (S2, S1) in head_dep:
            # get all head-dependent relations with S1 as head
            S1_rels = [r for r in Rp if r[0] == S1]
            # check if all dependents of S1 are in Rc
            if all([r in Rc for r in S1_rels]):
                stack.pop(-1)
                label = labels[head_dep.index((S2, S1))]
                Rc.append((S2, S1, label))
                next_action = 'RIGHTARC-' + label
            else:
                stack.append(buffer.pop(0))
                next_action = 'SHIFT'

        # otherwise SHIFT    
        else:
            stack.append(buffer.pop(0))
            next_action = 'SHIFT'

        actions.append(next_action)
        states.append((stack.copy(), buffer.copy(), Rc.copy()))

        if verbose:
            print(f"\nAction: {next_action}")
            print(f"Stack: {stack}")
            print(f"Buffer: {buffer}")
            print(f"Rc: {Rc}")      

    # make sure Rc and Rp are consistent
    assert all([r in Rc for r in relations]) and len(Rc)==len(Rp), "Rc not consistent with Rp"

    if niters == max_iters:
        print("Maximum number of iterations reached!")  

    return actions, states    


In [122]:
actions, states = training_oracle(5, verbose=True)

refenrence parse: [(('Inc.', 3), ('BELL', 1), 'compound'), (('Inc.', 3), ('INDUSTRIES', 2), 'compound'), (('increased', 4), ('Inc.', 3), 'nsubj'), (('ROOT', 0), ('increased', 4), 'root'), (('quarterly', 6), ('its', 5), 'nmod:poss'), (('increased', 4), ('quarterly', 6), 'dobj'), (('cents', 9), ('to', 7), 'case'), (('cents', 9), ('10', 8), 'nummod'), (('increased', 4), ('cents', 9), 'nmod'), (('cents', 12), ('from', 10), 'case'), (('cents', 12), ('seven', 11), 'nummod'), (('increased', 4), ('cents', 12), 'nmod'), (('share', 14), ('a', 13), 'det'), (('cents', 12), ('share', 14), 'nmod:npmod'), (('increased', 4), ('.', 15), 'punct')]

Stack: [('ROOT', 0), ('BELL', 1)]
Buffer: [('INDUSTRIES', 2), ('Inc.', 3), ('increased', 4), ('its', 5), ('quarterly', 6), ('to', 7), ('10', 8), ('cents', 9), ('from', 10), ('seven', 11), ('cents', 12), ('a', 13), ('share', 14), ('.', 15)]

Action: SHIFT
Stack: [('ROOT', 0), ('BELL', 1), ('INDUSTRIES', 2)]
Buffer: [('Inc.', 3), ('increased', 4), ('its', 5), (