# Deep Transition Dependency Parser in PyTorch

- In this problem set, you will implement a deep transition dependency parser in [PyTorch](https://pytorch.org).

- You will see how more complicated network architectures can be used to effectively solve a structured prediction problem.

You will:

- Implement an arc-standard transition-based dependency parser in PyTorch
- Implement various methods of computing word embeddings
- Implement neural network components for choosing actions and combining stack elements
- Train your network to parse English and Norwegian sentences

# 0. Setup

In order to develop this assignment, you will need [python 3.6](https://www.python.org/downloads/) and the following libraries. Most if not all of these are part of [anaconda](https://www.continuum.io/downloads), so a good starting point would be to install that.

- [jupyter](http://jupyter.readthedocs.org/en/latest/install.html)
- [numpy](https://docs.scipy.org/doc/numpy/user/install.html)
- [matplotlib](http://matplotlib.org/users/installing.html)
- [nosetests](https://nose.readthedocs.org/en/latest/)
- [pytorch](https://pytorch.org)

Here is some help on installing packages in python: https://packaging.python.org/installing/. You can use ```pip --user``` to install locally without sudo.

## About this assignment

- This is a Jupyter notebook. You can execute cell blocks by pressing control-enter.
- Most of your coding will be in the python source files in the directory ```gtnlplib```.
- The directory ```tests``` contains unit tests that will be used to grade your assignment, using ```nosetests```. You should run them as you work on the assignment to see that you're on the right track. You are free to look at their source code, if that helps -- though most of the relevant code is also here in this notebook. Learn more about running unit tests at http://pythontesting.net/framework/nose/nose-introduction/
- You may want to add more tests, but that is completely optional. 
- **To submit this assignment, run the script ```make-submission.sh```, and submit the tarball ```pset3-submission.tgz``` on Canvas.**

In [1]:
import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
import torch.autograd as ag

import nose
import numpy as np

from imp import reload

In [2]:
print('My library versions')

print('numpy: {}'.format(np.__version__))
print('nose: {}'.format(nose.__version__))
print('torch: {}'.format(torch.__version__))

My library versions
numpy: 1.14.2
nose: 1.3.7
torch: 0.3.1


To test whether your libraries are the right version, run:

`nosetests tests/test_environment.py`

In [3]:
# use ! to run shell commands in notebook
! nosetests tests/test_environment.py

.
----------------------------------------------------------------------
Ran 1 test in 0.001s

OK


In [4]:
import gtnlplib.parsing as parsing
import gtnlplib.data_tools as data_tools
import gtnlplib.constants as consts
import gtnlplib.evaluation as evaluation
import gtnlplib.utils as utils
import gtnlplib.feat_extractors as feat_extractors
import gtnlplib.neural_net as neural_net

In [5]:
# Read in the datasets
reload(data_tools)
en_dataset = data_tools.Dataset(consts.EN_TRAIN_FILE, consts.EN_DEV_FILE, consts.EN_TEST_FILE)
nr_dataset = data_tools.Dataset(consts.NR_TRAIN_FILE, consts.NR_DEV_FILE, consts.NR_TEST_FILE)

# Assign each word a unique index, including two special tokens needed for parsing logic
word_to_ix_en = { word: i for i, word in enumerate(en_dataset.vocab) }
word_to_ix_nr = { word: i for i, word in enumerate(nr_dataset.vocab) }

In [6]:
# Some constants to keep around
LSTM_NUM_LAYERS = 1
TEST_EMBEDDING_DIM = 4
WORD_EMBEDDING_DIM = 64
STACK_EMBEDDING_DIM = 100
NUM_FEATURES = 3

# Hyperparameters
ETA_0 = 0.01
DROPOUT = 0.0

# High-Level Overview of the Parser
Be sure that you have reviewed the notes on transition-based dependency parsing, and are familiar with the relevant terminology.
Parsing will proceed as follows:
* Initialize your parsing stack and input buffer.
* At each step, until the parse is done:
  * Extract some features.  We will start with simple features, but these can be anything: words in the sentence, the configuration of the stack, the configuration of the input buffer, the previous action, etc.
  * Send these features through a feed-forward (FF) network to get a probability distribution over actions (`SHIFT`, `ARC_L`, `ARC_R`).  The next action you choose is the one with the highest probability.
  * If the action is an arc- operation, you use a neural network to combine the two items in the operation and get a dense output to place back on the input buffer.

The key classes you will fill in code for are:
* Feature extraction in `feat_extractors.py`
* The `ParserState` class, which keeps track of the input buffer and parse stack, and offers a public interface for doing the parsing actions to update the state
* The `TransitionParser` class, which is a PyTorch module where the core parsing logic resides, in `parsing.py`.
* The neural network components in `neural_net.py`

The network components are compartmentalized as follows:
* `TransitionParser`, the base component that contains and coordinates the other substitutable components

* Embedding Lookup: You will implement three flavors of embeddings. These embeddings are used to initialize the input buffer, and will be shifted on the stack / serve as inputs to the combiner networks.
  - `VanillaWordEmbedding` just gets embeddings from a lookup table, as we used in pset 2.
  - `BiLSTMWordEmbedding` will run a sequence model in both directions over the sentence. The hidden state at step t is the embedding for the `t`-th word of the sentence.
  - `SuffixAndWordEmbedding` gets embeddings for words as in the vanilla embeddings, and also gets embeddings for word suffixes, and concatenates t
* Action Choosing: You will implement two action choosing components:
  - `FFActionChooser` is a simple feed-forward neural network that outputs log probabilities over the three actions given the extracted features as input.
  - `LSTMActionChooser` applies a sequence model that takes the hidden state of the previous action decision as input.

* Combiners: You will implement two combiners, which are the network components that take the two embeddings of the items in an arc- operation and creates a single vector.
  - `FFCombiner` takes the two input embeddings and gives a dense output
  - `LSTMCombiner` applies a sequence model, where the output embedding is the hidden state of the next timestep.

### Parsing example

The following is how the input buffer and stack look at each step of a parse, up to the first arc.  The input sentence is "the dog ran away".  Our action chooser network takes the top element of the stack, the top element of the input buffer, plus a one-token "lookahead" in the input buffer.  $C(x,y)$ refers to calling our combiner network on arguments $x, y$.  Also let $A$ be the set of actions: $\{ \text{SHIFT}, \text{ARC-L}, \text{ARC-R} \}$, and let $q_w$ be the embedding for word $w$.

1. 
  * Input Buffer: $\left[ q_\text{the}, q_\text{dog}, q_\text{ran}, q_\text{away}, q_\text{END-INPUT} \right]$
  * Stack: $\left[ q_\text{ROOT} \right]$
  * Action: $ \text{argmax}_{a \in A} \ \text{ActionChooser}(q_\text{ROOT}, q_\text{the}, \overbrace{q_\text{dog}}^\text{lookahead}) \Rightarrow \text{SHIFT}$
  
2.
  * Input Buffer: $\left[ q_\text{dog}, q_\text{ran}, q_\text{away}, q_\text{END-INPUT} \right]$
  * Stack: $\left[ q_\text{ROOT}, q_\text{the} \right]$
  * Action: $ \text{argmax}_{a \in A} \ \text{ActionChooser}(q_\text{the}, q_\text{dog}, q_\text{ran}) \Rightarrow \text{ARC-L}$
  
3.
  * Input Buffer: $\left[C(q_\text{dog}, q_\text{the}), q_\text{ran}, q_\text{away}, q_\text{END-INPUT} \right]$
  * Stack: $\left[ q_\text{ROOT} \right]$
  
This is a partial picture of parsing - we keep more than just the embedding on the stack and input buffer.  We also keep the word and its position in the sentence so that when we create an arc, we know what edge was just created.
So, for example, the initial input buffer really looks like

$$ \left[ (\text{the}, 0, q_\text{the}), (\text{dog}, 1, q_\text{dog}), (\text{ran}, 2, q_\text{ran}), (\text{away}, 3, q_\text{away}), (\text{END-INPUT}, 4, q_\text{END-INPUT}) \right] $$

Before beginning, I recommend completing the parse by hand, drawing the input buffer and stack at each step, and explicity listing the arguments to the action chooser.

# 1. Managing and Updating the Parser State 
### (4650: 2.5 points, 7650: 1.5 points)

In this part of the assignment, you will work with the ParserState class, that keeps track of the parser's input buffer and stack.

### Deliverable 1.1: Implementing Arc

#### 1.1a: Get arc components (0.25 points)
You will implement the generalized arc- operation of the `ParserState` in `parsing.py`, in the function `_arc`, in two parts.

First, fill in `_get_arc_components` in `parsing.py` to select the head and modifier according to the action passed in. This method should also remove the items from the stack and input buffer.
Arc actions follow the arc-standard procedure, from section 10.3.1 in the course notes.

- **Test**: ` test_parser.py:test_get_arc_components_d1_1a`

In [7]:
reload(parsing)
test_sentence = "The man ran away".split()
parser_state = parsing.ParserState(test_sentence + [consts.END_OF_INPUT_TOK], 
                                   [None] * (len(test_sentence)+1),
                                   utils.DummyCombiner())

In [8]:
parser_state.shift()
parser_state.shift()
print(parser_state)

head, modifier = parser_state._get_arc_components(consts.Actions.ARC_L)
print(head, modifier)

head, modifier = parser_state._get_arc_components(consts.Actions.ARC_R)
print(head, modifier)

Stack: ['<ROOT>', 'The', 'man']
Input Buffer: ['ran', 'away', '<END-OF-INPUT>']

StackEntry(headword='ran', headword_pos=2, embedding=None) StackEntry(headword='man', headword_pos=1, embedding=None)
StackEntry(headword='The', headword_pos=0, embedding=None) StackEntry(headword='away', headword_pos=3, embedding=None)


#### 1.1b: Create the arc (0.25 points)
Now, fill in `_create_arc` in `parsing.py` to use the `ParserState`'s `combiner` component to **combine** the passed in head and modifier, put the combination on the input buffer, and create a new dependency graph edge. At this point, we are just using a dummy combiner so we can test the logic.

You will want to familiarize yourself with the `StackEntry` and `DepGraphEdge` objects used by the `ParserState` object for this one.

- **Test**: ` test_parser.py:test_create_arc_d1_1b`

In [9]:
reload(parsing)
parser_state = parsing.ParserState(test_sentence + [consts.END_OF_INPUT_TOK], 
                                   [None] * (len(test_sentence)+1),
                                   utils.DummyCombiner())

print(parser_state)

parser_state.shift()
print(parser_state)

arc = parser_state.arc_left()
print("First arc: Head: {}, Modifier: {}".format(arc[0], arc[1]), "\n")
print(parser_state)

parser_state.shift()
arc = parser_state.arc_left()
print("Second arc: Head: {}, Modifier: {}".format(arc[0], arc[1]), "\n")
print(parser_state)

Stack: ['<ROOT>']
Input Buffer: ['The', 'man', 'ran', 'away', '<END-OF-INPUT>']

Stack: ['<ROOT>', 'The']
Input Buffer: ['man', 'ran', 'away', '<END-OF-INPUT>']

First arc: Head: ('man', 1), Modifier: ('The', 0) 

Stack: ['<ROOT>']
Input Buffer: ['man', 'ran', 'away', '<END-OF-INPUT>']

Second arc: Head: ('ran', 2), Modifier: ('man', 1) 

Stack: ['<ROOT>']
Input Buffer: ['ran', 'away', '<END-OF-INPUT>']



### Deliverable 1.2: Parser Terminating Condition (4650: 1 point, 7650: 0.5 points)
In this short (one line) deliverable, implement `done_parsing()` in `ParserState`.  Think about what the input buffer and stack look like at the end of a parse.

- **Test**: `test_parsing.py:test_stack_terminating_cond_d1_2`

In [10]:
# reload(parsing)
parser_state = parsing.ParserState(test_sentence + [consts.END_OF_INPUT_TOK], 
                                   [None] * (len(test_sentence)+1),
                                   utils.DummyCombiner())

parser_state.shift()
parser_state.arc_left()
parser_state.shift()
parser_state.arc_left()

print(parser_state.done_parsing())

parser_state.shift()
parser_state.arc_right()
print(parser_state.done_parsing())

parser_state.arc_right()
print(parser_state.done_parsing())

parser_state.shift()
print(parser_state.input_buffer)
print(parser_state.done_parsing())

False
False
False
[StackEntry(headword='<END-OF-INPUT>', headword_pos=4, embedding=None)]
True


### Deliverable 1.3: Validating parser actions (4650: 1 point, 7650: 0.5 points)
Implement the `_validate_action` method in `parsing.TransitionParser`. This will be used in the prediction setting, when the gold standard is not available. We need to ensure that any action we take is legal. Here are the rules:

- You cannot shift when the input buffer has <= 2 items on it (including the end of input token), UNLESS the stack is empty.
  - **In this case, do `ARC_R` by default.**
- You cannot do an arc- operation when the stack is empty (this will happen after creating an arc with ROOT).
  - **In this case, do `SHIFT` by default.**
- You cannot do an arc-left operation when the root token is on top of the stack.
  - **In this case, do `SHIFT` or `ARC-R` depending on the state of the input buffer.**
  
**Test:**
- `test_parser.py:test_validate_action_d1_3`

**Make sure you pass the test before you move on. The code blocks below are not meant to be comprehensive.**

In [11]:
reload(parsing)
parser_state = parsing.ParserState(test_sentence + [consts.END_OF_INPUT_TOK], 
                                   [None] * (len(test_sentence)+1),
                                   utils.DummyCombiner())
ix_to_action = consts.Actions.ix_to_action

In [12]:
print("parser_state: ", parser_state)
act_to_do = consts.Actions.ARC_L
valid_action = parser_state._validate_action(act_to_do)
print("Chosen action: %s, Valid action: %s\n" % (ix_to_action[act_to_do], ix_to_action[valid_action]))

parser_state.shift()

print("parser_state: ", parser_state)
act_to_do = consts.Actions.ARC_L
valid_action = parser_state._validate_action(act_to_do)
print("Chosen action: %s, Valid action: %s\n" % (ix_to_action[act_to_do], ix_to_action[valid_action]))

parser_state.shift()
parser_state.shift()

print("parser_state: ", parser_state)
act_to_do = consts.Actions.SHIFT
valid_action = parser_state._validate_action(act_to_do)
print("Chosen action: %s, Valid action: %s\n" % (ix_to_action[act_to_do], ix_to_action[valid_action]))

parser_state:  Stack: ['<ROOT>']
Input Buffer: ['The', 'man', 'ran', 'away', '<END-OF-INPUT>']

Chosen action: ARC_L, Valid action: SHIFT

parser_state:  Stack: ['<ROOT>', 'The']
Input Buffer: ['man', 'ran', 'away', '<END-OF-INPUT>']

Chosen action: ARC_L, Valid action: ARC_L

parser_state:  Stack: ['<ROOT>', 'The', 'man', 'ran']
Input Buffer: ['away', '<END-OF-INPUT>']

Chosen action: SHIFT, Valid action: ARC_R



# 2. Neural Network for Action Decisions (2 points)
In this part of the assignment, you will use PyTorch to create a neural network which examines the current state of the parse and makes the decision to either shift, arc left, or arc right.

### Deliverable 2.1: Word Embedding Lookup (0.5 points)
Implement the class `VanillaWordEmbedding` in `neural_net.py`
[Here are the docs for Pytorch embeddings](http://pytorch.org/docs/nn.html#embedding)

Hint: You will have to turn the input, which is a list of strings (the words in the sentence), into a format that your embedding lookup table can take. 

**Test:** `test_parser.py:test_word_embed_lookup_d2_1`

In [13]:
torch.manual_seed(765) # DO NOT CHANGE
reload(neural_net)
test_sentence = "natural language processing".split()
test_word_to_ix = { "natural": 0, "language": 1, "processing": 2 }

word_embedder = neural_net.VanillaWordEmbedding(test_word_to_ix, TEST_EMBEDDING_DIM)
embeds = word_embedder(test_sentence)
print(type(embeds))
print(len(embeds), "\n")
print("Embedding for 'natural':\n {}".format(embeds[0]))

<class 'list'>
3 

Embedding for 'natural':
 Variable containing:
-0.7945  1.7483 -0.6024 -0.8473
[torch.FloatTensor of size 1x4]



### Deliverable 2.2: Feature Extraction (0.5 points)
Fill in the `SimpleFeatureExtractor` class in `feat_extractors.py` to give the following 3 features as a list **in this order**:
* The embedding of the top of the stack
* The embedding of the first token in the input buffer
* The embedding of the next token in the input buffer (one-token lookahead)

If at this point you have not poked around `ParserState` to see how it stores the state, now would be a good time.

**Test:** `test_parser.py:test_feature_extraction_d2_2`

In [14]:
reload(feat_extractors)
torch.manual_seed(1) # DO NOT CHANGE
test_sentence = "The Sound and the Fury".split()
test_word_to_ix = { word: i for i, word in enumerate(set(test_sentence)) }

embedder = neural_net.VanillaWordEmbedding(test_word_to_ix, TEST_EMBEDDING_DIM)
embeds = embedder(test_sentence)

state = parsing.ParserState(test_sentence, embeds, utils.DummyCombiner())

state.shift()
feat_extractor = feat_extractors.SimpleFeatureExtractor()
feats = feat_extractor.get_features(state)

print("Embedding for 'The':\n {}".format(feats[0]))
print("Embedding for 'Sound':\n {}".format(feats[1]))
print("Embedding for 'and' (from buffer lookahead):\n {}".format(feats[2]))

Embedding for 'The':
 Variable containing:
-0.4212 -0.5107 -1.5727 -0.1232
[torch.FloatTensor of size 1x4]

Embedding for 'Sound':
 Variable containing:
-0.4519 -0.1661 -1.5228  0.3817
[torch.FloatTensor of size 1x4]

Embedding for 'and' (from buffer lookahead):
 Variable containing:
 0.6614  0.2669  0.0617  0.6213
[torch.FloatTensor of size 1x4]



### Deliverable 2.3: Feedforward Network for Choosing Actions (0.5 points)
Implement the class `neural_net.FFActionChooser` according to the specification.
You will need to take the list of embeddings passed in (that come from your feature extractor) and concatenate them to one long row vector (size [1 x num actions])

This network takes as input the features from your feature extractor, concatenates them, runs them through a feedforward network, and outputs log probabilities over actions.

**Test:** `test_parser.py:test_action_chooser_d2_3`

In [15]:
reload(neural_net)
torch.manual_seed(1) # DO NOT CHANGE, you can compare my output below to yours
act_chooser = neural_net.FFActionChooser(TEST_EMBEDDING_DIM * NUM_FEATURES)
feats = [ ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM)) for _ in range(NUM_FEATURES) ] # make some dummy feature embeddings
log_probs = act_chooser(feats)
print(log_probs)

Variable containing:
-1.2443 -0.8323 -1.2844
[torch.FloatTensor of size 1x3]



### Deliverable 2.4: Network for Combining Stack Items (0.5 points)
Implement the class `neural_net.FFCombiner` according to the specification. 
Recall that what this component does is take two embeddings, the head and modifier, during an arc- operation and output a combined embedding (of size [1 x embedding_dim]), which is then pushed back onto the input buffer during parsing.

**Test:** `test_parser.py:test_combiner_d2_4`

In [16]:
reload(neural_net)
torch.manual_seed(1) # DO NOT CHANGE
combiner = neural_net.FFCombiner(TEST_EMBEDDING_DIM)

# Again, make dummy inputs
head_feat = ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM))
modifier_feat = ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM))
combined = combiner(head_feat, modifier_feat)
print(combined)

Variable containing:
 0.4285 -0.1363  0.4046  0.6006
[torch.FloatTensor of size 1x4]



# 3. Return of the Parser (2.5 points)

### Deliverable 3.1: Parser Training Code (2 points)
We will now complete the parser and train it on our data. It is important to understand the difference between the following tasks:

* Training: Training the model involves passing it sentences along with the correct sequence of actions, and updating weights.
* Evaluation: We can evaluate the parser by passing it sentences along with the correct sequence of actions, and see how many actions it predicts correctly.  This is identical to training, except the weights are not updated after making a prediction.
* Prediction: After setting the weights, we give it a raw sentence (no gold-standard actions), and let it follow its own predicted actions to create a dependency graph, which we can compare to the ground truth.

You will implement the `forward()` function in `gtnlplib.parsing.TransitionParser`.

At this point, it is necessary to have all of the components from part 2 in place for constructing the parser.

The parsing logic is roughly as follows:
* Loop until parsing state is in its terminating state (deliverable 1.2)
* Get the features from the parsing state (deliverable 2.2)
* Send them through your action chooser network to get log probabilities over actions (deliverable 2.3)
* If you have `gold_actions`, do them. Otherwise (when predicting), take the argmax of your log probabilities, validate the action (deliverable 1.3), and do that. An argmax function is provided for you in `utils.argmax`.

Make sure to keep track of the things that the function wants to keep track of
* Do all of your actions by calling the appropriate function on your `parser_state`
* Append each output autograd.Variable from your action_chooser to the outputs list
* Append each action you do to `actions_done`
* Build the set of dependency edges as you go

**Tests:**
- `test_parser.py:test_parse_logic_d3_1`
- `test_parser.py:test_predict_after_train_d3_1`

In [17]:
test_sentence = "The man ran away".split()
test_word_to_ix = { word: i for i, word in enumerate(set(test_sentence)) }
test_word_to_ix[consts.END_OF_INPUT_TOK] = len(test_word_to_ix)
test_sentence_vocab = set(test_sentence)
gold_actions = ["SHIFT", "ARC_L", "SHIFT", "ARC_L", "SHIFT", "ARC_R", "ARC_R", "SHIFT"]

In [18]:
reload(parsing)
torch.manual_seed(1)
feat_extractor = feat_extractors.SimpleFeatureExtractor()
word_embedding_lookup = neural_net.VanillaWordEmbedding(test_word_to_ix, STACK_EMBEDDING_DIM)
action_chooser = neural_net.FFActionChooser(STACK_EMBEDDING_DIM * NUM_FEATURES)
combiner_network = neural_net.FFCombiner(STACK_EMBEDDING_DIM)
parser = parsing.TransitionParser(feat_extractor, word_embedding_lookup,
                                     action_chooser, combiner_network)
output, depgraph, actions_done = parser(test_sentence, gold_actions)
print(depgraph)
print(actions_done)

{DepGraphEdge(head=('ran', 2), modifier=('man', 1)), DepGraphEdge(head=('man', 1), modifier=('The', 0)), DepGraphEdge(head=('ran', 2), modifier=('away', 3)), DepGraphEdge(head=('<ROOT>', -1), modifier=('ran', 2))}
[0, 1, 0, 1, 0, 2, 2, 0]


### Now Train the Parser!

Training your parser may take some time. On the test below, I get about 5 seconds per loop on an i7 processor.

There are 10,000 training sentences, so multiply this measurement by 100 to get your training time.

In [19]:
def train_parser(parser, optimizer, dataset, n_epochs=1, n_train_insts=1000):
    for epoch in range(n_epochs):
        print("Epoch {}".format(epoch+1))

        parser.train() # turn on dropout layers if they are there
        parsing.train(dataset.training_data[:n_train_insts], parser, optimizer, verbose=True)

        print("Dev Evaluation")
        parser.eval() # turn them off for evaluation
        parsing.evaluate(dataset.dev_data, parser, verbose=True)
        print("F-Score: {}".format(evaluation.compute_metric(parser, dataset.dev_data, evaluation.fscore)))
        print("Attachment Score: {}".format(evaluation.compute_attachment(parser, dataset.dev_data)))
        print("\n")

In [20]:
reload(parsing)
torch.manual_seed(1)
feat_extractor = feat_extractors.SimpleFeatureExtractor()
word_embedding_lookup = neural_net.VanillaWordEmbedding(word_to_ix_en, STACK_EMBEDDING_DIM)
action_chooser = neural_net.FFActionChooser(STACK_EMBEDDING_DIM * NUM_FEATURES)
combiner_network = neural_net.FFCombiner(STACK_EMBEDDING_DIM)
parser = parsing.TransitionParser(feat_extractor, word_embedding_lookup,
                                     action_chooser, combiner_network)
optimizer = optim.SGD(parser.parameters(), lr=ETA_0)

In [21]:
%%timeit
torch.manual_seed(1)
parsing.train(en_dataset.training_data[:100], parser, optimizer, verbose=True)

Number of instances: 100    Number of network actions: 4836
Acc: 0.695822994210091  Loss: 33.27972549080849
Number of instances: 100    Number of network actions: 4836
Acc: 0.83767576509512  Loss: 18.323778986930847
Number of instances: 100    Number of network actions: 4836
Acc: 0.9050868486352357  Loss: 11.343122403025626
Number of instances: 100    Number of network actions: 4836
Acc: 0.9443755169561621  Loss: 6.971744049191475
1 loops, best of 3: 9.12 s per loop


In [22]:
# train the parser for a while here.
# Shouldn't take *too* long, even on a laptop
torch.manual_seed(1)
train_parser(parser, optimizer, en_dataset, n_train_insts=1000)

Epoch 1
Number of instances: 1000    Number of network actions: 44560
Acc: 0.8244389587073608  Loss: 19.962331902943085
Dev Evaluation
Number of instances: 501    Number of network actions: 15846
Acc: 0.8014009844755774  Loss: 18.568025672161383
F-Score: 0.3974101636736301
Attachment Score: 0.37372207497160165




### Deliverable 3.2: Test Data Predictions (0.25 points)
Run the code below to output your predictions on the test data and dev data.  You can run the dev test to verify you are correct up to this point.  The test data evaluation is for us.

**Test**: `test_parser.py:test_dev_d3_2_english`

In [23]:
dev_sentences = [ sentence for sentence, _ in en_dataset.dev_data ]
evaluation.output_preds(consts.EN_D3_2_DEV_FILENAME, parser, dev_sentences)

In [24]:
evaluation.output_preds(consts.EN_D3_2_TEST_FILENAME, parser, en_dataset.test_data)

### Deliverable 3.3: Dependency parsing in Norwegian (0.25 points)
Run the code below to output your predictions on the **norwegian** test data and dev data.  You can run the dev test to verify you are correct up to this point.  The test data evaluation is for us.

**Test**: `test_parser.py:test_dev_d3_3_norwegian`

In [25]:
#first, make the parser
reload(parsing)
torch.manual_seed(1)
feat_extractor_nr = feat_extractors.SimpleFeatureExtractor()
word_embedding_lookup_nr = neural_net.VanillaWordEmbedding(word_to_ix_nr, STACK_EMBEDDING_DIM)
action_chooser_nr = neural_net.FFActionChooser(STACK_EMBEDDING_DIM * NUM_FEATURES)
combiner_network_nr = neural_net.FFCombiner(STACK_EMBEDDING_DIM)
parser_nr = parsing.TransitionParser(feat_extractor_nr, word_embedding_lookup_nr,
                                     action_chooser_nr, combiner_network_nr)
optimizer_nr = optim.SGD(parser_nr.parameters(), lr=ETA_0)

In [26]:
torch.manual_seed(1)
train_parser(parser_nr, optimizer_nr, nr_dataset, n_epochs=2, n_train_insts=1000)

Epoch 1
Number of instances: 1000    Number of network actions: 30942
Acc: 0.8131665697110724  Loss: 13.960280205373651
Dev Evaluation
Number of instances: 501    Number of network actions: 16028
Acc: 0.8241826803094584  Loss: 13.742586671012248
F-Score: 0.4583598528806764
Attachment Score: 0.4311205390566509


Epoch 2
Number of instances: 1000    Number of network actions: 30942
Acc: 0.877092624911124  Loss: 9.447790956835728
Dev Evaluation
Number of instances: 501    Number of network actions: 16028
Acc: 0.821936610930871  Loss: 16.20854713876491
F-Score: 0.4530140489080682
Attachment Score: 0.42612927377090093




In [27]:
reload(evaluation)
dev_sentences_nr = [ sentence for sentence, _ in nr_dataset.dev_data ]
evaluation.output_preds(consts.NR_D3_3_DEV_FILENAME, parser_nr, dev_sentences_nr)

In [28]:
evaluation.output_preds(consts.NR_D3_3_TEST_FILENAME, parser_nr, nr_dataset.test_data)

# 4. Evaluation and Training Improvements (3 points)

### Deliverable 4.1: BiLSTM Word Embeddings (0.5 points)
Implement the class `BiLSTMWordEmbedding` in `neural_net.py`.
This class can replace your `VanillaWordEmbedding`.
This class implements a sequence model over the sentence, where the t'th word's embedding is the hidden state at timestep t.
This means that, rather than have our embeddings on the stack only include the semantics of a single word, our embeddings will contain information from all parts of the sentence (the LSTM will, in principle, learn what information is relevant).

**Test**: `tests/test_parser.py:test_bilstm_word_embeds_d4_1`

In [29]:
reload(neural_net)
torch.manual_seed(1) # DO NOT CHANGE
test_sentence = "Noam Chomsky".split()
test_word_to_ix = { "Noam": 0, "Chomsky": 1 }

lstm_word_embedder = neural_net.BiLSTMWordEmbedding(test_word_to_ix,
                                                    WORD_EMBEDDING_DIM,
                                                    STACK_EMBEDDING_DIM,
                                                    num_layers=LSTM_NUM_LAYERS,
                                                    dropout=DROPOUT)
    
lstm_embeds = lstm_word_embedder(test_sentence)
print(type(lstm_embeds))
print(len(lstm_embeds), "\n")
print("Embedding for Noam:\n {}".format(lstm_embeds[0]))

<class 'list'>
2 

Embedding for Noam:
 Variable containing:

Columns 0 to 9 
 0.1056  0.0944  0.0412  0.3668 -0.0773 -0.0040  0.1634 -0.1230 -0.1293  0.1256

Columns 10 to 19 
 0.1295  0.0352  0.1901  0.0478  0.0433  0.0087 -0.0291  0.1329  0.3459 -0.1165

Columns 20 to 29 
-0.3131 -0.0233 -0.0261  0.1224  0.2036 -0.1736 -0.2412  0.0233  0.0510 -0.1779

Columns 30 to 39 
-0.0900 -0.0254 -0.1700 -0.0838 -0.0048 -0.0954  0.1243 -0.2872 -0.0251  0.0371

Columns 40 to 49 
 0.0771 -0.2720  0.1354  0.1249  0.0127 -0.2583  0.1458 -0.0781 -0.0400 -0.0394

Columns 50 to 59 
-0.1927  0.0358  0.0909  0.0357  0.1218  0.1649 -0.0762  0.1449  0.2308 -0.0049

Columns 60 to 69 
-0.2879 -0.0810  0.1228  0.1740 -0.2155  0.1019  0.3102  0.0146  0.3096  0.0209

Columns 70 to 79 
 0.0164 -0.1165  0.1106  0.2169  0.2170 -0.0204 -0.1604 -0.1576 -0.0712 -0.0304

Columns 80 to 89 
-0.1149 -0.0915 -0.1964  0.1715  0.0541 -0.1103 -0.3502 -0.1217  0.0280 -0.0770

Columns 90 to 99 
 0.0541  0.0820  0.1399  0.1348

### Deliverable 4.2: Suffix Embeddings (0.5 points)
We can also try to more explicitly include morphological information by embedding the suffix of a word in addition to the word itself. We approximate the "suffix" by just looking at the last two characters of a word.

First, implement the function `build_suff_to_ix` in `utils.py`. It should take in a `word_to_ix` lookup and return a `suff_to_ix` lookup.

Then, implement the class `SuffixAndWordEmbedding` in `neural_net.py`.
This class embeds the words and suffixes in a sentence and then concatenates them to form one embedding. 

**Test**: `tests/test_parser.py:test_suff_word_embeds_d4_2`

In [30]:
reload(utils)
suff_to_ix_en = utils.build_suff_to_ix(word_to_ix_en)
suff_to_ix_nr = utils.build_suff_to_ix(word_to_ix_nr)

In [31]:
len(suff_to_ix_en), len(suff_to_ix_nr)

(1145, 849)

In [32]:
reload(neural_net)
torch.manual_seed(1) # DO NOT CHANGE
test_sentence = "prefix fixsuf fixinfix".split()
test_word_to_ix = { "prefix": 0, "fixsuf": 1 , "fixinfix": 2}
test_suff_to_ix = utils.build_suff_to_ix(test_word_to_ix)

suff_word_embedder = neural_net.SuffixAndWordEmbedding(test_word_to_ix, test_suff_to_ix, TEST_EMBEDDING_DIM)
test_embs = suff_word_embedder(test_sentence)

In [33]:
test_embs[0]

Variable containing:
 0.6614  0.2669 -1.5228  0.3817
[torch.FloatTensor of size 1x4]

### Deliverable 4.3: Pretrained Embeddings (0.5 points)

Fill in the function `initialize_with_pretrained` in `utils.py`.

It will take a word embedding lookup component and initialize its lookup table with pretrained embeddings, which are provided. Note that this is only applicable for the Vanilla, BiLSTM, and SuffixAndWord embedding components.

**Test**: `tests/test_parser.py:test_pretrained_embeddings_d4_3`

In [34]:
import pickle
pretrained_embeds = pickle.load(open(consts.PRETRAINED_EMBEDS_FILE, 'rb'))
print(pretrained_embeds['four'][:5])

[0.12429751455783844, -0.11472601443529129, -0.5684014558792114, -0.396965891122818, 0.22938089072704315]


In [35]:
torch.manual_seed(1)
embedder = neural_net.VanillaWordEmbedding(word_to_ix_en,64)

In [36]:
embedder.forward(['four'])[0][0,:5]

Variable containing:
 0.2170
-1.2324
-0.3781
 1.9561
 0.4290
[torch.FloatTensor of size 5]

In [37]:
reload(utils);
utils.initialize_with_pretrained(pretrained_embeds,embedder)
print(embedder.forward(['four'])[0][0,:5])

Variable containing:
 0.1243
-0.1147
-0.5684
-0.3970
 0.2294
[torch.FloatTensor of size 5]



### Deliverable 4.4: Better Arc Component Combination (0.5 points)
Before, in order to combine two embeddings during an arc- operation, we just passed them through a feed-forward network and got a dense output.  Now, we will instead use a sequence model of the stack.  The combined embedding from an arc- operation is the next time step of an LSTM.  Implement `neural_net.LSTMCombiner`.

**Test**: `tests/test_parser.py:test_lstm_combiner_d4_4`

In [38]:
reload(neural_net)
torch.manual_seed(1)
combiner = neural_net.LSTMCombiner(TEST_EMBEDDING_DIM, num_layers=LSTM_NUM_LAYERS, dropout=DROPOUT)
head_feat = ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM))
mod_feat = ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM))

In [39]:
combined = combiner(head_feat, mod_feat)
combined

Variable containing:
 0.0532 -0.1534  0.1484 -0.0595
[torch.FloatTensor of size 1x4]

### Deliverable 4.5: Better action choosing (0.5 points)
Instead of choosing the action from the combiner output independently at each time step, let's use an LSTM to predict the action. This way, past actions can influence the current decision directly. 

Implement `neural_net.LSTMActionChooser`. Use a linear layer to predict the action from the LSTM hidden state.

**Test**: `tests/test_parser.py:test_lstm_action_chooser_d4_5`

In [53]:
reload(neural_net)
torch.manual_seed(1)
action_chooser = neural_net.LSTMActionChooser(TEST_EMBEDDING_DIM * NUM_FEATURES,
                                                     LSTM_NUM_LAYERS,
                                                     dropout=DROPOUT)
feats = [ag.Variable(torch.randn(1, TEST_EMBEDDING_DIM)) for _ in range(NUM_FEATURES)]

In [54]:
output = action_chooser(feats)
output

Variable containing:
-1.0328 -1.1798 -1.0887
[torch.FloatTensor of size 1x3]

### Retrain with the new components

In [55]:
reload(neural_net)
reload(parsing)
torch.manual_seed(1)
stack_dim = STACK_EMBEDDING_DIM
feat_extractor = feat_extractors.SimpleFeatureExtractor()
#BiLSTM word embeddings will probably work best, but feel free to experiment with the others you developed
word_embedding_lookup = neural_net.BiLSTMWordEmbedding(word_to_ix_en,
                                                       WORD_EMBEDDING_DIM,
                                                       STACK_EMBEDDING_DIM,
                                                       num_layers=LSTM_NUM_LAYERS,
                                                       dropout=DROPOUT)
utils.initialize_with_pretrained(pretrained_embeds, word_embedding_lookup)
action_chooser = neural_net.LSTMActionChooser(STACK_EMBEDDING_DIM * NUM_FEATURES,
                                              LSTM_NUM_LAYERS,
                                              dropout=DROPOUT)
combiner = neural_net.LSTMCombiner(STACK_EMBEDDING_DIM,
                                   num_layers=LSTM_NUM_LAYERS,
                                   dropout=DROPOUT)
parser = parsing.TransitionParser(feat_extractor, word_embedding_lookup,
                                  action_chooser, combiner)
optimizer = optim.SGD(parser.parameters(), lr=ETA_0)

In [56]:
# The LSTMs will make this take longer, probably just a few minutes
train_parser(parser, optimizer, en_dataset, n_epochs=2, n_train_insts=1000)

Epoch 1
Number of instances: 1000    Number of network actions: 44560
Acc: 0.7996858168761221  Loss: 20.545610046975316
Dev Evaluation
Number of instances: 501    Number of network actions: 15846
Acc: 0.8592704783541588  Loss: 10.756946038016064
F-Score: 0.5800587021811232
Attachment Score: 0.571500694181497


Epoch 2
Number of instances: 1000    Number of network actions: 44560
Acc: 0.8861310592459605  Loss: 12.337165224332363
Dev Evaluation
Number of instances: 501    Number of network actions: 15846
Acc: 0.8787075602675755  Loss: 9.554229947888922
F-Score: 0.6364896694444863
Attachment Score: 0.6127729395431024




### Deliverable 4.6: Test Predictions: English (0.25 points)

**Test**: `tests/test_parser.py:test_dev_preds_d4_6_english`

In [57]:
dev_sentences = [ sentence for sentence, _ in en_dataset.dev_data ]
evaluation.output_preds(consts.EN_D4_6_DEV_FILENAME, parser, dev_sentences)

In [58]:
evaluation.output_preds(consts.EN_D4_6_TEST_FILENAME, parser, en_dataset.test_data)

In [59]:
!nosetests tests/test_parser.py:test_dev_preds_d4_6_english

.
----------------------------------------------------------------------
Ran 1 test in 0.012s

OK


### Deliverable 4.7: Test Predictions: Norwegian (0.25 points)

**Test**: `tests/test_parser.py:test_dev_preds_d4_7_norwegian`

In [60]:
torch.manual_seed(1)
feat_extractor_nr = feat_extractors.SimpleFeatureExtractor()
#BiLSTM word embeddings will probably work best, but feel free to experiment with the others you developed
word_embedding_lookup_nr = neural_net.BiLSTMWordEmbedding(word_to_ix_nr,
                                                          WORD_EMBEDDING_DIM,
                                                          STACK_EMBEDDING_DIM,
                                                          num_layers=LSTM_NUM_LAYERS,
                                                          dropout=DROPOUT)
action_chooser_nr = neural_net.FFActionChooser(STACK_EMBEDDING_DIM * NUM_FEATURES)
combiner_nr = neural_net.LSTMCombiner(STACK_EMBEDDING_DIM,
                                          num_layers=LSTM_NUM_LAYERS,
                                          dropout=DROPOUT)
parser_nr = parsing.TransitionParser(feat_extractor_nr, word_embedding_lookup_nr,
                                  action_chooser_nr, combiner_nr)
optimizer_nr = optim.SGD(parser_nr.parameters(), lr=ETA_0)

In [61]:
train_parser(parser_nr, optimizer_nr, nr_dataset, n_epochs=3, n_train_insts=1000)

Epoch 1
Number of instances: 1000    Number of network actions: 30942
Acc: 0.8116152802016676  Loss: 13.38444953562133
Dev Evaluation
Number of instances: 501    Number of network actions: 16028
Acc: 0.8383453955577739  Loss: 12.016908238210647
F-Score: 0.4970846621987839
Attachment Score: 0.47541801846768156


Epoch 2
Number of instances: 1000    Number of network actions: 30942
Acc: 0.8814556266563247  Loss: 8.603796067149844
Dev Evaluation
Number of instances: 501    Number of network actions: 16028
Acc: 0.8609308709757924  Loss: 10.8729981517646
F-Score: 0.5679072205309386
Attachment Score: 0.544172697778887


Epoch 3
Number of instances: 1000    Number of network actions: 30942
Acc: 0.9139034322280396  Loss: 6.370467250758666
Dev Evaluation
Number of instances: 501    Number of network actions: 16028
Acc: 0.8625530321936611  Loss: 11.12967840797251
F-Score: 0.5624383133719605
Attachment Score: 0.5499126528574994




In [62]:
dev_sentences_nr = [ sentence for sentence, _ in nr_dataset.dev_data ]
evaluation.output_preds(consts.NR_D4_7_DEV_FILENAME, parser_nr, dev_sentences_nr)

In [63]:
evaluation.output_preds(consts.NR_D4_7_TEST_FILENAME, parser_nr, nr_dataset.test_data)

In [64]:
!nosetests tests/test_parser.py:test_dev_preds_d4_7_norwegian

.
----------------------------------------------------------------------
Ran 1 test in 0.013s

OK


# 5. Bakeoff! (2 points)


We will have another Kaggle bakeoff for this problem set. The links will be put on canvas.

Try to implement new features and tune your network's architecture and hyperparameters to get the best network.
Section 3 of [this paper](https://pdfs.semanticscholar.org/55b8/1991fbb025038d98e8c71acf7dc2b78ee5e9.pdfhttps://pdfs.semanticscholar.org/55b8/1991fbb025038d98e8c71acf7dc2b78ee5e9.pdf) may help out with hyper parameter tuning if you are new to neural networks.
To get very competitive, it may be necessary to train for a large amount of time (leaving it running overnight should be fine).  Here are some suggestions.
* Tune your learning rate.
* Tune your other hyperparameters.
* Try different optimizers.  torch.optim has a ton of different training algorithms.  SGD was used in this pset because it is fast, but it is the most vanilla of them.  Trying new ones, like Adam, will almost certainly boost performance
* Try adding regularization to your network if you see evidence that it is overfitting. This can be done with:
  * L2 regularization using the [weight decay argument](http://pytorch.org/docs/master/optim.html#torch.optim.SGD)
  * adding dropout (already an input argument to some of the neural net components)
  * implement early stopping (stop training if dev performance on some metric doesn't improve for k epochs)
* Try customizing any of the 3 components (word embeddings, action choosing, combining) in clever ways.  You can create new classes that expose the same public interface and use them here (just leave your required ones untouched). Building word embeddings from characters using an RNN or convolutional layer may help.
* Try new features.  Write new classes that expose the same public interface as SimpleFeatureExtractor.  Try looking further into stack history, or more input buffer lookahead, or features based on the action sequence.  The possibilities are endless.
* Check out [this book](http://www.deeplearningbook.org/), which is undoubtedly the best deep learning book (and it is free online!) which has great information on regularization, optimization, and different network architectures.

**Tests**: 
- `tests/test_parser.py:test_dev_preds_bakeoff_d5_1_english`
- `tests/test_parser.py:test_dev_preds_bakeoff_d5_2_norwegian`

**Rubric**:
English dev:
- $\geq$ 0.74: 0.15 points
- $\geq$ 0.75: 0.25 points
- $\geq$ 0.76: 0.5 points

English test:
- $\geq$ 0.70: 0.15 points
- $\geq$ 0.71: 0.25 points
- $\geq$ 0.73: 0.5 points

Norwegian dev:
- $\geq$ 0.66: 0.15 points
- $\geq$ 0.70: 0.25 points
- $\geq$ 0.71: 0.5 points

Norwegian test:
- $\geq$ 0.66: 0.15 points
- $\geq$ 0.69: 0.25 points
- $\geq$ 0.70: 0.5 points

**Extra credit**: 
- +0.25 if you beat the best TA/prof system in English or Norwegian
- +0.25 if you are #1 in CS4650 in English or Norwegian
- +0.25 if you are #1 in CS7650 in English or Norwegian

Current staff best attachment scores are: 
- English dev: 0.786
- English test: 0.753
- Norwegian dev: 0.736
- Norwegian test: 0.736

### Using Cuda
You can use CUDA to train your network, and you should expect decent speedup if you have a GPU and the CUDA toolkit installed.
If you want to use CUDA in this assignment, change the HAVE_CUDA variable to True in constants.py, and call `.to_cuda()` on your parser. You may also need to reconfigure your Embedding layers if you didn't consider cuda before.

We are not officially supporting CUDA though.  If you have problems installing or running CUDA, please just use the CPU, we cannot help you debug it.

In [None]:
# Set your hyperparameters here
# e.g learning rate, regularization, lr annealing, dimensionality of embeddings, number of epochs, early stopping etc.

In [None]:
# Make your parser here
# name your TransitionParser bakeoff_parser to output your predictions below
# bakeoff_parser_en = TransitionParser(...)

# Also, choose an optimizer.
# bakeoff_optimizer_en = optim....

In [None]:
# train for bakeoff
train_parser(bakeoff_parser_en, bakeoff_optimizer_en, en_dataset, n_epochs=5, n_train_insts=10000)

In [None]:
evaluation.output_preds("bakeoff-dev-en.preds", bakeoff_parser_en, en_dataset.dev_data)

In [None]:
evaluation.output_preds("bakeoff-test-en.preds", bakeoff_parser_en, en_dataset.test_data)

In [None]:
#use this to output predictions in kaggle-ready format
evaluation.kaggle_output("KAGGLE-bakeoff-preds-en.csv", bakeoff_parser_en, en_dataset.test_data)

In [None]:
# Now make your norwegian parser if necessary
# name your TransitionParser bakeoff_parser to output your predictions below
# bakeoff_parser_nr = TransitionParser(...)

# Also, choose an optimizer.
# bakeoff_optimizer_nr = optim....

In [None]:
# train for bakeoff
train_parser(bakeoff_parser_nr, bakeoff_optimizer_nr, nr_dataset, n_epochs=5, n_train_insts=10000)

In [None]:
evaluation.output_preds("bakeoff-dev-nr.preds", bakeoff_parser_nr, nr_dataset.dev_data)

In [None]:
evaluation.output_preds("bakeoff-test-nr.preds", bakeoff_parser_nr, nr_dataset.test_data)

In [None]:
evaluation.kaggle_output("KAGGLE-bakeoff-preds-nr.csv", bakeoff_parser_nr, nr_dataset.test_data)

# 6. 7650 only: Research question (1 point)

Describe a paper that uses dependency trees for some downstream task. Your response should answer the following questions, with 1-2 sentences each.

1. What is the task that is being solved?
2. Briefly (one sentence) explain the metric for success on this task.
3. Why are dependency features expected to help with this task?
4. How are dependency features incorporated into the solution?
5. Does the paper evaluate whether dependency features improve performance on the downstream task? If so, what is their impact? If not, why not?

You should select a paper from 2008-2018 at one of the following venues: ACL, NAACL, EACL, EMNLP, TACL. Exceptions from this list will be considered on a case-by-case basis, please post a private question to Piazza.

Here are some suggested papers. Free downloads of all these papers can be found by searching online.

- Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. "SQuAD: 100,000+ Questions for Machine Comprehension of Text."
- Stanovsky, Gabriel, Judith Eckle-Kohler, Yevgeniy Puzikov, Ido Dagan, and Iryna Gurevych. "Integrating Deep Linguistic Features in Factuality Prediction over Unified Datasets." In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 352-357. 2017.
- Xu, Yan, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. "Classifying relations via long short term memory networks along shortest dependency paths." In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1785-1794. 2015.
- Madaan, Aman, Ashish Mittal, G. Ramakrishnan Mausam, Ganesh Ramakrishnan, and Sunita Sarawagi. "Numerical Relation Extraction with Minimal Supervision." In AAAI, pp. 2764-2771. 2016.
- Lee, Kenton, Yoav Artzi, Yejin Choi, and Luke Zettlemoyer. "Event detection and factuality assessment with non-expert supervision." In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1643-1648. 2015.
- Kim, Sun, Haibin Liu, Lana Yeganova, and W. John Wilbur. "Extracting drug–drug interactions from literature using a rich feature-based linear kernel approach." Journal of biomedical informatics 55 (2015): 23-30.
- Shwartz, Vered, Yoav Goldberg, and Ido Dagan. "Improving Hypernymy Detection with an Integrated Path-based and Distributional Method." In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2389-2398. 2016.
