**Note**: I adapted the starter codes to make them compatible with python 3. All codes were run in python 3 (rather than the officially supported version of python, i.e., python 2) unless clarified.

### Q1: Tensorflow Softmax

**1a** and **1b**.

In [1]:
%%bash
source activate py36
python q1_softmax.py

Softmax test 1 passed!
Softmax test 2 passed!
Basic (non-exhaustive) softmax tests pass

Cross-entropy test 1 passed!
Basic (non-exhaustive) cross-entropy tests pass


**1c**. Placeholder variables and feed dictionaries make it possible to feed data into the computational graph.

**1d** and **1e**.

In [2]:
%%bash
source activate py36
python q1_classifier.py

Epoch 0: loss = 59.18 (0.114 sec)
Epoch 1: loss = 20.32 (0.009 sec)
Epoch 2: loss = 10.92 (0.008 sec)
Epoch 3: loss = 7.30 (0.008 sec)
Epoch 4: loss = 5.44 (0.008 sec)
Epoch 5: loss = 4.32 (0.009 sec)
Epoch 6: loss = 3.58 (0.009 sec)
Epoch 7: loss = 3.05 (0.009 sec)
Epoch 8: loss = 2.65 (0.008 sec)
Epoch 9: loss = 2.35 (0.008 sec)
Epoch 10: loss = 2.11 (0.008 sec)
Epoch 11: loss = 1.91 (0.007 sec)
Epoch 12: loss = 1.75 (0.007 sec)
Epoch 13: loss = 1.61 (0.007 sec)
Epoch 14: loss = 1.49 (0.007 sec)
Epoch 15: loss = 1.39 (0.007 sec)
Epoch 16: loss = 1.30 (0.007 sec)
Epoch 17: loss = 1.22 (0.007 sec)
Epoch 18: loss = 1.15 (0.006 sec)
Epoch 19: loss = 1.09 (0.007 sec)
Epoch 20: loss = 1.03 (0.007 sec)
Epoch 21: loss = 0.98 (0.007 sec)
Epoch 22: loss = 0.94 (0.006 sec)
Epoch 23: loss = 0.89 (0.007 sec)
Epoch 24: loss = 0.86 (0.006 sec)
Epoch 25: loss = 0.82 (0.006 sec)
Epoch 26: loss = 0.79 (0.007 sec)
Epoch 27: loss = 0.76 (0.006 sec)
Epoch 28: loss = 0.73 (0.006 sec)
Epoch 29: loss = 0.71

During forward propagation, the loss gets computed; during backpropagation, the derivatives with respect to the variables in the graph get computed; after the op has been run, the variables in the graph will be updated.

### Q2: Neural Transition-Based Dependency Parsing

**2a**.

stack| buffer | new dependency | transition
--- | --- | ---| ---
[ROOT] | [I, parsed, this, sentence, correctly] | | Initial Configuration
[ROOT, I] | [parsed, this, sentence, correctly] | | SHIFT
[ROOT, I, parsed] | [this, sentence, correctly] | | SHIFT
[ROOT, parsed] | [this, sentence, correctly] | parsed → I | LEFT-ARC
[ROOT, parsed, this] | [sentence, correctly] | | SHIFT
[ROOT, parsed, this, sentence] | [correctly] | | SHIFT
[ROOT, parsed, sentence] | [correctly] | sentence → this | LEFT-ARC
[ROOT, parsed] | [correctly] | parsed → sentence | RIGHT-ARC
[ROOT, parsed, correctly] | [] | | SHIFT
[ROOT, parsed] | [] | parsed → correctly | RIGHT-ARC
[ROOT] | [] | ROOT → parsed | RIGHT-ARC

**2b**. A sentence containing $n$ words will be parsed in $2n$ steps; each word must be shifted onto the stack and moved away through LEFT-ARC or RIGHT-ARC.

**2c** and **2d**.

In [3]:
%%bash
source activate py36
python q2_parser_transitions.py

SHIFT test passed!
LEFT-ARC test passed!
RIGHT-ARC test passed!
parse test passed!
minibatch_parse test passed!


**2e**.

In [4]:
%%bash
source activate py36
python q2_initialization.py

Running basic tests...
Basic (non-exhaustive) Xavier initialization tests pass


**2f**.

$$\gamma = \frac{1}{1 - p_{drop}}$$

$$\mathbb{E}_{p_{drop}}[h_{drop}]_i = \mathbb{E}_{p_{drop}}[\gamma d_i h_i] = p_{drop}\cdot 0 + (1-p_{drop})\cdot \gamma h_i = (1-p_{drop})\cdot \gamma h_i = h_i$$

$$\therefore \gamma = \frac{1}{1 - p_{drop}}$$

**2g**. **(i)** Each update depends on the previous update since the previous $m$ would be multiplied by $\beta_1$ and the current gradient would be multiplied by $(1 - \beta_1)$. The rolling average is close to calculating the gradient over a larger minibatch, the variance of each update would be reduced, and each update would be closer to the gradient over the whole dataset.

**(ii)** The parameters with the smaller gradients (on average) will get larger updates; this would help parameters move out of the flat areas (saddle points).

**2h**.

In [5]:
%%bash
source activate py36
python q2_run_h1.py

INITIALIZING
Loading data...
took 1.14 seconds
Building parser...
took 0.02 seconds
Loading pretrained embeddings...
took 1.29 seconds
Vectorizing data...
took 0.03 seconds
Preprocessing training data...
took 0.79 seconds
Building model...
took 0.11 seconds

TRAINING
Epoch 1 out of 10
Evaluating on dev set
- dev UAS: 55.74

Epoch 2 out of 10
Evaluating on dev set
- dev UAS: 60.93

Epoch 3 out of 10
Evaluating on dev set
- dev UAS: 63.93

Epoch 4 out of 10
Evaluating on dev set
- dev UAS: 65.97

Epoch 5 out of 10
Evaluating on dev set
- dev UAS: 68.35

Epoch 6 out of 10
Evaluating on dev set
- dev UAS: 69.00

Epoch 7 out of 10
Evaluating on dev set
- dev UAS: 70.32

Epoch 8 out of 10
Evaluating on dev set
- dev UAS: 72.07

Epoch 9 out of 10
Evaluating on dev set
- dev UAS: 71.51

Epoch 10 out of 10
Evaluating on dev set
- dev UAS: 72.12



The best UAS on the dev set is 88.71; the UAS on the test set is 89.10.

**2i**. I added an additional hidden layer. In principle, an additional hidden layer increases the expressibility of the model due to the increased ability in catching nonlinearities; but it makes the model harder to train.

The best UAS on the dev set is 88.40; the UAS on the test set is 88.75.

### Q3: Recurrent Neural Networks: Language Modeling

(See solution)