**Note**: I adapted the starter codes to make them compatible with python 3. All codes were run in python 3 (rather than the officially supported version of python, i.e., python 2) unless clarified. As a result, I got slightly different results with solution for some questions, but all of my results passed the sanity check. 

### Q1: Softmax

**1a**. Proof: $$\text{softmax}(x+c)_i = \frac{e^{x_i+c}}{\sum_je^{x_j+c}}=\frac{e^ce^{x_i}}{e^c\sum_je^{x_j}}
=\frac{e^{x_i}}{\sum_je^{x_j}}=\text{softmax}(x)_i$$

**1b**.

In [1]:
%%bash
source activate py36
python q1_softmax.py

Running basic tests...
[0.26894142 0.73105858]
[[0.26894142 0.73105858]
 [0.26894142 0.73105858]]
[[0.73105858 0.26894142]]
You should be able to verify these results by hand!



### Q2: Neural Network Basics

**2a**.
$$\sigma'(x)=\frac{-1}{(1+e^{-x})^2}\cdot(-e^{-x})=\frac{1}{1+e^{-x}}\cdot\frac{e^{-x}}{1+e^{-x}}
=\sigma(x)(1-\sigma(x))$$

**2b**. Assume that only the k-th dimension of $\boldsymbol{y}$ is 1 and others are 0. We have
$$CE(\boldsymbol{y}, \boldsymbol{\hat{y}})=-y_k\log (\hat{y}_k)=-y_k(\log e^{\theta_k}-\log \sum_i e^{\theta_i}) = \log \sum_i e^{\theta_i} - \theta_k$$
$$\frac{\partial CE(\boldsymbol{y}, \boldsymbol{\hat{y}})}{\partial\theta_k}
=\frac{e^{\theta_k}}{\sum_i e^{\theta_i}}-1=\hat{y}_k-1$$
For $j \neq k$,
$$\frac{\partial CE(\boldsymbol{y}, \boldsymbol{\hat{y}})}{\partial\theta_j}
=\frac{e^{\theta_j}}{\sum_i e^{\theta_i}}=\hat{y}_j$$

$$\therefore \frac{\partial CE(\boldsymbol{y}, \boldsymbol{\hat{y}})}{\partial\boldsymbol{\theta}}=\boldsymbol{\hat{y}}-\boldsymbol{y}$$

**2c**. Denote $z_1=xW_1+b_1$ and $z_2=hW_2+b_2$, then

$$\frac{\partial J}{\partial x} = (\frac{\partial J}{\partial z_2}\frac{\partial z_2}{\partial h}\odot\frac{\partial h}{\partial z_1})\frac{\partial z_1}{\partial x}
=((\hat{y}-y)W_2^\text{T}\odot(h(1-h))W_1^\text{T})$$

**2d**. The total number of parameters is $(D_x+1)H+(H+1)D_y$.

**2e**.

In [2]:
%%bash
source activate py36
python q2_sigmoid.py

Running basic tests...
[[0.73105858 0.88079708]
 [0.26894142 0.11920292]]
[[0.19661193 0.10499359]
 [0.19661193 0.10499359]]
You should verify these results by hand!



**2f**.

In [3]:
%%bash
source activate py36
python q2_gradcheck.py

Running sanity checks...
Gradient check passed!
Gradient check passed!
Gradient check passed!



**2g**.

In [4]:
%%bash
source activate py36
python q2_neural.py

Running sanity check...
Gradient check passed!


### Q3: word2vec

**3a**. We know $u_W$ and $v_w$ are column vectors of size, say $H$. The the shape of $U$ is $(H, V)$. Let $\hat{y}$ be the column vector of softmax predictions, and $y$ be the one-hot labels as a column vector. Then,
$$\frac{\partial J}{\partial v_c}=U(\hat{y}-y)$$
The elements $\hat{y}_w$ and $y_w$ of $\hat{y}$ and $y$ are scalers. Thus equivalently,
$$\frac{\partial J}{\partial v_c}=\sum_{w=1}^V \hat{y}_wu_w-u_o.$$


**3b**. 
$$\frac{\partial J}{\partial U} = v_c(\hat{y}-y)^\text{T}$$
Equivalently,
$$\frac{\partial J}{\partial u_k} = 
\begin{cases}
     (\hat{y}_k-1)v_c, & k=o\\
     \hat{y}_kv_c, & k \neq o
\end{cases}$$

**3c**.
$$\frac{\partial J}{\partial v_c}=
-(1-\sigma(u_o^\text{T}v_c))u_o+\sum_{k=1}^K(1-\sigma(-u_k^\text{T}v_c))u_k$$

$$\frac{\partial J}{\partial u_w}= \left\{
\begin{array}{}
-(1-\sigma(u_w^\text{T}v_c))v_c, & w=o\\
(1-\sigma(-u_w^\text{T}v_c))v_c, & w=1, 2, ..., K\\
0, & \text{otherwise}
\end{array}\right.$$

There are less terms using negative sampling loss compared to the softmax-CE loss; the speed-up ratio is $O(V/K)$.

**3d**.
$$\frac{\partial J_{skip-gram}(w_{t-m...t+m})}{\partial U}=\sum_{-m\le j \le m, j\neq 0}\frac{\partial F(w_{t+j}, v_c)}{\partial U}$$

$$\frac{\partial J_{skip-gram}(w_{t-m...t+m})}{\partial v_k}= \left\{
\begin{array}{}
\sum_{-m\le j \le m, j \neq 0}\frac{\partial F(w_{t+j},v_c)}{\partial v_k}, & k = c\\
0,&k \neq c
\end{array}\right.$$

$$\frac{\partial J_{CBOW}(w_{c-m...c+m})}{\partial U}=\frac{\partial F(w_t, \hat{v})}{\partial U}$$

$$\frac{\partial J_{CBOW}(w_{c-m...c+m})}{\partial v_k}=\left\{
\begin{array}{}
\frac{\partial F(w_t, \hat{v})}{\partial v_k}, & k=w_{t+j}, j \in \{-m, ...,-1, +1, ..., +m\}\\
0, & k \neq w_{t+j}, j \in \{-m, ...,-1, +1, ..., +m\}\\
\end{array}\right.$$

**3e** and **3h**.

In [5]:
%%bash
source activate py36
python q3_word2vec.py

Testing normalizeRows...
[[0.6        0.8       ]
 [0.4472136  0.89442719]]

==== Gradient check for skip-gram ====
Gradient check passed!
Gradient check passed!

==== Gradient check for CBOW      ====
Gradient check passed!
Gradient check passed!

=== Results ===
(11.16610900153398, array([[ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [-1.26947339, -1.36873189,  2.45158957],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ]]), array([[-0.41045956,  0.18834851,  1.43272264],
       [ 0.38202831, -0.17530219, -1.33348241],
       [ 0.07009355, -0.03216399, -0.24466386],
       [ 0.09472154, -0.04346509, -0.33062865],
       [-0.13638384,  0.06258276,  0.47605228]]))
(14.093692760899629, array([[ 0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ],
       [-3.86802836, -1.12713967, -1.52668625],
       [ 0.        ,  0.        ,  0.        ],
       [ 0.       

**3f**.

In [6]:
%%bash
source activate py36
python q3_sgd.py

Running sanity checks...
iter 100: 0.004578
iter 200: 0.004353
iter 300: 0.004136
iter 400: 0.003929
iter 500: 0.003733
iter 600: 0.003546
iter 700: 0.003369
iter 800: 0.003200
iter 900: 0.003040
iter 1000: 0.002888
test 1 result: 8.414836786079764e-10
iter 100: 0.000000
iter 200: 0.000000
iter 300: 0.000000
iter 400: 0.000000
iter 500: 0.000000
iter 600: 0.000000
iter 700: 0.000000
iter 800: 0.000000
iter 900: 0.000000
iter 1000: 0.000000
test 2 result: 0.0
iter 100: 0.041205
iter 200: 0.039181
iter 300: 0.037222
iter 400: 0.035361
iter 500: 0.033593
iter 600: 0.031913
iter 700: 0.030318
iter 800: 0.028802
iter 900: 0.027362
iter 1000: 0.025994
test 3 result: -2.524451035823933e-09



**3g**.

![](q3_word_vectors.png)

**Note**: My result here is different from that in solution, presumably because I used python 3 rather than the officially supported version of python (python 2), and I adapted some codes to make it compatible with python 3. The final loss is 9.4384, which passed the sanity check.

**Explanation**: Words of similar meanings are "clustered" together. Some punctuations and frequent words are isolated from word clusters.

### Q4: Sentiment Analysis

**4a**.
```python
def getSentenceFeatures(tokens, wordVectors, sentence):
    """
    Obtain the sentence feature for sentiment analysis by averaging its
    word vectors
    """

    # Implement computation for the sentence features given a sentence.

    # Inputs:
    # tokens -- a dictionary that maps words to their indices in
    #           the word vector list
    # wordVectors -- word vectors (each row) for all tokens
    # sentence -- a list of words in the sentence of interest

    # Output:
    # - sentVector: feature vector for the sentence

    sentVector = np.zeros((wordVectors.shape[1],))

    ### YOUR CODE HERE
    sentVector = np.mean([wordVectors[tokens[w]] for w in sentence], axis=0)
    ### END YOUR CODE

    assert sentVector.shape == (wordVectors.shape[1],)
    return sentVector
```

**4b**. We want to use regularization to prevent overfitting.

**4c**.
```python
def getRegularizationValues():
    """Try different regularizations

    Return a sorted list of values to try.
    """
    # values = None   # Assign a list of floats in the block below
    ### YOUR CODE HERE
    values = [10 ** p for p in range(-6, 4)]
    ### END YOUR CODE
    return sorted(values)


def chooseBestModel(results):
    """Choose the best model based on dev set performance.

    Arguments:
    results -- A list of python dictionaries of the following format:
        {
            "reg": regularization,
            "clf": classifier,
            "train": trainAccuracy,
            "dev": devAccuracy,
            "test": testAccuracy
        }

    Each dictionary represents the performance of one model.

    Returns:
    Your chosen result dictionary.
    """
    bestResult = None

    ### YOUR CODE HERE
    bestResult = max(results, key=lambda m:m['dev'])
    ### END YOUR CODE

    return bestResult
```

**4d**. First, we need to fix the encoding bug in datasetSentences.txt.

In [7]:
%%bash
source activate py27
cd utils
python fix_encoding.py

After this, I manually fix the encoding bug in dictionary.txt.

In [8]:
%%bash
source activate py36
python q4_sentiment.py --yourvectors

Training for reg=0.000001
Train accuracy (%): 30.138109
Dev accuracy (%): 29.700272
Test accuracy (%): 29.230769
Training for reg=0.000010
Train accuracy (%): 30.102996
Dev accuracy (%): 29.609446
Test accuracy (%): 29.185520
Training for reg=0.000100
Train accuracy (%): 30.056180
Dev accuracy (%): 29.700272
Test accuracy (%): 29.185520
Training for reg=0.001000
Train accuracy (%): 30.044476
Dev accuracy (%): 29.972752
Test accuracy (%): 29.185520
Training for reg=0.010000
Train accuracy (%): 29.985955
Dev accuracy (%): 30.426885
Test accuracy (%): 28.959276
Training for reg=0.100000
Train accuracy (%): 29.552903
Dev accuracy (%): 30.063579
Test accuracy (%): 28.597285
Training for reg=1.000000
Train accuracy (%): 28.300562
Dev accuracy (%): 26.521344
Test accuracy (%): 26.108597
Training for reg=10.000000
Train accuracy (%): 27.212079
Dev accuracy (%): 25.522252
Test accuracy (%): 23.076923
Training for reg=100.000000
Train accuracy (%): 27.247191
Dev accuracy (%): 25.522252
Test accu

In [9]:
%%bash
source activate py36
python q4_sentiment.py --pretrained

Training for reg=0.000001
Train accuracy (%): 39.899345
Dev accuracy (%): 36.512262
Test accuracy (%): 37.013575
Training for reg=0.000010
Train accuracy (%): 39.969569
Dev accuracy (%): 36.512262
Test accuracy (%): 36.968326
Training for reg=0.000100
Train accuracy (%): 39.946161
Dev accuracy (%): 36.421435
Test accuracy (%): 36.968326
Training for reg=0.001000
Train accuracy (%): 39.899345
Dev accuracy (%): 36.512262
Test accuracy (%): 37.104072
Training for reg=0.010000
Train accuracy (%): 39.957865
Dev accuracy (%): 36.239782
Test accuracy (%): 37.194570
Training for reg=0.100000
Train accuracy (%): 39.782303
Dev accuracy (%): 36.239782
Test accuracy (%): 37.149321
Training for reg=1.000000
Train accuracy (%): 39.513109
Dev accuracy (%): 36.512262
Test accuracy (%): 37.420814
Training for reg=10.000000
Train accuracy (%): 38.611891
Dev accuracy (%): 36.875568
Test accuracy (%): 37.692308
Training for reg=100.000000
Train accuracy (%): 36.317884
Dev accuracy (%): 35.059037
Test accu

**Reasons**:
1. In this case, the dimension of GloVe vectors is higher than that of our word2vec vectors. More information could be encoded by higher dimensional word vectors.
2. GloVe vectors used here were trained on a much larger corpus.
3. GloVe has more efficient usage of global statistics.

**4e**.

![](q4_reg_v_acc.png)

Training accuracy decreases as regularization increases, while there is a peak for validation accuracy and validation accuracy decreases as regularization parameter deviates from the optimum.

**4f**.

![](q4_dev_conf.png)

The sentiment for most samples is predicted correctly, although there are some mistakes. There are not many extreme predictions (i.e., very negative or very positive).

**4g**. Example 1: true: 3; predicted: 1; text: but taken as a stylish and energetic one-shot , the queen of the damned can not be said to suck .  
Explanation: the model was unable to handle negation.

Example 2: true: 1; predicted: 3; text: too much of the humor falls flat .  
Explanation: the model was unable to handle negation.

Example 3: true: 1; predicted: 3; text:plays like a volatile and overlong w magazine fashion spread .
Explanation: the model fails to capture word order and context.