# Assignment 1

### Q1 - Ambiguity in English

#### (a) Lexical ambiguity

**Example:** ***"You will see lots of plants in the south of the city."*** </p>
**Explanation:** The word "plant" could mean either greenery including trees, flowers and herbs, or it could mean industrial facilities such as power plants. 

#### (b) Syntactic ambiguity

**Example:** ***"She stared at a girl in a blue dress"*** </p>
**Explanation:** "She" could be staring a girl who wears a blue dress, or she could be in a blue dress while staring at a girl.

#### (c) Semantic ambiguity

**Example:** ***"I took off Joe's jacket."*** </p>
**Explanation:** The sentence could suggest two different actions: taking Joe's jacket off myself or taking the jacket off Joe.

### Q2 - Validation

Let $U$ be all the data possible, $H$ be the set of hypothesis set that includes all the functions $f$, $f^*$ be the function that minimizes the error, $A$ be the algorithm to find $f^*$.</p>

##### Statement 1: 
_An unbiased estimate of the generalized error rate for a decision tree with max depth 5, trained on T , is 0.35._ </p>

**Answer:** </p>
False. Let the solution to this training set $T$ be $f^A_T$. In this case, both the training set and the validation set $V$ are i.i.d. samples of all the data $U$. However, since there are limitless possible solutions to find a decision tree with max depth 5, $ f^A_T \neq f^* $. So 0.35 is not an unbiased estimate of the generalized error rate for a decision tree with max depth 5, trained on T.

##### Statement 2:
_An unbiased estimate of the generalized error rate for a tuned decision tree, trained on T , is 0.35._ </p>

**Answer:** </p>
True. By definition given from the question, a ***tuned decision tree*** is a decision tree chosen from the three decision trees trained on $T$ for which the validation error rate is minimum. Since we specified the training set $T$, the number of optimal models with each depth is fixed. And because our validation set $V$ is i.i.d. sample of $U$ ( assume it has a size of $m$), the expected value of average test error can be written as:

$ E[avg test error] = {1 \over m} \sum_{(x_i, y_i) \in S} E[error(x_i, y_i, f^A_T)] $

The value of $ E[avg test error] $ doesn't rely on $i$, which means that it is an unbiased estimate of the deneralization error rate for a ***tuned decision tree***.

##### Statement 3:
_An unbiased estimate of the generalized error rate of the random decision tree is 0.35._ </p>

**Answer:** </p>

False. Generalization error rate refers to the expected value of the misclassification error over **all** possible examples for the task. Since we defined random decision tree trained on $T$ to be either one model from all 3 possible models, the generalization error rate should be calculated across 3 different models.

### Q3 - Classification Models

##### (a)

By definition, logistic regression is written as: 

$$
\begin{aligned}
p(x) = {1 \over {1 + e^{ \mathbf{w^T}x+w_0}}}
\end{aligned}
$$

When applied for classification problem for a single class, with $0$ and $1$ as the binary indicator, we usually set the decision boundary of logistic regression to 0.5, so we have $ p(x) = {1 \over 2} $.

From 

$$
\begin{aligned}
p(x) = {1 \over 2} = {1 \over {1 + e^{ \mathbf{w^T}x+w_0}}}
\end{aligned}
$$

we get: 

$$
\begin{aligned}
e^{ \mathbf{w^T}x+w_0} = 1
\end{aligned}
$$

$$
\begin{aligned}
\mathbf{w^T}x+w_0 = 0
\end{aligned}
$$

By definition, $ \mathbf{w}^T \mathbf{x} + w_0 = 0 $ is a linear model for classification.



##### (b)

Any such classifier is a linear model. Reason: </p>

Suppose there are $K$ classes $C_1, C_2, \dots C_K$. $ 0 \leq P(C_k | x) \leq 1 $, $ 0 \leq P(C_j | x) \leq 1 $ and $k \neq j$. </p>

The decision boundary between any class $k$ and $j$ is found when $ P(C_k | x) = P(C_j | x)$. </p>

Since for any class $C_K$, $ P(C_k | x) = f(\mathbf{w_k}^T \mathbf{x} + w_{k,0}) $, assume $ k > j $, the equation for boundary between $C_K$ and $C_J$ can be converted into: 

$$
\begin{aligned}
f(\mathbf{w_k}^T \mathbf{x} + w_{k,0}) = f(\mathbf{w_j}^T \mathbf{x} + w_{j,0})
\end{aligned}
$$

Because non-linear function $f$ is an one-to-one function,

$$
\begin{aligned}
\mathbf{w_k}^T \mathbf{x} + w_{k,0} = \mathbf{w_j}^T \mathbf{x} + w_{j,0}
\end{aligned}
$$

$$
\begin{aligned}
(\mathbf{w_k - w_j})^T \mathbf{x} + (w_{k,0} - w_{j,0}) = 0
\end{aligned}
$$

Again, since $k \neq j$, this equation can be simplified as:

$$
\begin{aligned}
\mathbf{w_m}^T \mathbf{x} + w_{m,0} = 0 \space , \space where \space m = k - j
\end{aligned}
$$

By definition, the classifier for multi-class classification with logistic regression is a linear model.


Introduction to Torch's tensor library
======================================

All of deep learning is computations on tensors, which are
generalizations of a matrix that can be indexed in more than 2
dimensions. We will see exactly what this means in-depth later. First,
lets look what we can do with tensors.



In [34]:
# Original Author: Robert Guthrie.
# Adapted by Amitabh Chaudhary

import torch
import torch.nn as nn
import torch.nn.functional as F


torch.manual_seed(1)

<torch._C.Generator at 0x7f6ed46497d0>

#### Creating Tensors

Tensors can be created from Python lists with the torch.tensor()
function.




In [35]:
# torch.tensor(data) creates a torch.Tensor object with the given data.
V_data = [1., 2., 3.]
V = torch.tensor(V_data)
print(V)

# Creates a matrix
M_data = [[1., 2., 3.], [4., 5., 6]]
M = torch.tensor(M_data)
print(M)

# Create a 3D tensor of size 2x2x2.
T_data = [[[1., 2.], [3., 4.]],
          [[5., 6.], [7., 8.]]]
T = torch.tensor(T_data)
print(T)

tensor([1., 2., 3.])
tensor([[1., 2., 3.],
        [4., 5., 6.]])
tensor([[[1., 2.],
         [3., 4.]],

        [[5., 6.],
         [7., 8.]]])


What is a 3D tensor anyway? Think about it like this. If you have a
vector, indexing into the vector gives you a scalar. If you have a
matrix, indexing into the matrix gives you a vector. If you have a 3D
tensor, then indexing into the tensor gives you a matrix!

A note on terminology:
when I say "tensor" in this tutorial, it refers
to any torch.Tensor object. Matrices and vectors are special cases of
torch.Tensors, where their dimension is 1 and 2 respectively. When I am
talking about 3D tensors, I will explicitly use the term "3D tensor".




In [36]:
# Index into V and get a scalar (0 dimensional tensor)
print(V[0])
# Get a Python number from it
print(V[0].item())

# Index into M and get a vector
print(M[0])

# Index into T and get a matrix
print(T[0])

# Index into T to get the last column of the second matrix.
print(T[1,:,-1])

tensor(1.)
1.0
tensor([1., 2., 3.])
tensor([[1., 2.],
        [3., 4.]])
tensor([6., 8.])


You can also create tensors of other data types. To create a tensor of integer types, try
torch.tensor([[1, 2], [3, 4]]) (where all elements in the list are integers).
You can also specify a data type by passing in ``dtype=torch.data_type``.
Check the documentation for more data types, but
Float and Long will be the most common.




You can create a tensor with random data and the supplied dimensionality
with torch.randn()




In [37]:
torch.tensor([1,2,3], dtype=torch.float64)

tensor([1., 2., 3.], dtype=torch.float64)

In [38]:
x = torch.randn((3, 4, 5))
print(x)

tensor([[[-1.5256, -0.7502, -0.6540, -1.6095, -0.1002],
         [-0.6092, -0.9798, -1.6091, -0.7121,  0.3037],
         [-0.7773, -0.2515, -0.2223,  1.6871,  0.2284],
         [ 0.4676, -0.6970, -1.1608,  0.6995,  0.1991]],

        [[ 0.8657,  0.2444, -0.6629,  0.8073,  1.1017],
         [-0.1759, -2.2456, -1.4465,  0.0612, -0.6177],
         [-0.7981, -0.1316,  1.8793, -0.0721,  0.1578],
         [-0.7735,  0.1991,  0.0457,  0.1530, -0.4757]],

        [[-0.1110,  0.2927, -0.1578, -0.0288,  0.4533],
         [ 1.1422,  0.2486, -1.7754, -0.0255, -1.0233],
         [-0.5962, -1.0055,  0.4285,  1.4761, -1.7869],
         [ 1.6103, -0.7040, -0.1853, -0.9962, -0.8313]]])


#### Operations with Tensors


You can operate on tensors in the ways you would expect.



In [39]:
x = torch.tensor([1., 2., 3.])
y = torch.tensor([4., 5., 6.])
z = x + y
print(z)

tensor([5., 7., 9.])


See the documentation---[pytorch.org/docs/torch.html](https://pytorch.org/docs/torch.html)---for a
complete list of the large number of operations available to you. They
expand beyond just mathematical operations.

One helpful operation that we will make use of later is concatenation.




In [40]:
# By default, it concatenates along the first axis (concatenates rows)
x_1 = torch.randn(2, 5)
y_1 = torch.randn(3, 5)
z_1 = torch.cat([x_1, y_1])
print(z_1)

# Concatenate columns:
x_2 = torch.randn(2, 3)
y_2 = torch.randn(2, 5)
# second arg specifies which axis to concat along
z_2 = torch.cat([x_2, y_2], 1)
print(z_2)

# If your tensors are not compatible, torch will complain.  Uncomment to see the error
#torch.cat([x_1, x_2])

tensor([[-0.8029,  0.2366,  0.2857,  0.6898, -0.6331],
        [ 0.8795, -0.6842,  0.4533,  0.2912, -0.8317],
        [-0.5525,  0.6355, -0.3968, -0.6571, -1.6428],
        [ 0.9803, -0.0421, -0.8206,  0.3133, -1.1352],
        [ 0.3773, -0.2824, -2.5667, -1.4303,  0.5009]])
tensor([[ 0.5438, -0.4057,  1.1341, -0.1473,  0.6272,  1.0935,  0.0939,  1.2381],
        [-1.1115,  0.3501, -0.7703, -1.3459,  0.5119, -0.6933, -0.1668, -0.9999]])


To see the shape of the tensor use either the attribute .shape (with no parentheses following it) or the function .size() (with parentheses since this is a function call).

In [41]:
print(z_2.shape) #.shape is an attribute
print(z_2.size()) #.size() is a member functionm

torch.Size([2, 8])
torch.Size([2, 8])


#### Reshaping Tensors


Use the .view() method to reshape a tensor. This method is used frequently, because many neural network components expect their inputs in a certain shape. You will often need to reshape your tensor before passing it to a component.




In [42]:
x = torch.randn(2, 3, 4)
print(x)
print(x.view(2, 12))  # Reshape to 2 rows, 12 columns
# Same as above.  If one of the dimensions is -1, its size can be inferred
print(x.view(2, -1))

tensor([[[ 0.4175, -0.2127, -0.8400, -0.4200],
         [-0.6240, -0.9773,  0.8748,  0.9873],
         [-0.0594, -2.4919,  0.2423,  0.2883]],

        [[-0.1095,  0.3126,  1.5038,  0.5038],
         [ 0.6223, -0.4481, -0.2856,  0.3880],
         [-1.1435, -0.6512, -0.1032,  0.6937]]])
tensor([[ 0.4175, -0.2127, -0.8400, -0.4200, -0.6240, -0.9773,  0.8748,  0.9873,
         -0.0594, -2.4919,  0.2423,  0.2883],
        [-0.1095,  0.3126,  1.5038,  0.5038,  0.6223, -0.4481, -0.2856,  0.3880,
         -1.1435, -0.6512, -0.1032,  0.6937]])
tensor([[ 0.4175, -0.2127, -0.8400, -0.4200, -0.6240, -0.9773,  0.8748,  0.9873,
         -0.0594, -2.4919,  0.2423,  0.2883],
        [-0.1095,  0.3126,  1.5038,  0.5038,  0.6223, -0.4481, -0.2856,  0.3880,
         -1.1435, -0.6512, -0.1032,  0.6937]])


The newer function .reshape() is similar to .view(), and actually more general.  There is a [slight difference](https://jdhao.github.io/2019/07/10/pytorch_view_reshape_transpose_permute/) between the two, which is unimportant at this point.

In [43]:
x = torch.randn(3, 4)
print(x)
print(x.reshape(2, 6))
print(x.view(2, 6))

tensor([[-1.6476,  1.0156, -0.2020, -1.2865],
        [ 0.8231, -0.6101, -1.2960, -0.9434],
        [ 0.6684,  1.1628, -0.3229,  1.8782]])
tensor([[-1.6476,  1.0156, -0.2020, -1.2865,  0.8231, -0.6101],
        [-1.2960, -0.9434,  0.6684,  1.1628, -0.3229,  1.8782]])
tensor([[-1.6476,  1.0156, -0.2020, -1.2865,  0.8231, -0.6101],
        [-1.2960, -0.9434,  0.6684,  1.1628, -0.3229,  1.8782]])


One can create new tensors that retain the shape and datatype of a given tensor.  Two such functions are ones_like() and rand_like().

In [44]:
y = torch.ones_like(x)
print(y)
print(torch.rand_like(x.reshape(-1,6))) #-1 implies "infer the size"

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])
tensor([[0.7444, 0.1408, 0.3854, 0.8637, 0.8960, 0.9729],
        [0.3985, 0.1114, 0.9923, 0.3935, 0.2943, 0.6219]])


Multiply matrices together using the @ operator or the matmul() function.  These will give an error if the sizes don't match.

In [45]:
x = torch.tensor([[1, 2, 3], [0, 0, 1]])
y = torch.tensor([[4, 5]])
z = torch.tensor([[1], [2], [3]])

print(y @ x)
print(y.matmul(x))
print(x.matmul(z))
print(x @ z)

tensor([[ 4,  8, 17]])
tensor([[ 4,  8, 17]])
tensor([[14],
        [ 3]])
tensor([[14],
        [ 3]])


For matrices of compatible sizes use * or mul() to compute an element-wise product. These follow [broadcasting rules](https://numpy.org/doc/stable/user/basics.broadcasting.html) as in numpy.

In [46]:
x = torch.tensor([[1, 2, 3], [4, 5, 6]])
y = torch.tensor([[1, 1, 3], [0, 0, 1]])
z = torch.tensor([[2],[3]])
u = torch.tensor([1, 3, 0])
print(x * y)
print(y.mul(x))
print(x * z)
print(x.mul(u))

tensor([[1, 2, 9],
        [0, 0, 6]])
tensor([[1, 2, 9],
        [0, 0, 6]])
tensor([[ 2,  4,  6],
        [12, 15, 18]])
tensor([[ 1,  6,  0],
        [ 4, 15,  0]])


Use the attribute .T or the function transpose(0, 1) to transpose a matrix.  transpose(dim1, dim2) can switch between any two dimensions of a matrix.

In [47]:
x = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(x.T)
print(x.transpose(0,1))

y = torch.tensor([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9],[10, 11, 12]]])
print(y)
print(y.transpose(0, 1))
print(y.transpose(1, 2))

tensor([[1, 4],
        [2, 5],
        [3, 6]])
tensor([[1, 4],
        [2, 5],
        [3, 6]])
tensor([[[ 1,  2,  3],
         [ 4,  5,  6]],

        [[ 7,  8,  9],
         [10, 11, 12]]])
tensor([[[ 1,  2,  3],
         [ 7,  8,  9]],

        [[ 4,  5,  6],
         [10, 11, 12]]])
tensor([[[ 1,  4],
         [ 2,  5],
         [ 3,  6]],

        [[ 7, 10],
         [ 8, 11],
         [ 9, 12]]])


We often need to remove a dimension (of size 1) or add a dimension (of size 1) from a tensor.  For these use squeeze() and unsqueeze().

In [48]:
x = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(x, x.shape)
y = x.unsqueeze(1)
print(y, y.shape)
z = y[:,:,1]
print(z, z.shape)
u = z.squeeze(1)
print(u, u.shape)

tensor([[1, 2, 3],
        [4, 5, 6]]) torch.Size([2, 3])
tensor([[[1, 2, 3]],

        [[4, 5, 6]]]) torch.Size([2, 1, 3])
tensor([[2],
        [5]]) torch.Size([2, 1])
tensor([2, 5]) torch.Size([2])


In [49]:
'''
PROBLEM 4 (a)

Write a function repeat() that takes a tensor of any shape and "repeats"
the vectors in the last dimension.
So if x is tensor([2, 3]), repeat(x) should return
    tensor([[2, 3],
            [2, 3]])

And if x is tensor([[[2, 3],[4,5]],[[7, 8],[9,10]]]),
repeat(x) should return
    tensor([[[[ 2,  3],
              [ 2,  3]],
             [[ 4,  5],
              [ 4,  5]]],
            [[[ 7,  8],
              [ 7,  8]],
             [[ 9, 10],
              [ 9, 10]]]]).
Use only the functions torch.cat() and torch.unsqueeze() and no loops.
'''

def repeat(x):
    '''
    Repeat the last dimension of a torch object
    '''
    new_shape = list(x.shape)
    new_shape.append(new_shape[-1])
    new_shape[-2] = 2
    return torch.cat([x, x], len(x.shape) - 1).reshape(new_shape)

def test_repeat():
    x1 = torch.tensor([2, 3])
    x2 = torch.tensor([[2, 3],[1,5]])
    x3 = torch.tensor([[[2, 3],[4,5]],[[7, 8],[9,10]]])
    y1 = torch.tensor([[2, 3], [2, 3]])
    y2 = torch.tensor([[[2, 3],[2, 3]],[[1, 5],[1, 5]]]) 
    y3 = torch.tensor([[[[ 2,  3],[ 2,  3]],[[ 4,  5],[ 4,  5]]],
            [[[ 7,  8],[ 7,  8]],[[ 9, 10],[ 9, 10]]]])
    assert(torch.equal(repeat(x1),y1))
    assert(torch.equal(repeat(x2),y2))    
    assert(torch.equal(repeat(x3),y3))
    print('Passed all tests.')
test_repeat()

Passed all tests.


Deep learning building blocks: affine maps, softmax, and embeddings
==========================================================================

Deep learning consists of composing linearities with non-linearities in
clever ways. The introduction of non-linearities allows for powerful
models. In this section, we will learn about the affine map, which is a linearity, and softmax, which is a non-linearity.  We'll also learn about a common objective or loss function used with softmax: negative log likelihood loss.


#### Affine Maps


One of the core workhorses of deep learning is the affine map, which is
a function $f(x)$ where
$$
\begin{align}f(x) = Ax + b\end{align}
$$
for a matrix $A$ and vectors $x, b$. The parameters to be
learned here are $A$ and $b$. Often, $b$ is refered to
as the *bias* term.


PyTorch and most other deep learning frameworks do things a little
differently than traditional linear algebra. It maps the rows of the
input instead of the columns. That is, the $i$'th row of the
output below is the mapping of the $i$'th row of the input under
$A$, plus the bias term. Look at the example below.




In [50]:
lin = nn.Linear(3, 2)
# The linear layer gets initialized with random parameters
# corresponding to A and b.
print("The weight parameter")
print(lin.weight)
print("The bias parameter")
print(lin.bias)

The weight parameter
Parameter containing:
tensor([[-0.4038,  0.3795,  0.3618],
        [-0.4581, -0.4742, -0.0506]], requires_grad=True)
The bias parameter
Parameter containing:
tensor([ 0.2425, -0.0167], requires_grad=True)


In [51]:
# Let us change these parameters to simpler numbers
with torch.no_grad():    
    lin.weight.copy_(torch.tensor([[3.0, 4.0, 3.0], 
                                   [3.0, 4.0, 1.0]]))
    lin.bias.copy_(torch.tensor([1.0, 2.5]))
# Let the data be 2 x 3 matrix.  The linear layer will 
# transform each row x to f(x)
data = torch.tensor([[4.0, 6.0, 1.0], [7.0, 1.0, 0.0]])
print(lin(data))

tensor([[40.0000, 39.5000],
        [26.0000, 27.5000]], grad_fn=<AddmmBackward0>)


#### Softmax and Probabilities

The function  Softmax(𝑥) is a non-linearity.  It is usually the last operation done in a network. This is because it takes in a vector of real numbers and returns a probability distribution. Its definition is as follows. Let  $x$  be a vector of real numbers (in $[-\infty, \infty]$). Then the i'th component of  Softmax(𝑥)  is

$$ \frac{\exp(𝑥_i)}{\sum_j \exp(x_j)}$$
 
It should be clear that the output is a probability distribution: each element is non-negative and the sum over all components is $1$.

You could also think of it as just applying an element-wise exponentiation operator to the input to make everything non-negative and then dividing by the normalization constant.

In [52]:
# Softmax is also in torch.nn.functional
data = torch.tensor([3.0, 4.0, 5.0])
print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())  # Sums to 1 because it is a distribution!
print(F.log_softmax(data, dim=0))  # theres also log_softmax

tensor([0.0900, 0.2447, 0.6652])
tensor(1.)
tensor([-2.4076, -1.4076, -0.4076])


In [53]:
'''
PROBLEM 4 (b)

Write a function mysoftmax() that takes a vector, and applies
softmax to it.  The returned vector consists of probabilities that 
add to 1.

So mysoftmax(torch.tensor([3.0,4.0, 5.0,-600.0])) returns
    tensor([0.0900, 0.2447, 0.6652, 0.0000]).
'''

def mysoftmax(x):
    '''
    Apply softmax to a vector 
    '''
    expons = torch.exp(x)
    return expons/torch.sum(expons)


def test_mysoftmax():
    x1 = torch.tensor([3.0,4.0,5.0,-600.0])
    x2 = torch.tensor([-33.44,4.44,-5.01,-6.0])
    assert(torch.allclose(mysoftmax(x1), F.softmax(x1,0)))
    assert(torch.allclose(mysoftmax(x2), F.softmax(x2,0)))
    print('Passed all tests.')

test_mysoftmax()    

Passed all tests.


In [54]:
'''
PROBLEM 4(c)

Extend above mysoftmax to accept tensors of any shape
Write a function mysoftmaxex() that takes a tensor of any shape, and
a dimension d, and applies softmax along dimension d.  So slices 
along dimension d consist of probabilities that add to 1.
E.g., mysoftmaxex(tensor([[3.0,4.0, 2.3],[5.0,-600.0, 2.3]]),0)
returns
    tensor([[0.1192, 1.0000, 0.5000],
            [0.8808, 0.0000, 0.5000]])
and mysoftmaxex(tensor([[3.0,4.0, 2.3],[5.0,-600.0, 2.3]]),1)
returns
    tensor([[0.2373, 0.6449, 0.1178],
            [0.9370, 0.0000, 0.0630]]).
'''

def mysoftmaxex(x, d):
    '''
    Apply softmax function mysoftmax() by dimension d
    '''
    dim_num = x.dim() - 1 - d
    mylist = [mysoftmax(x_i) for x_i in torch.unbind(x, dim=dim_num)]
    return torch.stack(mylist, dim=dim_num)

def test_mysoftmaxex():
    x = torch.tensor([[3.0,4.0, 2.3],[5.0,-600.0, 2.3]])
    assert(torch.allclose(mysoftmaxex(x,0), F.softmax(x,0)))
    assert(torch.allclose(mysoftmaxex(x,1), F.softmax(x,1)))
test_mysoftmaxex()

#### Word Embeddings

[Torchtext](https://pytorch.org/text/stable/index.html) is a library within the PyTorch framework that consists of data processing utilities and popular datasets for natural language processing.

[GloVe](https://nlp.stanford.edu/projects/glove/) is set of dense vector representations, or embeddings.  Torchtext has support for GloVe. (The following code takes several minutes to run the first time, since it downloads the GloVe embeddings.)

In [55]:
from torchtext.vocab import GloVe

glove = GloVe(name='6B')

words = ["hello", "hi", "king", "president"]
vecs = glove.get_vecs_by_tokens(words)

print(vecs.shape)
print('The first 10 values in the embedding for "hello" are',
     vecs[0,:10])

torch.Size([4, 300])
The first 10 values in the embedding for "hello" are tensor([-0.3371, -0.2169, -0.0066, -0.4162, -1.2555, -0.0285, -0.7219, -0.5289,
         0.0072,  0.3200])


In [56]:
'''
PROBLEM 4(d)

Write code to verify if in GloVe "similar words map into 
similar vectors.  Briefly discuss your results."
'''

# One way to check the similarity between two word embeddings is 
# to use cosine similarity score as a measurement, which should 
# range from -1 to 1

from itertools import combinations

# define cos() for calculating cosine similarity
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)

print("The cosine similarity between words:" )

for pair in combinations(range(4), 2):
    i, j = pair
    print("{} - {} : {}".format(words[i], words[j], round(float(cos(vecs[i], vecs[j])), 3)))

The cosine similarity between words:
hello - hi : 0.33
hello - king : 0.053
hello - president : 0.06
hi - king : 0.03
hi - president : -0.053
king - president : 0.267


As cosine similarity shows, the closest word embeddings among the possible pairs from 4 given words are: </p>
cos(`hello`,`hi`) = 0.33, </p>
cos(`king`,`president`) = 0.267 </p>
This suggests that in GloVe, these pairs of words are considered similar to each other.