# NLU+ 2021-2022 Lab 3: Tensor Computation in PyTorch

#### Authors: Yao Fu and Frank Keller

This lab indends to teach computation with tensors, which is a fundamental paradigm in modern machine learning. 

Students should work through Section 1 before the lab session, as this section introduces the basics of tensor computation. The lab session will then focus on more advanced computations with tensors, in Sections 2 and 3.

**Students are strongly encouraged to complete this lab before starting CW2. Many computations in the lab will be encountered in CW2 again, so the amount of difficult in the coursework will be significantly reduced.**

We suggest using jupyter lab instead of jupyter notebook. Basically jupyter lab is an enhanced version of jupyter notebook. Downloading jupyter lab is simply:

```bash
pip install jupyterlab
```

Then in a terminal, start jupyter lab with 
```bash
jupyter lab
```

In [6]:
import torch 
import torch.nn.functional as F

# Section 1. Basic Tensor Computation

## Definition of Tensor: High-dimensional Array

### Vector

A vector is a rank-1 (or one-dimentional) tensor.

In NLP, it could be a single sentence where each word is an integer index.

In [7]:
sent = torch.tensor([1, 3, 5, 7, 9])

In [8]:
sent

tensor([1, 3, 5, 7, 9])

### Matrix

A matrix is a rank-2 (or two-dimentional) tensor.

In NLP, this could used to represent a batch of different sentences.

In [9]:
sents = torch.tensor([[1, 3, 5, 7, 9], 
                     [2, 4, 6, 8, 10]]) # a batch of two sentences

In [10]:
sents # we assume these index corresponds to some actual words

tensor([[ 1,  3,  5,  7,  9],
        [ 2,  4,  6,  8, 10]])

### Rank-3 Tensor

There can be a rank-3 (three-dimensional) tensor.

In NLP, this could be a batch of sentences, where each word in each sentence is representened as an embedding vector.

In [11]:
sents_emb = torch.rand([2, 5, 10])  # [batch_size = 2, sentence_length = 5, hidden_size = 10]
                                    # we use random vectors for the purpose of demonstration 
                                    # we use the name `hidden_size` because the embedding is usually referred as 'hidden representations' in neural networks

In [12]:
sents_emb

tensor([[[0.2552, 0.2932, 0.9128, 0.0385, 0.2886, 0.4012, 0.6359, 0.2640,
          0.1432, 0.1400],
         [0.0908, 0.7768, 0.9460, 0.5340, 0.3198, 0.2623, 0.6783, 0.2349,
          0.8589, 0.6096],
         [0.4103, 0.4208, 0.2557, 0.8428, 0.4656, 0.7158, 0.3926, 0.3101,
          0.3126, 0.1072],
         [0.8238, 0.9857, 0.5228, 0.8131, 0.4215, 0.2141, 0.4408, 0.4935,
          0.7624, 0.6851],
         [0.8025, 0.1173, 0.9148, 0.4197, 0.3450, 0.7211, 0.0858, 0.4441,
          0.3001, 0.2407]],

        [[0.5476, 0.8870, 0.0843, 0.4745, 0.1150, 0.6033, 0.3752, 0.1390,
          0.6719, 0.1939],
         [0.3876, 0.6765, 0.4742, 0.1300, 0.4539, 0.8930, 0.8830, 0.5100,
          0.4287, 0.0923],
         [0.1531, 0.2541, 0.2076, 0.9926, 0.9684, 0.1034, 0.4452, 0.5783,
          0.7238, 0.9626],
         [0.8897, 0.7663, 0.8566, 0.9073, 0.6333, 0.9959, 0.2404, 0.2135,
          0.9341, 0.3970],
         [0.6974, 0.9085, 0.0868, 0.4143, 0.1393, 0.9332, 0.4206, 0.3227,
          0.057

Here `sents_emb[i, j]` means the word embedding for the j-th word in the i-th sentence.

In [13]:
#  word embedding for 0-th sentence, 3-rd word
sents_emb[0, 3]

tensor([0.8238, 0.9857, 0.5228, 0.8131, 0.4215, 0.2141, 0.4408, 0.4935, 0.7624,
        0.6851])

### Rank-4 Tensor

As for a rank-4 (four-dimensional) tensor, this could be a batch of sentences, where each word of each sentence is further devided into different heads of attention keys for multi-head attention (you will encounter this in CW2).

In [14]:
sents_emb_keys = torch.rand([2, 5, 4, 10])  # [batch_size = 2, sentence_length = 5, number_of_heads = 4, hidden_size = 10]
                                            # Again, we use random vectors for the purpose of demonstration

Here `sents_emb_keys[i, j, k]` means the attention key vector in the i-th sentence, j-th word, k-th head.

In [15]:
# key vector for 0-th, 3-rd word, 2-nd head
sents_emb_keys[0, 3, 2] 

tensor([0.6944, 0.5882, 0.0796, 0.2964, 0.8965, 0.8227, 0.9334, 0.3030, 0.8997,
        0.6403])

### Caveat: always keep in mind the meaning of shapes of tensors

It is important to always have a record of the meaning of the shape, otherwise one would quickly forget what each tensor dimension means.

In [16]:
sents_emb_keys.size()

torch.Size([2, 5, 4, 10])

Question: what does [2, 5, 4, 10] mean?

Answer: it means [batch_size = 2, sentence_length = 5, number_of_heads = 4, hidden_size = 10], where `number_of_heads` means the number of heads used in the attention mechanism.

## Basic Tensor Operation

### Indexing

In [4]:
# first dimension
sents_emb_keys[0]   # 0-th sentence, all words, all heads, all hidden

NameError: name 'sents_emb_keys' is not defined

In [37]:
sents_emb_keys[0].size()

torch.Size([5, 4, 10])

Q: what does [5, 4, 10] mean?

A: for the 0-th sentence, it has [length = 5, number_of_heads = 4, hidden_size = 10]

In other words, fixing the index of the first dimension of a four-dimensional tensor will result in a three-dimensional tensor.

In [55]:
sents_emb_keys[:, 1, :, :]  # All sentence, 1-st words, all heads, all hidden

tensor([[[0.2565, 0.9952, 0.2855, 0.7027, 0.4161, 0.5526, 0.5555, 0.8494,
          0.6023, 0.8167],
         [0.0980, 0.8770, 0.6813, 0.8324, 0.5196, 0.1883, 0.4374, 0.0931,
          0.2019, 0.0565],
         [0.0558, 0.1330, 0.3457, 0.1030, 0.9027, 0.1974, 0.9539, 0.4972,
          0.4778, 0.6240],
         [0.6143, 0.5695, 0.7340, 0.9583, 0.6507, 0.9898, 0.4018, 0.9045,
          0.9049, 0.9585]],

        [[0.1653, 0.9657, 0.1513, 0.5456, 0.2272, 0.2104, 0.1347, 0.7170,
          0.3753, 0.6655],
         [0.0749, 0.7884, 0.5357, 0.7545, 0.9891, 0.7123, 0.1937, 0.1788,
          0.7884, 0.5691],
         [0.1666, 0.2801, 0.4154, 0.3058, 0.2337, 0.4778, 0.6624, 0.3030,
          0.7515, 0.8743],
         [0.8686, 0.7914, 0.2702, 0.8953, 0.6699, 0.4771, 0.1721, 0.8616,
          0.7452, 0.9689]]])

In [42]:
sents_emb_keys[:, 1, :, :].size()

torch.Size([2, 4, 10])

Q: what does [2, 4, 10] mean?

A: when fixing the 1-st word, we have [batch_size = 2, number_of_heads = 4, hidden_size = 10].

Similarlly we have:

In [39]:
sents_emb_keys[0, :, :, :] # 0-st sentence, all words, all heads, all hidden
sents_emb_keys[:, 0, :, :] # all sentence, 0-th word, all heads, all hidden
sents_emb_keys[:, :, 0, :] # all sentence, all words, 0-th head, all hidden
sents_emb_keys[:, :, :, 0] # all sentence, all words, all heads, 0-th hidden

tensor([[[0.9634, 0.3859, 0.6385, 0.1204],
         [0.1372, 0.4683, 0.8694, 0.5100],
         [0.8190, 0.6909, 0.9366, 0.9601],
         [0.1284, 0.5891, 0.2114, 0.6303],
         [0.3531, 0.2699, 0.9196, 0.9952]],

        [[0.9251, 0.8620, 0.1857, 0.3526],
         [0.3475, 0.9417, 0.3291, 0.0049],
         [0.1911, 0.0424, 0.2512, 0.0985],
         [0.3292, 0.8421, 0.6986, 0.6922],
         [0.8767, 0.6509, 0.7392, 0.6896]]])

In [40]:
# Q: interpret what the following shape means: 
# A: --- YOUR ANSWER HERE ----
print(sents_emb_keys[0, :, :, :].size())
print(sents_emb_keys[:, 0, :, :].size())
print(sents_emb_keys[:, :, 0, :].size())
print(sents_emb_keys[:, :, :, 0].size())

torch.Size([5, 4, 10])
torch.Size([2, 4, 10])
torch.Size([2, 5, 10])
torch.Size([2, 5, 4])


The index can be further fixed at multiple dimensions, for example:

In [58]:
print(sents_emb_keys[0, 1, :, :]) # 0-th sentence, 1-st word, all heads, all hidden
print(sents_emb_keys[0, 1, 2, :]) # 0-th sentence, 1-st word, 2-nd head, all hidden
print(sents_emb_keys[:, 1, 2, :]) # all sentences, 1-st word, 2-nd head, all hidden

tensor([[0.2565, 0.9952, 0.2855, 0.7027, 0.4161, 0.5526, 0.5555, 0.8494, 0.6023,
         0.8167],
        [0.0980, 0.8770, 0.6813, 0.8324, 0.5196, 0.1883, 0.4374, 0.0931, 0.2019,
         0.0565],
        [0.0558, 0.1330, 0.3457, 0.1030, 0.9027, 0.1974, 0.9539, 0.4972, 0.4778,
         0.6240],
        [0.6143, 0.5695, 0.7340, 0.9583, 0.6507, 0.9898, 0.4018, 0.9045, 0.9049,
         0.9585]])
tensor([0.0558, 0.1330, 0.3457, 0.1030, 0.9027, 0.1974, 0.9539, 0.4972, 0.4778,
        0.6240])
tensor([[0.0558, 0.1330, 0.3457, 0.1030, 0.9027, 0.1974, 0.9539, 0.4972, 0.4778,
         0.6240],
        [0.1666, 0.2801, 0.4154, 0.3058, 0.2337, 0.4778, 0.6624, 0.3030, 0.7515,
         0.8743]])


Looking at the shape of the resulting tensors:

In [43]:
sents_emb_keys[:, 1, 2, :].size()

torch.Size([2, 10])

[2, 10] means [batch_size = 2, hidden_size = 10] when fixing the 1-st word and the 2-nd head.

In [57]:
# Q: interpret what the following shape means: 
# A: ---- YOUR ANSWER HERE ---
print(sents_emb_keys[0, 1, :, :].size())
print(sents_emb_keys[0, 1, 2, :].size())

torch.Size([4, 10])
torch.Size([10])


### Slicing

Slicing takes a range of an index within any dimension:

In [62]:
sents_emb_keys[:, 1:4, :, :] # all sentences, 1-3 words (right boundary 4 is not inclusive), all heads, all hidden
sents_emb_keys[:, :4, :, :]  # all sentences, 0-3 words (when left boundary is unspecified, the default is 0), all heads, all hidden
sents_emb_keys[:, 1:, :, :]  # all sentences, 1-last words (when right boundary is unspecified, the default is the last token), all heads, all hidden
sents_emb_keys[:, :, 2:4:, :] # all sentences, all words, 2-3 heads, all hidden
sents_emb_keys[:, :, :, 1:9] # all sentences, all words, all heads, 1-8 hidden

tensor([[[[0.2565, 0.9952, 0.2855, 0.7027, 0.4161, 0.5526, 0.5555, 0.8494,
           0.6023, 0.8167],
          [0.0980, 0.8770, 0.6813, 0.8324, 0.5196, 0.1883, 0.4374, 0.0931,
           0.2019, 0.0565],
          [0.0558, 0.1330, 0.3457, 0.1030, 0.9027, 0.1974, 0.9539, 0.4972,
           0.4778, 0.6240],
          [0.6143, 0.5695, 0.7340, 0.9583, 0.6507, 0.9898, 0.4018, 0.9045,
           0.9049, 0.9585]],

         [[0.9984, 0.8191, 0.6768, 0.5630, 0.6836, 0.2332, 0.9874, 0.7200,
           0.2859, 0.4261],
          [0.3903, 0.6026, 0.9944, 0.8326, 0.8898, 0.5297, 0.1403, 0.7895,
           0.1042, 0.9310],
          [0.3837, 0.3007, 0.8152, 0.3091, 0.4113, 0.5117, 0.0031, 0.3399,
           0.9229, 0.7710],
          [0.7565, 0.2816, 0.3472, 0.4599, 0.7831, 0.2791, 0.8118, 0.4355,
           0.7268, 0.5401]],

         [[0.5343, 0.0079, 0.2048, 0.3074, 0.9474, 0.0148, 0.4928, 0.2752,
           0.9744, 0.1238],
          [0.8444, 0.9804, 0.3297, 0.5908, 0.3706, 0.3828, 0.3094, 0.

You can also set a stepsize for slicing:

In [70]:
sents_emb_keys[:, 0::2, :, :] # all sentence, 0, 2, 4 words

tensor([[[[0.6521, 0.3246, 0.3111, 0.2697, 0.5552, 0.8763, 0.6840, 0.0914,
           0.1995, 0.5810],
          [0.7291, 0.5008, 0.4971, 0.2134, 0.9420, 0.1241, 0.4952, 0.2472,
           0.8769, 0.8252],
          [0.8650, 0.9792, 0.6818, 0.9030, 0.4958, 0.7130, 0.0967, 0.6244,
           0.7655, 0.4095],
          [0.8535, 0.8848, 0.5556, 0.8642, 0.9925, 0.3077, 0.1043, 0.8635,
           0.5151, 0.5593]],

         [[0.9984, 0.8191, 0.6768, 0.5630, 0.6836, 0.2332, 0.9874, 0.7200,
           0.2859, 0.4261],
          [0.3903, 0.6026, 0.9944, 0.8326, 0.8898, 0.5297, 0.1403, 0.7895,
           0.1042, 0.9310],
          [0.3837, 0.3007, 0.8152, 0.3091, 0.4113, 0.5117, 0.0031, 0.3399,
           0.9229, 0.7710],
          [0.7565, 0.2816, 0.3472, 0.4599, 0.7831, 0.2791, 0.8118, 0.4355,
           0.7268, 0.5401]],

         [[0.2395, 0.7648, 0.2465, 0.8468, 0.1039, 0.0613, 0.9354, 0.5005,
           0.6134, 0.3457],
          [0.6280, 0.0104, 0.1284, 0.3472, 0.3275, 0.3812, 0.7000, 0.

In [69]:
sents_emb_keys[:, 1::2, :, :] # # all sentence, 1, 3 words

tensor([[[[0.2565, 0.9952, 0.2855, 0.7027, 0.4161, 0.5526, 0.5555, 0.8494,
           0.6023, 0.8167],
          [0.0980, 0.8770, 0.6813, 0.8324, 0.5196, 0.1883, 0.4374, 0.0931,
           0.2019, 0.0565],
          [0.0558, 0.1330, 0.3457, 0.1030, 0.9027, 0.1974, 0.9539, 0.4972,
           0.4778, 0.6240],
          [0.6143, 0.5695, 0.7340, 0.9583, 0.6507, 0.9898, 0.4018, 0.9045,
           0.9049, 0.9585]],

         [[0.5343, 0.0079, 0.2048, 0.3074, 0.9474, 0.0148, 0.4928, 0.2752,
           0.9744, 0.1238],
          [0.8444, 0.9804, 0.3297, 0.5908, 0.3706, 0.3828, 0.3094, 0.7239,
           0.1286, 0.8170],
          [0.5350, 0.7691, 0.3863, 0.2020, 0.5287, 0.1367, 0.5998, 0.9947,
           0.8692, 0.9149],
          [0.8383, 0.7545, 0.6470, 0.1198, 0.4272, 0.7342, 0.5708, 0.3814,
           0.2928, 0.3868]]],


        [[[0.1653, 0.9657, 0.1513, 0.5456, 0.2272, 0.2104, 0.1347, 0.7170,
           0.3753, 0.6655],
          [0.0749, 0.7884, 0.5357, 0.7545, 0.9891, 0.7123, 0.1937, 

In [71]:
sents_emb_keys[:, 0:4:2, :, :] # # all sentence, 0, 2 words

tensor([[[[0.6521, 0.3246, 0.3111, 0.2697, 0.5552, 0.8763, 0.6840, 0.0914,
           0.1995, 0.5810],
          [0.7291, 0.5008, 0.4971, 0.2134, 0.9420, 0.1241, 0.4952, 0.2472,
           0.8769, 0.8252],
          [0.8650, 0.9792, 0.6818, 0.9030, 0.4958, 0.7130, 0.0967, 0.6244,
           0.7655, 0.4095],
          [0.8535, 0.8848, 0.5556, 0.8642, 0.9925, 0.3077, 0.1043, 0.8635,
           0.5151, 0.5593]],

         [[0.9984, 0.8191, 0.6768, 0.5630, 0.6836, 0.2332, 0.9874, 0.7200,
           0.2859, 0.4261],
          [0.3903, 0.6026, 0.9944, 0.8326, 0.8898, 0.5297, 0.1403, 0.7895,
           0.1042, 0.9310],
          [0.3837, 0.3007, 0.8152, 0.3091, 0.4113, 0.5117, 0.0031, 0.3399,
           0.9229, 0.7710],
          [0.7565, 0.2816, 0.3472, 0.4599, 0.7831, 0.2791, 0.8118, 0.4355,
           0.7268, 0.5401]]],


        [[[0.1025, 0.6387, 0.0027, 0.6100, 0.9511, 0.9956, 0.9138, 0.4353,
           0.0430, 0.2796],
          [0.5632, 0.7684, 0.2373, 0.6559, 0.1192, 0.9709, 0.5268, 

As you can see, the slicing syntax is `some_tensor[start_index : end_index : step_size]`.

In [64]:
# Q: interpret what the following shape means:
# A: ---- YOUR ANSWER HERE ----
print(sents_emb_keys[:, 1:4, :, :].size())
print(sents_emb_keys[:, :, 3:, :].size())

torch.Size([2, 3, 4, 10])
torch.Size([2, 5, 1, 10])


### Concatenation

To concatenate two tensors within a specified dimension:

In [78]:
sent_1 = torch.tensor([[1, 2, 3]]) # size = [1, 3]
sent_2 = torch.tensor([[4, 5, 6]]) # size = [1, 3]

In [80]:
# concatenate two sentences into a batch
sents = torch.cat([sent_1, sent_2], dim=0)
print(sents)
print(sents.size())

tensor([[1, 2, 3],
        [4, 5, 6]])
torch.Size([2, 3])


In [82]:
# concat the second sentence to the end of the first sentence
sents = torch.cat([sent_1, sent_2], dim=0)
print(sents)
print(sents.size())

tensor([[1, 2, 3],
        [4, 5, 6]])
torch.Size([2, 3])


### Reshaping

Usually reshaping has two effects: (a) spliting tensors or (b) concatenating tensors.

In [85]:
# concatenating tensors
print(sents)
print(sents.size())
sents_cat = sents.view(6)
print(sents_cat)
print(sents_cat.size()) # pay attention how the shape of the tensor changes from [2, 3] -> [6]

tensor([[1, 2, 3],
        [4, 5, 6]])
torch.Size([2, 3])
tensor([1, 2, 3, 4, 5, 6])
torch.Size([6])


In [87]:
# spliting tensors
sents_split_1 = sents_cat.view(2, 3)
print(sents_split_1)
print(sents_split_1.size())
sents_split_2 = sents_cat.view(3, 2)
print(sents_split_2)
print(sents_split_2.size())

tensor([[1, 2, 3],
        [4, 5, 6]])
torch.Size([2, 3])
tensor([[1, 2],
        [3, 4],
        [5, 6]])
torch.Size([3, 2])


### Transposing

Transpose the two dimensions of a tensor:

In [92]:
print(sents) # shape = [batch, length]
sents_transposed = sents.transpose(0, 1)
print(sents_transposed) # shape = [length, batch]

tensor([[1, 2, 3],
        [4, 5, 6]])
tensor([[1, 4],
        [2, 5],
        [3, 6]])


### Caveat: pay close attention to the difference between transposing and reshaping

Generally, reshaping does not change the order of the items within the tensor, while tranposing changes the order.

In [99]:
# view does not change order
print(sents.view(6))
print(sents.view(2, 3))
print(sents.view(3, 2))

# transposing changes the order
print(sents.transpose(0, 1))
print(sents.transpose(0, 1).reshape(6))

tensor([1, 2, 3, 4, 5, 6])
tensor([[1, 2, 3],
        [4, 5, 6]])
tensor([[1, 2],
        [3, 4],
        [5, 6]])
tensor([[1, 4],
        [2, 5],
        [3, 6]])
tensor([1, 4, 2, 5, 3, 6])


## Basic Tensor Computations

### Tensor add scalar

In [103]:
# addition, differences, multiplication, division
sent_emb = torch.rand(3, 5)    # [length = 5, hidden = 10] 
                                # we assume we work with one sentence embedding where there are 3 words and 
                                # each word is associated with an embedding vector with length 5
print(sent_emb)

# the following operations are applied to every entry within the tensor
print(sent_emb + 1)
print(sent_emb - 1)
print(sent_emb * 2)
print(sent_emb / 2)

tensor([[0.8883, 0.1146, 0.6120, 0.8524, 0.0173],
        [0.4917, 0.2352, 0.0226, 0.0530, 0.0636],
        [0.4203, 0.9898, 0.9779, 0.4225, 0.4789]])
tensor([[1.8883, 1.1146, 1.6120, 1.8524, 1.0173],
        [1.4917, 1.2352, 1.0226, 1.0530, 1.0636],
        [1.4203, 1.9898, 1.9779, 1.4225, 1.4789]])
tensor([[-0.1117, -0.8854, -0.3880, -0.1476, -0.9827],
        [-0.5083, -0.7648, -0.9774, -0.9470, -0.9364],
        [-0.5797, -0.0102, -0.0221, -0.5775, -0.5211]])
tensor([[1.7766, 0.2293, 1.2239, 1.7049, 0.0346],
        [0.9834, 0.4704, 0.0452, 0.1060, 0.1273],
        [0.8405, 1.9796, 1.9559, 0.8451, 0.9578]])
tensor([[0.4442, 0.0573, 0.3060, 0.4262, 0.0087],
        [0.2459, 0.1176, 0.0113, 0.0265, 0.0318],
        [0.2101, 0.4949, 0.4890, 0.2113, 0.2395]])


### Vector dot product

In [166]:
# vector dot-product
word_1 = torch.rand(5)
word_2 = torch.rand(5)

prod = (word_1 * word_2).sum() # element-wise multiplication then sum 

### Matrix multiplication

In [106]:
# matrix mulplication
M1 = torch.rand(2, 5)
M2 = torch.rand(5, 10)

prod = torch.matmul(M1, M2)
print(prod.size()) # shape changes: [2, 5] * [5, 10] -> [2, 10]

torch.Size([2, 10])


## Alignment and broadcasting

### Rule 1: if the same shape, then element-wise computation

In [167]:
# adding two words, each word is a vector
word_1 = torch.rand(5) # hidden_size = 5
word_2 = torch.rand(5)
result = word_1 + word_2 # result[i] = word_1[i] + word_2[i] 

In [168]:
# adding two sentences, each sentence is a sequence of word vectors
sent_1 = torch.rand(3, 5) # sent_length = 3, hidden_size = 5
sent_2 = torch.rand(3, 5)
result = sent_1 + sent_2 # result[i, j] = sent_1[i, j] + sent_2[i, j]

### Rule 2: if different shape, first align, then element-wise computation

In [169]:
# Example 1. adding a single word to every word within a sentence
word = torch.tensor([1, 2, 3])              # [hidden = 3]
sent = torch.tensor([[0, 0, 0], 
                     [1, 1, 1]]) # [sent_length = 2, hidden = 3]

result = word.view(1, 3) + sent # underlying process: 
                                # step 1, alignment
                                # word.size() = [1, 3]
                                #                |  |   alignment
                                # sent.size() = [2, 3]
                                #
                                # step 2, repeat where dimension size is 1
                                # word_repeat = word.repeat([2, 1]) -- 0th dim repeat twice, first dim repeat 1 time (= no repeat)
                                #
                                # step 3, align again
                                # align again
                                # word_repeat.size() = [2, 3]
                                #                       |  |   alignmeng, note that the 0th dim of word is repeated
                                # sent.size()        = [2, 3]
                                #
                                # step 4, element-wise addition. 
                                # result = word_repeat + sent
print(result)  # this is basically broadcasing the given word to each word in the given sentence

tensor([[1, 2, 3],
        [2, 3, 4]])


In [3]:
# Example 2. adding two words to two sentences, respectively
words = torch.tensor([[1, 2, 3], [4, 5, 6]])                      # batch_size = 2, hidden_size = 3
sents = torch.tensor([
                        [[0, 0, 0], [1, 1, 1], [2, 2, 2]], 
                        [[3, 3, 3], [4, 4, 4], [5, 5, 5]],        # batch_size = 2, length = 2, hidden_size = 3
                    ])

# we aim to add words[0] to sents[0] and words[1] to sents[1]
result = words.view(2, 1, 3) + sents    # underlying process: 
                                        # 
                                        # step 1, alignment
                                        # words.size() = [2, 1, 3]   two words, each has hidden dim = 3
                                        #                 |  |  |    alignment
                                        # sents.size() = [2, 2, 3]   two sentences, each has two words, each word has hidden dim = 3
                                        # 
                                        # step 2, repeat where dimension size is 1 for broadcasting
                                        # repeat words along 1st dimension
                                        # words_repeat = words.repeat([1, 2, 1]) -- 0th dim repeat 1 time, 1st dim repeat 2 times, 3rd dim repeat 1 time
                                        # 
                                        # step 3, align again
                                        # words_repeat.size() = [2, 2, 3]
                                        #                        |  |  |   alignment, note that the 1st dim of words is repeated
                                        # sents.size()        = [2, 2, 3]
                                        #
                                        # step 4, element-wise addition
                                        # result = words_repeat + sents
print(result)
print(result.size())

tensor([[[ 1,  2,  3],
         [ 2,  3,  4],
         [ 3,  4,  5]],

        [[ 7,  8,  9],
         [ 8,  9, 10],
         [ 9, 10, 11]]])
torch.Size([2, 3, 3])


In [125]:
# Example 3, outer product
# the outer product of two vectors is a matrix
v1 = torch.tensor([1, 2, 3])
v2 = torch.tensor([4, 5, 6])

outer_prod = v1.view(3, 1) * v2.view(1, 3)  # underlying process: 
                                            # 
                                            # step 1, alignment
                                            # v1.size() = [3, 1]  
                                            #              |  |  alignment
                                            # v2.size() = [1, 3] 
                                            # 
                                            # step 2, repeat where dimension size is 1
                                            # repeat v1 along 1st dimension
                                            # v1_repeat = v1.repeat([1, 3]) -- 0th dim repeat 1 time, 1st repeat 3 times
                                            # repeat v2 along 0th dimension
                                            # v2_repeat = v2.repeat([3, 1]) -- 0th dim repeat 3 times, 1st repeat 1 time
                                            #
                                            # step 3, align again and element-wise computation
                                            # v1_repeat.size() = [3, 3]
                                            #                     |  | 
                                            # v2_repeat.size() = [3, 3]
                                            #
                                            # step 4, elementwise multiplication
                                            # outer_prod = v1_repeat * v2_repeat

print(outer_prod)

tensor([[ 4,  5,  6],
        [ 8, 10, 12],
        [12, 15, 18]])


In [None]:
# Pay attention to the shape of the tensor, again
v1 = torch.tensor([1, 2, 3])
v2 = torch.tensor([4, 5, 6])

# Q: what is the differences between the following two operations: 
# A: ---- YOUR ANSWER HERE ----
result_1 = v1.view(3, 1) * v2.view(1, 3)
result_2 = v1.view(1, 3) * v2.view(3, 1)

# Section 2. Batchified Tensor Computation

In this section, we will study how to compute with tensors whose first dimension is the batch size.

## Batch matrix multiplication

In [126]:
# linear transform of the word embeddings of two sentences
sents = torch.rand(2, 5, 10) # [batch_size = 2, length = 2, hidden_size= 10]
weight = torch.rand(10, 20)  # we transform each word from a length-10 vector to a length-20 vector

sents_transform = torch.matmul(sents, weight)   # underlying process:
                                                # sents.size() = [2,  5, 10]
                                                #                 |       |         alignment. 
                                                # view weight as [1,     10, 20]    repeat where dimension is 1
                                                # matrix multiplication happens at the final two dimensions, [2, 5, 10] x [10, 20] -> [2, 5, 20]
print(sents_transform.size())

torch.Size([2, 5, 20])


## Retrieving embedding vectors from an embedding matrix

In [13]:
# in deep learning practice, usually we start with sequences of word index
sent = torch.tensor([[10, 25, 59, 77, 88],  # suppose [10, 25, 59, 77, 88] means = ['Oh', I', 'really', 'like', 'cats']
                     [16, 29, 40, 56, 3]]   # suppose [16, 29, 40, 56, 3] means = ['Jack', 'does', 'not', 'have', 'dogs']
                    )
print(sent.size())

# neural networks requires words being represented as embedding vectors, not index. So we store an embedding matrix for every word
vocab_size = 100
hidden_size = 10
embedding_matrix = torch.rand([vocab_size, hidden_size])

# converting index to embeddings consists of two steps:
# 1. convert index representation to one-hot representation
sent_one_hot = F.one_hot(sent, vocab_size).float()
# 2. batch matrix multiplication between the one hot representation and the embedding matrix
#    NOTE: when multiply a one-hot vector to a matrix, the one-hot vector essentially retrieves the corresponding row vector from the matrix
sent_emb = torch.matmul(sent_one_hot, embedding_matrix) # underlying process:
                                                        # sent_one_hot.size() =    [2,  5, 100]
                                                        #                           |       |         alignment. 
                                                        # view embedding_matrix as [1,     100, 10]   repeat where dimension is 1
                                                        # matrix multiplication happens at the final two dimensions, [2, 5, 100] x [100, 10] -> [2, 5, 10]
                                                        # In this case, multiplying an one-hot vector to a matrix means to retrieve the corresponding 
                                                        # row vector from the matrix

print(sent_emb.size())

# the following show the embedding vector from word index 10 is retrieved for sent[0, 0]
print(sent_emb[0, 0])
print(embedding_matrix[10])

# similarlly the embedding vector from word index 25 is retrieved for sent[0, 1]
print(sent_emb[0, 1])
print(embedding_matrix[25])

# ---- YOUR TASK ----
# verify how one-hot vector can retrieve the corresponding row vector from the matrix

torch.Size([2, 5])
torch.Size([2, 5, 10])
tensor([0.5894, 0.4493, 0.0103, 0.4794, 0.9032, 0.1871, 0.7477, 0.2201, 0.6163,
        0.0064])
tensor([0.5894, 0.4493, 0.0103, 0.4794, 0.9032, 0.1871, 0.7477, 0.2201, 0.6163,
        0.0064])
tensor([0.5246, 0.6199, 0.2446, 0.8485, 0.2647, 0.0523, 0.4590, 0.7056, 0.8788,
        0.3406])
tensor([0.5246, 0.6199, 0.2446, 0.8485, 0.2647, 0.0523, 0.4590, 0.7056, 0.8788,
        0.3406])


## Linear transform of the word embeddings

In [128]:
# linear transform of the word embeddings of two sentences where each sentence has 4 attention heads
# this computation is performed in multi-head attention in CW2

sents = torch.rand(2, 5, 4, 10) # [batch_size = 2, length = 2, number_of_heads = 4, hidden_size= 10]
weight = torch.rand(10, 20)  # we transform each word from a length-10 vector to a length-20 vector

sents_transform = torch.matmul(sents, weight)   # underlying process:
                                                # sents.size() = [2,  5, 4, 10]
                                                #                 |   |      |         alignment. 
                                                # view weight as [1,  1,    10, 20]    repeat where dimension is 1
                                                # matrix multiplication happens at the final two dimensions, [2, 5, 4, 10] * [10, 20] -> [2, 5, 4, 20]
print(sents_transform.size())

torch.Size([2, 5, 4, 20])


## Similarity: one single vector v.s. a batch of sentence 

In [129]:
# similarity: one single vector v.s. a batch of sentence representations
query = torch.rand(10)
sents = torch.rand(2, 5, 10)

# suppose we would like to compute the dot-product (as a measure of similarity) between the query vector and all words within all sentences
similarity = (query.view(1, 1, 10) * sents).sum(dim=2)
print(similarity.size())  # similarity[i, j] means the vector dot-product between the query vector and the word j at sentence i

torch.Size([2, 5])


## Similarity: a batch of vectors v.s. a batch of sentence  

In [130]:
# similarity: a batch of vectors v.s. a batch of sentence representations 
query = torch.rand(2, 10) # pay attention to the differences to the previous case
sents = torch.rand(2, 5, 10)

similarity = (query.view(2, 1, 10) * sents).sum(dim=2)
print(similarity.size())  # similarity[i, j] means the vector dot-product between the query[i] and the word j at sentence i

torch.Size([2, 5])


## Similarity: a sentence v.s. a batch of sentence

In [135]:
# similarity: a sentence v.s. a batch of sentence representations
sent_0 = torch.rand(4, 10)
sents = torch.rand(2, 5, 10)

# Q: what is the underlying process? hint: recall the previous align-and-broadcast
# A: ---- YOUR ANSWER HERE ----
similarity = (sent_0.view(1, 4, 1, 10) * sents.view(2, 1, 5, 10)).sum(dim=3) 
print(similarity.size()) # similarity[i, j, k] means the vector dot-product between sent_0[j] (j-th word) and sents[i, k] (i-th sent, k-th word)

similarity = (sent_0.view(1, 1, 4, 10) * sents.view(2, 5, 1, 10)).sum(dim=3)
print(similarity.size()) # similarity[i, j, k] means the vector dot-product between sent_0[k] (k-th word) and sents[i, j] (i-th sent, j-th word)

# NOTE: pay attention two the differences of the shapes

torch.Size([2, 4, 5])
torch.Size([2, 5, 4])


## Similarity: two batches of sentences

In [138]:
# similarity: two batches of sentences representations
# this is usually happens in machine translation where we have a batch of sentences of source language, and another batch of sentences of target language
source_sents = torch.rand(2, 5, 10) # [batch_size = 2, source_sent_length = 5, hidden_size = 10]
target_sents = torch.rand(2, 6, 10) # [batch_size = 2, target_sent_length = 6, hidden_size = 10]

# Q: what is the underlying process? hint: recall the previous align-and-broadcast
# A: ---- YOUR ANSWER HERE ----
similarity = (source_sents.view(2, 5, 1, 10) * target_sents.view(2, 1, 6, 10)).sum(dim = 3)
print(similarity.size()) # similarity[i, j, k] means source_sents[i, j] v.s. target_sents[i, k]

similarity = (source_sents.view(2, 1, 5, 10) * target_sents.view(2, 6, 1, 10)).sum(dim = 3)
print(similarity.size()) # similarity[i, j, k] means source_sents[i, k] v.s. target_sents[i, j]
# NOTE: pay attention two the differences of the shapes

torch.Size([2, 5, 6])
torch.Size([2, 6, 5])


# Section 3. Masked Batch Computation

When putting sentences of difference length into a batch, common practice is to pad them into the same pre-specified maximum length with a special PAD token, then mask out computation involving the PAD token.

## Basic Masking

In [141]:
# a batch of sentences with different length
PAD = 0
sents = torch.tensor([[1, 2, 3, 0, 0], # length = 3, 2 PAD tokens,
                      [3, 4, 0, 0, 0], # length = 2, 3 PAD tokens
                      [5, 6, 7, 8, 9]] # length = 5, no PAD token
                    )

In [142]:
# create a batch of mask vectors
mask = (sents != PAD).float()
print(mask)

tensor([[1., 1., 1., 0., 0.],
        [1., 1., 0., 0., 0.],
        [1., 1., 1., 1., 1.]])


## Masked Batchified Computation

Now we repeat the computation of the previous section. But this time, we mask out the computation involving the PAD tokens.

## Similarity: one single vector vs. a batch of sentence + mask

In [145]:
# one single vector v.s. a batch of sentence representations + a batch of mask
query = torch.rand(10)
sents = torch.rand(3, 5, 10)

# suppose we would like to compute the dot-product (as a measure of similarity) between the query vector and all words within all sentences
similarity = (query.view(1, 1, 10) * sents * mask.view(3, 5, 1)).sum(dim=2) # query.size = [1, 1, 10]
                                                                            # sents.size = [3, 5, 10]
                                                                            # mask.size  = [3, 5, 1]
                                                                            # recall the align-then-broadcast introduced in the previous section
print(similarity)
print(similarity.size())  # similarity[i, j] means the vector dot-product between the query vector and the word j at sentence i
                          # similarity scores for PAD tokens are masked to be 0

tensor([[2.2061, 1.4723, 1.4071, 0.0000, 0.0000],
        [1.8124, 1.3987, 0.0000, 0.0000, 0.0000],
        [1.8645, 2.5112, 2.2962, 3.0032, 2.4920]])
torch.Size([3, 5])


## Similarity: a sentence vs. a batch of sentence + mask

In [149]:
# similarity: a sentence v.s. a batch of sentence representations
sent_0 = torch.rand(4, 10)
sent_0_mask = torch.tensor([1, 1, 1, 0]) # we assume the length of sent_0 is 3

sents = torch.rand(3, 5, 10)

# Q: what is the underlying process? hint: recall the previous align-and-broadcast
# A: ---- YOUR ANSWER HERE ----
#
# hint: first mask out sent_0
#       sent_0.size        = [1, 4, 1, 10]
#       sent_0_mask.size   = [1, 4, 1, 1]
#       sent_0_masked.size = [1, 4, 1, 10]
#
#       then mask out sents
#       sents.size         = [3, 1, 5, 10]
#       mask.size          = [3, 1, 5, 1 ]
#       sents_masked.size  = [3, 1, 5, 10]
#
#       finally computed similarity score for the two masked tensors
#       sent_0_masked.size = [1, 4, 1, 10]
#       sents_masked.size  = [3, 1, 5, 10]
#       similarity.size    = [3, 4, 5]

similarity = (sent_0.view(1, 4, 1, 10) * sent_0_mask.view(1, 4, 1, 1) * sents.view(3, 1, 5, 10) * mask.view(3, 1, 5, 1)).sum(dim=3) 

print(similarity)
print(similarity.size()) # similarity[i, j, k] means the vector dot-product between sent_0[j] (j-th word) and sents[i, k] (i-th sent, k-th word)
                         # similarity scores for PAD tokens are masked to be 0

tensor([[[2.0493, 2.1468, 2.2325, 0.0000, 0.0000],
         [3.1543, 3.7314, 4.3647, 0.0000, 0.0000],
         [2.3197, 2.5866, 3.1946, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],

        [[2.2435, 2.1819, 0.0000, 0.0000, 0.0000],
         [3.1755, 3.7108, 0.0000, 0.0000, 0.0000],
         [2.0227, 3.0320, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],

        [[1.8279, 3.1262, 1.9105, 1.4596, 1.3352],
         [3.0546, 4.7720, 3.2631, 2.5259, 2.6082],
         [2.7099, 3.2255, 2.7408, 1.7844, 1.7518],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000]]])
torch.Size([3, 4, 5])


## Similarity: two batches of masked sentences

In [151]:
# similarity: two batches of sentences representations
# this is usually happens in machine translation where we have a batch of sentences of source language, and another batch of sentences of target language
# here additionaly, we assume each sentence has its own mask 
source_sents = torch.rand(2, 5, 10) # [batch_size = 2, source_sent_length = 5, hidden_size = 10]
source_mask  = torch.tensor([[1, 1, 1, 1, 0], # length = 4
                             [1, 1, 1, 1, 1]] # length = 5
                           )
target_sents = torch.rand(2, 6, 10) # [batch_size = 2, target_sent_length = 6, hidden_size = 10]
target_mask  = torch.tensor([[1, 1, 1, 0, 0, 0], # length = 3
                             [1, 1, 1, 1, 1, 1]] # length = 6
                           )

# Q: what is the underlying process? hint: recall the previous align-and-broadcast-and-mask
# A: ---- YOUR ANSWER HERE ----
similarity = (source_sents.view(2, 5, 1, 10) * source_mask.view(2, 5, 1, 1) * target_sents.view(2, 1, 6, 10) * target_mask.view(2, 1, 6, 1)).sum(dim = 3)
print(similarity)
print(similarity.size()) # similarity[i, j, k] means source_sents[i, j] v.s. target_sents[i, k]

tensor([[[2.5672, 2.6088, 2.5834, 0.0000, 0.0000, 0.0000],
         [2.0834, 1.5814, 2.3227, 0.0000, 0.0000, 0.0000],
         [2.5505, 2.8746, 2.9579, 0.0000, 0.0000, 0.0000],
         [2.6716, 2.0214, 2.7910, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]],

        [[2.0494, 2.5466, 2.8356, 2.0703, 2.9785, 3.8640],
         [0.9491, 1.5451, 1.6202, 1.1407, 1.5279, 1.9398],
         [3.0602, 2.4551, 3.5492, 2.8821, 3.5459, 4.9329],
         [2.5598, 2.3238, 2.8777, 2.4559, 2.8768, 4.1201],
         [2.0323, 1.8204, 2.0931, 2.0048, 1.5199, 2.5985]]])
torch.Size([2, 5, 6])


# Section 4. Final Application: Negative Log Likelihood

Finally, we compute the per-word negative log likelihood of a neural language model that puts everything that we have learned together.

The per-word NLL for a sentence $x_1, ..., x_T$ is:

$$
- \frac{1}{T} \sum_t \log p(x_t | x_{1:t-1})
$$

which is the average of word likelihood over sentence length.

In [4]:
# we start with the final layer of a neural language model which output a tensor named logits (logits means the vector before putting into softmax):
logits = torch.rand(2, 5, 100)  # [batch_size = 2, max_sent_len = 5, vocab_size = 100]
                                # in practice this should be the output of the final layer of a neural language model. Here we just simulate it as random tensor
vocab_size = 100 # in practice the vocabulary size is usually >10K, here for demonstration we simplify it to 100

# putting the logits into a softmax function gives the probability of each word within the vocabulary
probs = F.softmax(logits, dim=2)

# negative log likelihood requires the log of probability
log_probs = F.log_softmax(logits, dim=2)

# a language model aims to maximize the probability of each word in every sentence, but the PAD token should be excluded
target_sent = torch.tensor([[10, 25, 59, 77, 0],   # length = 4, suppose [10, 25, 59, 77, 0] means = ['I', 'really', 'like', 'cats', 'PAD']
                            [16, 29, 40, 56, 3]]   # length = 5, suppose [16, 29, 40, 56, 3] means = ['Jack', 'does', 'not', 'have', 'dogs']
                          )

# transform the index representation to the one-hot representation
target_one_hot = F.one_hot(target_sent, vocab_size)

# create mask for the target sentence
mask = (target_sent != 0).float()

# the following computation computes the negative log likelihood
# ---- YOUR TASK ----
# 1. annotate the shape of every tensor in the following computation
# 2. explain every computation step
negative_log_likelihood = (((-target_one_hot * log_probs).sum(dim=2) * mask).sum(dim=1) / mask.sum(dim=1)).mean()

# Further Reading

Up to this stage, you have already suffered from endless annotation of shape of tensor. Is there a way that one can write code which is self-explanatory, rather than requires shape annotation at every single line? 

The answer is yes, and the solution is [einsum](https://rockt.github.io/2018/04/30/einsum) and [einops](https://einops.rocks/). You are encouraged to find them out yourself.