# Exercise 2: Pytorch

You can work in pairs or individually.

Upload your solution on OLAT before the deadline: **Friday, 8th November 2014,  at 12:15**.

If you have any questions, post them on OLAT.

**Submission Format**
- Filename: **olatnameStudent1_olatnameStudent2_ml_ex2.ipynb**
- Include the names of **both team members** in the block below.
- If you have multiple files, place your file(s) in a compressed folder (zip).


Good luck! :)

---
Group Members:

Mert Erol, 20-915-245

#### please ignore the "type: ignore" comments, its so that my linter doesnt produce random warnings
---


In [33]:
import torch # type: ignore
import torch.nn as nn # type: ignore
import torch.nn.functional as F # type: ignore

## Task 1: Tensor Manipulation

### Task 1.1 Squeeze and Unsqueeze
- The first task: for the first three blocks, explain ONLY squeeze and unsqueeze dimension in SIMPLE and SHORT sentences

In [2]:
x = torch.randn(2, 1, 4, 1, 3)
x = torch.squeeze(x)
# squeeze removes dimensions of size 1 from the shape of the tensor
x.shape

torch.Size([2, 4, 3])

In [3]:
x = torch.randn(3, 4)
x = torch.unsqueeze(x, 0) # add a dimension of size to the shape at index 0
x = torch.unsqueeze(x, -1) # add a dimension of size to the shape at the last index
x.shape

torch.Size([1, 3, 4, 1])

In [4]:
x = torch.randn(2, 3, 1, 4)
x = torch.squeeze(x, 2) # remove dimension of size 1 at index 2
x = torch.unsqueeze(x, 1) # add dimension of size 1 to the shape at index 1
x.shape

torch.Size([2, 1, 3, 4])

- The second task is to use squeeze and unsqueeze functions to achieve the target dimension

In [5]:
# Starting shape: (3, 1, 4, 1)
# Target shape: (1, 3, 4, 1, 1)
x = torch.randn(3, 1, 4, 1)
x = torch.squeeze(x, 1)
x = torch.unsqueeze(x, 0)
x = torch.unsqueeze(x, -1)

x.shape

torch.Size([1, 3, 4, 1, 1])

### Task 1.2 Batch Matrix Multiplication
Given the following operations involving batch matrix multiplication (BMM)
- fill in the code (the correct dimension for d)
- answer questions.

You can find information about batch matrix multiplication here:

https://pytorch.org/docs/stable/generated/torch.bmm.html

https://stackoverflow.com/questions/50826644/why-do-we-do-batch-matrix-matrix-product

In [None]:
# Mini-batch size of 64
a = torch.randn(64, 128, 256)   # [batch_size, seq_length, embedding_dim]
b = torch.randn(64, 256, 512)   # [batch_size, embedding_dim, hidden_dim]

# First BMM operation
x = torch.bmm(a, b)             # DONE: What's the shape? Calculate it!
print("shape of x: ", x.shape)

# Create tensor d with correct dimensions
d = torch.randn(64, 512, 9)   # DONE: Fill in the correct dimensions
print("shape of d: ", d.shape)

# Second BMM operation
y = torch.bmm(x, d)             # Target shape: [64, 128, 9]
                                # (batch_size, seq_length, output_dim)
print("shape of y: ",y.shape)

shape of x:  torch.Size([64, 128, 512])
shape of d:  torch.Size([64, 512, 9])
shape of y:  torch.Size([64, 128, 9])


1. What is the shape of intermediate tensor x?

      torch.Size([64, 128, 512])

2. Fill in the shape of tensor d. Explain why you did so according to bmm.
   
      Tensor d should have the shape [64, 512, 9] to be compatible with the next bmm operation. This is because x has a shape      [64,  128, 512], so when performing y = torch.bmm(x, d), the 512 in x and d align

3. Think: Why does this sequence of operations make sense in the context of
   deep learning (hint: think about sequence processing)? Then guess and answer: what might seq_length and embedding_dim mean? How about the output dimension of y? (Points will be given for all guesses that are relevant to Text/Speech/Images/Sciences etc.)

      Operations could represent transformations on a sequence of data, where seq_length represents the number of steps in a sequence (like words in a sentence for NLP tasks) and embedding_dim or hidden_dim represents the dimensionality of the embeddings. The final output dimension might represent a classification into 9 categories or classes.


###Task 1.3: Embeddings

Take a closer look at the code below, finish it (TODO) and then answer the questions. You can add print statements to get insight into the individual steps and output.

In [None]:
# We have a dataset of 10 tokens, and each token is represented by a vector with a dimensionality of 5.
vocab_size = 10 #DONE
embedding_dim = 5 #DONE

embedding_layer = nn.Embedding(vocab_size, embedding_dim)

x = torch.LongTensor([1, 3, 5, 7, 9])

embedded_x = embedding_layer(x)
embedded_x

tensor([[-0.4952, -0.7276, -1.2611, -0.3340,  1.2518],
        [-1.6660, -0.4927,  1.5976,  2.1014,  0.0643],
        [-1.0812,  0.1321,  1.6172,  0.1853,  1.2440],
        [ 0.9381, -1.6118,  0.5916,  0.7331,  0.5981],
        [-0.8053, -0.6769,  1.2444,  0.0452,  1.8524]],
       grad_fn=<EmbeddingBackward0>)

TODO: Answer the following questions: (I have no idea what happened with markdown here)
- a) What is the purpose of the embedding layer in neural networks?

        layer maps input information from a high-dimensional to a lower-dimensional space, allowing the network to learn more about the relationship between inputs and to process the data more efficiently

- b) What is the role of 'vocab_size' and 'embedding_dim' in the 'nn.Embedding' layer?

        nn.Embedding(size of the dictionary of embeddings, the size of each embedding vector)
        
- c) What happens if you try to embed a category that is out of the range defined by vocab_size in the nn.Embedding layer? Give an example of an out of range input value.

        the values need to be in [0, vocab_size-1]. If we use a value thats negative or >(vocab_size-1) it will raise an Error.
        e.g 

        import torch
        import torch.nn as nn

        vocab_size = 10  # Assume a vocabulary size of 10
        embedding_dim = 5
        embedding_layer = nn.Embedding(vocab_size, embedding_dim)

        input_tensor = torch.tensor([10])  # 10 is out of range
        output = embedding_layer(input_tensor)



## Task 2: Network Parameters

Imagine you're training a neural network for a 10-class classification problem  that takes an input vector of size 236. You implement 2 hidden layers with 472 neurons each. You don't learn any bias terms.
<br>

 1) How many different parameters does the neural network contain?
    
    236 * 472 + 472^2 + 472*10 = 338'896

<br> <br>
1) What's the difference between parameters and hyper-parameters? Name 3 different hyper-parameters and explain their role inside a neural network.
    
    Parameter: Internal values learned by the network during training like weights in the neural connections. These are adjusted in response to the data to minimize the loss function and improve the models predictions.

    Hyperparameter: External configs that are set before training, like the architecture of the model or the learning rate. These are not learned from the data but instead manually adjusted to optimize performance.

    e.g hyperparams: 
        - learning rate
        - batch size
        - number of epochs


## Task 3: Pytorch Model

###Task 3.1: Realize the following model in PyTorch:
  - [ ] Realize matrices as linear layers.
  - [ ] Create an instance of your model.
  - [ ] Apply it to $\mathbf{x}$, defined in the second code cell.
  - [ ] Determine the Cross-Entropy loss.

**Model Equations:**

- $\mathbf{e}=\mathbf{E}\mathbf{x}$
- $\mathbf{h}=\sigma(\mathbf{W}\mathbf{e}+\mathbf{b})$
- $\mathbf{z}=\mathbf{U}\mathbf{h}+\mathbf{b}$
- $\hat{\mathbf{y}}=\mathbf{z}$

**Details:**

- We are performing a classification task with **4 classes**.
- $\mathbf{x}$ is a batch of 4 integers that represent some tokens.
- $\mathbf{E}$ is a 8 x 6 embedding layer, realized with PyTorch functionality.
- $\mathbf{W}$ and $\mathbf{U}$ are weight matrices. Both layers learn bias terms. The output dimension of $\mathbf{W}$ is 10.
- Activation function $\sigma$ is ReLU.
- Model output is 4 logits per token (raw values).

Note: Don't change any of the existing code, only add your solutions where it's indicated (TODO statements). Print statements can be placed wherever you need them.

In [None]:
import torch # type: ignore
import torch.nn as nn # type: ignore
import torch.optim as optim # type: ignore


seed = 42
torch.manual_seed(seed)

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.embedding = nn.Embedding(8, 6) #DONE
        self.W = nn.Linear(6, 10)#DONE
        self.U = nn.Linear(10, 4) #DONE
        self.relu = nn.ReLU() #DONE

    def forward(self,x):
        #DONE: ... (as many lines as you need)
        e = self.embedding(x)
        h = self.relu(self.W(e))
        z = self.U(h)

        return z #DONE:...


In [None]:
X=torch.tensor([1,2,3,7],dtype=torch.long)
y=torch.tensor([1,0,1,3],dtype=torch.long)

model = Net() #DONE
output = model(X) #DONE
print(f'model output: {output}')

#DONE: determine the loss

loss_function = nn.CrossEntropyLoss() #DONE
loss = loss_function(output, y) #DONE
print(f'loss: {loss}')

model output: tensor([[ 0.3125,  0.0869,  0.2802, -0.1106],
        [-0.2266, -0.4833,  0.0509, -0.4157],
        [-0.3728, -0.2066, -0.1796, -0.3994],
        [-0.1099, -0.2404, -0.3024, -0.1897]], grad_fn=<AddmmBackward0>)
loss: 1.3744735717773438


### Task 3.2

Take another look at the cells above.

- Is it a binary or a multiclass classification task?
    - multiclass
- Why is there no activation function applied to the output? (Hint: check the documentation for Cross Entropy Loss) TODO
    - CrossEntropyLoss expects raw logits as input and internally applies the softmax activation function. Applying an external activation function would double up which give incorrect results
- What is the vocabulary size of this model? TODO
    - 8
- What do the numbers in $\mathbf{y}$ represent? TODO
    - class labels

## Task 4 (Bidirectional Self-)Attention
Given a simple sentence "he likes cats" and the following weight matrices:
- Weight matrix for Query ($W_q$)
- Weight matrix for Key ($W_k$)

Dimensionalities: (important!)
- the dimensionality of input is [sentence length, d]
- d stands for hidden embeddings
- the dimensionality of weight matrices should be [d, d_k]
- **from above, you can see that d_k has nothing to do with input x.**

You have two tasks.
- Follow the steps to fill in the code - calculate the weight matrix.
- Look at the weight matrix and explain it.

In [22]:
# Don't change this block, just read and run
# Weight matrices
Wq = torch.tensor([[0.5, 2.0],
                    [3.0, 1.5]])

Wk = torch.tensor([[1.0, 0.4],
                    [0.6, 1.2]])

# Word embeddings for "he likes cats"
x = torch.tensor([[1.0, 2.0],  # "he"
                    [2.0, 1.0],  # "likes"
                    [1.5, 1.5]]) # "cats"

Follow the steps to compute attention weights and interpret the results.

In [None]:
# Step 1: calculate Q and K
Q = torch.matmul(x, Wq) # DONE
K = torch.matmul(x, Wk) # DONE

In [None]:
# Step 2: Compute attention scores
# hint1: d_k has nothing to do with d_x, but instead, it is related to the weight matrix k.
# hint2: to compute the scores, you need to transpose K to compute the scores)
import math

d_k = K.size(1) # DONE
scores = torch.matmul(Q, K.T) # DONE
scores_scaled = scores / math.sqrt(d_k) #DONE, here you divide the scores by ...

In [None]:
# Step 3.1:
# It is always a good habitude to read documentation. Always read documentation.
# Find the documentation in torch on how you can calculate the variance of a tensor, and paste the link here

# https://pytorch.org/docs/stable/generated/torch.var.html

In [None]:
# Step 3.2:
# Read the documentation
# Use what you find in the documentation and calculate the variance of scores and scores_scaled
# Hint: It should be a very simple code

var_scores = torch.var(scores, dim = None) # DONE
var_scores_scaled = torch.var(scores_scaled, dim = None) # DONE

In [31]:
# Step 3.3
# the code below gives you the ratio of var_scores and var_scores_scaled
ratio = var_scores / var_scores_scaled
print(ratio)

# Answer: what number the ratio is close to?

# The ratio is close to 2. This indicates that scaling the scores by sqrt(dk) reduces their variance by approximately a factor of 2.

tensor(2.0000)


In [32]:
# Step 4: Apply softmax to get attention weights
weights = F.softmax(scores_scaled, dim = 1) # DONE
weights

tensor([[0.5047, 0.1876, 0.3077],
        [0.6624, 0.0915, 0.2461],
        [0.5874, 0.1331, 0.2796]])

TODO: Answer the questions below

a) What do the values in the first row of attention weights represent?
    he to he // he to likes // he to cats

b) Look at the last row of weights - what does each number tell us?
    cats to he // cats to likes // cats to cats

c) Which row or column you should look at if you want to find out which word pays most attention to "likes"? and what is the word?
    second column and the word is "he"

## Task 5: Training Optimization Problems

### Task 5.1:

For each scenario, choose between GD, SGD and mini-batch GD.

a) You are training a deep neural network. You have a lot of training samples but limited GPU memory. Now you want to have some stability during training and enable parallel processing.

mini-batch GD

b) You're training a model with data points coming in in real time, and you want to process them immediately.

SGD





### Task 5.2:

You're training a neural model. After several epochs you notice that the loss decreases very slowly, so you increase the learning rate. Now the loss fluctuates significantly. What were the issues (before and after changing the leaning rate)?

**#TODO**