# CS224N: PyTorch Tutorial (Winter '21)

### Author: Dilara Soylu

In this notebook, we will have a basic introduction to `PyTorch` and work on a demo network solving X task [TODO write the task here]. Following resources have been used in preparation of this notebook:
* Matt's tutorial
* Official PyTorch tutorial
* GAN Class tutorial [TODO properly cite]

## Introduction
[PyTorch](https://pytorch.org/) is a machine learning framework that is used in both academia and industry for various applications. PyTorch started of as a more flexible alternative to [TensorFlow](https://www.tensorflow.org/), which is another popular machine learning framework. At the time of its release, `PyTorch` appealed to the users due to its user friendly nature: as opposed to defining static graphs before performing an operation as in `TensorFlow`, `PyTorch` allowed users to define their operations as they go, which is also the approached integrated by `TensorFlow` in its following releases. Although `TensorFlow` is more widely preferred in the industry, `PyTorch` is often times the preferred machine learning framework for researchers. 

Now that we have learned enough about the background of `PyTorch`, let's start by importing it into our notebook. To install `PyTorch`, you can follow the instructions here. Alternatively, you can open this notebook using `Google Colab`, which already has `PyTorch` installed in its base kernel. Once you are done with the installation process, run the following cell:

### Constructor

In [None]:
import torch
import torch.nn as nn

In [None]:
# Initialize a tensor from a Python List
data = [
        [0, 1], 
        [2, 3],
        [4, 5]
       ]
x_python = torch.tensor(data)

# Print the tensor
x_python

In [None]:
x_python.dtype

PyTorch is open source

In [None]:
# We are using the dtype to create a tensor of particular type
x_float = torch.tensor(data, dtype=torch.float)
x_float

In [None]:
# We are using the dtype to create a tensor of particular type
x_bool = torch.tensor(data, dtype=torch.bool)
x_bool

In [None]:
x_python.to(torch.long)

In [None]:
torch.Tensor(data).dtype

#### From a NumPy Array

In [None]:
import numpy as np

arr = np.array(data)
x_numpy = torch.tensor(arr)
x_numpy[:, 0] = 30

In [None]:
x_numpy

In [None]:
arr

In [None]:
x_numpy_container = torch.from_numpy(arr)
x_numpy_container[:, 0] = 30

In [None]:
x_numpy_container

In [None]:
arr

#### From a tensor

In [None]:
# Initialize a base tensor
x = torch.tensor([[1., 2.], [3., 4.]])
x

In [None]:
# Initialize a tensor of 0s
x_zeros = torch.zeros_like(x)
x_zeros

In [None]:
# Initialize a tensor of 1s
x_ones = torch.ones_like(x)
x_ones

In [None]:
# Initialize a tensor where each element is sampled from a uniform distribution
# between 0 and 1
x_rand = torch.rand_like(x)
x_rand

In [None]:
# Initialize a tensor where each element is sampled from a normal distribution
x_randn = torch.randn_like(x)
x_randn

#### Specifying a Shape

In [None]:
# Initialize a 2x3x2 tensor of 0s
shape = (4, 2, 2)
x_zeros = torch.zeros(shape) # x_zeros = torch.zeros(4, 3, 2) is an alternative
x_zeros

#### With torch.arange

In [None]:
x = torch.arange(10)
x

### Properties

#### data types

In [None]:
# Initialize a 3x2 tensor, with 3 rows and 2 columns
x = torch.ones(3, 2, dtype=torch.float64)
x.dtype

In [None]:
# Initialize a 3x2 tensor, with 3 rows and 2 columns
x = torch.Tensor([[1, 2], [3, 4], [5, 6]])
x.shape

In [None]:
x_view = x.reshape(-1, 2)
x_view[:, 0] = 90
x_view, x

In [None]:
x_view = x.view(-1, 2)
x_view[:, 0] = 90
x_view, x

In [None]:
# Initialize a 5x2 tensor, with 5 rows and 2 columns
x = torch.arange(10).reshape(5, 2).unsqueeze(2).unsqueeze(0).unsqueeze(0)
x.squeeze()

In [None]:
x.numel()

#### Device

In [None]:
x.device

#### tensor indexing

In [None]:
# Initialize an example tensor
x = torch.Tensor([
                  [[1, 2], [3, 4]],
                  [[5, 6], [7, 8]], 
                  [[9, 10], [11, 12]] 
                 ])
x

In [None]:
# Access the 0th element, which is the first row
x[0] # Equivalent to x[0, :]

In [None]:
# Get the top left element of each element in our tensor
x[::2, :, 0].view([-1, ])

In [None]:
x_reshape = x[::2, :, 0].reshape([-1, ])
x_reshape[:] = 200
x

In [None]:
x[[0, 0, 1, 1]]

In [None]:
x[[1, 2], 0]

### Operations

In [None]:
# Create an example tensor
x = torch.ones((3,2,2))
x

In [None]:
# Perform elementwise addition
# Use - for subtraction
x + 2

In [None]:
# Perform elementwise multiplication
# Use / for division
x * 2

In [None]:
# Create a 4x3 tensor of 6s
a = torch.ones((4,3)) * 6
a

In [None]:
# Create a 1D tensor of 2s
b = torch.ones(3) * 2
a / b

In [None]:
# Alternative to a.matmul(b)
# a @ b.T returns the same result since b is 1D tensor and the 2nd dimension
# is inferred
a @ b 

In [None]:
# Create an example tensor
m = torch.tensor(
    [
     [1., 1.],
     [2., 2.],
     [3., 3.],
     [4., 4.]
    ]
)

print("Mean: {}".format(m.mean()))
print("Mean in the 0th dimension: {}".format(m.mean(0)))
print("Mean in the 1st dimension: {}".format(m.mean(1)))

In [None]:
# Concatenate in dimension 0 and 1
a_cat0 = torch.cat([a, a, a], dim=0)
a_cat1 = torch.cat([a, a, a], dim=1)

print("Initial shape: {}".format(a.shape))
print("Shape after concatenation in dimension 0: {}".format(a_cat0.shape))
print("Shape after concatenation in dimension 1: {}".format(a_cat1.shape))

### Autograd

In [None]:
# Create an example tensor
# requires_grad parameter tells PyTorch to store gradients
x = torch.tensor([2.], requires_grad=True)

# Print the gradient if it is calculated
# Currently None since x is a scalar
print(x.grad)

In [None]:
y = x * x * 3
y.backward()
z = x * x * 3
u = y + z * 2

In [None]:
isinstance(x, torch.autograd.Variable)

In [None]:
isinstance(a, torch.autograd.Variable)

### Neural Network Module

In [None]:
import torch.nn as nn

#### Linear Layer

In [None]:
# Create the inputs
inputs = torch.ones(2,3,4)

# Make a linear layers transforming N,*,H_in dimensinal inputs to N,*,H_out
# dimensional outputs
linear = nn.Linear(4, 2)
linear_output = linear(inputs)
linear_output

In [None]:
linear.weight, linear.bias

#### Other Module Layers

In [None]:
nn.Conv2d??

#### Nonlinear Layer

In [None]:
sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)
output

#### Squential Pipe

In [None]:
block = nn.Sequential(
    nn.Linear(4, 2),
    nn.Sigmoid()
)

inputs = torch.ones(2,3,4)
output = block(inputs)
output

In [None]:
for x, y in block.named_children():
    print(x, y)

In [None]:
block[0]

#### Custom Layer

In [None]:
class MultilayerPerceptron(nn.Module):

  def __init__(self, input_size, hidden_size):
    # Call to the __init__ function of the super class
    super(MultilayerPerceptron, self).__init__()

    # Bookkeeping: Saving the initialization parameters
    self.input_size = input_size 
    self.hidden_size = hidden_size 

    # Defining of our layers
    self.linear = nn.Linear(self.input_size, self.hidden_size)
    self.relu = nn.ReLU()
    self.linear2 = nn.Linear(self.hidden_size, self.input_size)
    self.sigmoid = nn.Sigmoid()
    
  def forward(self, x):
    linear = self.linear(x)
    relu = self.relu(linear)
    linear2 = self.linear2(relu)
    output = self.sigmoid(linear2)
    return output

In [None]:
# Make a sample input
inputs = torch.randn(2, 5)

# Create our model
model = MultilayerPerceptron(5, 3)

# Pass our input through our model
model(inputs)

In [None]:
output1 = inputs @ model.linear.weight.T + model.linear.bias
output2 = model.relu(output1)
output3 = model.linear2(output2)
output4 = model.sigmoid(output3)

In [None]:
for x in model.named_children():
    print(x)

In [None]:
list(model.named_parameters())

### Optimization

In [None]:
import torch.optim as optim

In [None]:
# Create the y data
y = torch.ones(10, 5)

# Add some noise to our goal y to generate our x
# We want out model to predict our original data, albeit the noise
x = y + torch.randn_like(y)
x

In [None]:
# Instantiate the model
model = MultilayerPerceptron(5, 3)

# Define the optimizer
adam = optim.Adam(model.parameters(), lr=1e-1)

# Define loss using a predefined loss function
loss_function = nn.BCELoss()

# Calculate how our model is doing now
y_pred = model(x)
loss_function(y_pred, y).item()

In [None]:
# Set the number of epoch, which determines the number of training iterations
n_epoch = 10 

for epoch in range(n_epoch):
  # Set the gradients to 0
  adam.zero_grad()

  # Get the model predictions
  y_pred = model(x)

  # Get the loss
  loss = loss_function(y_pred, y)

  # Print stats
  print(f"Epoch {epoch}: traing loss: {loss}")

  # Compute the gradients
  loss.backward()

  # Take a step to optimize the weights
  adam.step()

In [None]:
# See how our model performs on the training data
y_pred = model(x)
y_pred

## Demo: Word Window Classification

#### Data

In [None]:
# Our raw data, which consists of sentences
corpus = [
    "We always come to Paris",
    "The professor is from Australia",
    "I live in Stanford",
    "He comes from Taiwan",
    "The capital of Turkey is Ankara",
]

In [None]:
# The preprocessing function we will use to generate our training examples
# Our function is a simple one, we lowercase the letters
# and then tokenize the words.
def preprocess_sentence(sentence): return sentence.lower().split()

# Create our training set
train_sentences = [sent.lower().split() for sent in corpus]
train_sentences

In [None]:
# Set of locations that appear in our corpus
locations = set(["australia", "ankara", "paris", "stanford", "taiwan", "turkey"])

# Our train labels
train_labels = [[1 if word in locations else 0 for word in sent] for sent in train_sentences]
train_labels

In [None]:
# Find all the unique words in our corpus 
vocabulary = set(w for s in train_sentences for w in s)
vocabulary

In [None]:
# Add the unknown token to our vocabulary
vocabulary.add("<unk>")

In [None]:
# Add the <pad> token to our vocabulary
vocabulary.add("<pad>")

# Function that pads the given sentence
# We are introducing this function here as an example
# We will be utilizing it later in the tutorial
def pad_window(sentence, window_size, pad_token="<pad>"):
  window = [pad_token] * window_size
  return window + sentence + window

# Show padding example
window_size = 2
pad_window(train_sentences[0], window_size=window_size)

In [None]:
# We are just converting our vocabularly to a list to be able to index into it
# Sorting is not necessary, we sort to show an ordered word_to_ind dictionary
# That being said, we will see that having the index for the padding token
# be 0 is convenient as some PyTorch functions use it as a default value
# such as nn.utils.rnn.pad_sequence, which we will cover in a bit
ix_to_word = sorted(list(vocabulary))

# Creating a dictionary to find the index of a given word
word_to_ix = {word: ind for ind, word in enumerate(ix_to_word)}
word_to_ix

In [None]:
# Given a sentence of tokens, return the corresponding indices
def convert_token_to_indices(sentence, word_to_ix):
  indices = []
  for token in sentence:
    # Check if the token is in our vocabularly. If it is, get it's index. 
    # If not, get the index for the unknown token.
    if token in word_to_ix:
      index = word_to_ix[token]
    else:
      index = word_to_ix["<unk>"]
    indices.append(index)
  return indices


# Show an example
example_sentence = ["we", "always", "come", "to", "kuwait"]
example_indices = convert_token_to_indices(example_sentence, word_to_ix)
restored_example = [ix_to_word[ind] for ind in example_indices]

print(f"Original sentence is: {example_sentence}")
print(f"Going from words to indices: {example_indices}")
print(f"Going from indices to words: {restored_example}")

In [None]:
# Converting our sentences to indices
example_padded_indices = [convert_token_to_indices(s, word_to_ix) for s in train_sentences]
example_padded_indices

In [None]:
# Creating an embedding table for our words
embedding_dim = 5
embeds = nn.Embedding(len(vocabulary), embedding_dim)

# Printing the parameters in our embedding table
list(embeds.parameters())

In [None]:
# Get the embedding for the word Paris
index = word_to_ix["paris"]
index_tensor = torch.tensor(index, dtype=torch.long)
paris_embed = embeds(index_tensor)
paris_embed

In [None]:
# We can also get multiple embeddings at once
index_paris = word_to_ix["paris"]
index_ankara = word_to_ix["ankara"]
indices = [index_paris, index_ankara]
indices_tensor = torch.tensor(indices, dtype=torch.long)
embeddings = embeds(indices_tensor)
embeddings

#### Batching Sentences

In [None]:
x, y = zip(*zip(train_sentences, train_labels))
# Pad the train examples.
x = [pad_window(s, window_size=window_size) for s in x]
# Convert the train examples into indices.
x = [convert_token_to_indices(s, word_to_ix) for s in x]

In [None]:
# We will now pad the examples so that the lengths of all the example in 
# one batch are the same, making it possible to do matrix operations. 
# We set the batch_first parameter to True so that the returned matrix has 
# the batch as the first dimension.
pad_token_ix = word_to_ix["<pad>"]
# pad_sequence function expects the input to be a tensor, so we turn x into one
x = [torch.LongTensor(x_i) for x_i in x]
x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)

In [None]:
# We will also pad the labels. Before doing so, we will record the number 
# of labels so that we know how many words existed in each example. 
lengths = [len(label) for label in y]
lenghts = torch.LongTensor(lengths)

In [None]:
y = [torch.LongTensor(y_i) for y_i in y]
y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)

In [None]:
y_padded

In [None]:
from torch.utils.data import DataLoader
from functools import partial

def custom_collate_fn(batch, window_size, word_to_ix):
    x, y = zip(*batch)
    # Pad the train examples.
    x = [pad_window(s, window_size=window_size) for s in x]
    # Convert the train examples into indices.
    x = [convert_token_to_indices(s, word_to_ix) for s in x]
    # We will now pad the examples so that the lengths of all the example in 
    # one batch are the same, making it possible to do matrix operations. 
    # We set the batch_first parameter to True so that the returned matrix has 
    # the batch as the first dimension.
    pad_token_ix = word_to_ix["<pad>"]
    # pad_sequence function expects the input to be a tensor, so we turn x into one
    x = [torch.LongTensor(x_i) for x_i in x]
    x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)
    # We will also pad the labels. Before doing so, we will record the number 
    # of labels so that we know how many words existed in each example. 
    lengths = [len(label) for label in y]
    lenghts = torch.LongTensor(lengths)
    
    y = [torch.LongTensor(y_i) for y_i in y]
    y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)
    # We are now ready to return our variables. The order we return our variables
    # here will match the order we read them in our training loop.
    return x_padded, y_padded, lenghts 

In [None]:
# Parameters to be passed to the DataLoader
data = list(zip(train_sentences, train_labels))
batch_size = 2
shuffle = True
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)

# Instantiate the DataLoader
loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)

# Go through one loop
counter = 0
for batched_x, batched_y, batched_lengths in loader:
    print(f"Iteration {counter}")
    print("Batched Input:")
    print(batched_x)
    print("Batched Labels:")
    print(batched_y)
    print("Batched Lengths:")
    print(batched_lengths)
    print("")
    counter += 1

In [None]:
# Print the original tensor
print(f"Original Tensor: ")
print(batched_x)
print("")

# Create the 2 * 2 + 1 chunks
chunk = batched_x.unfold(1, window_size*2 + 1, 1)
print(f"Windows: ")
print(chunk)

#### Model

In [None]:
class WordWindowClassifier(nn.Module):

    def __init__(self, hyperparameters, vocab_size, pad_ix=0):
        super(WordWindowClassifier, self).__init__()

        """ Instance variables """
        self.window_size = hyperparameters["window_size"]
        self.embed_dim = hyperparameters["embed_dim"]
        self.hidden_dim = hyperparameters["hidden_dim"]
        self.freeze_embeddings = hyperparameters["freeze_embeddings"]

        """ Embedding Layer 
        Takes in a tensor containing embedding indices, and returns the 
        corresponding embeddings. The output is of dim 
        (number_of_indices * embedding_dim).

        If freeze_embeddings is True, set the embedding layer parameters to be
        non-trainable. This is useful if we only want the parameters other than the
        embeddings parameters to change. 

        """
        self.embeds = nn.Embedding(vocab_size, self.embed_dim, padding_idx=pad_ix)
        if self.freeze_embeddings:
            self.embed_layer.weight.requires_grad = False

        """ Hidden Layer
        """
        full_window_size = 2 * window_size + 1
        self.hidden_layer = nn.Sequential(
            nn.Linear(full_window_size * self.embed_dim, self.hidden_dim), 
            nn.Tanh()
        )

        """ Output Layer
        """
        self.output_layer = nn.Linear(self.hidden_dim, 1)

        """ Probabilities 
        """
        self.probabilities = nn.Sigmoid()

    def forward(self, inputs):
        """
        Let B:= batch_size
            L:= window-padded sentence length
            D:= self.embed_dim
            S:= self.window_size
            H:= self.hidden_dim

        inputs: a (B, L) tensor of token indices
        """
        B, L = inputs.size()

        """
        Reshaping.
        Takes in a (B, L) LongTensor
        Outputs a (B, L~, S) LongTensor
        """
        # Fist, get our word windows for each word in our input.
        token_windows = inputs.unfold(1, 2 * self.window_size + 1, 1)
        _, adjusted_length, _ = token_windows.size()

        # Good idea to do internal tensor-size sanity checks, at the least in comments!
        assert token_windows.size() == (B, adjusted_length, 2 * self.window_size + 1)

        """
        Embedding.
        Takes in a torch.LongTensor of size (B, L~, S) 
        Outputs a (B, L~, S, D) FloatTensor.
        """
        embedded_windows = self.embeds(token_windows)

        """
        Reshaping.
        Takes in a (B, L~, S, D) FloatTensor.
        Resizes it into a (B, L~, S*D) FloatTensor.
        -1 argument "infers" what the last dimension should be based on leftover axes.
        """
        embedded_windows = embedded_windows.view(B, adjusted_length, -1)

        """
        Layer 1.
        Takes in a (B, L~, S*D) FloatTensor.
        Resizes it into a (B, L~, H) FloatTensor
        """
        layer_1 = self.hidden_layer(embedded_windows)

        """
        Layer 2
        Takes in a (B, L~, H) FloatTensor.
        Resizes it into a (B, L~, 1) FloatTensor.
        """
        output = self.output_layer(layer_1)

        """
        Softmax.
        Takes in a (B, L~, 1) FloatTensor of unnormalized class scores.
        Outputs a (B, L~, 1) FloatTensor of (log-)normalized class scores.
        """
        output = self.probabilities(output)
        output = output.view(B, -1)

        return output

#### Training

In [None]:
# Initialize a model
# It is useful to put all the model hyperparameters in a dictionary
model_hyperparameters = {
    "batch_size": 4,
    "window_size": 2,
    "embed_dim": 10,
    "hidden_dim": 25,
    "freeze_embeddings": False,
}

vocab_size = len(word_to_ix)
model = WordWindowClassifier(model_hyperparameters, vocab_size)


In [None]:
# Define a loss function, which computes to binary cross entropy loss
def loss_function(batch_outputs, batch_labels, batch_lengths):   
    # Calculate the loss for the whole batch
    bceloss = nn.BCELoss()
    loss = bceloss(batch_outputs, batch_labels.float())

    # Rescale the loss. Remember that we have used lengths to store the 
    # number of words in each training example
    loss = loss / batch_lengths.sum().float()

    return loss

In [None]:
# Function that will be called in every epoch
def train_epoch(loss_function, optimizer, model, loader):

    # Keep track of the total loss for the batch
    total_loss = 0
    for batch_inputs, batch_labels, batch_lengths in loader:
        # Clear the gradients
        optimizer.zero_grad()
        # Run a forward pass
        outputs = model.forward(batch_inputs)
        # Compute the batch loss
        loss = loss_function(outputs, batch_labels, batch_lengths)
        # Calculate the gradients
        loss.backward()
        # Update the parameteres
        optimizer.step()
        total_loss += loss.item()

    return total_loss


# Function containing our main training loop
def train(loss_function, optimizer, model, loader, num_epochs=10000):

    # Iterate through each epoch and call our train_epoch function
    for epoch in range(num_epochs):
        epoch_loss = train_epoch(loss_function, optimizer, model, loader)
        if epoch % 100 == 0: print(epoch_loss)

In [None]:
num_epochs = 1000
# Define an optimizer
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
train(loss_function, optimizer, model, loader, num_epochs=num_epochs)

In [None]:
inputs = torch.tensor([
    [ 0,  0, 22,  2,  6, 20, 15,  0,  0,  0],
    [ 0,  0, 19,  5, 14, 21, 12,  3,  0,  0]
])

In [None]:
B, L = inputs.size()
token_windows = inputs.unfold(1, 2 * model.window_size + 1, 1)
assert token_windows.size() == (B, L - 2 * model.window_size, 2 * model.window_size + 1)

In [None]:
embedded_windows = model.embeds(token_windows)
embedded_windows = embedded_windows.view(B, L - 2 * model.window_size, -1)

In [None]:
layer_1 = model.hidden_layer(embedded_windows)
output = model.output_layer(layer_1)
output = model.probabilities(output)
output = output.view(B, -1)

In [None]:
output.shape

#### Test

In [None]:
# Create test sentences
test_corpus = ["She comes from Paris"]
test_sentences = [s.lower().split() for s in test_corpus]
test_labels = [[0, 0, 0, 1]]

# Create a test loader
test_data = list(zip(test_sentences, test_labels))
batch_size = 1
shuffle = False
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=2, word_to_ix=word_to_ix)
test_loader = torch.utils.data.DataLoader(test_data, 
                                           batch_size=1, 
                                           shuffle=False, 
                                           collate_fn=collate_fn)

In [None]:
for test_instance, labels, _ in test_loader:
    outputs = model.forward(test_instance)
    print(labels)
    print(outputs)