# Assignment 2. Music Century Classification

For this task, we will construct models to predict the century in which a music piece was released. We will utilize the "YearPredictionMSD Data Set," which is derived from the Million Song Dataset from the UCI Machine Learning Repository. Make sure you download the version of the dataset from the moodle and not from UCI. Here are some relevant links to read on this dataset:

- https://archive.ics.uci.edu/ml/datasets/yearpredictionmsd
- http://millionsongdataset.com/pages/tasks-demos/#yearrecognition

Just like in the last assignment, it is divided to two files.
1. This file (ML_DL_Assignment2.ipynb)
2. A python functions  file which you will fill out (ML_DL_Functions2.py)

As well as the year prediction msd dataset file.

In this assignment you will mount and load the dataset and functions file from google drive. To start make sure you have both the template python functions file and the song dataset file(downloaded from the moodle) on the same directory in your google drive.

When you are finished with the assignment make sure you submit the following files:
1. this file (ML_DL_Assignment2.ipynb).
2. the functions file (ML_DL_Functions2.py).
3. the weights file from section 2.7 (assignment2_submission_optimal_weights.npy).
4. the bias file from section 2.7 (assignment2_submission_optimal_bias.npy).

**Make sure you fill your personal ID in the functions file in the return value of function ID1. ID2 is filled only if you were allowed to complete the assignment in pairs otherwise keep it's return value 0.**

Note that untill section 2.9 you are not allowed to import additional packages **(especially not PyTorch)**. One of the objectives is to understand how the training procedure actually operates, before working with PyTorch's autograd engine which does it all for us. Importing the pytorch package will deduct from your points.

## 1. Data

In [None]:
import pandas
import numpy as np
import matplotlib.pyplot as plt
import sys
def reload_functions():
  if 'ML_DL_Functions2' in sys.modules:
    del sys.modules['ML_DL_Functions2']
  functions_path = drive_path.replace(" ","\ ") + 'ML_DL_Functions2.py'
  !cp $functions_path .

Just like in the last assignment you should mount your google drive and make sure you have both the dataset from the moodle('YearPredictionMSD.csv') and the functions file ('ML_DL_Functions2.py') in the same directory which you will input below:

In [None]:

from google.colab import drive
drive.mount('/content/gdrive')
drive_path = '/content/gdrive/My Drive/Intro_to_Deep_Learning/Assignment2/' # TODO - UPDATE ME WITH THE TRUE PATH!
csv_path = drive_path + 'YearPredictionMSD.csv'
t_label = ["year"]
x_labels = ["var%d" % i for i in range(1, 91)]
df = pandas.read_csv(csv_path)

Now that the data is loaded to your Colab notebook, you should be able to display the Pandas
DataFrame `df` as a table:

In [None]:
df

To set up our data for classification, we'll use the "year" field to represent
whether a song was released in the 20-th century. In our case `df["year"]` will be 1 if
the year was released after 2000, and 0 otherwise.

In [None]:
df["year"] = df["year"].map(lambda x: int(x > 2000))

In [None]:
df.head(20)

### 1.1 - Train Test Split

The data set description text asks us to respect the below train/test split to
avoid the "producer effect". That is, we want to make sure that no song from a single artist
ends up in both the training and test set.

#### Food for thought:
why would it be problematic to have some songs from an artist in the training set, and other songs from the same artist in the test set. (Hint: Remember that we want our test accuracy to predict how well the model will perform in practice on a song it hasn't learned about.)

In [None]:
# train test split
df_train = df[:463715]
df_test = df[463715:]

# convert to numpy
train_xs = df_train[x_labels].to_numpy()
train_ts = df_train[t_label].to_numpy()
test_xs = df_test[x_labels].to_numpy()
test_ts = df_test[t_label].to_numpy()

### Part (b) -- 7%
Normalize the data by subtracting the mean and dividing by the std just like the last assignment.

In [None]:
# Insert your code here:
train_norm_xs = ...
test_norm_xs = ...

### Part (c) -- 7%

Finally, we'll move some of the data in our training set into a validation set.
#### Food for thought:
Why should we limit how many times we use the test set, and how do we use the validation set during the model building process?

In [None]:
# shuffle the training set
reindex = np.random.permutation(len(train_xs))
train_xs = train_xs[reindex]
train_norm_xs = train_norm_xs[reindex]
train_ts = train_ts[reindex]

# use the first 50000 elements of `train_xs` as the validation set
train_xs, val_xs           = train_xs[50000:], train_xs[:50000]
train_norm_xs, val_norm_xs = train_norm_xs[50000:], train_norm_xs[:50000]
train_ts, val_ts           = train_ts[50000:], train_ts[:50000]

## Part 2. Classification (79%)

We will first build a *classification* model to perform decade classification. We have written a few helper functions for you. You can find them in your functions file ('sigmoid', 'cross_entropy' and 'get_accuracy'). All other code that you write in this section should be vectorized whenever possible (i.e., avoid unnecessary loops). Feel free to add more testing to the notebook to validate your code in the functions file.

### 2.1 Prediction

Fill in the function `pred` in the functions file that computes the prediction `y` based on logistic regression, i.e., a single layer with weights `w` and bias `b`. The output is given by:
\begin{equation}
y = \sigma({\bf w}^T {\bf x} + b),
\end{equation}
where the value of $y$ is an estimate of the probability that the song is released in the current century, namely ${\rm year} =1$.

In [None]:
reload_functions()
import ML_DL_Functions2
ML_DL_Functions2.pred(np.zeros(90), 1, np.ones([2, 90]))

### 2.2 Cost
Assuming the loss function is the cross entropy function fill in the cost(risk) function in the functions file which returns the mean of the loss function on all inputs.
$$\mathcal{L}_\mathcal{P}(\text{Cross Entropy}) = \mathbb{E}_{(y,t)\sim\mathcal{P}}\left\{\text{CE}(t,s)\right\}$$


In [None]:
reload_functions()
import ML_DL_Functions2
print(ML_DL_Functions2.cost(0.5*np.ones(4), np.ones(4)))

### 2.3 Derivative of the cost -- 7%
Take a pen and paper and calculate the analytical derivative of the cost function with respect to the weights and bias. use the formula calculated to fill in the function `derivative_cost` that computes and returns the gradients
$\frac{\partial\mathcal{L}}{\partial {\bf w}}$ and
$\frac{\partial\mathcal{L}}{\partial b}$. Here, `X` is the input, `y` is the prediction, and `t` is the true label.


In [None]:
reload_functions()
import ML_DL_Functions2
dldw, dldb = ML_DL_Functions2.derivative_cost(np.ones([10,90]), np.ones(10), np.ones(10))
print(dldw.shape)
print(type(dldb))

### 2.4 Derivative approximation

We can check that our derivative is implemented correctly using the finite difference rule. In 1D, the
finite difference rule tells us that for small $h$, we should have

$$\frac{f(x+h) - f(x)}{h} \approx f'(x)$$

make sure that $\frac{\partial\mathcal{L}}{\partial b}$  is implement correctly
by comparing the result from `derivative_cost` with the empirical cost derivative computed using the above numerical approximation.


In [None]:
reload_functions()
import ML_DL_Functions2
# Your code goes here

'''
r1 = ...
r2 = ...
print("The analytical results is -", r1)
print("The algorithm results is - ", r2)
'''

make sure that $\frac{\partial\mathcal{L}}{\partial {\bf w}}$  is implement correctly.

In [None]:
reload_functions()
import ML_DL_Functions2
# Your code goes here. You might find this below code helpful: but it's
# up to you to figure out how/why, and how to modify the code

'''
r1 = ...
r2 = ...
print("The analytical results is -", r1)
print("The algorithm results is - ", r2)
'''

'\nr1 = ...\nr2 = ...\nprint("The analytical results is -", r1)\nprint("The algorithm results is - ", r2)\n'

### 2.5 Gradient descent

Now that you have a gradient function that works, we can actually run gradient descent.
Complete the following code that will run stochastic: gradient descent training:

In [None]:
def run_gradient_descent(w0, b0, mu=0.1, batch_size=100, max_iters=100):
  """Return the values of (w, b) after running gradient descent for max_iters.
  We use:
    - train_norm_xs and train_ts as the training set
    - val_norm_xs and val_ts as the test set
    - mu as the learning rate
    - (w0, b0) as the initial values of (w, b)

  Precondition: np.shape(w0) == (90,)
                type(b0) == float

  Postcondition: np.shape(w) == (90,)
                 type(b) == float
  """
  w = w0
  b = b0
  iter = 0
  max_acc = 0
  opt_w = w
  opt_b = b
  cost_list = []
  acc_list  = []
  while iter < max_iters:
    # shuffle the training set (there is code above for how to do this)
    # <===

    for i in range(0, len(train_norm_xs), batch_size): # iterate over each minibatch
      # minibatch that we are working with:
      X = train_norm_xs[i:(i + batch_size)]
      t = train_ts[i:(i + batch_size), 0]

      # since len(train_norm_xs) does not divide batch_size evenly, we will skip over
      # the "last" minibatch
      if np.shape(X)[0] != batch_size:
        continue

      # compute the prediction
      # <===
      # calculate gradient(backpropegate)
      # <===
      # update w and b(step)
      # <===
      # increment the iteration count
      iter += 1
      # compute and print the *validation* loss and accuracy
      if (iter % 40 == 0):
        # <===
        val_cost = ...
        val_acc = ...
        cost_list.append(val_cost)
        acc_list.append(val_acc)
        # save the best weights and biases
        if val_acc>max_acc:
          opt_w = w
          opt_b = b

        print("Iter %d. [Val Acc %.0f%%, Loss %f]" % (
              iter, val_acc * 100, val_cost))

      if iter >= max_iters:
        break


  return opt_w, opt_b, cost_list, acc_list

### 2.6 Running everything!

Call `run_gradient_descent` with the weights and biases all initialized to zero. Test your self with different $\mu$ values and show that if mu is too small then convergance is slow and if mu is too large then the optimization algorithm does not converge. You can add more automation and plot function to help you find the best configuration.

In [None]:
reload_functions()
import ML_DL_Functions2
w0 = np.zeros(90)
b0 = np.zeros(1)[0]

# choose values
mu = ...
max_iters = ...
batch_size = ...


# Write your code here
# opt_w,opt_b,cost_list,acc_list = run_gradient_descent(w0,b0,mu,batch_size,max_iters)
# plt.plot(range(0,max_iters,40),acc_list,"r-")
# plt.title("classification accuracy per iteration; $\mu$="+str(mu)+" batch size="+str(batch_size))


### 2.7 Finding and saving optimal values

Find the optimal value of ${\bf w}$ and $b$ in the means of accuracy using your code. Notice that the choice of $\mu$ and the batch size are important for this. Run the code below to save these parameters to a file in your google drive directory. submit to the moodle (alongside this file and the functions file) both the
"assignment2_submission_optimal_weights.npy" file and "assignment2_submission_optimal_bias.npy" file.

In [None]:
#change these to the optimal weights and biases, leave the name the same
np.save(drive_path+"assignment2_submission_optimal_weights.npy",opt_w_)
np.save(drive_path+"assignment2_submission_optimal_bias.npy",opt_b_)

### 2.8 Results

Using the values of `w` and `b` from part 2.7, compute your training accuracy, validation accuracy,
and test accuracy. Are there any differences between those three values? If so, why?

In [None]:
w = np.load(drive_path+"assignment2_submission_optimal_weights.npy")
b = np.load(drive_path+"assignment2_submission_optimal_bias.npy")

# Write your code here

train_acc = ...
val_acc = ...
test_acc = ...

print('train_acc = ', train_acc, ' val_acc = ', val_acc, ' test_acc = ', test_acc)

### 2.9 Using Pytorch
Writing a classifier like this is instructive, and helps you understand what happens when
we train a model. However, in practice, we rarely write model building and training code
from scratch. Instead, we typically use one of the well-tested libraries available in a package. The following example showes you how this task could have been achieved using the deep learning library, pytorch. The library greatly simplifies the steps needed to create a learning model. Though there is nothing you need to complete in this section we suggest you read this section thoroughly and make sure you understand all the code. In the next assignment you will need to build a deep learning model yourself.

The first step required to use the pytorch module is to create a class which will be our model. in this case we will use a linear layer with a costum size(in your assignment you used a 90,1 linear layer meaning an input size of 90 and an output size of 1). We also add a sigmoid function to restrict the values between 0 and 1.

The forward function is called everytime you call the model by name. It is equivalent to the prediction function you wrote but it serves another purpose since it saves all the operations done to the tensor which can then be used to calculate the gradients.

In [None]:
import torch
class single_layer(torch.nn.Module):
  def __init__(self,input_size,output_size):
    super(single_layer,self).__init__()
    self.neuron = torch.nn.Linear(input_size,output_size)
    self.sigmoid = torch.nn.Sigmoid()

  def forward(self,X):
    out = self.neuron(X)
    out = self.sigmoid(out)
    return out


We can now create a new model.

we don't have to write the binary cross entropy loss since it is already written for us(criterion).

Also instead of writing the optimzation proccess which in our case was gradient descent(W[n+1] = w[n]-$\mu$dL/dW) we can use a pre built optimizer(SGD).

In [None]:
model = single_layer(90,1)
criterion = torch.nn.BCELoss()
optimizer = torch.optim.SGD(model.parameters(),lr = 0.05)

There are a few pre-built training functions but usually the training function is written by hand. this function is similar to the one you wrote in this assignment only we now can use the pre-built tensor functions. Make sure you understood all the differences between the two:

In [None]:
import pdb
def train_model(model, criterion, optimizer, batch_size=100, max_iters=100):
  iter = 0
  cost_list = []
  acc_list  = []
  train_norm_xs_shuff = train_norm_xs
  train_ts_shuff = train_ts
  val_X_tensor = torch.tensor(val_norm_xs,dtype=torch.float32)
  while iter < max_iters:
    # shuffle the training set (there is code above for how to do this)
    reindex = np.random.permutation(len(train_norm_xs))
    train_norm_xs_shuff = train_norm_xs_shuff[reindex]
    train_ts_shuff = train_ts_shuff[reindex]

    for i in range(0, len(train_norm_xs), batch_size): # iterate over each minibatch
      # minibatch that we are working with:
      X = train_norm_xs_shuff[i:(i + batch_size)]
      t = train_ts_shuff[i:(i + batch_size), 0]

      # since len(train_norm_xs) does not divide batch_size evenly, we will skip over
      # the "last" minibatch
      if np.shape(X)[0] != batch_size:
        continue
      # change the numpy types into torches tensors
      X_tensor = torch.tensor(X,dtype=torch.float32)
      t_tensor = torch.tensor(t,dtype=torch.float32).unsqueeze(1) # the unsqueeze reshapes (N,) to (N,1)

      # a clean up step for PyTorch
      optimizer.zero_grad()
      # compute the prediction
      prediction = model(X_tensor)
      # compute the cost/loss
      loss = criterion(prediction,t_tensor)
      # calculate gradient(backpropegate)
      loss.backward()
      # update w and b(step)
      optimizer.step()
      # increment the iteration count
      iter += 1
      # compute and print the *validation* accuracy
      if (iter % 40 == 0):
        val_pred = model(val_X_tensor)
        val_acc = ML_DL_Functions2.get_accuracy(val_pred,val_ts)
        acc_list.append(val_acc)

        print("Iter %d. [Val Acc %.1f%%]" % (
                iter, val_acc * 100))

      if iter >= max_iters:
        break


  return acc_list


We can now run the training proccess. You should get pretty similar results to the ones from the model you wrote. Make sure that the results are in the same range and if not fix your model and try again.

In [None]:
reload_functions()
import ML_DL_Functions2
acc_list = train_model(model,criterion,optimizer,100,500)

You can also try to change the model(add layers or change layers) change the optimizer or the hyperparameters and try to improve the validation accuracy. If you want a challenge you can try reach a validation accuracy of 75%