In [1]:
import numpy as np

# Suggested order to complete this assignment:
** Always make sure you watch the lecture videos and read carefully the instructions under every function! 
1. distances.py
2. k_nearest_neighbor.py
3. collaborative_filtering.py

load_movelens.py is only needed for the free response questions, so the order does not really matter

# distances.py
1. Check lectures for the formulas
2. For each distance function, the inputs are a M * K np array (X) and a N * K np array (Y). The output should be a M * N array, where entry [i, j] is the distance between ith row of X and jth row of Y

Tip: there are many ways to implement these functions, but we strongly recommend you to use numpy methods (particularly aggregate functions) to speed up the run time. If these distance functions are not optimized, it may take a very long time to run the code needed for FR questions. 

So here are some useful functions:

In [2]:
X = np.array([1,2,3])
Y = np.array([1,4,9])

In [3]:
np.linalg.norm(X-Y, ord = None)
#Note: ord = None gives Euclidean distance. 

6.324555320336759

In [4]:
np.linalg.norm(X-Y, ord = 1)
#Note: ord = 1 gives Manhattan distance. 

8.0

In [7]:
np.sqrt([1, 4, 9, 16])

array([1., 2., 3., 4.])

In [6]:
np.dot([-1, 2, 3], [4, 5, 6])

24

In [5]:
np.square([-1, 2, 3])

array([1, 4, 9])

# k_nearest_neighbors.py
1. fit function: think about whether there is an actual training process for KNN models...
2. predict function. For each row (i.e. each data point) in the input features matrix: 
<br>
    a. find its K nearest neighbor from the training feature sets. You need use the functions implemented in distances.py: compute the distances --> sort the training data based on distances --> get the K nearest neighbors.
<br>
    b. if ignore_first = True, skip the closest neighbor (so you want to find K+1 nearest neighbors and get rid of the first one)
<br>
    c. use the specified aggregator to predict the class of that data point. For example, if K = 5 and aggregator = mode, the classes of 5 nearest neighbors are [0,0,2,3,4] --> the prediction for that data point will be 0. 
<br>
    d. do this for every row in the input features matrix
<br>
<br>
Tip: note that the prediction results directly depends on the training data (meaning that we do not build a model in fit function and use that model for prediction). Therefore you need to access training data in the predict function - how do you do that? 


Here are some useful functions:

In [10]:
arr = np.array([1,-2,100,3])
print(np.sort(arr, axis=0))     # sorts array
print(np.argsort(arr, axis=0))  # returns idices of sorted array

[ -2   1   3 100]
[1 0 3 2]


# collaborative_filtering.py
Goal is to predict a person's rating by using the ratings of their K nearest neighbors
<br>
1. Essentially, we want to find K nearest neighbors for every user (i.e. every row) in the input features. Then we use the information from those K nearest neighbors to predict the ratings for a user. Note that we want to predict a user's ratings for all movies, and we only replace 0 ratings with the prediction results. 
<br>
2. In this case, we do not have an explicit target array. You will be using the same data as training features and  training targets. 
<br>
3. When you call KNN.predict, make sure you set ignore_first to be True. Otherwise, a data point's nearest neighbor will always be itself, which is not helpful for prediction. 
<br>

p.s.: when you answer FR questions, it is possible that after imputation, there are still a few zeros in the data. This just means that a user's nearest neighbors do not rate that movie either... But as long as your collaborative_filtering model is working correctly, a few zeros will not affect the MSEs much. 


In [13]:
x = np.array([0, 2, 3, 4, 5, 5, 4, 3, 2, 0, 0, 0])
print(np.argwhere(x == 0)) #helpful when you decide which entries you need to impute

[[ 0]
 [ 9]
 [10]
 [11]]
