# Session 1
## Case Study 2
### Homework

### Data Source

* movie-train.csv

* movie-test.csv

These have been taken (and modified) from:
http://kevinmolloy.info/teaching/cs504_2017Fall/

This is a small subset of the original movielens dataset.
https://grouplens.org/datasets/movielens/


#### Objective
To use kNN as a kind of a recommendation/prediction for movies.

#### Datasets

As discussed in class, you will build your model using the training data. To test your model, you will calculate predictions for each entry in the test set (a userID/movieID pair), and since you know the real rating, you can compute the difference between the two, and determine how well your method performs, as an additional exercise. In this exercise we only consider if a user has seen or not seen -- irrespective of the rating. 

In other words if a userId, movieId, rating line exists, then the user has seen that movie. 


### Description

Consider the problem of recommending movies to users. We have M Users and N Movies. 
Now, we want to predict whether a given test user $x$ will watch movie $y$.

User $x$ has seen and not seen few movies in the past. We will use $x$'s movie watching history as a feature for our recommendation system.

We will use KNN to find the K nearest neighbour users (users with similar taste) to $x$, and make predictions based on their entries for movie $y$.

A user either had seen the movie (1) or not seen the movie (0). We can represent this as a matrix of size M×N. (M rows and N columns). We have actually used a dictionary with the keys userId and movieId to represent this matrix.

Each element of the matrix is either zero or one. If (u, m) entry in this matrix is 1, then the $u^{th}$ user has seen the movie $m$.
#### Training set
M×N binary matrix indicating seen/not-seen.
#### Test set: 
L test cases with $(x, y)$ pairs. $x$ is N-dimensional binary vector with missing $y^{th}$ entry - which we want to predict.

Now, we want to predict whether a given test user x will watch movie y.

User x has seen and not seen few movies in the past. We will use x's movie watching history as feature for our recommendation system.
**Exercise 1** :: Write a function to compute euclidean distance between two users for all entries except the missing $y^{th}$ entry.

We will use KNN to find the K nearest neighbour users (users with similar taste) to x, and make predictions based on their entries for movie y.

We have given the code for Cosine distance, when computing nearest neighbours.

In [None]:
import pandas as pd
rated = pd.read_csv("../Datasets/movie-train.csv", converters={"userId":int, "movieId":int})
rated.describe()

In [None]:
userCount = max(rated.userId)
movieCount = max(rated.movieId)


In [None]:
seen = {}
for x in rated.values:
    seen[(int(x[0]), int(x[1]))] = 1

In [None]:
allUsersMovies = [(u,m) for u in range(userCount) for m in range(movieCount)]

In [None]:
for x in allUsersMovies:
    if x not in seen:
        seen[x] = 0

Now we have the data loaded into a dictionary, let us recast the distance function to use it. Given two users, $u_1$ and $u_2$, for a movie $m$, we must ignore the entries for $m$ for every other user while computing distance

In [None]:
import math
userCount = max(rated.userId)
movieCount = max(rated.movieId)
seen = {} #dict
for x in rated.values:
    seen[(int(x[0]), int(x[1]))] = 1
for x in allUsersMovies:
    if x not in seen:
        seen[x] = 0

# This is actually the cosine distance
def distance(u1, u2, mx):
    d = 0 - seen[(u1, mx)] * seen[(u2, mx)]
    for m in range(movieCount):
        d += seen[(u1, m)] * seen[(u2, m)]
    return d


def kNN(k, givenUser, givenMovie):
    distances = []
    for u in range(userCount):
        if u != givenUser:
            distances.append([distance(u, givenUser, givenMovie), u])
    distances.sort()
   # print(distances)
    distances.reverse() ## Because cosine distances mean higher = closer
    return distances[:k] ##list
    #return min(distances)

def prediction(k, givenUser, givenMovie):
    neighbours = kNN(k, givenUser, givenMovie)
    howmanySaw = sum([seen[(u, givenMovie)] for d, u in neighbours])
    print(howmanySaw)
    return 2 * howmanySaw > k

In [None]:
prediction(4,3,101)

**Exercise 1** :: Verify the above code and check if it works

**Exercise 2** :: Change the distance function to compute Euclidean, and see if the prediction changes. Remember to modify the kNN function to pick the smallest distances: do not reverse()!

In [None]:
import math
userCount = max(rated.userId)
movieCount = max(rated.movieId)
seen = {} #dict
for x in rated.values:
    seen[(int(x[0]), int(x[1]))] = 1
for x in allUsersMovies:
    if x not in seen:
        seen[x] = 0

# Euclidean distancevenUser, givenMovie)

def distance(u1,u2,mx):
    sqSum = 0
    for m in range(movieCount):
        sqSum += (seen[(u1, m)] - seen[(u2, m)]) ** 2
    return math.sqrt(sqSum)

def kNN(k, givenUser, givenMovie):
    distances = []
    for u in range(userCount):
        if u != givenUser:
            distances.append([distance(u, givenUser, givenMovie), u])
    distances.sort()
   # print(distances)
    #distances.reverse() ## Because cosine distances mean higher = closer
    return distances[:k] ##list
    #return min(distances)

def prediction(k, givenUser, givenMovie):
    neighbours = kNN(k, givenUser, givenMovie)
    howmanySaw = sum([seen[(u, givenMovie)] for d, u in neighbours])
    print(howmanySaw)
    return 2 * howmanySaw > k

In [None]:
prediction(4,2,102)

**Exercise 3** :: Change the distance function to compute Manhattan, and see if the prediction changes. Remember to modify the kNN function to pick the smallest distances: do not reverse()!

In [None]:
import math
userCount = max(rated.userId)
movieCount = max(rated.movieId)
seen = {} #dict
for x in rated.values:
    seen[(int(x[0]), int(x[1]))] = 1
for x in allUsersMovies:
    if x not in seen:
        seen[x] = 0
# Manhattan distance 

def distance(u1,u2,mx):
    sqSum = 0
    for m in range(movieCount):
        sqSum += abs(seen[(u1, m)] - seen[(u2, m)])
    return sqSum


def kNN(k, givenUser, givenMovie):
    distances = []
    for u in range(userCount):
        if u != givenUser:
            distances.append([distance(u, givenUser, givenMovie), u])
    distances.sort()
   # print(distances)
    #distances.reverse() ## Because cosine distances mean higher = closer
    return distances[:k] ##list
    #return min(distances)

def prediction(k, givenUser, givenMovie):
    neighbours = kNN(k, givenUser, givenMovie)
    howmanySaw = sum([seen[(u, givenMovie)] for d, u in neighbours])
    print(howmanySaw)
    return 2 * howmanySaw > k

In [None]:
prediction(4,3,101)