## Case Study: Movie Suggestion

<img align="right" style="padding-left:10px; height: 120%; width: 40%" src=http://labs.criteo.com/wp-content/uploads/2017/08/CustomersWhoBought3.jpg>

### Two MovieLens datasets.

* **The Small Dataset:** Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users. Here is the small dataset, [a (1 MB) subset of the IMDB database](https://grouplens.org/datasets/movielens/latest/), downloaded and unzipped for your convenience. The dataset consists of 9742 movies.

* **The Full Dataset:** The Full MovieLens Dataset consisting of 26 million ratings and 750,000 tag applications from 270,000 users on all the 45,000 movies in this dataset can be accessed [here](https://grouplens.org/datasets/movielens/latest/). We will not be using the full dataset in this exercise.

In [16]:
import pandas as pd
import numpy as np
from pandas import DataFrame
import random
movies = pd.read_csv('../data/movies-list.csv') 

movies.head()

Unnamed: 0,Movie ID:,Name:,Rating:,Genre:,Additional Genre:,Producer:,No info:
0,1,The Prestige,PG-13,Science_Fiction,Thriller,,
1,2,Fast & Furious,PG-13,Crime,Thriller,,
2,3,Infinity War,PG-13,Fantasy,Science_Fiction,,
3,4,Wedding Crashers,R,Romance,Comedy,,
4,5,Avatar,PG-13,Fantasy,Science_Fiction,,


### Movie Ratings

The above 9742 movies were rated by 610 users; this works out to about 165 movies on average rated by each user, available in the `ratings.csv` file as sampled in the DataFrame below.

In [2]:
# '../data/movies-list.csv'

ratings = pd.read_csv('../data/ratings.csv',header = 0) 
ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [22]:
# Ratings Group By userId:
# print (type(ratings.groupby(["userId"])["userId"].count())) # prints <class 'pandas.core.series.Series'>

# Convert Series to DataFrame
#     Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reset_index.html#pandas.Series.reset_index

counts = ratings.groupby(["userId"])["userId"].count().reset_index(name="Count")
counts

Unnamed: 0,userId,Count
0,1,232
1,2,29
2,3,39
3,4,216
4,5,44
...,...,...
605,606,1115
606,607,187
607,608,831
608,609,37


<a href="https://en.wikipedia.org/wiki/Collaborative_filtering"><img align="right" style="padding-left:10px; height: 40%; width: 40%" src="https://upload.wikimedia.org/wikipedia/commons/5/52/Collaborative_filtering.gif" ></a>

## General Approach

As discussed in [04-05-recommendations](../../04-analysis-and-visualization/04-05-recommendations/04-05-recommendations.ipynb) a generalized version of Collaborative filtering, implied by the adjoining image, is a three-step process:

1. A user expresses their preferences by rating items (e.g. books, movies or CDs) of the system. These ratings can be viewed as an approximate representation of the user's interest in the corresponding domain. _The ratings have been collected by IMDB and imported into the `ratings` DataFrame._
2. The system matches this user's ratings against other users' and finds the people with most "similar" tastes. For the purpose of this Case, we shall determine the recommendations for the user with **userId = 607**.
3. With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item).

<span style="color:blue">

### Solution Development
</span>

We proceed with the calculations as outlined above but first create a **tiny dataset** such that we can develop the solution and _verify the calculations manually._ 

Step 2 of the algorithms is for the system to match this user's ratings against other users' and finds the people with most "similar" tastes. We shall use **userId = 5** for this tiny dataset.

Once the solution has been developed, we will write functions and classes to package the developed code and use it for the given dataset.

To formulate the math behind the distance calculation, consider two users U<sub>i</sub> and U<sub>j</sub>. The “distance” between them, Δ<sub>ij</sub>, is expressed as

Δ<sub>ij</sub> = $\sqrt{ \Sigma {(r_{ik} - r_{jk} )^2 } }$ for all movies <em>k</em> that have been rated by both $U_{i}$ _and_ $U_{j}$, where $ r_{ik} $ and $ r_{jk} $ are ratings of movie _k_ by $U_{i}$ and $U_{j}$

In [4]:
# Cell 4
# Initial Parameters
given_userId = 93

In [5]:
# Cell 5
tiny_movies = pd.read_csv('../data/movies-tiny.csv',header = 0) 
tiny_movies

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance


In [6]:
# Cell 6
tiny_ratings = pd.read_csv('../data/ratings.csv',header = 0).drop(columns=['timestamp'])
tiny_ratings = tiny_ratings.merge(tiny_movies, how='inner', on='movieId').drop(columns = ['title', 'genres'])
userIds = [el[0] for el in list(np.array(tiny_ratings[['userId']]))]
seen = {}
dupes = []

for x in userIds:
    if x not in seen:
        seen[x] = 1
    else:
        if seen[x] == 1:
            dupes.append(x)
        seen[x] += 1
dupes[:10]
try:
    assert(given_userId in dupes)
except AssertionError:
    print ('Should have chosen a userId from amongst these\n', dupes)
    raise

### Librares to use

We will be using numpy and scipy for most of the calculations, mostly use `pandas` for pretty printing. 

Variable names will be chosen such that:

1. Variables ending in `_np` will be used for numpy arrays.
1. Variables ending in `_2d` will be used for 2-D numpy arrays.
2. Variables ending in `_df` will be used for pandas DataFrames.

In [7]:
# Cell 7
tiny_ratings_np = tiny_ratings.to_numpy(dtype=np.float32)
tiny_ratings_np

array([[  1. ,   1. ,   4. ],
       [  5. ,   1. ,   4. ],
       [  7. ,   1. ,   4.5],
       ...,
       [262. ,   4. ,   1. ],
       [411. ,   4. ,   2. ],
       [600. ,   4. ,   1.5]], dtype=float32)

In [8]:
# Find the ratings done by our user
x = tiny_ratings_np
the_r_2d = x[np.where(x[:,0] == given_userId)][:, [2]]
print(the_r_2d)
random_index = random.randint(0,len(the_r_2d)-1)
the_r_2d = the_r_2d[random_index].reshape(1,1)
print(the_r_2d)
the_r = the_r_2d.reshape(the_r_2d.shape[1])
the_r

[[3.]
 [5.]]
[[5.]]


array([5.], dtype=float32)

In [9]:
# Find the ratings by all users
all_r_2d = tiny_ratings_np[:, [2]]
all_r = all_r_2d.reshape(all_r_2d.shape[0])

all_u_2d = tiny_ratings_np[:, [0]]
all_u = all_u_2d.reshape(all_u_2d.shape[0])

all_m_2d = tiny_ratings_np[:, [1]]
all_m = all_m_2d.reshape(all_m_2d.shape[0])

all_u, all_m, all_r

(array([  1.,   5.,   7.,  15.,  17.,  18.,  19.,  21.,  27.,  31.,  32.,
         33.,  40.,  43.,  44.,  45.,  46.,  50.,  54.,  57.,  63.,  64.,
         66.,  68.,  71.,  73.,  76.,  78.,  82.,  86.,  89.,  90.,  91.,
         93.,  96.,  98., 103., 107., 112., 119., 121., 124., 130., 132.,
        134., 135., 137., 140., 141., 144., 145., 151., 153., 155., 156.,
        159., 160., 161., 166., 167., 169., 171., 177., 178., 179., 182.,
        185., 186., 191., 193., 200., 201., 202., 206., 213., 214., 216.,
        217., 219., 220., 223., 226., 229., 232., 233., 234., 239., 240.,
        247., 249., 252., 254., 263., 264., 266., 269., 270., 273., 274.,
        275., 276., 277., 279., 280., 282., 283., 288., 290., 291., 292.,
        293., 298., 304., 307., 314., 322., 323., 328., 330., 332., 334.,
        336., 337., 339., 341., 347., 350., 353., 357., 359., 364., 367.,
        372., 373., 378., 380., 381., 382., 385., 389., 391., 396., 399.,
        401., 411., 412., 414., 420., 

In [10]:
from scipy.spatial.distance import cdist, euclidean
from scipy.spatial import distance_matrix
print (the_r, all_r)
dm = distance_matrix(all_r_2d, all_r_2d)

[5.] [4.  4.  4.5 2.5 4.5 3.5 4.  3.5 3.  5.  3.  3.  5.  5.  3.  4.  5.  3.
 3.  5.  5.  4.  4.  2.5 5.  4.5 0.5 4.  2.5 4.  3.  3.  4.  3.  5.  4.5
 4.  4.  3.  3.5 4.  4.  3.  2.  3.  4.  4.  3.  4.  3.5 5.  5.  2.  3.
 4.  4.5 4.  4.  5.  3.5 4.5 5.  5.  4.  4.  4.  4.  4.  4.  2.  3.5 5.
 4.  5.  3.5 3.  3.  4.  3.5 5.  3.5 3.5 5.  3.5 3.  5.  4.  5.  5.  4.
 4.5 4.5 4.  4.  2.  5.  5.  5.  4.  5.  4.  4.  3.  4.5 4.5 3.  4.5 4.
 4.  4.  3.  2.  5.  4.  3.  3.5 3.5 5.  4.  4.  3.5 4.  4.  4.  5.  5.
 4.  5.  5.  4.  5.  5.  3.  3.  4.5 5.  3.5 4.5 4.  5.  3.  5.  4.  3.5
 5.  2.  4.  4.  4.  2.5 4.  4.  4.5 4.  5.  5.  5.  5.  4.5 1.5 4.  4.
 4.  5.  4.  4.  4.  3.  4.  4.5 4.5 3.5 4.  4.  4.  4.  4.  4.  3.  4.
 4.  2.5 3.  5.  4.  3.  3.  4.  4.  5.  3.  4.  4.5 3.5 4.  4.  5.  4.
 3.  5.  5.  4.  4.  4.  3.  2.5 4.  4.  3.  4.  2.5 4.  2.5 3.  5.  4.
 5.  3.  3.  4.  5.  3.  4.  3.  3.5 2.  3.  3.5 5.  3.5 3.  3.  3.  5.
 4.  1.  3.5 4.  4.  3.  4.  2.5 1.  3.  3.5 0.5 3.  3.  

In [11]:
# Manually verifying the euclidean calculation.
# The numbers produced by the previous cell and by this cell should match!
dists = cdist(the_r_2d, all_r_2d)
distances = list(dists.reshape(dists.shape[1]))
print (distances)

[1.0, 1.0, 0.5, 2.5, 0.5, 1.5, 1.0, 1.5, 2.0, 0.0, 2.0, 2.0, 0.0, 0.0, 2.0, 1.0, 0.0, 2.0, 2.0, 0.0, 0.0, 1.0, 1.0, 2.5, 0.0, 0.5, 4.5, 1.0, 2.5, 1.0, 2.0, 2.0, 1.0, 2.0, 0.0, 0.5, 1.0, 1.0, 2.0, 1.5, 1.0, 1.0, 2.0, 3.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.5, 0.0, 0.0, 3.0, 2.0, 1.0, 0.5, 1.0, 1.0, 0.0, 1.5, 0.5, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 1.5, 0.0, 1.0, 0.0, 1.5, 2.0, 2.0, 1.0, 1.5, 0.0, 1.5, 1.5, 0.0, 1.5, 2.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.5, 0.5, 1.0, 1.0, 3.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 2.0, 0.5, 0.5, 2.0, 0.5, 1.0, 1.0, 1.0, 2.0, 3.0, 0.0, 1.0, 2.0, 1.5, 1.5, 0.0, 1.0, 1.0, 1.5, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 2.0, 2.0, 0.5, 0.0, 1.5, 0.5, 1.0, 0.0, 2.0, 0.0, 1.0, 1.5, 0.0, 3.0, 1.0, 1.0, 1.0, 2.5, 1.0, 1.0, 0.5, 1.0, 0.0, 0.0, 0.0, 0.0, 0.5, 3.5, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 2.0, 1.0, 0.5, 0.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.5, 2.0, 0.0, 1.0, 2.0, 2.0, 1.0, 1.0, 0.0, 2.0, 1.0, 0.5, 1.5, 1.0, 1.0, 0.0, 1.0, 2.0, 0.0,

In [12]:
# This cell will throw an exception if the results of euclidean calculation don't match
from math import sqrt
assert(np.isclose(euclidean(the_r, all_r), 
                  sqrt(sum([_*_ for _ in distances]))))

In [13]:
# Cell 13
dists = np.stack([all_u,
                  all_m,
                  np.apply_along_axis(lambda x: sqrt(sum([_*_ for _ in x])), 0, dm)])
dists_df = DataFrame(dists.transpose(), columns=['u', 'm', 'r']) \
           .astype({'u': 'int32', 'm': 'int32', 'r': 'float32'})
dists_df   #  [['u', 'r']]

Unnamed: 0,u,m,r
0,1,1,19.474342
1,5,1,19.474342
2,7,1,24.591665
3,15,1,29.236107
4,17,1,24.591665
...,...,...,...
379,84,4,22.455511
380,162,4,22.455511
381,262,4,55.301445
382,411,4,37.379807


In [14]:
# Cell 14
threshold_distance = 2 * dists_df['r'].min()
dists_df = dists_df[dists_df['r'] < threshold_distance]
dists_df

Unnamed: 0,u,m,r
0,1,1,19.474342
1,5,1,19.474342
2,7,1,24.591665
3,15,1,29.236107
4,17,1,24.591665
...,...,...,...
375,605,2,18.594355
377,6,4,22.455511
378,14,4,22.455511
379,84,4,22.455511


In **step 3 of the algorithm**, the system recommends items that the similar users have rated highly but not yet being rated by this user (presumably the absence of rating is often considered as the unfamiliarity of an item).

In [15]:
# Cell 15
# What movies has the user picked already? Don't recommend those!
candidates_df = dists_df[dists_df['u'] == given_userId].sort_values(by=['r'])
recommend_df = candidates_df.merge(tiny_movies, left_on='m', right_on='movieId').drop(columns=['u', 'm', 'r'])
recommend_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy


## TO-DO

Your assignment is to package the code from <span style="color:green"><em># Initial Parameters</em></span> to the <span style="color:green"><em># Candidate Movies</em></span> cells.

1. Create a function `recommend_movies(uid,threshold)` that takes `userId`, `threshold` and `movies_directory` as parameters and produces recommendations for the user. Test the code first with userId = 607. Try various values of threshold such that the user gets at least 6 movie recommendations.
2. Back in the <span style="color:green"><em># What movies has the user picked already? Don't recommend those!</em></span> cell (cell 15), we had picked the first record for our user. Modify the code to begin instead with the movie our user liked the most!
3. Time your code for various values of `userId` and `threshold`. What accounts for the variation in timing?