# Movie Recommendation System

**Name:**  Riddhi Mahesh Dange

**Download the dataset from here:** https://grouplens.org/datasets/movielens/1m/

In [1]:
# Import all the required libraries
import numpy as np
import pandas as pd

## Reading the Data


In [4]:
# Read the dataset from the two files into ratings_data and movies_data
column_list_ratings = ["UserID", "MovieID", "Ratings","Timestamp"]
ratings_data  = pd.read_csv('ratings.dat',sep='::',names = column_list_ratings, encoding= "latin-1",engine="python")
column_list_movies = ["MovieID","Title","Genres"]
movies_data = pd.read_csv('movies.dat',sep = '::',names = column_list_movies, encoding="latin-1", engine="python")
column_list_users = ["UserID","Gender","Age","Occupation","Zixp-code"]
user_data = pd.read_csv("users.dat",sep = "::",names = column_list_users, encoding="latin-1", engine="python")

`ratings_data`, `movies_data`, `user_data` corresponds to the data loaded from `ratings.dat`, `movies.dat`, and `users.dat` in Pandas.

## Data analysis

In [5]:
data=pd.merge(pd.merge(ratings_data,user_data),movies_data)
data

Unnamed: 0,UserID,MovieID,Ratings,Timestamp,Gender,Age,Occupation,Zixp-code,Title,Genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama
...,...,...,...,...,...,...,...,...,...,...
1000204,5949,2198,5,958846401,M,18,17,47901,Modulations (1998),Documentary
1000205,5675,2703,3,976029116,M,35,14,30030,Broken Vessels (1998),Drama
1000206,5780,2845,1,958153068,M,18,17,92886,White Boys (1999),Drama
1000207,5851,3607,5,957756608,F,18,20,55410,One Little Indian (1973),Comedy|Drama|Western


Next, we can create a pivot table to match the ratings with a given movie title. Using `data.pivot_table`, we can aggregate (using the average/`mean` function) the reviews and find the average rating for each movie. We can save this pivot table into the `mean_ratings` variable. 

In [6]:
mean_ratings=data.pivot_table('Ratings','Title',aggfunc='mean')
mean_ratings

Unnamed: 0_level_0,Ratings
Title,Unnamed: 1_level_1
"$1,000,000 Duck (1971)",3.027027
'Night Mother (1986),3.371429
'Til There Was You (1997),2.692308
"'burbs, The (1989)",2.910891
...And Justice for All (1979),3.713568
...,...
"Zed & Two Noughts, A (1985)",3.413793
Zero Effect (1998),3.750831
Zero Kelvin (Kjærlighetens kjøtere) (1995),3.500000
Zeus and Roxanne (1997),2.521739


Now, we can take the `mean_ratings` and sort it by the value of the rating itself. Using this and the `head` function, we can display the top 15 movies by average rating.

In [7]:
mean_ratings=data.pivot_table('Ratings',index=["Title"],aggfunc='mean')
top_15_mean_ratings = mean_ratings.sort_values(by = 'Ratings',ascending = False).head(15)
top_15_mean_ratings

Unnamed: 0_level_0,Ratings
Title,Unnamed: 1_level_1
Ulysses (Ulisse) (1954),5.0
Lured (1947),5.0
Follow the Bitch (1998),5.0
Bittersweet Motel (2000),5.0
Song of Freedom (1936),5.0
One Little Indian (1973),5.0
Smashing Time (1967),5.0
Schlafes Bruder (Brother of Sleep) (1995),5.0
"Gate of Heavenly Peace, The (1995)",5.0
"Baby, The (1973)",5.0


In [8]:
mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean')
mean_ratings

Gender,F,M
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
"$1,000,000 Duck (1971)",3.375000,2.761905
'Night Mother (1986),3.388889,3.352941
'Til There Was You (1997),2.675676,2.733333
"'burbs, The (1989)",2.793478,2.962085
...And Justice for All (1979),3.828571,3.689024
...,...,...
"Zed & Two Noughts, A (1985)",3.500000,3.380952
Zero Effect (1998),3.864407,3.723140
Zero Kelvin (Kjærlighetens kjøtere) (1995),,3.500000
Zeus and Roxanne (1997),2.777778,2.357143


We can now sort the ratings as before, but instead of by `Rating`, but by the `F` and `M` gendered rating columns. Print the top rated movies by male and female reviews, respectively.

In [9]:
data=pd.merge(pd.merge(ratings_data,user_data),movies_data)

mean_ratings=data.pivot_table('Ratings',index=["Title"],columns=["Gender"],aggfunc='mean')
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
print(top_female_ratings.head(15))

top_male_ratings = mean_ratings.sort_values(by='M', ascending=False)
print(top_male_ratings.head(15))

Gender                                               F         M
Title                                                           
Clean Slate (Coup de Torchon) (1981)               5.0  3.857143
Ballad of Narayama, The (Narayama Bushiko) (1958)  5.0  3.428571
Raw Deal (1948)                                    5.0  3.307692
Bittersweet Motel (2000)                           5.0       NaN
Skipped Parts (2000)                               5.0  4.000000
Lamerica (1994)                                    5.0  4.666667
Gambler, The (A Játékos) (1997)                    5.0  3.166667
Brother, Can You Spare a Dime? (1975)              5.0  3.642857
Ayn Rand: A Sense of Life (1997)                   5.0  4.000000
24 7: Twenty Four Seven (1997)                     5.0  3.750000
Twice Upon a Yesterday (1998)                      5.0  3.222222
Woman of Paris, A (1923)                           5.0  2.428571
I Am Cuba (Soy Cuba/Ya Kuba) (1964)                5.0  4.750000
Gate of Heavenly Peace, T

In [10]:
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]

Gender,F,M,diff
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"James Dean Story, The (1957)",4.0,1.0,-3.0
Country Life (1994),5.0,2.0,-3.0
"Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919)",4.0,1.0,-3.0
Babyfever (1994),3.666667,1.0,-2.666667
"Woman of Paris, A (1923)",5.0,2.428571,-2.571429
Cobra (1925),4.0,1.5,-2.5
"Other Side of Sunday, The (Søndagsengler) (1996)",5.0,2.928571,-2.071429
"To Have, or Not (1995)",4.0,2.0,-2.0
For the Moment (1994),5.0,3.0,-2.0
Phat Beach (1996),3.0,1.0,-2.0


Grouping the data-frame, instead, to see how different titles compare in terms of the number of ratings. Group by `Title` and then take the top 10 items by number of reviews. We can see here the most popularly-reviewed titles.

In [11]:
ratings_by_title=data.groupby('Title').size()
ratings_by_title.sort_values(ascending=False).head(10)

Title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672
Saving Private Ryan (1998)                               2653
Terminator 2: Judgment Day (1991)                        2649
Matrix, The (1999)                                       2590
Back to the Future (1985)                                2583
Silence of the Lambs, The (1991)                         2578
dtype: int64

In [12]:
filtered_data= ratings_by_title.groupby('Title').filter(lambda x:(x>=2500).all())
filtered_data

Title
American Beauty (1999)                                   3428
Back to the Future (1985)                                2583
Fargo (1996)                                             2513
Jurassic Park (1993)                                     2672
Matrix, The (1999)                                       2590
Men in Black (1997)                                      2538
Raiders of the Lost Ark (1981)                           2514
Saving Private Ryan (1998)                               2653
Silence of the Lambs, The (1991)                         2578
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Terminator 2: Judgment Day (1991)                        2649
dtype: int64

Creating a ratings matrix using Numpy. This matrix allows us to see the ratings for a given movie and user ID. Every element $[i,j]$ is a rating for movie $i$ by user $j$. Print the **shape** of the matrix produced.


In [13]:
# Create the matrix
### use numpy to create a ratings data matrix
nr_users = np.max(ratings_data.UserID.values)
nr_movies = np.max(ratings_data.MovieID.values)
ratings_matrix = np.ndarray(shape=(nr_users, nr_movies),dtype=np.uint8)
ratings_matrix[ratings_data.UserID.values - 1, ratings_data.MovieID.values - 1] = ratings_data.Ratings.values

In [14]:
# Print the shape
ratings_matrix.shape

(6040, 3952)

In [15]:
ratings_matrix

array([[  5, 202, 225, ...,   0,   0,   0],
       [240, 192, 228, ...,   0,   0,   0],
       [240, 192, 228, ...,   0,   0,   0],
       ...,
       [ 96, 204,  78, ...,   0,   0,   0],
       [240,  76,  88, ...,   0,   0,   0],
       [  3,  39,  86, ...,   0,   0,   0]], dtype=uint8)

Normalizing the ratings matrix using Z-score normalization. While we can't use `sklearn`'s `StandardScaler` for this step, we can do the statistical calculations ourselves to normalize the data.

In [16]:
print(data.isna().sum())

UserID        0
MovieID       0
Ratings       0
Timestamp     0
Gender        0
Age           0
Occupation    0
Zixp-code     0
Title         0
Genres        0
dtype: int64


In [17]:
# ratings_col_average = np.mean(ratings_matrix, axis = 0)
ratings_col_average = ratings_matrix.mean(axis = 0)
print(ratings_col_average)
ratings_matrix = (ratings_matrix - ratings_col_average)

[1.04686921e+02 1.35459272e+02 1.15657119e+02 ... 3.27814570e-02
 2.58278146e-02 2.42880795e-01]


In [18]:
ratings_matrix = (ratings_matrix - ratings_matrix.mean(axis = 0))/ratings_matrix.std(axis = 0)
ratings_matrix[np.isnan(ratings_matrix)] = 0

  ratings_matrix = (ratings_matrix - ratings_matrix.mean(axis = 0))/ratings_matrix.std(axis = 0)


In [19]:
ratings_matrix.shape

(6040, 3952)

In [20]:
ratings_matrix

array([[-1.03580722,  0.83591235,  1.31608218, ..., -0.09136796,
        -0.07885485, -0.25386356],
       [ 1.40598449,  0.71028819,  1.35219103, ..., -0.09136796,
        -0.07885485, -0.25386356],
       [ 1.40598449,  0.71028819,  1.35219103, ..., -0.09136796,
        -0.07885485, -0.25386356],
       ...,
       [-0.09026234,  0.86103719, -0.45325185, ..., -0.09136796,
        -0.07885485, -0.25386356],
       [ 1.40598449, -0.74695214, -0.33288899, ..., -0.09136796,
        -0.07885485, -0.25386356],
       [-1.05658842, -1.21176156, -0.35696157, ..., -0.09136796,
        -0.07885485, -0.25386356]])

We're now going to perform Singular Value Decomposition (SVD) on the normalized ratings matrix.

In [21]:
# Compute the SVD of the normalised matrix
U, S, V = np.linalg.svd(ratings_matrix)

In [22]:
# Print the shapes
print("Shape of U is", U.shape)
print("Shape of S is", S.shape)
print("Shape of V is", V.shape)

Shape of U is (6040, 6040)
Shape of S is (3952,)
Shape of V is (3952, 3952)


Reconstructing four rank-k rating matrix $R_k$, where $R_k = U_kS_kV_k^T$ for k = [100, 1000, 2000, 3000].

In [23]:
r_1000 = None
for k in [100, 1000, 2000, 3000]:
  u_k = np.matrix(U[:, :k])
  s_k = np.diag(S[:k])
  v_k = np.matrix(V[:k, :])
  r_k = np.dot(np.dot(u_k, s_k), v_k)
  print("R", k, "shape:", r_k.shape)
  print(r_k)
  if k == 1000:
    r_1000 = r_k

R 100 shape: (6040, 3952)
[[ 0.09723825  0.5324566   1.21585487 ... -0.18679382 -0.01173059
  -0.17023171]
 [ 0.40711757  0.64906767  1.08668488 ... -0.06848748 -0.09136479
  -0.13488144]
 [ 0.55551149  0.4215691   1.28758247 ... -0.06718757 -0.02194689
  -0.2679966 ]
 ...
 [-0.38322408  0.34219653 -0.40044199 ... -0.03891093 -0.07205773
  -0.29803141]
 [ 0.60215119 -0.6703222  -0.51958483 ... -0.01094536 -0.24857907
  -0.28515274]
 [ 0.19125636 -0.83883476 -0.12176604 ... -0.60508681  0.30209566
  -0.69297208]]
R 1000 shape: (6040, 3952)
[[-0.63184106  0.73018712  1.13127481 ... -0.49639322 -0.1593363
  -0.21711673]
 [ 1.04335126  0.91240671  0.75161515 ... -0.10908332 -0.02911136
  -0.28862726]
 [ 0.11633925  0.56136193  1.23716647 ...  0.18865481 -0.0212118
  -0.17609038]
 ...
 [-0.31912257  0.70547138 -0.20498886 ...  0.24464518  0.17725873
  -0.30711504]
 [ 0.91162276 -0.93272326 -0.27501697 ...  0.20310239 -0.49451376
  -0.03848966]
 [-0.32267166 -0.71513125 -0.39829088 ... -0.06

### Cosine Similarity
Cosine similarity is a metric used to measure how similar two vectors are. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. Cosine similarity is high if the angle between two vectors is 0, and the output value ranges within $cosine(x,y) \in [0,1]$. $0$ means there is no similarity (perpendicular), where $1$ (parallel) means that both the items are 100% similar.

$$ cosine(x,y) = \frac{x^T y}{||x|| ||y||}  $$

**Based on the reconstruction rank-1000 rating matrix $R_{1000}$ and the cosine similarity,** sorting the movies which are most similar. Using Function `top_cosine_similarity` which sorts data by its similarity to a movie with ID `movie_id` and returns the top $n$ items, and a second function `print_similar_movies` which prints the titles of said similar movies. Return the top 5 movies for the movie with ID `1377` (*Batman Returns*):

In [24]:
def top_cosine_similarity(data, movieID, topN = 5):
  x_t = data[:, movieID - 1]
  y = data
  magnitude_x = np.linalg.norm(x_t)
  magnitude_y = np.linalg.norm(y)
  cosineSimilarity = np.dot(x_t, y)/ (magnitude_x * magnitude_y)
  topSortedIndices = np.argsort(-cosineSimilarity)
  returnIndices = topSortedIndices[1: topN + 1]
  return returnIndices

def print_similar_movies(movie_data,movieID,top_indexes):
  print('Most Similar movies: ')
  for id in top_indexes + 1:
      print(movie_data[movie_data["MovieID"] == id]["Title"].values[0])

In [25]:
k = 1000
movie_id = 1377
top_n = 5

ydata = V[:k, :]
indexes = top_cosine_similarity(ydata, movie_id, top_n)
print_similar_movies(movies_data, movie_id, indexes)

Most Similar movies: 
Batman Forever (1995)
Batman & Robin (1997)
Star Trek: Generations (1994)
Mirror Has Two Faces, The (1996)
Tall Tale (1994)
