# Movie Recommender System Using Dimensionality-Reduction Techniques

In [2]:
import numpy as np
import pandas as pd
import math as math
import statistics
import matplotlib.pyplot as plt

from numpy.linalg import svd
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA, NMF

## Scope:
For this `movie_review` dataset, we have 2670 user review (i.e. rows of observations) for 2153 movies across 19 different genres. Due to its high-dimensionality, let's try to use three different dimensionality reduction techniques:<br>a) Singular Value Decomposition(SVD)<br>b) Non-negative Matrix Factorization.<br>c) Soft-impute Matrix Completion

With each technique, we'll then reconstruct the `movie_review` matrix using the components and assess how effective it is using a metric called "average difference". This measures how well our reconstruct matrix capture the original matrix. The difference - $x_{true} - x_{recon}$ is given by the following formula: $$\sqrt{\frac{\Sigma((x_{true} - x_{recon})^2)}{N}}$$

Finally, for each technique, we'll use the reconstruct matrix to recommend movies that is quantitatively "best" for one arbitrary user.

First, let's read in the dataset

In [3]:
movie_review_tibble = pd.read_csv("MovieReviewMat.csv",index_col=0, low_memory=False)
movie_review_tibble

Unnamed: 0,Toy Story (1995),Jumanji (1995),Grumpier Old Men (1995),Waiting to Exhale (1995),Father of the Bride Part II (1995),Heat (1995),Sabrina (1995),Tom and Huck (1995),Sudden Death (1995),GoldenEye (1995),...,Madame Sin (1972),Illegally Yours (1988),A Daughter's Nightmare (2014),Beside Still Waters (2013),Fallen (2005),The Brittany Murphy Story (2014),Up from the Depths (1979),Tumult (2012),Love's Abiding Joy (2006),Bullyparade - Der Film (2017)
1,Adventure|Animation|Children|Comedy|Fantasy,Adventure|Children|Fantasy,Comedy|Romance,Comedy|Drama|Romance,Comedy,Action|Crime|Thriller,Comedy|Romance,Adventure|Children,Action,Action|Adventure|Thriller,...,Thriller,Comedy|Romance,Drama|Mystery|Thriller,(no genres listed),Drama,Drama,Horror,(no genres listed),Action|Drama|Romance,Comedy
2,3.5,3.5,3,,,,,,,,...,,,,,3.5,,,,,
3,2.5,,,,,3,,,,3,...,,,,,,,,,,
4,5,4.5,,,,5,,,,4,...,4,,,,,,,,,
5,4.5,4,,,,4,,,3,3.5,...,,,,,,,4,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2667,3,2.5,1.5,,,4.5,2,,3,3.5,...,4.5,,4,,3.5,4.5,3,,,3.5
2668,4,1,,,,3.5,3,,,,...,,,,,3,,,,,3.5
2669,3,3,3.5,,2,,,,,4,...,,,,,,,,,,
2670,4.5,,,,,3,,,,,...,,,,2,4.5,,,,,1


In [3]:
movie_review_names = np.array(movie_review_tibble.columns)
movie_review_genre = np.array(movie_review_tibble.iloc[0])
movie_review = np.array(movie_review_tibble.drop([1]), dtype=float)
movie_review

array([[3.5, 3.5, 3. , ..., nan, nan, nan],
       [2.5, nan, nan, ..., nan, nan, nan],
       [5. , 4.5, nan, ..., nan, nan, nan],
       ...,
       [3. , 3. , 3.5, ..., nan, nan, nan],
       [4.5, nan, nan, ..., nan, nan, 1. ],
       [4.5, 2.5, 0.5, ..., nan, nan, nan]])

Based on the results below, the singular highest and lowest review given to any film is 5 and 0.5, respectively. The rating scale ranges from 0 (worst) to 5 (best).<br>

The average review is about 3.30. Note that this average is also computed as `np.mean(np.nanmean(movie_review,axis=1))` - where `axis=1` means we're taking the average for each user review (across all movies that they did leave a review) first, then taking the mean across 2671 of such users. If we had done `axis=0`, we would be taking the average review **across all (2153) films** - which results in a slightly different number.

In [4]:
print("Highest review given = " + str(np.nanmax(movie_review)))
print("Lowest review given = " + str(np.nanmin(movie_review)))
print("Overall average review (across 2671 users) = " + str(np.nanmean(movie_review)))

Highest review given = 5.0
Lowest review given = 0.5
Overall average review (across 2671 users) = 3.297080189685473


### User 1462
We'll choose an arbitrary user for our project. Let's go with user1462

For this user, there are 11 movies which they rate the highest possible score (5) - see below for the names of all 11 movies. <br>Across these 11 movies, the most common genre is **Drama** - it is present in all 11 movies. We'll need these genres later on when we build the movie recommendations.

In [5]:
user1462 = movie_review[1460] #user 1462
index_review_1462 = np.where(user1462 == 5)[0] #len = 11

for i in index_review_1462:
    movie = movie_review_names[i]
    print(movie + " Genre:" + movie_review_tibble.loc[1,movie]) #print name of movies along with genre

Taxi Driver (1976) Genre:Crime|Drama|Thriller
Rob Roy (1995) Genre:Action|Drama|Romance|War
Little Princess, A (1995) Genre:Children|Drama
William Shakespeare's Romeo + Juliet (1996) Genre:Drama|Romance
Quiet Man, The (1952) Genre:Drama|Romance
Raging Bull (1980) Genre:Drama
Pink Floyd: The Wall (1982) Genre:Drama|Musical
Field of Dreams (1989) Genre:Children|Drama|Fantasy
Man Who Would Be King, The (1975) Genre:Adventure|Drama
Playing by Heart (1998) Genre:Drama|Romance
War of the Worlds, The (1953) Genre:Action|Drama|Sci-Fi


## Choosing the number of components
Before we can carry out any techniques, we first need to pick a specific number of components.

In terms of genre preference, I expect to see around as many types as there are genres. We can choose to look at "preferences" as the most common genre across an individual's most highly-rated movies. In this case, one individual can only have at most one "genre preference".

If we take a look at the first row of the tibble (containing movie genres), there seems to be 19 in total (Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, IMAX, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western). So I'd expect to see around this many types of individuals. 

In conclusion, we'll have 19 components - representing 19 possible genres that an individual can be interested in.

## Approach 1: SVD


In [6]:
#Creating a "quick-SVD" function
def svd_reconstruct(M, k):    
    u_full, s_full , vh_full = np.linalg.svd(M, full_matrices = False)
    
    #at the end we only return a subset of u,s,v_transpose - this subset is dictated by k: # of non-zero singular values
    u = u_full[:,0:k] # (m x k) matrix - slicing only k-columns
    s = s_full[0:k] # (k x k) square matrix - selecting k non-zero singular values
    vh = vh_full[0:k,:] # (k x n) matrix - slicing only k-rows
    
    return u,s,vh

#### Imputing missing (nan) values via column-mean padding
The matrix has quite a lot of missing values, so first, we need to come up with a way to impute them. For this dataset, let's choose a method called column-mean padding.

**Column-mean padding**: In this method, any missing value <ins>for a given film</ins> us replaced by the average score (across all 2670 users) <ins>of said film</ins>. We're assuming that most people who watch film A (but didn't) would probably end up with opinions ranging somewhere in-between the already existing scores. Only a minor few would have extreme opinions, and these users probably have already rated film A had they felt so strongly about the move. Consequently, it makes sense to fill the "silent majority" with a metric that represents "in-between opinions": the arithmetic mean.

In [7]:
col_mean = np.nanmean(movie_review,axis=0) #mean of each column
row_col_coord = np.where(np.isnan(movie_review)) #first element tells us the row where there is a nan-values
                                                    #second element tells us the column where there is a nan-values
                                                    #together, each pair forms (row,column) coord of each nan-values

movie_review_colpad = movie_review.copy()
movie_review_colpad[row_col_coord] = np.take(col_mean,indices = row_col_coord[1])

#movie_review[row_column_coord]: extract out only nan-values in matrix movie_review to fill in: 4628129 nan values like N above
#np.take extracts out value from a particular array (col_mean in this case) based on certain indices
## we can fill in going row by row (row-major order) - each row stopping at column of nan (using 2nd element of row_col_coord)

#### Rank-19 reconstruction (with column-mean padding)

In [8]:
#Getting the SVD with k=19
u19_col,s19_col,vh19_col = svd_reconstruct(movie_review_colpad,k=19)
movie_review_reconstruct_2 = u19_col @ np.diag(s19_col) @ vh19_col

N = np.sum(~np.isnan(movie_review))
difference_matrix_2 = movie_review - movie_review_reconstruct_2

print("Total number of true non-missing values (N) = " + str(N))
print("Avg difference (for true non-missing values) between review & reconstructed matrix = " 
      + str(math.sqrt(np.nansum(difference_matrix_2**2)/N)))

Total number of true non-missing values (N) = 1120381
Avg difference (for true non-missing values) between review & reconstructed matrix = 0.7580686426265324


For this rank-19 reconstruction (with column-mean padding), the average difference in true (non-missing) values (i.e. $x_{true}-x_{recon}$) is <ins>about 0.76</ins>.

## Approach 2: NMF

#### Non-negative Matrix Factorization (NMF) with K=19 components

In [9]:
NMF_model = NMF(n_components=19, init = 'random', random_state = 0, max_iter = 200) #create NMF model object

U_19 = NMF_model.fit_transform(movie_review_colpad) #Ut.shape = (2670, 19): this is transformed movie_review w/ 19 components
Vt_19 = NMF_model.components_ #Vt.shape = (19, 2153) for 19 NMF components (V_transpose)



#### `movie_review` reconstruction (with column-mean padding) using 19 NMF components

In [10]:
movie_review_reconstruct_19 = U_19 @ Vt_19
difference_matrix_19 = movie_review - movie_review_reconstruct_19

print("Total number of true non-missing values (N) = " + str(N))
print("Avg difference (for true non-missing values) between review & reconstructed matrix = " 
      + str(math.sqrt(np.nansum(difference_matrix_19**2)/N)))

Total number of true non-missing values (N) = 1120381
Avg difference (for true non-missing values) between review & reconstructed matrix = 0.7683549412326072


In this 19-component NMF-based reconstruction, the average difference in true (non-missing) values (i.e. $x_{true}-x_{recon}$) is <ins>about 0.77</ins>.

### Comparison to SVD-based average difference: 
The average difference is <ins>almost identical</ins> to that found using SVD. The column-mean SVD has an average difference of 0.76. If we're splitting hairs, then this NMF has a worse average difference than column-mean SVD (0.768 > 0.758).

## Approach 3: Soft-impute Matrix Completion
#### Soft imputation
First, let's implement the soft imputation algorithm using a 100-iteration for-loop

In [11]:
#Iteration t=0: initialize intermediate matrix Y with NaN = column-mean padding
Y_next = movie_review_colpad.copy() 
lambda_pen = 100

for iteration in range(0,100):
    u5,s5,vh5 = svd_reconstruct(Y_next,k=5)
    
    #Applying soft imputation using penalty lambda = 100
    for i in range(len(s5)):
        
        if s5[i] > lambda_pen:
            s5[i] = s5[i] - lambda_pen
        else:
            s5[i] = 0
    #Reconstruct Z^(t+1) for next iteration
    Z_next = u5 @ np.diag(s5) @ vh5
    
    #Iteration (t+1): fill missing values in movie_review w/ recon values in Z_next above
    next_iter = pd.DataFrame(movie_review).fillna(pd.DataFrame(Z_next))
    Y_next = np.array(next_iter,dtype=float)

[[3.5        3.5        3.         ... 3.35294118 2.91428571 3.65909091]
 [2.5        3.02286902 2.77336198 ... 3.35294118 2.91428571 3.65909091]
 [5.         4.5        2.77336198 ... 3.35294118 2.91428571 3.65909091]
 ...
 [3.         3.         3.5        ... 3.35294118 2.91428571 3.65909091]
 [4.5        3.02286902 2.77336198 ... 3.35294118 2.91428571 1.        ]
 [4.5        2.5        0.5        ... 3.35294118 2.91428571 3.65909091]]


Now, in the final iteration (100), our last reconstructed matrix - `Z_next` - will be the one we used to find top three movie to recommend to user45.

#### `movie_review` reconstruction and average difference evaluate
In this 19-component soft-imputed matrix reconstruction, the average difference in true (non-missing) values (i.e. $x_{true}-x_{recon}$) is <ins>about 0.84</ins>.


### Comparison to Aproach 1 and 2:
Compared to the previous two, this one is the least effective, with the greatest average difference: 0.843 > 0.768 > 0.758. 

In [12]:
difference_matrix_soft = movie_review - Z_next

print("Total number of true non-missing values (N) = " + str(N))
print("Avg difference (for true non-missing values) between review & reconstructed matrix = " 
      + str(math.sqrt(np.nansum(difference_matrix_soft**2)/N)))

Total number of true non-missing values (N) = 1120381
Avg difference (for true non-missing values) between review & reconstructed matrix = 0.843108776861315


## Recommender system
#### Recommender system - Top 1% 
For my recommender system, let's find the movies in the 99th perecentile using the reconstructed ratings of user1462. There are 22 movies in this "top 1%". 

Additionally, we need to filter out the movies that user1462 had already seen/rated.
## Approach 1 (cont.): SVD-based Recommendations


In [13]:
user1462_rec = movie_review_reconstruct_2[1460] #user1462
top_1 = np.where(user1462_rec > np.quantile(user1462_rec,0.99))[0] #indices where a movie is in top 1% for user1462
top_movie, top_genre, top_score = [],[],[]

#For-loop to check and store data for movies that meet the criteria
for i in top_1:
    movie = movie_review_names[i]
    genre = movie_review_genre[i]
    score = user1462[i]
    
    if math.isnan(score):
        top_movie.append(movie), top_genre.append(genre), top_score.append(user1462_rec[i])

#Create data frame containing movies recommended
rec_data = pd.DataFrame({'Movie Recommended':top_movie,
                         'Genre':top_genre,
                         'Rating Reconstruct':top_score})
rec_data = rec_data.sort_values(by=['Rating Reconstruct'],ascending=False).reset_index(drop=True)
rec_data

Unnamed: 0,Movie Recommended,Genre,Rating Reconstruct
0,Before Sunset (2004),Drama|Romance,4.097509
1,Apocalypse Now (1979),Action|Drama|War,4.046136
2,Blade II (2002),Action|Horror|Thriller,4.04406
3,Love Actually (2003),Comedy|Drama|Romance,4.034459
4,Being There (1979),Comedy|Drama,4.015774
5,"Slums of Beverly Hills, The (1998)",Comedy|Drama,3.997325
6,Titus (1999),Drama,3.992225
7,My Dog Skip (1999),Children|Drama,3.991485
8,"They Shoot Horses, Don't They? (1969)",Drama,3.991173
9,Reindeer Games (2000),Action|Thriller,3.98888


**$\implies$** The above dataset orders the movies by their respective reconstructed ratings. <ins>The top three recommended movies are "Before Sunset", "Apocalypse Now", and "Blade II".</ins> 

**Evaluation**: The 2nd and 3rd highest movie ("Apocalypse Now" and "Blade II") do not align with my expectations because not all of their respective genres (Action/Horror) are in the set of genre (see below) belonging to the 5-rated movies that user1462 originally had. Because genre of a movie is important to a person's likings, recommend "Apocalypse Now" and "Blade II" may backfire. Therefore, let's see if we can find a movie that is in the "top 1%" and also caters to user1462 preferences.

#### Additional Recommendations
First, I created a set that contains all 11 genres belonging to user1462's favorite movies.

In [14]:
genre_rec = set()

for i in index_review_1462:
    genre = movie_review_genre[i]
    genre_list = genre.split('|')
    for genre in genre_list:
        genre_rec.add(genre)

print(genre_rec)

{'Adventure', 'Drama', 'Romance', 'Thriller', 'Sci-Fi', 'Action', 'Crime', 'Musical', 'Children', 'War', 'Fantasy'}


In [15]:
top_movie, top_genre, top_score = [],[],[]

#For-loop to check and store data for movies that meet the criteria
for i in top_1:
    movie = movie_review_names[i]
    genre = movie_review_genre[i]
    score = user1462[i]
    genre_list = genre.split('|')
    
    if all(genre in genre_rec for genre in genre_list) and math.isnan(score): 
        top_movie.append(movie), top_genre.append(genre), top_score.append(user1462_rec[i])
        #filter for only movies with all genre in common w/ 5-rated list and user1462 has not yet seen

#Create data frame containing movies recommended
rec_data = pd.DataFrame({'Movie Recommended':top_movie,
                         'Genre':top_genre,
                         'Rating Reconstruct':top_score})
rec_data = rec_data.sort_values(by=['Rating Reconstruct'],ascending=False).reset_index(drop=True)
rec_data

Unnamed: 0,Movie Recommended,Genre,Rating Reconstruct
0,Before Sunset (2004),Drama|Romance,4.097509
1,Apocalypse Now (1979),Action|Drama|War,4.046136
2,Titus (1999),Drama,3.992225
3,My Dog Skip (1999),Children|Drama,3.991485
4,"They Shoot Horses, Don't They? (1969)",Drama,3.991173
5,Reindeer Games (2000),Action|Thriller,3.98888


$\implies$ Now, "Blade II" has been replaced by "Titus". 

#### Conclusion: 
Based on the SVD approach, I'd recommend **<ins>"Before Sunset", "Apocalypse Now" and "Titus"</ins>** as the top three because all three movies share all genres with the original top rated movies by user1462.

## Approach 2 (cont.): NMF-based Recommendations

In [16]:
user1462_rec = movie_review_reconstruct_19[1460] #user1462
top_1 = np.where(user1462_rec > np.quantile(user1462_rec,0.99))[0] #indices where a movie is in top 1% for user1462
top_movie, top_genre, top_score = [],[],[]

#For-loop to check and store data for movies that meet the criteria
for i in top_1:
    movie = movie_review_names[i]
    genre = movie_review_genre[i]
    score = user1462[i]
    
    if math.isnan(score):
        top_movie.append(movie), top_genre.append(genre), top_score.append(user1462_rec[i])

#Create data frame containing movies recommended
rec_data = pd.DataFrame({'Movie Recommended':top_movie,
                         'Genre':top_genre,
                         'Rating Reconstruct':top_score})
rec_data = rec_data.sort_values(by=['Rating Reconstruct'],ascending=False).reset_index(drop=True)
rec_data

Unnamed: 0,Movie Recommended,Genre,Rating Reconstruct
0,Before Sunset (2004),Drama|Romance,4.089317
1,K-PAX (2001),Drama|Fantasy|Mystery|Sci-Fi,4.000641
2,Apocalypse Now (1979),Action|Drama|War,3.959615
3,My Dog Skip (1999),Children|Drama,3.954048
4,Wild at Heart (1990),Crime|Drama|Mystery|Romance|Thriller,3.945417
5,Reindeer Games (2000),Action|Thriller,3.938943
6,The Falcon and the Snowman (1985),Crime|Drama|Thriller,3.934128
7,"Slums of Beverly Hills, The (1998)",Comedy|Drama,3.928364


**$\implies$** The above dataset orders the movies by their respective reconstructed ratings.

**Evaluation**: The 2nd and 3rd highest movie ("K-PAX" and "Apocalypse Now") do not align with my expectations because some of their respective genres (Mystery/Action) are not even in the set of genre belonging to the 5-rated movies that user1462 originally had. Because genre of a movie is important to a person's likings, recommend "K-PAX" and "Apocalypse Now" may backfire. Therefore, let's see if we can find a movie that is in the "top 1%" and also caters to user1462 preferences.

Again, let's filter for any "top 1%" movie that shares these genre. The condition is strict: if a movie has more than one genre, all of those genres must be in common with this set.

In [17]:
top_movie, top_genre, top_score = [],[],[]

#For-loop to check and store data for movies that meet the criteria
for i in top_1:
    movie = movie_review_names[i]
    genre = movie_review_genre[i]
    score = user1462[i]
    genre_list = genre.split('|')
    
    if all(genre in genre_rec for genre in genre_list) and math.isnan(score): 
        top_movie.append(movie), top_genre.append(genre), top_score.append(user1462_rec[i])
        #filter for only movies with all genre in common w/ 5-rated list and user1462 has not yet seen

#Create data frame containing movies recommended
rec_data = pd.DataFrame({'Movie Recommended':top_movie,
                         'Genre':top_genre,
                         'Rating Reconstruct':top_score})
rec_data = rec_data.sort_values(by=['Rating Reconstruct'],ascending=False).reset_index(drop=True)
rec_data

Unnamed: 0,Movie Recommended,Genre,Rating Reconstruct
0,Before Sunset (2004),Drama|Romance,4.089317
1,Apocalypse Now (1979),Action|Drama|War,3.959615
2,My Dog Skip (1999),Children|Drama,3.954048
3,Reindeer Games (2000),Action|Thriller,3.938943
4,The Falcon and the Snowman (1985),Crime|Drama|Thriller,3.934128


$\implies$ Now, "K-PAX" has been replaced by "My Dog Skip".

#### Conclusion: 
Based on this NMF, I'd recommend **<ins>"Before Sunset", "Apocalypse Now" and "My Dog Skip" a</ins>** top three because all three movies share all genres with the original top rated movies by user1462.

## Approach 3 (cont.): Soft-imputed Matrix Completion Recommendations
#### Recommender system - Top 5% 
Again,let's first find the movies in the 95th perecentile using the reconstructed ratings of user1462. For this user, I expanded my search to the "top 5%" because the 99th percentile ("top 1%") did not yield any movies. This is because user1462 has watched all the movies in the "top 1%".

The 95th percentile resulted in 11 movies that user1462 has not watched (see below).

In [18]:
Z_final = Z_next #the latest Z_next represents our final reconstructed matrix that will be used to recommend movies
user1462_rec = Z_final[1460] #user1462
top_1 = np.where(user1462_rec > np.quantile(user1462_rec,0.95))[0] #indices where a movie is in top 1% for user1462
top_movie, top_genre, top_score = [],[],[]

#For-loop to check and store data for movies that meet the criteria
for i in top_1:
    movie = movie_review_names[i]
    genre = movie_review_genre[i]
    score = user1462[i] #this is original score not reconstructed score

    if math.isnan(score):
        top_movie.append(movie), top_genre.append(genre), top_score.append(user1462_rec[i])

#Create data frame containing movies recommended
rec_data = pd.DataFrame({'Movie Recommended':top_movie,
                         'Genre':top_genre,
                         'Rating Reconstruct':top_score})
rec_data = rec_data.sort_values(by=['Rating Reconstruct'],ascending=False).reset_index(drop=True)
rec_data

Unnamed: 0,Movie Recommended,Genre,Rating Reconstruct
0,Local Hero (1983),Comedy,3.19157
1,Sgt. Bilko (1996),Comedy,3.131296
2,L.A. Story (1991),Comedy|Romance,3.119027
3,Atlantis: The Lost Empire (2001),Adventure|Animation|Children|Fantasy,3.115704
4,Dutch (1991),Comedy,3.113159
5,Casino (1995),Crime|Drama,3.106797
6,Jerry Maguire (1996),Drama|Romance,3.092969
7,Trees Lounge (1996),Drama,3.088268
8,Notorious (1946),Film-Noir|Romance|Thriller,3.079079
9,"Good, the Bad and the Ugly, The (Buono, il bru...",Action|Adventure|Western,3.072708


**$\implies$** The above dataset orders the movies by their respective reconstructed ratings. 

**Evaluation**: All three highest movie do not align with my expectations because their respective genre (Comedy) is not even in the set of genre belonging to the 5-rated movies that user1462 originally had. Because genre of a movie is important to a person's likings, one last time, let's see if we can find a movie that is in the "top 5%" and also caters to user1462 preferences.

Again,the condition is strict: if a movie has more than one genre, all of those genres must be in common with this set.

In [19]:
top_movie, top_genre, top_score = [],[],[]

#For-loop to check and store data for movies that meet the criteria
for i in top_1:
    movie = movie_review_names[i]
    genre = movie_review_genre[i]
    score = user1462[i]
    genre_list = genre.split('|')
    
    if all(genre in genre_rec for genre in genre_list) and math.isnan(score): 
        top_movie.append(movie), top_genre.append(genre), top_score.append(user1462_rec[i])
        #filter for only movies with all genre in common w/ 5-rated list and user1462 has not yet seen

#Create data frame containing movies recommended
rec_data = pd.DataFrame({'Movie Recommended':top_movie,
                         'Genre':top_genre,
                         'Rating Reconstruct':top_score})
rec_data = rec_data.sort_values(by=['Rating Reconstruct'],ascending=False).reset_index(drop=True)
rec_data

Unnamed: 0,Movie Recommended,Genre,Rating Reconstruct
0,Casino (1995),Crime|Drama,3.106797
1,Jerry Maguire (1996),Drama|Romance,3.092969
2,Trees Lounge (1996),Drama,3.088268


$\implies$ Now, we're left with exactly three movies.

#### Conclusion: 
Based on the matrix completion and reconstruction using soft imputation, I'd recommend **<ins>"Casino", "Jerry Maguire", "Trees Lounge"</ins>** as the top three because all three movies share all genres with the original top rated movies by user1462. Interestingly, this approach recommends all three completely different movies compared to SVD and NMF approach.

## Comparison:
After a simple run of dimension reduction using a 19-component SVD and NMF, we have arrived at a top three recommendation for user1462

**SVD-based recommendation:** "Before Sunset", "Apocalypse Now" and "Titus"<br>**NMF-based recommendation:** "Before Sunset", "Apocalypse Now" and "My Dog Skip".<br>**Soft-imputed Matrix Completion recommendation:** "Casino", "Jerry Maguire", "Trees Lounge"
<br><br>$\implies$The only movie that differs between SVD and NMF is "Titus" vs. "My Dog Skip". Interestingly, the soft-imputation approach recommends three completely different movies.

## Why should one choose SVD or NMF?
#### Reason for SVD:
The left and right matrices from SVD are orthogonal, and the columns of U and V are <ins>linearly independent (i.e. globally orthonormal bases)</ins>. Because of this, we can achieve a low-rank approximation and avoid multicollinearity issue (i.e. movies correlated with each other). In a review dataset like this one where there are many product columns (movies) with somewhat overlapping characteristics (genres), we can get a very high-dimensional matrix. Consequently, the above-mentioned benefit of SVD is quite useful. 

#### Reason for NMF:
An objective advantage NMF has over SVD is that it is more interpretable - and that is very true for this dataset. The NMF pick
different parts of the dataset, and as a result our data would lie very close to a specific axis (i.e. dimension). We can visualize each data point as if it belongs to 1 single (or some combination of) dimension(s). If we are able to define our dimensions as separate categories, we'd be able to interpret the NMF components meaningfully. This is perfect for our `movie_review` because it already has a well-defined set of categories: genres.

$\implies$For this reason, <ins>I'd recommend using NMF</ins>, especially if we want to test different combinations of genres as separate dimensions (i.e. Comedy/Drama vs. Horror/Thriller) rather than just 19 singular genre (my intial approach).