<div style="text-align: right">INFO 6105 Data Science Eng Methods and Tools, Final Project</div> <br> <div style="text-align: right; color: green">
Movie Comment NLP & Recommender System</div> 
<div style="text-align: right">Zhe Xu, Yixin Guo, Qiuyi Zhang</div>

## Part II

This part of the final project is the recommendation part. 


Generally, there are three categories of recommendation systems: content based systems, collaborative filtering systems, and hybrid systems.

The collaborative filtering systems is based on: when two users share the same taste on one movie, they might have same taste on another. For example: when user A and B both love movie 1, and A love movie 2, then B might also love movie2.

The content-based filtering systems is based on: the similar movie. That's to say, if movie 1 and 2 is similar, after reading movie 1, the user might also love movie 2

The hybrid systems, as name indicated, is the combination of the other two

Two approach:

<center>
<img src="image/two_approach.png" width=600 />
</center>

Theoratically speaking, these are the generally used recommendation algorithms, in which, KNN will be used in this project
    
<center>
<img src="image/algo_types.png" width=150 />
</center>


Collaborative filtering systems use the actions of users to recommend other movies. In general, they can either be user-based or item-based. 

To develop a recommender system, we used Item-based collaborative filtering algorithmn

Item based approach is often considered better than user-based approach. User-based approach is often harder to scale because of the dynamic nature of users, whereas items usually don’t change much, and item based approach often can be computed offline and served without constantly re-training.

Collaborative filtering systems work with rating matrix. This simplest algorithm computes cosine or correlation similarity of rows (users) or columns (items) and recommends items that k — nearest neighbors(KNN) enjoyed. 

# KNN 

KNN is a machine learning algorithm to find clusters of similar movies based on common movie ratings, and make predictions using the average rating of top-k nearest neighbors

KNN will calculate the “distance” between the target movie and every other movie in its database, then it ranks its distances and returns the top K nearest neighbor movies as the most similar movie recommendations.

<center>
<img src="image/123.png" width=600 />
</center>

we can get a brief view of KNN algorithm from the picture below

<center>
<img src="image/knnway.jpg" width=600 />
</center>

In [3]:
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
import numpy as np

Load data of movies

In [4]:

df_movies = pd.read_csv('data/ml-latest-small/movies.csv')

Get a brief view of the movies' data

In [242]:

df_movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


Load data of ratings

In [243]:

df_ratings = pd.read_csv('data/ml-latest-small/ratings.csv')

Get a brief view of the ratings' data

In [244]:

df_ratings.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


we need to transform the dataframe of ratings into a proper format that can fit in a KNN model by pivoting ratings data from above.Then fill the missing observations with 0 since we’re going to be performing linear algebra operations 

In [245]:
#transfer the form of dataframe: From index on the top to index on the left.
#fill missing data with 0
df_movie_raings = df_ratings.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0)

In [246]:
#get a brief view of the processed ratings' data
df_movie_raings.head(5)

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


# Make recommendation

Cosine similarity is a measure of distance in a vector space. Methematically, cosine similarity function can be defined as follows:
<center>
<img src="image/cosine_similarity.png" width=150 />
</center>
Cosine similarity is generally used as a metric for measuring distance when the magnitude of the vectors does not matter.

In [247]:
#use the matrix to calculate similarity of movies
movie_similarity = cosine_similarity(df_movie_raings,dense_output=True)
movie_similarity

array([[1.        , 0.41056206, 0.2969169 , ..., 0.        , 0.        ,
        0.        ],
       [0.41056206, 1.        , 0.28243799, ..., 0.        , 0.        ,
        0.        ],
       [0.2969169 , 0.28243799, 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 1.        , 1.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        1.        ]])

Convert a dense matrix into sparse matrix by using `csr_matrix()`, given that sparse matrix is more efficient for machine learning, and the use of libraries such as scikit learn 

In [248]:
movie_rating_train = csr_matrix(df_movie_raings.values)

Since the data is high-dimension, in order to avoid "cause of dimension", we use the cosine similarity for nearest neighbor research.

Configure our KNN model with proper hyper-params

In [249]:
movie_neighbors = NearestNeighbors(metric='cosine', algorithm='brute')

Fit data to the KNN model

In [250]:
movie_neighbors.fit(movie_rating_train)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=None, n_neighbors=5, p=2, radius=1.0)

Randomly select a movie as user is selecting a movie before recommendation with its movieId

In [251]:
query_index = np.random.choice(df_movie_raings.shape[0])

Calculate the distance and associate index using `.kneighbors` method from scikit learn library

In [252]:
distances, indices = movie_neighbors.kneighbors(df_movie_raings.iloc[query_index,:].values.reshape(1,-1), n_neighbors = 6)

In [253]:
distances.flatten()

array([1.11022302e-16, 4.54086913e-01, 4.92177463e-01, 4.98627119e-01,
       5.14659315e-01, 5.30223894e-01])

Iterate list of calcualted distances, cross reference those distances representing nearest neighbors to the query movie with movies dataframe, and return a list of recommended movies with their movieIds

In [254]:
recommend_list = []

for i in range(len(distances.flatten())):
    if i == 0:
        print('Recommendations for {0}:\n'.format(df_movies.loc[df_movies['movieId']==df_movie_raings.index[query_index], 'title'].iloc[0]))
    else:
        recommend_list.append(df_movie_raings.index[indices.flatten()[i]])
        print ('{0}:{1}, with distance of {2}:'.format(i, df_movie_raings.index[indices.flatten()[i]], distances.flatten()[i]))

Recommendations for Big Chill, The (1983):

1:1958, with distance of 0.4540869125068342:
2:2243, with distance of 0.49217746269539553:
3:2289, with distance of 0.4986271190982092:
4:2247, with distance of 0.5146593150505063:
5:1299, with distance of 0.5302238935205185:


Now we get the result!

In [255]:
recommend_list

[1958, 2243, 2289, 2247, 1299]

Show the result in a more clear way, as well as movie details

In [256]:

recommend_res = pd.DataFrame(recommend_list, columns=['movieId'])
recommend_res

Unnamed: 0,movieId
0,1958
1,2243
2,2289
3,2247
4,1299


In [257]:
recommend_res = pd.merge(recommend_res, movies, on='movieId')
recommend_res

Unnamed: 0,movieId,title,genres
0,1958,Terms of Endearment (1983),Comedy|Drama
1,2243,Broadcast News (1987),Comedy|Drama|Romance
2,2289,"Player, The (1992)",Comedy|Crime|Drama
3,2247,Married to the Mob (1988),Comedy
4,1299,"Killing Fields, The (1984)",Drama|War


Get more information on the recommended movies

### Moving on, since the result of recommended movies are determined, it is better to confirm them with the movie similarities to see if the result is logically correcrt

### Movie Similarity Representation

In [260]:
similarity_compare = pd.DataFrame({'Recommend Similarity':movie_similarity[query_index],'Selected MovieId':df_movie_raings.index[query_index]})
similarity_compare = similarity_compare.set_index(df_movie_raings.index)
similarity_compare = similarity_compare.sort_values('Recommend Similarity',ascending=False)
similarity_compare_pivot = similarity_compare.T
similarity_compare

Unnamed: 0_level_0,Recommend Similarity,Selected MovieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
2352,1.000000,2352
1958,0.545913,2352
2243,0.507823,2352
2289,0.501373,2352
2247,0.485341,2352
1299,0.469776,2352
2369,0.467111,2352
1674,0.465368,2352
1292,0.459691,2352
2240,0.459163,2352


In [261]:
similarity_compare.index.values

array([  2352,   1958,   2243, ...,  40870,  40851, 193609])

### Better Movie Similarity Representation 

In [263]:
similarity_compare_2 = pd.DataFrame(columns = similarity_compare.index.values)
query_movie_similarity = movie_similarity[query_index]
query_movie_similarity[::-1].sort()
similarity_compare_2.loc[df_movie_raings.index[query_index]] = query_movie_similarity
similarity_compare_2.index.name = 'Query Movie'
similarity_compare_2

Unnamed: 0_level_0,2352,1958,2243,2289,2247,1299,2369,1674,1292,2240,...,41569,40826,1857,41014,40966,40962,40955,40870,40851,193609
Query Movie,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2352,1.0,0.545913,0.507823,0.501373,0.485341,0.469776,0.467111,0.465368,0.459691,0.459163,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Now Compare Movie Similarity Above to the Recommendation Result:

In [264]:
recommend_res

Unnamed: 0,movieId,title,genres
0,1958,Terms of Endearment (1983),Comedy|Drama
1,2243,Broadcast News (1987),Comedy|Drama|Romance
2,2289,"Player, The (1992)",Comedy|Crime|Drama
3,2247,Married to the Mob (1988),Comedy
4,1299,"Killing Fields, The (1984)",Drama|War


### See? The recommender system did logically make recommendation to similar movies that might intrigue the viewer who query the movie (movieId)
<center>
<img src="image/cheers.png" width=300 />
</center>

