## DATA 612 Project 2: Content-Based and Collaborative Filtering
#### Team members: Mia Chen, Wei Zhou
#### Date: 6/21/2020

The goal of this assignment is for us to try out different ways of implementing and configuring a recommender, and to evaluate the different approaches. We will use the MovieLens [dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset) from Kaggle. The dataset files contain metadata for all 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. This dataset captures feature points like cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts, and vote averages.

This dataset consists of the following files:

* movies_metadata.csv: This file contains information on ~45,000 movies featured in the Full MovieLens dataset. Features include posters, backdrops, budget, genre, revenue, release dates, languages, production countries, and companies.

* keywords.csv: Contains the movie plot keywords for our MovieLens movies. Available in the form of a stringified JSON Object.

* credits.csv: Consists of Cast and Crew Information for all the movies. Available in the form of a stringified JSON Object.

* links.csv: This file contains the TMDB and IMDB IDs of all the movies featured in the Full MovieLens dataset.

* links_small.csv: Contains the TMDB and IMDB IDs of a small subset of 9,000 movies of the Full Dataset.

* ratings_small.csv: The subset of 100,000 ratings from 700 users on 9,000 movies.

## Content-Based Filtering
Content-based filtering recommends movies that are similar to a particular movie. To achieve this, we will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.

In [49]:
# Load modules
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
# Load Movies Metadata
metadata = pd.read_csv('movies_metadata.csv', low_memory=False)

# Inspect the plots of a few movies
metadata['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

To compute the similarity and/or dissimilarity between them, we will compute the Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. It will give us a matrix where each column represents a word in the overview vocabulary, and each row represents a movie.

In [3]:
# Define a TF-IDF Vectorizer Object
# Remove all English stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75823)

In [4]:
# Array mapping from feature integer indices to feature name
tfidf.get_feature_names()[1000:1010]

[u'abdel',
 u'abdelatif',
 u'abdelhakim',
 u'abdelilah',
 u'abdelkader',
 u'abdicate',
 u'abdicated',
 u'abdicates',
 u'abdicating',
 u'abdication']

In [5]:
# Import sklearn's linear_kernel() since it will be faster than cosine_similarities()
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

cosine_sim.shape

(45466, 45466)

In [6]:
cosine_sim[1]

array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])

In [7]:
# Reverse mapping of movie titles and DataFrame indices
# Make title as an index in a Series
index = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

index[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

In [8]:
# Define a function that takes in movie tiles and outputs similar movies
def movie_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    ind = index[title]  
    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[ind]))
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # Get the scores of the 15 most similar movies
    sim_scores = sim_scores[1:16]
    # Get the movie indices
    movie_index = [i[0] for i in sim_scores]
    # Return the top 15 most similar movies
    return metadata['title'].iloc[movie_index]


### User-User Collaborative Filter

In [169]:
rating = pd.read_csv('ratings_small.csv',\
                     usecols=['userId', 'movieId', 'rating'],\
                    dtype={'userId': 'int32',
                           'movieId': 'int32',
                           'rating': 'float32'})
movie = pd.read_csv('movies_metadata.csv', low_memory=False)[['id','original_title']]

#### Pivot Ratings into Movie-features

In [170]:
from scipy.sparse import csr_matrix
# pivot ratings into movie features
df_movie_features = rating.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0)

mat_movie_features = csr_matrix(df_movie_features.values)

In [171]:
df_movie_features.shape

(9066, 671)

#### Now we need to take only movies that have been rated at least 50 times to get some idea of the reactions of users towards it, and active users that have more than 50 times of rating. 


In [172]:
popularity_thres = 50
popular_movie_index=np.array(np.matrix(df_movie_features.as_matrix()>0).sum(axis=1)>=popularity_thres).reshape(-1,).tolist()


ratings_thres = 50
active_user_list = np.array(np.matrix(df_movie_features.as_matrix()>0).sum(axis=0)>=ratings_thres).reshape(-1,).tolist()

pop_movie_active_user = df_movie_features.loc[popular_movie_index,active_user_list]


  
  


In [173]:
## After filtering, we finally got 453 movies and 426 users. 
pop_movie_active_user.shape

(453, 427)

#### Apply KNN Algorithm
In KNN, a data point is classified by a majority vote of its neighbors, with the data point being assigned to the class most common amongst its k-nearest neighbors, as measured by a distance function (these can be of any kind depending upon your data being continuous or categorical). If k = 1, then the data point is simply assigned to the class of its nearest neighbor—i.e., itself.


In [174]:
from sklearn.neighbors import NearestNeighbors
#make an object for the NearestNeighbors Class.
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)
# fit the dataset
model_knn.fit(pop_movie_active_user)


NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=-1, n_neighbors=20, p=2, radius=1.0)

#### Making Recommendations

In [197]:
def make_recommendation( fav_movie, n_recommendations):
    
    # get input movie index
    print('You have input movie:', fav_movie)
    movie_id = int(movie.loc[movie.original_title== fav_movie ,'id'].values[0])
    
    print('Recommendation system start to make inference')
    print('......\n')
    k = pop_movie_active_user.loc[pop_movie_active_user.index == movie_id,:]
    recommend_movie_id =model_knn.kneighbors(k, n_neighbors=6)[1].tolist()[0]
    recommend_movie =movie.loc[movie.id.isin([str(i) for i in recommend_movie_id]),:]
    

#     print('Recommendations for {}:'.format(fav_movie))
    print(recommend_movie)

In [198]:
fav_movie='Toy Story'
n_recommendations=5

In [199]:
make_recommendation(fav_movie,n_recommendations)

('You have input movie:', 'Toy Story')
Recommendation system start to make inference
......



ValueError: Found array with 0 sample(s) (shape=(0, 427)) while a minimum of 1 is required.

In [178]:
k = pop_movie_active_user.loc[pop_movie_active_user.index == 10,:]
recommend_movie_id =model_knn.kneighbors(k, n_neighbors=6)[1].tolist()[0]
movie.loc[movie.id.isin([str(i) for i in recommend_movie_id]),:]


Unnamed: 0,id,original_title
31,63,Twelve Monkeys
474,6,Judgment Night
1156,85,Raiders of the Lost Ark
1221,33,Unforgiven
2216,73,American History X


In [204]:
print('You have input movie:', fav_movie)
movie_id = int(movie.loc[movie.original_title== fav_movie ,'id'].values[0])

print('Recommendation system start to make inference')
print('......\n')
k = pop_movie_active_user.loc[pop_movie_active_user.index == movie_id,:]
recommend_movie_id =model_knn.kneighbors(k, n_neighbors=6)[1].tolist()[0]
# recommend_movie =movie.loc[movie.id.isin([str(i) for i in recommend_movie_id]),:]

('You have input movie:', 'Toy Story')
Recommendation system start to make inference
......



ValueError: Found array with 0 sample(s) (shape=(0, 427)) while a minimum of 1 is required.

In [207]:
pop_movie_active_user

userId,2,3,4,5,7,8,12,13,15,17,...,655,656,658,659,660,662,664,665,667,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,3.0,0.0,0.0,5.0,2.0,0.0,...,0.0,0.0,0.0,0.0,2.5,0.0,3.5,0.0,0.0,5.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,...,4.0,0.0,0.0,0.0,0.0,5.0,0.0,3.0,0.0,0.0
3,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.5,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,4.5,...,0.0,0.0,0.0,3.0,0.0,0.0,4.0,0.0,4.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,4.0,0.0,4.0,0.0,3.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0
11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.5,0.0,...,0.0,0.0,0.0,5.0,0.0,0.0,4.0,4.0,2.0,0.0
17,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,2.0,0.0
