# Movie Recommendation

This data science project focuses on developing a movie recommendation system. The system utilizes collaborative filtering, a technique that predicts a user's preferences based on similar patterns from other users. The project leverages a large dataset of user ratings and movie metadata to train a machine learning model. 

Key features include user-rating prediction accuracy and personalized recommendation lists. By analyzing user behavior and movie characteristics, the system offers tailored suggestions, enhancing user engagement and satisfaction. This project exemplifies the application of machine learning algorithms to improve content discoverability in digital media platforms.

## Import library

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Read the Rating's file into Pandas dataframes

In [2]:
user_ratings_df = pd.read_csv("/Users/okguser/Downloads/archive/ratings.csv")
user_ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,110,1.0,1425941529
1,1,147,4.5,1425942435
2,1,858,5.0,1425941523
3,1,1221,5.0,1425941546
4,1,1246,5.0,1425941556


## Read the movie metadata information into a dataframe

In [3]:
movie_metadata = pd.read_csv("/Users/okguser/Downloads/archive/movies_metadata.csv")
movie_metadata.head()

  movie_metadata = pd.read_csv("/Users/okguser/Downloads/archive/movies_metadata.csv")


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


## Combine these two dataframes on the common column movieID and ID

In [4]:
#convert its values to numeric by using .to_numeric()
#errors='coerce' to handle any values that cannot be converted to numbers. It will replace with 'NaN'
user_ratings_df['movieId'] = pd.to_numeric(user_ratings_df['movieId'], errors='coerce')
movie_metadata['id'] = pd.to_numeric(movie_metadata['id'], errors='coerce')
#using .merge() to merge two dataframes
movie_data = pd.merge(user_ratings_df, movie_metadata, left_on='movieId', right_on='id')
movie_data.head()

Unnamed: 0,userId,movieId,rating,timestamp,adult,belongs_to_collection,budget,genres,homepage,id,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,1,110,1.0,1425941529,False,"{'id': 131, 'name': 'Three Colors Collection',...",0,"[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'na...",,110.0,...,1994-05-27,0.0,99.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Three Colors: Red,False,7.8,246.0
1,11,110,3.5,1231676989,False,"{'id': 131, 'name': 'Three Colors Collection',...",0,"[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'na...",,110.0,...,1994-05-27,0.0,99.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Three Colors: Red,False,7.8,246.0
2,22,110,5.0,1111937009,False,"{'id': 131, 'name': 'Three Colors Collection',...",0,"[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'na...",,110.0,...,1994-05-27,0.0,99.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Three Colors: Red,False,7.8,246.0
3,24,110,5.0,979870012,False,"{'id': 131, 'name': 'Three Colors Collection',...",0,"[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'na...",,110.0,...,1994-05-27,0.0,99.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Three Colors: Red,False,7.8,246.0
4,29,110,3.0,1044020005,False,"{'id': 131, 'name': 'Three Colors Collection',...",0,"[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'na...",,110.0,...,1994-05-27,0.0,99.0,"[{'iso_639_1': 'fr', 'name': 'Français'}]",Released,,Three Colors: Red,False,7.8,246.0


## Build the User-Item Matrix

Dataset is prepared, let's explore how collaborative filtering operates. This machine learning algorithm seeks to identify patterns in user preferences that can be used to provide recommendations.

One of the common approach is to use user-item matrix. The system then uses various algorithms to analyze this matrix, find patterns and generate recommendations. This matrix leads us to one of the advantages of collaborative filtering.

It's excellent at discovering new and unexpected recommendations. Since it's based on user behavior, it can suggest a movie you might never have considered but will probably like.

Create a user-movie rating matrix for our dataset by using built-in pivot function of a Pandas dataframe.

In [5]:
#this function transform the dataframe into a pivot table that summarizes the data, making it easier to understand
#index parameter specifies that the value as its rows
#columns parameter specifies that the value will form the columns
#values parameter is to let pandas to fill the table with values given by us
#.fillna() is a method to replaces all 'NaN' to 0
user_item_matrix = user_ratings_df.pivot(index=['userId'], columns=['movieId'], values='rating').fillna(0)
user_item_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,170705,170813,170827,170945,171763,172547,173145,174055,174231,174585
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
473,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
474,4.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
475,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
476,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Define and Train Model

Use K-nearest neighbors (KNN). It finds the spot of your favourite item on the board and then looks around to see the nearest dots.

The metric parameter in KNN is crucial. It's like the ruler the system uses to measure the distance between the dots. The metric used here is Cosine similarity.

### What is cosine similarity?
It is a metric that measure how similar two entities are (like documents or vectors in multi-dimensional space), irrespective of size. Cosine similarity is widely used in NLP to find similar context words.

In [6]:
from sklearn.neighbors import NearestNeighbors #import NearestNeighbors

#define KNN model on cosine similarity
#metric specifies that cosine similarity is used as the distance metric
#algorithm parameter dictates the algorithm used to compute the nearest neighbors
#brute refers to brute-force search, the algorithm will copmute the distance between each pair of points and is 
#thus computationally intensive but straightforward
#n_neighbors is the number of nearest neighbors to find
#n_jobs parameter is to tell the function to use all available CPUs to perform the job
cf_knn_model = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=10, n_jobs=-1)

#fitting the model on the matrix
cf_knn_model.fit(user_item_matrix)

### Define function to provide the desired number of movie recommendations

In [7]:
from fuzzywuzzy import process
import pandas as pd

def movie_recommender_engine(movie_name, matrix, cf_model, n_recs):
    #fit model on matrix
    cf_knn_model.fit(matrix)
    
    #extract input movie name using the title from movie_metadata
    matched_title, score = process.extractOne(movie_name, movie_metadata['title'].tolist())
    
    #get corresponding movie ID from the matched title
    movie_id = movie_metadata.loc[movie_metadata['title'] == matched_title, 'id'].iloc[0]
    
    #check if the movie_id is in the matrix columns
    if movie_id in matrix.columns:
        movie_col_index = matrix.columns.get_loc(movie_id)
        
        #create query matrix with the same number of features as the original matrix
        #set all elements to zero except for the column corresponding to the movie ID
        query_matrix = np.zeros((1, len(matrix.columns)))
        query_matrix[0, movie_col_index] = 1 #set specific movie column to 1 
        
        #calculate distances and indices for the closest movies
        #+1 to exclude the movie itself
        distances, indices = cf_model.kneighbors(query_matrix, n_neighbors = n_recs + 1)
        
        #list to store recommendations
        cf_recs = []
        for idx, distance in zip(indices.squeeze(), distances.squeeze()):
            if idx != movie_col_index: #exclude the movie itself from recommendations
                rec_movie_id = matrix.columns[idx]
                rec_movie_title = movie_metadata[movie_metadata['id'] == rec_movie_id]['title'].iloc[0]
                cf_recs.append({'Title': rec_movie_title, 'Distance': distance})
                
        #create dataframe of recommendations
        df = pd.DataFrame(cf_recs, index = range(1, len(cf_recs) + 1))
        return df
    
    else:
        print("The movie ID was not found in the matrix columns.")
        return pd.DataFrame([], columns = ['Title', 'Distance'])



### Display the recommendations

In [8]:
#example usage 
n_recs = 10
recommendations = movie_recommender_engine('The Dark Knight', user_item_matrix, cf_knn_model, n_recs)
print(recommendations)

                                 Title  Distance
1                   As It Is in Heaven  0.963835
2                         Finding Nemo  0.984208
3                                 Cube  1.000000
4                              Ice Age  1.000000
5                              Vertigo  1.000000
6                            Mon oncle  1.000000
7                           Summer '04  1.000000
8   The Life Aquatic with Steve Zissou  1.000000
9       The Good, the Bad and the Ugly  1.000000
10                 Maria Full of Grace  1.000000
11                        Tough Enough  1.000000


In [9]:
#example usage 
n_recs = 10
recommendations = movie_recommender_engine('Batman', user_item_matrix, cf_knn_model, n_recs)
print(recommendations)

                             Title  Distance
1                    Trainspotting  0.959871
2            Bride of Frankenstein  0.969386
3                       Summer '04  1.000000
4                      The Pianist  1.000000
5                          Ice Age  1.000000
6                          Vertigo  1.000000
7                        Mon oncle  1.000000
8                             Cube  1.000000
9   The Good, the Bad and the Ugly  1.000000
10          The Day After Tomorrow  1.000000
11                    Mary Poppins  1.000000


In [10]:
#example usage 
n_recs = 10
recommendations = movie_recommender_engine('Trainspotting', user_item_matrix, cf_knn_model, n_recs)
print(recommendations)

                 Title  Distance
1      The Dark Knight  0.837511
2       Ocean's Eleven  0.882734
3             Fat Girl  0.917646
4   A Clockwork Orange  0.918214
5                Ghost  0.940517
6            Mon oncle  1.000000
7                 Cube  1.000000
8              Vertigo  1.000000
9          The Pianist  1.000000
10          Summer '04  1.000000
