# CA05 – kNN based Movie Recommender Engine

### Libraries & Data

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from numpy import indices
from sklearn.neighbors import NearestNeighbors

In [2]:
# Load the Data
df = pd.read_csv('https://raw.githubusercontent.com/ArinB/MSBA-CA-Data/main/CA05/movies_recommendation_data.csv')

In [3]:
df

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0,0
5,98,21,6.8,0,1,0,0,1,0,1,0
6,31,Gifted,7.6,0,1,0,0,0,0,0,0
7,3,Travelling Salesman,5.9,0,1,0,0,0,1,0,0
8,51,Avatar,7.9,0,0,0,0,0,0,0,0
9,47,The Karate Kid,7.2,0,1,0,0,0,0,0,0


In [4]:
# Dropping the 'Label' column, as we are not using this dataset for classification or regression
df = df.drop('Label', axis=1)

In [5]:
# Inspecting the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Movie ID     30 non-null     int64  
 1   Movie Name   30 non-null     object 
 2   IMDB Rating  30 non-null     float64
 3   Biography    30 non-null     int64  
 4   Drama        30 non-null     int64  
 5   Thriller     30 non-null     int64  
 6   Comedy       30 non-null     int64  
 7   Crime        30 non-null     int64  
 8   Mystery      30 non-null     int64  
 9   History      30 non-null     int64  
dtypes: float64(1), int64(8), object(1)
memory usage: 2.5+ KB


### Creating the kNN Model

In [6]:
# Selecting Features - excluding 'Movie ID' and 'Movie Name' for the kNN features matrix
X = df.drop(['Movie ID', 'Movie Name'], axis=1)

In [7]:
# Initialize and Train the kNN Model
knn = NearestNeighbors(n_neighbors=5, metric='euclidean')
knn.fit(X)

### The Recommendation System

When a user encounters a movie on our website, for example: "The Post," and scrolls down to the "More Like This" section, the following back-end program is run to provide the top 5 recommendations.

1. The program begins by creating a feature vector for "The Post" based on its genres and IMDB rating.

2. The kNN algorithm then compares "The Post" against our dataset of movies. It calculates the distance between "The Post" and every other movie in our dataset based on the genres and IMDB ratings.

3. Once the algorithm identifies the 5 closest neighbors, it selects these movies as recommendations.

In [8]:
# Manually creating a feature vector for the Query Movie – "The Post"
# IMDB Rating = 7.2, Biography = Yes, Drama = Yes, Thriller = No, Comedy = No, Crime = No, Mystery = No, History = Yes
the_post_vector = np.array([[7.2, 1, 1, 0, 0, 0, 0, 1]])

In [9]:
# Finding the Most Similar Movies
the_post_vector_df = pd.DataFrame(the_post_vector, columns=X.columns)
distances, indices = knn.kneighbors(the_post_vector_df)

In [10]:
# Extracting and Displaying the Similar Movies
similar_movies = df.iloc[indices[0]]['Movie Name']
print("Movies similar to 'The Post':")
print(similar_movies)

Movies similar to 'The Post':
28    12 Years a Slave
27       Hacksaw Ridge
29      Queen of Katwe
16      The Wind Rises
2     A Beautiful Mind
Name: Movie Name, dtype: object


These recommended films are similar to "The Post" in genre and quality, as indicated by their IMDB ratings, ensuring that our user receives personalized and relevant suggestions. The engine then sends these movie recommendations back to the website, where they are displayed to the user under the "More Like This" section.