# What question are we trying to answer?
Given a movies data set, what are the 5 most similar movies to a movie query?

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv("https://github.com/ArinB/MSBA-CA-Data/raw/main/CA05/movies_recommendation_data.csv")
df

Unnamed: 0,Movie ID,Movie Name,IMDB Rating,Biography,Drama,Thriller,Comedy,Crime,Mystery,History,Label
0,58,The Imitation Game,8.0,1,1,1,0,0,0,0,0
1,8,Ex Machina,7.7,0,1,0,0,0,1,0,0
2,46,A Beautiful Mind,8.2,1,1,0,0,0,0,0,0
3,62,Good Will Hunting,8.3,0,1,0,0,0,0,0,0
4,97,Forrest Gump,8.8,0,1,0,0,0,0,0,0
5,98,21,6.8,0,1,0,0,1,0,1,0
6,31,Gifted,7.6,0,1,0,0,0,0,0,0
7,3,Travelling Salesman,5.9,0,1,0,0,0,1,0,0
8,51,Avatar,7.9,0,0,0,0,0,0,0,0
9,47,The Karate Kid,7.2,0,1,0,0,0,0,0,0


In [4]:
# print the shape of the DataFrame, its columns, and data types
print("DataFrame shape:", df.shape)
print()
print("DataFrame columns:")
print(df.columns)
print()
print("DataFrame data types:")
print(df.dtypes)

DataFrame shape: (30, 11)

DataFrame columns:
Index(['Movie ID', 'Movie Name', 'IMDB Rating', 'Biography', 'Drama',
       'Thriller', 'Comedy', 'Crime', 'Mystery', 'History', 'Label'],
      dtype='object')

DataFrame data types:
Movie ID         int64
Movie Name      object
IMDB Rating    float64
Biography        int64
Drama            int64
Thriller         int64
Comedy           int64
Crime            int64
Mystery          int64
History          int64
Label            int64
dtype: object


In [5]:
# Loop over the column names and print the unique values for each column
for column in df.columns:
    print("• Unique Values {}:".format(column), df[column].unique())

• Unique Values Movie ID: [58  8 46 62 97 98 31  3 51 47 50 49 30 94  6 73 44 65 48 27 57 14 69 17
 12  1 86]
• Unique Values Movie Name: ['The Imitation Game' 'Ex Machina' 'A Beautiful Mind' 'Good Will Hunting'
 'Forrest Gump' '21' 'Gifted' 'Travelling Salesman' 'Avatar'
 'The Karate Kid' 'A Brilliant Young Mind' 'A Time To Kill' 'Interstellar'
 'The Wolf of Wall Street' 'Black Panther' 'Inception' 'The Wind Rises'
 'Spirited Away' 'Finding Forrester' 'The Fountain' 'The DaVinci Code'
 'Stand and Deliver' 'The Terminator' '21 Jump Street' 'The Avengers'
 'Thor: Ragnarok' 'Spirit: Stallion of the Cimarron' 'Hacksaw Ridge'
 '12 Years a Slave' 'Queen of Katwe']
• Unique Values IMDB Rating: [8.  7.7 8.2 8.3 8.8 6.8 7.6 5.9 7.9 7.2 7.4 8.6 7.8 7.3 6.6 8.1 7.1]
• Unique Values Biography: [1 0]
• Unique Values Drama: [1 0]
• Unique Values Thriller: [1 0]
• Unique Values Comedy: [0 1]
• Unique Values Crime: [0 1]
• Unique Values Mystery: [0 1]
• Unique Values History: [0 1]
• Unique Values La

# Building your own Recommender System
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

## `KNeighborsClassifier` Training

In [6]:
from sklearn.neighbors import KNeighborsClassifier

X = df.drop(["Movie ID", "Movie Name", "Label"], axis=1)
y = df.Label

# Initialize the classifier with the desired parameters
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
knn.fit(X, y)

KNeighborsClassifier()

## What recommendations he/she will see?

In [7]:
# test data
the_post = [7.2, 1, 1, 0, 0, 0, 0, 1]

- Passes test data point as a single-element array
- This function returns two arrays: `distances` and `indices`
    - `distances` contains the distances between the test point and its k-nearest neighbors in the dataset
    - `indices` contains the indices of those neighbors in the original dataset.

In [8]:
distances, indices = knn.kneighbors([the_post])



In [9]:
print("distances: ", distances)
print()
print("indices: ", indices)

distances:  [[0.9        1.         1.0198039  1.16619038 1.41421356]]

indices:  [[28 27 29 16  2]]


Loop that iterates over the indices of the k-nearest neighbors found in the previous step, and prints out the "Movie Name" value from the corresponding row of the original dataset (df).

In [12]:
for i, index in enumerate(indices[0]):
    distance = distances[0][i]
    movie_name = df.iloc[index]["Movie Name"]
    print(f"{movie_name}: {distance:.2f}")

12 Years a Slave: 0.90
Hacksaw Ridge: 1.00
Queen of Katwe: 1.02
The Wind Rises: 1.17
A Beautiful Mind: 1.41


# Conclusion
Based on our data, the five movies that are most similar to _The Post_ are _12 Years a Slave_, _Hacksaw Ridge_, _Queen of Katwe_, _The Wind Rises_, and _A Beautiful Mind_ in that order.