# 50-get-5recommendation-using-euclidean-distance
> In this notebook, I am going to get the 5 recommendation movies using Euclidean distance.

Now, we have 4 datasets with different clustering methods and different columns.

1. including ['votes'] column + Kmode + Kmeans
2. including ['votes'] column + Kmode 
3. NOT including ['votes'] column + Kmode + Kmeans
4. NOT including ['votes'] column + Kmode 

Using Euclidian distance, let's create a function returning the 5 closest movie id.

In [112]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os.path
import glob

from scipy.spatial import distance

In [113]:
import warnings
warnings.filterwarnings('ignore')

# Import data

In [171]:
# Define file paths
prefix_path = os.path.expanduser('~/')
# If you want to run this code, please edit personal path below.
personal_path = 'Vanderbilt University/2021 Fall/'
folder_path = 'case2/Data/'

kmodes_withVotes = 'yesVotes_kmode_final_df.csv'
kmodes_noVotes = 'noVotes_kmode_final_df.csv'
kmodes_kmeans_withVotes = 'yesVotes_kmode_kmeans_final_df.csv'

# 1. including ['votes'] column + Kmode + Kmeans
df_kmodes_withVotes = pd.read_csv(prefix_path + personal_path + folder_path + kmodes_withVotes)

# 2. including ['votes'] column + Kmode 
df_kmodes_kmeans_withVotes = pd.read_csv(prefix_path + personal_path + folder_path + kmodes_kmeans_withVotes)

# 4. NOT including ['votes'] column + Kmode 
df_kmodes_noVotes = pd.read_csv(prefix_path + personal_path + folder_path + kmodes_noVotes)

####### at this time, Jay is working on creating "3. NOT including ['votes'] column + Kmode + Kmeans" data#########
####### Once she finished her job, we can also get 3rd way's recommendation #################

# Drop unnecessary column 'Unnamed: 0'
df_kmodes_withVotes = df_kmodes_withVotes.drop(['Unnamed: 0'], axis = 1)
df_kmodes_noVotes = df_kmodes_noVotes.drop(['Unnamed: 0'], axis = 1)
df_kmodes_kmeans_withVotes = df_kmodes_kmeans_withVotes.drop(['Unnamed: 0'], axis = 1)

In [151]:
df_kmodes_withVotes.shape

(3018, 11)

In [189]:
df_kmodes_withVotes.head()

Unnamed: 0,id,group,review_1st,review_2nd,description_1st,description_2nd,overview_1st,overview_2nd,votes,reviews_from_critics,popularity
0,tt0248123,1,0.809561,0.784496,0.450726,0.026841,0.841725,0.437456,0.738854,0.008197,0.000225
1,tt0443584,1,0.119524,0.42467,0.662225,0.261958,0.978468,0.143147,0.464968,0.012295,0.028406
2,tt0242527,1,0.384807,0.343383,0.740054,0.55419,0.817338,0.069299,0.700637,0.062842,0.02381
3,tt0374180,1,0.821214,0.161882,0.329482,0.479809,0.85441,0.196915,0.579618,0.043716,0.046704
4,tt0428856,1,0.672031,0.535561,0.473752,0.034677,0.912749,0.273534,0.713376,0.053279,0.029964


# Create a function to recommend the 5 closest movie ids when only used Kmode

In [161]:
def get_5recommendation_noKmeans(df, movie_id):
    """
    Get the dataframe having distance between input movies and others, and 5 recommendation movie ID
    (This function only treats a dataframe created by NO K-means clustering)
    
    Parameters
    ----------
    df : dataframe
        a dataframe among 4 I mentioned above
    movie_id : str
        input movie's id
        
    Returns
    -------
    group_df
        a dataframe having only movies within the same group as input movie + distance between input movie to others
    recommendation
        a list containig 5 movie's id (The 5 closest distance movie's id)
    """
    # Find the group number of input movie.
    group_num = df[df["id"] == movie_id]['group'].values[0]
    # Get dataframe of which the group number is same as the input movie's group
    group_df = df[df["group"] == group_num].reset_index(drop = True)
    # Convert input movie's features into list
    standard_point = group_df[group_df["id"] == movie_id].loc[:, "review_1st":"popularity"].values.tolist()
    
    # Calculate the euclidean distance between input movie and other movies within the same cluster
    for i in range(group_df.shape[0]):
        point = group_df.iloc[i]["review_1st":"popularity"].values.tolist()
        
        group_df.loc[i, "distance"] = distance.euclidean(standard_point, point)
    
    group_df_temp = group_df[group_df.id != movie_id]
    recommendation = group_df_temp.sort_values(by = ['distance'])[:5]['id'].values.tolist()
    
    return group_df, recommendation

# Create a function to recommend the 5 closest movie ids when used both Kmode and Kmeans

In [162]:
def get_5recommendation_yesKmeans(df, movie_id):
    """
    Get the dataframe having distance between input movies and others, and 5 recommendation movie ID
    (This function only treats a dataframe created by K-mode AND K-means clustering)
    
    Parameters
    ----------
    df : dataframe
        a dataframe among 4 I mentioned above
    movie_id : str
        input movie's id
        
    Returns
    -------
    subgroup_df
        a dataframe having only movies within the same group as input movie + distance between input movie to others
    recommendation
        a list containig 5 movie's id (The 5 closest distance movie's id)
    """
    # Find the group number of input movie.
    group_num = df[df["id"] == movie_id]['group'].values[0]
    # Find the subgroup number of input movie.
    subgroup_num = df[df["id"] == movie_id]['subgroup'].values[0]
    # Get dataframe of which the group number is same as the input movie's group
    subgroup_df = df[(df["group"] == group_num)&(df["subgroup"] == subgroup_num)].reset_index(drop = True)
    # Convert input movie's features into list
    standard_point = subgroup_df[subgroup_df["id"] == movie_id].loc[:, "review_1st":"popularity"].values.tolist()
    
    # Calculate the euclidean distance between input movie and other movies within the same cluster
    for i in range(subgroup_df.shape[0]):
        point = subgroup_df.iloc[i]["review_1st":"popularity"].values.tolist()
        
        subgroup_df.loc[i, "distance"] = distance.euclidean(standard_point, point)
    
    subgroup_df_temp = subgroup_df[subgroup_df.id != movie_id]
    recommendation = subgroup_df_temp.sort_values(by = ['distance'])[:5]['id'].values.tolist()
    
    return subgroup_df, recommendation

# Test

Let's test our functions.

In [182]:
movie_id = "tt0096486"

Let's remind the 4 recommendation methods:

1. including ['votes'] column + Kmode + Kmeans
2. including ['votes'] column + Kmode 
3. NOT including ['votes'] column + Kmode + Kmeans
4. NOT including ['votes'] column + Kmode 

In [185]:
# Get recommendation using "1. including ['votes'] column + Kmode + Kmeans"
_, recommendation5_yesKmeans_noVotes = get_5recommendation_yesKmeans(df_kmodes_kmeans_withVotes, movie_id)

In [183]:
# Get recommendation using "2. including ['votes'] column + Kmode" way
_, recommendation5_noKmeans_yesVotes = get_5recommendation_noKmeans(df_kmodes_withVotes, movie_id)

In [184]:
# Get recommendation using "4. NOT including ['votes'] column + Kmode"
_, recommendation5_noKmeans_noVotes = get_5recommendation_noKmeans(df_kmodes_withVotes, movie_id)

In [188]:
recommendation5_yesKmeans_noVotes

['tt0119415', 'tt0340110', 'tt0080546', 'tt0209368', 'tt0084728']

In [186]:
recommendation5_noKmeans_yesVotes

['tt0119415', 'tt0340110', 'tt0115886', 'tt0080546', 'tt0086189']

In [187]:
recommendation5_noKmeans_noVotes

['tt0119415', 'tt0340110', 'tt0115886', 'tt0080546', 'tt0086189']

2 and 4 shows exactly same recommendation movie IDs. When we look at 1's recommendation list, the first 2 movie ids are same as 2 and 4, but the last 3 movie ids are different.

Since the data is normalized, actually one feature 'Votes' cannot give a big impact on the distance. For this reason, I think with "votes" VS without "votes" actually does not matter in our modeling.