## Content-Based Recommendations

**Introduction**

Content-based recommendations is a system which could recommend items for an user based on item attributes. Therefore, the data being used here contains information about properties of an item. 
To use the data and find whether it is similar to the other item an user like, we may want to encode each categorical attributes as vectors and calculate distances among items. Also, since it does not recommend an item based on user feedback, this method can be utilized on brand new products which have never been purchased or seen before.

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import permutations
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform

In [2]:
# Load in the second dataset
movie_genre_df = pd.read_csv('movies.csv')
print(movie_genre_df.head())
print()

# Inspect data columns
print(movie_genre_df.info())
print()

# Inspect unique values in each columns
print(movie_genre_df.nunique())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None

movieId    9742
title      9737
genres      951
dtype: 

In [3]:
# Convert genres columns into list values
movie_genre_df['genres'] = [str(x).split('|') for x in movie_genre_df['genres']]

# Put every genre of a title to different rows
movie_genre_df = movie_genre_df.explode('genres')

# Inspect unique value of genres
print(movie_genre_df['genres'].nunique())

movie_genre_df

20


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure
0,1,Toy Story (1995),Animation
0,1,Toy Story (1995),Children
0,1,Toy Story (1995),Comedy
0,1,Toy Story (1995),Fantasy
...,...,...,...
9738,193583,No Game No Life: Zero (2017),Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Animation


In [4]:
# Inspect the subset of movie series 'Toy Story' 
movie_genre_df[movie_genre_df['title'].isin([x for x in movie_genre_df['title'] if 'toy story' in str(x.lower())])]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure
0,1,Toy Story (1995),Animation
0,1,Toy Story (1995),Children
0,1,Toy Story (1995),Comedy
0,1,Toy Story (1995),Fantasy
2355,3114,Toy Story 2 (1999),Adventure
2355,3114,Toy Story 2 (1999),Animation
2355,3114,Toy Story 2 (1999),Children
2355,3114,Toy Story 2 (1999),Comedy
2355,3114,Toy Story 2 (1999),Fantasy


Since the title and genres are in one-to-many relationship, when a title has multiple genres, it will come with multiple rows. However, our desired outcome is a row per title with different genres as attributes. Thus, the next step, the dataset will be unpivoted to obtain our desired dataset. 

In [12]:
# Create cross-tabulated DataFrame from title and genre_list columns
movie_cross_table = pd.crosstab(movie_genre_df['title'],movie_genre_df['genres'])

# Inspect again on the subset of movie series 'Toy Story'
movie_cross_table[movie_cross_table.index.isin([x for x in movie_cross_table.index if 'toy story' in str(x.lower())])]

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Toy Story 2 (1999),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Toy Story 3 (2010),0,0,1,1,1,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0


### 1. By using Jaccard Distance

Jaccard similarity: $J(A,B) = \frac{A \cap B}{A \cup B}$, always between 0 and 1.\

`pdist` calculates the Jaccard distance of how different rows are from each other, so in this case, to find the similarity, the value should be subtracted from 1.

In [19]:
jaccard_score(movie_cross_table.loc['Toy Story (1995)'].values, movie_cross_table.loc['Toy Story 3 (2010)'].values)

0.8333333333333334

In [22]:
# Calculate all pairwise distances
jaccard_distances = pdist(movie_cross_table.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array = 1 - squareform(jaccard_distances)

# Wrap the array in a dataframe
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index = movie_cross_table.index, columns = movie_cross_table.index)

jaccard_similarity_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.125,0.2,0.333333,0.2,0.0,0.0,0.25,0.166667,0.0,...,0.4,0.4,0.2,0.2,0.2,0.4,0.4,0.4,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.125,1.0,0.0,0.0,0.0,0.0,0.2,0.0,0.142857,0.285714,...,0.0,0.0,0.0,0.0,0.0,0.142857,0.142857,0.142857,0.166667,0.166667
'Round Midnight (1986),0.2,0.0,1.0,0.2,0.333333,0.0,0.0,0.5,0.25,0.0,...,0.25,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.333333
'Salem's Lot (2004),0.333333,0.0,0.2,1.0,0.2,0.0,0.0,0.25,0.166667,0.0,...,0.4,0.75,0.5,0.5,0.2,0.166667,0.166667,0.166667,0.0,0.0
'Til There Was You (1997),0.2,0.0,0.333333,0.2,1.0,0.5,0.0,0.5,0.666667,0.0,...,0.25,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0


In [56]:
# Find all the title's name contains princess
print([x for x in jaccard_similarity_df.index if 'princess' in str(x.lower())])
print()

# Inspect the attribute for that movie
print(movie_genre_df[movie_genre_df['title']=='Princess Diaries, The (2001)'])
print()

# Find the values for the movie 'Princess Diaries, The (2001)'
similarity_df = jaccard_similarity_df.loc['Princess Diaries, The (2001)'].sort_values(ascending=False)
print(similarity_df[similarity_df.values >= 0.7])

["Dragon Ball: Sleeping Princess in Devil's Castle (Doragon bôru: Majinjô no nemuri hime) (1987)", 'Ice Princess (2005)', 'Little Princess, A (1995)', 'Little Princess, The (1939)', 'Princess (Prinsessa) (2010)', 'Princess Blade, The (Shura Yukihime) (2001)', 'Princess Bride, The (1987)', 'Princess Caraboo (1994)', 'Princess Diaries 2: Royal Engagement, The (2004)', 'Princess Diaries, The (2001)', 'Princess Mononoke (Mononoke-hime) (1997)', 'Princess and the Frog, The (2009)', 'Princess and the Pirate, The (1944)', 'Princess and the Warrior, The (Krieger und die Kaiserin, Der) (2000)', 'Swan Princess, The (1994)', 'Tale of Princess Kaguya, The (Kaguyahime no monogatari) (2013)']

      movieId                         title    genres
3446     4700  Princess Diaries, The (2001)  Children
3446     4700  Princess Diaries, The (2001)    Comedy
3446     4700  Princess Diaries, The (2001)   Romance

title
Princess Diaries, The (2001)                              1.00
Lizzie McGuire Movie, The