## Content-Based Recommendations

**Introduction**

Content-based recommendations is a system which could recommend items for an user based on item attributes. Therefore, the data being used here contains information about properties of an item. 
To use the data and find whether it is similar to the other item an user like, we may want to encode each categorical attributes as vectors and calculate distances among items. Also, since it does not recommend an item based on user feedback, this method can be utilized on brand new products which have never been purchased or seen before.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import permutations

In [2]:
# Load in the second dataset
movie_genre_df = pd.read_csv('movies.csv')
print(movie_genre_df.head())
print()

# Inspect data columns
print(movie_genre_df.info())
print()

# Inspect unique values in each columns
print(movie_genre_df.nunique())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None

movieId    9742
title      9737
genres      951
dtype: 

In [3]:
# Convert genres columns into list values
movie_genre_df['genres'] = [str(x).split('|') for x in movie_genre_df['genres']]

# Put every genre of a title to different rows
movie_genre_df = movie_genre_df.explode('genres')

# Inspect unique value of genres
print(movie_genre_df['genres'].nunique())

movie_genre_df

20


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure
0,1,Toy Story (1995),Animation
0,1,Toy Story (1995),Children
0,1,Toy Story (1995),Comedy
0,1,Toy Story (1995),Fantasy
...,...,...,...
9738,193583,No Game No Life: Zero (2017),Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Animation


In [4]:
# Inspect the subset of movie series 'Toy Story' 
movie_genre_df[movie_genre_df['title'].isin([x for x in movie_genre_df['title'] if 'toy story' in str(x.lower())])]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure
0,1,Toy Story (1995),Animation
0,1,Toy Story (1995),Children
0,1,Toy Story (1995),Comedy
0,1,Toy Story (1995),Fantasy
2355,3114,Toy Story 2 (1999),Adventure
2355,3114,Toy Story 2 (1999),Animation
2355,3114,Toy Story 2 (1999),Children
2355,3114,Toy Story 2 (1999),Comedy
2355,3114,Toy Story 2 (1999),Fantasy


Since the title and genres are in one-to-many relationship, when a title has multiple genres, it will come with multiple rows. However, our desired outcome is a row per title with different genres as attributes. Thus, the next step, the dataset will be unpivoted to obtain our desired dataset. 