## Content-Based Recommendations

**Introduction**

Content-based recommendations is a system which could recommend items for an user based on item attributes. Therefore, the data being used here contains information about properties of an item. 
To use the data and find whether it is similar to the other item an user like, we may want to encode each categorical attributes as vectors and calculate distances among items. Also, since it does not recommend an item based on user feedback, this method can be utilized on brand new products which have never been purchased or seen before.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from itertools import permutations
from sklearn.metrics import jaccard_score
from scipy.spatial.distance import pdist, squareform
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Load in the second dataset
movie_genre_df = pd.read_csv('movies.csv')
print(movie_genre_df.head())
print()

# Inspect data columns
print(movie_genre_df.info())
print()

# Inspect unique values in each columns
print(movie_genre_df.nunique())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB
None

movieId    9742
title      9737
genres      951
dtype: 

In [3]:
# Convert genres columns into list values
movie_genre_df['genres'] = [str(x).split('|') for x in movie_genre_df['genres']]

# Put every genre of a title to different rows
movie_genre_df = movie_genre_df.explode('genres')

# Inspect unique value of genres
print(movie_genre_df['genres'].nunique())

movie_genre_df

20


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure
0,1,Toy Story (1995),Animation
0,1,Toy Story (1995),Children
0,1,Toy Story (1995),Comedy
0,1,Toy Story (1995),Fantasy
...,...,...,...
9738,193583,No Game No Life: Zero (2017),Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Animation


In [4]:
# Inspect the subset of movie series 'Toy Story' 
movie_genre_df[movie_genre_df['title'].isin([x for x in movie_genre_df['title'] if 'toy story' in str(x.lower())])]

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure
0,1,Toy Story (1995),Animation
0,1,Toy Story (1995),Children
0,1,Toy Story (1995),Comedy
0,1,Toy Story (1995),Fantasy
2355,3114,Toy Story 2 (1999),Adventure
2355,3114,Toy Story 2 (1999),Animation
2355,3114,Toy Story 2 (1999),Children
2355,3114,Toy Story 2 (1999),Comedy
2355,3114,Toy Story 2 (1999),Fantasy


Since the title and genres are in one-to-many relationship, when a title has multiple genres, it will come with multiple rows. However, our desired outcome is a row per title with different genres as attributes. Thus, the next step, the dataset will be unpivoted to obtain our desired dataset. 

In [5]:
# Create cross-tabulated DataFrame from title and genre_list columns
movie_cross_table = pd.crosstab(movie_genre_df['title'],movie_genre_df['genres'])

# Inspect again on the subset of movie series 'Toy Story'
movie_cross_table[movie_cross_table.index.isin([x for x in movie_cross_table.index if 'toy story' in str(x.lower())])]

genres,(no genres listed),Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,IMAX,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Toy Story (1995),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Toy Story 2 (1999),0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0
Toy Story 3 (2010),0,0,1,1,1,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0


### 1. By using Jaccard Distance

Jaccard similarity: $J(A,B) = \frac{A \cap B}{A \cup B}$, always between 0 and 1.\

`pdist` calculates the Jaccard distance of how different rows are from each other, so in this case, to find the similarity, the value should be subtracted from 1.

In [6]:
jaccard_score(movie_cross_table.loc['Toy Story (1995)'].values, movie_cross_table.loc['Toy Story 3 (2010)'].values)

0.8333333333333334

In [7]:
# Calculate all pairwise distances
jaccard_distances = pdist(movie_cross_table.values, metric='jaccard')

# Convert the distances to a square matrix
jaccard_similarity_array = 1 - squareform(jaccard_distances)

# Wrap the array in a dataframe
jaccard_similarity_df = pd.DataFrame(jaccard_similarity_array, index = movie_cross_table.index, columns = movie_cross_table.index)

jaccard_similarity_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.125,0.2,0.333333,0.2,0.0,0.0,0.25,0.166667,0.0,...,0.4,0.4,0.2,0.2,0.2,0.4,0.4,0.4,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.125,1.0,0.0,0.0,0.0,0.0,0.2,0.0,0.142857,0.285714,...,0.0,0.0,0.0,0.0,0.0,0.142857,0.142857,0.142857,0.166667,0.166667
'Round Midnight (1986),0.2,0.0,1.0,0.2,0.333333,0.0,0.0,0.5,0.25,0.0,...,0.25,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.333333
'Salem's Lot (2004),0.333333,0.0,0.2,1.0,0.2,0.0,0.0,0.25,0.166667,0.0,...,0.4,0.75,0.5,0.5,0.2,0.166667,0.166667,0.166667,0.0,0.0
'Til There Was You (1997),0.2,0.0,0.333333,0.2,1.0,0.5,0.0,0.5,0.666667,0.0,...,0.25,0.25,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0


In [8]:
# Find all the title's name contains princess
print([x for x in jaccard_similarity_df.index if 'princess' in str(x.lower())])
print()

# Inspect the attribute for that movie
print(movie_genre_df[movie_genre_df['title']=='Princess Diaries, The (2001)'])
print()

# Find the values for the movie 'Princess Diaries, The (2001)'
similarity_df = jaccard_similarity_df.loc['Princess Diaries, The (2001)'].sort_values(ascending=False)
print(similarity_df[similarity_df.values >= 0.7])

["Dragon Ball: Sleeping Princess in Devil's Castle (Doragon bôru: Majinjô no nemuri hime) (1987)", 'Ice Princess (2005)', 'Little Princess, A (1995)', 'Little Princess, The (1939)', 'Princess (Prinsessa) (2010)', 'Princess Blade, The (Shura Yukihime) (2001)', 'Princess Bride, The (1987)', 'Princess Caraboo (1994)', 'Princess Diaries 2: Royal Engagement, The (2004)', 'Princess Diaries, The (2001)', 'Princess Mononoke (Mononoke-hime) (1997)', 'Princess and the Frog, The (2009)', 'Princess and the Pirate, The (1944)', 'Princess and the Warrior, The (Krieger und die Kaiserin, Der) (2000)', 'Swan Princess, The (1994)', 'Tale of Princess Kaguya, The (Kaguyahime no monogatari) (2013)']

      movieId                         title    genres
3446     4700  Princess Diaries, The (2001)  Children
3446     4700  Princess Diaries, The (2001)    Comedy
3446     4700  Princess Diaries, The (2001)   Romance

title
Princess Diaries, The (2001)                              1.00
Lizzie McGuire Movie, The

### 2. Text-based Similarities (by using TF-IDF)

When descriptive attributes are available

In [9]:
# Load in another movie dataset obtained from Kaggle
meta_movie = pd.read_csv('movies_metadata.csv')
meta_movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

  exec(code_obj, self.user_global_ns, self.user_ns)


In [10]:
# Subset the columns which is needed
movie_all_language_df = meta_movie[['original_language', 'original_title', 'overview']]
print('Before cleaning: \n'+ f'{movie_all_language_df.nunique()}')
print()

# Filter movies using english as its language
movie_en_df = movie_all_language_df[movie_all_language_df['original_language']=='en']

# Inspect properties of the dataframe
print('After cleaning: \n' + f'{movie_en_df.nunique()}')
movie_en_df.head()

Before cleaning: 
original_language       92
original_title       43373
overview             44307
dtype: int64

After cleaning: 
original_language        1
original_title       30493
overview             32095
dtype: int64


Unnamed: 0,original_language,original_title,overview
0,en,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,en,Jumanji,When siblings Judy and Peter discover an encha...
2,en,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,en,Father of the Bride Part II,Just when George Banks has recovered from his ...


In [24]:
# Drop language column
movie_plot_df = movie_en_df.drop(columns = ['original_language'])
print(movie_plot_df.info())
print()

# Remove null values
movie_plot_df = movie_plot_df.dropna(axis=0)
print(movie_plot_df.info())

# Remove Duplicate
movie_plot_df = movie_plot_df.drop_duplicates(subset=['original_title'],keep='last')
print(movie_plot_df.info())

# Change column's name
movie_plot_df.columns = ['Title', 'Plot']
movie_plot_df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32269 entries, 0 to 45465
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   original_title  32269 non-null  object
 1   overview        32200 non-null  object
dtypes: object(2)
memory usage: 756.3+ KB
None

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32200 entries, 0 to 45465
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   original_title  32200 non-null  object
 1   overview        32200 non-null  object
dtypes: object(2)
memory usage: 754.7+ KB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30429 entries, 0 to 45465
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   original_title  30429 non-null  object
 1   overview        30429 non-null  object
dtypes: object(2)
memory usage: 713.2+ KB
None


Unnamed: 0,Title,Plot
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


In [23]:
movie_plot_df[movie_plot_df['Title']=='Twilight']

Unnamed: 0,Title,Plot
13133,Twilight,When Bella Swan moves to a small town in the P...


**TF-IDF**

The arguments `max_df` is used for removing terms that appear too frequently, also known as "corpus-specific stop words". `min_df` is used for removing terms that appear too infrequently. 
For example:
- max_df = 0.50 means "ignore terms that appear in more than 50% of the documents".
- min_df = 0.01 means "ignore terms that appear in less than 1% of the documents".

In [12]:
# Instantiate the vectorizer object to the vectorizer variable
vectorizer = TfidfVectorizer(min_df=0.005, max_df=0.7,stop_words='english')
#min_df=0.01, max_df=0.5,
# Fit and transform the plot column
vectorized_data = vectorizer.fit_transform(movie_plot_df['Plot'])

# Look at the length of features generated
print(len(vectorizer.get_feature_names()))

947


In [13]:
# Create Dataframe from TF-IDFarray
tfidf_df = pd.DataFrame(vectorized_data.toarray(), columns = vectorizer.get_feature_names())

# Assign movie title to index
tfidf_df.index = movie_plot_df['Title']
tfidf_df.head()

Unnamed: 0_level_0,000,10,12,15,20,30,abandoned,able,academy,accident,...,world,writer,written,wrong,year,years,york,young,younger,youth
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Jumanji,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.124431,0.0,0.0,0.0,0.0,0.138707,0.0,0.0,0.0,0.0
Grumpier Old Men,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Waiting to Exhale,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Father of the Bride Part II,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Cosine Similarity**

In [14]:
# Create an array of cosine similarity values
cosine_similarity_array = cosine_similarity(tfidf_df)

# Wrap the array in a pandas Dataframe
cosine_similarity_df = pd.DataFrame(cosine_similarity_array, index=tfidf_df.index, columns=tfidf_df.index)
cosine_similarity_df.head()

Title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II,Tom and Huck,Sudden Death,GoldenEye,The American President,Dracula: Dead and Loving It,...,Deep Hearts,The Morning After,House of Horrors,Shadow of the Blair Witch,The Burkittsville 7,Caged Heat 3000,Robin Hood,Betrayal,Satana likuyushchiy,Queerama
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Toy Story,1.0,0.07031,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.036598,0.0
Jumanji,0.07031,1.0,0.07486,0.0,0.0,0.0,0.282389,0.0,0.031537,0.0,...,0.0,0.0,0.0,0.0,0.0,0.032686,0.0,0.0,0.071772,0.038988
Grumpier Old Men,0.0,0.07486,1.0,0.0,0.072844,0.021133,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.063822,0.0
Waiting to Exhale,0.0,0.0,0.0,1.0,0.0,0.033474,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.10196,0.0,0.0,0.050435,0.109745
Father of the Bride Part II,0.0,0.0,0.072844,0.0,1.0,0.0,0.080383,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.020444,0.0


In [15]:
# Find all the title's name contains princess diaries
print([x for x in cosine_similarity_df.index if 'princess diaries' in str(x.lower())])

# Find the values for the movie The Princess Diaries
cosine_similarity_df.loc['The Princess Diaries'].nlargest(10)

['The Princess Diaries', 'The Princess Diaries 2: Royal Engagement']


Title
The Princess Diaries           1.000000
Club Sandwich                  0.337777
The Diary of a Teenage Girl    0.334437
The Son-Daughter               0.333637
Gourmet Detective              0.333033
As Melhores Coisas do Mundo    0.328041
Free Ride                      0.311860
In Her Skin                    0.310102
San Francisco 2.0              0.304489
How to Cook Your Life          0.299432
Name: The Princess Diaries, dtype: float64

The most similar movie from the results is the movie 'Joy'

In [16]:
def find_movie(name):
    possible_movie_list = [x for x in cosine_similarity_df.index if name in str(x.lower())]
    return possible_movie_list

In [17]:
find_movie('opera')

['Operation Dumbo Drop',
 'Phantom of the Opera',
 'Operation Petticoat',
 'Operation Pacific',
 'A Night at the Opera',
 'Operation Crossbow',
 'Operation Homecoming: Writing the Wartime Experience',
 'Standard Operating Procedure',
 'Repo! The Genetic Opera',
 'Operator 13',
 'Operation: Endgame',
 'Jerry Springer: The Opera',
 'Operation Mad Ball',
 'Operation: Daybreak',
 'Opera Australia: The Pirates of Penzance',
 'Charlie Chan at the Opera',
 'The Phantom of the Opera',
 'Björk at the Royal Opera House',
 'Modus Operandi',
 'Operation Dalmatian: The Big Adventure',
 'Operazione Goldman',
 'The Phantom of the Opera at the Royal Albert Hall',
 'Fighter Pilot: Operation Red Flag',
 'Operation Diplomat',
 'Owned & Operated',
 'Erebus: Operation Overdue',
 'Operation Avalanche',
 'Operator',
 'The Lonedale Operator',
 "What's Opera, Doc?",
 'Operation: Rabbit']

### 3. My User Profile

In [28]:
# Create a list of movies I like
list_of_movies_enjoyed = ['Coach Carter', 'Charlie and the Chocolate Factory',
'The Princess Diaries', 'Matilda', '21', 'Catch Me If You Can', 'The Wolf of Wall Street',
'X-Men Origins: Wolverine', 'The Black Panther', 'Sherlock Holmes: A Game of Shadows', 'Mulan',
'The Parent Trap', 'Monsters University', 'Clueless', 'Legally Blonde', 'Parasite', 'The Chronicles of Narnia: The Lion, the Witch and the Wardrobe',
'The Hunger Games', 'School of Rock','Indiana Jones and the Kingdom of the Crystal Skull',
'The Incredibles','Kingsman: The Secret Service','Argo','The Lord of the Rings: The Return of the King',
'White Chicks', 'The House Bunny', 'Suicide Squad','Pitch Perfect','The Devil Wears Prada', 'The Holiday','The Phantom of the Opera',
'Harry Potter and the Philosopher\'s Stone','Lemony Snicket\'s A Series of Unfortunate Events']

# Create a subset of only the movies in the target list
movies_enjoyed_df = tfidf_df.reindex(list_of_movies_enjoyed)
movies_enjoyed_df.head()

Unnamed: 0_level_0,000,10,12,15,20,30,abandoned,able,academy,accident,...,world,writer,written,wrong,year,years,york,young,younger,youth
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Coach Carter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Charlie and the Chocolate Factory,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.568365,0.0,0.0,0.0,0.0,0.0,0.0,0.266935,0.0,0.0
The Princess Diaries,0.0,0.0,0.0,0.274565,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.173174,0.0,0.0,0.0,0.0,0.0
Matilda,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
21,0.205212,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.129256,0.0,0.108915,0.0,0.0


In [32]:
# Generate the user profile by finding the average scores of movies
user_prof = movies_enjoyed_df.mean()
user_prof

000        0.006219
10         0.000000
12         0.008383
15         0.008320
20         0.000000
             ...   
years      0.013190
york       0.013533
young      0.029100
younger    0.000000
youth      0.000000
Length: 947, dtype: float64

In [36]:
# Find subset of tfidf_df that does not include movies in list_of_movies_enjoyed
tfidf_subset_df = tfidf_df.drop(list_of_movies_enjoyed, axis=0)

# Calculate the cosine_similarity and wrap it in a DataFrame
similarity_array = cosine_similarity(user_prof.values.reshape(1,-1), tfidf_subset_df)
similarity_df = pd.DataFrame(similarity_array.T, index = tfidf_subset_df.index, columns = ['similarity_score'])

# Sort the values from high to low by the values in the similarity_score
sorted_similarity_df = similarity_df.sort_values(by = 'similarity_score', ascending=False)
sorted_similarity_df['similarity_score'].nlargest(50)

Title
Sixteen Candles                            0.331357
Harriet the Spy: Blog Wars                 0.311832
The Godless Girl                           0.306106
Three O'Clock High                         0.302900
Flirting                                   0.298180
Girl, Positive                             0.296140
Avalon High                                0.294995
Teen Vamp                                  0.290873
LOL                                        0.290611
Mischief                                   0.290529
Mean Girls 2                               0.290312
Model Behavior                             0.287012
32A                                        0.286051
Fab Five: The Texas Cheerleader Scandal    0.284968
Please Vote for Me                         0.284835
Buster and Billie                          0.283678
Austin High                                0.282660
All the Real Girls                         0.282278
Very Good Girls                            0.282116
The Se