`20 April, 2022`

### **Content Based Recommendation System Exercise**

**Use `movies_metadata` and `ratings_small` datasets**

This exercise would be challenging for you all. Overall, you only need to use Python (Pandas) methods to finish this exercise. 

You are free to create regular functions to make your code neater.

In case you are clueless, I give some hints to be followed. I hope it helps!

<hr>

### **Import libraries**

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from ast import literal_eval
from sklearn.metrics.pairwise import cosine_similarity

import warnings
warnings.filterwarnings('ignore')

### **Load datasets**

In [2]:
# Load the datasets and save it into variables named df_movies and df_ratings
df_movies = pd.read_csv('movies_metadata.csv')
df_ratings = pd.read_csv('ratings_small.csv')

In [3]:
# An example of df_movies data
df_movies.head(1)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0


In [4]:
# An example of df_ratings data
df_ratings

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
...,...,...,...,...
99999,671,6268,2.5,1065579370
100000,671,6269,4.0,1065149201
100001,671,6365,4.0,1070940363
100002,671,6385,2.5,1070979663


In [5]:
# Check df_movies info
df_movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [6]:
# There is one movie with the same id. We should drop one of it manually 
df_movies[df_movies['id'] == '4912']

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
5865,False,,30000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'nam...",,4912,tt0270288,en,Confessions of a Dangerous Mind,"Television made him famous, but his biggest hi...",...,2002-12-30,33013805.0,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Some things are better left top secret.,Confessions of a Dangerous Mind,False,6.6,281.0
33826,False,,30000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'nam...",,4912,tt0270288,en,Confessions of a Dangerous Mind,"Television made him famous, but his biggest hi...",...,2002-12-30,33013805.0,113.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Some things are better left top secret.,Confessions of a Dangerous Mind,False,6.6,281.0


In [7]:
df_ratings[df_ratings['movieId'] == 4912]

Unnamed: 0,userId,movieId,rating,timestamp
40803,294,4912,4.0,1143064198
44114,311,4912,3.5,1115160516
62456,452,4912,2.5,1133735678
79568,547,4912,4.5,1199391512
86086,575,4912,3.0,1012605645


In [8]:
# Drop one of the duplicates
df_movies.drop([df_movies.index[33826]], axis=0, inplace=True)

In [9]:
# Drop duplicate title with different movieId
movie_id = ['2966', '2661', '2310', '1450', '26323', '3875', '923', '3019',
            '3022', '26147', '3604', '3035', '3057', '1678', '2082', '244', '4479', 
            '2212', '299', '2104', '479', '2103', '97936', '4706', '5516', '2982', '3003', 
            '8208', '32582', '1162', '6058', '8128', '912', '2135', '3577', '597']

for i in movie_id:
    df_movies.drop(df_movies.index[df_movies['id'] == i], inplace = True)

for i in movie_id:
    df_ratings.drop(df_ratings.index[df_ratings['movieId'] == i], inplace = True)

In [10]:
df_movies.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

**As you can see in the code below, the value in the 'genres' feature is like JSON type (dictionaries inside a list). We need only the name of the genres for each movie title.**

In [11]:
df_movies['genres'][0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

**Since we only need some of the features from df_ratings, please merge both DataFrames which contain the following features:**
    
    - userId
    - movieId
    - title
    - genre
    - rating

In [12]:
# Merge the DataFrames and save it into a variable named df_ratings_with_titles

df_ratings['movieId'] = df_ratings['movieId'].apply(str)
df_ratings_with_titles = pd.merge(
    left=df_ratings,
    right=df_movies[['id', 'title']],
    how='inner',
    left_on='movieId',
    right_on='id'
).drop(columns='timestamp', axis=1)

In [13]:
# Merging output
df_ratings_with_titles.head(3)

Unnamed: 0,userId,movieId,rating,id,title
0,1,1371,2.5,1371,Rocky III
1,4,1371,4.0,1371,Rocky III
2,7,1371,3.0,1371,Rocky III


In [14]:
# Check df_ratings_with_titles info
df_ratings_with_titles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44432 entries, 0 to 44431
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   userId   44432 non-null  int64  
 1   movieId  44432 non-null  object 
 2   rating   44432 non-null  float64
 3   id       44432 non-null  object 
 4   title    44432 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 2.0+ MB


**If we check the number of unique movie ID, it should be 2794.**

In [15]:
# Check the number of unique movie ID
df_ratings_with_titles['id'].nunique()

2794

In [16]:
# Check 'rating' feature descriptive stats
df_ratings_with_titles['rating'].describe()

count    44432.000000
mean         3.558528
std          1.053150
min          0.500000
25%          3.000000
50%          4.000000
75%          4.000000
max          5.000000
Name: rating, dtype: float64

In [17]:
df_ratings_with_titles['movieId']

0          1371
1          1371
2          1371
3          1371
4          1371
          ...  
44427    127728
44428    129009
44429       167
44430       563
44431       129
Name: movieId, Length: 44432, dtype: object

In [18]:
df_ratings_with_titles['movieId'].unique()

array(['1371', '1405', '2105', ..., '167', '563', '129'], dtype=object)

In [20]:
df_ratings_with_titles.shape

(44432, 5)

In [21]:
df_ratings_with_titles.reset_index(inplace=True)
df_ratings_with_titles.drop(columns='index')

Unnamed: 0,userId,movieId,rating,id,title
0,1,1371,2.5,1371,Rocky III
1,4,1371,4.0,1371,Rocky III
2,7,1371,3.0,1371,Rocky III
3,19,1371,4.0,1371,Rocky III
4,21,1371,3.0,1371,Rocky III
...,...,...,...,...,...
44427,652,127728,5.0,127728,8:46
44428,652,129009,4.0,129009,Love Is a Ball
44429,659,167,4.0,167,K-PAX
44430,659,563,3.0,563,Starship Troopers


**As we can see, the rating range is from 0.5 to 5. Thus, from this point, we can assume that 0 means a user has not watched the movie. We can use the 0 value to mask the unwatched movie later.**

**`User-movie matrix`**

In [22]:
# Generate user-movie matrix and save it into a variable named df_user_movie_matrix
df_user_movie_matrix = df_ratings_with_titles.pivot_table(
    index='userId', columns='movieId', values='rating', fill_value=0)

In [23]:
# Check the outcome
df_user_movie_matrix.head()

movieId,100,100017,100032,100272,100450,101,101362,1018,101904,102,...,987,988,99,990,991,99106,992,994,996,99846
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
2,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
3,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
4,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0
5,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0.0,0,0.0,0,0.0,0,0.0,0.0,0


In [24]:
# pd.set_option('display.max_rows', 3000)
df_user_movie_matrix.iloc[29].sort_values(ascending=False)

movieId
508      5.0
2028     5.0
3114     5.0
527      5.0
1997     5.0
        ... 
291      0.0
292      0.0
2923     0.0
2924     0.0
99846    0.0
Name: 30, Length: 2794, dtype: float64

**`Extract movie genres`**

**Once again, we only need the genre name from the 'genres' column.**

In [25]:
# Extract movie genres from movies metadata and save it into a variable named df_movie_genres
# You can create a regular function to iterate through each genre in each movie's genres list to get only the genre names 

def get_genre_list(genres):
    if isinstance(genres, list): # if type(genres) == list
        genre_names = [item['name'] for item in genres]
        return genre_names
    return []

df_movie_genres = df_movies[df_movies['id'].isin(df_ratings['movieId'].unique())][['id', 'genres']]

df_movie_genres['genres'] = df_movie_genres['genres'].apply(literal_eval).apply(get_genre_list)

In [26]:
# Check the outcome
df_movie_genres

Unnamed: 0,id,genres
5,949,"[Action, Crime, Drama, Thriller]"
9,710,"[Adventure, Action, Thriller]"
14,1408,"[Action, Adventure]"
15,524,"[Drama, Crime]"
16,4584,"[Drama, Romance]"
...,...,...
45318,80831,[Drama]
45353,3104,"[Horror, Science Fiction]"
45403,64197,"[Romance, Drama]"
45406,98604,"[Comedy, Romance]"


**`Movie-feature matrix`**

**To generate movie-feature matrix, we need to transform each genre into a column.**

**Follow these 3 steps!**

In [27]:
# Step 1
df_movie_genres_id = df_movie_genres.set_index('id')
df_movie_genres_id

Unnamed: 0_level_0,genres
id,Unnamed: 1_level_1
949,"[Action, Crime, Drama, Thriller]"
710,"[Adventure, Action, Thriller]"
1408,"[Action, Adventure]"
524,"[Drama, Crime]"
4584,"[Drama, Romance]"
...,...
80831,[Drama]
3104,"[Horror, Science Fiction]"
64197,"[Romance, Drama]"
98604,"[Comedy, Romance]"


In [28]:
df_movie_genres_id.loc['100450']

genres    []
Name: 100450, dtype: object

In [29]:
# Step 2
df_movie_genres_stacked = df_movie_genres_id['genres'].apply(pd.Series).stack()
df_movie_genres_stacked

id      
949    0       Action
       1        Crime
       2        Drama
       3     Thriller
710    0    Adventure
              ...    
98604  0       Comedy
       1      Romance
49280  0      Fantasy
       1       Action
       2     Thriller
Length: 6658, dtype: object

In [30]:
# Step 3
# Generate movie-feature matrix and save it into a variable named df_movie_feature_matrix
df_movie_feature_matrix = pd.get_dummies(df_movie_genres_stacked).groupby(level=0).sum()
df_movie_feature_matrix

Unnamed: 0_level_0,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,Foreign,History,Horror,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
100,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
100017,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
100032,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
100272,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
101,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99106,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
992,0,0,0,1,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0
994,0,0,0,0,1,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0
996,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0


### **Similarity between items with Cosine Similarity**

In [31]:
# Check the number of userID
df_ratings['userId'].value_counts()

547    2391
564    1868
624    1735
15     1700
73     1610
       ... 
296      20
289      20
249      20
221      20
1        20
Name: userId, Length: 671, dtype: int64

**Now, we can start to build our recommendation system based on item similarity**

In [32]:
# Pick one random user to give recommendation to. Here choose userID = 30
random_user_id = 30

In [33]:
# Get his/her watched movie list and save it into a variable named movies_watched_by_user
movies_watched_by_user = df_ratings_with_titles[
    df_ratings_with_titles['userId'] == random_user_id]['movieId'].unique()

In [34]:
# Check the outcome
len(movies_watched_by_user)

423

In [35]:
# Get the movie titles watched by userID = 30
df_movies[df_movies['id'].isin(movies_watched_by_user)]['title'].values

array(['Leaving Las Vegas', "Mr. Holland's Opus", 'French Twist',
       'Beyond Rangoon', 'Strange Days', 'Drop Zone',
       'Interview with the Vampire', 'Star Wars', 'Nell',
       'Once Were Warriors', "A Pyromaniac's Love Story",
       'Three Colors: Red', 'While You Were Sleeping', 'Color of Night',
       'The House of the Spirits', 'Judgment Night', 'Killing Zoe',
       'Executive Decision', 'The Remains of the Day',
       'Romeo Is Bleeding', "Schindler's List", 'Sleepless in Seattle',
       'Blade Runner', 'Dances with Wolves', 'Mission: Impossible',
       'Space Jam', 'Dead Man', 'A Close Shave', 'A Time to Kill',
       'Vertigo', 'The Thin Man', 'The 39 Steps', 'Cat on a Hot Tin Roof',
       'Romeo + Juliet', 'Murder, My Sweet', 'Reservoir Dogs',
       'Monty Python and the Holy Grail', 'The Wrong Trousers',
       'Lawrence of Arabia', 'To Kill a Mockingbird',
       'Return of the Jedi', 'The Third Man', 'Alien', 'Psycho',
       'Duck Soup', 'Stand by Me', 'M', 

**`Similarity matrix`**

In [36]:
# Generate similarity matrix and save it into a variable named df_cosine_matrix
df_cosine_matrix = pd.DataFrame(
    data=cosine_similarity(X=df_movie_feature_matrix),
    columns=df_movie_feature_matrix.index.tolist(),
    index=df_movie_feature_matrix.index.tolist()
)

In [37]:
df_cosine_matrix

Unnamed: 0,100,100017,100032,100272,101,101362,1018,101904,102,102165,...,987,988,99,990,991,99106,992,994,996,99846
100,1.000000,0.000000,0.000000,0.408248,0.408248,0.816497,0.000000,0.000000,0.000000,0.000000,...,0.408248,0.000000,0.500000,0.000000,0.000000,0.00000,0.353553,0.353553,0.000000,0.408248
100017,0.000000,1.000000,0.707107,0.577350,0.577350,0.000000,0.577350,0.707107,0.707107,0.707107,...,0.577350,0.707107,0.707107,1.000000,0.707107,0.00000,0.500000,0.500000,0.577350,0.577350
100032,0.000000,0.707107,1.000000,0.408248,0.408248,0.000000,0.408248,0.500000,0.500000,0.500000,...,0.408248,0.500000,0.500000,0.707107,0.500000,0.00000,0.353553,0.353553,0.408248,0.816497
100272,0.408248,0.577350,0.408248,1.000000,0.333333,0.333333,0.333333,0.408248,0.408248,0.408248,...,0.666667,0.408248,0.816497,0.577350,0.408248,0.57735,0.577350,0.288675,0.333333,0.333333
101,0.408248,0.577350,0.408248,0.333333,1.000000,0.333333,0.666667,0.408248,0.408248,0.408248,...,0.333333,0.816497,0.408248,0.577350,0.408248,0.00000,0.288675,0.866025,0.666667,0.666667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99106,0.000000,0.000000,0.000000,0.577350,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,1.00000,0.000000,0.000000,0.000000,0.000000
992,0.353553,0.500000,0.353553,0.577350,0.288675,0.288675,0.577350,0.707107,0.353553,0.353553,...,0.577350,0.353553,0.707107,0.500000,0.353553,0.00000,1.000000,0.500000,0.577350,0.288675
994,0.353553,0.500000,0.353553,0.288675,0.866025,0.288675,0.866025,0.353553,0.353553,0.353553,...,0.288675,0.707107,0.353553,0.500000,0.353553,0.00000,0.500000,1.000000,0.866025,0.577350
996,0.000000,0.577350,0.408248,0.333333,0.666667,0.000000,1.000000,0.408248,0.408248,0.408248,...,0.333333,0.816497,0.408248,0.577350,0.408248,0.00000,0.577350,0.866025,1.000000,0.333333


In [38]:
# Get one known watched movieID by the user and save it into a variable named movieID_watched
movie_title_watched = 'Star Wars'
movieID_watched = df_movies[
    df_movies['title'] == movie_title_watched]['id'].values[0]

In [39]:
# Check the outcome
movieID_watched

'11'

In [40]:
# Get similarity vector for movie watched by the user and save it into a variable named df_movie_similarity
df_movie_similarity = df_cosine_matrix[movieID_watched].reset_index().rename(
    columns={'index': 'id', movieID_watched: 'cosine_similarity'}
)

# Merge
df_movie_similarity = pd.merge(
    left=df_movie_similarity,
    right=df_movies[['id', 'title']],
    how='left',
    on='id'
)

In [41]:
# Check the outcome
df_movie_similarity.head()

Unnamed: 0,id,cosine_similarity,title
0,100,0.0,"Lock, Stock and Two Smoking Barrels"
1,100017,0.0,Hounded
2,100032,0.408248,The Great Los Angeles Earthquake
3,100272,0.0,Harold's Going Stiff
4,101,0.0,Leon: The Professional


**Remember that we do not want to give movie recommendations that the user has watched, so we need to exclude those movies from our 'df_movie_similarity'. Then, we can show the top n movie recommendations for this user (the most similar movies to Star Wars).**

In [42]:
# Exclude watched movies from recommendation and show top n recommendations
n_recommendation = 5

df_movie_similarity[
    ~df_movie_similarity['id'].isin(movies_watched_by_user)
].sort_values(by='cosine_similarity', ascending=False).iloc[:n_recommendation]

Unnamed: 0,id,cosine_similarity,title
2449,830,1.0,Forbidden Planet
632,2164,1.0,Stargate
420,1891,1.0,The Empire Strikes Back
422,1894,1.0,Star Wars: Episode II - Attack of the Clones
2524,861,1.0,Total Recall


**Five unwatched movies that can be recommended to the picked user based on one watched movie: Star Wars.**

<hr>

### **Content-based Filtering**

**Now, we will try to build a recommendation system by utilizing some items that a user has watched.**

In [43]:
# Generate user-movie rating matrix of a user
# You can use the same userID from the previous steps. Here, I use userID = 30
df_current_user_ratings = df_ratings_with_titles[
    df_ratings_with_titles['userId'] == random_user_id][['movieId', 'rating']]

df_current_user_ratings

Unnamed: 0,movieId,rating
96,2105,2.0
142,2193,3.0
465,110,5.0
716,150,5.0
1040,161,5.0
...,...,...
37561,5961,4.0
37563,6103,2.0
37564,6318,3.0
37566,6436,5.0


In [44]:
# Generate movie-feature matrix for the chosen user
df_movie_feature_matrix_current_user = pd.merge(
    left=df_movie_feature_matrix.copy().reset_index(),
    right=df_current_user_ratings,
    how='left',
    left_on='id',
    right_on='movieId'
)

df_movie_feature_matrix_current_user.head(2)

Unnamed: 0,id,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Music,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,movieId,rating
0,100,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,100.0,4.0
1,100017,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,,


In [45]:
# Create a new DataFrame similar to df_movie_feature_matrix_current_user
df_movie_feature_matrix_current_user_pref = df_movie_feature_matrix_current_user.copy()

# Add a new column named 'have_watched' that contains movies a user has watched
df_movie_feature_matrix_current_user_pref['have_watched'] = df_movie_feature_matrix_current_user_pref['rating'] > 0

df_movie_feature_matrix_current_user_pref.head()

Unnamed: 0,id,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,movieId,rating,have_watched
0,100,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,100.0,4.0,True
1,100017,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,,,False
2,100032,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,,,False
3,100272,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,,,False
4,101,0,0,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,,,False


In [46]:
# Split the DataFrame into have_watched is True and False
df_movie_feature_matrix_current_user_pref_not_watched = df_movie_feature_matrix_current_user_pref[
    df_movie_feature_matrix_current_user_pref['have_watched'] == False
].fillna(0)

df_movie_feature_matrix_current_user_pref = df_movie_feature_matrix_current_user_pref[
    df_movie_feature_matrix_current_user_pref['have_watched'] == True
]

In [47]:
# Check one of the outcomes
df_movie_feature_matrix_current_user_pref_not_watched.head()

Unnamed: 0,id,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,movieId,rating,have_watched
1,100017,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0.0,False
2,100032,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0.0,False
3,100272,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0.0,False
4,101,0,0,0,0,1,0,1,0,0,...,0,0,0,0,1,0,0,0,0.0,False
5,101362,0,1,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0.0,False


In [48]:
# Multiply each feature with user rating
genres = df_movie_feature_matrix.columns.tolist()

for genre in genres:
    df_movie_feature_matrix_current_user_pref[genre] = (df_movie_feature_matrix_current_user_pref[genre] 
                                                     * df_movie_feature_matrix_current_user_pref['rating'])

In [49]:
df_movie_feature_matrix_current_user_pref.head(3)

Unnamed: 0,id,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Mystery,Romance,Science Fiction,TV Movie,Thriller,War,Western,movieId,rating,have_watched
0,100,0.0,0.0,0.0,4.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100,4.0,True
6,1018,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,...,3.0,0.0,0.0,0.0,3.0,0.0,0.0,1018,3.0,True
38,1073,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,...,3.0,0.0,0.0,0.0,3.0,0.0,0.0,1073,3.0,True


In [50]:
# Generate user feature vector
curr_user_feature_vector = (df_movie_feature_matrix_current_user_pref[genres].sum() 
                            / df_movie_feature_matrix_current_user_pref[genres].sum().sum())

# Sort them in descending order
curr_user_feature_vector.sort_values(ascending=False)

Drama              0.237664
Comedy             0.118894
Thriller           0.103893
Romance            0.075874
Action             0.072403
Crime              0.069427
Adventure          0.056038
Science Fiction    0.046367
Horror             0.041656
Fantasy            0.034218
Mystery            0.032730
Family             0.024052
History            0.023060
Music              0.013885
Animation          0.013885
War                0.013142
Western            0.009918
Documentary        0.006943
Foreign            0.005703
TV Movie           0.000248
dtype: float64

In [51]:
# Estimate current user's preference for unwatched movies
for genre in genres:
    df_movie_feature_matrix_current_user_pref_not_watched[genre] = (df_movie_feature_matrix_current_user_pref_not_watched[genre] 
                                                     * curr_user_feature_vector[genre])

In [52]:
# Create an estimated preference score for the movie that a user has not watched
df_movie_feature_matrix_current_user_pref_not_watched['est_pref_score'] = df_movie_feature_matrix_current_user_pref_not_watched[genres].sum(axis=1)

In [53]:
# Check the outcome
df_movie_feature_matrix_current_user_pref_not_watched.head(2)

Unnamed: 0,id,Action,Adventure,Animation,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Romance,Science Fiction,TV Movie,Thriller,War,Western,movieId,rating,have_watched,est_pref_score
1,100017,0.0,0.0,0.0,0.0,0.0,0.0,0.237664,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,False,0.237664
2,100032,0.072403,0.0,0.0,0.0,0.0,0.0,0.237664,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0,0.0,False,0.310067


In [54]:
# Get top n recommendations based on estimated preference score
df_current_user_recommendations = pd.merge(
    left=df_movie_feature_matrix_current_user_pref_not_watched[['id', 'est_pref_score']],
    right=df_movies[['id', 'title']],
    how='left',
    on='id'
).drop_duplicates()

df_current_user_recommendations.sort_values(by='est_pref_score', ascending=False)[:n_recommendation]

Unnamed: 0,id,est_pref_score,title
2242,9005,0.658319,The Ice Harvest
1442,4990,0.635011,Hustle
1401,4912,0.605753,Confessions of a Dangerous Mind
1670,5965,0.588892,Scorcher
964,31921,0.569055,They All Laughed


In [55]:
# Get top n recommendations based on estimated preference score
df_current_user_recommendations = pd.merge(
    left=df_movie_feature_matrix_current_user_pref_not_watched[['id', 'est_pref_score']],
    right=df_movies[['id', 'title']],
    how='left',
    on='id'
).drop_duplicates()

df_current_user_recommendations.sort_values(by='est_pref_score', ascending=False)[:n_recommendation]

Unnamed: 0,id,est_pref_score,title
2242,9005,0.658319,The Ice Harvest
1442,4990,0.635011,Hustle
1401,4912,0.605753,Confessions of a Dangerous Mind
1670,5965,0.588892,Scorcher
964,31921,0.569055,They All Laughed


<hr>