### Content Based Recommendations

In the previous notebook, you were introduced to a way to make recommendations using collaborative filtering.  However, using this technique there are a large number of users who were left without any recommendations at all.  Other users were left with fewer than the ten recommendations that were set up by our function to retrieve...

In order to help these users out, let's try another technique **content based** recommendations.  Let's start off where we were in the previous notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import ast
from collections import defaultdict
from IPython.display import HTML
import progressbar
import tests as t
import pickle

In [3]:
# Read in the datasets
movies = pd.read_csv('data/movies_clean.csv', index_col=0)
reviews = pd.read_csv('data/reviews_clean.csv', index_col=0)

In [4]:
# This loads our solution dictionary (key: user_id and value: an array of recommended movie titles)
all_recs = {}
with open("data/all_recs.txt", newline="") as csvfile:
    data = csv.reader(csvfile, delimiter=",")
    for row in data:
        val = ast.literal_eval(row[1])
        all_recs[int(row[0])] = val

### Datasets

From the above, you now have access to three important items that you will be using throughout the rest of this notebook.  

`a.` **movies** - a dataframe of all of the movies in the dataset along with other content related information about the movies (genre and date)


`b.` **reviews** - this was the main dataframe used before for collaborative filtering, as it contains all of the interactions between users and movies.


`c.` **all_recs** - a dictionary where each key is a user, and the value is a list of movie recommendations based on collaborative filtering

For the individuals in **all_recs** who did recieve 10 recommendations using collaborative filtering, we don't really need to worry about them.  However, there were a number of individuals in our dataset who did not receive any recommendations.

-----

`1.` To begin, let's start with finding all of the users in our dataset who didn't get all 10 ratings we would have liked them to have using collaborative filtering.  

In [15]:
rec_number = {user:len(rec) for user, rec in all_recs.items()}

In [19]:
rec_num = {}
rec_num['user_id'] = rec_number.keys()
rec_num['num_recommendations'] = rec_number.values()

In [21]:
rec_num = pd.DataFrame(rec_num)

In [26]:
(rec_num['num_recommendations']>=10).sum()

3494

In [27]:
rec_num['user_id'].nunique()

3494

In [28]:
reviews.drio

Unnamed: 0,user_id,movie_id,rating,timestamp,date
0,1,114508,8,1381006850,2013-10-05 21:00:50
1,2,208092,5,1586466072,2020-04-09 21:01:12
2,2,358273,9,1579057827,2020-01-15 03:10:27
3,2,10039344,5,1578603053,2020-01-09 20:50:53
4,2,6751668,9,1578955697,2020-01-13 22:48:17
...,...,...,...,...,...
99996,8022,40746,9,1585954942,2020-04-03 23:02:22
99997,8022,41959,9,1586569384,2020-04-11 01:43:04
99998,8022,43014,9,1587085691,2020-04-17 01:08:11
99999,8022,44079,9,1586738312,2020-04-13 00:38:32


In [31]:
rec_num['user_id'].drop_duplicates().values

array([   2,    3,    4, ..., 8018, 8020, 8022], dtype=int64)

In [32]:
reviews['user_id'].drop_duplicates().values

array([   1,    2,    3, ..., 8020, 8021, 8022], dtype=int64)

In [33]:
users_with_all_recs = rec_num['user_id'].drop_duplicates().values
users = reviews['user_id'].drop_duplicates().values

In [41]:
(~np.isin(users, users_with_all_recs)).mean()

0.5644477686362503

In [39]:
users[~np.isin(users, users_with_all_recs)]

array([   1,    9,   10, ..., 8017, 8019, 8021], dtype=int64)

In [43]:
users_with_all_recs = all_recs.keys()
# Convert `all_recs` dictionary into array of users with all recommendations
rec_number = {user:len(rec) for user, rec in all_recs.items()}
rec_num = {}
rec_num['user_id'] = rec_number.keys()
rec_num['num_recommendations'] = rec_number.values()
rec_num = pd.DataFrame(rec_num)

users_with_all_recs = rec_num['user_id'].drop_duplicates().values
users = reviews['user_id'].drop_duplicates().values

print("There are {} users with all reccomendations from collaborative filtering.".format(len(users_with_all_recs)))

users_who_need_recs = users[~np.isin(users, users_with_all_recs)]

print("There are {} users who still need recommendations.".format(len(users_who_need_recs)))
print("This means that only {}% of users received all 10 of their recommendations using collaborative filtering".format(round(len(users_with_all_recs)/len(np.unique(reviews['user_id'])), 4)*100))   

There are 3494 users with all reccomendations from collaborative filtering.
There are 4528 users who still need recommendations.
This means that only 43.56% of users received all 10 of their recommendations using collaborative filtering


In [44]:
# Some test here might be nice
assert len(users_who_need_recs) == 4528
print("That's right there were still another 4528 users who needed recommendations when we only used collaborative filtering!")

That's right there were still another 4528 users who needed recommendations when we only used collaborative filtering!


### Content Based Recommendations

You will be doing a bit of a mix of content and collaborative filtering to make recommendations for the users this time.  This will allow you to obtain recommendations in many cases where we didn't make recommendations earlier.     

`2.` Before finding recommendations, rank the user's ratings from highest ratings to lowest ratings. You will move through the movies in this order looking for other similar movies.