### Content Based Recommendations

In the previous notebook, you were introduced to a way to make recommendations using collaborative filtering.  However, using this technique there are a large number of users who were left without any recommendations at all.  Other users were left with fewer than the ten recommendations that were set up by our function to retrieve...

In order to help these users out, let's try another technique **content based** recommendations.  Let's start off where we were in the previous notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
from IPython.display import HTML
import progressbar
import tests as t
import pickle


%matplotlib inline

# Read in the datasets
movies = pd.read_csv('movies_clean.csv')
reviews = pd.read_csv('reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']


all_recs = pickle.load(open("all_recs.p", "rb"))

### Datasets

From the above, you now have access to three important items that you will be using throughout the rest of this notebook.  

`a.` **movies** - a dataframe of all of the movies in the dataset along with other content related information about the movies (genre and date)


`b.` **reviews** - this was the main dataframe used before for collaborative filtering, as it contains all of the interactions between users and movies.


`c.` **all_recs** - a dictionary where each key is a user, and the value is a list of movie recommendations based on collaborative filtering

For the individuals in **all_recs** who did recieve 10 recommendations using collaborative filtering, we don't really need to worry about them.  However, there were a number of individuals in our dataset who did not receive any recommendations.

-----

`1.` To begin, let's start with finding all of the users in our dataset who didn't get all 10 ratings we would have liked them to have using collaborative filtering.  

In [2]:
users_with_all_recs = []
for user, movie_recs in all_recs.items():
    if len(movie_recs) > 9:
        users_with_all_recs.append(user)

print("There are {} users with all reccomendations from collaborative filtering.".format(len(users_with_all_recs)))

users = np.unique(reviews['user_id'])
users_who_need_recs = np.setdiff1d(users, users_with_all_recs)

print("There are {} users who still need recommendations.".format(len(users_who_need_recs)))
print("This means that only {}% of users received all 10 of their recommendations using collaborative filtering".format(round(len(users_with_all_recs)/len(np.unique(reviews['user_id'])), 4)*100))   

There are 22187 users with all reccomendations from collaborative filtering.
There are 46123 users who still need recommendations.
This means that only 32.48% of users received all 10 of their recommendations using collaborative filtering


In [3]:
# Some test here might be nice
assert len(users_with_all_recs) == 22187
print("That's right there were still another 31781 users who needed recommendations when we only used collaborative filtering!")

That's right there were still another 31781 users who needed recommendations when we only used collaborative filtering!


### Content Based Recommendations

You will be doing a bit of a mix of content and collaborative filtering to make recommendations for the users this time.  This will allow you to obtain recommendations in many cases where we didn't make recommendations earlier.     

`2.` Before finding recommendations, rank the user's ratings from highest ratings to lowest ratings. You will move through the movies in this order looking for other similar movies.

In [4]:
# create a dataframe similar to reviews, but ranked by rating for each user
ranked_reviews = reviews.sort_values(by=['user_id', 'rating'], ascending=False)

### Similarities

In the collaborative filtering sections, you became quite familiar with different methods of determining the similarity (or distance) of two users.  We can perform similarities based on content in much the same way.  

In many cases, it turns out that one of the fastest ways we can find out how similar items are to one another (when our matrix isn't totally sparse like it was in the earlier section) is by simply using matrix multiplication.  If you are not familiar with this, an explanation is available [here by 3blue1brown](https://www.youtube.com/watch?v=LyGKycYT2v0) and another quick explanation is provided [on the post here](https://math.stackexchange.com/questions/689022/how-does-the-dot-product-determine-similarity).

For us to pull out a matrix that describes the movies in our dataframe in terms of content, we might just use the indicator variables related to **year** and **genre** for our movies.  

Then we can obtain a matrix of how similar movies are to one another by taking the dot product of this matrix with itself.  Notice in the below that the dot product where our 1 values overlap gives a value of 2 indicating higher similarity.  In the second dot product, the 1 values don't match up.  This leads to a dot product of 0 indicating lower similarity.

<img src="images/dotprod1.png" alt="Dot Product" height="500" width="500">

We can perform the dot product on a matrix of movies with content characteristics to provide a movie by movie matrix where each cell is an indication of how similar two movies are to one another.  In the below image, you can see that movies 1 and 8 are most similar, movies 2 and 8 are most similar and movies 3 and 9 are most similar for this subset of the data.  The diagonal elements of the matrix will contain the similarity of a movie with itself, which will be the largest possible similarity (which will also be the number of 1's in the movie row within the orginal movie content matrix.

<img src="images/moviemat.png" alt="Dot Product" height="500" width="500">


`3.` Create a numpy array that is a matrix of indicator variables related to year (by century) and movie genres by movie.  Perform the dot prodoct of this matrix with itself (transposed) to obtain a similarity matrix of each movie with every other movie.  The final matrix should be 31245 x 31245.

In [5]:
# Subset so movie_content is only using the dummy variables for each genre and the 3 century based year dummy columns
movie_content = np.array(movies.iloc[:,4:])

# Take the dot product to obtain a movie x movie matrix of similarities
dot_prod_movies = movie_content.dot(np.transpose(movie_content))

In [6]:
# create checks for the dot product matrix
assert dot_prod_movies.shape[0] == 31245
assert dot_prod_movies.shape[1] == 31245
assert dot_prod_movies[0, 0] == np.max(dot_prod_movies[0])
print("Looks like you passed all of the tests.  Though they weren't very robust - if you want to write some of your own, I won't complain!")

AssertionError: 

### For Each User...


Now that you have a matrix where each user has their ratings ordered.  You also have a second matrix where movies are each axis, and the matrix entries are larger where the two movies are more similar and smaller where the two movies are dissimilar.  This matrix is a measure of content similarity. Therefore, it is time to get to the fun part.

For each user, we will perform the following:

    i. For each movie, find the movies that are most similar that the user hasn't seen.

    ii. Continue through the available, rated movies until 10 recommendations or until there are no additional movies.

As a final note, you may need to adjust the criteria for 'most similar' to obtain 10 recommendations.  As a first pass, I used only movies with the highest possible similarity to one another as similar enough to add as a recommendation.

`3.` In the below cell, complete each of the functions needed for making content based recommendations.

In [10]:
def find_similar_movies(movie_id):
    '''
    INPUT
    movie_id - a movie_id 
    OUTPUT
    similar_movies - an array of the most similar movies by title
    '''
    # find the row of each movie id
    if not np.where(movies['movie_id']==210156)[0]:
        return np.array([])
        
    movie_idx = np.where(movies['movie_id'] == movie_id)[0][0]
    
    # find the most similar movie indices - to start I said they need to be the same for all content
    similar_idxs = np.where(dot_prod_movies[movie_idx] == np.max(dot_prod_movies[movie_idx]))[0]
    
    # pull the movie titles based on the indices
    similar_movies = np.array(movies.iloc[similar_idxs, ]['movie'])
    
    return similar_movies
    
    
def get_movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movies - a list of movie names associated with the movie_ids
    
    '''
    movie_lst = list(movies[movies['movie_id'].isin(movie_ids)]['movie'])
   
    return movie_lst

def make_recs():
    '''
    INPUT
    None
    OUTPUT
    recs - a dictionary with keys of the user and values of the recommendations
    '''
    # Create dictionary to return with users and ratings
    recs = defaultdict(set)
    # How many users for progress bar
    n_users = len(users)

    
    # Create the progressbar
    cnter = 0
    bar = progressbar.ProgressBar(maxval=n_users+1, widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    bar.start()
    
    # For each user
    for user in users:
        
        # Update the progress bar
        cnter+=1 
        bar.update(cnter)

        # Pull only the reviews the user has seen
        reviews_temp = ranked_reviews[ranked_reviews['user_id'] == user]
        movies_temp = np.array(reviews_temp['movie_id'])
        movie_names = np.array(get_movie_names(movies_temp))

        # Look at each of the movies (highest ranked first), 
        # pull the movies the user hasn't seen that are most similar
        # These will be the recommendations - continue until 10 recs 
        # or you have depleted the movie list for the user
        for movie in movies_temp:
            rec_movies = find_similar_movies(movie)
            temp_recs = np.setdiff1d(rec_movies, movie_names)
            recs[user].update(temp_recs)

            # If there are more than 
            if len(recs[user]) > 9:
                break

    bar.finish()
    
    return recs

In [11]:
recs = make_recs()

                                                                               [                                                                        ] N/A%

input mv_id 114508
input mv_id 8579674
input mv_id 790636
input mv_id 1343092
input mv_id 2884206
input mv_id 2980516
input mv_id 1454029
input mv_id 1663202
input mv_id 237572
input mv_id 317740
input mv_id 361748
input mv_id 7286456
input mv_id 3226786
input mv_id 181875
input mv_id 1024855
input mv_id 1025100
input mv_id 2277860
input mv_id 1790885
input mv_id 2278388
input mv_id 110076
input mv_id 1843866
input mv_id 2380307
input mv_id 4520988
input mv_id 368891
input mv_id 1663662
input mv_id 2402927
input mv_id 7286456
input mv_id 2872732
input mv_id 1707386
input mv_id 68646
input mv_id 2334879
input mv_id 2302755
input mv_id 1661199
input mv_id 478970
input mv_id 454876
input mv_id 1979320
input mv_id 10367276
input mv_id 264464
input mv_id 2584384
input mv_id 1798709
input mv_id 3450650
input mv_id 2802144
input mv_id 1800241
input mv_id 7798634
input mv_id 137523
input mv_id 119282
input mv_id 961097
input mv_id 1542768
input mv_id 1951264
input mv_id 454876
input mv_id 1386

input mv_id 1483013
input mv_id 70735
input mv_id 1727388
input mv_id 8722346
input mv_id 1405810
input mv_id 1889414
input mv_id 8579674
input mv_id 1800241
input mv_id 1972591
input mv_id 1386697
input mv_id 452702
input mv_id 3787068
input mv_id 1170358
input mv_id 1981677
input mv_id 2802144
input mv_id 993846
input mv_id 1790809
input mv_id 1000774
input mv_id 307901
input mv_id 4666726
input mv_id 275491
input mv_id 1663202
input mv_id 1843866
input mv_id 1179933
input mv_id 1924429
input mv_id 158350
input mv_id 1891845
input mv_id 770828
input mv_id 8579674
input mv_id 1853728
input mv_id 1979388
input mv_id 1454468
input mv_id 2096673
input mv_id 272020
input mv_id 2671706
input mv_id 3390572
input mv_id 277027
input mv_id 1663662
input mv_id 2975590
input mv_id 50083
input mv_id 76759
input mv_id 50083
input mv_id 1454468
input mv_id 1392214
input mv_id 1396484
input mv_id 1645080
input mv_id 1905041
input mv_id 1843866
input mv_id 3504048
input mv_id 1843866
input mv_id 9862

                                                                               [                                                                        ]   1%

input mv_id 365907
input mv_id 3011894
input mv_id 66344
input mv_id 338013
input mv_id 1690953
input mv_id 50083
input mv_id 86425
input mv_id 4555426
input mv_id 117951
input mv_id 4630562
input mv_id 241527
input mv_id 133093
input mv_id 387564
input mv_id 162222
input mv_id 1853728
input mv_id 2024544
input mv_id 2305051
input mv_id 1454468
input mv_id 2788732
input mv_id 485947
input mv_id 54698
input mv_id 405159
input mv_id 89218
input mv_id 108052
input mv_id 1540741
input mv_id 1939659
input mv_id 1663662
input mv_id 3270538
input mv_id 15864
input mv_id 1178663
input mv_id 1408101
input mv_id 1392214
input mv_id 3099498
input mv_id 101765
input mv_id 8228288
input mv_id 2338151
input mv_id 1024648
input mv_id 1535109
input mv_id 790628
input mv_id 1723121
input mv_id 1430132
input mv_id 3315342
input mv_id 2024544
input mv_id 443543
input mv_id 1124037
input mv_id 266697
input mv_id 478970
input mv_id 2763304
input mv_id 5726616
input mv_id 3501632
input mv_id 1702439
input m

input mv_id 1483013
input mv_id 3262342
input mv_id 6472976
input mv_id 116282
input mv_id 2582496
input mv_id 68646
input mv_id 1535108
input mv_id 276919
input mv_id 79116
input mv_id 319343
input mv_id 6472976
input mv_id 1477834
input mv_id 1877832
input mv_id 110912
input mv_id 236640
input mv_id 993846
input mv_id 1255953
input mv_id 3742378
input mv_id 3896198
input mv_id 2194499
input mv_id 2802144
input mv_id 482571
input mv_id 31679
input mv_id 1504320
input mv_id 4003440
input mv_id 7286456
input mv_id 2503944
input mv_id 1727824
input mv_id 1229340
input mv_id 105236
input mv_id 111161
input mv_id 57012
input mv_id 4154796
input mv_id 174480
input mv_id 2883512
input mv_id 1386697
input mv_id 6864864
input mv_id 1905041
input mv_id 1234719
input mv_id 3890160
input mv_id 884732
input mv_id 9243946
input mv_id 7420342
input mv_id 268978
input mv_id 11388580
input mv_id 1727824
input mv_id 2333784
input mv_id 1483013
input mv_id 47478
input mv_id 76759
input mv_id 85267
input

input mv_id 770828
input mv_id 1119646
input mv_id 1306980
input mv_id 1433811
input mv_id 4139124
input mv_id 1232829
input mv_id 1408101
input mv_id 108052
input mv_id 1302006
input mv_id 4154796
input mv_id 1392214
input mv_id 1899353
input mv_id 1677720
input mv_id 7131622
input mv_id 446029
input mv_id 2582846
input mv_id 4154796
input mv_id 4648786
input mv_id 111161
input mv_id 1477834
input mv_id 83922
input mv_id 1980209
input mv_id 1596343
input mv_id 2382396
input mv_id 112471
input mv_id 2194499
input mv_id 3170832
input mv_id 8579674
input mv_id 1335975
input mv_id 3315342
input mv_id 2584384
input mv_id 816692
input mv_id 1853728
input mv_id 144084
input mv_id 1800241
input mv_id 1300854
input mv_id 78346
input mv_id 112744
input mv_id 8367814
input mv_id 765443
input mv_id 114814
input mv_id 97165


                                                                               [=                                                                       ]   2%

input mv_id 3315342
input mv_id 2310332
input mv_id 816711
input mv_id 304141
input mv_id 1872194
input mv_id 1024648
input mv_id 1907668
input mv_id 266543
input mv_id 110912
input mv_id 120794
input mv_id 107554
input mv_id 1843866
input mv_id 1392190
input mv_id 172495
input mv_id 2402927
input mv_id 1489887
input mv_id 62512
input mv_id 1567609
input mv_id 3622592
input mv_id 2404435
input mv_id 257044
input mv_id 4857264
input mv_id 454921
input mv_id 770828
input mv_id 1535108
input mv_id 7286456
input mv_id 68646
input mv_id 120907
input mv_id 2562232
input mv_id 1981677
input mv_id 443465
input mv_id 1065073
input mv_id 5084170
input mv_id 6663582
input mv_id 2793930
input mv_id 114898
input mv_id 359950
input mv_id 1454468
input mv_id 993846
input mv_id 68646
input mv_id 137523
input mv_id 1809398
input mv_id 816692
input mv_id 2250912
input mv_id 7286456
input mv_id 481499
input mv_id 892769
input mv_id 352248
input mv_id 1300854
input mv_id 6857166
input mv_id 3183660
input 

input mv_id 770828
input mv_id 2452386
input mv_id 5164432
input mv_id 1979376
input mv_id 1285016
input mv_id 401445
input mv_id 422295
input mv_id 111161
input mv_id 437405
input mv_id 2582846
input mv_id 2084970
input mv_id 8946378
input mv_id 247586
input mv_id 1305806
input mv_id 78841
input mv_id 172493
input mv_id 1911658
input mv_id 1428538
input mv_id 1959490
input mv_id 1959490
input mv_id 1663662
input mv_id 2975590
input mv_id 4196776
input mv_id 1065073
input mv_id 110413
input mv_id 6751668
input mv_id 842926
input mv_id 816692
input mv_id 5607714
input mv_id 119174
input mv_id 2358891
input mv_id 87843
input mv_id 71577
input mv_id 2488496
input mv_id 1951264
input mv_id 497378
input mv_id 5084170
input mv_id 2584384
input mv_id 230600
input mv_id 1727824
input mv_id 54215
input mv_id 2652118
input mv_id 395169
input mv_id 4154796
input mv_id 2084970
input mv_id 1860357
input mv_id 4430212
input mv_id 4218572
input mv_id 809432
input mv_id 1535108
input mv_id 1663662
inp

                                                                               [==                                                                      ]   3%

input mv_id 398286
input mv_id 816692
input mv_id 3183660
input mv_id 2802144
input mv_id 238552
input mv_id 111161
input mv_id 96895
input mv_id 88944
input mv_id 217505
input mv_id 108002
input mv_id 1974419
input mv_id 245429
input mv_id 1964418
input mv_id 1675434
input mv_id 2179136
input mv_id 71315
input mv_id 443272
input mv_id 4123430
input mv_id 1343092
input mv_id 1730687
input mv_id 1587310
input mv_id 81398
input mv_id 41841
input mv_id 7734218
input mv_id 1853728
input mv_id 848537
input mv_id 5690360
input mv_id 71562
input mv_id 120855
input mv_id 6958212
input mv_id 848228
input mv_id 1458169
input mv_id 3881784
input mv_id 1618434
input mv_id 120338
input mv_id 1853728
input mv_id 7339792
input mv_id 1213663
input mv_id 1825157
input mv_id 2334649
input mv_id 120735
input mv_id 457430
input mv_id 490215
input mv_id 758758
input mv_id 1302006
input mv_id 397313
input mv_id 1504320
input mv_id 421715
input mv_id 213149
input mv_id 99785
input mv_id 66921
input mv_id 813

input mv_id 7713068
input mv_id 5862312
input mv_id 75148
input mv_id 405159
input mv_id 1515091
input mv_id 2096673
input mv_id 119174
input mv_id 409379
input mv_id 1931533
input mv_id 3168230
input mv_id 111161
input mv_id 983193
input mv_id 455944
input mv_id 1707386
input mv_id 2381111
input mv_id 2649554
input mv_id 5862312
input mv_id 2582846
input mv_id 1623205
input mv_id 1457767
input mv_id 4550098
input mv_id 8228288
input mv_id 1024648
input mv_id 1853728
input mv_id 1206885
input mv_id 7131622
input mv_id 4034228
input mv_id 5247022
input mv_id 3460252
input mv_id 4196776
input mv_id 122690
input mv_id 7286456
input mv_id 1069238
input mv_id 117500
input mv_id 2467046
input mv_id 2659414
input mv_id 2126357
input mv_id 1355644
input mv_id 6212478
input mv_id 770828
input mv_id 1291150
input mv_id 101698
input mv_id 1351685
input mv_id 1464549
input mv_id 816711
input mv_id 50986
input mv_id 1950186
input mv_id 6857112
input mv_id 118789
input mv_id 453562
input mv_id 20325

input mv_id 8579674
input mv_id 993846
input mv_id 120338
input mv_id 3721936
input mv_id 1663662
input mv_id 4916630
input mv_id 1424432
input mv_id 8367814
input mv_id 110912
input mv_id 8579674
input mv_id 1582507
input mv_id 2455264
input mv_id 6320628
input mv_id 1386697
input mv_id 1832382
input mv_id 840361
input mv_id 1321511
input mv_id 2802144
input mv_id 8579674
input mv_id 1213663
input mv_id 1670345
input mv_id 1800241
input mv_id 77651
input mv_id 106308
input mv_id 1727824
input mv_id 816692
input mv_id 3170832
input mv_id 1204341
input mv_id 6320628
input mv_id 9243946
input mv_id 805564
input mv_id 408236
input mv_id 903624
input mv_id 8579674


                                                                               [===                                                                     ]   5%

input mv_id 2388819
input mv_id 3547740
input mv_id 2446042
input mv_id 52561
input mv_id 2515086
input mv_id 7286456
input mv_id 1408101
input mv_id 70047
input mv_id 993846
input mv_id 1911644
input mv_id 4425200
input mv_id 1360860
input mv_id 17136
input mv_id 1392190
input mv_id 1702439
input mv_id 2431286
input mv_id 274518
input mv_id 1408101
input mv_id 417
input mv_id 68646
input mv_id 1398426
input mv_id 90583
input mv_id 455944
input mv_id 816692
input mv_id 1988621
input mv_id 99685
input mv_id 2106476
input mv_id 1540933
input mv_id 2064713
input mv_id 327056
input mv_id 454876
input mv_id 243133
input mv_id 1670345
input mv_id 1130884
input mv_id 4425200
input mv_id 56172
input mv_id 1905041
input mv_id 78748
input mv_id 425112
input mv_id 317219
input mv_id 472181
input mv_id 1659337
input mv_id 2140479
input mv_id 2101441
input mv_id 95327
input mv_id 2872718
input mv_id 1659337
input mv_id 1723121
input mv_id 1690953
input mv_id 50083
input mv_id 2494362
input mv_id 94

IndexError: index 0 is out of bounds for axis 0 with size 0

  """Entry point for launching an IPython kernel.


True

### How Did We Do?

Now that you have made the recommendations, how did we do in providing everyone with a set of recommendations?

`4.` Use the cells below to see how many individuals you were able to make recommendations for, as well as explore characteristics about individuals who you were not able to make recommendations for.  

In [142]:
# Explore recommendations
users_without_all_recs = []
users_with_all_recs = []
no_recs = []
for user, movie_recs in recs.items():
    if len(movie_recs) < 10:
        users_without_all_recs.append(user)
    if len(movie_recs) > 9:
        users_with_all_recs.append(user)
    if len(movie_recs) == 0:
        no_recs.append(user)

In [145]:
# Some characteristics of my content based recommendations
print("There were {} users without all 10 recommendations we would have liked to have.".format(len(users_without_all_recs)))
print("There were {} users with all 10 recommendations we would like them to have.".format(len(users_with_all_recs)))
print("There were {} users with no recommendations at all!".format(len(no_recs)))

There were 2179 users without all 10 recommendations we would have liked to have.
There were 51789 users with all 10 recommendations we would like them to have.
There were 174 users with no recommendations at all!


In [146]:
# Closer look at individual user characteristics
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_by_movie = user_items.groupby(['user_id', 'movie_id'])['rating'].max().unstack()

def movies_watched(user_id):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''
    movies = user_by_movie.loc[user_id][user_by_movie.loc[user_id].isnull() == False].index.values

    return movies


movies_watched(189)

array([457430])

In [152]:
cnter = 0
print("Some of the movie lists for users without any recommendations include:")
for user_id in no_recs:
    print(user_id)
    print(get_movie_names(movies_watched(user_id)))
    cnter+=1
    if cnter > 10:
        break

Some of the movie lists for users without any recommendations include:
189
['El laberinto del fauno (2006)']
797
['The 414s (2015)']
1603
['Beauty and the Beast (2017)']
2056
['Brimstone (2016)']
2438
['Baby Driver (2017)']
3322
['Rosenberg (2013)']
3925
['El laberinto del fauno (2006)']
4325
['Beauty and the Beast (2017)']
4773
['The Frozen Ground (2013)']
4869
['Beauty and the Beast (2017)']
4878
['American Made (2017)']


### Now What?  

Well, if you were really strict with your criteria for how similar two movies (like I was initially), then you still have some users that don't have all 10 recommendations (and a small group of users who have no recommendations at all). 

As stated earlier, recommendation engines are a bit of an **art** and a **science**.  There are a number of things we still could look into - how do our collaborative filtering and content based recommendations compare to one another? How could we incorporate user input along with collaborative filtering and/or content based recommendations to improve any of our recommendations?  How can we truly gain recommendations for every user?

`5.` In this last step feel free to explore any last ideas you have with the recommendation techniques we have looked at so far.  You might choose to make the final needed recommendations using the first technique with just top ranked movies.  You might also loosen up the strictness in the similarity needed between movies.  Be creative and share your insights with your classmates!

In [153]:
# Cells for exploring