# Recommendations with MovieTweetings: Neighborhood Collab Filtering

This notebook demonstrates **neighborhood-based collaborative filtering**. (However, it is easy to extend this approach to make recommendations using item-based collaborative filtering (see next notebook).

Collaborative Filtering is one of the most popular methods for making recommendations, using the collaboration of user-item recommendations to assist in making new recommendations.  

Generally there are two main methods of performing collaborative filtering:

1. **Neighborhood-Based Collaborative Filtering**, which is based on the idea that we can either correlate items that are similar to provide recommendations or we can correlate users to one another to provide recommendations.

2. **Model Based Collaborative Filtering**, which is based on the idea that we can use machine learning and other mathematical models to understand the relationships that exist amongst items and users to predict ratings and provide ratings (see SVD notebooks).


In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm
import pickle

# from scipy.sparse import csr_matrix

import matplotlib.pyplot as plt
%matplotlib inline

In [4]:
# Read in the datasets
movies = pd.read_csv('data/movies_clean.csv')
reviews = pd.read_csv('data/reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

reviews.head(2)

Unnamed: 0,user_id,movie_id,rating,timestamp,date,month_1,month_2,month_3,month_4,month_5,...,month_9,month_10,month_11,month_12,year_2013,year_2014,year_2015,year_2016,year_2017,year_2018
0,1,68646,10,1381620027,2013-10-12 23:20:27,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
1,1,113277,10,1379466669,2013-09-18 01:11:09,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


## Create a User-Item Matrix

### User-Item Matrix

In order to calculate the similarities, it is common to put values in a matrix.  In this matrix, users are identified by each row, and items are represented by columns.  

![alt text](images/userxitem.png "User Item Matrix")

In the above matrix, you can see that **User 1** and **User 2** both used **Item 1**, and **User 2**, **User 3**, and **User 4** all used **Item 2**.  However, there are also a large number of missing values in the matrix for users who haven't used a particular item.  A matrix with many missing values (like the one above) is considered **sparse**.

Our first goal for this notebook is to create the above matrix with the **reviews** dataset.  However, instead of 1 values in each cell, you should have the actual rating.  

The users will indicate the rows, and the movies will exist across the columns. To create the user-item matrix, we only need the first three columns of the **reviews** dataframe, which you can see by running the cell below.


---


### Measure of Similarity: Euclidean distance

Euclidean distance is a measure of the straightline distance from one vector to another.  

$$ \text{EUCL}(x, y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$

Different from the correlation coefficient, no scaling is performed in the denominator.  Therefore, you need to make sure all of your data are on the same scale when using this metric.

**Note:** Because measuring similarity is often based on looking at the distance between vectors, it is important in these cases to scale your data or to have all data be in the same scale.  If some measures are on a 5 point scale, while others are on a 100 point scale, you are likely to have non-optimal results due to the difference in variability of your features.  In this case, we will not need to scale data because they are all on a 10 point scale, but it is always something to keep in mind!

In [5]:
# work on a subset of the reviews table

user_items = reviews[['user_id', 'movie_id', 'rating']]
user_items.head()

Unnamed: 0,user_id,movie_id,rating
0,1,68646,10
1,1,113277,10
2,2,422720,8
3,2,454876,8
4,2,790636,7


**Avoiding Memory errors:** In order to create the user-items matrix (like the one above), I personally started by using a [pivot table](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html). However, I quickly ran into a memory error. So check [this link here](https://stackoverflow.com/questions/39648991/pandas-dataframe-pivot-memory-error).

In [None]:
# Create user-by-item matrix (it doesn't matter witch agg function is used, we just want the value)

user_by_movie = reviews[['user_id', 'movie_id', 'rating']].groupby(['user_id', 'movie_id'])['rating'].max().unstack()

### Create a dictionary of rated movies per user 
Now that you have a matrix of users by movies, use this matrix to create a dictionary where the key is each user and the value is an array of the movies each user has rated.

In [None]:
# Create a dictionary with users and corresponding movies seen

def movies_watched(user_id):
    '''
    Helper function.
    
    INPUT:
    user_id - the user_id of an individual as int
    OUTPUT:
    movies - an array of movies the user has watched
    '''
    
    movies = user_by_movie.loc[user_by_movie.index == user_id]
    movies = movies.dropna(how = 'any', axis = 1)
    movies = list(movies.columns)

    return movies


def create_user_movie_dict():
    '''
    INPUT: None
    OUTPUT: movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    Creates the movies_seen dictionary
    '''
    movies_seen = {}
    for id in tqdm(user_by_movie.index):
        movies_seen[id] = movies_watched(id) 
    
    return movies_seen


# Use your function to return dictionary
movies_seen = create_user_movie_dict()

If a user hasn't rated more than 2 movies, we consider these users "too new".  Create a new dictionary that only contains users who have rated more than 2 movies.  This dictionary will be used for all the final steps of this workbook.

In [None]:
# Remove individuals who have watched 2 or fewer movies - don't have enough data to make recs

def create_movies_to_analyze(movies_seen, lower_bound=2):
    '''
    INPUT:  
    movies_seen - a dictionary where each key is a user_id and the value is an array of movie_ids
    lower_bound - (an int) a user must have more movies seen than the lower bound to be added to the movies_to_analyze dictionary

    OUTPUT: 
    movies_to_analyze - a dictionary where each key is a user_id and the value is an array of movie_ids
    
    The movies_seen and movies_to_analyze dictionaries should be the same except that the output dictionary has removed 
    
    '''
    
    # Do things to create updated dictionary
    
#     movies_to_analyze = {}
#     for id, movies in movies_seen.items():
#         if len(movies) > lower_bound:
#             movies_to_analyze[id] = movies  
    
    movies_to_analyze = {id for id, movie in movies_seen.items() if len(movie_recs) lower_bound] 
    
    return movies_to_analyze


# Use your function to return your updated dictionary
movies_to_analyze = create_movies_to_analyze(movies_seen)

## Calculate User Similarities

Now that you have set up the **movies_to_analyze** dictionary, it is time to take a closer look at the similarities between users.  Below is the pseudocode for how I thought about determining the similarity between users:

```
for user1 in movies_to_analyze
    for user2 in movies_to_analyze
        see how many movies match between the two users
        if more than two movies in common
            pull the overlapping movies
            compute the distance/similarity metric between ratings on the same movies for the two users
            store the users and the distance metric
```

However, this took a very long time to run, and other methods of performing these operations did not fit on the workspace memory!

Therefore, your task for this question is to look at a few specific examples of the correlation between ratings given by two users.  For this question consider you want to compute the [correlation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) between users.

`4.` Using the **movies_to_analyze** dictionary and **user_by_movie** dataframe, create a function that computes the correlation between the ratings of similar movies for two users.  Then use your function to compare your results to ours using the tests below.  

In [None]:
def compute_correlation(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the correlation between the matching ratings between the two users
    '''
    
    movies1 = movies_to_analyze[user1]
    movies2 = movies_to_analyze[user2]
    
    similar_movies = np.intersect1d(movies1, movies2, assume_unique=True) # assume speeds things up
    
    # create a dense dataframe with similar movies only
    corr_df = user_by_movie.loc[[user1, user2], similar_movies] 
    corr = corr_df.transpose().corr().iloc[0,1] # transpose because corr is calculated col-wise
    
    return corr 

In [None]:
# for performance reasons not all correlations are calculated. you can import them here

import pickle
corrs_import = pd.read_pickle(open("data/corrs.p", "rb"))
df_corrs = pd.DataFrame(corrs_import)
df_corrs.columns = ['user1', 'user2', 'movie_corr']

In [None]:
# check out some examples

print(compute_correlation(2,3))
print(compute_correlation(2,104))

### Elaboration: Why the NaN's? - Problems with Correlations

One question is, why are we still obtaining **NaN** values? These Nan's ultimately make the correlation coefficient a less than optimal measure of similarity between two users. The reason for the NaN is the way the correlation is calculated. It has the std in the denominator - if that is 0, then we get a NaN. Happens here for the intersecting movie ratings of user 2 and 104.

In [None]:
# Which movies did both user 2 and user 104 see?

similar_movies = np.intersect1d(movies_to_analyze[2], movies_to_analyze[104])
similar_movies

In [None]:
# The ratings for user 2 have a std of 0, resulting in a NaN value for the correlation calc

user_by_movie.loc[2, similar_movies]

Because the correlation coefficient proved to be less than optimal for relating user ratings to one another, we could instead calculate the euclidean distance between the ratings.  I found [this post](https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy) particularly helpful when I was setting up my function.  This function should be very similar to your previous function.  When you feel confident with your function, test it against our results.

In [None]:
def compute_euclidean_dist(user1, user2):
    '''
    INPUT
    user1 - int user_id
    user2 - int user_id
    OUTPUT
    the euclidean distance between user1 and user2
    '''
    
    movies1 = movies_to_analyze[user1]
    movies2 = movies_to_analyze[user2]
    
    similar_movies = np.intersect1d(movies1, movies2, assume_unique=True) # assume speeds things up
    
    # create a dense dataframe with similar movies only
    dist_df = user_by_movie.loc[[user1, user2], similar_movies] 
    dist = np.linalg.norm(dist_df.loc[user1] - dist_df.loc[user2])

    return dist

In [None]:
compute_euclidean_dist(2, 104)

In [None]:
# for performance reasons not all distances are calculated, in this cell you can load them.

import pickle
dists_import = pd.read_pickle(open("data/Term2/recommendations/lesson1/data/dists.p", "rb"))
df_dists = pd.DataFrame(dists_import)
df_dists.columns = ['user1', 'user2', 'movie_dist']

## Use Nearest Neighbors to Make Recommendations

In the previous questions, you read in **df_corrs** and **df_dists**. Therefore, you have a measure of distance and similarity for each user to every other user. These dataframes hold every possible combination of users, as well as the corresponding correlation or euclidean distance, respectively.

Because of the **NaN** values that exist within **df_corrs**, we will proceed using **df_dists**. You will want to find the users that are 'nearest' each user.  Then you will want to find the movies the closest neighbors have liked to recommend to each user.

I made use of the following objects:

* df_dists (to obtain the neighbors)
* user_items (to obtain the movies the neighbors and users have rated)
* movies (to obtain the names of the movies)

`7.` Complete the functions below, which allow you to find the recommendations for any user.  There are five functions which you will need:

* **find_closest_neighbors** - this returns a list of user_ids from closest neighbor to farthest neighbor using euclidean distance


* **movies_liked** - returns an array of movie_ids


* **movie_names** - takes the output of movies_liked and returns a list of movie names associated with the movie_ids


* **make_recommendations** - takes a user id and goes through closest neighbors to return a list of movie names as recommendations


* **all_recommendations** = loops through every user and returns a dictionary of with the key as a user_id and the value as a list of movie recommendations

In [None]:
def find_closest_neighbors(user_id):
    '''
    INPUT:
        user_id - (int) the user_id of the individual you want to find the closest users
    OUTPUT:
        closest_neighbors - an array of the id's of the users sorted from closest to farthest away
    '''
    # I treated ties as arbitrary and just kept whichever was easiest to keep using the head method
    # You might choose to do something less hand wavy - order the neighbors
    
    closest_neighbors = df_dists.loc[df_dists['user1'] == user_id].sort_values('movie_dist')
    closest_neighbors = closest_neighbors['user2'].values[1:]
    
    return closest_neighbors
    
    
    
def movies_liked(user_id, min_rating=7):
    '''
    INPUT:
    user_id - the user_id of an individual as int
    min_rating - the minimum rating considered while still a movie is still a "like" and not a "dislike"
    OUTPUT:
    movies_liked - an array of movies the user has watched and liked
    '''
    movies_liked = np.array(user_items.query('user_id == @user_id and rating > (@min_rating -1)')['movie_id'])
    
    return movies_liked


def movie_names(movie_ids):
    '''
    INPUT
    movie_ids - a list of movie_ids
    OUTPUT
    movie_list - a list of movie names associated with the movie_ids
    
    '''
    movie_list = [] 
    for id in movie_ids:
        title = movies.loc[movies['movie_id'] == id, 'movie'].tolist()
        movie_list.append(title[0])
        
    return movie_list
    
    
def make_recommendations(user, num_recs=10):
    '''
    INPUT:
        user - (int) a user_id of the individual you want to make recommendations for
        num_recs - (int) number of movies to return
    OUTPUT:
        recommendations - a list of movies - if there are "num_recs" recommendations return this many
                          otherwise return the total number of recommendations available for the "user"
                          which may just be an empty list
    '''
    # I wanted to make recommendations by pulling different movies than the user has already seen
    # Go in order from closest to farthest to find movies you would recommend
    # I also only considered movies where the closest user rated the movie as a 9 or 10
    
    # movies_seen by user (we don't want to recommend these)
    movies_seen = movies_watched(user)
    closest_neighbors = find_closest_neighbors(user)
    
    # Keep the recommended movies here
    recs = np.array([])
    
    # Go through the neighbors and identify movies they like the user hasn't seen
    for neighbor in closest_neighbors:
        neighbs_likes = movies_liked(neighbor)
        
        #Obtain recommendations for each neighbor
        new_recs = np.setdiff1d(neighbs_likes, movies_seen, assume_unique=True)
        
        # Update recs with new recs
        recs = np.unique(np.concatenate([new_recs, recs], axis=0))
        
        # If we have enough recommendations exit the loop
        if len(recs) > num_recs-1:
            break
    
    # Pull movie titles using movie ids
    recommendations = movie_names(recs)
    
    return recommendations


def all_recommendations(num_recs=10):
    '''
    INPUT 
        num_recs (int) the (max) number of recommendations for each user
    OUTPUT
        all_recs - a dictionary where each key is a user_id and the value is an array of recommended movie titles
    '''
    
    all_recs = {}
    for user_id in tqdm(user_by_movie.index):
        all_recs[user_id] = make_recommendations(user_id)
        
    
    return all_recs

all_recs = all_recommendations(10)

In [None]:
all_recs