## Recommendations with MovieTweetings: Collaborative Filtering

One of the most popular methods for making recommendations is **collaborative filtering**.  In collaborative filtering, you are using the collaboration of user-item recommendations to assist in making new recommendations.  

In this notebook, you will be working on performing **collaborative filtering**.  There are two main methods for performing collaborative filtering:

1. **User-based collaborative filtering:** In this type of recommendation, users related to the user you would like to make recommendations for are used to create a recommendation.

2. **Item-based collaborative filtering:** In this type of recommendation, first you need to find the items that are most related to each other item (based on similar ratings).  Then you can use the ratings of an individual on those similar items to understand if a user will like the new item.

In this notebook you will be implementing **user-based collaborative filtering**.  However, it is easy to extend this approach to make recommendations using **item-based collaborative filtering**.  First, let's read in our data and necessary libraries.

**NOTE**: Because of the size of the datasets, some of your code cells here will take a while to execute, so be patient!

In [None]:
import numpy as np
import pandas as pd

# Read in the datasets
movies = pd.read_csv('../data/movies_clean.csv')
reviews = pd.read_csv('../data/reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

print(reviews.head())

In the above matrix, you can see that **User 1** and **User 2** both used **Item 1**, and **User 2**, **User 3**, and **User 4** all used **Item 2**.  However, there are also a large number of missing values in the matrix for users who haven't used a particular item.  A matrix with many missing values (like the one above) is considered **sparse**.

Our first goal for this notebook is to create the above matrix with the **reviews** dataset.  However, instead of 1 values in each cell, you should have the actual rating.  

The users will indicate the rows, and the movies will exist across the columns. To create the user-item matrix, we only need the first three columns of the **reviews** dataframe, which you can see by running the cell below.

In [None]:
user_items = reviews[['user_id', 'movie_id', 'rating']]
user_items.head()

### Creating the User-Item Matrix

In order to create the user-items matrix (like the one above), I personally started by using a [pivot table](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html). 

However, I quickly ran into a memory error (a common theme throughout this notebook).  I will help you navigate around many of the errors I had, and achieve useful collaborative filtering results! 

_____

`1.` Create a matrix where the users are the rows, the movies are the columns, and the ratings exist in each cell, or a NaN exists in cells where a user hasn't rated a particular movie. If you get a memory error (like I did), [this link here](https://stackoverflow.com/questions/39648991/pandas-dataframe-pivot-memory-error) might help you!

In [None]:
# Create user-by-item matrix
user_by_movie = # Your code here

Check your results below to make sure your matrix is ready for the upcoming sections.

In [None]:
assert reviews.user_id.nunique() == user_by_movie.shape[0], "Oh no! Your matrix should have {} rows, and yours has {}!".format(reviews.user_id.nunique(), user_by_movie.shape[0])
print("Looks like you are all set! Proceed!")

`2.` For a first iteration, we will update the user-movie matrix imagine we are only interested in which movies a user has viewed.  Therefore, update your solution to question 1 to have a `0` for any movie a user has not rated and a `1` for any movie a user has rated.

In [None]:
user_by_movie = # your solution here

`3.` A common similarity metric is `cosine_similarity`.  Complete the function below using this metric to determine which users are most similar to one another.

In [None]:
# Lets use the cosine_similarity function from sklearn
from sklearn.metrics.pairwise import cosine_similarity

def find_similar_users(user_id, user_item, include_similarity=False):
    """
    INPUT:
    user_id - (int) a user_id
    user_item - (pandas dataframe) matrix of users by movies:
                1's when a user has interacted with a movie, 0 otherwise
    include_similarity - (bool) whether to include the similarity in the output

    OUTPUT:
    similar_users - (list) an ordered list where the closest users (largest dot product users)
                    are listed first

    Description:
    Computes the similarity of every pair of users based on the dot product
    Returns an ordered list of user ids. If include_similarity is True, returns a list of lists
    where the first element is the user id and the second the similarity.

    """
    # compute similarity of each user to the provided user
    # sort by similarity
    # remove the own user's id
    # create list of just the ids
    # create list of just the similarities
    # return a list of the users in order from most to least similar

`4.` Using your new function, who are the top 10 similar users to user `3508`?

In [None]:
users = # your solution here

`5.` Determine what movies you would recommend to user `3508` by finding movies they haven't rated that the top 10 users from the previous question have rated.

In [None]:
# movies 3508 has already rated

# movies rated by the top 10 most similar users to 3508