[View in Colaboratory](https://colab.research.google.com/github/schwaaweb/aimlds1_08-UnsupervisedLearning/blob/master/Th08_AS--AG--Recommendation_Systems_Assignment_Solution.ipynb)

# Assignment Description

Write a function `recommend_for_user(user)` that takes a user and returns the top 5 recommended movies for them to watch that they have not already rated.

You should use the code from the lecture and coding challenge, and experiment with different approaches - you can use the predicted ratings we generated above, and you can experiment with different distance measurements or hybrid approaches (considering item similarity). You can also try an approach where e.g. you find the `n` users most similar to the given user, and average their preferences to generate new ratings and recommendations.

The end result should be a function that is reusable and suitable for an application and that satisfies this spec:

```
def recommend_for_user(user, n=5):
  """
  Generate movie recommendations for a user.
  Input: userId (from MovieLens data)
  Output: list of tuples (movieId, predictedRating) of top n movies recommended for user (not previously rated by them)
  """
  pass
```

As a stretch goal, prepare the data such that you generate test data - get a subset of users and drop some of their top ratings (favorite movies), then generate recommendations for them and see if they get recommended those movies and at what rating.

Part of the goal here is to think about making your code general and reusable, so it could plug into an actual live recommendation system. As a super-stretch goal, you can explore popular Python web application frameworks - these are the tools used to build the back-end that exposes an API so data can be displayed to the user.

- http://flask.pocoo.org/ - popular minimal framework
- https://www.djangoproject.com/ - "industry-grade" framework (more complicated, includes more features)

In a real-world situation, your role in a team could be ensuring that the API endpoint for generating recommendations does what it should (so everyone else can build on it). In web application terminology, this means that some route (e.g. `/recommendations/<user>`) can accept requests and return JSON data in response, essentially wrapping the function you built for this assignment.

# Assignment Solution (as a Python Class)

In [0]:
import numpy as np
import pandas as pd

class CollaborativeFilter:
  """Generate collaborative filtering powered recommendations for movie data."""
  
  def __init__(self, train_csv, test_csv=None):
    """Initialize an instance of the CollaborativeFilter class, loading data and
    forming a pivot table appropriate for running collaborative filtering.
    
    Args:
        train_csv (str): path (local or web) to CSV with MovieLens ratings data
        test_csv (str): path (local or web) to CSV with testing data (optional)
    """
    self.data_path = train_csv
    df = pd.read_csv(train_csv)
    self.pivot = pd.pivot_table(df, index='userid', columns='movieid',
                                aggfunc=np.max)
    self.test_path = test_csv
    if test_csv:
      df_test = pd.read_csv(test_csv)
      self.pivot_test = pd.pivot_table(df_test, index='userid',
                                       columns='movieid', aggfunc=np.max)
  
  def _matrix_factorization(self, R, P, Q, K, steps=5000, alpha=0.0002,
                            beta=0.02):
    """An implementation of matrix factorization.
    Source: http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/
    
    Args:
        R     : a matrix to be factorized, dimension N x M
        P     : an initial matrix of dimension N x K
        Q     : an initial matrix of dimension M x K
        K     : the number of latent features
        steps : the maximum number of steps to perform the optimization
        alpha : the learning rate
        beta  : the regularization parameter
    
    Returns: a tuple (P, Q.T) reflecting the factorized values of R
    """
    Q = Q.T
    for step in range(steps):
      for i in range(len(R)):
        for j in range(len(R[i])):
          if R[i][j] > 0:
            eij = R[i][j] - np.dot(P[i,:], Q[:,j])
            for k in range(K):
              P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
              Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
      eR = np.dot(P, Q)
      e = 0
      for i in range(len(R)):
        for j in range(len(R[i])):
          if R[i][j] > 0:
            e = e + pow(R[i][j] - np.dot(P[i,:],Q[:,j]), 2)
            for k in range(K):
              e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
      if e < 0.001:
        break
    return P, Q.T
    
  def run_matrix_factorization(self, K=2):
    """Run the matrix factorization on training data, and save predicted
    ratings to self.predicted_ratings as a matrix (rows=users, cols=movies).
    
    Args:
        K: number of latent features (passed on to _matrix_factorization)
    """
    R = self.pivot.as_matrix()
    N = len(R)
    M = len(R[0])
    
    P = np.random.rand(N, K) # Matrix of user attributes
    Q = np.random.rand(M, K) # Matrix of movie attributes
    
    nP, nQ = self._matrix_factorization(R, P, Q, K)
    # Find all predicted ratings by multiplying nP and nQ 
    self.predicted_ratings = np.dot(nP, nQ.T)
  
  def rmse(self):
    """Calculate root mean square error accuracy measurement.
    
    Returns: tuple (train_rmse, test_rmse), where test_rmse is None if there
        is no test data
    """
    if not hasattr(self, 'predicted_ratings'):
      raise Exception('Please execute run_matrix_factorization() to '
                      'generate predictions before calculating RMSE.')
    train_rmse = np.sqrt(np.nanmean(np.square(self.pivot.as_matrix() -
                                              self.predicted_ratings)))
    if hasattr(self, 'pivot_test'):
      test_rmse = np.sqrt(np.nanmean(np.square(self.pivot_test.as_matrix() -
                                               self.predicted_ratings)))
    else:
      test_rmse = None

    return train_rmse, test_rmse
  
  def recommend_for_user(self, user, n=5):
    """Generate movie recommendations for a user. Assumes 1 indexed users.
    
    Args:
        user: id for a user in the MovieLens rating data
        n: number of movies (that they have not reviewed) to recommend
    
    Returns: list of tuples (movieid, predicted_rating) of recommendations
    """
    user_ratings = self.pivot.as_matrix()[user - 1, :]
    predicted_ratings = self.predicted_ratings[user - 1, :].copy()
    # Wherever user ratings exist (aren't nan), make predicted ratings -1
    # so the predicted movies are not recommended
    predicted_ratings[~np.isnan(user_ratings)] = -1
    # Assuming movieids start with 1, sort descending by predicted rating
    movies = range(len(predicted_ratings))
    movies = sorted(movies, key=lambda i: predicted_ratings[i], reverse=True)
    return [(movie + 1, predicted_ratings[movie]) for movie in movies[:n]]

In [0]:
train_data = 'https://www.dropbox.com/s/ej76ujnyxmxn2mi/movie_ratings_training.csv?raw=1'
test_data = 'https://www.dropbox.com/s/h5jwlmnaxu5xdvq/movie_ratings_testing.csv?raw=1'
cf = CollaborativeFilter(train_data, test_data)

In [0]:
dir(cf)
print(cf.rmse())

In [0]:
cf.run_matrix_factorization()
print(cf.rmse())

In [0]:
cf.pivot.as_matrix()[4, :]

In [0]:
cf.predicted_ratings[4, :]

In [0]:
cf.recommend_for_user(5)