# Recommendations with MovieTweetings: FunkSVD, incl. Evaluation

This notebook performs collaborative filtering based on the working version of SVD for situations even when there are tons of missing values, that was developed in the last notebook (FunkSVD_Intro).  

The focus lies on answering the question, how well this solution works? For this we simulate tuning our recommender.  

In [2]:
import numpy as np
import pandas as pd

# Read in the datasets
movies = pd.read_csv('data/movies_clean.csv')
reviews = pd.read_csv('data/reviews_clean.csv')

del movies['Unnamed: 0']
del reviews['Unnamed: 0']

### Create training and test sets

1. Using the **reviews** dataframe, perform the following tasks to create a training and validation set of data we can use to test the performance of your SVD algorithm using **off-line** validation techniques.

 * Order the reviews dataframe from earliest to most recent (IMPORTANT: always put the newer data in the test set)
 * Pull the first 10000 reviews from  the dataset
 * Make the first 8000/10000 reviews the training data 
 * Make the last 2000/10000 the test data
 * Return the training and test datasets

In [6]:
def create_train_test(reviews, order_by_column, training_size, testing_size):
    '''    
    Split dataframe in training and test sets. Order data from earliest to most recent.
    (Always put the newer data in the test set.)
    
    INPUT:
    reviews - (pandas df) dataframe to split into train and test
    order_by - (string) column name to sort by
    training_size - (int) number of rows in training set
    testing_size - (int) number of columns in the test set
    
    OUTPUT:
    training_df -  (pandas df) dataframe of the training set
    validation_df - (pandas df) dataframe of the test set
    '''
    
    reviews_set = reviews.sort_values(order_by_column)
    train_df = reviews_set.iloc[:training_size, :]
    val_df = reviews_set.iloc[training_size : (training_size + testing_size), :]
    

    return train_df, val_df

In [7]:
# apply function
train_df, val_df = create_train_test(reviews, 'date', 8000, 2000)

In [8]:
# Make sure the dataframes we are using are the right shape
assert train_df.shape[0] == 8000, "The number of rows doesn't look right in the training dataset."
assert val_df.shape[0] == 2000, "The number of rows doesn't look right in the validation dataset"
assert str(train_df.tail(1)['date']).split()[1] == '2013-03-15', "The last date in the training dataset doesn't look like what we expected."
assert str(val_df.tail(1)['date']).split()[1] == '2013-03-18', "The last date in the validation dataset doesn't look like what we expected."

In the real world, we might have all of the data up to this final date in the training data.  Then we want to see how well we are doing for each of the new ratings, which show up in the test data.

### Define FunkSVD function (copy from last notebook)

In [None]:
# function defined in the last notebook (Into FunkSVD)

def FunkSVD(ratings_mat, latent_features=12, learning_rate=0.0001, iters=100):
    """ Perform matrix factorization using a basic form of FunkSVD with 
    no regularization.
    
    INPUT:
        user_item_np: np.array, matrix with users as rows, movies as columns, 
            ratings as values
        latent_features: int, number of latent features used
        learning_rate: float, the learning rate 
        iters: int, the number of iterations
    
    OUTPUT:
        user_matrix: np.array, user by latent feature matrix
        movie_mat: np.array, latent feature by movie matrix
    """
    
    # Set up useful values to be used through the rest of the function
    n_users = ratings_mat.shape[0]
    n_movies = ratings_mat.shape[1]
    num_ratings = np.count_nonzero(~np.isnan(ratings_mat))
    
    # initialize the user and movie matrices with random values
    user_mat = np.random.rand(n_users, latent_features)
    movie_mat = np.random.rand(latent_features, n_movies)
    
    # initialize sse at 0 for first iteration
    sse_accum = 0
    
    # keep track of iteration and MSE
    print("Optimization Statistics")
    print("Iterations | Mean Squared Error ")
    
    # for each iteration
    for iteration in range(iters):

        # update our sse
        old_sse = sse_accum
        sse_accum = 0
        
        # For each user-movie pair
        for i in range(n_users):
            for j in range(n_movies):
                
                # if the rating exists
                if ratings_mat[i, j] > 0:
                    
                    # compute the error as the actual minus the dot product of the user and movie latent features
                    diff = ratings_mat[i, j] - np.dot(user_mat[i, :], movie_mat[:, j])
                    
                    # Keep track of the sum of squared errors for the matrix
                    sse_accum += diff**2
                    
                    # update the values in each matrix in the direction of the gradient
                    for k in range(latent_features):
                        user_mat[i, k] += learning_rate * (2*diff*movie_mat[k, j])
                        movie_mat[k, j] += learning_rate * (2*diff*user_mat[i, k])

        # print results
        print("%d \t\t %f" % (iteration+1, sse_accum / num_ratings))
        
    return user_mat, movie_mat 

### Create user-by-item matrix and fit FunkSVD to training data

In [None]:
# create user-by-item matrix, you need this for the fit to work

def create_user_item_matrix(df):
    """Create user_item_matrix, with users as rows and items
    as columns.
    
    INPUT:
    df: DataFrame with (training) data.
    
    OUTPUT:
    user_item_df: DataFrame containing user_item_matrix
    user_item_np: Numpy Array containing user_item_matrix.
    """
    
    train_user_item = df[['user_id', 'movie_id', 'rating', 'timestamp']]
    train_data_df = train_user_item.groupby(
        ['user_id', 'movie_id'])['rating'].max().unstack()
    train_data_np = np.array(train_data_df)
    
    return train_data_df, train_data_np

# call function
train_data_df, train_data_np = create_user_item_matrix(train_df)

In [None]:
# fit FunkSVD with the specified hyper parameters to the training data

user_mat, movie_mat = FunkSVD(train_data_np, latent_features=15, learning_rate=0.005, iters=100)

### Make predictions

Now that you have created the **user_mat** and **movie_mat**, we can use this to make predictions for how users would rate movies, by just computing the dot product of the row associated with a user and the column associated with the movie.

In [39]:
def predict_rating(user_matrix, movie_matrix, user_id, movie_id):
    '''
    INPUT:
    user_matrix - user by latent factor matrix
    movie_matrix - latent factor by movie matrix
    user_id - the user_id from the reviews df
    movie_id - the movie_id according the movies df
    
    OUTPUT:
    pred - the predicted rating for user_id-movie_id according to FunkSVD
    '''
    # Use the training data to create a series of users and movies that matches the ordering in training data
    user_ids_series = np.array(train_data_df.index)
    movie_ids_series = np.array(train_data_df.columns)
    
    # User row and Movie Column
    user_row = np.where(user_ids_series == user_id)[0][0]
    movie_col = np.where(movie_ids_series == movie_id)[0][0]
    
    # Take dot product of that row and column in U and V to make prediction
    pred = np.dot(user_matrix[user_row, :], movie_matrix[:, movie_col])
    
    return pred

In [40]:
# Test your function with the first user-movie in the user-movie matrix (notice this is a nan)

pred_val = predict_rating(user_mat, movie_mat, 8, 2844)
pred_val

6.9847254075533831

It is great that you now have a way to make predictions. However it might be nice to get a little phrase back about the user, movie, and rating.

`4.` Use the comments in the function below to complete the **predict_rating** function.  

**Note:** The movie name doesn't come back in a great format, so you can see in the solution I messed around with it a bit just to make it a little nicer.

In [41]:
def print_prediction_summary(user_id, movie_id, prediction):
    '''
    INPUT:
    user_id - the user_id from the reviews df
    movie_id - the movie_id according the movies df
    prediction - the predicted rating for user_id-movie_id
    
    OUTPUT:
    None - prints a statement about the user, movie, and prediction made
    
    '''
    
    movie_name = str(movies[movies['movie_id'] == movie_id]['movie']) [5:]
    movie_name = movie_name.replace('\nName: movie, dtype: object', '')
    print("For user {} we predict a {} rating for the movie {}.".format(user_id, round(prediction, 2), str(movie_name)))


In [42]:
# Test your function the the results of the previous function

print_prediction_summary(8, 2844, pred_val)

For user 8 we predict a 6.98 rating for the movie  Fantômas - À l'ombre de la guillotine (1913).


### Validate results

Now that we have the ability to make predictions, let's see how well our predictions do on the test ratings we already have.  This will give an indication of how well we have captured the latent features, and our ability to use the latent features to make predictions in the future!

For each of the user-movie rating in the **val_df** dataset, compare the actual rating given to the prediction you would make.  

In [43]:
def validation_comparison(val_df, num_preds):
    '''
    INPUT:
    val_df - the validation dataset created in the third cell above
    num_preds - (int) the number of rows (going in order) you would like to make predictions for
    
    OUTPUT:
    Nothing returned - print a statement about the prediciton made for each row of val_df from row 0 to num_preds
    '''
    val_users = np.array(val_df['user_id'])
    val_movies = np.array(val_df['movie_id'])
    val_ratings = np.array(val_df['rating'])
    
    
    for idx in range(num_preds):
        pred = predict_rating(user_mat, movie_mat, val_users[idx], val_movies[idx])
        print("The actual rating for user {} on movie {} is {}.\n While the predicted rating is {}.".format(val_users[idx], val_movies[idx], val_ratings[idx], round(pred))) 
    
# Perform the predicted vs. actual for the first 6 rows.  How does it look?

validation_comparison(val_df, 6)        

The actual rating for user 49056 on movie 1598822 is 8.
 While the predicted rating is 7.0.
The actual rating for user 49056 on movie 289879 is 9.
 While the predicted rating is 8.0.
The actual rating for user 49056 on movie 1563738 is 9.
 While the predicted rating is 7.0.
The actual rating for user 49056 on movie 1458175 is 4.
 While the predicted rating is 7.0.
The actual rating for user 28599 on movie 103639 is 8.
 While the predicted rating is 8.0.
The actual rating for user 50593 on movie 1560985 is 4.
 While the predicted rating is 3.0.


### Run into cold start problem

In [44]:
# Perform the predicted vs. actual for the first 7 rows.  What happened?

validation_comparison(val_df, 7)        

The actual rating for user 49056 on movie 1598822 is 8.
 While the predicted rating is 7.0.
The actual rating for user 49056 on movie 289879 is 9.
 While the predicted rating is 8.0.
The actual rating for user 49056 on movie 1563738 is 9.
 While the predicted rating is 7.0.
The actual rating for user 49056 on movie 1458175 is 4.
 While the predicted rating is 7.0.
The actual rating for user 28599 on movie 103639 is 8.
 While the predicted rating is 8.0.
The actual rating for user 50593 on movie 1560985 is 4.
 While the predicted rating is 3.0.


IndexError: index 0 is out of bounds for axis 0 with size 0

**Cold start problem:** We have new users / movies in the test set, that weren't in the training set. We can't make predictions for them using collaborative filtering techniques.

---