# Building the Recommender System (Matrix Factorization)

### Intuition

The intuition behind using matrix factorization to solve this problem is that there should be some latent features that determine how a user rates an item. For example, two users would give high ratings to a certain movie if they both like the actors or actresses in the movie, or if the movie is an action movie, which is a genre preferred by both users.

Hence, if we can discover these latent features, we should be able to predict a rating with respect to a certain user and a certain item, because the features associated with the user should match with the features associated with the item.

### Workflow

1. Read in data
2. Create utility (user and product interaction) matrix.
3. Using MF class, create 2 latent features matrices (P and Q).
4. Use gradient descent to lower the root mean squared error (RMSE).
5. Score predictions based on total final RMSE and Recall.

In [31]:
# Import standard libraries
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import sqlite3
from collections import Counter
from time import sleep
import time

sns.set_style('whitegrid')

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

### 1. Read in Data

In [2]:
# Read in Data
sqlite_db = 'datasets/amzn_vg_clean.db'
conn = sqlite3.connect(sqlite_db) 

query = '''
SELECT "customer_id", "product_title", "star_rating"
FROM video_games
'''

sample = pd.read_sql(query, con=conn)

In [3]:
# Matrix of user ratings for each product
ratings_df = sample.groupby(['customer_id','product_title'])['star_rating'].mean().unstack()
ratings_df = ratings_df.fillna(0)
ratings_df.head()


product_title,007 Legends,007 The World Is Not Enough PS,007 The World is Not Enough,10 Minute Solution,100 All-Time Favorites - Nintendo DS,100 Classic Books - Nintendo DS,1001 Touch Games - Nintendo DS,101-in-1 Explosive Megamix - Nintendo DS,1080 Snowboarding,1080° Avalanche,...,iCarly - Nintendo DS,inFAMOUS - Playstation 3,inFAMOUS 2,inFAMOUS Collection - Playstation 3,inFAMOUS Second Son - PlayStation 4,inFAMOUS: Second Son Collector's Edition - PlayStation 4,inFAMOUS: Second Son Limited Edition (PlayStation 4),inFamous First Light - PS4 (Physical Version),miCoach by Adidas,rFactor V. 1.255 - PC
customer_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11049,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0
86525,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100864,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
452646,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1132227,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
ratings_df.shape

(1996, 4669)

### 2. Matrix Factorization Class

In [5]:
class MF():

    def __init__(self, R, K, alpha, beta, iterations):
        '''
        Perform matrix factorization to predict empty entries in a matrix.
        
        Arguments:
        - R (ndarray)   : user-item rating matrix
        - K (int)       : number of latent dimensions
        - alpha (float) : learning rate
        - beta (float)  : regularization parameter
        '''

        self.R = R
        self.num_users, self.num_items = R.shape
        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.iterations = iterations

    def train(self):
        '''
        Function that trains MF model based on how many iterations.
        
        Process:
        1. Initialize user and item latent features matrices.
        2. Initialize global, user and item biases.
        3. Implement gradient descent.
        4. Calculate RMSE at each interation.
        '''
        # Initialize user and item latent feature matrice
        self.P = np.random.normal(scale=1./self.K, size=(self.num_users, self.K))
        self.Q = np.random.normal(scale=1./self.K, size=(self.num_items, self.K))

        # Initialize the biases
        self.b_u = np.zeros(self.num_users)
        self.b_i = np.zeros(self.num_items)
        self.b = np.mean(self.R[np.where(self.R != 0)])

        # Create a list of training samples
        self.samples = [
            (u, i, self.R[u, i]) 
            for u in range(self.num_users) 
            for i in range(self.num_items) 
            if self.R[u, i] > 0
        ]

        # Perform stochastic gradient descent for number of iterations
        training_process = []
        for i in range(self.iterations):
            np.random.shuffle(self.samples)
            self.sgd()
#             print('sgd ok')
            mse = self.mse()
#             print('mse ok')
            training_process.append((i, mse))
            if (i+1) % 100 == 0:
                print("Iteration: %d ; rmse = %.4f" % (i+1, mse))

        return training_process

    def mse(self):
        '''
        A function to compute the total mean square error
        '''
        xs, ys = self.R.nonzero()
        predicted = self.full_matrix()
        error = 0
        for x, y in zip(xs, ys):
            error += pow(self.R[x, y] - predicted[x, y], 2)
        return np.sqrt(error)

    def sgd(self):
        '''
        Perform stochastic graident descent
        '''
        for u, i, r in self.samples:
            # Compute prediction and error
            prediction = self.get_rating(u, i)
            e = (r - prediction)

            # Update biases
            self.b_u[u] += self.alpha * (e - self.beta * self.b_u[u])
            self.b_i[i] += self.alpha * (e - self.beta * self.b_i[i])
#             print('ok')

            # Update user and item latent feature matrices
            self.P[u, :] += self.alpha * (e * self.Q[i, :] - self.beta * self.P[u,:])
            self.Q[i, :] += self.alpha * (e * self.P[u, :] - self.beta * self.Q[i,:])
#             print('ok')

    def get_rating(self, u, i):
        '''
        Get the predicted rating of user i and item j
        '''
        prediction = self.b + self.b_u[u] + self.b_i[i] + self.P[u, :].dot(self.Q[i, :].T)
        return prediction

    def full_matrix(self):
        '''
        Compute the full matrix using the resultant biases, P and Q
        '''
        return self.b + self.b_u[:,np.newaxis] + self.b_i[np.newaxis:,] + self.P.dot(self.Q.T)

**NEW**

In [22]:
class MF():

    def __init__(self, R, K, alpha, beta, iterations):
        '''
        Perform matrix factorization to predict empty entries in a matrix.
        
        Arguments:
        - R (ndarray)   : user-item rating matrix
        - K (int)       : number of latent dimensions
        - alpha (float) : learning rate
        - beta (float)  : regularization parameter
        '''

        self.R = R
        self.num_users, self.num_items = R.shape
        self.K = K
        self.alpha = alpha
        self.beta = beta
        self.iterations = iterations

    def train(self):
        '''
        Function that trains MF model based on how many iterations.
        
        Process:
        1. Initialize user and item latent features matrices.
        2. Initialize global, user and item biases.
        3. Implement gradient descent.
        4. Calculate RMSE at each interation.
        '''
        # Initialize user and item latent feature matrice
        self.P = np.random.normal(scale=1./self.K, size=(self.num_users, self.K))
        self.Q = np.random.normal(scale=1./self.K, size=(self.num_items, self.K))

        # Initialize the biases
#         self.b_u = np.zeros(self.num_users)
        self.b_i = np.zeros(self.num_items)
        self.b = np.mean(self.R[np.where(self.R != 0)])

        # Create a list of training samples
        self.samples = [
            (u, i, self.R[u, i]) 
            for u in range(self.num_users) 
            for i in range(self.num_items) 
            if self.R[u, i] > 0
        ]

        # Perform stochastic gradient descent for number of iterations
        training_process = []
        for i in range(self.iterations):
            np.random.shuffle(self.samples)
            self.sgd()
#             mse = self.mse()
#             training_process.append((i, mse))
#             if (i+1) % 100 == 0:
#                 print("Iteration: %d ; rmse = %.4f" % (i+1, mse))

        return training_process

#     def mse(self):
#         '''
#         A function to compute the total mean square error
#         '''
#         xs, ys = self.R.nonzero()
#         predicted = [self.get_rating(xs, ys) for xs, ys in zip(xs, ys)]
#         error = 0
#         for x, y, p in zip(xs, ys, predicted):
#             error += pow(self.R[x, y] - p, 2)
#         return np.sqrt(error)

    def sgd(self):
        '''
        Perform stochastic graident descent
        '''
        for u, i, r in self.samples:
            # Compute prediction and error
            prediction = self.get_rating(u, i)
            e = (r - prediction)

            # Update biases
#             self.b_u[u] += self.alpha * (e - self.beta * self.b_u[u])
            self.b_i[i] += self.alpha * (e - self.beta * self.b_i[i])

            # Update user and item latent feature matrices
            self.P[u, :] += self.alpha * (e * self.Q[i, :] - self.beta * self.P[u,:])
            self.Q[i, :] += self.alpha * (e * self.P[u, :] - self.beta * self.Q[i,:])

    def get_rating(self, u, i):
        '''
        Get the predicted rating of user i and item j
        '''
        prediction = self.b + self.b_i[i] + self.P[u, :].dot(self.Q[i, :].T)
        return prediction

    def full_matrix(self):
        '''
        Compute the full matrix using the resultant biases, P and Q
        '''
        return self.b + self.b_i[np.newaxis:,] + self.P.dot(self.Q.T)

### 3. Train Matrix Factorization Model

In [23]:
# Define R
R = ratings_df.values

# Initialize the matrix factorization calss
mf = MF(R, K=2, alpha=0.01, beta=0.01, iterations=500)

# Note - for bigger datasets, alpha=0.1 creates a problem with matrix multiplication
# Reduce alpha to solve this problem

In [24]:
# Time training time
start = time.time()

# Train model
training_procc = mf.train()

end = time.time()
print('Time taken: {} seconds'.format(end - start))

Time taken: 345.47141194343567 seconds


In [None]:
iterations = [itr[0] for itr in training_procc]
rmse = [score[1] for score in training_procc]

plt.figure(figsize=(10,8))
plt.plot(iterations, rmse)
plt.title('RMSE - 1000 Iterations', pad=20, fontsize=20)
plt.xlabel('Iterations', labelpad=20, fontsize=15)
plt.ylabel('RMSE', labelpad=20, fontsize=15)

**Closer Look at our Model**

In [None]:
# Latent Features for customers
print(mf.P.shape)

P_latent_features = pd.DataFrame(mf.P, 
                                 index=ratings_df.index, 
                                 columns=['Latent Feature 1','Latent Feature 2'])

P_latent_features.head()

In [None]:
# Latent Features for items 
print(mf.Q.shape)

Q_latent_features = pd.DataFrame(mf.Q, 
                                 index=ratings_df.columns, 
                                 columns=['Latent Feature 1','Latent Feature 2'])

Q_latent_features.head()


In [None]:
Q_latent_features['diff'] = Q_latent_features['Latent Feature 1'] - Q_latent_features['Latent Feature 2']
Q_latent_features.describe()

In [None]:
Q_latent_features[Q_latent_features['diff']>0]

In [None]:
Q_latent_features[Q_latent_features['diff']<0]

In [None]:
# User bias
print(mf.b_u.shape)

user_bias = pd.DataFrame(mf.b_u, index=ratings_df.index, columns=['User Bias'])
user_bias

In [None]:
# Item bias
print(mf.b_i.shape)

item_bias = pd.DataFrame(mf.b_i, index=ratings_df.columns, columns=['Item Bias'])
item_bias

**Compare to Predictions to Original**

In [None]:
pd.DataFrame(mf.full_matrix(), index=ratings_df.index, columns=ratings_df.columns).head()

In [None]:
ratings_df.head()

In [None]:
mf.mse()

In [None]:
def mse(true, preds):
    '''
    A function that computes total root mean squared error.
    '''
    xs, ys = true.nonzero()
    error = 0
    for x, y in zip(xs, ys):
        error += (true[x, y] - preds[x, y])**2
        print(true[x, y], preds[x, y])
    return np.sqrt(error)

In [None]:
mse(ratings_df.values, mf.full_matrix())

### 5. Make Recommendations for New Customer

In [25]:
def get_recommendations(new_ratings, utilmat, K=2, alpha=0.05, beta=0.02, iterations=500):
    '''
    Function that gets recommendation for a new customer.
    
    Arguments:
    - new_ratings   : Dictionary of new customer's items and ratings
    - utilmat       : Utility matrix
    - K             : Number of latent features
    - alpha         : Learning rate 
    - beta          : Regularization parameter
    - iterations    : Number of iterations
    '''
    # Add new customer to utility matrix
    new_customer = pd.DataFrame([new_ratings], columns=utilmat.columns).fillna(0)
    new_utilmat = pd.concat([new_customer, utilmat])
    
    # Define R
    R = new_utilmat.values

    # Initialize the matrix factorization calss
    mf = MF(R, K=K, alpha=alpha, beta=beta, iterations=iterations)
    
    # Train model
    start = time.time()
    training_procc = mf.train()
    end = time.time()
    print('Time taken: {} seconds'.format(end - start))
    
    # Make new Dataframe with new customer
    preds = pd.DataFrame(mf.full_matrix(), index=new_utilmat.index, columns=new_utilmat.columns)
    
    print('\nBased on your preferences:\n', new_ratings, '\nWe Suggest:')

    return preds.loc[0].sort_values(ascending=False).head(15)
    
    

In [26]:
# Count how many times each product appears in the entire dataset
prod_freq_sample = Counter([prod for prod in sample['product_title'].values])
sorted(prod_freq_sample.items(), key=lambda x: x[1], reverse=True)

[('The Last of Us', 203),
 ('Grand Theft Auto V', 183),
 ('Grand Theft Auto IV', 180),
 ('Call of Duty: Ghosts', 177),
 ("Assassin's Creed 4", 175),
 ('Watch Dogs', 155),
 ('Call of Duty 4: Modern Warfare', 148),
 ('Elder Scrolls V: Skyrim', 147),
 ('Destiny', 147),
 ('Mass Effect 3', 139),
 ('Dead Space', 139),
 ("Assassin's Creed III", 137),
 ('Mortal Kombat', 135),
 ('BioShock', 134),
 ('Nintendo Amiibo', 134),
 ("Assassin's Creed", 132),
 ('Tomb Raider', 131),
 ('Fallout 3', 130),
 ('Mass Effect 2', 130),
 ('Battlefield 3', 124),
 ('Call of Duty: Modern Warfare 2', 124),
 ('Borderlands', 119),
 ('Red Dead Redemption', 118),
 ('Resident Evil 5', 117),
 ('Dead Space 2', 117),
 ('Resident Evil 6', 116),
 ('Injustice: Gods Among Us', 115),
 ('Max Payne 3', 113),
 ('Star Wars: The Force Unleashed', 112),
 ('Assassins Creed II', 111),
 ('Batman Arkham Origins', 110),
 ('LA Noire', 108),
 ('Batman Arkham City', 107),
 ('Battlefield 4', 107),
 ('God of War III', 106),
 ('BioShock Infinite'

In [27]:
# Make recommendations
new_ratings = {'Destiny': 4,
               "Assassin's Creed III": 5, 
               'Red Dead Redemption': 5,
               'Battlefield 4': 3}

get_recommendations(new_ratings, ratings_df, K=2, alpha=0.01, beta=0.02, iterations=500)

Time taken: 346.77094888687134 seconds

Based on your preferences:
 {'Destiny': 4, "Assassin's Creed III": 5, 'Red Dead Redemption': 5, 'Battlefield 4': 3} 
We Suggest:


product_title
Transformers Rise of the Dark Spark     9.034580
Velvet Assassin                         8.811642
Unreal II: The Awakening - Xbox         8.734482
Guild Wars Factions                     8.629912
SWAT 3                                  8.529739
Phantasy Star Online, Episode I & II    8.505751
Pikachu Hard Pouch                      8.309053
The Sims Deluxe Edition - PC            8.275162
Shaun White Snowboarding                8.198683
Blade Runner - PC                       8.092065
WWE Wrestlemania X8                     8.038657
Earth & Beyond                          7.983707
Killing Floor - PC                      7.647993
Tom Clancy's Rainbow Six Lockdown       7.610660
Shaun White Skateboarding               7.580981
Name: 0, dtype: float64