# Ratings prediction using SVD

We first try to gain some intuition on SVD using a small toy dataset. Let the following be a user-movie rating dataset where every user has rated every movie. 

In [None]:
import numpy as np
import pandas as pd

np.set_printoptions(formatter={'float': lambda x: "{0:0.2f}".format(x)})

#df = pd.DataFrame([[1,2,8,9,3,3],[2,1,9,8,4,2],[2,2,6,8,2,3],
#                   [9,7,2,3,1,1],[1,1,1,2,8,7],[2,2,3,2,8,8],
#                   [7,9,2,2,2,3],[9,8,2,3,1,3]], 
#                  columns=["horror1","horror2","drama1","drama2","art1","art2"], 
#                  index=["u0","u1","u2","u3","u4","u5","u6","u7"])

df = pd.DataFrame([[1,2,8,9,3,3],[2,1,9,8,4,2],[2,2,6,8,2,3],
                   [9,7,2,3,1,1],[1,1,1,2,8,7],[2,2,3,2,8,8],
                   [7,9,2,2,2,3],[9,8,2,3,1,3],[7,1,1,9,2,8]], 
                  columns=["horror1","horror2","drama1","drama2","art1","art2"], 
                  index=["u0","u1","u2","u3","u4","u5","u6","u7","u8"])


df

Let us extract the data in the form of a matrix and subtract the mean from each row. 

In [None]:
A = df.values
means = np.mean(A,axis=1).reshape((A.shape[0],1))
#print(means)
A = A - means
print(A)

Let us also examine on average what is the rating for each movie.

In [None]:
print(np.mean(A,axis=0))

## Compute SVD

In [None]:
U, S, VT = np.linalg.svd(A, full_matrices=False)

print("U = \n", U, "\n")
print("S = ", S, "\n")
print("VT = \n", VT, "\n")

## Dimension Reduction

We project the data onto the first $k$ singular vectors (optionally scaled by the corresponding singular values). This gives us the data in a reduced dimension, but should retain the most important information. 

In [None]:
k = 3
Ak = np.diag(S[:k]) @ np.transpose(U[:,:k]) @ A
#Ak = np.transpose(U[:,:k])  @ A
print(Ak)

# Using some real data (MovieLens)

In [None]:
import numpy as np
import pandas as pd

ratings = pd.read_csv("/Users/debapriyo/Dropbox/data/ml-latest-small/ratings.csv")
ratings

## Separate some data for testing 

We will use this data for our experiments. But first, we need to separate some data so that we can test our methods on that. We create a random partition of the indices. 

In [None]:
# Create the list of indices
idx = np.arange(0,len(ratings))

# Randomly shuffle them
np.random.shuffle(idx)

# Size of test data
testsize = 5000
testidx = idx[0:testsize]
trainidx = idx[testsize:]
print(testidx)
print(trainidx)

Now, we partition the dataframe using the two sets of indices. 

In [None]:
# The sample for testing
ratings_test = ratings.iloc[testidx]
ratings_test

In [None]:
# The rest of the data
ratings = ratings.iloc[trainidx]
ratings

## Creating the ratings matrix from the data

Instead of a 2-D array, we will create a sparse matrix first. We will use the rows as the items and columns as the users. However, we will have to make this matrix dense for some reasons. 

In [None]:
from scipy.sparse import csr_matrix

users = ratings["userId"].values.astype(int)
movies = ratings["movieId"].values.astype(int)
vals = ratings["rating"].values

#ii = np.array([1,2,1,2,4])
#jj = np.array([1,1,2,3,3])
#vv = np.array([1,1,2,3,2])

#R = csr_matrix((vv, (ii,jj)))
RS = csr_matrix((vals,(users-1,movies-1)))

print("The data has %d items with %d dimensions (users)." %(RS.shape[1], RS.shape[0]))

In [None]:
# Converting it to a dense matrix
R = RS.todense()

## Fill the zero entries with average ratings 

The zero entries in this matrix are those for which the corresponding user has not rated the movie. In other words, those ratings are not zero, rather unknown. To start with, we fill the zero entries by the average ratings obtained by the movies (from users who have rated that movie). 

In [None]:
def colmeans(R):
    # Sum of the ratings for each movie
    sums = np.sum(R, axis=0)
    print(sums.shape)

    # Number of non-zero entries for each movie (number of ratings)
    nzcounts = np.sum(R.astype(bool), axis=0) + 1
    print(nzcounts.shape)

    # The average ratings for each movie
    means = sums/nzcounts
    print(means.shape)
    
    return means

In [None]:
means = colmeans(R)
print(means.shape)

### Adding the mean vector (mean rating of each movie) to each column

Instead of running through loops, we will use masking. We create a Boolean matrix zero_R such that an entry of zero_R is True if and only if the corresponding entry of R is zero. 

In [None]:
# The boolean matrix (True/False entries) w
zero_R = R==0
print(zero_R.shape)
print(zero_R)

Next, we multiply (element wise) zero_R to the broadcasted mean vector to get a matrix where an entry is zero if the corresponding entry in R was non zero, and an entry is the mean rating if the corresponding entry in R was zero (not rated).

In [None]:
unrated_R = np.multiply(means, zero_R)
print(unrated_R.shape)
print(unrated_R)

Now, we add the unrated part of R with original R. 

In [None]:
R1 = R + unrated_R

print(R1)

## A naive prediction of ratings by average ratings

The matrix R1 is already a naive prediction of the ratings. Let us test how good that is. First, we convert our test data (ratings which are known but were separated in the beginning) into a sparse matrix.

In [None]:
users_test = ratings_test["userId"].values.astype(int)
movies_test = ratings_test["movieId"].values.astype(int)
vals_test = ratings_test["rating"].values

We write a function to compute Root Mean Squared Error (RMSE) for any matrix with same dimensions as R against the test data.

In [None]:
def RMSE_ratings(R_pred, users_test, movies_test, vals_test):
    R_pred_selected = R_pred[users_test-1,movies_test-1]
    error = R_pred_selected - vals_test
    #print(error)
    return np.sqrt(np.mean(np.multiply(error,error)))

And we test the RMSE with the matrix R1.

In [None]:
import numpy as np
from scipy.sparse import csc_matrix

#A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
#B = np.array([[1,0,0,0],[0,0,1,0],[0,1,0,1]])

#u = np.array([1,2,1,2,3])
#m = np.array([1,0,0,3,2])
#v = np.array([0.5,2,1,1.5,2.5])

#C = csr_matrix(B)
#Z = C != 0

print(RMSE_ratings(R1, users_test, movies_test, vals_test))

## Back to SVD: Center the data (R1)

We center the data by making the mean of all columns zero. 

In [None]:
def centerrows(R1):
    # Center the columns (users)
    means = np.mean(R1, axis=1)
    #print(means)
    print(R1.shape)
    print(means.shape)
    return R1 - means

In [None]:
R2 = centerrows(R1)
print(R2)

## SVD 

Let us compute the SVD of the matrix now. 

In [None]:
U, S, VT = np.linalg.svd(R2, full_matrices=False)

print(U.shape, S.shape, VT.shape)

# The low rank approximation

The low rank approximation is defined by $R_k = U_k S_k V_k^T$. The dimensions will remain the same as before. 

In [None]:
def lowrank(U,S,VT,k):
    # Let us fix k

    Uk = U[:,:k]
    #print(Uk.shape)

    Sk = np.diag(S[:k])
    #print(Sk.shape)

    Vk = VT[:k,:]
    #print(Vk.shape)

    return Uk @ Sk @ Vk

## Moving back the columns from the center

Now we need to add the mean of the columns back.

In [None]:
k_range = np.array([2,5,10,20,30,50,75,100])
rmse = np.zeros(len(k_range))

i = 0
for k in k_range:
    Rk1 = lowrank(U,S,VT,k) + R1 - R2
    # print(Rk1)
    rmse[i] = RMSE_ratings(Rk1, users_test, movies_test, vals_test)
    print("k = ", k, ", RMSE: ", rmse[i])