<a href="https://colab.research.google.com/github/vlochub/MIT-Xpro-colab/blob/main/SoftImpute_for_Movielens.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  The MovieLens Dataset

[MovieLens](https://movielens.org/) is a non-commercial web-based movie recommender system, created in 1997 by GroupLens, a research lab at the University of Minnesota, in order to gather movie rating data for research purposes.


## Getting the Data


The MovieLens dataset is hosted by the [GroupLens](https://grouplens.org/datasets/movielens/) website. Several versions are available. We will use the latest smallest dataset released from [link](https://files.grouplens.org/datasets/movielens/ml-latest-small.zip).

## Custom Code

The custom packages; soft_impute and functionsCF will need to be installed

In [None]:
# Install the standard papackages
!pip install numpy
!pip install pandas
!pip install fancyimpute

Collecting fancyimpute
  Downloading fancyimpute-0.7.0.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting knnimpute>=0.1.0 (from fancyimpute)
  Downloading knnimpute-0.1.0.tar.gz (8.3 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting nose (from fancyimpute)
  Downloading nose-1.3.7-py3-none-any.whl.metadata (1.7 kB)
Downloading nose-1.3.7-py3-none-any.whl (154 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.7/154.7 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: fancyimpute, knnimpute
  Building wheel for fancyimpute (setup.py) ... [?25l[?25hdone
  Created wheel for fancyimpute: filename=fancyimpute-0.7.0-py3-none-any.whl size=29879 sha256=f3c1cca4b67ecc8120416d3dc6bf20c72a13a2c46411a801390561ce8f58bf65
  Stored in directory: /root/.cache/pip/wheels/1a/f3/a1/f7f10b5ae2c2459398762a3fcf4ac18c325311c7e3163d5a15
  Building wheel for knnimpute (setup.py) ... [?25l[?25hdone
  C

Google Collab Connection to Google Drive: External data: Local Files, Drive, Sheets, and Cloud Storage
https://colab.research.google.com/notebooks/io.ipynb

In [None]:
# mount drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
# location of custom packages: soft_impute , functionsCF, and dataset ratings.csv
# CollaborativeFiltering folder in google drive
import sys
sys.path.append('/content/drive/My Drive/Colab Notebooks/CollaborativeFiltering/')

In [None]:
# change the working directory
import os
os.chdir("/content/drive/My Drive/Colab Notebooks/CollaborativeFiltering/")

In [None]:
# Impute necessary packages
import numpy as np
import pandas as pd
from fancyimpute import BiScaler
from soft_impute import SoftImpute
from functionsCF import GenerateTrainingSet

## Create the incomplete matrices for training and testing

In [None]:
# Read movielens data from files- point to where data is stored, small set of Movielens dataset
# 100836 (rows), userId	movieId	rating	timestamp (columns).
# Using smaller dataset rather than the full dataset to speed performance.
# Your results may vary depending on which Movielens data set is used; Several are available online
# read in values only
rating = pd.read_csv('ratings.csv', sep=',').values

In [None]:
# Here we only care about the ratings, so we only use the first three columns, which contain use IDs, movie IDs, and ratings.
rating = rating[:,0:3]

In [None]:
#show top 5 rows
print(rating[:5, :])

[[ 1.  1.  4.]
 [ 1.  3.  4.]
 [ 1.  6.  4.]
 [ 1. 47.  5.]
 [ 1. 50.  5.]]


In [None]:
# Use all known information to create the incomplete matrix

# First, create an empty matrix
matrix_incomplete = np.zeros((len(np.unique(rating[:,0])), len(np.unique(rating[:,1]))))

# Second, Since some movies don't have any ratings, we only use the movies that have ratings.
# Here we correspondingly change the movie IDs to make each column has ratings.
# create an array of all movie IDs
usedID = np.unique(rating[:, 1])
# replace the movie IDs by the their positions in the array we just created
for i in range(len(rating[:,1])):
    rating[:,1][i] = np.where(usedID==rating[:,1][i])[0][0] + 1

# Finally, we construct the incomplete matrix, on which the incomplete components are nan by
# default.
# all components are nan by default
matrix_incomplete[:] = np.nan
# create the index pair of the components with ratings
indices = np.array(rating[:,0] - 1).astype(int), np.array(rating[:,1] - 1).astype(int)
# change the values in the corresponding positions to the known rating information
matrix_incomplete[indices] = rating[:,2]

In [None]:
# Obtain the index pairs of the training set and the validation set, with ratio 90%
train_indices, validation_indices = GenerateTrainingSet(rating[:,0], rating[:,1], 0.90)
# And then use the index pairs to create the incomplete training test
matrix_train = matrix_incomplete.copy()
matrix_train[:] = np.nan
matrix_train[train_indices] = matrix_incomplete[train_indices]

##  Run the softImpute model for collaborative filtering

In [None]:
# Create the BiScaler model
biscaler = BiScaler(scale_rows=False, scale_columns=False, max_iters=50, verbose=False)
# Rescale both rows and columns to have zero mean
matrix_train_normalized = biscaler.fit_transform(matrix_train)

In [None]:
# Use softImpute to complete the matrix. J means the number of archetypes and rand_seed means the
# seed for the inner random number generator, verbose control whether outputting algorithm logs.
softImpute = SoftImpute(J = 4, maxit = 200, random_seed = 1, verbose = False)

In [None]:
# Run the softImpute model on the normalized training set
matrix_train_softImpute = softImpute.fit(matrix_train_normalized)
# Use the softImpute model to create the predicted matrix. If we set copyto as True, then it
# directly change the value of matrix_train_normalized
matrix_train_filled_normalized = matrix_train_softImpute.predict(matrix_train_normalized, copyto = False)
# Inverse transformation to undo the scaling
matrix_train_filled = biscaler.inverse_transform(matrix_train_filled_normalized)

## Analysis of the predicted ratings

### Out-of-sample R^2

In [None]:
# Create the baseline method
train_average = np.average(matrix_train[train_indices])

In [None]:
# Calculate out-of-sample R2 and in-sample R2
# Your results may vary from the lesson due to datasize and training test split.
validation_mse = ((matrix_train_filled[validation_indices] - matrix_incomplete[validation_indices]) ** 2).mean()
training_mse = ((matrix_train_filled[train_indices] - matrix_incomplete[train_indices]) ** 2).mean()
validation_mse_baseline = ((train_average - matrix_incomplete[validation_indices]) ** 2).mean()
training_mse_baseline = ((train_average - matrix_incomplete[train_indices]) ** 2).mean()
print("out-of-sample R2: %.4f, in-sample R2: %.4f." % (1 - validation_mse / validation_mse_baseline, 1 - training_mse / training_mse_baseline))

out-of-sample R2: 0.1968, in-sample R2: 0.6343.


### Get low-rank factors

In [None]:
# Obtain the ratings of each archetype
# Each row of this matrix corresponds to a song and each column corresponds to an archetype
softImpute.v

array([[-0.00120371, -0.0126243 , -0.01569624, -0.00489712],
       [-0.00588053, -0.00410622, -0.00701788, -0.00271441],
       [-0.00560296,  0.00767393, -0.00861866,  0.01191865],
       ...,
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        ,  0.        ]])

In [None]:
softImpute.v.shape

(9724, 4)

In [None]:
# (Optional)
# Obtain the weights of archetypes of each user
# each row of this matrix corresponds to a user and each column corresponds to an archetype
weights = np.dot(softImpute.u, np.diagflat(softImpute.d).T)
weights

array([[ -9.55732307,  -8.14603301,   8.80230765,   4.76432235],
       [-14.24049083,  -6.06898543,  -0.34951558,   5.47545152],
       [-60.13264768, -19.48473727,  57.0010829 ,  36.11188168],
       ...,
       [  0.7833923 ,  52.01879593,  25.71779472, -12.0670246 ],
       [ -6.24094062,  -0.68394357,   4.12038907,  -2.18463982],
       [  7.4236412 , -22.0291323 ,  -0.41552704,  13.79213509]])

In [None]:
weights.shape

(610, 4)

In [None]:
# And then the predicted matrix is computed by the product of two low-rank matrices
new_prediction = np.dot(weights, softImpute.v.T)

In [None]:
# We can see it is the same with the output of the codes in the previous section
np.sum(np.abs(new_prediction - matrix_train_filled_normalized))

np.float64(7.909104785166668e-11)

end of the note