<a href="https://colab.research.google.com/github/yckamra/MovieLens25m-Recommender/blob/main/MovieLens_Recommender_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
print("Installing project dependencies...")
!pip install numpy pandas scipy scikit-learn # Basic libraries we want
!pip install implicit # Installs implicit library for matrix factorization
!pip install torch
print("Dependencies installed successfully.")

Installing project dependencies...
Dependencies installed successfully.


In [19]:
from google.colab import auth
import zipfile
import os
import pandas as pd
import numpy as np
import implicit
import torch
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import coo_matrix, csr_matrix # Compressed Sparse Row

In [3]:
auth.authenticate_user()

!gcloud config set project movielens-recommender-451017

!gsutil cp gs://movielens-data/movielens_data.zip /content/

Updated property [core/project].
Copying gs://movielens-data/movielens_data.zip...
==> NOTE: You are downloading one or more large file(s), which would
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").

/ [1 files][257.5 MiB/257.5 MiB]   23.2 MiB/s                                   
Operation completed over 1 objects/257.5 MiB.                                    


Dataset can be found at: https://www.kaggle.com/datasets/garymk/movielens-25m-dataset

In [4]:
!ls -lh /content/

total 258M
drwxr-xr-x 3 root root 4.0K May 24 17:59 data
-rw-r--r-- 1 root root 258M May 24 19:30 movielens_data.zip
drwxr-xr-x 1 root root 4.0K May 14 13:38 sample_data


In [5]:
zip_path = "/content/movielens_data.zip"  # Change to your actual zip file name
extract_to = "/content/data/"  # Where to extract files

# Create directory if it doesn't exist
os.makedirs(extract_to, exist_ok=True)

# Unzip the file
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(extract_to)

print("Extraction complete! Files are in:", extract_to)
!ls -lh /content/data/ml-25m

Extraction complete! Files are in: /content/data/
total 1.1G
-rw-r--r-- 1 root root 416M May 24 19:30 genome-scores.csv
-rw-r--r-- 1 root root  18K May 24 19:30 genome-tags.csv
-rw-r--r-- 1 root root 1.4M May 24 19:30 links.csv
-rw-r--r-- 1 root root 2.9M May 24 19:30 movies.csv
-rw-r--r-- 1 root root 647M May 24 19:30 ratings.csv
-rw-r--r-- 1 root root  11K May 24 19:30 README.txt
-rw-r--r-- 1 root root  38M May 24 19:30 tags.csv


In [6]:
genome_scores_CSV = "/content/data/ml-25m/genome-scores.csv"
genome_tags_CSV = "/content/data/ml-25m/genome-tags.csv"
links_CSV = "/content/data/ml-25m/links.csv"
movies_CSV = "/content/data/ml-25m/movies.csv"
ratings_CSV = "/content/data/ml-25m/ratings.csv"
tags_CSV = "/content/data/ml-25m/tags.csv"

ratings.csv provides us with the data needed for the context-based recommender

In [7]:
# Load ratings data
ratings_path = ratings_CSV
ratings_df = pd.read_csv(ratings_path)

print(ratings_df.head())

# Sort by user
ratings_df.sort_values('userId', inplace=True, ignore_index=True)

# Count unique users and movies
num_users = ratings_df["userId"].nunique()
num_movies = ratings_df["movieId"].nunique()

print(f"Number of Unique Users: {num_users}")
print(f"Number of Unique Movies: {num_movies}")

   userId  movieId  rating   timestamp
0       1      296     5.0  1147880044
1       1      306     3.5  1147868817
2       1      307     5.0  1147868828
3       1      665     5.0  1147878820
4       1      899     3.5  1147868510
Number of Unique Users: 162541
Number of Unique Movies: 59047


We will be using collaborative filtering for recommendations. A 2D matrix will be created for movie to user ratings with movies on x-axis and users on y-axis. For new users, we ask for a baseline amount of movies to get a starting vector, and then we use cosine similarity to find their similarity to existing users to recommend them movies. With more ratings, comes greater accuracy.

In [8]:
#numpy_matrix = np.zeros((num_users, num_movies))
#print(numpy_matrix)
#print(numpy_matrix.shape)

In order to map IDs of movies and users to the 2D matrix for collaborative filtering, we will use dictionaries to create key-value pairs between ID and index within the matrix.

In [9]:
user_id_map = {}
movie_id_map = {}
user_index = 0
movie_index = 0

# Create user_id and movie_id to 2D matrix indices
for user in ratings_df["userId"]: # Use this for sparse matrix, not userId since users could be removed from dataset
  if user not in user_id_map:
    user_id_map[user] = user_index
    user_index += 1

for movie in ratings_df["movieId"]:
  if movie not in movie_id_map:
    movie_id_map[movie] = movie_index
    movie_index += 1

Notice below that we are populating a dense matrix, so if we have ~25 million cells to populate out of the ~9 billion from our 162,541 x 59047 matrix filled with zeroes, we are dealing with major sparcity problems and wasted memory. So below is just to later enjoy shorter train times and smaller matrices. Going forward we will be using a Compressed Sparse Row matrix. This will help in learning user and movie abstract feature matrices and cosine similarities.

In [10]:
#from IPython.display import clear_output

#iter = 0
#percent_done = 0
#total_rows = len(ratings_df)
# Iterate over the rows of the DataFrame
#for index, row in ratings_df.iterrows():
    #user_index = user_id_map[row["userId"]]
    #movie_index = movie_id_map[row["movieId"]]
    #numpy_matrix[user_index, movie_index] = row["rating"]
    #if iter % int(total_rows / 100) == 0:
      #clear_output(wait=True)
      #print(f"Percent of matrix populated with ratings: {percent_done}%")
      #percent_done += 1
    #iter += 1

In [11]:
#ratings_dataframe = pd.DataFrame(numpy_matrix)
#print(ratings_dataframe.head())

In [12]:
# Comparing the dataframe ratings to the original Kaggle dataset
#print(f"Original Rating: 4.0 Found Rating: {ratings_dataframe.loc[user_id_map[1], movie_id_map[8786]]}")
#print(f"Original Rating: 2.0 Found Rating: {ratings_dataframe.loc[user_id_map[2], movie_id_map[480]]}")
#print(f"Original Rating: 3.5 Found Rating: {ratings_dataframe.loc[user_id_map[3], movie_id_map[1270]]}")
#print(ratings_dataframe.shape)

Here I have created my own arrays for sparse matrix. We will use scipy in order to have more accurate conversions, but this gives us another option and removes the blackbox.

In [13]:
def calculate_data_array(df):  # Numpy array of ratings
  ratings_df = df
  data = ratings_df["rating"].to_numpy()
  print(data)
  return data

def calculate_movie_indices(df):   # Numpy array of movie indices
  ratings_df = df
  indices = ratings_df["movieId"].to_numpy()

  for i in range(len(indices)):
      indices[i] = movie_id_map[int(indices[i])]
  print(indices)
  return indices


def calculate_indptr(df):   # Numpy array of pointers to indices
  ratings_df = df
  num_users = ratings_df["userId"].nunique()
  indptr = np.zeros(num_users + 1, dtype=int)
  last_user = ratings_df["userId"].iloc[0]
  index_in_indptr = 1
  iterator = 0

  for i in range(len(ratings_df["userId"])):
    current_user = ratings_df["userId"].iloc[iterator]
    if current_user != last_user:
      indptr[index_in_indptr] = iterator
      index_in_indptr += 1
      last_user = current_user
    iterator += 1

  indptr[index_in_indptr] = len(ratings_df["userId"]) # This fills up the last indice

  print(indptr)
  return indptr

#data = {'userId': [0, 0, 1, 2, 2, 2, 3]} # For testing
#df = pd.DataFrame(data) # For testing
#calculate_indptr(df) # For testing

In [14]:
# Apply the maps to the DataFrame columns to get the 0-indexed row and column arrays
row_indices = ratings_df['userId'].map(user_id_map).to_numpy()
col_indices = ratings_df['movieId'].map(movie_id_map).to_numpy()
data_values = ratings_df['rating'].to_numpy() # The actual ratings

# Create the COO (Coordinate) Matrix
user_item_matrix_coo = coo_matrix(
    (data_values, (row_indices, col_indices)),
    shape=(num_users, num_movies)
)

# Convert to CSR (Compressed Sparse Row) for efficient operations. Gets the `indptr` array and sorts `indices`
user_item_matrix_csr = user_item_matrix_coo.tocsr()

Notice below that the indice array is different than our function. This is because although the user indice array is sorted as the main sort, within each user's range--the subsection for each user--has their movie indices sorted as well.

In [15]:
print("Converted to CSR Matrix.")
print("CSR Shape:", user_item_matrix_csr.shape)
print("CSR data:", user_item_matrix_csr.data)
print("CSR indices (column indices):", user_item_matrix_csr.indices)
print("CSR indptr (pointers to row starts):", user_item_matrix_csr.indptr)

Converted to CSR Matrix.
CSR Shape: (162541, 59047)
CSR data: [5.  3.5 4.  ... 2.  3.  2.5]
CSR indices (column indices): [    0     1     2 ... 12315 12777 14125]
CSR indptr (pointers to row starts): [       0       70      254 ... 24999825 24999913 25000095]


In [16]:
def get_cosine_similarity_userX_to_allUsers(userId, user_id_map, user_item_matrix_csr):
  user_index = user_id_map[userId]
  target_user_vector = user_item_matrix_csr[user_index : user_index + 1]
  return cosine_similarity(target_user_vector, user_item_matrix_csr)

In [17]:
print(get_cosine_similarity_userX_to_allUsers(1, user_id_map, user_item_matrix_csr))

[[1.         0.04086293 0.06130627 ... 0.01936879 0.04149169 0.07084903]]


Our model for matrix factorization will be Alternating Least Squares (ALS).

In [21]:
# Hyperparameters
num_factors = 50       # Number of dimensions to learn
regularization = 0.01  # L2 regularization to prevent overfitting
num_iterations = 20    # Number of training iterations

# ALS model
model = implicit.als.AlternatingLeastSquares(
    factors=num_factors,
    regularization=regularization,
    iterations=num_iterations,
    random_state=42,
    calculate_training_loss=True # Monitors loss
    #use_gpu=True # Puts training on the GPU
)

# Model will learn confidence of a user's interest in a movie and will not predict scores (0.5-5.0)
# Larger confidence score, the more likely the user will like the movie
model.fit(user_item_matrix_csr)

# Store our user and movie abstract feature (latent) matrices
user_factors = model.user_factors # Shape is num_users x num_factors
item_factors = model.item_factors # Shape is num_items x num_factors

print("ALS model training complete!")
print(f"Learned User Factors Shape: {user_factors.shape}")
print(f"Learned Movie Factors Shape: {item_factors.shape}")

  check_blas_config()


  0%|          | 0/20 [00:00<?, ?it/s]

ALS model training complete.
Learned User Factors Shape: (162541, 50)
Learned Item Factors Shape: (59047, 50)


In [22]:
def create_average_rating_per_movie_dictionary(ratings_df): # This will be used for baseline testing
  average_rating_per_movie_dictionary = {}
  for index, row in ratings_df.iterrows():
    if row["movieId"] not in average_rating_per_movie_dictionary:
      average_rating_per_movie_dictionary[row["movieId"]] = [row["rating"], 1]
    else:
      average_rating_per_movie_dictionary[row["movieId"]][0] += row["rating"]
      average_rating_per_movie_dictionary[row["movieId"]][1] += 1
  return average_rating_per_movie_dictionary

[[ 0.36005363 -0.79667735 -0.35361508 ...  0.02900936 -0.24478053
   0.17390235]
 [-0.3420918   0.32324147 -0.2431638  ...  0.7328276   0.7302562
  -0.3060542 ]
 [ 0.7235623  -0.30942324  0.305054   ... -0.31590626  0.15667811
   0.88717204]
 ...
 [ 0.05987119  0.42834234  0.02346726 ...  0.4322703  -0.0123726
  -0.06824803]
 [ 0.43982348  0.3695898   0.02304196 ... -0.04816718 -0.20397164
   0.11612535]
 [ 0.01913334 -0.03748711 -0.07723095 ... -0.19982308  0.21645366
  -0.2926144 ]]
[[-3.1941003e-01  3.7681155e-02  9.2802636e-02 ...  1.0823677e-01
  -8.8527150e-02 -2.1638710e-02]
 [ 1.4070044e-02  3.2144140e-02 -1.2952663e-01 ... -1.1457725e-01
  -2.6649756e-03 -4.5268625e-02]
 [ 1.5311753e-02 -3.2694701e-02  3.0742399e-04 ...  2.4349444e-02
   1.7238420e-03  2.2549885e-03]
 ...
 [ 8.7840592e-05 -1.0283425e-05  1.5136128e-04 ... -1.4595708e-04
  -2.0282352e-04 -2.4015728e-05]
 [ 8.7840584e-05 -1.0283427e-05  1.5136116e-04 ... -1.4595706e-04
  -2.0282352e-04 -2.4015717e-05]
 [ 7.68846