<a href="https://colab.research.google.com/github/yckamra/MovieLens25m-Recommender/blob/main/MovieLens_Recommender_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
from google.colab import auth
import zipfile
import os
import pandas as pd
import numpy as np

In [None]:
auth.authenticate_user()

!gcloud config set project movielens-recommender-451017

!gsutil cp gs://movielens-data/movielens_data.zip /content/

Updated property [core/project].
Copying gs://movielens-data/movielens_data.zip...
==> NOTE: You are downloading one or more large file(s), which would
run significantly faster if you enabled sliced object downloads. This
feature is enabled by default but requires that compiled crcmod be
installed (see "gsutil help crcmod").

| [1 files][257.5 MiB/257.5 MiB]                                                
Operation completed over 1 objects/257.5 MiB.                                    


Dataset can be found at: https://www.kaggle.com/datasets/garymk/movielens-25m-dataset

In [None]:
!ls -lh /content/

total 258M
-rw-r--r-- 1 root root 258M May 24 13:02 movielens_data.zip
drwxr-xr-x 1 root root 4.0K May 14 13:38 sample_data


In [None]:
zip_path = "/content/movielens_data.zip"  # Change to your actual zip file name
extract_to = "/content/data/"  # Where to extract files

# Create directory if it doesn't exist
os.makedirs(extract_to, exist_ok=True)

# Unzip the file
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(extract_to)

print("Extraction complete! Files are in:", extract_to)
!ls -lh /content/data/ml-25m

Extraction complete! Files are in: /content/data/
total 1.1G
-rw-r--r-- 1 root root 416M May 24 13:02 genome-scores.csv
-rw-r--r-- 1 root root  18K May 24 13:02 genome-tags.csv
-rw-r--r-- 1 root root 1.4M May 24 13:02 links.csv
-rw-r--r-- 1 root root 2.9M May 24 13:02 movies.csv
-rw-r--r-- 1 root root 647M May 24 13:02 ratings.csv
-rw-r--r-- 1 root root  11K May 24 13:02 README.txt
-rw-r--r-- 1 root root  38M May 24 13:02 tags.csv


In [None]:
genome_scores_CSV = "/content/data/ml-25m/genome-scores.csv"
genome_tags_CSV = "/content/data/ml-25m/genome-tags.csv"
links_CSV = "/content/data/ml-25m/links.csv"
movies_CSV = "/content/data/ml-25m/movies.csv"
ratings_CSV = "/content/data/ml-25m/ratings.csv"
tags_CSV = "/content/data/ml-25m/tags.csv"

ratings.csv provides us with the data needed for the context-based recommender

In [None]:
# Load ratings data
ratings_path = ratings_CSV
ratings_df = pd.read_csv(ratings_path)

print(ratings_df.head())

# Sort by user
ratings_df.sort_values('userId', inplace=True, ignore_index=True)

# Count unique users and movies
num_users = ratings_df["userId"].nunique()
num_movies = ratings_df["movieId"].nunique()

print(f"Number of Unique Users: {num_users}")
print(f"Number of Unique Movies: {num_movies}")

   userId  movieId  rating   timestamp
0       1      296     5.0  1147880044
1       1      306     3.5  1147868817
2       1      307     5.0  1147868828
3       1      665     5.0  1147878820
4       1      899     3.5  1147868510
Number of Unique Users: 162541
Number of Unique Movies: 59047


We will be using collaborative filtering for recommendations. A 2D matrix will be created for movie to user ratings with movies on x-axis and users on y-axis. For new users, we ask for a baseline amount of movies to get a starting vector, and then we use cosine similarity to find their similarity to existing users to recommend them movies. With more ratings, comes greater accuracy.

In [None]:
#numpy_matrix = np.zeros((num_users, num_movies))
#print(numpy_matrix)
#print(numpy_matrix.shape)

In order to map IDs of movies and users to the 2D matrix for collaborative filtering, we will use dictionaries to create key-value pairs between ID and index within the matrix.

In [None]:
user_id_map = {}
movie_id_map = {}
user_index = 0
movie_index = 0

# Create user_id and movie_id to 2D matrix indices
for user in ratings_df["userId"]: # Use this for sparse matrix, not userId since users could be removed from dataset
  if user not in user_id_map:
    user_id_map[user] = user_index
    user_index += 1

for movie in ratings_df["movieId"]:
  if movie not in movie_id_map:
    movie_id_map[movie] = movie_index
    movie_index += 1

Notice below that we are populating a dense matrix, so if we have ~25 million cells to populate out of the ~9 billion from our 162,541 x 59047 matrix filled with zeroes, we are dealing with major sparcity problems and wasted memory. So below is just to later enjoy shorter train times and smaller matrices. Going forward we will be using a Compressed Sparse Row matrix. This will help in learning user and movie abstract feature matrices and cosine similarities.

In [None]:
#from IPython.display import clear_output

#iter = 0
#percent_done = 0
#total_rows = len(ratings_df)
# Iterate over the rows of the DataFrame
#for index, row in ratings_df.iterrows():
    #user_index = user_id_map[row["userId"]]
    #movie_index = movie_id_map[row["movieId"]]
    #numpy_matrix[user_index, movie_index] = row["rating"]
    #if iter % int(total_rows / 100) == 0:
      #clear_output(wait=True)
      #print(f"Percent of matrix populated with ratings: {percent_done}%")
      #percent_done += 1
    #iter += 1

In [None]:
#ratings_dataframe = pd.DataFrame(numpy_matrix)
#print(ratings_dataframe.head())

In [None]:
# Comparing the dataframe ratings to the original Kaggle dataset
#print(f"Original Rating: 4.0 Found Rating: {ratings_dataframe.loc[user_id_map[1], movie_id_map[8786]]}")
#print(f"Original Rating: 2.0 Found Rating: {ratings_dataframe.loc[user_id_map[2], movie_id_map[480]]}")
#print(f"Original Rating: 3.5 Found Rating: {ratings_dataframe.loc[user_id_map[3], movie_id_map[1270]]}")
#print(ratings_dataframe.shape)

In [None]:
from scipy.sparse import csr_matrix # Compressed Sparse Row

# Numpy array of ratings
data = ratings_df["rating"].to_numpy()
print(data)

# Numpy array of movie indices
indices = ratings_df["movieId"].to_numpy()

for i in range(len(indices)):
    indices[i] = movie_id_map[int(indices[i])]
print(indices)

# Numpy array of pointers to indices

#data = {'userId': [0, 1, 1]} # For testing
#ratings_df = pd.DataFrame(data) # For testing
#num_users = ratings_df["userId"].nunique() # For testing
indptr = np.zeros(num_users + 1, dtype=int)

last_user = ratings_df["userId"].iloc[0]
index_in_indptr = 1
iterator = 0

for i in range(len(ratings_df["userId"])):
  current_user = ratings_df["userId"].iloc[iterator]
  if current_user != last_user:
    indptr[index_in_indptr] = iterator
    index_in_indptr += 1
    last_user = current_user
  print(current_user)
  iterator += 1

indptr[index_in_indptr] = len(ratings_df["userId"]) # This fills up the last indice

print(indptr)

[5.  3.5 4.  ... 4.  5.  4.5]
[   0    1    2 ... 1348   47  774]


KeyboardInterrupt: 

0
1
1
[0 1 3]
