# Collaborative Filtering in RedisVL

Recommendation systems are a common application of machine learning and serve many industries from e-commerce to music streaming platforms.

There are many different architechtures that can be followed to build a recommender system. 

In this notebook we'll demonstrate how to build a [content filtering](https://en.wikipedia.org/wiki/Recommender_system#:~:text=of%20hybrid%20systems.-,Content%2Dbased%20filtering,-%5Bedit%5D)
recommender and use the movies dataset as our example data.

In [1]:
!pip install scikit-surprise --quiet

In [2]:
## IMPORTS
import os
import requests
import pandas as pd
import numpy as np

from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split


# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

In [3]:

## EVALUATE MOVE TO COLLABORATIVE FILTERING SO WE CAN SHOW BETTER NUMBERS
#let's see how well this works. we can choose some users, and based on their first watched movie we can recommend them some more.
#we can then look at the set intersection between our recommendations and the movies they actually watched (and rated highly) to see how well we did.

In [4]:
## DONE
# clean up your index

#while remaining := index.clear():
#    print(f"Deleted {remaining} keys")

# YOLO FTW

To build a collaborative filtering example using the Surprise library and the Movies dataset, we need to first load the data, format it according to the requirements of Surprise, and then apply a collaborative filtering algorithm like SVD.

Since you mentioned a modified version of the dataset hosted on Kaggle, I’ll show you how to structure the code, assuming you have the dataset ready.

Here’s an example:

Step-by-Step Guide
Install necessary libraries: Ensure you have installed the Surprise library if you haven’t already.

Loading and Preparing the Data: Let’s assume the dataset contains at least two relevant files: ratings.csv (user, movie, rating) and movies.csv (movieId, title).

You’ll need to load the ratings data and prepare it for use with Surprise.

In [5]:
def fetch_dataframe(file_name):
    try:
        df = pd.read_csv('datasets/collaborative_filtering/' + file_name)
    except:
        url = 'https://redis-ai-resources.s3.us-east-2.amazonaws.com/recommenders/datasets/collaborative-filtering/'
        r = requests.get(url + file_name)
        if not os.path.exists('datasets/collaborative_filtering'):
            os.makedirs('datasets/collaborative_filtering')
        with open('datasets/collaborative_filtering/' + file_name, 'wb') as f:
            f.write(r.content)
        df = pd.read_csv('datasets/collaborative_filtering/' + file_name)
    return df


In [6]:
ratings_file = 'ratings_small.csv'

ratings_df = fetch_dataframe(ratings_file)

# only keep the columns we need: userId, movieId, rating
ratings_df = ratings_df[['userId', 'movieId', 'rating']]

reader = Reader(rating_scale=(0.0, 5.0))

ratings_data = Dataset.load_from_df(ratings_df, reader)

# Training Our Model

In [7]:
# split the data into training and testing sets (80% train, 20% test)
train_set, test_set = train_test_split(ratings_data, test_size=0.2)

# use SVD (Singular Value Decomposition) for collaborative filtering
svd_algo = SVD(biased=False)  # We'll set biased to False so that predictions are of the form "rating_prediction = user_vector dot item_vector"

# train the algorithm on the train_set
svd_algo.fit(train_set)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x13a767e50>

A lot happened in the cell above. We split our full data into train and test sets. We defined the collaborative filtering algorithm to use, which in this case is the Singular Value Decomposition (SVD) algorithm. lastly, we fit our model to our data.

It's worth going into more detail why we chose this algorithm and what it is computing in the `.fit(train_set)` method we're calling.
First, let's think about what data it's receiving - our ratings data. This only contains the user_ids, movie_ids, and the user's ratings of their watched movies on a scale of 1 to 5.

We can put this data into a matrix with rows being users and columns being movies

| RATINGS| movie_1 | movie_2 | movie_3 | movie_4 | movie_5 | movie_6 | ....... |
| -----  | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |
| user_1 |    4    |    1    |         |    4    |         |    5    |         |
| user_2 |         |    5    |    5    |    2    |    1    |         |         |
| user_3 |         |         |         |         |    1    |         |         |
| user_4 |    4    |    1    |         |    4    |         |    ?    |         |
| user_5 |         |    4    |    5    |    2    |         |         |         |
| ...... |         |         |         |         |         |         |         |

Our empty cells aren't zero's their missing ratings, so `user_1` has never rated `movie_3`. They may like it or hate it.

Unlike Content Filtering, here we're only considering the ratings that users assign. We don't know the plot or genre or release year of any of these films.
But we can still build recommender by assuming that users have similar tastes to each other. As an intuitive example, we can see that `user_1` and `user_4` have very similar ratings on several movies, so we can assume that `user_4` will rate `movie_6` highly, just as `user_1` did. This is the idea behind collaborative filtering.

That's the idea, but what about the math? Since we only have this matrix to work with what we want to do is decompose it into two constituent matrices.
Lets call our ratings matrix `[R]`. We want to find two other matrices, a user matrix `[U]`, and a movies matrix `[M]` that fit the equation:

`[U] * [M] = [R]`

`[U]` will look like:
|user_1_feature_1 | user_1_feature_2 | user_1_feature_3 | user_1_feature_4 | ... | user_1_feature_k |
| ----- | --------- | --------- | --------- | --- | --------- |
|user_2_feature_1 | user_2_feature_2 | user_2_feature_3 | user_2_feature_4 | ... | user_2_feature_k |
|user_3_feature_1 | user_3_feature_2 | user_3_feature_3 | user_3_feature_4 | ... | user_3_feature_k |
|  ...  | . | . | . | ... | . |
|user_N_feature_1 | user_N_feature_2 | user_N_feature_3 | user_N_feature_4 | ... | user_N_feature_k |

`[M]` will look like:

| movie_1_feature_1 | movie_2_feature_1 | movie_3_feature_1 | ... | movie_M_feature_1 |
| --- | --- | --- | --- | --- |
| movie_1_feature_2 | movie_2_feature_2 | movie_3_feature_2 | ... | movie_M_feature_1 |
| movie_1_feature_3 | movie_2_feature_3 | movie_3_feature_3 | ... | movie_M_feature_1 |
| movie_1_feature_4 | movie_2_feature_4 | movie_3_feature_4 | ... | movie_M_feature_1 |
|  ...  | . | . | ... | . |
| movie_1_feature_k | movie_2_feature_k | movie_3_feature_k | ... | movie_M_feature_k |


these features are called the latent features and are the values we're trying to find when we call the `.fit(training_data)` method. The algorithm that computes these features from our ratings matrix is the SVD algorithm.

Explanation:
Dataset Preparation: We load the ratings data and ensure it has the necessary format with userId, movieId, and rating columns.
Surprise Reader: This helps in specifying the format of the data.
SVD Algorithm: We use the SVD algorithm for collaborative filtering. It decomposes the user-item interaction matrix into the latent factors.
Accuracy: After training the model, we evaluate it using the RMSE (Root Mean Squared Error).
Next Steps:
You can experiment with different algorithms such as KNNBasic or NMF in the Surprise library.
If your dataset contains titles, you can join movies.csv to display movie names in recommendations.
Would you like more details on dataset preprocessing or any specific functionality in collaborative filtering?

To extract user and item (movie) vectors from an SVD model trained using Surprise and then store them in a Redis vector database, you'll need to:

Extract the learned latent factors (user and item vectors) from the SVD model.
Use Redis-py (Python client for Redis) to store those vectors in Redis as vectors in a vector database (assuming you are using Redis with the RedisVector or RedisSearch modules).
Here’s how to do it:

Step 1: Extract User and Item Vectors from the SVD Model
The Surprise SVD model stores user and item vectors (latent factors) in two attributes:

algo.pu: user factors matrix (a matrix where each row corresponds to the latent factors of a user).
algo.qi: item factors matrix (a matrix where each row corresponds to the latent factors of an item/movie).
These matrices store the vectors in the latent space after training.

Step 2: Save the Vectors in Redis
Redis stores vectors in vector databases, such as Redis' HNSW index for vector similarity search. You can store both user and movie vectors as hashes in Redis and then use them for similarity search or recommendations.

Install Redis and Redis-py
Make sure you have Redis installed with vector support (RediSearch or RedisVL), and install the Redis-py package:

In [8]:
# step 1: extract vectors
user_vectors = svd_algo.pu  # user latent features (matrix)
movie_vectors = svd_algo.qi  # movie latent features (matrix)

print(user_vectors.shape)
print(movie_vectors.shape)

(671, 100)
(8405, 100)


Explanation:
Extract Vectors:

algo.pu gives you a matrix where each row corresponds to a user’s latent factors (user vector).
algo.qi gives you a matrix where each row corresponds to an item/movie’s latent factors (item vector).
Store in Redis:

We store each vector under a unique Redis key (e.g., user:123, item:456).
The vector is stored as a hash in Redis with each dimension (dim_0, dim_1, etc.) being a field in the hash.
Step 3: Advanced Storage for Vector Similarity Search
If you want to store the vectors in a Redis vector search index (e.g., HNSW from RedisSearch for vector similarity queries), you would follow the Redis commands for indexing:



In [9]:
print(svd_algo.predict(347, 5515))

inner_uid = train_set.to_inner_uid(347)
inner_iid = train_set.to_inner_iid(5515)
print(np.dot(user_vectors[inner_uid], movie_vectors[inner_iid])) # surpirse casts userId and movieId to inner ids
svd_algo.predict(347, 5515)

user: 347        item: 5515       r_ui = None   est = 1.42   {'was_impossible': False}
1.4150893670982523


Prediction(uid=347, iid=5515, r_ui=None, est=1.4150893670982523, details={'was_impossible': False})

while our collaborative filtering algorithm was trained solely on user's ratings of movies, and doesn't require any data about the movies themselves - like the title, genres, or release year - we'll want that information stored as metadata.

We can grab this data from our `movies_metadata.csv` file, clean it, and join it to our user ratings via the `movieId` column

In [10]:
movies_df = fetch_dataframe('movies_metadata.csv')
movies_df.head()

Unnamed: 0,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92
3,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,...,1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,...,1995-02-10,76578911,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173


In [11]:

import datetime
movies_df.drop(columns=['homepage', 'production_countries', 'production_companies', 'spoken_languages', 'video', 'original_title', 'video', 'poster_path', 'belongs_to_collection'], inplace=True)

# drop rows that have missing values
movies_df.dropna(subset=['imdb_id'], inplace=True)

movies_df['original_language'] = movies_df['original_language'].fillna('unknown')
movies_df['overview'] = movies_df['overview'].fillna('')
movies_df['popularity'] = movies_df['popularity'].fillna(0)
movies_df['release_date'] = movies_df['release_date'].fillna('1900-01-01').apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d").timestamp())
movies_df['revenue'] = movies_df['revenue'].fillna(0) # fill with average?
movies_df['runtime'] = movies_df['runtime'].fillna(0) # fill with average?
movies_df['status'] = movies_df['status'].fillna('unknown')
movies_df['tagline'] = movies_df['tagline'].fillna('')
movies_df['title'] = movies_df['title'].fillna('')
movies_df['vote_average'] = movies_df['vote_average'].fillna(0)
movies_df['vote_count'] = movies_df['vote_count'].fillna(0)
movies_df['genres'] = movies_df['genres'].apply(lambda x: [g['name'] for g in eval(x)] if x != '' else []) # convert to a list of genre names
movies_df['imdb_id'] = movies_df['imdb_id'].apply(lambda x: x[2:] if str(x).startswith('tt') else x).astype(int) # remove leading 'tt' from imdb_id

# make sure we've filled all missing values
movies_df.isnull().sum()

budget               0
genres               0
id                   0
imdb_id              0
original_language    0
overview             0
popularity           0
release_date         0
revenue              0
runtime              0
status               0
tagline              0
title                0
vote_average         0
vote_count           0
dtype: int64

We'll eventually have to map these movies to their ratings, which we'll do so with the `links.csv` file that matches `movieId`, `imdbId`, and `tmdbId`.

Let's do that now.

In [12]:

links_df = fetch_dataframe('links_small.csv')
movies_df = movies_df.merge(links_df, left_on='imdb_id', right_on='imdbId', how='inner')
movies_df.head()

Unnamed: 0,budget,genres,id,imdb_id,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,title,vote_average,vote_count,movieId,imdbId,tmdbId
0,30000000,"[Animation, Comedy, Family]",862,114709,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,815040000.0,373554033,81.0,Released,,Toy Story,7.7,5415,1,114709,862.0
1,65000000,"[Adventure, Fantasy, Family]",8844,113497,en,When siblings Judy and Peter discover an encha...,17.015539,819014400.0,262797249,104.0,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413,2,113497,8844.0
2,0,"[Romance, Comedy]",15602,113228,en,A family wedding reignites the ancient feud be...,11.7129,819619200.0,0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92,3,113228,15602.0
3,16000000,"[Comedy, Drama, Romance]",31357,114885,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,819619200.0,81452156,127.0,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34,4,114885,31357.0
4,0,[Comedy],11862,113041,en,Just when George Banks has recovered from his ...,8.387519,792403200.0,76578911,106.0,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173,5,113041,11862.0


We'll want to move our SVD user vectors and movie vectors and their corresponding userId and movieId into 2 dataframes for later processing.

In [13]:
# build a dataframe out of the user vectors and their userIds
user_vectors_and_ids = {train_set.to_raw_uid(inner_id): user_vectors[inner_id].tolist() for inner_id in train_set.all_users()}
user_vector_df = pd.Series(user_vectors_and_ids).to_frame('user_vector')

# now do the same for the movie vectors and their movieIds
movie_vectors_and_ids = {train_set.to_raw_iid(inner_id): movie_vectors[inner_id].tolist() for inner_id in train_set.all_items()}
movie_vector_df = pd.Series(movie_vectors_and_ids).to_frame('movie_vector')

# merge the movie vector series with the movies dataframe using the movieId and id fields
movies_df = movies_df.merge(movie_vector_df, left_on='id', right_index=True, how='inner')

Querying Vectors
You can use Redis’ vector similarity search to find the most similar vectors once they’re stored.



Once you've stored your vectors in Redis, querying for vector similarity becomes straightforward, especially if you're using RediSearch with vector support (such as HNSW). I'll guide you through setting up and querying for vector similarity.

Query Setup
We'll assume:

You've already created a vector index using the HNSW algorithm (or another vector indexing mechanism).
You've stored your user or item vectors in Redis, either as fields in a Redis hash or as direct vector fields for vector similarity searches.
Step-by-Step Guide for Querying Vector Similarity
1. Create a Vector Index (If not already created)
Before you can perform similarity queries, you need to create a vector index using the FT.CREATE command. This defines how vectors are indexed in Redis.



In [14]:
from redis import Redis
from redisvl.schema import IndexSchema
from redisvl.index import SearchIndex

client = Redis.from_url(REDIS_URL)

movie_schema = IndexSchema.from_yaml("collaborative_filtering_schema.yaml")

movie_index = SearchIndex(movie_schema, redis_client=client)
movie_index.create(overwrite=True, drop=True)

user_schema = IndexSchema.from_yaml("user_schema.yaml")

user_index = SearchIndex(user_schema, redis_client=client)
user_index.create(overwrite=True, drop=True)

In [15]:
keys = movie_index.load(movies_df.to_dict(orient='records'))

In [16]:
number_of_movies = len(movies_df.to_dict(orient='records'))
size_of_movie_df = movies_df.size

print(number_of_movies)
print(size_of_movie_df)
unique_movie_ids = movies_df['id'].nunique()
print(unique_movie_ids)
unique_movie_titles = movies_df['title'].nunique()
print(unique_movie_titles)

unique_movies_rated = ratings_df['movieId'].nunique()
print(unique_movies_rated)

1494
28386
1494
1482
9065


Unlike in content filtering, where we want to compute vector similarity between items and we use cosine distance between items vectors to do so, in collaborative filtering we instead try to compute the predicted rating a user will give to a movie by taking the inner product of the user and movie vector.

This is why in our `collaborative_filtering_schema.yaml` we use `ip` (inner product) as our distance metric.

It's also why we'll use our user vector as the query vector when we do a vector query.

In [24]:
from redisvl.query import RangeQuery, FilterQuery
from redisvl.query.filter import Tag, Num, Text

user_1_vector = user_vectors[20].tolist()

# the distance metric 'ip' inner product is computing "score = 1 - u * v" and returning the minimum, which corresponds to the max of "u * v"
# this is what we want. The predicted rating on a scale of 0 to 5 is then -(score - 1) == -score + 1
query = RangeQuery(vector=user_1_vector,
                   vector_field_name='movie_vector',
                  num_results=20,
                  return_score=True,
                  return_fields=['title', 'genres']
                  )

results = movie_index.query(query)

for r in results:
    print(r)

{'id': 'movie:9df0babc731549909e929885973aee58', 'vector_distance': '-4.31072711945', 'title': 'The Million Dollar Hotel', 'genres': '["Drama","Thriller"]'}
{'id': 'movie:5d7079fac9534a0585608c9b0d01ba80', 'vector_distance': '-4.11799812317', 'title': "Pandora's Box", 'genres': '["Drama","Thriller","Romance"]'}
{'id': 'movie:120d744065394c499f76589586051337', 'vector_distance': '-4.10946702957', 'title': "Monsieur  Hulot's Holiday", 'genres': '["Comedy","Family"]'}
{'id': 'movie:ae34e2e4c64147c994f3c5c3c29f5190', 'vector_distance': '-4.01828145981', 'title': 'Scarface', 'genres': '["Action","Crime","Drama","Thriller"]'}
{'id': 'movie:a1506eff623a4ba0a165a850e7250d3e', 'vector_distance': '-4.00052165985', 'title': 'The Thomas Crown Affair', 'genres': '["Romance","Crime","Thriller","Drama"]'}
{'id': 'movie:3dbfc6b11f374d649663c0365792a3d2', 'vector_distance': '-3.99500846863', 'title': 'Dead Man', 'genres': '["Drama","Fantasy","Western"]'}
{'id': 'movie:4df54c2a6f674b85ab12ced9b6eddf3e',