![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# Recommendation Systems: Collaborative Filtering in RedisVL

<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/recommendation-systems/01_collaborative_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Recommendation systems are a common application of machine learning and serve many industries from e-commerce to music streaming platforms.

There are many different architectures that can be followed to build a recommendation system. In a previous example notebook we demonstrated how to do [content filtering with RedisVL](content_filtering.ipynb). We encourage you to start there before diving into this notebook.

In this notebook we'll demonstrate how to build a [collaborative filtering](https://en.wikipedia.org/wiki/Collaborative_filtering)
recommendation system and use the large IMDB movies dataset as our example data.

To generate our vectors we'll use the popular Python package [Surprise](https://surpriselib.com/)

## Environment Setup

In [None]:
# NBVAL_SKIP
!pip install -q scikit-surprise redis redisvl pandas

### Install Redis Stack

Later in this tutorial, Redis will be used to store, index, and query vector
embeddings. **We need to make sure we have a Redis instance available.**

####  Redis in Colab
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive.

In [None]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

#### Other ways to get Redis
There are many ways to get the necessary redis-stack instance running
1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.io/try-free/). Or, if you have your
own version of Redis Enterprise running, that works too!
2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)
3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`

### Define the Redis Connection URL

By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [2]:
import os
import requests
import pandas as pd
import numpy as np

from surprise import SVD
from surprise import Dataset, Reader
from surprise.model_selection import train_test_split


# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

To build a collaborative filtering example using the Surprise library and the Movies dataset, we need to first load the data, format it according to the requirements of Surprise, and then apply a collaborative filtering algorithm like SVD.

In [3]:
def fetch_dataframe(file_name):
    try:
        df = pd.read_csv('datasets/collaborative_filtering/' + file_name)
    except:
        url = 'https://redis-ai-resources.s3.us-east-2.amazonaws.com/recommenders/datasets/collaborative-filtering/'
        r = requests.get(url + file_name)
        if not os.path.exists('datasets/collaborative_filtering'):
            os.makedirs('datasets/collaborative_filtering')
        with open('datasets/collaborative_filtering/' + file_name, 'wb') as f:
            f.write(r.content)
        df = pd.read_csv('datasets/collaborative_filtering/' + file_name)
    return df


In [4]:
ratings_df = fetch_dataframe('ratings_small.csv') # for a larger example use 'ratings.csv' instead

# only keep the columns we need: userId, movieId, rating
ratings_df = ratings_df[['userId', 'movieId', 'rating']]

reader = Reader(rating_scale=(0.0, 5.0))

ratings_data = Dataset.load_from_df(ratings_df, reader)

# What is Collaborative Filtering

A lot is going to happen in the code cell below. We split our full data into train and test sets. We defined the collaborative filtering algorithm to use, which in this case is the Singular Value Decomposition (SVD) algorithm. lastly, we fit our model to our data.

It's worth going into more detail why we chose this algorithm and what it is computing in the `svd.fit(train_set)` method we're calling.
First, let's think about what data it's receiving - our ratings data. This only contains the userIds, movieIds, and the user's ratings of their watched movies on a scale of 1 to 5.

We can put this data into a matrix with rows being users and columns being movies

| RATINGS| movie_1 | movie_2 | movie_3 | movie_4 | movie_5 | movie_6 | ....... |
| -----  | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: | :-----: |
| user_1 |    4    |    1    |         |    4    |         |    5    |         |
| user_2 |         |    5    |    5    |    2    |    1    |         |         |
| user_3 |         |         |         |         |    1    |         |         |
| user_4 |    4    |    1    |         |    4    |         |    ?    |         |
| user_5 |         |    4    |    5    |    2    |         |         |         |
| ...... |         |         |         |         |         |         |         |

Our empty cells aren't zero's, they're missing ratings, so `user_1` has never rated `movie_3`. They may like it or hate it.

Unlike Content Filtering, here we're only considering the ratings that users assign. We don't know the plot or genre or release year of any of these films. We don't even know the title.
But we can still build a recommender by assuming that users have similar tastes to each other. As an intuitive example, we can see that `user_1` and `user_4` have very similar ratings on several movies, so we will assume that `user_4` will rate `movie_6` highly, just as `user_1` did. This is the idea behind collaborative filtering.

That's the intuition, but what about the math? Since we only have this matrix to work with, what we want to do is decompose it into two constituent matrices.
Lets call our ratings matrix `[R]`. We want to find two other matrices, a user matrix `[U]`, and a movies matrix `[M]` that fit the equation:

`[U] * [M] = [R]`

`[U]` will look like:
|user_1_feature_1 | user_1_feature_2 | user_1_feature_3 | user_1_feature_4 | ... | user_1_feature_k |
| ----- | --------- | --------- | --------- | --- | --------- |
|user_2_feature_1 | user_2_feature_2 | user_2_feature_3 | user_2_feature_4 | ... | user_2_feature_k |
|user_3_feature_1 | user_3_feature_2 | user_3_feature_3 | user_3_feature_4 | ... | user_3_feature_k |
|  ...  | . | . | . | ... | . |
|user_N_feature_1 | user_N_feature_2 | user_N_feature_3 | user_N_feature_4 | ... | user_N_feature_k |

`[M]` will look like:

| movie_1_feature_1 | movie_2_feature_1 | movie_3_feature_1 | ... | movie_M_feature_1 |
| --- | --- | --- | --- | --- |
| movie_1_feature_2 | movie_2_feature_2 | movie_3_feature_2 | ... | movie_M_feature_2 |
| movie_1_feature_3 | movie_2_feature_3 | movie_3_feature_3 | ... | movie_M_feature_3 |
| movie_1_feature_4 | movie_2_feature_4 | movie_3_feature_4 | ... | movie_M_feature_4 |
|  ...  | . | . | ... | . |
| movie_1_feature_k | movie_2_feature_k | movie_3_feature_k | ... | movie_M_feature_k |


these features are called the latent features (or latent factors) and are the values we're trying to find when we call the `svd.fit(training_data)` method. The algorithm that computes these features from our ratings matrix is the SVD algorithm. The number of users and movies is set by our data. The size of the latent feature vectors `k` is a parameter we choose. We'll keep it at the default 100 for this notebook.

In [5]:
# split the data into training and testing sets (80% train, 20% test)
train_set, test_set = train_test_split(ratings_data, test_size=0.2, random_state=42)

# use SVD (Singular Value Decomposition) for collaborative filtering
svd = SVD(n_factors=100, biased=False)  # we'll set biased to False so that predictions are of the form "rating_prediction = user_vector dot item_vector"

# train the algorithm on the train_set
svd.fit(train_set)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x10f6af3d0>

## Extracting The User and Movie Vectors

Now that the SVD algorithm has computed our `[U]` and `[M]` matrices - which are both really just lists of vectors - we can load them into our Redis instance.

The Surprise SVD model stores user and movie vectors in two attributes:

`svd.pu`: user features matrix (a matrix where each row corresponds to the latent features of a user).
`svd.qi`: item features matrix (a matrix where each row corresponds to the latent features of an item/movie).

It's worth noting that the matrix `svd.qi` is the transpose of the matrix `[M]` we defined above. This way each row corresponds to one movie.

In [6]:
user_vectors = svd.pu  # user latent features (matrix)
movie_vectors = svd.qi  # movie latent features (matrix)

print(f'we have {user_vectors.shape[0]} users with feature vectors of size {user_vectors.shape[1]}')
print(f'we have {movie_vectors.shape[0]} movies with feature vectors of size {movie_vectors.shape[1]}')

we have 671 users with feature vectors of size 100
we have 8403 movies with feature vectors of size 100


# Predicting User Ratings
The great thing about collaborative filtering is that using our user and movie vectors we can predict the rating any user will give to any movie in our dataset.
And unlike content filtering, there is no assumption that all the movies a user will be recommended are similar to each other. A user can be recommended dark horror films and light-hearted animations.

Looking back at our SVD algorithm the equation is [User_features] * [Movie_features].transpose = [Ratings]
So to get a prediction of what a user will rate a movie they haven't seen yet we just need to take the dot product of that user's feature vector and a movie's feature vector.

In [7]:
# surprise casts userId and movieId to inner ids, so we have to use their mapping to know which rows to use
inner_uid = train_set.to_inner_uid(347) # userId
inner_iid = train_set.to_inner_iid(5515) # movieId

# predict one user's rating of one film
predicted_rating = np.dot(user_vectors[inner_uid], movie_vectors[inner_iid])
print(f'the predicted rating of user {347} on movie {5515} is {predicted_rating}')

the predicted rating of user 347 on movie 5515 is 0.8991088891906795


## Adding Movie Data
while our collaborative filtering algorithm was trained solely on user's ratings of movies, and doesn't require any data about the movies themselves - like the title, genres, or release year - we'll want that information stored as metadata.

We can grab this data from our `movies_metadata.csv` file, clean it, and join it to our user ratings via the `movieId` column

In [8]:
movies_df = fetch_dataframe('movies_metadata.csv')
movies_df.head()

Unnamed: 0,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,popularity,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",21.946943,...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415
1,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,17.015539,...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413
2,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,11.7129,...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92
3,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",3.859495,...,1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34
4,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,8.387519,...,1995-02-10,76578911,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173


In [9]:

import datetime
movies_df.drop(columns=['homepage', 'production_countries', 'production_companies', 'spoken_languages', 'video', 'original_title', 'video', 'poster_path', 'belongs_to_collection'], inplace=True)

# drop rows that have missing values
movies_df.dropna(subset=['imdb_id'], inplace=True)

movies_df['original_language'] = movies_df['original_language'].fillna('unknown')
movies_df['overview'] = movies_df['overview'].fillna('')
movies_df['popularity'] = movies_df['popularity'].fillna(0)
movies_df['release_date'] = movies_df['release_date'].fillna('1900-01-01').apply(lambda x: datetime.datetime.strptime(x, "%Y-%m-%d").timestamp())
movies_df['revenue'] = movies_df['revenue'].fillna(0)
movies_df['runtime'] = movies_df['runtime'].fillna(0)
movies_df['status'] = movies_df['status'].fillna('unknown')
movies_df['tagline'] = movies_df['tagline'].fillna('')
movies_df['title'] = movies_df['title'].fillna('')
movies_df['vote_average'] = movies_df['vote_average'].fillna(0)
movies_df['vote_count'] = movies_df['vote_count'].fillna(0)
movies_df['genres'] = movies_df['genres'].apply(lambda x: [g['name'] for g in eval(x)] if x != '' else []) # convert to a list of genre names
movies_df['imdb_id'] = movies_df['imdb_id'].apply(lambda x: x[2:] if str(x).startswith('tt') else x).astype(int) # remove leading 'tt' from imdb_id

# make sure we've filled all missing values
movies_df.isnull().sum()

budget               0
genres               0
id                   0
imdb_id              0
original_language    0
overview             0
popularity           0
release_date         0
revenue              0
runtime              0
status               0
tagline              0
title                0
vote_average         0
vote_count           0
dtype: int64

We'll have to map these movies to their ratings, which we'll do so with the `links.csv` file that matches `movieId`, `imdbId`, and `tmdbId`.
Let's do that now.

In [10]:
links_df = fetch_dataframe('links_small.csv') # for a larger example use 'links.csv' instead

movies_df = movies_df.merge(links_df, left_on='imdb_id', right_on='imdbId', how='inner')

We'll want to move our SVD user vectors and movie vectors and their corresponding userId and movieId into 2 dataframes for later processing.

In [11]:
# build a dataframe out of the user vectors and their userIds
user_vectors_and_ids = {train_set.to_raw_uid(inner_id): user_vectors[inner_id].tolist() for inner_id in train_set.all_users()}
user_vector_df = pd.Series(user_vectors_and_ids).to_frame('user_vector')

# now do the same for the movie vectors and their movieIds
movie_vectors_and_ids = {train_set.to_raw_iid(inner_id): movie_vectors[inner_id].tolist() for inner_id in train_set.all_items()}
movie_vector_df = pd.Series(movie_vectors_and_ids).to_frame('movie_vector')

# merge the movie vector series with the movies dataframe using the movieId and id fields
movies_df = movies_df.merge(movie_vector_df, left_on='movieId', right_index=True, how='inner')
movies_df['movieId'] = movies_df['movieId'].apply(lambda x: str(x)) # need to cast to a string as this is a tag field in our search schema
movies_df.head()

Unnamed: 0,budget,genres,id,imdb_id,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,title,vote_average,vote_count,movieId,imdbId,tmdbId,movie_vector
0,30000000,"[Animation, Comedy, Family]",862,114709,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,815040000.0,373554033,81.0,Released,,Toy Story,7.7,5415,1,114709,862.0,"[0.3629597621031209, 0.09949090915092493, -0.3..."
1,65000000,"[Adventure, Fantasy, Family]",8844,113497,en,When siblings Judy and Peter discover an encha...,17.015539,819014400.0,262797249,104.0,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413,2,113497,8844.0,"[0.4218097358091202, 0.40147087972459594, 0.04..."
2,0,"[Romance, Comedy]",15602,113228,en,A family wedding reignites the ancient feud be...,11.7129,819619200.0,0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92,3,113228,15602.0,"[0.05688804187546483, 0.23857067106480734, -0...."
3,16000000,"[Comedy, Drama, Romance]",31357,114885,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,819619200.0,81452156,127.0,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34,4,114885,31357.0,"[0.19581296502262047, 0.13208694293045403, -0...."
4,0,[Comedy],11862,113041,en,Just when George Banks has recovered from his ...,8.387519,792403200.0,76578911,106.0,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173,5,113041,11862.0,"[0.10202142982800701, 0.07210970873780809, -0...."


## RedisVL Handles the Scale

Especially for large datasets like the 45,000 movie catalog we're dealing with, you'll want Redis to do the heavy lifting of vector search.
All that's needed is to define the search index and load our data we've cleaned and merged with our vectors.


In [12]:
from redis import Redis
from redisvl.schema import IndexSchema
from redisvl.index import SearchIndex

client = Redis.from_url(REDIS_URL)

movie_schema = IndexSchema.from_dict({
    'index': {
        'name': 'movies',
        'prefix': 'movie',
        'storage_type': 'json'
    },
    'fields': [
        {'name': 'movieId','type': 'tag'},
        {'name': 'genres', 'type': 'tag'},
        {'name': 'original_language', 'type': 'tag'},
        {'name': 'overview', 'type': 'text'},
        {'name': 'popularity', 'type': 'numeric'},
        {'name': 'release_date', 'type': 'numeric'},
        {'name': 'revenue', 'type': 'numeric'},
        {'name': 'runtime', 'type': 'numeric'},
        {'name': 'status', 'type': 'tag'},
        {'name': 'tagline', 'type': 'text'},
        {'name': 'title', 'type': 'text'},
        {'name': 'vote_average', 'type': 'numeric'},
        {'name': 'vote_count', 'type': 'numeric'},
        {
            'name': 'movie_vector',
            'type': 'vector',
            'attrs': {
                'dims': 100,
                'algorithm': 'flat',
                'datatype': 'float32',
                'distance_metric': 'ip'
            }
        }
    ]
})


movie_index = SearchIndex(movie_schema, redis_client=client)
movie_index.create(overwrite=True, drop=True)

movie_keys = movie_index.load(movies_df.to_dict(orient='records'))

16:32:12 redisvl.index.index INFO   Index already exists, overwriting.


In [13]:
# sanity check we merged all dataframes properly and have the right sizes of movies, users, vectors, ids, etc.
number_of_movies = len(movies_df.to_dict(orient='records'))
size_of_movie_df = movies_df.shape[0]

print('number of movies', number_of_movies)
print('size of movie df', size_of_movie_df)

unique_movie_ids = movies_df['id'].nunique()
print('unique movie ids', unique_movie_ids)

unique_movie_titles = movies_df['title'].nunique()
print('unique movie titles', unique_movie_titles)

unique_movies_rated = ratings_df['movieId'].nunique()
print('unique movies rated', unique_movies_rated)
movies_df.head()

number of movies 8365
size of movie df 8365
unique movie ids 8359
unique movie titles 8117
unique movies rated 9065


Unnamed: 0,budget,genres,id,imdb_id,original_language,overview,popularity,release_date,revenue,runtime,status,tagline,title,vote_average,vote_count,movieId,imdbId,tmdbId,movie_vector
0,30000000,"[Animation, Comedy, Family]",862,114709,en,"Led by Woody, Andy's toys live happily in his ...",21.946943,815040000.0,373554033,81.0,Released,,Toy Story,7.7,5415,1,114709,862.0,"[0.3629597621031209, 0.09949090915092493, -0.3..."
1,65000000,"[Adventure, Fantasy, Family]",8844,113497,en,When siblings Judy and Peter discover an encha...,17.015539,819014400.0,262797249,104.0,Released,Roll the dice and unleash the excitement!,Jumanji,6.9,2413,2,113497,8844.0,"[0.4218097358091202, 0.40147087972459594, 0.04..."
2,0,"[Romance, Comedy]",15602,113228,en,A family wedding reignites the ancient feud be...,11.7129,819619200.0,0,101.0,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,6.5,92,3,113228,15602.0,"[0.05688804187546483, 0.23857067106480734, -0...."
3,16000000,"[Comedy, Drama, Romance]",31357,114885,en,"Cheated on, mistreated and stepped on, the wom...",3.859495,819619200.0,81452156,127.0,Released,Friends are the people who let you be yourself...,Waiting to Exhale,6.1,34,4,114885,31357.0,"[0.19581296502262047, 0.13208694293045403, -0...."
4,0,[Comedy],11862,113041,en,Just when George Banks has recovered from his ...,8.387519,792403200.0,76578911,106.0,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,5.7,173,5,113041,11862.0,"[0.10202142982800701, 0.07210970873780809, -0...."


For a complete solution we'll store the user vectors and their watched list in Redis also. We won't be searching over these user vectors so no need to define an index for them. A direct JSON look up will suffice.

In [14]:
from redis.commands.json.path import Path

# use a Redis pipeline to store user data and verify it in a single transaction
with client.pipeline() as pipe:
    for user_id, user_vector in user_vectors_and_ids.items():
        user_key = f"user:{user_id}"
        watched_list_ids = ratings_df[ratings_df['userId'] == user_id]['movieId'].tolist()

        user_data = {
            "user_vector": user_vector,
            "watched_list_ids": watched_list_ids
        }
        pipe.json().set(user_key, Path.root_path(), user_data)
        pipe.execute()

Unlike in content filtering, where we want to compute vector similarity between items and we use cosine distance between items vectors to do so, in collaborative filtering we instead try to compute the predicted rating a user will give to a movie by taking the inner product of the user and movie vector.

This is why in our `collaborative_filtering_schema.yaml` we use `ip` (inner product) as our distance metric.

It's also why we'll use our user vector as the query vector when we do a query. Let's pick a random user and their corresponding user vector to see what this looks like.

In [15]:
from redisvl.query import RangeQuery

user_vector = client.json().get(f"user:{352}")["user_vector"]

# the distance metric 'ip' inner product is computing "score = 1 - u * v" and returning the minimum, which corresponds to the max of "u * v"
# this is what we want. The predicted rating on a scale of 0 to 5 is then -(score - 1) == -score + 1
query = RangeQuery(vector=user_vector,
                    vector_field_name='movie_vector',
                    num_results=12,
                    return_score=True,
                    return_fields=['title', 'genres']
                    )

results = movie_index.query(query)

for r in results:
    # compute our predicted rating on a scale of 0 to 5 from our vector distance
    r['predicted_rating'] = - float(r['vector_distance']) + 1.
    print(f"vector distance: {float(r['vector_distance']):.08f},\t predicted rating: {r['predicted_rating']:.08f},\t title: {r['title']}, ")

vector distance: -3.70880890,	 predicted rating: 4.70880890,	 title: The Shawshank Redemption, 
vector distance: -3.64755058,	 predicted rating: 4.64755058,	 title: Gladiator 1992, 
vector distance: -3.59094477,	 predicted rating: 4.59094477,	 title: Spirited Away, 
vector distance: -3.55783939,	 predicted rating: 4.55783939,	 title: The Third Man, 
vector distance: -3.50615883,	 predicted rating: 4.50615883,	 title: Schindler's List, 
vector distance: -3.46187067,	 predicted rating: 4.46187067,	 title: My Neighbor Totoro, 
vector distance: -3.45508957,	 predicted rating: 4.45508957,	 title: Ran, 
vector distance: -3.44600630,	 predicted rating: 4.44600630,	 title: Saving Private Ryan, 
vector distance: -3.43901110,	 predicted rating: 4.43901110,	 title: The Lord of the Rings: The Two Towers, 
vector distance: -3.41369772,	 predicted rating: 4.41369772,	 title: Memento, 
vector distance: -3.39571905,	 predicted rating: 4.39571905,	 title: The Great Escape, 
vector distance: -3.36728716

## Adding All the Bells & Whistles
Vector search handles the bulk of our collaborative filtering recommendation system and is a great approach to generating personalized recommendations that are unique to each user.

To up our RecSys game even further we can leverage RedisVL Filter logic to give more control to what users are shown. Why have only one feed of recommended movies when you can have several, each with its own theme and personalized to each user.

In [16]:

from redisvl.query.filter import Tag, Num, Text

def get_recommendations(user_id, filters=None, num_results=10):
    user_vector = client.json().get(f"user:{user_id}")["user_vector"]
    query = RangeQuery(vector=user_vector,
                       vector_field_name='movie_vector',
                       num_results=num_results,
                       filter_expression=filters,
                       return_fields=['title', 'overview', 'genres'])

    results = movie_index.query(query)

    return [(r['title'], r['overview'], r['genres'], r['vector_distance']) for r in results]

Top_picks_for_you = get_recommendations(user_id=42) # general SVD results, no filter

block_buster_filter = Num('revenue') > 30_000_000
block_buster_hits = get_recommendations(user_id=42, filters=block_buster_filter)

classics_filter = Num('release_date') < datetime.datetime(1990, 1, 1).timestamp()
classics = get_recommendations(user_id=42, filters=classics_filter)

popular_filter = (Num('popularity') > 50) & (Num('vote_average') > 7)
Whats_popular = get_recommendations(user_id=42, filters=popular_filter)

indie_filter = (Num('revenue') < 1_000_000) & (Num('popularity') > 10)
indie_hits = get_recommendations(user_id=42, filters=indie_filter)

fruity = Text('title') % 'apple|orange|peach|banana|grape|pineapple'
fruity_films = get_recommendations(user_id=42, filters=fruity)


In [17]:
# put all these titles into a single pandas dataframe, where each column is one category
all_recommendations = pd.DataFrame(columns=["top picks", "block busters", "classics", "what's popular", "indie hits", "fruity films"])
all_recommendations["top picks"] = [m[0] for m in Top_picks_for_you]
all_recommendations["block busters"] = [m[0] for m in block_buster_hits]
all_recommendations["classics"] = [m[0] for m in classics]
all_recommendations["what's popular"] = [m[0] for m in Whats_popular]
all_recommendations["indie hits"] = [m[0] for m in indie_hits]
all_recommendations["fruity films"] = [m[0] for m in fruity_films]

all_recommendations.head(10)

Unnamed: 0,top picks,block busters,classics,what's popular,indie hits,fruity films
0,The Godfather,The Godfather,The Godfather,The Shawshank Redemption,Castle in the Sky,A Clockwork Orange
1,The Godfather: Part II,The Godfather: Part II,The Godfather: Part II,Pulp Fiction,The Professional,James and the Giant Peach
2,The Shawshank Redemption,The Silence of the Lambs,The African Queen,The Dark Knight,Shine,What's Eating Gilbert Grape
3,Band of Brothers,Spirited Away,Amadeus,Fight Club,My Neighbor Totoro,Pineapple Express
4,Gladiator 1992,Forrest Gump,Star Wars,Blade Runner,Seven Samurai,The Grapes of Wrath
5,The African Queen,Pulp Fiction,One Flew Over the Cuckoo's Nest,Guardians of the Galaxy,Once Upon a Time in America,Bananas
6,The Silence of the Lambs,The Fugitive,The Empire Strikes Back,Whiplash,All About Eve,Orange County
7,Spirited Away,The Dark Knight,Taxi Driver,The Avengers,La Haine,The Apple Dumpling Gang
8,Forrest Gump,Amadeus,Cinema Paradiso,Big Hero 6,Cube,Adam's Apples
9,Pulp Fiction,Star Wars,The Philadelphia Story,Gone Girl,Arsenic and Old Lace,Herbie Goes Bananas


## Keeping Things Fresh
You've probably noticed that a few movies get repeated in these lists. That's not surprising as all our results are personalized and things like `popularity` and `user_rating` and `revenue` are likely highly correlated. And it's more than likely that at least some of the recommendations we're expecting to be highly rated by a given user are ones they've already watched and rated highly.

We need a way to filter out movies that a user has already seen, and movies that we've already recommended to them before.
We could use a Tag filter on our queries to filter out movies by their id, but this gets cumbersome quickly.
Luckily Redis offers an easy answer to keeping recommendations new and interesting, and that answer is Bloom Filters.

In [18]:
# rewrite the get_recommendations() function to use a bloom filter and apply it before we return results
def get_unique_recommendations(user_id, filters=None, num_results=10):
    user_data = client.json().get(f"user:{user_id}")
    user_vector = user_data["user_vector"]
    watched_movies = user_data["watched_list_ids"]

    # use a Bloom Filter to filter out movies that the user has already watched
    client.bf().insert('user_watched_list', [f"{user_id}:{movie_id}" for movie_id in watched_movies])

    query = RangeQuery(vector=user_vector,
                       vector_field_name='movie_vector',
                       num_results=num_results * 5,  # fetch more results to account for watched movies
                       filter_expression=filters,
                       return_fields=['title', 'overview', 'genres', 'movieId'],
    )
    results = movie_index.query(query)

    matches = client.bf().mexists("user_watched_list", *[f"{user_id}:{r['movieId']}" for r in results])

    recommendations = [
        (r['title'], r['overview'], r['genres'], r['vector_distance'], r['movieId'])
        for i, r in enumerate(results) if matches[i] == 0
    ][:num_results]

    # add these recommendations to the bloom filter so they don't appear again
    client.bf().insert('user_watched_list', [f"{user_id}:{r[4]}" for r  in recommendations])
    return recommendations

# example usage
# create a bloom filter for all our users
try:
    client.bf().create(f"user_watched_list", 0.01, 10000)
except Exception as e:
    client.delete("user_watched_list")
    client.bf().create(f"user_watched_list", 0.01, 10000)

user_id = 42

top_picks_for_you = get_unique_recommendations(user_id=user_id, num_results=5)  # general SVD results, no filter
block_buster_hits = get_unique_recommendations(user_id=user_id, filters=block_buster_filter, num_results=5)
classic_movies = get_unique_recommendations(user_id=user_id, filters=classics_filter, num_results=5)
whats_popular = get_unique_recommendations(user_id=user_id, filters=popular_filter, num_results=5)
indie_hits = get_unique_recommendations(user_id=user_id, filters=indie_filter, num_results=5)

In [19]:
# put all these titles into a single pandas dataframe , where each column is one category
top_picks = pd.DataFrame({"top picks":[m[0] for m in top_picks_for_you]})
block_busters = pd.DataFrame({"block busters": [m[0] for m in block_buster_hits]})
classics = pd.DataFrame({"classics": [m[0] for m in classic_movies]})
popular = pd.DataFrame({"what's popular": [m[0] for m in whats_popular]})
indies = pd.DataFrame({"indie hits": [m[0] for m in indie_hits]})

all_recommendations = pd.concat([top_picks, block_busters, classics, popular, indies], axis=1)
all_recommendations.head()

Unnamed: 0,top picks,block busters,classics,what's popular,indie hits
0,The Godfather,Spirited Away,Taxi Driver,Blade Runner,Castle in the Sky
1,The Godfather: Part II,Amadeus,Cinema Paradiso,Whiplash,The Professional
2,Gladiator 1992,One Flew Over the Cuckoo's Nest,The Philadelphia Story,Big Hero 6,Shine
3,The African Queen,Fight Club,The Great Escape,Gone Girl,My Neighbor Totoro
4,The Silence of the Lambs,Dead Poets Society,The Bridge on the River Kwai,Avatar,Seven Samurai


## Conclusion
That's it! That's all it takes to build a highly scalable, personalized, customizable collaborative filtering recommendation system with Redis and RedisVL.


In [20]:
# clean up your index
while remaining := movie_index.clear():
    print(f"Deleted {remaining} keys")

client.delete("user_watched_list")
client.delete(*[f"user:{user_id}" for user_id in user_vectors_and_ids.keys()])

Deleted 4365 keys
Deleted 2000 keys
Deleted 1000 keys
Deleted 500 keys
Deleted 500 keys


671