![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# Recommendation Systems: Content Filtering with RedisVL

<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/recommendation-systems/00_content_filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Recommendation systems are a common application of machine learning and serve many industries from e-commerce to music streaming platforms. However, there are many different architechtures that can be used.

In this notebook we'll demonstrate how to build a **[content filtering](https://en.wikipedia.org/wiki/Recommender_system#:~:text=of%20hybrid%20systems.-,Content%2Dbased%20filtering,-%5Bedit%5D)**
recommender from scratch using `redisvl` and an IMDB movies dataset.

## What is content filtering?
Content filtering recommender systems are built on the premise that a person will want to be recommended things that are similar to things they already like.

In the case of movies, if a person watches and enjoys a nature documentary we should recommend other nature documentaries. Or if they like classic black & white horror films we should recommend more of those.

The question we need to answer is, 'what does it mean for movies to be similar?'. There are exact matching strategies, like using a movie's labelled genre like 'Horror', or 'Sci Fi', but that can lock people in to only a few genres. Or what if it's not the genre that a person likes, but certain story arcs that are common among many genres?

For our content filtering recommender we'll measure similarity between movies as semantic similarity of their descriptions and keywords.

## Let's Begin!

## Environment Setup

In [1]:
# NBVAL_SKIP
!pip install -q redis redisvl sentence_transformers pandas requests

### Install Redis Stack

Later in this tutorial, Redis will be used to store, index, and query vector
embeddings. **We need to make sure we have a Redis instance available.**

####  Redis in Colab
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive.

In [None]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

#### Other ways to get Redis
There are many ways to get the necessary redis-stack instance running
1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.io/try-free/). Or, if you have your
own version of Redis Enterprise running, that works too!
2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)
3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`

### Define the Redis Connection URL

By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [3]:
import ast
import os
import pandas as pd
import pickle
import requests
import warnings
warnings.filterwarnings('ignore')

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

## Prepare The Dataset

Start by downloading the movies data and doing a quick inspection of it.

In [4]:
try:
    df = pd.read_csv("datasets/content_filtering/25k_imdb_movie_dataset.csv")
except:
    import requests
    # download the file
    url = 'https://redis-ai-resources.s3.us-east-2.amazonaws.com/recommenders/datasets/content-filtering/25k_imdb_movie_dataset.csv'
    r = requests.get(url)

    #save the file as a csv
    if not os.path.exists('./datasets/content_filtering'):
        os.makedirs('./datasets/content_filtering')
    with open('./datasets/content_filtering/25k_imdb_movie_dataset.csv', 'wb') as f:
        f.write(r.content)
    df = pd.read_csv("datasets/content_filtering/25k_imdb_movie_dataset.csv")

df.head()

Unnamed: 0,title,runtime,rating,rating_count,genres,overview,keywords,director,cast,writer,year,path
0,The Story of the Kelly Gang,1 hour 10 minutes,6.0,772,"['Action', 'Adventure', 'Biography']","Story of Ned Kelly, an infamous 19th-century A...","['ned kelly', 'australia', 'historic figure', ...",Charles Tait,"['Elizabeth Tait', 'John Tait', 'Nicholas Brie...",Charles Tait,1906,/title/tt0000574/
1,Fantômas - À l'ombre de la guillotine,not-released,6.9,2300,"['Crime', 'Drama']",Inspector Juve is tasked to investigate and ca...,"['silent film', 'france', 'hotel', 'duchess', ...",Louis Feuillade,"['Louis Feuillade', 'Pierre Souvestre', 'René ...",Marcel Allain,1913,/title/tt0002844/
2,Cabiria,2 hours 28 minutes,7.1,3500,"['Adventure', 'Drama', 'History']",Cabiria is a Roman child when her home is dest...,"['carthage', 'slave', 'moloch', '3rd century b...",Giovanni Pastrone,"['Titus Livius', 'Giovanni Pastrone', 'Italia ...",Gabriele D'Annunzio,1914,/title/tt0003740/
3,The Life of General Villa,not-released,6.7,65,"['Action', 'Adventure', 'Biography']",The life and career of Panccho Villa from youn...,"['chihuahua mexico', 'chihuahua', 'sonora mexi...",Christy Cabanne,"['Frank E. Woods', 'Raoul Walsh', 'Eagle Eye',...",Raoul Walsh,1914,/title/tt0004223/
4,The Patchwork Girl of Oz,not-released,5.4,484,"['Adventure', 'Family', 'Fantasy']","Ojo and Unc Nunkie are out of food, so they de...","['silent film', 'journey', 'magic wand', 'wiza...",J. Farrell MacDonald,"['Violet MacMillan', 'Frank Moore', 'Raymond R...",L. Frank Baum,1914,/title/tt0004457/


As with any machine learning task, the first step is to clean our data.

We'll drop some columns that we don't plan to use, and fill missing values with some reasonable defaults.

Lastly, we'll do a quick check to make sure we've filled in all the null and missing values.

In [5]:
roman_numerals = ['(I)','(II)','(III)','(IV)', '(V)', '(VI)', '(VII)', '(VIII)', '(IX)', '(XI)', '(XII)', '(XVI)', '(XIV)', '(XXXIII)', '(XVIII)', '(XIX)', '(XXVII)']

def replace_year(x):
    if x in roman_numerals:
        return 1998 # the average year of the dataset
    else:
        return x

df.drop(columns=['runtime', 'writer', 'path'], inplace=True)
df['year'] = df['year'].apply(replace_year)             # replace roman numerals with average year
df['genres'] = df['genres'].apply(ast.literal_eval)     # convert string representation of list to list
df['keywords'] = df['keywords'].apply(ast.literal_eval) # convert string representation of list to list
df['cast'] = df['cast'].apply(ast.literal_eval)         # convert string representation of list to list
df = df[~df['overview'].isnull()]                       # drop rows with missing overviews
df = df[~df['overview'].isin(['none'])]                 # drop rows with 'none' as the overview

# make sure we've filled all missing values
df.isnull().sum()

Unnamed: 0,0
title,0
rating,0
rating_count,0
genres,0
overview,0
keywords,0
director,0
cast,0
year,0


## Generate Vector Embeddings

Since we movie similarity as semantic similarity of movie descriptions we need a way to generate semantic vector embeddings of these descriptions.

RedisVL supports many different embedding generators. For this example we'll use a HuggingFace model that is rated well for semantic similarity use cases.

RedisVL also supports complex query logic, beyond just vector similarity. To showcase this we'll generate an embedding from each movies' `overview` text and list of `plot keywords`,
and use the remaining fields like, `genres`, `year`, and `rating` as filterable fields to target our vector queries to.


In [6]:
# add a column to the dataframe with all the text we want to embed
df["full_text"] = df["title"] + ". " + df["overview"] + " " + df['keywords'].apply(lambda x: ', '.join(x))
df["full_text"][0]

'The Story of the Kelly Gang. Story of Ned Kelly, an infamous 19th-century Australian outlaw. ned kelly, australia, historic figure, australian western, first of its kind, directorial debut, australian history, 19th century, victoria australia, australian'

In [None]:
# NBVAL_SKIP
# # this step will take a while, but only needs to be done once for your entire dataset
# currently taking 10 minutes to run, so we've gone ahead and saved the vectors to a file for you
# if you don't want to wait, you can skip the cell and load the vectors from the file in the next cell
from redisvl.utils.vectorize import HFTextVectorizer

vectorizer = HFTextVectorizer(model='sentence-transformers/paraphrase-MiniLM-L6-v2')

df['embedding'] = df['full_text'].apply(lambda x: vectorizer.embed(x, as_buffer=False))
pickle.dump(df['embedding'], open('datasets/content_filtering/text_embeddings.pkl', 'wb'))

In [7]:
try:
    with open('datasets/content_filtering/text_embeddings.pkl', 'rb') as vector_file:
        df['embedding'] = pickle.load(vector_file)
except:
    embeddings_url = 'https://redis-ai-resources.s3.us-east-2.amazonaws.com/recommenders/datasets/content-filtering/text_embeddings.pkl'
    r = requests.get(embeddings_url)
    with open('./datasets/content_filtering/text_embeddings.pkl', 'wb') as f:
        f.write(r.content)
    with open('datasets/content_filtering/text_embeddings.pkl', 'rb') as vector_file:
        df['embedding'] = pickle.load(vector_file)

## Define our Search Schema
Our data is now ready to be loaded into Redis. The last step is to define our search index schema that specifies each of our data fields and the size and type of our embedding vectors.

We'll load this from the accompanying `content_filtering_schema.yaml` file.

This schema defines what each entry will look like within Redis. It will need to specify the name of each field, like `title`, `rating`, and `rating-count`, as well as the type of each field, like `text` or `numeric`.

The vector component of each entry similarly needs its dimension (dims), distance metric, algorithm and datatype (dtype) attributes specified.

In [8]:
from redis import Redis
from redisvl.schema import IndexSchema
from redisvl.index import SearchIndex

# define a redis client
client = Redis.from_url(REDIS_URL)

# define our movie schema
movie_schema = IndexSchema.from_dict(
    {
        'index': {
            'name': 'movies_recommendation',
            'prefix': 'movie',
            'storage_type': 'json'
        },
        'fields': [
            {'name': 'title', 'type': 'text'},
            {'name': 'rating', 'type': 'numeric'},
            {'name': 'rating_count', 'type': 'numeric'},
            {'name': 'genres', 'type': 'tag'},
            {'name': 'overview', 'type': 'text'},
            {'name': 'keywords', 'type': 'tag'},
            {'name': 'cast', 'type': 'tag'},
            {'name': 'writer', 'type': 'text'},
            {'name': 'year', 'type': 'numeric'},
            {'name': 'full_text', 'type': 'text'},
            {
                'name': 'embedding',
                'type': 'vector',
                'attrs': {
                    'dims': 384,
                    'algorithm': 'flat',
                    'datatype': 'float32',
                    'distance_metric': 'cosine'
                }
            }
        ]
})

index = SearchIndex(movie_schema, redis_client=client)
index.create(overwrite=True, drop=True)

## Load products into vector DB
Now that we have all our data cleaned and a defined schema we can load the data into RedisVL.

We need to convert this data into a format that RedisVL can understand, which is a list of dictionaries.


In [9]:
data = df.to_dict(orient='records')
keys = index.load(data)

## Querying to get recommendations

We now have a working content filtering recommender system, all we need a starting point, so let's say we want to find movies similar to the movie with the title "20,000 Leagues Under the Sea"

We can use the title to find the movie in the dataset and then use the vector to find similar movies.

In [10]:
from redisvl.query import RangeQuery

query_vector = df[df['title'] == '20,000 Leagues Under the Sea']['embedding'].values[0] # one good match

query = RangeQuery(
    vector=query_vector,
    vector_field_name='embedding',
    num_results=5,
    distance_threshold=0.7,
    return_fields=['title', 'overview', 'vector_distance']
)

results = index.query(query)
for r in results:
    print(r)

{'id': 'movie:345589922cb348a098930568d5e7d02a', 'vector_distance': '0.584869861603', 'title': 'The Odyssey', 'overview': 'The aquatic adventure of the highly influential and fearlessly ambitious pioneer, innovator, filmmaker, researcher, and conservationist, Jacques-Yves Cousteau, covers roughly thirty years of an inarguably rich in achievements life.'}
{'id': 'movie:5147986e894d43879f4d90d6ed85dfd0', 'vector_distance': '0.633292078972', 'title': 'The Inventor', 'overview': 'Inventing flying contraptions, war machines and studying cadavers, Leonardo da Vinci tackles the meaning of life itself with the help of French princess Marguerite de Nevarre.'}
{'id': 'movie:da53156795ab4026b51e9dde88b02fa6', 'vector_distance': '0.658123493195', 'title': 'Ruin', 'overview': 'The film follows a nameless ex-Nazi captain who navigates the ruins of post-WWII Germany determined to atone for his crimes during the war by hunting down the surviving members of his former SS Death Squad.'}
{'id': 'movie:3e

## Generating user recommendations
This systems works, but we can make it even better.

Production recommender systems often have fields that can be configured. Users can specify if they want to see a romantic comedy or a horror film, or only see new releases.

Let's go ahead and add this functionality by using the tags we've defined in our schema.

In [11]:
from redisvl.query.filter import Tag, Num, Text

def make_filter(genres=None, release_year=None, keywords=None):
    flexible_filter = (
        (Num("year") > release_year) &  # only show movies released after this year
        (Tag("genres") == genres) &     # only show movies that match at least one in list of genres
        (Text("full_text") % keywords)  # only show movies that contain at least one of the keywords
    )
    return flexible_filter

def get_recommendations(movie_vector, num_results=5, distance=0.6, filter=None):
    query = RangeQuery(
        vector=movie_vector,
        vector_field_name='embedding',
        num_results=num_results,
        distance_threshold=distance,
        return_fields=['title', 'overview', 'genres'],
        filter_expression=filter,
    )

    recommendations = index.query(query)
    return recommendations

As a final demonstration we'll find movies similar to the classic horror film 'Nosferatu'.
The process has 3 steps:
- fetch the vector embedding of our film Nosferatu
- optionally define any hard filters we want. Here we'll specify we want horror movies made on or after 1990
- perform the vector range query to find similar movies that meet our filter criteria

In [12]:
movie_vector = df[df['title'] == 'Nosferatu']['embedding'].values[0]

filter = make_filter(genres=['Horror'], release_year=1990)

recs = get_recommendations(movie_vector, distance=0.8, filter=filter)

for rec in recs:
    print(f"- {rec['title']}:\n\t {rec['overview']}\n\t Genres: {rec['genres']}")

- Wolfman:
	 A man becomes afflicted by an ancient curse after he is bitten by a werewolf.
	 Genres: ["Horror"]
- Off Season:
	 Tenn's relentless search for his father takes him back to his childhood town only to find a community gripped by fear. As he travels deeper into the bitter winter wilderness of the town he uncovers a dreadful secret buried long ago.
	 Genres: ["Horror","Mystery","Thriller"]
- Pieces:
	 The co-eds of a Boston college campus are targeted by a mysterious killer who is creating a human jigsaw puzzle from their body parts.
	 Genres: ["Horror","Mystery","Thriller"]
- Cursed:
	 A prominent psychiatrist at a state run hospital wrestles with madness and a dark supernatural force as he and a female police detective race to stop an escaped patient from butchering five people held hostage in a remote mansion.
	 Genres: ["Horror","Thriller"]
- The Home:
	 The Home unfolds after a young man is nearly killed during an accident that leaves him physically and emotionally scarr

### Now you have a working content filtering recommender system with Redis.
Don't forget to clean up once you're done.

In [13]:
# clean up your index
while remaining := index.clear():
    print(f"Deleted {remaining} keys")

Deleted 143 keys
