# Movie recommendation system with Qdrant space vectors

This notebook is a simple example of how to use Qdrant to build a movie recommendation system.
We will use the MovieLens dataset and Qdrant to build a simple recommendation system.

## How it works

MovieLens dataset contains a list of movies and ratings given by users. We will use this data to build a recommendation system.

Our recommendation system will use an approach called **collaborative filtering**.

The idea behind collaborative filtering is that if two users have similar tastes, then they will like similar movies.
We will use this idea to find the most similar users to our own ratings and see what movies these similar users liked, which we haven't seen yet.


1. We will represent each user's ratings as a vector in a sparse high-dimensional space.
2. We will use Qdrant to index these vectors.
3. We will use Qdrant to find the most similar users to our own ratings.
4. We will see what movies these similar users liked, which we haven't seen yet.


In [None]:
!pip install qdrant-client pandas

In [25]:
# Download and unzip the dataset

!mkdir -p data
!wget https://files.grouplens.org/datasets/movielens/ml-1m.zip
!unzip ml-1m.zip -d data

In [26]:
from qdrant_client import QdrantClient, models
import pandas as pd

In [27]:
users = pd.read_csv(
    "./data/ml-1m/users.dat",
    sep="::",
    names=["user_id", "gender", "age", "occupation", "zip"],
    engine="python",
)
users

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,02460
4,5,M,25,20,55455
...,...,...,...,...,...
6035,6036,F,25,15,32603
6036,6037,F,45,1,76006
6037,6038,F,56,1,14706
6038,6039,F,45,0,01060


In [28]:
movies = pd.read_csv(
    "./data/ml-1m/movies.dat",
    sep="::",
    names=["movie_id", "title", "genres"],
    engine="python",
    encoding="latin-1",
)
movies

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
3878,3948,Meet the Parents (2000),Comedy
3879,3949,Requiem for a Dream (2000),Drama
3880,3950,Tigerland (2000),Drama
3881,3951,Two Family House (2000),Drama


In [29]:
ratings = pd.read_csv(
    "./data/ml-1m/ratings.dat",
    sep="::",
    names=["user_id", "movie_id", "rating", "timestamp"],
    engine="python",
)

In [30]:
# Normalize ratings

# Sparse vectors can use advantage of negative values, so we can normalize ratings to have mean 0 and std 1
# In this scenario we can take into account movies that we don't like

ratings.rating = (ratings.rating - ratings.rating.mean()) / ratings.rating.std()

In [31]:
# Convert ratings to sparse vectors

from collections import defaultdict

user_sparse_vectors = defaultdict(lambda: {"values": [], "indices": []})

for row in ratings.itertuples():
    user_sparse_vectors[row.user_id]["values"].append(row.rating)
    user_sparse_vectors[row.user_id]["indices"].append(row.movie_id)

In [32]:
# For this small dataset we can use in-memory Qdrant
# But for production we recommend to use server-based version

qdrant = QdrantClient(":memory:")  # or QdrantClient("http://localhost:6333")

In [33]:
# Create collection with configured sparse vectors
# Sparse vectors don't require to specify dimension, because it's extracted from the data automatically

qdrant.create_collection(
    "movielens",
    vectors_config={},
    sparse_vectors_config={"ratings": models.SparseVectorParams()},
)

True

In [34]:
# Upload all user's votes as sparse vectors


def data_generator():
    for user in users.itertuples():
        yield models.PointStruct(
            id=user.user_id,
            vector={"ratings": user_sparse_vectors[user.user_id]},
            payload=user._asdict(),
        )


# This will do lazy upload of the data
qdrant.upload_points("movielens", data_generator())

In [35]:
# Let's try to recommend something for ourselves

#  1 - like
# -1 - dislike

# Search with
# movies[movies.title.str.contains("Matrix", case=False)]

my_ratings = {
    2571: 1,  # Matrix
    329: 1,  # Star Trek
    260: 1,  # Star Wars
    2288: -1,  # The Thing
    1: 1,  # Toy Story
    1721: -1,  # Titanic
    296: -1,  # Pulp Fiction
    356: 1,  # Forrest Gump
    2116: 1,  # Lord of the Rings
    1291: -1,  # Indiana Jones
    1036: -1,  # Die Hard
}

inverse_ratings = {k: -v for k, v in my_ratings.items()}


def to_vector(ratings):
    vector = models.SparseVector(values=[], indices=[])
    for movie_id, rating in ratings.items():
        vector.values.append(rating)
        vector.indices.append(movie_id)
    return vector

In [None]:
# Find users with similar taste

results = qdrant.search(
    "movielens",
    query_vector=models.NamedSparseVector(name="ratings", vector=to_vector(my_ratings)),
    with_vectors=True,  # We will use those to find new movies
    limit=20,
)

In [37]:
# Calculate how frequently each movie is found in similar users' ratings


def results_to_scores(results):
    movie_scores = defaultdict(lambda: 0)

    for user in results:
        user_scores = user.vector["ratings"]
        for idx, rating in zip(user_scores.indices, user_scores.values):
            if idx in my_ratings:
                continue
            movie_scores[idx] += rating

    return movie_scores

In [38]:
# Sort movies by score and print top 5

movie_scores = results_to_scores(results)
top_movies = sorted(movie_scores.items(), key=lambda x: x[1], reverse=True)

for movie_id, score in top_movies[:5]:
    print(movies[movies.movie_id == movie_id].title.values[0], score)

Star Wars: Episode V - The Empire Strikes Back (1980) 20.023877887283938
Star Wars: Episode VI - Return of the Jedi (1983) 16.44318377549194
Princess Bride, The (1987) 15.84006760423755
Raiders of the Lost Ark (1981) 14.94489407628955
Sixth Sense, The (1999) 14.570321651488953


In [39]:
# Find users with similar taste, but only within my age group
# We can also filter by other fields, like `gender`, `occupation`, etc.

results = qdrant.search(
    "movielens",
    query_vector=models.NamedSparseVector(name="ratings", vector=to_vector(my_ratings)),
    query_filter=models.Filter(
        must=[models.FieldCondition(key="age", match=models.MatchValue(value=25))]
    ),
    with_vectors=True,
    limit=20,
)

movie_scores = results_to_scores(results)
top_movies = sorted(movie_scores.items(), key=lambda x: x[1], reverse=True)

for movie_id, score in top_movies[:5]:
    print(movies[movies.movie_id == movie_id].title.values[0], score)

Princess Bride, The (1987) 16.214640029038147
Star Wars: Episode V - The Empire Strikes Back (1980) 14.652836719595939
Blade Runner (1982) 13.52911944519415
Usual Suspects, The (1995) 13.446604377087162
Godfather, The (1972) 13.300575698740357
