<div style="background-color: #1B1A21; text-align: right; margin-bottom: -1px">
    <img src="https://raw.githubusercontent.com/singlestore-labs/notebook-picture/main/singlestore-banner.png" style="padding: 0px; padding-right: 20px; margin: 0px; padding-top: 20px; height: 60px"/>
    <img src="https://raw.githubusercontent.com/singlestore-labs/notebook-picture/main/banner-colors.png" style="width:100%; height: 50px; padding: 0px; margin: 0px; margin-bottom: -8px"/>
</div>

# Movie Recommendation

*Source*: [Full MovieLens 25M Dataset](https://grouplens.org/datasets/movielens/25m/)

This notebook demonstrates how SingleStoreDB helps you build a simple Movie Recommender System.

<img src=https://raw.githubusercontent.com/singlestore-labs/notebook-picture/main/database_tables.png width="1000">

## 1. Install required libraries

Install the library for vectorizing the data (up to 2 minutes).

In [9]:
!pip install sentence-transformers --quiet

## 2. Create database and ingest data

Create the `movie_recommender` database.

In [15]:
%%sql
DROP DATABASE IF EXISTS movie_recommender;

CREATE DATABASE movie_recommender;



<div class="alert alert-block alert-danger" style="font-size: 150%; font-weight: bold">
    <p style="float: left; padding-right: 20px; padding-left: 10px"><img src="https://raw.githubusercontent.com/singlestore-labs/notebook-picture/main/caution.png"/ style="height: 55px; vertical-align: middle"></p>
    <p>Make sure to select the <tt style="font-size: 80%">movie_recommender</tt> database from the drop-down menu at the top of this notebook.
    It updates the <tt style="font-size: 80%">connection_url</tt> to connect to that database.</p>
</div>

Create `tags` table and start pipeline.

In [18]:
%%sql
CREATE TABLE IF NOT EXISTS tags (
    `userId` bigint(20) NULL,
    `movieId` bigint(20) NULL,
    `tag` text CHARACTER SET utf8 COLLATE utf8_general_ci NULL,
    `timestamp` bigint(20) NULL
);

CREATE PIPELINE tags
    AS LOAD DATA S3 'studiotutorials/movielens/tags.csv'
    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'
    BATCH_INTERVAL 2500
    MAX_PARTITIONS_PER_BATCH 1
    DISABLE OUT_OF_ORDER OPTIMIZATION
    DISABLE OFFSETS METADATA GC
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE `tags`
    FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
    LINES TERMINATED BY '\r\n'
    NULL DEFINED BY ''
    IGNORE 1 LINES
    (userId, movieId, tag, timestamp);

START PIPELINE tags;



Create `ratings` table and start pipeline.

In [19]:
%%sql
CREATE TABLE IF NOT EXISTS ratings (
    userId bigint(20) DEFAULT NULL,
    movieId bigint(20) DEFAULT NULL,
    rating double DEFAULT NULL,
    timestamp bigint(20) DEFAULT NULL
);

CREATE PIPELINE ratings
    AS LOAD DATA S3 'studiotutorials/movielens/ratings.csv'
    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'
    BATCH_INTERVAL 2500
    MAX_PARTITIONS_PER_BATCH 1
    DISABLE OUT_OF_ORDER OPTIMIZATION
    DISABLE OFFSETS METADATA GC
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE `ratings`
    FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
    LINES TERMINATED BY '\r\n'
    NULL DEFINED BY ''
    IGNORE 1 LINES
    (userId, movieId, rating, timestamp);

START PIPELINE ratings;



Create `movies` table and start pipeline.

In [20]:
%%sql
CREATE TABLE movies (
    movieId bigint(20) DEFAULT NULL,
    title text CHARACTER SET utf8 COLLATE utf8_general_ci,
    genres text CHARACTER SET utf8 COLLATE utf8_general_ci,
    FULLTEXT(title)
);

CREATE PIPELINE movies
    AS LOAD DATA S3 'studiotutorials/movielens/movies.csv'
    CONFIG '{\"region\":\"us-east-1\", \"disable_gunzip\": false}'
    BATCH_INTERVAL 2500
    MAX_PARTITIONS_PER_BATCH 1
    DISABLE OUT_OF_ORDER OPTIMIZATION
    DISABLE OFFSETS METADATA GC
    SKIP DUPLICATE KEY ERRORS
    INTO TABLE `movies`
    FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '\\'
    LINES TERMINATED BY '\r\n'
    NULL DEFINED BY ''
    IGNORE 1 LINES
    (movieId, title, genres);

START PIPELINE movies;



### Check that all the data has been loaded

There should be 25m rows for ratings, 62k for movies and 1m for tags. If the values are less than that, try the query
again in a few seconds, the pipelines are still running.

In [21]:
%%sql
SELECT COUNT(*) AS count_rows FROM ratings
UNION ALL
SELECT COUNT(*) AS count_rows FROM movies
UNION ALL
SELECT COUNT(*) AS count_rows FROM tags

count_rows
0
62423
1093360


### Concatenate `tags` and `movies` tables using all tags

In [23]:
%%sql
CREATE TABLE movies_with_tags AS
    SELECT 
        m.movieId, 
        m.title, 
        m.genres,
        GROUP_CONCAT(t.tag SEPARATOR ',') AS all_tags
    FROM movies m
    LEFT JOIN tags t ON m.movieId = t.movieId
    GROUP BY m.movieId, m.title, m.genres;



## 3. Vectorize data

Initialize sentence transformer.

In [24]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')

Downloading (…)e933c/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/README.md:   0%|          | 0.00/9.85k [00:00<?, ?B/s]

Downloading (…)e6ee933c/config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)33c/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)e933c/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)933c/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)6ee933c/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Query the `movies_with_tags` table and store the output in a variable named `result`. The `result <<` syntax in the 
`%%sql` line indicates that the output from the query should get stored under that variable name.

In [25]:
%%sql result <<
SELECT * FROM movies_with_tags

Convert the result from the above SQL into a DataFrame and clean up quotes.

In [26]:
import pandas as pd

df = pd.DataFrame(result)

# Curate the special characters
df['title'] = df['title'].str.replace('"', '')
df['all_tags'] = df['all_tags'].str.replace('"', '').str.replace("'", '')

# Convert from dataframe to list
all_titles = df.values.tolist()

Check the first row of the list.

In [27]:
all_titles[0]

[72165,
 "Cirque du Freak: The Vampire's Assistant (2009)",
 'Action|Adventure|Comedy|Fantasy|Horror|Thriller',
 'Paul Weitz,NO_FA_GANES,Patrick Fugit,New Orleans,vampires,stew,Cast,Salma Hayek,Salma Hayek,吸血鬼的助手,John C Reilly,Special Effects,Josh Hutcherson,vampires,antidote,vampires,best friend,vampire,Ken Watanabe,execution,Characters,wolfman,vampires,奇趣马戏团：吸血鬼的助手,Music,John C Reilly,vampires,vampires,Script,based on young adult novel,Costumes,spider,freak show,奇趣马戏团,execution']

Concatenate title and tags.

In [28]:
all_title_type_column = [f'{row[1]}-{row[3]}' if row[1] is not None else row[1] for row in all_titles]

Create the embeddings for Title & Tag (~3 minutes).

In [50]:
# Remove [:3000] if you want to vectorize all rows (~60 minutes)
all_embeddings = model.encode(all_title_type_column[:3000])
all_embeddings.shape

(300, 768)

Join the original data with the vector data.

In [42]:
# Remember the list will be only 3,000 elements
combined_data = [tuple(row) + (embedding,) for embedding, row in zip(all_embeddings, all_titles)]

## 4. Create table for movie information and vectors

In [43]:
%%sql
DROP TABLE IF EXISTS movie_with_tags_with_vectors;

CREATE TABLE movie_with_tags_with_vectors (
    movieId bigint(20) DEFAULT NULL,
    title text CHARACTER SET utf8 COLLATE utf8_general_ci,
    genres text CHARACTER SET utf8 COLLATE utf8_general_ci,
    all_tags longtext CHARACTER SET utf8mb4,
    vector blob
)



Create a database connection using SQLAlchemy. We are going to use an SQLAlchemy connection here because one
column of data is numpy arrays. The SingleStoreDB SQLAlchemy driver will automatically convert those to
the correct binary format when uploading, so it's a bit more convenient than doing the conversions and 
formatting manually for the `%sql` magic command.

In [44]:
from sqlalchemy import create_engine

db_connection = create_engine(connection_url).connect()

Insert the data. Some rows might encounter errors due to unsupported characters.

In [45]:
sql_query = 'INSERT INTO movie_with_tags_with_vectors (movieId, title, genres, all_tags, vector) VALUES (%s, %s, %s, %s, %s)'

for i, row in enumerate(combined_data):
    try:
        db_connection.execute(sql_query, row)
    except Exception as e:
        print("Error inserting row {}: {}".format(i, e))
        continue

## 5. Marrying Search ❤️ Semantic Search ❤️ Analytics

### Build autocomplete search

This is en experimentat we started with to render a full text search.

In [47]:
%%sql
WITH queryouter AS (
                SELECT DISTINCT(title), movieId, MATCH(title) AGAINST ('Pocahontas*') as relevance
                FROM movies
                WHERE MATCH(title) AGAINST ('Pocahontas*')
                ORDER BY relevance DESC
                LIMIT 10)
    SELECT title, movieId FROM queryouter;

title,movieId
Pocahontas (1995),48


### Create user favorite movie tables

In [48]:
%%sql
CREATE ROWSTORE TABLE IF NOT EXISTS user_choice (
    userid text CHARACTER SET utf8 COLLATE utf8_general_ci,
    title text CHARACTER SET utf8 COLLATE utf8_general_ci,
    ts datetime DEFAULT NULL,
    KEY userid (userid)
)



Enter dummy data for testing purposes.

In [49]:
%%sql
INSERT INTO user_choice (userid, title, ts)
    VALUES ('user1', 'Zone 39 (1997)', '2022-01-01 00:00:00'),
           ('user1', 'Star Trek II: The Wrath of Khan (1982)', '2022-01-01 00:00:00'),
           ('user1', 'Giver, The (2014)', '2022-01-01 00:00:00');



### Build semantic search for a movie recommendation

In [50]:
%%sql
WITH
    table_match AS (
        SELECT
            m.title,
            m.movieId,
            m.vector
        FROM
            user_choice t
            INNER JOIN movie_with_tags_with_vectors m ON m.title = t.title
        WHERE
            userid = 'user1'
    ),
    movie_pairs AS (
        SELECT
            m1.movieId AS movieId1,
            m1.title AS title1,
            m2.movieId AS movieId2,
            m2.title AS title2,
            DOT_PRODUCT(m1.vector, m2.vector) AS similarity
        FROM
            table_match m1
            CROSS JOIN movie_with_tags_with_vectors m2
        WHERE
            m1.movieId != m2.movieId
            AND NOT EXISTS (
                SELECT
                    1
                FROM
                    user_choice uc
                WHERE
                    uc.userid = 'user1'
                    AND uc.title = m2.title
            )
    ),
    movie_match AS (
        SELECT
            movieId1,
            title1,
            movieId2,
            title2,
            similarity
        FROM
            movie_pairs
        ORDER BY
            similarity DESC
    ),
    distinct_count AS (
        SELECT DISTINCT
            movieId2,
            title2 AS Title,
            ROUND(AVG(similarity), 4) AS Rating_Match
        FROM
            movie_match
        GROUP BY
            movieId2,
            title2
        ORDER BY
            Rating_Match DESC
    ),
    average_ratings AS (
        SELECT
            movieId,
            AVG(rating) AS Avg_Rating
        FROM
            ratings
        GROUP BY
            movieId
    )
SELECT
    dc.Title,
    dc.Rating_Match as 'Match Score',
    ROUND(ar.Avg_Rating, 1) AS 'Average User Rating'
FROM
    distinct_count dc
    JOIN average_ratings ar ON dc.movieId2 = ar.movieId
ORDER BY
    dc.Rating_Match DESC
LIMIT
    5;

Title,Match Score,Average User Rating


## 6. What are you looking for?

In [52]:
search_embedding = model.encode("I want see a French comedy movie")

In [53]:
sql_query = "SELECT title, genres, DOT_PRODUCT(vector, %(vector)s) AS Score FROM movie_with_tags_with_vectors tv " + \
            "ORDER BY Score DESC " + \
            "LIMIT 10"

results = db_connection.execute(sql_query, dict(vector=search_embedding))

output_list = []
for res in results:
    output_list.append({
        'title': res[0],
        'genres': res[1],
        'score': res[2]
    })

for i, res in enumerate(output_list):
    print(f"{i + 1}: {res['title']} {res['genres']} Score: {res['score']}")

1: All About Actresses (Le bal des actrices) (2009) Comedy|Drama Score: 0.5936338901519775
2: Ce que mes yeux ont vu (2007) Drama|Mystery|Thriller Score: 0.5902963280677795
3: Right Now (À tout de suite) (2004) Crime|Drama|Romance|Thriller Score: 0.5659379363059998
4: Me Two (Personne aux deux personnes, La) (2008) Comedy|Fantasy Score: 0.5574517250061035
5: Fugitives (1986) Comedy|Crime Score: 0.5554698705673218
6: Jean-Luc Cinema Godard (2009) Documentary Score: 0.5506473779678345
7: Happiness Is in the Field (Bonheur est dans le pré, Le) (1995) Comedy Score: 0.5377820730209351
8: Fair is Fair (2010) Comedy Score: 0.5282392501831055
9: Le grand soir (2012) Comedy Score: 0.5223126411437988
10: Hugo (2011) Children|Drama|Mystery Score: 0.5180882811546326


<img src="https://raw.githubusercontent.com/singlestore-labs/notebook-picture/main/banner-colors-reverse.png" style="width: 100%; margin: 0; padding: 0"/>