<a href="https://colab.research.google.com/github/rtkilian/recommendation-engine-movie-lens/blob/main/Recommending_Movies_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Recommending movies: retrieval
Real-world recommender systems are often made up of two tasks:
1. Retrieval: select an initial set of hundreds of candidates from all possible candidates. This needs to be computationally efficient.
2. Ranking: takes the output of the retrieval model and fine-tunes them to select only the best. 

Retrieval models are often composed of two sub-models:
1. Query model: computes the query representation (normally a fixed-dimensionality embedding vector) using query features.
2. Candidate model: computes the candidate representation (an equally-sized vector) using the candidate features

The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

In this notebook, I am going to build a two-tower model using the Movielens dataset. I will:
1. Get the data and split into a training and test set.
2. Implement a retrieval model.
3. Fit and evaluate the model.
4. Export the model for efficient serving by building an approximate nearest neighbours (ANN) index.

## Imports

In [1]:
!pip install -q numpy==1.18.5 # we have to downgrade otherwise we get an error

!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets
!pip install -q scann

In [2]:
import numpy as np

print(np.__version__)

1.18.5


In [3]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds

In [4]:
import tensorflow_recommenders as tfrs

## Data
We can use the Movielens data in two ways:
1. Explicitly: use the ratings from 1-5
2. Implicitly: binary of 0 or 1, where 1=the user has watched the movie

We are going to use the latter.

We are going to use the data with 100k ratings.

In [5]:
# Ratings data
ratings = tfds.load("movielens/100k-ratings", split="train") # this data does not have any predefined splits

# Features of all the available movies
movies = tfds.load("movielens/100k-movies", split="train")