## In the following notebook we will be implemementing a movie reccommender system based on a tensorflow movie  dataset.

In [10]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade tensorflow-datasets

### typing: This module provides support for type hints in Python.
### Dict and Text: These are type hints used for defining the types of variables.
### numpy (imported as np): A popular library for numerical operations in Python.
### tensorflow (imported as tf): A powerful library for machine learning and deep learning.
### tensorflow_datasets (imported as tfds): A library for accessing and preprocessing various datasets in TensorFlow.
### tensorflow_recommenders (imported as tfrs): A library built on TensorFlow for building recommendation systems.
By using these libraries, you can leverage the capabilities of TensorFlow to create and train recommendation models using the provided datasets and algorithms from tensorflow_recommenders.

In [2]:
from typing import Dict, Text

import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

Loading the ratings data:

ratings = tfds.load('movielens/100k-ratings', split="train"): This line loads the ratings data from the MovieLens 100K dataset using the TensorFlow Datasets (tfds) library. The split parameter is set to "train," which indicates that you are loading the training split of the dataset.


Loading the movie data:

movies = tfds.load('movielens/100k-movies', split="train"): This line loads the movie data from the MovieLens 100K dataset using TensorFlow Datasets. The split parameter is again set to "train."

This code snippet uses the map() function to transform the ratings data. It selects the "movie_title" and "user_id" features from each rating record, and the resulting dataset will contain only these two features.

This code snippet uses the map() function to extract the "movie_title" feature from each movie record. The resulting dataset will only contain the movie titles.

In [3]:
# Ratings data.
ratings = tfds.load('movielens/100k-ratings', split="train")
# Features of all the available movies.
movies = tfds.load('movielens/100k-movies', split="train")

# Select the basic features.
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"]
})
movies = movies.map(lambda x: x["movie_title"])

Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\stilinski\tensorflow_datasets\movielens\100k-ratings\0.1.1...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling C:\Users\stilinski\tensorflow_datasets\movielens\100k-ratings\0.1.1.incompleteNWDA9W\movielens-train…

Dataset movielens downloaded and prepared to C:\Users\stilinski\tensorflow_datasets\movielens\100k-ratings\0.1.1. Subsequent calls will reuse this data.
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\stilinski\tensorflow_datasets\movielens\100k-movies\0.1.1...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/1 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling C:\Users\stilinski\tensorflow_datasets\movielens\100k-movies\0.1.1.incompleteK0WLGG\movielens-train.…

Dataset movielens downloaded and prepared to C:\Users\stilinski\tensorflow_datasets\movielens\100k-movies\0.1.1. Subsequent calls will reuse this data.


we are creating vocabulary look-up tables for the user IDs and movie titles using tf.keras.layers.StringLookup. These look-up tables are used to map string values to integer indices, which can be useful for embedding or encoding categorical features in machine learning models.


The movie_titles_vocabulary is created as a StringLookup layer instance with mask_token=None. The adapt() method is then called on this layer, passing the movies dataset. This process allows the vocabulary to be adapted based on the unique movie titles present in the dataset.

The StringLookup layer analyzes the movie titles in the dataset and builds the vocabulary by mapping the unique titles to integer indices. The adapt() method is responsible for this adaptation process.

By adapting the vocabulary look-up tables using the respective datasets, the StringLookup layers learn the mapping between the string values (user IDs and movie titles) and their corresponding integer indices. This mapping can be used later for feature encoding or embedding when building recommendation models.

By adapting the vocabulary look-up tables using the respective datasets, the look-up tables learn the mapping between the string values (user IDs and movie titles) and integer indices. This mapping can be used later for feature encoding or embedding when building recommendation models.

In [4]:
user_ids_vocabulary = tf.keras.layers.StringLookup(mask_token=None)
user_ids_vocabulary.adapt(ratings.map(lambda x: x["user_id"]))

movie_titles_vocabulary = tf.keras.layers.StringLookup(mask_token=None)
movie_titles_vocabulary.adapt(movies)

__init__(self, user_model, movie_model, task): This is the constructor method of the MovieLensModel class. It takes three arguments:

user_model: An instance of tf.keras.Model representing the user model. This model is responsible for learning user embeddings or representations.
movie_model: An instance of tf.keras.Model representing the movie model. This model is responsible for learning movie embeddings or representations.
task: An instance of tfrs.tasks.Retrieval representing the retrieval task. This task defines how the user and movie embeddings are compared or matched to compute the loss.
compute_loss(self, features, training=False): This method computes the loss for the model. It takes the following arguments:

features: A dictionary containing input features. In this case, it expects the "user_id" and "movie_title" features as keys, mapped to their respective tensor values.
training: A boolean indicating whether the model is being trained or not (default is False).
Inside the method, the user and movie embeddings are computed using the user and movie models, respectively. Then, the compute_loss method of the task object is called, passing the user embeddings and movie embeddings. The task's compute_loss method calculates the loss based on the defined retrieval task (e.g., pairwise ranking loss, pointwise loss, etc.), which is specific to the recommendation scenario.

By implementing the MovieLensModel class, you define the structure and behavior of the recommendation model using user and movie models, as well as a specific retrieval task.

In [5]:
class MovieLensModel(tfrs.Model):
  # We derive from a custom base class to help reduce boilerplate. Under the hood,
  # these are still plain Keras Models.

  def __init__(
      self,
      user_model: tf.keras.Model,
      movie_model: tf.keras.Model,
      task: tfrs.tasks.Retrieval):
    super().__init__()

    # Set up user and movie representations.
    self.user_model = user_model
    self.movie_model = movie_model

    # Set up a retrieval task.
    self.task = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # Define how the loss is computed.

    user_embeddings = self.user_model(features["user_id"])
    movie_embeddings = self.movie_model(features["movie_title"])

    return self.task(user_embeddings, movie_embeddings)

Here, the user_model is defined as a sequential model using tf.keras.Sequential. It consists of two layers:

The first layer is user_ids_vocabulary, which is the vocabulary look-up table for user IDs that you created earlier. It maps the string user IDs to integer indices.
The second layer is an Embedding layer with a vocabulary size equal to user_ids_vocabulary.vocab_size() and an embedding dimension of 64. This layer learns the user embeddings based on the integer indices obtained from the vocabulary look-up table.

Similarly, the movie_model is defined as a sequential model. It also consists of two layers:

The first layer is movie_titles_vocabulary, the vocabulary look-up table for movie titles created earlier. It maps the string movie titles to integer indices.
The second layer is an Embedding layer with a vocabulary size equal to movie_titles_vocabulary.vocab_size() and an embedding dimension of 64. This layer learns the movie embeddings based on the integer indices obtained from the vocabulary look-up table.

The retrieval task, tfrs.tasks.Retrieval, is defined with a specified metric, tfrs.metrics.FactorizedTopK. The FactorizedTopK metric is used to evaluate the performance of the model by measuring the top-K recommendations. In this case, movies.batch(128).map(movie_model) is passed as an argument to the metric. This indicates that the metric will be computed based on the movie embeddings obtained from the movie_model.

In [6]:
# Define user and movie models.
user_model = tf.keras.Sequential([
    user_ids_vocabulary,
    tf.keras.layers.Embedding(user_ids_vocabulary.vocab_size(), 64)
])
movie_model = tf.keras.Sequential([
    movie_titles_vocabulary,
    tf.keras.layers.Embedding(movie_titles_vocabulary.vocab_size(), 64)
])

# Define your objectives.
task = tfrs.tasks.Retrieval(metrics=tfrs.metrics.FactorizedTopK(
    movies.batch(128).map(movie_model)
  )
)









Here, a retrieval model is created by instantiating the MovieLensModel class that you defined earlier. The user_model, movie_model, and task objects are passed as arguments to the constructor.
The retrieval model is compiled by specifying an optimizer for training. In this case, the Adagrad optimizer with a learning rate of 0.5 is used.
The retrieval model is trained using the fit() method. The ratings dataset is batched into batches of size 4096, and the model is trained for 10 epochs.
A brute-force search index is created using tfrs.layers.factorized_top_k.BruteForce, which takes the user_model as an argument. Then, the index is built using the movie embeddings obtained from the movie_model and the movie titles from the movies dataset.
Recommendations are obtained for a specific user (in this case, user ID "30") by querying the index with the user ID. The top recommendations are retrieved and printed.

By creating the retrieval model, training it, setting up the retrieval index, and obtaining recommendations, you have completed the recommendation workflow using the MovieLens dataset and the defined model architecture.

In [8]:
# Create a retrieval model.
model = MovieLensModel(user_model, movie_model, task)
model.compile(optimizer=tf.keras.optimizers.Adagrad(0.5))

# Train for 3 epochs.
model.fit(ratings.batch(4096), epochs=10)

# Use brute-force search to set up retrieval using the trained representations.
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)
index.index_from_dataset(
    movies.batch(100).map(lambda title: (title, model.movie_model(title))))

# Get some recommendations.
_, titles = index(np.array(["30"]))
print(f"Top 3 recommendations for user 30: {titles[0, :3]}")

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Top 3 recommendations for user 30: [b'Flubber (1997)' b'Mouse Hunt (1997)' b'Rocket Man (1997)']
