# **Lab: ML Lifecycle**
---
## Exercise 1: Tensorflow Recommender

We will be usinge the dataset:
- https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36120-adv_mla/lab08/movies.csv
- https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36120-adv_mla/lab08/ratings.csv

The steps are:
1.   Setup Environment
2.   Load and explore dataset
3.   Prepare Data
4.   Split Dataset
5.   Define Architecture
6.   Train Tensorflow model
7.   Push Changes

### 1. Setup Repository

**[1.1]** Go to a folder of your choice on your computer (where you store projects)

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
cd /Users/anthonyso/Projects/adv_mla_2024

**[1.2]** Run the built Docker image

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
docker run  -dit --rm --name adv_mla_lab_7 -p 8888:8888 -v ~/Projects/adv_mla_2024/adv_mla_lab_7:/home/jovyan/work/ tensorflow-jupyter:latest

**[1.3]** Display last 50 lines of logs

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
docker logs --tail 50 adv_mla_lab_7

**[1.4]** Copy the url displayed and paste it to a browser in order to launch Jupyter Lab

**[1.5]** Navigate the folder `notebooks` and create a new jupyter notebook called `2_reco.ipynb`

### 2.   Load and Explore Dataset

**[2.1]** Launch magic commands to automatically reload modules


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
%load_ext autoreload
%autoreload 2

**[2.2]** Install your custom package with pip

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
! pip install -i https://test.pypi.org/simple/ my-krml-149874

**[2.3]** Import the pandas and numpy packages

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
import pandas as pd
import numpy as np

**[2.4]** Load the ratings and movies CSV files into 2 dataframes called `ratings_df` and `movies_df`

In [None]:
# Solution
ratings_df = pd.read_csv('https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36120-adv_mla/lab08/ratings.csv')
movies_df = pd.read_csv('https://raw.githubusercontent.com/aso-uts/labs_datasets/main/36120-adv_mla/lab08/movies.csv')

**[2.5]** Display the first 5 rows of `ratings_df`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ratings_df.head()

**[2.6]** Display the first 5 rows of `movies_df`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
movies_df.head()

**[2.7]** Display the dimensions of both dataframes

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
print(ratings_df.shape)
print(movies_df.shape)

**[2.8]** Display the descriptive statistics of `ratings_df`


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ratings_df.describe()

**[2.9]** Display the descriptive statistics of `movies_df`


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
movies_df.describe()

### 3. Prepare Data

**[3.1]** Create a dataframe called `movies` that contains only the distinct values of the column `title`


In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
movies = pd.DataFrame(movies_df["title"].unique(), columns=["movie_title"])

**[3.2]** Join `movies_df` and `ratings_df` into a dtaframe called `ratings

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ratings = ratings_df.merge(movies_df, on="movieId", how="left")

**[3.3]** Take the first 100000 rows of `ratings` and save it back to `ratings`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ratings = ratings[:100000]

**[3.4]** Rename the columns `title` to `movie_title` and `userId` to `user_id` and keep these columns only. Finally convert the column `user_id` to string

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ratings.rename(columns={"title":"movie_title", "userId":"user_id"}, inplace=True)
ratings = ratings[["movie_title", "user_id"]].astype('str')

**[3.5]** Import Tensorflow

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
import tensorflow as tf

**[3.6]** Convert `movies` into a Tensorflow dataset called `movies_tf`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
movies_tf = tf.data.Dataset.from_tensor_slices(dict(movies))

**[3.7]** Convert `ratings` into a Tensorflow dataset called `ratings_tf`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ratings_tf = tf.data.Dataset.from_tensor_slices(dict(ratings))

**[3.8]** Apply the `map` method on `ratings_tf`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
ratings_tf = ratings_tf.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
})

**[3.9]** Apply the `map` method on `movies_tf`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
movies_tf = movies_tf.map(lambda x: x["movie_title"])

**[3.10]** Create 2 list called `unique_user_ids` and `unique_movies` that will respectively contained the list of unique user ids and movie titles

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
unique_user_ids = ratings["user_id"].unique()
unique_movies = ratings["movie_title"].unique()

# 4 Split the Dataset

**[4.1]** Set a seed for Tensorflow

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
tf.random.set_seed(42)

**[3.2]** Randomly shuffle 100000 rows of `ratings_tf` and save the results into `shuffled_tf`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
shuffled_tf = ratings_tf.shuffle(100000, seed=42, reshuffle_each_iteration=False)

**[4.3]** Create a training set called `train_tf` by taking the first 80000 rows of `shuffled_tf`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
train_tf = shuffled_tf.take(80000)

**[4.4]** Create a testing set called `test_tf` by taking the remaining 20000 rows of `shuffled_tf`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
test_tf = shuffled_tf.skip(80000).take(20000)

#5. Define Architecture

**[5.1]** Import tensorflow_recommenders package

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
import tensorflow_recommenders as tfrs

**[5.2]** Create a variable called `embedding_dimension` that will take the value 32

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
embedding_dimension = 32

**[5.3]** Create a Tensorflow Sequential model called `user_model` with a StringLookup for the unique list of user ids and an Embedding layer

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
user_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(vocabulary=unique_user_ids, mask_token=None),
  tf.keras.layers.Embedding(len(unique_user_ids) + 1, embedding_dimension)
])

**[5.4]** Create a Tensorflow Sequential model called `movie_model` with a StringLookup for the unique list of movie titles and an Embedding layer

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
movie_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(vocabulary=unique_movies, mask_token=None),
  tf.keras.layers.Embedding(len(unique_movies) + 1, embedding_dimension)
])

**[5.5]** Instantiate a FactorizedTopK called `metrics` with `movies_tf`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
metrics = tfrs.metrics.FactorizedTopK(
  candidates=movies_tf.batch(128).map(movie_model)
)

**[5.6]** Instantiate a Retrieval called `task` with `metrics`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

**[5.7]** Create a class called `MovieLensModel` that inherits from `tensorflow_recommenders.Model` with 2 attributes `movie_model`, `user_model` and `task` and a method called `compute_loss()`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
class MovielensModel(tfrs.Model):

  def __init__(self, user_model, movie_model, task):
    super().__init__()
    self.movie_model: tf.keras.Model = movie_model
    self.user_model: tf.keras.Model = user_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features, training=False) -> tf.Tensor:
    user_embeddings = self.user_model(features["user_id"])
    positive_movie_embeddings = self.movie_model(features["movie_title"])

    return self.task(user_embeddings, positive_movie_embeddings)

# 6. Train the Model

**[6.1]** Instantiate a MovieLensModel

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
model = MovielensModel(user_model, movie_model, task)

**[6.2]** Compile the model with an Adagrad optimiser

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

**[6.3]** Create a cache dataset of `train_tf` of batch of 4096 observations

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
cached_train = train_tf.batch(4096).cache()

**[6.4]** Train the model with the cache dataset

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
model.fit(cached_train, epochs=3)

**[6.5]** Create a cache dataset of `test_tf` of batch of 4096 observations

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
cached_test = test_tf.batch(4096).cache()

**[6.6]** Evaluate the performance of the model on the test set

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
model.evaluate(cached_test, return_dict=True)

**[6.7]** Instantiate a factorized_top_k.BruteForce layer called `index` with `user_model`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model)

**[6.8]** Add the movies dataset to `index`

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
index.index_from_dataset(
  tf.data.Dataset.zip((movies_tf.batch(100), movies_tf.batch(100).map(model.movie_model)))
)

**[6.9]** Print the recommended movies for an user index

In [None]:
# Placeholder for student's code (Python code)

In [None]:
# Solution
user_id = 0
_, titles = index(tf.constant([f"{user_id}"]))
print(f"Recommendations for user {user_id}: {titles[0, :3]}")

### 7.   Push changes

**[7.1]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git add .

**[7.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git commit -m "first tf reco model"

**[7.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git push

**[7.4]** Go to Github and merge the branch after reviewing the code and fixing any conflict


**[7.5]** Check out to the master branch

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git checkout master

**[7.6]** Pull the latest updates

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
git pull

**[7.7]** Stop the Docker container

In [None]:
# Placeholder for student's code (command line)

In [None]:
# Solution:
docker stop adv_mla_lab_7