In [None]:
import os
import tensorflow as tf
import sys
sys.path.append("./src/")
sys.path.insert(0, "./src/modeling")
sys.path.insert(0, "./src/plotting")
sys.path.insert(0, "./src/predicting")

import constant as const
import numpy as np
from pathlib import Path
from plotting import plotting_utils
from modeling import train_rl_model
from predicting import predicting_utils

# ref for path: 
# https://stackoverflow.com/questions/16771894/python-nameerror-global-name-file-is-not-defined
base_path = os.path.dirname(os.path.realpath('__file__'))
ROOT_PATH = base_path + "/artifacts"

### Define RL modules [locally]

Define a [MovieLens-specific bandits environment](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/environments/movielens_py_environment/MovieLensPyEnvironment), a [Linear UCB agent](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/agents/lin_ucb_agent) and the [regret metric](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/metrics/tf_metrics/RegretMetric).

In [None]:
tf.profiler.experimental.start(const.PROFILER_DIR)
environment = train_rl_model.define_rl_envirioment()
agent = train_rl_model.define_rl_agent(environment)
metrics = train_rl_model.define_rl_metric(environment)

### Train the model [locally]

Define the training logic (on-policy training). The following function is the same as [trainer.train](https://github.com/tensorflow/agents/blob/r0.8.0/tf_agents/bandits/agents/examples/v2/trainer.py#L104), but it keeps track of intermediate metric values and saves different artifacts to different locations. You can also directly invoke [trainer.train](https://github.com/tensorflow/agents/blob/r0.8.0/tf_agents/bandits/agents/examples/v2/trainer.py#L104) which also trains the policy.

Train the RL policy and gather intermediate metric results. At the same time, use [TensorBoard Profiler](https://www.tensorflow.org/guide/profiler) to profile the training process and resources.

In [None]:
metric_results = train_rl_model.train(
    root_dir=ROOT_PATH,
    agent=agent,
    environment=environment,
    training_loops=const.TRAINING_LOOPS,
    steps_per_loop=const.STEPS_PER_LOOP,
    additional_metrics=metrics)

tf.profiler.experimental.stop()

### Evaluate RL metrics [locally]

You can visualize how the regret and average return metrics evolve over training steps.

In [None]:
plotting_utils.plot(metric_results, "RegretMetric")

In [None]:
plotting_utils.plot(metric_results, "AverageReturnMetric")

### Create custom prediction container

As with training, create a custom prediction container. This container handles the TF-Agents specific logic that is different from a regular TensorFlow Model. Specifically, it finds the predicted action using a trained policy.

#### Serve predictions:
- Use [`tensorflow.saved_model.load`](https://www.tensorflow.org/agents/api_docs/python/tf_agents/policies/PolicySaver#usage), instead of [`tf_agents.policies.policy_loader.load`](https://github.com/tensorflow/agents/blob/r0.8.0/tf_agents/policies/policy_loader.py#L26), to load the trained policy, because the latter produces an object of type [`SavedModelPyTFEagerPolicy`](https://github.com/tensorflow/agents/blob/402b8aa81ca1b578ec1f687725d4ccb4115386d2/tf_agents/policies/py_tf_eager_policy.py#L137) whose `action()` is not compatible for use here.
- Note that prediction requests contain only observation data but not reward. This is because: The prediction task is a standalone request that doesn't require prior knowledge of the system state. Meanwhile, end users only know what they observe at the moment. Reward is a piece of information that comes after the action has been made, so the end users would not have knowledge of said reward. In handling prediction requests, you create a [`TimeStep`](https://www.tensorflow.org/agents/api_docs/python/tf_agents/trajectories/TimeStep) object (consisting of `observation`, `reward`, `discount`, `step_type`) using the [`restart()`](https://www.tensorflow.org/agents/api_docs/python/tf_agents/trajectories/restart) function which takes in an `observation`. This function creates the *first* TimeStep in a trajectory of steps, where reward is 0, discount is 1 and step_type is marked as the first timestep. In other words, each prediction request forms the first `TimeStep` in a brand new trajectory.
- For the prediction response, avoid using NumPy-typed values; instead, convert them to native Python values using methods such as [`tolist()`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.tolist.html) as opposed to `list()`.

### Predict on the Endpoint
- Put prediction input(s) into a list named `instances`. The observation should of dimension (BATCH_SIZE, RANK_K). Read more about the MovieLens simulation environment observation [here](https://github.com/tensorflow/agents/blob/v0.8.0/tf_agents/bandits/environments/movielens_py_environment.py#L32-L138).

In [None]:
recommended_movie_ids = predicting_utils.predict_observations_by_users(observation = {"observation": [list(np.ones(20)) for _ in range(8)]})