# Contextual Multi Armed Bandits with ✨Generative✨ Reward Functions

Inspiration from this idea - somewhat of a LLM distillation method:
https://arxiv.org/abs/2303.00001

### Reward Design with Language Models
##### Minae Kwon, Sang Michael Xie, Kalesha Bullard, Dorsa Sadigh
`Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior. Our approach leverages this proxy reward function in an RL framework. Specifically, users specify a prompt once at the beginning of training. During training, the LLM evaluates an RL agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. The RL agent then uses this reward to update its behavior. We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DealOrNoDeal negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning`

In this we will use PALM Text Bison. We will do the following

1. Generate "trajectory" training examples where we have a batch of n users, shown a randomly selected batch of n_actions movies
2. Take the raw textual data of those features and format a prompt asking text bison to rate the movie (float) on a 0-5 rating scale
3. These rewards are then used as we generte the reward and update the trajectory
4. Either generate the rest of these examples online (suggested for scale), or generate as-you-train
5. Create a deep network for an epsilon-greed multi-armed bandit agent to explore/exlpoit the rewards generated by text-bison


Future ideas

- When tempurature is set to high and/or token count is extended, it can be interesting to see it rationalize the ratings. 

- Mapreduce multiple prompts with...

- Mixture-of-experts approach

# Train Bandits with per-arm features

**Exploring linear and nonlinear** (e.g., those with neural network-based value functions) bandit methods for recommendations using TF-Agents

> Neural linear bandits provide a nice way to leverage the representation power of deep learning and the bandit approach for uncertainty measure and efficient exploration

## Load notebook config

* use the prefix defined in `00-env-setup`

In [119]:
PREFIX = 'mabv1'

In [None]:
# # staging GCS
# import subprocess

# GCP_PROJECTS             = !gcloud config get-value project
# PROJECT_ID               = GCP_PROJECTS[0]

# # GCS bucket and paths
# BUCKET_NAME              = f'{PREFIX}-{PROJECT_ID}-bucket'
# BUCKET_URI               = f'gs://{BUCKET_NAME}'

# # config = !gsutil cat {BUCKET_URI}/config/notebook_env.py
# # print(config.n)
# # exec(config.n)

In [130]:
import subprocess

GCP_PROJECTS             = "wortz-project-352116"
PROJECT_ID               = "wortz-project-352116"

# GCS bucket and paths
BUCKET_NAME              = f'{PREFIX}-{PROJECT_ID}-bucket'
BUCKET_URI               = f'gs://{BUCKET_NAME}'
LOCATION                 = "us-central1"

# config = subprocess.run([f"gsutil", "cat", f"{BUCKET_URI}/config/notebook_env.py"])
# config


PROJECT_ID               = "wortz-project-352116"
PROJECT_NUM              = "679926387543"
LOCATION                 = "us-central1"

REGION                   = "us-central1"
BQ_LOCATION              = "US"
VPC_NETWORK_NAME         = "ucaip-haystack-vpc-network"

VERTEX_SA                = "679926387543-compute@developer.gserviceaccount.com"

PREFIX                   = "mabv1"
VERSION                  = "v1"

BUCKET_NAME              = "mabv1-wortz-project-352116-bucket"
BUCKET_URI               = "gs://mabv1-wortz-project-352116-bucket"
DATA_GCS_PREFIX          = "data"
DATA_PATH                = "gs://mabv1-wortz-project-352116-bucket/data"
VOCAB_SUBDIR             = "vocabs"
VOCAB_FILENAME           = "vocab_dict.pkl"

VPC_NETWORK_FULL         = "projects/679926387543/global/networks/ucaip-haystack-vpc-network"

BIGQUERY_DATASET_ID      = "wortz-project-352116.movielens_dataset_mabv1"
BIGQUERY_TABLE_ID        = "wortz-project-352116.movielens_dataset_mabv1.training_dataset"

REPO_D

CompletedProcess(args=['gsutil', 'cat', 'gs://mabv1-wortz-project-352116-bucket/config/notebook_env.py'], returncode=0)

In [None]:

PROJECT_ID               = "wortz-project-352116"
PROJECT_NUM              = "679926387543"
LOCATION                 = "us-central1"

REGION                   = "us-central1"
BQ_LOCATION              = "US"
VPC_NETWORK_NAME         = "ucaip-haystack-vpc-network"

VERTEX_SA                = "679926387543-compute@developer.gserviceaccount.com"

PREFIX                   = "mabv1"
VERSION                  = "v1"

BUCKET_NAME              = "mabv1-wortz-project-352116-bucket"
BUCKET_URI               = "gs://mabv1-wortz-project-352116-bucket"
DATA_GCS_PREFIX          = "data"
DATA_PATH                = "gs://mabv1-wortz-project-352116-bucket/data"
VOCAB_SUBDIR             = "vocabs"
VOCAB_FILENAME           = "vocab_dict.pkl"

VPC_NETWORK_FULL         = "projects/679926387543/global/networks/ucaip-haystack-vpc-network"

BIGQUERY_DATASET_ID      = "wortz-project-352116.movielens_dataset_mabv1"
BIGQUERY_TABLE_ID        = "wortz-project-352116.movielens_dataset_mabv1.training_dataset"

REPO_DOCKER_PATH_PREFIX  = "src"
RL_SUB_DIR               = "per_arm_rl"
REPOSITORY               = "rl-movielens-mabv1"

In [120]:
# # staging GCS
# GCP_PROJECTS             = !gcloud config get-value project
# PROJECT_ID               = GCP_PROJECTS[0]

# # GCS bucket and paths
# BUCKET_NAME              = f'{PREFIX}-{PROJECT_ID}-bucket'
# BUCKET_URI               = f'gs://{BUCKET_NAME}'

# config = !gsutil cat {BUCKET_URI}/config/notebook_env.py
# print(config.n)
# exec(config.n)

## imports

In [4]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

In [16]:
import functools
from collections import defaultdict
from typing import Callable, Dict, List, Optional, TypeVar
from datetime import datetime
import time
from pprint import pprint
import pickle as pkl
# import torch
# from transformers import LlamaForCausalLM, LlamaTokenizer

# logging
import logging
logging.disable(logging.WARNING)

import matplotlib.pyplot as plt
import numpy as np

# google cloud
from google.cloud import aiplatform, storage

# tensorflow
import tensorflow as tf

#we set mem growth so we only use what gpu is needed - to make room for torch/LLaMA
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
    
gpus

# from tf_agents.agents import TFAgent

# from tf_agents.bandits.environments import stationary_stochastic_per_arm_py_environment as p_a_env
from tf_agents.bandits.metrics import tf_metrics as tf_bandit_metrics
# from tf_agents.drivers import dynamic_step_driver
# from tf_agents.environments import tf_py_environment
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.specs import tensor_spec
from tf_agents.trajectories import time_step as ts

# from tf_agents.bandits.agents import lin_ucb_agent
# from tf_agents.bandits.agents import linear_thompson_sampling_agent as lin_ts_agent
from tf_agents.bandits.agents import neural_epsilon_greedy_agent
from tf_agents.bandits.agents import neural_linucb_agent
from tf_agents.bandits.networks import global_and_arm_feature_network
from tf_agents.bandits.policies import policy_utilities

from tf_agents.bandits.specs import utils as bandit_spec_utils
from tf_agents.trajectories import trajectory

# GPU
from numba import cuda 
import gc

import sys
sys.path.append("..")

# this repo
# from src.per_arm_rl import data_utils

# tf exceptions and vars
if tf.__version__[0] != "2":
    raise Exception("The trainer only runs with TensorFlow version 2.")

T = TypeVar("T")

#### Helper functions

In [None]:
# from src.per_arm_rl import data_utils

def get_all_features():
    
    feats = {
        # user - global context features
        'user_id': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
        'user_rating': tf.io.FixedLenFeature(shape=(), dtype=tf.float32),
        'bucketized_user_age': tf.io.FixedLenFeature(shape=(), dtype=tf.float32),
        'user_occupation_text': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
        'user_occupation_label': tf.io.FixedLenFeature(shape=(), dtype=tf.int64),
        'user_zip_code': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
        'user_gender': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
        'timestamp': tf.io.FixedLenFeature(shape=(), dtype=tf.int64),

        # movie - per arm features
        'movie_id': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
        'movie_title': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
        'movie_genres': tf.io.FixedLenFeature(shape=(1,), dtype=tf.int64),
    }
    
    return feats

def parse_tfrecord(example):
    """
    Reads a serialized example from GCS and converts to tfrecord
    """
    feats = get_all_features()
    
    example = tf.io.parse_example(
        example,
        feats
        # features=feats
    )
    return example

In [17]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Num GPUs Available:  0


In [18]:
# device = cuda.get_current_device()
# device.reset()
# gc.collect()

In [19]:
# cloud storage client
storage_client = storage.Client(project=PROJECT_ID)

# Vertex client
aiplatform.init(project=PROJECT_ID, location=LOCATION)

# Data prep

### Read TF Records

In [20]:
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.AUTO

In [21]:
SPLIT = "train" # "train" | "val"

train_files = []
for blob in storage_client.list_blobs(f"{BUCKET_NAME}", prefix=f'{DATA_GCS_PREFIX}/{SPLIT}'):
    if '.tfrecord' in blob.name:
        train_files.append(blob.public_url.replace("https://storage.googleapis.com/", "gs://"))
        
train_files

['gs://mabv1-wortz-project-352116-bucket/data/train/ml-ratings-100k-train.tfrecord']

In [22]:
train_dataset = tf.data.TFRecordDataset(train_files)
train_dataset = train_dataset.map(parse_tfrecord)

for x in train_dataset.batch(1).take(1):
    pprint(x)

{'bucketized_user_age': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([35.], dtype=float32)>,
 'movie_genres': <tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[7]])>,
 'movie_id': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'898'], dtype=object)>,
 'movie_title': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Postman, The (1997)'], dtype=object)>,
 'timestamp': <tf.Tensor: shape=(1,), dtype=int64, numpy=array([885409515])>,
 'user_gender': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'False'], dtype=object)>,
 'user_id': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'681'], dtype=object)>,
 'user_occupation_label': <tf.Tensor: shape=(1,), dtype=int64, numpy=array([14])>,
 'user_occupation_text': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'marketing'], dtype=object)>,
 'user_rating': <tf.Tensor: shape=(1,), dtype=float32, numpy=array([4.], dtype=float32)>,
 'user_zip_code': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'97208'], dtype=o

### get vocab

**TODO:** 
* streamline vocab calls

In [23]:
GENERATE_VOCABS = False
print(f"GENERATE_VOCABS: {GENERATE_VOCABS}")

VOCAB_SUBDIR   = "vocabs"
VOCAB_FILENAME = "vocab_dict.pkl"

GENERATE_VOCABS: False


In [None]:
def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from the bucket."""
    # The ID of your GCS bucket
    # bucket_name = "your-bucket-name"

    # The ID of your GCS object
    # source_blob_name = "storage-object-name"

    # The path to which the file should be downloaded
    # destination_file_name = "local/path/to/file"

    storage_client = storage.Client()

    bucket = storage_client.bucket(bucket_name)

    # Construct a client side representation of a blob.
    # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
    # any content from Google Cloud Storage. As we don't need additional data,
    # using `Bucket.blob` is preferred here.
    blob = bucket.blob(source_blob_name)
    blob.download_to_filename(destination_file_name)

    print(
        "Downloaded storage object {} from bucket {} to local file {}.".format(
            source_blob_name, bucket_name, destination_file_name
        )
    )

In [24]:
if not GENERATE_VOCABS:

    EXISTING_VOCAB_FILE = f'gs://{BUCKET_NAME}/{VOCAB_SUBDIR}/{VOCAB_FILENAME}'
    print(f"Downloading vocab...")
    
    # os.system(f'gsutil -q cp {EXISTING_VOCAB_FILE} .')
    download_blob(BUCKET_NAME, f"{VOCAB_SUBDIR}/{VOCAB_FILENAME}", VOCAB_FILENAME)
    print(f"Downloaded vocab from: {EXISTING_VOCAB_FILE}\n")

    filehandler = open(VOCAB_FILENAME, 'rb')
    vocab_dict = pkl.load(filehandler)
    filehandler.close()
    
    for key in vocab_dict.keys():
        pprint(key)

Downloading vocab...
Downloaded vocab from: gs://mabv1-wortz-project-352116-bucket/vocabs/vocab_dict.pkl

'movie_id'
'user_id'
'user_occupation_text'
'movie_genres'
'bucketized_user_age'
'max_timestamp'
'min_timestamp'
'timestamp_buckets'


## helper functions

**TODO:**
* modularize in a train_utils or similar

In [25]:
def _add_outer_dimension(x):
    """Adds an extra outer dimension."""
    if isinstance(x, dict):
        for key, value in x.items():
            x[key] = tf.expand_dims(value, 1)
        return x
    return tf.expand_dims(x, 1)

# Multi-Armed Bandits with Per-Arm Features

In [26]:
from tf_agents.bandits.metrics import tf_metrics as tf_bandit_metrics
from tf_agents.replay_buffers import tf_uniform_replay_buffer

nest = tf.nest

## Preprocessing layers for global and arm features

The preproccesing layers will ultimately feed the two functions described below, both of which will ultimately feed the `Environment`

`global_context_sampling_fn`: 
* A function that outputs a random 1d array or list of ints or floats
* This output is the global context. Its shape and type must be consistent across calls.

`arm_context_sampling_fn`: 
* A function that outputs a random 1 array or list of ints or floats (same type as the output of `global_context_sampling_fn`). * This output is the per-arm context. Its shape must be consistent across calls.

In [27]:
NUM_OOV_BUCKETS        = 1
GLOBAL_EMBEDDING_SIZE  = 16
MV_EMBEDDING_SIZE      = 32 #32

### global context (user) features

#### user ID

In [28]:
user_id_input_layer = tf.keras.Input(
    name="user_id",
    shape=(1,),
    dtype=tf.string
)

user_id_lookup = tf.keras.layers.StringLookup(
    max_tokens=len(vocab_dict['user_id']) + NUM_OOV_BUCKETS,
    num_oov_indices=NUM_OOV_BUCKETS,
    mask_token=None,
    vocabulary=vocab_dict['user_id'],
)(user_id_input_layer)

user_id_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=len(vocab_dict['user_id']) + NUM_OOV_BUCKETS,
    output_dim=GLOBAL_EMBEDDING_SIZE
)(user_id_lookup)

user_id_embedding = tf.reduce_sum(user_id_embedding, axis=-2)

# global_inputs.append(user_id_input_layer)
# global_features.append(user_id_embedding)

In [29]:
test_user_id_model = tf.keras.Model(inputs=user_id_input_layer, outputs=user_id_embedding)

for x in train_dataset.batch(1).take(1):
    print(x["user_id"])
    print(test_user_id_model(x["user_id"]))

tf.Tensor([b'681'], shape=(1,), dtype=string)
tf.Tensor(
[[-0.0430223   0.04316026 -0.03505059  0.04390352  0.02443171 -0.01438611
   0.01287739  0.03737282 -0.0031909   0.04948736  0.04524032 -0.00800508
   0.04057597  0.02142993  0.03996793  0.02976451]], shape=(1, 16), dtype=float32)


#### user AGE

In [30]:
user_age_input_layer = tf.keras.Input(
    name="bucketized_user_age",
    shape=(1,),
    dtype=tf.float32
)

user_age_lookup = tf.keras.layers.IntegerLookup(
    vocabulary=vocab_dict['bucketized_user_age'],
    num_oov_indices=NUM_OOV_BUCKETS,
    oov_value=0,
)(user_age_input_layer)

user_age_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=len(vocab_dict['bucketized_user_age']) + NUM_OOV_BUCKETS,
    output_dim=GLOBAL_EMBEDDING_SIZE
)(user_age_lookup)

user_age_embedding = tf.reduce_sum(user_age_embedding, axis=-2)

# global_inputs.append(user_age_input_layer)
# global_features.append(user_age_embedding)

In [31]:
test_user_age_model = tf.keras.Model(inputs=user_age_input_layer, outputs=user_age_embedding)

for x in train_dataset.batch(1).take(1):
    print(x["bucketized_user_age"])
    print(test_user_age_model(x["bucketized_user_age"]))

tf.Tensor([35.], shape=(1,), dtype=float32)
tf.Tensor(
[[-0.03988954 -0.01103028  0.04024919  0.01274309 -0.03701185  0.00081088
  -0.04950501 -0.04429914 -0.04845876  0.03500685 -0.01687964 -0.03256346
  -0.02901732  0.0246805  -0.0419592   0.00757987]], shape=(1, 16), dtype=float32)


#### user OCC

In [32]:
user_occ_input_layer = tf.keras.Input(
    name="user_occupation_text",
    shape=(1,),
    dtype=tf.string
)

user_occ_lookup = tf.keras.layers.StringLookup(
    max_tokens=len(vocab_dict['user_occupation_text']) + NUM_OOV_BUCKETS,
    num_oov_indices=NUM_OOV_BUCKETS,
    mask_token=None,
    vocabulary=vocab_dict['user_occupation_text'],
)(user_occ_input_layer)

user_occ_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=len(vocab_dict['user_occupation_text']) + NUM_OOV_BUCKETS,
    output_dim=GLOBAL_EMBEDDING_SIZE
)(user_occ_lookup)

user_occ_embedding = tf.reduce_sum(user_occ_embedding, axis=-2)

# global_inputs.append(user_occ_input_layer)
# global_features.append(user_occ_embedding)

In [33]:
test_user_occ_model = tf.keras.Model(inputs=user_occ_input_layer, outputs=user_occ_embedding)

for x in train_dataset.batch(1).take(1):
    print(x["user_occupation_text"])
    print(test_user_occ_model(x["user_occupation_text"]))

tf.Tensor([b'marketing'], shape=(1,), dtype=string)
tf.Tensor(
[[-0.01944429  0.02837546  0.04978884  0.00983376  0.0448536  -0.01951627
   0.00986196  0.03108202 -0.04180235 -0.02979856 -0.02010383 -0.02731179
  -0.04827795 -0.04703623 -0.04032074 -0.01869842]], shape=(1, 16), dtype=float32)


#### user Timestamp

In [34]:
user_ts_input_layer = tf.keras.Input(
    name="timestamp",
    shape=(1,),
    dtype=tf.int64
)

user_ts_lookup = tf.keras.layers.Discretization(
    vocab_dict['timestamp_buckets'].tolist()
)(user_ts_input_layer)

user_ts_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=len(vocab_dict['timestamp_buckets'].tolist()) + NUM_OOV_BUCKETS,
    output_dim=GLOBAL_EMBEDDING_SIZE
)(user_ts_lookup)

user_ts_embedding = tf.reduce_sum(user_ts_embedding, axis=-2)

# global_inputs.append(user_ts_input_layer)
# global_features.append(user_ts_embedding)

In [35]:
test_user_ts_model = tf.keras.Model(inputs=user_ts_input_layer, outputs=user_ts_embedding)

for x in train_dataset.batch(1).take(1):
    print(x["timestamp"])
    print(test_user_ts_model(x["timestamp"]))

tf.Tensor([885409515], shape=(1,), dtype=int64)
tf.Tensor(
[[-0.02315004  0.03875582 -0.01551851 -0.01455063 -0.01816684 -0.01592218
  -0.01438422 -0.01885771  0.02474555 -0.00832601 -0.01128403 -0.02329888
   0.00698436 -0.04715705 -0.00511736 -0.03002429]], shape=(1, 16), dtype=float32)


#### define global sampling function

In [36]:
def _get_global_context_features(x):
    """
    This function generates a single global observation vector.
    """
    user_id_value = x['user_id']
    user_age_value = x['bucketized_user_age']
    user_occ_value = x['user_occupation_text']
    user_ts_value = x['timestamp']

    _id = test_user_id_model(user_id_value) # input_tensor=tf.Tensor(shape=(4,), dtype=float32)
    _age = test_user_age_model(user_age_value)
    _occ = test_user_occ_model(user_occ_value)
    _ts = test_user_ts_model(user_ts_value)

    # # tmp - insepct numpy() values
    # print(_id.numpy()) #[0])
    # print(_age.numpy()) #[0])
    # print(_occ.numpy()) #[0])
    # print(_ts.numpy()) #[0])

    # to numpy array
    _id = np.array(_id.numpy())
    _age = np.array(_age.numpy())
    _occ = np.array(_occ.numpy())
    _ts = np.array(_ts.numpy())

    concat = np.concatenate(
        [_id, _age, _occ, _ts], axis=-1 # -1
    ).astype(np.float32)
    
    user_info = [
                user_age_value.numpy(),
                user_occ_value.numpy(),
                user_ts_value.numpy(),
                x['user_zip_code'].numpy(),
                x['user_gender'].numpy(),
                x['movie_title'].numpy(),
                x['user_rating'].numpy()
                ]

    return concat, user_info

In [37]:
for epoch in range(1):
    
    iterator = iter(train_dataset.batch(5))
    data = next(iterator)

In [38]:
data

{'bucketized_user_age': <tf.Tensor: shape=(5,), dtype=float32, numpy=array([35., 18., 56., 45., 35.], dtype=float32)>,
 'movie_genres': <tf.Tensor: shape=(5, 1), dtype=int64, numpy=
 array([[7],
        [4],
        [9],
        [4],
        [7]])>,
 'movie_id': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'898', b'367', b'484', b'494', b'58'], dtype=object)>,
 'movie_title': <tf.Tensor: shape=(5,), dtype=string, numpy=
 array([b'Postman, The (1997)', b'Clueless (1995)',
        b'Maltese Falcon, The (1941)', b'His Girl Friday (1940)',
        b'Quiz Show (1994)'], dtype=object)>,
 'timestamp': <tf.Tensor: shape=(5,), dtype=int64, numpy=array([885409515, 883388887, 891249586, 878044851, 880130613])>,
 'user_gender': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'False', b'True', b'True', b'True', b'False'], dtype=object)>,
 'user_id': <tf.Tensor: shape=(5,), dtype=string, numpy=array([b'681', b'442', b'932', b'506', b'18'], dtype=object)>,
 'user_occupation_label': <tf.Ten

In [39]:
_get_global_context_features(data)

(array([[-4.30223010e-02,  4.31602634e-02, -3.50505933e-02,
          4.39035185e-02,  2.44317092e-02, -1.43861063e-02,
          1.28773935e-02,  3.73728164e-02, -3.19089741e-03,
          4.94873635e-02,  4.52403165e-02, -8.00508261e-03,
          4.05759700e-02,  2.14299299e-02,  3.99679281e-02,
          2.97645070e-02, -3.98895368e-02, -1.10302791e-02,
          4.02491949e-02,  1.27430893e-02, -3.70118506e-02,
          8.10883939e-04, -4.95050065e-02, -4.42991368e-02,
         -4.84587550e-02,  3.50068472e-02, -1.68796405e-02,
         -3.25634629e-02, -2.90173180e-02,  2.46804990e-02,
         -4.19592038e-02,  7.57987425e-03, -1.94442868e-02,
          2.83754580e-02,  4.97888438e-02,  9.83376428e-03,
          4.48536016e-02, -1.95162650e-02,  9.86195728e-03,
          3.10820229e-02, -4.18023467e-02, -2.97985561e-02,
         -2.01038253e-02, -2.73117907e-02, -4.82779518e-02,
         -4.70362306e-02, -4.03207429e-02, -1.86984167e-02,
         -2.31500398e-02,  3.87558229e-0

In [40]:
# #check how this works with batches - new JW

# batch_elem = train_dataset.batch(4)
# _get_global_context_features(batch_elem)
_get_global_context_features(data)[0].shape

(5, 64)

In [41]:
for x in train_dataset.batch(1).take(1):
    test_globals = _get_global_context_features(x)[0]


test_globals

array([[-0.0430223 ,  0.04316026, -0.03505059,  0.04390352,  0.02443171,
        -0.01438611,  0.01287739,  0.03737282, -0.0031909 ,  0.04948736,
         0.04524032, -0.00800508,  0.04057597,  0.02142993,  0.03996793,
         0.02976451, -0.03988954, -0.01103028,  0.04024919,  0.01274309,
        -0.03701185,  0.00081088, -0.04950501, -0.04429914, -0.04845876,
         0.03500685, -0.01687964, -0.03256346, -0.02901732,  0.0246805 ,
        -0.0419592 ,  0.00757987, -0.01944429,  0.02837546,  0.04978884,
         0.00983376,  0.0448536 , -0.01951627,  0.00986196,  0.03108202,
        -0.04180235, -0.02979856, -0.02010383, -0.02731179, -0.04827795,
        -0.04703623, -0.04032074, -0.01869842, -0.02315004,  0.03875582,
        -0.01551851, -0.01455063, -0.01816684, -0.01592218, -0.01438422,
        -0.01885771,  0.02474555, -0.00832601, -0.01128403, -0.02329888,
         0.00698436, -0.04715705, -0.00511736, -0.03002429]],
      dtype=float32)

### arm preprocessing layers

#### movie ID

In [42]:
mv_id_input_layer = tf.keras.Input(
    name="movie_id",
    shape=(1,),
    dtype=tf.string
)

mv_id_lookup = tf.keras.layers.StringLookup(
    max_tokens=len(vocab_dict['movie_id']) + NUM_OOV_BUCKETS,
    num_oov_indices=NUM_OOV_BUCKETS,
    mask_token=None,
    vocabulary=vocab_dict['movie_id'],
)(mv_id_input_layer)

mv_id_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=len(vocab_dict['movie_id']) + NUM_OOV_BUCKETS,
    output_dim=MV_EMBEDDING_SIZE
)(mv_id_lookup)

mv_id_embedding = tf.reduce_sum(mv_id_embedding, axis=-2)

# arm_inputs.append(mv_id_input_layer)
# arm_features.append(mv_id_embedding)

In [43]:
test_mv_id_model = tf.keras.Model(inputs=mv_id_input_layer, outputs=mv_id_embedding)

for x in train_dataset.batch(1).take(1):
    print(x["movie_id"])
    print(test_mv_id_model(x["movie_id"]))

tf.Tensor([b'898'], shape=(1,), dtype=string)
tf.Tensor(
[[ 0.04911545 -0.01634108 -0.01234639 -0.0070103  -0.00573139 -0.00559377
   0.01259658 -0.00546701  0.0216699   0.01893058 -0.0372486   0.00512571
   0.02640522 -0.04102753  0.04012236 -0.03196473 -0.02491291  0.03234743
   0.01537937  0.01648374  0.01373536  0.00863097  0.03960489 -0.04251634
  -0.00295169  0.02778888 -0.01608081  0.03596785  0.0189887   0.04032845
   0.00537063  0.00472279]], shape=(1, 32), dtype=float32)


#### movie genre

In [44]:
mv_genre_input_layer = tf.keras.Input(
    name="movie_genres",
    shape=(1,),
    dtype=tf.float32
)

mv_genre_lookup = tf.keras.layers.IntegerLookup(
    vocabulary=vocab_dict['movie_genres'],
    num_oov_indices=NUM_OOV_BUCKETS,
    oov_value=0,
)(mv_genre_input_layer)

mv_genre_embedding = tf.keras.layers.Embedding(
    # Let's use the explicit vocabulary lookup.
    input_dim=len(vocab_dict['movie_genres']) + NUM_OOV_BUCKETS,
    output_dim=MV_EMBEDDING_SIZE
)(mv_genre_lookup)

mv_genre_embedding = tf.reduce_sum(mv_genre_embedding, axis=-2)

# arm_inputs.append(mv_genre_input_layer)
# arm_features.append(mv_genre_embedding)

In [45]:
test_mv_gen_model = tf.keras.Model(inputs=mv_genre_input_layer, outputs=mv_genre_embedding)

for x in train_dataset.batch(1).take(1):
    print(x["movie_genres"])
    print(x["movie_id"])
    print(test_mv_gen_model(x["movie_genres"]))

tf.Tensor([[7]], shape=(1, 1), dtype=int64)
tf.Tensor([b'898'], shape=(1,), dtype=string)
tf.Tensor(
[[-0.04909344 -0.03367907  0.018834   -0.00841088  0.01948916 -0.04885552
  -0.00617999 -0.02069405 -0.04477356  0.03419046 -0.04278373  0.00344169
   0.04837879 -0.03245031 -0.02844121  0.02616468 -0.04857191 -0.00706835
   0.03614764 -0.03875374 -0.02769949 -0.03707442  0.00756137  0.04047615
  -0.00872046  0.00992896 -0.01910238 -0.01092043  0.02773506 -0.00496411
   0.04582066  0.02387423]], shape=(1, 32), dtype=float32)


#### define sampling function

In [46]:
def _get_per_arm_features(x):
    """
    This function generates a single per-arm observation vector
    """
    mv_id_value = x['movie_id']
    mv_gen_value = x['movie_genres']

    _mid = test_mv_id_model(mv_id_value)
    _mgen = test_mv_gen_model(mv_gen_value)

    # to numpy array
    _mid = np.array(_mid.numpy())
    _mgen = np.array(_mgen.numpy())

    # print(_mid)
    # print(_mgen)

    concat = np.concatenate(
        [_mid, _mgen], axis=-1 # -1
    ).astype(np.float32)
    # concat = tf.concat([_mid, _mgen], axis=-1).astype(np.float32)

    return concat #this is special to this example - there is only one action dimensions

In [47]:
_get_per_arm_features(data).shape #shape checks out at batchdim, nactions, arm feats

(5, 64)

### Create a moive lookup Table 🆕

This will be used in our trajectories to randomly select a movie. Using the produced embeddings, we will also have a reward function for each combination by taking the inner product via `tf_agents.bandits.networks.global_and_arm_feature_network.create_feed_forward_dot_product_network` [link](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/networks/global_and_arm_feature_network/create_feed_forward_dot_product_network)

In [48]:
movie_lookup_table = {'id': [],
                      'movie_features': [],
                      'movie_title': [],
                      'movie_genres': []
                     }
    
iterator = iter(train_dataset.batch(1000))
for data in iterator:
    _get_per_arm_features(data)
    movie_lookup_table['id'].extend(data['movie_id'].numpy())
    movie_lookup_table['movie_title'].extend(data['movie_title'].numpy())
    movie_lookup_table['movie_genres'].extend(data['movie_genres'].numpy())
    movie_lookup_table['movie_features'].extend(_get_per_arm_features(data))
    
#fix string ids to integers for random lookup later
movie_lookup_table['id'] = [int(x) for x in movie_lookup_table['id']]

In [49]:
import pandas as pd



movie_lookup_table = pd.DataFrame(movie_lookup_table)
movie_lookup_table.set_index(['id'])

unique_table = movie_lookup_table.groupby(['id'])[['movie_features', 'movie_title', 'movie_genres']].first().reset_index() #resetting index to get consecutive counts from min-max (no gaps)
# unique_table = unique_table['movie_features']
MAX_ARM_ID = len(unique_table)-1
MIN_ARM_ID = 0

# unique_table
# print(f"Max movie id is: {MAX_ARM_ID} \nMin movie id is: {MIN_ARM_ID}")

In [50]:
unique_table.iloc[2,:]['movie_features'] #example of getting a ra movie

array([ 0.01271609, -0.01063954, -0.04453922,  0.03591642,  0.00564051,
        0.02092992, -0.04488032, -0.02584858, -0.04740479, -0.03002661,
       -0.00383632,  0.03370481,  0.03419319, -0.00474702,  0.03322557,
        0.03964141, -0.01609331, -0.00516544,  0.01496688, -0.01403636,
        0.04762489, -0.00278686, -0.03502396, -0.04725162, -0.0474501 ,
       -0.01647393, -0.04304332, -0.02555263,  0.04443676,  0.04496859,
        0.01758147,  0.03720978, -0.04137355,  0.00880221,  0.04673413,
        0.01510568,  0.00661532, -0.01288216,  0.04426864, -0.01195104,
       -0.00610339, -0.03908002,  0.04702449,  0.03476996,  0.02464242,
       -0.04901971, -0.00125569, -0.04959689, -0.04279599, -0.02668892,
       -0.02922921,  0.02490618, -0.02546238, -0.03765206, -0.0131256 ,
       -0.00520775,  0.03591717, -0.02238215,  0.01095585,  0.01336155,
        0.01121206, -0.03909452,  0.00455445,  0.00605113], dtype=float32)

In [51]:
def get_random_arm_features(movie_id):
    movie_info = unique_table.iloc[movie_id]
    tensor = tf.constant(movie_info['movie_features'], dtype=tf.float32)
    return tf.reshape(tensor, [1, tensor.shape[0]]), [movie_info['movie_title'],
                                                     movie_info['movie_genres']]

get_random_arm_features(222)

(<tf.Tensor: shape=(1, 64), dtype=float32, numpy=
 array([[ 0.01961209, -0.04153495, -0.03037058, -0.04911233, -0.00121395,
         -0.00420331, -0.01246155,  0.01867643, -0.04896856,  0.04429226,
         -0.00705642, -0.04387381, -0.03706367,  0.0153186 , -0.00388533,
         -0.0074302 ,  0.04971996,  0.01255456,  0.02609159, -0.00529303,
         -0.04905823,  0.00490832,  0.02905465,  0.01013665,  0.02138368,
          0.00283175, -0.04763628,  0.039085  , -0.0075358 ,  0.01310653,
         -0.02010018,  0.02264884, -0.04909344, -0.03367907,  0.018834  ,
         -0.00841088,  0.01948916, -0.04885552, -0.00617999, -0.02069405,
         -0.04477356,  0.03419046, -0.04278373,  0.00344169,  0.04837879,
         -0.03245031, -0.02844121,  0.02616468, -0.04857191, -0.00706835,
          0.03614764, -0.03875374, -0.02769949, -0.03707442,  0.00756137,
          0.04047615, -0.00872046,  0.00992896, -0.01910238, -0.01092043,
          0.02773506, -0.00496411,  0.04582066,  0.02387423]],

In [52]:
def get_random_set_of_arm_features(n_actions):
    random_arm_ids = list(np.random.randint(MIN_ARM_ID, MAX_ARM_ID, n_actions))
    features = [get_random_arm_features(x) for x in random_arm_ids]
    just_features = [x[0] for x in features]
    movie_info = [x[1] for x in features]
    return tf.concat(just_features, axis=0), movie_info

In [53]:
get_random_set_of_arm_features(n_actions=2)[0] #NEW - there's a tuple returned with the movies we will use for PALM!

<tf.Tensor: shape=(2, 64), dtype=float32, numpy=
array([[ 0.00206195, -0.02604058, -0.04603716,  0.04289932,  0.00313123,
         0.0446151 , -0.04643131, -0.03961716,  0.02898718,  0.02279401,
         0.03361676,  0.03951753, -0.00993199,  0.03707865,  0.04852705,
        -0.00771183, -0.04796034,  0.03386337,  0.00921752,  0.03477183,
        -0.00896877, -0.03940413, -0.04002311, -0.01100897,  0.0089735 ,
        -0.00648671,  0.04059838,  0.03659462,  0.02175606, -0.00847901,
        -0.03849056, -0.04923638, -0.04909344, -0.03367907,  0.018834  ,
        -0.00841088,  0.01948916, -0.04885552, -0.00617999, -0.02069405,
        -0.04477356,  0.03419046, -0.04278373,  0.00344169,  0.04837879,
        -0.03245031, -0.02844121,  0.02616468, -0.04857191, -0.00706835,
         0.03614764, -0.03875374, -0.02769949, -0.03707442,  0.00756137,
         0.04047615, -0.00872046,  0.00992896, -0.01910238, -0.01092043,
         0.02773506, -0.00496411,  0.04582066,  0.02387423],
       [ 0.035

In [54]:
### Look at the raw input features to format a good prompt for ranking movies
NUM_ACTIONS = 5
batch_size = 8
iterator = iter(train_dataset.batch(batch_size))
data = next(iterator)

_, user_info = _get_global_context_features(data) #new - user info passes on the raw user features for prompting with PALM
###NEW - we are getting the arm features here
_, movie_info = get_random_set_of_arm_features(n_actions=NUM_ACTIONS)

print(user_info, movie_info)

[array([35., 18., 56., 45., 35., 25., 25., 35.], dtype=float32), array([b'marketing', b'student', b'educator', b'programmer', b'other',
       b'programmer', b'other', b'executive'], dtype=object), array([885409515, 883388887, 891249586, 878044851, 880130613, 892778202,
       879959212, 877131685]), array([b'97208', b'85282', b'06437', b'03869', b'37212', b'55414',
       b'06405', b'L1V3W'], dtype=object), array([b'False', b'True', b'True', b'True', b'False', b'True', b'False',
       b'True'], dtype=object), array([b'Postman, The (1997)', b'Clueless (1995)',
       b'Maltese Falcon, The (1941)', b'His Girl Friday (1940)',
       b'Quiz Show (1994)', b"Carlito's Way (1993)",
       b'Primal Fear (1996)', b'Aladdin (1992)'], dtype=object), array([4., 2., 5., 5., 4., 4., 5., 4.], dtype=float32)] [[b'Roman Holiday (1953)', array([4])], [b'Careful (1992)', array([4])], [b'Mille bolle blu (1993)', array([4])], [b'Two Bits (1995)', array([7])], [b"What's Love Got to Do with It (1993)", arr

In [55]:
from datetime import datetime
dt = datetime.utcfromtimestamp(885409515)
dt.ctime()

'Wed Jan 21 19:05:15 1998'

## Quick inspection of movie info
Here's an example of `N_ACTIONS` randomly selected movies

In [56]:
movie_info

[[b'Roman Holiday (1953)', array([4])],
 [b'Careful (1992)', array([4])],
 [b'Mille bolle blu (1993)', array([4])],
 [b'Two Bits (1995)', array([7])],
 [b"What's Love Got to Do with It (1993)", array([7])]]

##### Feature formats info reference

[BUCKETIZED AGE](https://www.tensorflow.org/datasets/catalog/movielens)


[GENRE_LIST](https://files.grouplens.org/datasets/movielens/ml-10m-README.html)

In [140]:
from datetime import datetime
from pprint import pprint

age_text_lookup = {
'1': "Under 18",
'18': "18-24",
'25': "25-34",
'35': "35-44",
'45': "45-49",
'50': "50-55",
'56': "56+"
}

genre_list = [
    "Action",
    "Adventure",
    "Animation",
    "Children's",
    "Comedy",
    "Crime",
    "Documentary",
    "Drama",
    "Fantasy",
    "Film-Noir",
    "Horror",
    "Musical",
    "Mystery",
    "Romance",
    "Sci-Fi",
    "Thriller",
    "War",
    "Western",
] #use this to lookup genres

def gender_movielens_translator(elem):
    if elem=="True":
        return "male" 
    else:
        return "non-male"

rating_scale = '''
5 - highly recommended movie
4 - somewhat recommended movie
3 - maybe watch movie
2 - not a good movie
1 - really bad movie
'''

age, occ, time, zipcode, gender, ex_movie, ex_movie_rating = user_info[0], user_info[1], user_info[2], user_info[3], user_info[4], user_info[5], user_info[6]

prompts = []
for i in range(len(age)):
    formatted_datetime = datetime.utcfromtimestamp(time[i]).ctime()
    gender = gender_movielens_translator(gender[i])
    prompt = f"""You are looking to watch a movie and need to review each movie based on user demographics 
Here are some info on this the user: 
the user is age is {age_text_lookup[str(int(age[i]))]}, {gender[i]}
and lives in zipcode {zipcode[i].decode("utf-8")}
the user's occupation is {occ[i].decode("utf-8")} 
the user previously reviewed {ex_movie[i].decode("utf-8")}, 
giving it a {float(ex_movie_rating[i])} out five star review during {formatted_datetime}
    
Please rate these movies below using using {rating_scale}
"""
    
    for j, movie in enumerate(movie_info):
        try:
            genre = genre_list[movie[1][0]]
        except:
            genre = 'NA'
        prompt += f"\n{j+1}. {movie[0].decode('utf-8')}, {genre}"
        total_movies = j+1
    prompt += f"\n please rate the {total_movies} movies"
    prompt += f"\n ensure you return the ratings as a python list of just the ratings for {total_movies} movies"
        
    ## next add in the movie selections
    prompts.append(prompt)
pprint(prompts[0])

('You are looking to watch a movie and need to review each movie based on user '
 'demographics \n'
 'Here are some info on this the user: \n'
 'the user is age is 35-44, n\n'
 'and lives in zipcode 97208\n'
 "the user's occupation is marketing \n"
 'the user previously reviewed Postman, The (1997), \n'
 'giving it a 4.0 out five star review during Wed Jan 21 19:05:15 1998\n'
 '    \n'
 'Please rate these movies below using using \n'
 '5 - highly recommended movie\n'
 '4 - somewhat recommended movie\n'
 '3 - maybe watch movie\n'
 '2 - not a good movie\n'
 '1 - really bad movie\n'
 '\n'
 '\n'
 '1. Roman Holiday (1953), Comedy\n'
 '2. Careful (1992), Comedy\n'
 '3. Mille bolle blu (1993), Comedy\n'
 '4. Two Bits (1995), Drama\n'
 "5. What's Love Got to Do with It (1993), Drama\n"
 ' please rate the 5 movies\n'
 ' ensure you return the ratings as a python list of just the ratings for 5 '
 'movies')


In [58]:
movie_info

[[b'Roman Holiday (1953)', array([4])],
 [b'Careful (1992)', array([4])],
 [b'Mille bolle blu (1993)', array([4])],
 [b'Two Bits (1995)', array([7])],
 [b"What's Love Got to Do with It (1993)", array([7])]]

In [59]:
def RL_prompt(user_info, movie_info):
    
    age, occ, time, zipcode, gender, ex_movie, ex_movie_rating = user_info[0], user_info[1], user_info[2], user_info[3], user_info[4], user_info[5], user_info[6]

    prompts = []
    for i in range(len(age)):
        formatted_datetime = datetime.utcfromtimestamp(time[i]).ctime()
        gender = gender_movielens_translator(gender[i])
        prompt = f"""CONTEXT: Pretend you are looking to watch a movie and need to review each movie on your user profile 
USER PROFILE: 
your age is {age_text_lookup[str(int(age[i]))]}, {gender[i]}
and lives you live in zipcode {zipcode[i].decode("utf-8")}
your occupation is {occ[i].decode("utf-8")} 
plus you previously reviewed {ex_movie[i].decode("utf-8")}, 
giving it a {int(ex_movie_rating[i])} out 5.0 star review during {formatted_datetime}

Review these movies using this scale: {rating_scale}
Here are the movies to you need to rate: """
        for j, movie in enumerate(movie_info):
            try:
                genre = genre_list[movie[1][0]]
            except:
                genre = 'NA'
            prompt += f"\n{j+1}. {movie[0].decode('utf-8')}, {genre}"
        # prompt += textwrap.dedent(f"\n Please rate these movies below using using this scale: {rating_scale}")
        # prompt += f"Q: return the ratings of these movies like so: 3.5, 4, ... for each movie:" #llm recency bias
        prompts.append(prompt)
    return prompts

In [60]:
prompts = RL_prompt(user_info, movie_info)

len(prompts)
pprint(prompts[0])

('CONTEXT: Pretend you are looking to watch a movie and need to review each '
 'movie on your user profile \n'
 'USER PROFILE: \n'
 'your age is 35-44, n\n'
 'and lives you live in zipcode 97208\n'
 'your occupation is marketing \n'
 'plus you previously reviewed Postman, The (1997), \n'
 'giving it a 4 out 5.0 star review during Wed Jan 21 19:05:15 1998\n'
 '\n'
 'Review these movies using this scale: \n'
 '5 - highly recomended movie\n'
 '4 - somewhat recommend movie\n'
 '3 - maybe watch movie\n'
 '2 - not a good movie\n'
 '1 - really bad movie\n'
 '\n'
 'Here are the movies to you need to rate: \n'
 '1. Roman Holiday (1953), Comedy\n'
 '2. Careful (1992), Comedy\n'
 '3. Mille bolle blu (1993), Comedy\n'
 '4. Two Bits (1995), Drama\n'
 "5. What's Love Got to Do with It (1993), Drama")


## Adding in reward function with PALM!

In [61]:
## Adding in reward function with PALM!

import vertexai
from vertexai.language_models import TextGenerationModel

vertexai.init(project="wortz-project-352116", location="us-central1")
parameters = {
    "temperature": 0.0,
    "max_output_tokens": 400,
    "top_p": 0.8,
    "top_k": 40
}
llm = TextGenerationModel.from_pretrained("text-bison")
response = llm.predict(
    "How are you today?",
    **parameters
)
response.text

'I am doing well today, thank you for asking! I am excited to be learning more about natural language processing and how it can be used to improve the customer experience.'

In [62]:
# create a rate limiter function
import time

def wait(secs):
    def decorator(func):
        def wrapper(*args, **kwargs):
            time.sleep(secs)
            return func(*args, **kwargs)
        return wrapper
    return decorator

In [63]:
@wait(1)
def palm_rate_limited(prompt, *args, **kwargs):
    return llm.predict(prompt, *args, **kwargs)

In [64]:
palm_rate_limited("hello")

world

```
#include <stdio.h>

int main() {
  printf("Hello, world!\n");
  return 0;
}
```

Output:

```
Hello, world!
```

##### WIP Adding in reward funciton with OpenLLaMA

https://arxiv.org/abs/2302.13971

Running this locally may improve run times for reward examples

Note PEFT/LORA would be a good opition to do a pass over the entire dataset, then use fine-tuned model TODO

##### Follow pip install for packages below

```python
!pip install torch transformers sentencepiece --user
!pip install accelerate --user
# https://github.com/pytorch/pytorch/issues/90673
!echo Y | conda install libcusparse=11.7.3.50 -c nvidia
!echo Y |conda install cudatoolkit=11.8 -c pytorch -c nvidia #I did a thing...
```

In [65]:
# import torch
# from transformers import LlamaForCausalLM, LlamaTokenizer

# model_path = "openlm-research/open_llama_3b"


# tokenizer = LlamaTokenizer.from_pretrained(model_path)

# model = LlamaForCausalLM.from_pretrained(
#     model_path,
#     torch_dtype=torch.float16,
#     device_map="auto"
# )

# prompt = "Q: What is the largest animal?\nA:"
# input_ids = tokenizer(prompt, return_tensors="pt").input_ids
# input_ids = input_ids.to("cuda")
# generation_output = model.generate(input_ids=input_ids, max_new_tokens=32)
# print(tokenizer.decode(generation_output[0]))

In [66]:
# def get_llama_predictions(prompt, max_new_tokens):
#     input_ids = tokenizer(prompt, return_tensors="pt").input_ids
#     input_ids = input_ids.to("cuda")
#     generation_output = model.generate(input_ids=input_ids, max_new_tokens=max_new_tokens)
#     return(tokenizer.decode(generation_output[0]))

In [131]:
pprint(prompts)

['CONTEXT: Pretend you are looking to watch a movie and need to review each '
 'movie on your user profile \n'
 'USER PROFILE: \n'
 'your age is 35-44, n\n'
 'and lives you live in zipcode 97208\n'
 'your occupation is marketing \n'
 'plus you previously reviewed Postman, The (1997), \n'
 'giving it a 4 out 5.0 star review during Wed Jan 21 19:05:15 1998\n'
 '\n'
 'Review these movies using this scale: \n'
 '5 - highly recomended movie\n'
 '4 - somewhat recommend movie\n'
 '3 - maybe watch movie\n'
 '2 - not a good movie\n'
 '1 - really bad movie\n'
 '\n'
 'Here are the movies to you need to rate: \n'
 '1. Roman Holiday (1953), Comedy\n'
 '2. Careful (1992), Comedy\n'
 '3. Mille bolle blu (1993), Comedy\n'
 '4. Two Bits (1995), Drama\n'
 "5. What's Love Got to Do with It (1993), Drama",
 'CONTEXT: Pretend you are looking to watch a movie and need to review each '
 'movie on your user profile \n'
 'USER PROFILE: \n'
 'your age is 18-24, o\n'
 'and lives you live in zipcode 85282\n'
 'yo

In [132]:
# %%time
# get_llama_predictions("what is your favorite color?", 30)

In [133]:
# %%time
# get_llama_predictions(prompt[0], 50)

In [134]:
# prompt = prompts[0]
# input_ids = tokenizer(prompt, return_tensors="pt").input_ids
# input_ids = input_ids.to("cuda")
# generation_output = model.generate(input_ids=input_ids, max_new_tokens=10)
# print(tokenizer.decode(generation_output[0]))

In [135]:
# %%time
### PALM test prompt!

rating = llm.predict(prompts[0], **parameters)
extraction_prompt = "extract the ratings in order in a simple comma seperated list:"
ratings = llm.predict(f"{rating.text} {extraction_prompt}", **parameters)
ratings.text

'5, 3, 2, 3, 4'

In [136]:
def llm_call(prompts):
    ratings_list = []
    for prompt in prompts:
        rating = palm_rate_limited(prompt, **parameters)
        extraction_prompt = "extract the numeric-only ratings a comma seperated list:"
        ratings = palm_rate_limited(f"given the output {rating.text}, {extraction_prompt}", **parameters)
        ratings_list.append(ratings.text)
    return ratings_list

In [137]:
#now try to put it together by getting ratings for a batch with multiple arms

print(batch_size, NUM_ACTIONS)

8 5


In [138]:
unvalidated_llm_response = llm_call(prompts)

In [139]:
unvalidated_llm_response

['5, 3, 2, 3, 4',
 '4, 3, 2, 3, 4',
 '5, 3, 3, 2, 4',
 '5, 3, 3, 2, 4',
 '5, 3, 2, 3, 4',
 '5, 3, 2, 3, 4',
 '5, 3, 2, 3, 4',
 '5, 3, 2, 3, 4']

In [145]:
import re
# ERROR 2023-08-22T03:37:47.867063105Z [resource.labels.taskName: workerpool0-0] ValueError: could not convert string to float: ' 4/5'
# ERROR 2023-08-22T03:37:47.867039137Z [resource.labels.taskName: workerpool0-0] return [[float(y) for y in x] for x in str_list]

def validate_llm_response(llm_response, n_actions):
    "this formats the text lists into a list of floats and also"
    "TODO - handles when LLM has poor output"
    str_list = []
    for resp in llm_response:
        str_elem = [y for y in resp.split(',')]
        if len(str_elem) != n_actions:
             str_elem = list(np.ones(n_actions)*3) #default rating of all threes if we can't figure it out TODO
        try:
            [float(y) for y in str_elem] #check if we can do a float conversion
        except:
            str_elem = list(np.ones(n_actions)*3)
        str_list.append(str_elem)
    # re_clean_list = [[re.findall(r'\d+', y) for y in x] for x in str_list]
    try:
        return [[float(y) for y in x] for x in str_list]
    except:
        return list(np.ones(n_actions)*3)

In [146]:
llm_rewards = validate_llm_response(unvalidated_llm_response, NUM_ACTIONS)
llm_rewards

[[5.0, 3.0, 2.0, 3.0, 4.0],
 [4.0, 3.0, 2.0, 3.0, 4.0],
 [5.0, 3.0, 3.0, 2.0, 4.0],
 [5.0, 3.0, 3.0, 2.0, 4.0],
 [5.0, 3.0, 2.0, 3.0, 4.0],
 [5.0, 3.0, 2.0, 3.0, 4.0],
 [5.0, 3.0, 2.0, 3.0, 4.0],
 [5.0, 3.0, 2.0, 3.0, 4.0]]

## Finally, put it together into the LLM reward


In [78]:
def llm_reward(user_info, movie_info, num_actions):
    prompts = RL_prompt(user_info, movie_info)
    unvalidated_llm_response = llm_call(prompts)
    return validate_llm_response(unvalidated_llm_response, num_actions)

In [79]:
# ### Look at the raw input features to format a good prompt for ranking movies
# NUM_ACTIONS = 5
# batch_size = 8
# iterator = iter(train_dataset.batch(batch_size))
# test_steps = 3
# for _ in range(3):
#     data = next(iterator)

#     _, user_info = _get_global_context_features(data) #new - user info passes on the raw user features for prompting with PALM
#     ###NEW - we are getting the arm features here
#     _, movie_info = get_random_set_of_arm_features(n_actions=NUM_ACTIONS)


#     llm_reward(user_info, movie_info, NUM_ACTIONS) #batch size by n_actions/arms

In [None]:
# add one more validation - we will add a null tie in case of bad formatting TODO
### should make sure we have the correct shapes

# TF-Agents implementation

In TF-Agents, the *per-arm features* implementation differs from the *global-only* feature examples in the following aspects:
* Reward is modeled not per-arm, but globally.
* The arms are permutation invariant: it doesn’t matter which arm is arm 1 or arm 2, only their features.
* One can have a different number of arms to choose from in every step (note that unspecified/dynamically changing number of arms will have a problem with XLA compatibility).

When implementing per-arm features in TF-Bandits, the following details have to be discussed:
* Observation spec and observations,
* Action spec and actions,
* Implementation of specific policies and agents.


**TODO:**
* outline the components and highlight their interactions, dependencies on eachother, etc.

In [80]:
BATCH_SIZE  = 8
NUM_ACTIONS = 5 

# GLOBAL_EMBEDDING_SIZE  = 16
# MV_EMBEDDING_SIZE      = 32 #32

GLOBAL_DIM = GLOBAL_EMBEDDING_SIZE * 4 # 4 global features in this example
PER_ARM_DIM = MV_EMBEDDING_SIZE * 2 # 2 movie features

print(f"BATCH_SIZE  : {BATCH_SIZE}")
print(f"NUM_ACTIONS : {NUM_ACTIONS}")

BATCH_SIZE  : 8
NUM_ACTIONS : 5


## Tensor Specs

**TODO:**
* explain relationship between Tensor Specs and their Tensor counterparts
* highlight the errors, lessons learned, and utility functions to address these

### Observation spec

**This observation spec allows the user to have a global observation of fixed dimension**, and an unspecified number of *per-arm* features (also of fixed dimension)
* The actions output by the policy are still integers as usual, and they indicate which row of the arm-features it has chosen 
* The action spec must be a single integer value without boundaries:

```python
global_spec = tensor_spec.TensorSpec([GLOBAL_DIM], tf.float32)
per_arm_spec = tensor_spec.TensorSpec([None, PER_ARM_DIM], tf.float32)
observation_spec = {'global': global_spec, 'per_arm': per_arm_spec}

action_spec = tensor_spec.TensorSpec((), tf.int32)
```
> Here the only difference compared to the action spec with global features only is that the tensor spec is not bounded, as we don’t know how many arms there will be at any time step

**XLA compatibility:**
* Since dynamic tensor shapes are not compatible with XLA, the number of arm features (and consequently, number of arms for a step) cannot be dynamic. 
* One workaround is to fix the maximum number of arms for a problem, then pad the arm features in steps with fewer arms, and use action masking to indicate how many arms are actually active.

```python
per_arm_spec = tensor_spec.TensorSpec([NUM_ACTIONS, PER_ARM_DIM], tf.float32)

action_spec = tensor_spec.BoundedTensorSpec(
    shape=(), dtype=tf.int32, minimum = 0, maximum = NUM_ACTIONS - 1
)
```

In [81]:
observation_spec = {
    'global': tf.TensorSpec([GLOBAL_DIM], tf.float32),
    'per_arm': tf.TensorSpec([NUM_ACTIONS, PER_ARM_DIM], tf.float32) #excluding action dim here
}
observation_spec

{'global': TensorSpec(shape=(64,), dtype=tf.float32, name=None),
 'per_arm': TensorSpec(shape=(5, 64), dtype=tf.float32, name=None)}

### Action spec

> The time_step_spec and action_spec are specifications for the input time step and the output action

```python
    if (
        not tensor_spec.is_bounded(action_spec)
        or not tensor_spec.is_discrete(action_spec)
        or action_spec.shape.rank > 1
        or action_spec.shape.num_elements() != 1
    ):
      raise NotImplementedError(
          'action_spec must be a BoundedTensorSpec of type int32 and shape (). '
          'Found {}.'.format(action_spec)
      )
```

* [src](https://github.com/tensorflow/agents/blob/master/tf_agents/bandits/policies/reward_prediction_base_policy.py#L97)

In [82]:
action_spec = tensor_spec.BoundedTensorSpec(
    shape=[], 
    dtype=tf.int32,
    minimum=tf.constant(0),            
    maximum=NUM_ACTIONS-1, #n degrees of freedom and will dictate the expected mean reward spec shape
    name="action_spec"
)

action_spec

BoundedTensorSpec(shape=(), dtype=tf.int32, name='action_spec', minimum=array(0, dtype=int32), maximum=array(4, dtype=int32))

### TimeStep spec

In [83]:
time_step_spec = ts.time_step_spec(observation_spec)#, reward_spec=tf.TensorSpec([1, NUM_ACTIONS]))
time_step_spec

TimeStep(
{'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'observation': {'global': TensorSpec(shape=(64,), dtype=tf.float32, name=None),
                 'per_arm': TensorSpec(shape=(5, 64), dtype=tf.float32, name=None)},
 'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
 'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type')})

## The Agent

**Note** that contextual bandits form a special case of RL, where the actions taken by the agent do not alter the state of the environment 

> “Contextual” refers to the fact that the agent chooses among a set of actions while having knowledge of the context (environment observation)

### Agent types

**Possible Agent Types:**

```
AGENT_TYPE = ['LinUCB', 'LinTS', 'epsGreedy', 'NeuralLinUCB']
```

**LinearUCBAgent:** (`LinUCB`)
* An agent implementing the Linear UCB bandit algorithm
* (whitepaper) [A contextual bandit approach to personalized news recommendation](https://arxiv.org/abs/1003.0146)
* [docs](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/agents/lin_ucb_agent/LinearUCBAgent)

**LinearThompsonSamplingAgent:** (`LinTS`)
* Implements the Linear Thompson Sampling Agent from the paper: [Thompson Sampling for Contextual Bandits with Linear Payoffs](https://arxiv.org/abs/1209.3352)
* the agent maintains two parameters `weight_covariances` and `parameter_estimators`, and updates them based on experience.
* The inverse of the weight covariance parameters are updated with the outer product of the observations using the Woodbury inverse matrix update, while the parameter estimators are updated by the reward-weighted observation vectors for every action
* [docs](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/agents/linear_thompson_sampling_agent/LinearThompsonSamplingAgent)

**NeuralEpsilonGreedyAgent:** (`epsGreedy`) 
* A neural network based epsilon greedy agent
* This agent receives a neural network that it trains to predict rewards
* The action is chosen greedily with respect to the prediction with probability `1 - epsilon`, and uniformly randomly with probability epsilon
* [docs](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/agents/neural_epsilon_greedy_agent/NeuralEpsilonGreedyAgent)

**NeuralLinUCBAgent:** (`NeuralLinUCB`)
* An agent implementing the LinUCB algorithm on top of a neural network
* `ENCODING_DIM` is the output dimension of the encoding network 
> * This output will be used by either a linear reward layer and epsilon greedy exploration, or by a LinUCB logic, depending on the number of training steps executed so far
* `EPS_PHASE_STEPS` is the number training steps to run for training the encoding network before switching to `LinUCB`
> * If negative, the encoding network is assumed to be already trained
> * If the number of steps is less than or equal to `EPS_PHASE_STEPS`, `epsilon greedy` is used, otherwise `LinUCB`
* [docs](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/agents/neural_linucb_agent/NeuralLinUCBAgent)

### network types

Which network architecture to use for the `epsGreedy` or `NeuralLinUCB` agents

```
NETWORK_TYPE = ['commontower', 'dotproduct']
```

**GlobalAndArmCommonTowerNetwork:** (`commontower`)
* This network takes the output of the global and per-arm networks, and leads them through a common network, that in turn outputs reward estimates
> * `GLOBAL_LAYERS` - Iterable of ints. Specifies the layers of the global tower
> * `ARM_LAYERS` - Iterable of ints. Specifies the layers of the arm tower
> * `COMMON_LAYERS` - Iterable of ints. Specifies the layers of the common tower
* The network produced by this function can be used either in `GreedyRewardPredictionPolicy`, or `NeuralLinUCBPolicy`
> * In the former case, the network must have `output_dim=1`, it is going to be an instance of `QNetwork`, and used in the policy as a reward prediction network
> * In the latter case, the network will be an encoding network with its output consumed by a reward layer or a `LinUCB` method. The specified `output_dim` will be the encoding dimension
* [docs](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/networks/global_and_arm_feature_network/GlobalAndArmCommonTowerNetwork)

**GlobalAndArmDotProductNetwork:** (`dotproduct`)
* This network calculates the **dot product** of the output of the global and per-arm networks and returns them as reward estimates
> * `GLOBAL_LAYERS` - Iterable of ints. Specifies the layers of the global tower
> * `ARM_LAYERS` - Iterable of ints. Specifies the layers of the arm tower
* [docs](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/networks/global_and_arm_feature_network/GlobalAndArmDotProductNetwork)

### define agent and network

In [84]:
# ================================
# Agents
# ================================
AGENT_TYPE      = 'epsGreedy' # 'LinUCB' | 'LinTS |, 'epsGreedy' | 'NeuralLinUCB'

# Parameters for linear agents (LinUCB and LinTS).
AGENT_ALPHA     = 0.1

# Parameters for neural agents (NeuralEpsGreedy and NerualLinUCB).
EPSILON         = 0.4
LR              = 0.005

# Parameters for NeuralLinUCB
ENCODING_DIM    = 1
EPS_PHASE_STEPS = 1000

# ================================
# Agent's Preprocess Network
# ================================
NETWORK_TYPE    = "dotproduct" # 'commontower' | 'dotproduct'

if AGENT_TYPE == 'NeuralLinUCB':
    NETWORK_TYPE = 'commontower'
    

GLOBAL_LAYERS   = [50, 50, 50]
ARM_LAYERS      = [50, 50, 50]
COMMON_LAYERS   = [100]

observation_and_action_constraint_splitter = None

HPARAMS = {  # TODO - streamline and consolidate
    "batch_size": BATCH_SIZE,
    "num_actions": NUM_ACTIONS,
    "model_type": AGENT_TYPE,
    "network_type": NETWORK_TYPE,
    "global_layers": GLOBAL_LAYERS,
    "per_arm_layers": ARM_LAYERS,
    "common_layers": COMMON_LAYERS,
    "learning_rate": LR,
    "epsilon": EPSILON,
}
pprint(HPARAMS)

{'batch_size': 8,
 'common_layers': [100],
 'epsilon': 0.4,
 'global_layers': [50, 50, 50],
 'learning_rate': 0.005,
 'model_type': 'epsGreedy',
 'network_type': 'dotproduct',
 'num_actions': 5,
 'per_arm_layers': [50, 50, 50]}


### Agent Factory

**TODO:**
* consolidate agent, network, and hparams

In [85]:
print("Quick check on the inputs of the agent - this can be used to diagnose spec shape inputs")
print("\ntime_step_spec: ", time_step_spec)
print("\naction_spec: ", action_spec)
print("\nobservation_spec: ", observation_spec)

Quick check on the inputs of the agent - this can be used to diagnose spec shape inputs

time_step_spec:  TimeStep(
{'discount': BoundedTensorSpec(shape=(), dtype=tf.float32, name='discount', minimum=array(0., dtype=float32), maximum=array(1., dtype=float32)),
 'observation': {'global': TensorSpec(shape=(64,), dtype=tf.float32, name=None),
                 'per_arm': TensorSpec(shape=(5, 64), dtype=tf.float32, name=None)},
 'reward': TensorSpec(shape=(), dtype=tf.float32, name='reward'),
 'step_type': TensorSpec(shape=(), dtype=tf.int32, name='step_type')})

action_spec:  BoundedTensorSpec(shape=(), dtype=tf.int32, name='action_spec', minimum=array(0, dtype=int32), maximum=array(4, dtype=int32))

observation_spec:  {'global': TensorSpec(shape=(64,), dtype=tf.float32, name=None), 'per_arm': TensorSpec(shape=(5, 64), dtype=tf.float32, name=None)}


In [86]:
# from tf_agents.bandits.policies import policy_utilities
# from tf_agents.bandits.agents import greedy_reward_prediction_agent

network = None
observation_and_action_constraint_splitter = None
global_step = tf.compat.v1.train.get_or_create_global_step()

if AGENT_TYPE == 'LinUCB':
    agent = lin_ucb_agent.LinearUCBAgent(
        time_step_spec=time_step_spec,
        action_spec=action_spec,
        alpha=AGENT_ALPHA,
        accepts_per_arm_features=True,
        dtype=tf.float32,
    )
elif AGENT_TYPE == 'LinTS':
    agent = lin_ts_agent.LinearThompsonSamplingAgent(
        time_step_spec=time_step_spec,
        action_spec=action_spec,
        alpha=AGENT_ALPHA,
        observation_and_action_constraint_splitter=(
            observation_and_action_constraint_splitter
        ),
        accepts_per_arm_features=True,
        dtype=tf.float32,
    )
elif AGENT_TYPE == 'epsGreedy':
    # obs_spec = per_arm_tf_env.observation_spec()
    if NETWORK_TYPE == 'commontower':
        network = global_and_arm_feature_network.create_feed_forward_common_tower_network(
            observation_spec = observation_spec, 
            global_layers = GLOBAL_LAYERS, 
            arm_layers = ARM_LAYERS, 
            common_layers = COMMON_LAYERS,
            # output_dim = 1
        )
    elif NETWORK_TYPE == 'dotproduct':
        network = global_and_arm_feature_network.create_feed_forward_dot_product_network(
            observation_spec = observation_spec, 
            global_layers = GLOBAL_LAYERS, 
            arm_layers = ARM_LAYERS
        )
    agent = neural_epsilon_greedy_agent.NeuralEpsilonGreedyAgent(
        time_step_spec=time_step_spec,
        action_spec=action_spec,
        reward_network=network,
        optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=HPARAMS['learning_rate']),
        epsilon=HPARAMS['epsilon'],
        observation_and_action_constraint_splitter=(
            observation_and_action_constraint_splitter
        ),
        accepts_per_arm_features=True,
        emit_policy_info=policy_utilities.InfoFields.PREDICTED_REWARDS_MEAN,
        train_step_counter=global_step,
        # info_fields_to_inherit_from_greedy=['predicted_rewards_mean'],
        name='OffpolicyNeuralEpsGreedyAgent'
    )

elif AGENT_TYPE == 'NeuralLinUCB':
    # obs_spec = per_arm_tf_env.observation_spec()
    network = (
        global_and_arm_feature_network.create_feed_forward_common_tower_network(
            observation_spec = observation_spec, 
            global_layers = GLOBAL_LAYERS, 
            arm_layers = ARM_LAYERS, 
            common_layers = COMMON_LAYERS,
            output_dim = ENCODING_DIM
        )
    )
    agent = neural_linucb_agent.NeuralLinUCBAgent(
        time_step_spec=per_arm_tf_env.time_step_spec(),
        action_spec=per_arm_tf_env.action_spec(),
        encoding_network=network,
        encoding_network_num_train_steps=EPS_PHASE_STEPS,
        encoding_dim=ENCODING_DIM,
        optimizer=tf.compat.v1.train.AdamOptimizer(learning_rate=LR),
        alpha=1.0,
        gamma=1.0,
        epsilon_greedy=EPSILON,
        accepts_per_arm_features=True,
        debug_summaries=True,
        summarize_grads_and_vars=True,
        emit_policy_info=policy_utilities.InfoFields.PREDICTED_REWARDS_MEAN,
    )
    
agent.initialize() # TODO - does this go here?
    
print(f"Agent: {agent.name}\n")
if network:
    print(f"Network: {network.name}")

Agent: OffpolicyNeuralEpsGreedyAgent

Network: GlobalAndArmDotProductNetwork


## Reward function

**TODO:**
* create a baseline reward function on user features and rating

In [87]:
# def _get_rewards(element):
#     """Calculates reward for the actions."""

#     def _calc_reward(x):
#         """Calculates reward for a single action."""
#         r0 = lambda: tf.constant(0.0)
#         r1 = lambda: tf.constant(-10.0)
#         r2 = lambda: tf.constant(2.0)
#         r3 = lambda: tf.constant(3.0)
#         r4 = lambda: tf.constant(4.0)
#         r5 = lambda: tf.constant(10.0)
#         c1 = tf.equal(x, 1.0)
#         c2 = tf.equal(x, 2.0)
#         c3 = tf.equal(x, 3.0)
#         c4 = tf.equal(x, 4.0)
#         c5 = tf.equal(x, 5.0)
#         return tf.case(
#             [(c1, r1), (c2, r2), (c3, r3),(c4, r4),(c5, r5)], 
#             default=r0, exclusive=True
#         )

#     return tf.map_fn(
#         fn=_calc_reward, 
#         elems=element['user_rating'], 
#         dtype=tf.float32
#     )

### New - exploring the dot product network

Let's get the dot proudcut of arm/global features for the trajectories

Looking at source [code](https://github.com/tensorflow/agents/blob/v0.17.0/tf_agents/bandits/networks/global_and_arm_feature_network.py#L54-L138)

```python
return GlobalAndArmDotProductNetwork(obs_spec_no_num_actions, global_network,
                                       arm_network)
```

Leads to [here](https://www.tensorflow.org/agents/api_docs/python/tf_agents/bandits/networks/global_and_arm_feature_network/GlobalAndArmDotProductNetwork#get_initial_state)

Also member the config

- GLOBAL_LAYERS   = [16, 4]
- ARM_LAYERS      = [16, 4]
- COMMON_LAYERS   = [4]

```python
network = global_and_arm_feature_network.create_feed_forward_dot_product_network(
            observation_spec = observation_spec, 
            global_layers = GLOBAL_LAYERS, 
            arm_layers = ARM_LAYERS
        )
```

## Trajectory function

**parking lot**
* does trajectory fn need concept of `dummy_chosen_arm_features`, similar to [this](https://github.com/tensorflow/agents/blob/master/tf_agents/bandits/policies/reward_prediction_base_policy.py#L297)

```python
      dummy_chosen_arm_features = tf.nest.map_structure(
          lambda obs: tf.zeros_like(obs[:, 0, ...]),
          time_step.observation[bandit_spec_utils.PER_ARM_FEATURE_KEY],
      )
```

In [88]:
from tf_agents.bandits.specs import utils as bandit_spec_utils
from tf_agents.trajectories import trajectory

def _trajectory_fn(element, batch_size): # hparams
        
    """Converts a dataset element into a trajectory."""
    global_features, user_info = _get_global_context_features(element) #new - user info passes on the raw user features for prompting with PALM
    ###NEW - we are getting the arm features here
    arm_features, movie_info = get_random_set_of_arm_features(n_actions=NUM_ACTIONS)
    # arm_features = get_random_set_of_arm_features(n_actions=NUM_ACTIONS)
    
    #get the dot product reward of the feed-forward network
    reward = llm_reward(user_info, movie_info, NUM_ACTIONS)
    
    reward = tf.constant(reward, tf.float32)
    
    #chose an arm
    best_arm_ids = tf.argmax(reward, axis=1)
    # best_arm_ids = tf.cast(best_arm_ids, dtype=tf.int32)
    max_rewards = tf.math.reduce_max(reward, axis=1)
    max_rewards = _add_outer_dimension(max_rewards) # add time dim
    chosen_arm_feats = tf.gather(arm_features, best_arm_ids) # [batch_size, arm_features]
    
    chosen_arm_feats = _add_outer_dimension(chosen_arm_feats)
    # Adds a time dimension.
    arm_features = _add_outer_dimension(arm_features)

    # obs spec
    observation = {
        bandit_spec_utils.GLOBAL_FEATURE_KEY:
            _add_outer_dimension(global_features), #timedim bloat
    }
    
    
    reward = _add_outer_dimension(reward)
    
    ###TODO - not sure if this should actually go in the action for trajectory
    # best_arm_ids =  _add_outer_dimension(best_arm_ids)
    
    dummy_rewards = tf.zeros([batch_size, 1, NUM_ACTIONS])
    
    policy_info = policy_utilities.PerArmPolicyInfo(
        chosen_arm_features=chosen_arm_feats,
        # Pass dummy mean rewards here to match the model_spec for emitting
        # mean rewards in policy info
        predicted_rewards_mean=dummy_rewards
    )
    
    if HPARAMS['model_type'] == 'neural_ucb':
        policy_info = policy_info._replace(
            predicted_rewards_optimistic=dummy_rewards
        )
        
    return trajectory.single_step(
        observation=observation,
        action=tf.zeros_like(
            max_rewards, dtype=tf.int32
        ),  # Arm features are copied from policy info, put dummy zeros here
        policy_info=policy_info,
        reward=max_rewards,
        discount=tf.zeros_like(max_rewards)
    )

## Train loop

`agent.train(experience=...)`

where `experience` is a batch of trajectories data in the form of a Trajectory. 
* The structure of experience must match that of `self.training_data_spec`. 
* All tensors in experience must be shaped [batch, time, ...] where time must be equal to self.train_step_length if that property is not None.

In [89]:
##todo - create a function that selects the best movie features along with 

In [90]:
BATCH_SIZE, NUM_ACTIONS

(8, 5)

## Generate the training trajectories to disk
So we don't have to sync call the llm

In [115]:
## function to save
import pickle

# os.mkdir('trajectories')

def save_trajectories(trajectory, filename):
    # Open a file for writing
    with open(f"{filename}", "wb") as f:
        # Write the dictionary to the file
        pickle.dump(trajectory, f)

    # Close the file
    f.close()

In [116]:
from google.cloud import storage


def upload_blob(bucket_name, source_file_name, destination_blob_name):
    """Uploads a file to the bucket."""
    # The ID of your GCS bucket
    # bucket_name = "your-bucket-name"
    # The path to your file to upload
    # source_file_name = "local/path/to/file"
    # The ID of your GCS object
    # destination_blob_name = "storage-object-name"

    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(destination_blob_name)

    # Optional: set a generation-match precondition to avoid potential race conditions
    # and data corruptions. The request to upload is aborted if the object's
    # generation number does not match your precondition. For a destination
    # object that does not yet exist, set the if_generation_match precondition to 0.
    # If the destination object already exists in your bucket, set instead a
    # generation-match precondition using its generation number.
    generation_match_precondition = 0

    blob.upload_from_filename(source_file_name, if_generation_match=generation_match_precondition)

In [142]:
import collections
from tf_agents.utils import common
from tf_agents.eval import metric_utils
from tf_agents.policies import policy_saver
import time
from tqdm import tqdm

ARTIFACTS_DIR = '.'


train_step_counter = tf.compat.v1.train.get_or_create_global_step()
saver = policy_saver.PolicySaver(
    agent.policy, 
    train_step=train_step_counter
)
starting_loop = 0

print(f"saving files...")
start_time = time.time()


big_number = 10_000_000
big_nubmer_len = len(str(big_number))

iterator = iter(train_dataset.batch(BATCH_SIZE))

for data in tqdm(iterator):
    trajectories = _trajectory_fn(data, BATCH_SIZE)
    filename = str(big_number)[1:big_nubmer_len]+'.p'
    save_trajectories(trajectories, f'/home/jupyter/{filename}') 
    upload_blob(BUCKET_NAME, f'/home/jupyter/{filename}', f'generative-trajectories/{filename}')
    big_number += 1

saving files...


0it [00:43, ?it/s]


PreconditionFailed: 412 POST https://storage.googleapis.com/upload/storage/v1/b/mabv1-wortz-project-352116-bucket/o?uploadType=multipart&ifGenerationMatch=0: {
  "error": {
    "code": 412,
    "message": "At least one of the pre-conditions you specified did not hold.",
    "errors": [
      {
        "message": "At least one of the pre-conditions you specified did not hold.",
        "domain": "global",
        "reason": "conditionNotMet",
        "locationType": "header",
        "location": "If-Match"
      }
    ]
  }
}
: ('Request failed with status code', 412, 'Expected one of', <HTTPStatus.OK: 200>)