# Starting Kit Guidance

This starting kit will guide you through the workflow of developing a solution for an offline RL problem. It contains two components: 

1. `baseline`: an example solution of all relative procedures to build an effective offline RL model: preprocess offline data, build a virtual environment, and train a policy based on the virtual environment. 

2. `sample_submission`: a ready-to-submit bundle based on the baseline model. You can directly submit the `sample_submission.zip` downloaded along with starting kit, or replace the parameters file with your own locally trained baseline model.

By running `run_baseline.sh` in a bash-compatible shell, all these procedures will run sequentially and participants will obtain a usable baseline model along with an up-to-date sample submission bundle.

You may use the baseline model and sample submission code as the entrypoint to your own model implementation. The baseline is provided with `revive` SDK and `stablebaselines3` library, but there is no restriction on how your solution is implemented.

## Requirements

1. `Linux x86_64`. Revive SDK requires this, participants on Windows machine could use WSL2 and seamlessly develop on it with `VSCode` + `Remote WSL extension`. GPU is also supported on WSL2 in Windows 11 or a newer Windows 10 build.

2. `Python 3.6, Python 3.7 or Python 3.8`. Revive SDK does not support Python 3.9 yet, so you may create a py38 environment with conda for baseline model. **Note**: the evaluation program on the competition platform runs on Python 3.9, so feel free to use the latest Python features in your own model.

3. Since a non-ascii file path (e.g. path containing Chinese characters) will trigger `UnicodeEncodeError` from some library as reported by participants, it is recommended to store this starting kit in a path only containing ASCII characters.

## Data organization

All data files related to baseline model will be saved and organized in `baseline/data` folder, including:

1. Public data
  * `offline_592_1000.csv`: Offline dataset downloaded from public data in development phase, containing data of 1000 customers in 60 days.
2. Preprocessing data
  * `offline_592_3_dim_state.csv`: Processed offline dataset with user state inserted as a 3-dim vector (`total_num`,`average_num`,`average_fee`).
  * `user_states_by_day.npy`: State of all users in `60-x` days, with data in first `x` days reduced as initial state. `x` is set to 30 in this baseline.
  * `evaluation_start_states.npy`: The last day's states in the offline dataset, where online evaluation will start from this day.
3. Train a virtual environment
  * `license.lic`: License file required by Revive SDK. A public license valid until the end of this competition is provided by this starting kit, or could be applied in Revive website.
  * `venv.yaml`: Metadata of Revive describing the decision flow of Offline RL problem. Pre-included in baseline data folder.
  * `venv.npz`: Actual dataset corresponding to the layout described in `venv.yaml`. It is generated in data preprocessing phase. 
  * `venv.py`: Expert functions used to introduce prior or constrain the data output from neural network.
  * `venv.pkl`: Trained parameters of virtual environment. It is generated after running Revive for training virtual environment.
4. Train a policy model
  * `model_checkpoints/rl_model_*_steps.zip`: Checkpoints of trained parameters of policy validation model.
  * `rl_model.zip`: The final policy model chosen to be submitted.
  
**Note**: `evaluation_start_states.npy` and `rl_model.zip` will be copied to `sample_submission/data` folder as the parameters for baseline model in online policy evaluation. `user_states.py` will be copied to `sample_submission` folder to provide the model with your state definition.

Let's start with checking the Python environment:

In [None]:
!which python3
!python3 --version

If you are using conda environment, make sure the path outputted from `which python3` are in the correct environment you prepared for this baseline. Besides, make sure the Python version is correct as per description stated above (3.6 ~ 3.8).

Next, let us define the baseline root envrionment variable:

In [None]:
import os

baseline_root = f"{os.getcwd()}/baseline"

%env BASELINE_ROOT=$baseline_root

**IMPORTANT NOTES:**
1. **This code cell should be executed every time you reopen this notebook, to ensure `$baseline_root` for Python and `$BASELINE_ROOT` for Shell are correctly set.**
2. **Make sure the notebook environment is started in `starting_kit` folder, or else the `$baseline_root` variable will not be set correctly.**

Now install the baseline folder to `PYTHONPATH`, along with some basic requirements:

In [None]:
%pushd $baseline_root
!pip install -e .
%popd

# Step 1: Derive user states from offline data

To begin with, switch to the data directory and download the public data from development phase:

In [None]:
%pushd $baseline_root/data
!wget -q https://codalab.lisn.upsaclay.fr/my/datasets/download/eea9f5b7-3933-47cf-ba6f-394218eeb913 -O public_data_dev.zip
!unzip -o public_data_dev.zip
%popd # Leave data directory

The public offline dataset contains only the company promotion action, the user response action, and metadata like index, step, date. We need to derive a user's state (the depiction of a user) from this offline dataset, and use it to build our virtual environment model.

As a baseline, here we define the user state as a simple 3-dim features:

|   Feature   |                         Description                          |
| :---------: | :----------------------------------------------------------: |
|  total_num  |       The total number of orders in the user's history       |
| average_num | The average number of per-day orders in the user's history |
| average_fee | The average fee of per-day orders in the user's history |

Beware that **defining proper user states is key to an effective virtual environment**. States defined in above table is easy and straightforward, but it is necessary for participants to define more robust and reasonable user states.

For convenience, the code related to definition for user states are defined in `user_states.py`, where:
* A `get_state_names` function defined to give each state's dimension a name.
* A `get_next_state` function defined to compute next state from current state and user actions (`day_order_num` and `day_average_order_fee`).

Therefore, when you want to change the user state definition, you only need to modify the code in `user_states.py`, and the modification will apply to all logics related to user states. 

Take a look at the code definition of states in baseline:

In [None]:
import inspect
import user_states

print(inspect.getsource(user_states.get_state_names))
print(inspect.getsource(user_states.get_next_state))

Some helpful notes for baseline's state implementation:
1. `get_next_state` is defined in a generic way, because this function will be called with both numpy array and pytorch tensors. So be careful when you modify the definition of state, such as avoid using API that only exists in only one library. 
2. If you find it impossible to maintain a generic implementation in `get_next_state`, you may copy this definition to `get_next_state_torch` function in the same file, so arrays from `numpy` and `torch` will use their own implementation. 
3. When dealing with pytorch tensor, be cautious with its device, since tensors from both `cpu` and `cuda` device will be passed to `get_next_state` function in Revive SDK. You may use `states.new_<zeros/ones/empty>` API to create new tensor from the same device of `states`.
4. `next_state` is passed as parameter to `get_next_state` function, meaning that you have to pre-allocate array / tensor for `next_state` before calling `get_next_state` function. This is designed to maintain its generic implementaion, since numpy uses `np.empty`, and torch uses `Tensor.new_empty`.

Some details on baseline's state definition design:
1. This is a recurrent state definition, meaning a user's state in a certain day is an accmulated result of all days in this user's history. Therefore, user state at day 31 is the reduced state of all previous 30 days.
2. Baseline computes current days by `total_num / average_num`, whose result is only correct when counting days with non-zero order. To correctly compute all day's average, you need to bake this information into your newly-defined user state, or use some strategies like discount factor.
3. Some constraints could be performed on data before calculating user state. For example, if your venv model outputs `day_order_num` in a continuous manner, you can use `round()` method to convert it to integer. Besides, days of zero order should also generates zero fees, so they can mutually enforce each other to be 0 when one's value is 0. 

To keep the baseline simple, we use the state at day 31 as initial state, representing the user data collected in first 30 days. Then we can proceed to learn a virtual environment based on the transition from day 31 to day 60 (It should be noticed that, this assumption does not fit the fact that user states is influenced by different promotion actions).

Such consideration is implemented in `data_preprocess.py`, we could run it to transform public data `offline_592_1000.csv` to processed data `offline_592_3_dim_state.csv`, `user_states_by_day.npy`, `evaluation_start_states.npy` and `venv.npz`:

In [None]:
%pushd $baseline_root/data
import numpy as np
import importlib
import data_preprocess
import sys

importlib.reload(data_preprocess)

offline_data_with_states, user_states_by_day, evaluation_start_states, offline_data_with_states_npz = data_preprocess.data_preprocess("offline_592_1000.csv")
print(offline_data_with_states.shape)
print(user_states_by_day.shape)
print(evaluation_start_states.shape)
offline_data_with_states.to_csv('offline_592_3_dim_state.csv', index=False)
np.save('user_states_by_day.npy', user_states_by_day)
np.save('evaluation_start_states.npy', evaluation_start_states)
np.savez('venv.npz', **offline_data_with_states_npz)
%popd

# Step 2: Learn a virtual environment

Below shows the decision flow graph of our Offline RL problem:

![](baseline/docs/images/revive_graph_en.png)

Here, `User State` is defined by us, and `Promotion Action` is to be implemented in our submission policy, so for the entire flow to work we need to mock the `User Action` as a virtual envrionment. This is equivalent of learning a user policy model from offline dataset that can output user action from user states and promotion action.


In baseline we learn such a user policy (virutal environment) with [Revive SDK](https://www.revive.cn). First download the SDK with git:

In [None]:
%pushd $baseline_root
!git clone https://agit.ai/Polixir/revive.git
%popd

Here Revive SDK is extracted as `baseline/revive`. The starting kit provides a public license located at `baseline/data/license.lic`, which will be valid until the end of competition. About applying for an individual license, please refer to the documentation in `baseline/docs`.

Now prepare the environment for Revive SDK by:
1. Install requiremented dependencies with `pip`.
2. Revive requires a license file's location set correctly within an environment variable `PYARMOR_LICENSE`.

In [None]:
%pushd $baseline_root/revive
!git fetch
!git checkout 0.6.0
!python3 -m pip install -e .
%env PYARMOR_LICENSE=$baseline_root/data/license.lic
%popd

## Config tuning for training the virtual environment

Before running Revive, we may tune Revive's configs to best suit our needs. For more about the usage, refer to [Revive's documentation](https://revive.cn/help/polixir-revive-sdk/index.html).

In this notebook, designing decision graph and tuning hyper parameters are done by directly writing to config file with code. Revive SDK also provides a [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Polixir.polixir-revive), enabling users to configure decision graph and parameters in an interactive way. For more tutorial on using `VSCode Polixir Revive` extension, please refere to documentation in `baseline/docs`.

### (a) Decision graph tuning

Revive uses `venv.yaml` to define its decision graph (as shown in above image). We could modify its structure to control the workflow of virtual environment.

#### Define data format for each column

The actual offline dataset is store in `venv.npz` (you can view its data through `offline_592_3_dim_state.csv`), and each column in offline dataset is attached to a specific node in decision graph. 

For example, `day_order_num` is related to coupon's action, corresponding to `action_2` in `venv.yaml`. Such a relationship is defined in this way:
```yaml
column:
  day_order_num:
    dim: action_2
```

To learn a user action policy from such offline dataset, apart from column-node relationship, Revive also needs to know the data format and bound of a column. For example, `day_order_num` is restricted to `[0, 1, 2, 3, 4, 5, 6]` in this competition's scenario, so the `min` and `max` bound for this column should be `0` and `6`. 

Revive provides three different types for what a column's data format is:

* **`discrete`**: Discrete data type uses `min`, `max` and `num` field to control how many discrete values this column should have. For `day_order_num` it could be defined in this way:

  ```yaml
  column:
    day_order_num:
      dim: action_2
      type: discrete
      min: 0
      max: 6
      num: 7
  ```

* **`category`**: Category data type explicitly defines what `values` a column could hold. For `day_order_num` it could be defined in this way:
  ```yaml
  column:
    day_order_num:
      dim: action_2
      type: category
      values: [0, 1, 2, 3, 4, 5, 6, 7]
  ```

* **`continuous`**: Continous data type assumes a column could have any floating point value between `min` and `max` bound. For `day_order_num` it could be defined in this way:
  ```yaml
  column:
    day_order_num:
      dim: action_2
      type: continuous
      min: 0
      max: 6
  ```

For some columns like `day_average_order_fee` and `average_num`, it is natural to just use `continuous` data type. In this baseline, we also use `continuous` data type for integer fields like `day_order_num` and `day_deliver_coupon_num`. 

You may tune between these three data types, and even reduce the `min` and `max` bound of some columns to improve the performance your virtual environment model. For standard range of each data column, please refer [Descriptions in competition page](https://codalab.lisn.upsaclay.fr/competitions/823#learn_the_details-terms_and_conditions).

#### Define expert functions

For serveral reasons, we may want to introduce some domain knowledge into the decision graph:
1. Some of the nodes could be reliably computed from its ingress nodes. For example, in this competition user states are defined from user actions, so `next_state` in decision graph could just be computed using the same definition, unnecessary to be learned from offline dataset. 
2. Some constraints could be applied on the output of nodes computed with neural network. For example, days with zero order should also be zero fee, but neural networks may not learn this behavior well. To explicitly enforce this constraint, a node with expert function could be put right after the neural network node on the decision graph.
3. Some priors could be applied on the computation of certain nodes. For example, `day_order_num` is mostly limited in 0~2 orders, and it's rare for 5 and 6 orders to occur. By ways like providing weights for each order num, we could apply priors on the input or output of neural network node, to obtain a more reasonable data distribution.

For first scenario, we could explicitly specify the expert function to be used for `next_state` node:
```yaml
expert_functions:
  next_state:
    'node_function' : 'venv.get_next_state'
```

where the expert function `get_next_state` are defined in `venv.py`, put in `baseline/data` folder:

In [None]:
!cat $BASELINE_ROOT/data/venv.py

Here, `venv.get_next_state` will forward the state calculation to `user_states.get_next_state`, so updates will immediately take effect if you change state definition in `user_states.py`. Notice the usage of `states.new_empty`, which will create a new `next_states` tensor with device same as `states`. This is necessary since the device of inputs varies between `cpu` and `cuda`. We use the `[..., <i>]` indexing notation since the batch shape also varies. 

For second scenario, if you choose `continuous` data type for integer columns, then it would be better to introduce function nodes after neural network output to apply regularizations like `round()` operation. This baseline does not implement such contraints, and participants may refer to documentations and examples above as guidance if desired.

For third scenario, you could obtain the data distribution of each column in the offline dataset as prior distribution, and insert an expert function node to the input or output of neural network node. This baseline does not implement such contraints, and participants may refer to documentations and examples above as guidance if desired. 

### Introduce coupon action to user state

In the baseline model, user state depends exclusively on user action. In the decision graph above we can also find that `Promotion Action` node is not the ingress node of `User State`. However in data preprocessing step, it's also mentioned that this assumption does not fit the fact that user states is influenced by different promotion actions.

Therefore, when designing one's own user state, if necessary coupon action could also be introduced. For this, `user_states.py` is designed to be as easy to support coupon action's introduction as possible:

1. `get_next_state` supports `coupon_num` and `coupon_discount` as input parameters, although in baseline model these two parameters are not used.
2. In `get_next_state_numpy`, `coupon_num` and `coupon_discount` are correcly provided in the place it is referenced.
3. In `get_next_state_torch`, since user state does not depend on coupon action in baseline model's decision graph, `coupon_num` and `coupon_discount` are passed as `None` for coupon action is not available during user state computation. But once available, this function will automatically support providing coupon actions to `get_next_state`.

Therefore, to introduce coupon action, we only need to introduce an egress edge between `Promotion Action` node and `User State` node:

```yaml
graph:
  next_state:
  - state
  - action_2
  - action_1 # Add promotion action (action_1) as user state's dependency
```

### (b) Hyper parameters tuning

Revive uses `config.json` to define its parameters for training. We could modify this config to find a balance between performance and time cost during the training of virtual environment.

In this notebook, we first define some utility functions for the convenience of editing `config.json` using Python code:

In [None]:
import json

def update_config(configs, name, default_value):
    for config in configs:
        if config["name"] == name:
            config["default"] = default_value
            return

def update_search_config(configs, name, search_mode, search_values):
    for config in configs:
        if config["name"] == name:
            config["search_mode"] = search_mode
            config["search_values"] = search_values
            return
        
def insert_config(configs, name, new_config):
    for index, config in enumerate(configs):
        if config["name"] == name:
            configs[index] = new_config
            return
    configs.insert(0, new_config)

Then, we select the base `config.json` file to be modified. In this starting kit, we use the default config provided by Revive SDK. You may change it to the local `config.json` after you have created one:

In [None]:
# Default config provided by Revive SDK
input_config_file = f"{baseline_root}/revive/data/config.json"

# # Config generated locally after following the notebook
# input_config_file = f"{baseline_root}/data/config.json"

with open(input_config_file, 'r') as f:
    config = json.load(f)

#### Base configs tuning

`base_config` in `config.json` controls general configuration for training. In this notebook, we select some of the configs that will possibly be fine-tuned, and as an example update them with basic explanations:

* `val_split_mode`: Controls the ratio of validation set when splitting the offline dataset. Defaults to 0.5, we set 0.2 here.
* `venv_rollout_horizon`, `test_horizon`: Controls the steps to rollout from expert data. Defaults to 10, since during submission the policy will be evaluated for 14 days, we set to 14 here.
* `venv_metric`: Metric for scoring expert data and generated data, which will be elaborated in next section. We select `wdist` here.
* `venv_algo`: Algorithm for learning the decision graph. Choose between `revive` (based on MAGAIL) and `bc`. We select `revive` here.
* `train_venv_trials`: Controls total number of trails to search, which revive's training is based on. Defaults to 25, for early stop we set 10 here.  

According to above analysis, update the JSON config using Python code:

In [None]:
base_config = config["base_config"]

update_config(base_config, "val_split_ratio", 0.2)
update_config(base_config, "venv_rollout_horizon", 14)
update_config(base_config, "test_horizon", 14)
update_config(base_config, "venv_metric", "wdist") # Use w-distance as metric, which will calculate on the whole rollout
update_config(base_config, "venv_algo", "revive")
update_config(base_config, "train_venv_trials", 10)

#### Training algorithm configs tuning

Revive provides two algorithms for learning the decison graph from offline dataset: `revive` and `bc`, controlled by `venv_algo` config. Revive's training is trail-based, where each trail represents a distinct set of hyper parameters for training an algorithm, for a large amount of epoches (e.g. `revive_epoch` defaults to be 5000). 

For example, a trail for `revive` composes of following hyper parameters:
```json
{
  "d_lr": 0.000446,
  "d_steps": 2,
  "g_lr": 6e-05,
  "g_steps": 2,
  "ppo_runs": 2,
  "tag": "2_d_lr=0.000446,d_steps=2,g_lr=6e-05,g_steps=2,ppo_runs=2"
}
```
And such a trail is named as `ReviveLog_8ac1b516_2_d_lr=0.000446,d_steps=2,g_lr=6e-05,g_steps=2,ppo_runs=2_2022-01-20_12-58-00`, where:
* `ReviveLog` is the log name prefix. In 0.5.0 it's `TorchTrainable`.
* `8ac1b516` is the random uuid for this trail.
* `2` is the trail id for this trail.
* `d_lr=0.000446,d_steps=2,g_lr=6e-05,g_steps=2,ppo_runs=2` is the hyper parameter set.
* `2022-01-20_12-58-00` is the timestamp for this trail.

In this baseline, we focus on the `revive` algorithm. `revive` is a GAIL-based algorithm applied on multi-agent scenario, where each node on the decision graph acting as agents. 

Therefore, the hyper parameters here are generally related to GAIL (note: `policy` corresponds to generator, with `matcher` to be discriminator):
* `*_hidden_layers`: All network's layers (`policy`, `transition`, `matcher`) default to 4, as an example we set them to 5.
* `*_backbone`: Policy network's backbone is MLP in Revive SDK 0.5.0, as an example we set it to ResNet.
* `g_steps` and `d_steps`: Search range set to [1, 2, 4], so `g_steps`:`d_steps` will vary in [1:4, 1:2, 1:1, 2:1, 4:1].
* `g_lr` and `d_lr`: As an example, set the search to be in a continuous small range [1e-5, 1e-6].

Besides, for quick convergence, we apply BC intialization on `revive` at the beginning epoches of the trail (10 out of 5000 epoches):
* `bc_batch_size` set to 256.
* `bc_epoch` set to 10.

According to above analysis, update the JSON config using Python code:

In [None]:
venv_algo_config = config["venv_algo_config"]
revive_config = venv_algo_config["revive"]

update_config(revive_config, "policy_hidden_layers", 5)
update_config(revive_config, "policy_backbone", "res")
update_config(revive_config, "transition_hidden_layers", 5)
update_config(revive_config, "matcher_hidden_layers", 5)

update_search_config(revive_config, "g_steps", "grid", [1,2,4])
update_search_config(revive_config, "d_steps", "grid", [1,2,4])
update_search_config(revive_config, "g_lr", "continuous", [1e-06,1e-05])
update_search_config(revive_config, "d_lr", "continuous", [1e-06,1e-04])

bc_config = venv_algo_config["bc"]
insert_config(revive_config, "bc_batch_size", bc_config[0].copy())
insert_config(revive_config, "bc_epoch", bc_config[1].copy())
update_config(revive_config, "bc_epoch", 10)

Since GAIL is used for learning, we do not want to take models learned from BC intialization into account when selecting the best model. Therefore, we only start recording the model after BC initialization finishs (after 10 epoches), by setting `save_start_epoch` to 10 (More configs like this could be found in Revive's source code): 

In [None]:
base_config.append({
    "name": "save_start_epoch",
    "abbreviation": "sse",
    "description": "We only save models after this epoch, default is 0 which means we save models from the beginning.",
    "type": "int",
    "default": 10,
    "tune": False
})

After deciding the tuned parameters for the training config, save it as `baseline/data/config.json`:

In [None]:
output_config_file = f"{baseline_root}/data/config.json"

with open(output_config_file, 'w') as f: # Write config.json to data directory
    json.dump(config, f, indent=2)

## Train and evaluate the virtual environment

We could start learning the virtual environment now. Firstly, define the run id used throughout the virtual environment training process:

In [None]:
# Define the run id. This id will be used in subsequent code cells.
run_id = "venv_baseline"
%env RUN_ID $run_id

Then start the training program (**NOTE**: To quickly continue on notebook, you may stop this training cell as long as an available venv model is generated. Actual venv training may be put as background process outside of this notebook so as to not block its kernel from subsequent commands).

In [None]:
%pushd $baseline_root/data
# Ensure the license is correctly set
%env PYARMOR_LICENSE=$baseline_root/data/license.lic

# Start training. Only train minimum number of trails (3) so notebook will not be blocked for so long
!python $BASELINE_ROOT/revive/train.py --run_id $RUN_ID -rcf config.json --data_file venv.npz --config_file venv.yaml --venv_mode tune --policy_mode None --train_venv_trials 3
%popd

* The `config.json` contains the parameters for training.
* The `venv.yaml` contains decision flow graph definitions like the image above.
* The `venv.npz` contains the actual dataset, with each field attached to a certain node of decision flow graph described in `venv.yaml`.
* The `--policy-mode` is set to `None` to only train the virtual environment. The policy training is delegated to `stablebaselines3` library in next section.
* The `run_id` is set as `venv_baseline`. Correspondingly, the training log is located at `baseline/revive/logs/venv_baseline/`, with the best target model parameters file to be `venv_baseline/env.pkl`.

During virtual environment training, multiple venv parameters in different trails are generated. It is important for us to select a best trained environment for subsequent policy learning, or else the policy will be trained on a environment far different from the real one. 

### (a) Automatic evaluation with specific metrics

Revive uses a specific metric to automaticlly evaluate the environment, sort them in metric ascending order, and choose the environment with least metric as the best one. This behavior can be checked in `train_venv.json`:

In [None]:
%pushd $baseline_root/revive/logs/$run_id
!cat train_venv.json
%popd

Here, the `metrics` field lists all trails with their metric value, accuracy, and its corresponding trail subfolder. The trails in `metrics` are sort in metric ascending order, with `best_id` field to be the first trail in `metrics`.

The specific metric to use is defined in `venv_metric` field in `config.json`, including:
* `mae`: Mean Absolute Error, computed between expert data and **1-step rollout from expert data**
* `mse`: Mean Square Error, computed between expert data and **1-step rollout from expert data**
* `nll`: Negative Log Likelihood, computed between expert data and **1-step rollout from expert data**
* `wdist`: Wasserstein Distance, computed between expert data and **Multi-step rollout from expert data**
* `shooting_mae`: Mean Absolute Error, computed between expert data and **Multi-step rollout from expert data**
* `shooting_mse`: Mean Square Error, computed between expert data and **Multi-step rollout from expert data**

That is, for `mae`, `mse` and `nll`, the test data is generated by unrolling only 1 step on the expert data, and compute the metric between the test data and expert data; for `wdist`, `shooting_mae` and `shooting_mse`, the test data is generated by rollout many steps defined in `venv_rollout_horizon` field in `config.json`.

There are some more detailed difference between a 1-step metric and multi-step metric. But what's important is that for our offline rl problem, it is more suitable to use a multi-step metric to better represent the similarity of offline data and virtual environment in a rollout scope. 

Our baseline selects `wdist` as the metric. You could develop your own metric if you found all above not suitable by implementing related code yourself in Revive SDK (Contact organizer for help of reading Revive code if you decided to do this and have problem in understanding Revive source code).


**To view a more specific training log, we can select one trail id from `train_venv.json`:**

In [None]:
%pushd $baseline_root/revive/logs/$run_id

import json

with open("train_venv.json") as f:
    report = json.load(f)
    
best_id = report["best_id"]

# Change the id to the trail you want to see here
trail_id = best_id

trail_dir = report["metrics"][str(trail_id)]["traj_dir"]

print(f"* --- Trail id {trail_id} in dir {trail_dir} --- *")

%popd

And 1) view statistics in `progress.csv`, or 2) view its training curves through tensorboard:

In [None]:
%load_ext tensorboard

%tensorboard --bind_all --logdir $trail_dir

### (b) Manual evaluation with histogram and rollout image

Sometimes using metric is not enough to choose a reasonable virtual environment out of all trails, and you may need to figure a proper one out yourself by manually examining on all of the trails. 

Within each trail dir, Revive also prints some histograms and rollout images to help your decision, which are formed in this way:

* histogram/
  * action_1.field_1-train/val.png
  * action_2.field_1-train/val.png
  * action_2.field_2-train/val.png
  * next_state.field_\*-train/val.png
* rollout_images/
  * action_1/
    - 0_action_1.png
    - 1_action_1.png
  * action_2/
    - \*_action_2.png
  * next_state/
    - \*_next_state.png
    
In `venv.yaml`, we define `action_2` as the user action, which is the virtual environment to learn. So we mainly focus on `action_2` related images here to determine whether a virtual environment is generating correct data.

Let us display some images from the trail selected in last section:

In [None]:
%pushd $trail_dir

from IPython.display import Image, display

print("Frequency of each day's order num from user:")

display(Image(filename="histogram/action_2.day_order_num-train.png"))

print("User actions from one of the rollout:")

display(Image(filename="rollout_images/action_2/0_action_2.png"))

%popd

In the first image, the histogram is the frequency of a specific value of action (e.g. The user gives 1, 3, 5 or 10 orders a day), collected from all users in all days. 

In the second image, the rollout image is the action sequence generated in a specific rollout. 

Therefore, these two kinds of images depicts the quality of virtual environment in two ways: 1) the overall user tendency and 2) a specific user's response in multiple days. They are all expected to be simliar with the one generated in expert data, so participants could use them to manually select a good virtual environment.

Here is an example histogram of poor learned virtual environment:

![](baseline/docs/images/venv-poor-learned.png)

Also expert data have shown that users tend to not place more than 5 orders a day, the virtual environment is learned to predict that user will place more than 20 orders, even to a maximum of 80 orders a day, which will result in a too high reward in policy learning.

Here is another example histogram of relatively well learned virtual environment:

![](baseline/docs/images/venv-well-learned.png)

Although this virtual environment is still far from correct than the expert data, this histogram's upper bound (around 22) is nearly the same as the lower bound of the poor learned one (around 19). This policy learning on this venv will generate reasonable reward described in next section.

## Get the model parameters for virtual environment

After obtaining a desired virtual environment, it could be copied to `baseline/data` folder in one of these two ways:

### (a) Use the model with best metric

Model of best metric are saved as `$RUN_ID/env.pkl`:

In [None]:
!cp -f $BASELINE_ROOT/revive/logs/$RUN_ID/env.pkl $BASELINE_ROOT/data/venv.pkl

### (b) Use the model with specific trail id

Model with specific trail dir are saved as `venv.pkl`. We use the trail dir selected in last section:

In [None]:
%env TRAIL_DIR=$trail_dir

!cp -f $TRAIL_DIR/venv.pkl $BASELINE_ROOT/data/venv.pkl

# Step 3: Learn a fair promotion policy from virtual environment

After learning a virtual environment, we could then get started to learn a fair promotion policy based on it. An unfair promotion policy takes individual user state as input, then outputs discriminated promotion action to every individual user. On the contrary, to learn a fair policy, the input should be the states of the entire user community (such as all users in a city), and output same promotion action to all users.

## Environment and MDP Setup

The virtual environment we have learned in Step 2 is actually a decision graph. Take a view at the decision graph again:

![](baseline/docs/images/revive_graph_en.png)

To compute a specific node on this graph, values of its ingress nodes are required. Therefore, as shown in the graph, to predict user's response action, we should provide current user's state and our promotion action.

Here, we take the state from the initial states processed in Step 1, and chooses sending no coupon as our action:

In [None]:
import numpy as np

# Use user states from first day (day 31) as initial states
initial_states = np.load(f"{baseline_root}/data/user_states_by_day.npy")[0] 

# Send no coupon (0, 1.00) to all users
zero_actions = np.array([(0, 1.00) for _ in range(initial_states.shape[0])])

Then users' response action could be predicted by virtual environment in this way:

In [None]:
import pickle as pk
import sys

with open(f"{baseline_root}/data/venv.pkl", "rb") as f:
    venv = pk.load(f, encoding="utf-8")
    sys.path.pop(0)
    

# Propogate from initial_states and zero_actions for one step, returning all nodes' values after propogated
node_values = venv.infer_one_step({ "state": initial_states, "action_1": zero_actions })
user_action = node_values["action_2"]
print("Node values after propogated for one step:", node_values)
print("Predicted user actions:", user_action)

In above code, we could derive a set of user actions from a set of user states and a set of corresponding coupon actions, but this is not enough to train a policy using reinforcement learning methods. To train a policy here, we need to define our MDP with concepts like action space, observation space, and rewards.

We use `gym.Env` API here to formally setup our reinforcement-learning-ready environment.

### (a) Action space

Since a fair promotion policy requires sending a same action to all users, so the action space must be a 1-d array of `[coupon_num, coupon_discount]`.

Here, we restrict that less than 6 coupons are sent in a day. For coupon discount, since in the real world discounts are normally arranged in a fixed interval (e.g. 0.90, 0.85, 0.80...), so the possible values of coupon discount are also finite and discrete. In the competition and the baseline we restrict the coupon discount to only be one of \[0.95, 0.90, 0.85, 0.80, 0.75, 0.70, 0.65, 0.60\].

Therefore, the action space could be described as `gym.MultiDiscrete([6, 8])`, where 
* `coupon_num = action[0]`, 
* `coupon_discount = 0.95 - 0.05 * action[1]`. 

For participants, feel free to change the action space used in your own policy, for example you can use any real number between \[0.60, 1.00\] to represent coupon discount, but note that the online evaluation environment only supports coupon discount in fixed possible values mentioned above, so remember to standardize the coupon discount to these values in your submitted policy validation program.

### (b) Observation space

For a fair promotion policy, it expect states of all users as input. However, `the states of all users` is a fairly high dimesion input, so instead of returing original user states, we perform a dimension reduction and return a low dimension observation to the policy.

As a baseline, we adopt a very straightforward method here: compute some basic statistics on all users' states, appended with a few additional information as the observation:
* `mean`: np.mean(states, axis=0)
* `std`: np.std(states, axis=0)
* `max`: np.max(states, axis=0)
* `min`: np.min(states, axis=0)
* `day_total_order_num`: Sum of orders of all users in last day
* `day_roi`: ROI of last day

Here, we compute an observation based on inital states, the day order num and ROI defaults to 0:

In [None]:
from user_states import states_to_observation

obs = states_to_observation(initial_states, 0, 0.0)
print(obs.shape) # 14 = 3 + 3 + 3 + 3 + 1 + 1
print(obs)

### (c) Reward

Review the target of this competition:

**After several days of evaluation (14 in development phase, 30 in final), gain as much GMV as possible, on the premise that all days' ROI >= 6.5.**

Therefore, the reward should be designed so that:
* Return negative or reduced value if ROI does not meet the threshold.
* When ROI meets the requirement, return positive value that grows only with GMV (Since extra ROI does not count anymore).

As a baseline, we design a simple reward in this way:
* It is a delayed reward: returning non-zero only in the last day of evaluataion. Since the ROI is calculated based on all days.
* The ROI threshold is set a litter higher than the one in evaluation program: it is hard to train a policy that reliably pass the target threshold, so during training we set the threshold to be 8.0, with the expectation that it will generally produce ROI greater than 6.5.
* Positive reward is defined to be the ratio between actual GMV and the GMV when sending no coupon (set as `ZERO_GMV=81840` in baseline). Negative reward is defind to be the difference of actual ROI and threshold ROI.

## Training of the policy

After the definition of MDP, we can start to train the policy using any available reinforcement learning algorithm. In the baseline, we directly use the Proximal Policy Optimization (PPO) algorithm from `stablebaselines3` library for training.

Start tensorboard for policy training:

In [None]:
%load_ext tensorboard

!mkdir -p $baseline_root/data/logs

%tensorboard --bind_all --logdir $baseline_root/data/logs

Then start policy training, where the progress will be logged to tensorboard above:

In [None]:
%pushd $baseline_root/data

import importlib
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import CheckpointCallback
import virtual_env
from virtual_env import get_env_instance

importlib.reload(virtual_env)

save_path = 'model_checkpoints'

env = get_env_instance('user_states_by_day.npy', 'venv.pkl')
model = PPO("MlpPolicy", env, n_steps=840, batch_size=420, verbose=1, tensorboard_log='logs')
checkpoint_callback = CheckpointCallback(save_freq=1e4, save_path=save_path)
model.learn(total_timesteps=int(8e6), callback=[checkpoint_callback])

%popd

Since the policy training is progressed on a fixed set of hyper parameters, normally the model trained with more steps gains better performance. 

**We copy such latest model to `baseline/data`, to be ready for evaluation in next section:**

In [None]:
%pushd $baseline_root/data/model_checkpoints
!cp -f $(ls -Art . | tail -n 1) $BASELINE_ROOT/data/rl_model.zip
%popd

## Evaluation of the policy

The most convincing evaluation of our policy is to upload it to the competition website and fetch the oneline evaluation result on real environment. What we can do locally is to evaluate the policy on the virtual environment, in which the accuracy of policy is highly dependent on the accuracy of virtual environment. Therefore, it should be noted again that **designing an effective virtual environment is very important**.

Here we perform a rollout validation using the virtual environment and the policy:

In [None]:
%pushd $baseline_root/data
import importlib
import numpy as np
import random
from stable_baselines3 import PPO
import virtual_env

importlib.reload(virtual_env)

env = virtual_env.get_env_instance("user_states_by_day.npy", "venv.pkl")
policy = PPO.load("rl_model.zip")
validation_length = 14

obs = env.reset()
for day_index in range(validation_length):
    coupon_action, _ = policy.predict(obs, deterministic=True) # Some randomness will be added to action if deterministic=False
    # coupon_action = np.array([random.randint(0, 6), random.randint(0, 5)]) # Random action
    obs, reward, done, info = env.step(coupon_action)
    if reward != 0:
        info["Reward"] = reward
    print(f"Day {day_index+1}: {info}")
%popd

There are some heuristic hints to manually determine wheather the virtual environment and the policy works properly. Here are some examples:

**(a) Check output of each infer step.** Some output may indicate that virtual environment is not learned correctly. Take sending 5 coupons of discount 0.95 from initial state as example:

In [None]:
import pickle as pk

with open(f"{baseline_root}/data/venv.pkl", "rb") as f:
    venv = pk.load(f, encoding="utf-8")
    
initial_states = np.load(f"{baseline_root}/data/user_states_by_day.npy")[10]
coupon_actions = np.array([(5, 0.95) for _ in range(initial_states.shape[0])])

node_values = venv.infer_one_step({ "state": initial_states, "action_1": coupon_actions })
user_actions = node_values['action_2']
day_order_num, day_avg_fee = user_actions[..., 0].round(), user_actions[..., 1].round(2)
print(day_order_num.reshape((-1,))[:100])
print(day_avg_fee.reshape((-1,))[:100])

If you find that when user did not place any order (day_order_num == 0), there is still fees generated (day_avg_fee != 0, with a value far greater than 0), then your venv may not learn correctly.

On the other hand, if you find that the venv generally responds correctly, but with a certain coupon action (which occurs rarely in the offline dataset), the virtual environment returns great user response (large order nums or large order fee), then your venv also does not learn correctly, which will result in your policy only willing to select this coupon action once found in the action space.

**(b) Check reward curve.** The reward during rollout should be reasonable. A normal reward curve is usually in this pattern: 

1. Start from negative reward, since at the beginning the ROI may not meet the threshold requirement. This is optional if your policy handles ROI well.

2. Reward is no greater than 10. The positive reward is defined as the ratio between actual GMV and zero action GMV (GMV when sending no coupon, set to be 81840). The GMV of taking some action will not beat zero action too far, and if you find your policy 10 times or even 50 times better than ZERO GMV (meaning reward to be > 10), then you must have selected a poor virtual environment (e.g. A venv that tells you  your users will place 80 orders a day).

# Step 4: Generate a submission bundle

After obtaining a fair promotion policy, it's time to upload your model for online evaluation to get a final score. Only the policy needs to be included in your submission, since the competition platform will use the real environment to evaluate your policy.

## File structure

The uploaded file is a `.zip` bundle in such file structure (as shown in `sample_submission`):
* `data/`: data folder containing 1) the initial states defined in Step 1; 2) The policy model parameters learned in Step 3.
* `metadata`: yaml-format description file to specify the requirements of runtime environment (pytorch-1.8, pytorch-1.10, etc.)
* `policy_validation.py`: entrypoint file containing an interface class and a function to fetch participant's policy instance.
* `random_policy_validation.py`: an implementation of `PolicyValidation` that returns coupon actions randomly.
* `baseline_policy_validation.py`: an implementation of `PolicyValidation` that uses the baseline model.

## PolicyValidation file

The online evaluation program invokes participant's code through an interface named `PolicyValidation`. It is an abstract class defining required members and methods to be implemented by participants:

```py
class PolicyValidation:
    """Abstract class defining the interfaces required by evaluation program.
    """

    """initial_state is paticipant-defined first day's user state in evaluation,
    derived from the offline data in May 18th.
    """
    initial_states: Any = None

    @abstractmethod
    def __init__(self, *args, **kwargs):
        """Initialize the members required for your model here.
        You may also provide some parameters for __init__ method,
        but you must fill the arguments yourself in the get_pv_instance function.
        """

    @abstractmethod
    def get_next_states(self, cur_states: Any, coupon_action: np.ndarray, user_actions: List[np.ndarray]) -> Any:
        """Generate next day's user state from current day's coupon action and user's response action.
        """
        pass

    @abstractmethod
    def get_action_from_policy(self, user_states: Any) -> np.ndarray:
        """Generate current day's coupon action based on current day's user states depicted by participants.
        """
        pass
```
(For detailed interface documentation, see `policy_validation.py`).

Along with `PolicyValidation` class, there is a function `get_pv_instance() -> PolicyValidation`, which will be invoked by evaluation program to fetch participant's implementation of `PolicyValidation`:
```py
def get_pv_instance() -> PolicyValidation:
    from my_policy_validation import MyPolicyValidation
    return MyPolicyValidation(<your arguments...>)
```

Participants should inherit the `PolicyValidation` class, implement their own policy, and put initialization code in `get_pv_instance()`, so as to be successfully processed by online evaluation program.

## Metadata

Each participant's submission runs in an individual docker container. Since different teams may have their own requirements on the runtime environment (e.g. pytorch version), we provide a field `image` in `metadata` file, allowing for participants to choose what docker image to use for their evaluation process:

```yaml
image: pytorch-1.8
```
We supports: `pytorch-1.8`, `pytorch-1.9`, `pytorch-1.10` for now. If your team requires other DL frameworks like tensorflow, keras, mxnet, or references some libraries that provided images does not contain, you can contact the team organizer to provide your list of dependencies. If appropriate, we will make the image satisfying the dependency for you.

## Create a submission

If you are using the basline model, you can use following command to update the parameters:

In [None]:
%pushd $baseline_root/../sample_submission
!cp -f $BASELINE_ROOT/data/evaluation_start_states.npy ./data/evaluation_start_states.npy
!cp -f $BASELINE_ROOT/data/rl_model.zip ./data/rl_model.zip
!cp -f $BASELINE_ROOT/user_states.py ./data/user_states.py
%popd

When your submission is ready, create a zip file out of your submission folder:

In [None]:
%pushd $baseline_root/../sample_submission
!zip -o -r --exclude='*.git*' --exclude='*__pycache__*' --exclude='*.DS_Store*' --exclude='*public_data*' ../sample_submission .;
%popd

Congratulations! Now you can upload your submission to https://codalab.lisn.upsaclay.fr/competitions/823#participate-submit_results.