# Summary

This notebook uses the ray library https://docs.ray.io/en/latest/index.html to quickly run trials of two agents in parallel. It may be especially useful for those that have high CPU count local machines on which they are developing agents. In the Kaggle dual CPU kernel environment the speed-up will be limited. 

I re-use aspects of other notebooks for which credit goes to their authors:

* @xhlulu https://www.kaggle.com/xhlulu/santa-2020-ucb-and-bayesian-ucb-starter   
* @isaienkov https://www.kaggle.com/xhlulu/santa-2020-ucb-and-bayesian-ucb-starter
* @JumabekAlihanov https://www.kaggle.com/jumabek/plot-comparison-ucb-vs-bayesian-isaienkov-s-code


Confession time, I'm an R first programmer, so excuse my poor python programming but comments on making improvements are welcome! The ray library itself looks rather interesting as it includes agents and other capabilities for reinformcement learning. 



## 1 Setup libraries

In [None]:
!pip install kaggle-environments --upgrade

from kaggle_environments import make
from collections import defaultdict
import numpy as np
import psutil
import ray
import scipy.signal
import matplotlib.pyplot as plt


## 2 Define two agents

The two agents are taken from the notebook of @xhlulu https://www.kaggle.com/xhlulu/santa-2020-ucb-and-bayesian-ucb-starter


In [None]:
%%writefile ucb_decay.py

import numpy as np

decay = 0.97
total_reward = 0
bandit = None

def agent(observation, configuration):
    global reward_sums, n_selections, total_reward, bandit
    
    n_bandits = configuration.banditCount

    if observation.step == 0:
        n_selections, reward_sums = np.full((2, n_bandits), 1e-32)
    else:
        reward_sums[bandit] += decay * (observation.reward - total_reward)
        total_reward = observation.reward

    avg_reward = reward_sums / n_selections    
    delta_i = np.sqrt(2 * np.log(observation.step + 1) / n_selections)
    bandit = int(np.argmax(avg_reward + delta_i))

    n_selections[bandit] += 1

    return bandit

In [None]:
%%writefile bayesian_ucb.py

import numpy as np
from scipy.stats import beta

post_a, post_b, bandit = [None] * 3
total_reward = 0
c = 3

def agent(observation, configuration):
    global total_reward, bandit, post_a, post_b, c

    if observation.step == 0:
        post_a, post_b = np.ones((2, configuration.banditCount))
    else:
        r = observation.reward - total_reward
        total_reward = observation.reward
        # Update Gaussian posterior
        post_a[bandit] += r
        post_b[bandit] += 1 - r
    
    bound = post_a / (post_a + post_b) + beta.std(post_a, post_b) * c
    bandit = int(np.argmax(bound))
    
    return bandit

## 3 Create utility functions

First function is to plot final rewards trials of two agents. This is from code of @JumabekAlihanov https://www.kaggle.com/jumabek/plot-comparison-ucb-vs-bayesian-isaienkov-s-code 

In [None]:
def plot_final_rewards(hist):
    num_episodes = 0
    colors_db = ['g','b']

    plt.figure(figsize=(12,8))
    for i,agent in enumerate(hist.keys()):
        plt.plot(hist[agent], label=agent, color=colors_db[i])
        num_episodes = len(hist[agent])
        avg_final_reward = np.array(hist[agent]).mean()
        plt.plot([0, num_episodes-1],[avg_final_reward, avg_final_reward], label=agent+' avg.', color=colors_db[i],linestyle='dashed')
        
    plt.legend(bbox_to_anchor=(1.2, 0.5))
    plt.xlabel("Iterations")
    plt.ylabel("Final Reward")
    plt.title("Final Agent Rewards for " 
              + str(num_episodes) + " Episodes")
    plt.show()

Now initialise worker pool that will process trials across available CPU's.

In [None]:
num_cpus = psutil.cpu_count(logical=False)
ray.shutdown()
ray.init(num_cpus=num_cpus)

Define function run_trial that will be executed on remote workers.

In [None]:
@ray.remote

def run_trial(agent1, agent2):
    hist = defaultdict(list)
    env = make("mab")
    env.reset()
    env.run([agent1, agent2])
    hist[agent1].append(env.state[0]['reward'])
    hist[agent2].append(env.state[1]['reward'])
    return hist
  

Lastly define local function to collate results from remote workers.

In [None]:
def run_trials(agent1, agent2, num_trails):
    result_ids = []
    
    for i in range(num_trails):
        result_ids.append(run_trial.remote(agent1, agent2))
    
    results = ray.get(result_ids)
    hist = defaultdict(list)
    
    for i in range(num_trails):
        hist[agent1].append(results[i][agent1])
        hist[agent2].append(results[i][agent2])
    
    return hist

## 4 Run trial and plot results

In [None]:
hist = run_trials("ucb_decay.py", "bayesian_ucb.py", 30)
plot_final_rewards(hist)


By processing in parallel across available CPU's you can run more comparison trials in shorter time enabling quicker refinement of agents. This may be especially beneficial if you have high CPU count compute resources avaialble.