In [1]:
import numpy as np
import pandas as pd
from tqdm import tqdm

### Multi-Armed Bandits

Consider a scenario where you are staying at an unknown place for some days. There are a few restaurants in that area and there is a certain amount of happiness you get from eating food from each restaurant. The happiness is numerically given by happiness index which is assumed to be normally distributed with known mean and variance (as you might not get same amount of happiness each time you eat food from a restaurant based on a lot of factors - cooking style, food quality, your mood etc.)

As a rational person, your objective is to maximize the happiness index for yourself. So you are in a dilemma of whether to keep going at the same restaurant where you had supposedly best food among visited restaurants so far, or keep exploring in search of having better food. The former is called Exploitation, while the latter is called Exploration. 

You can do either on a day, so you are in a **Exploration-Exploitation Tradeoff**.

In this notebook, we will examine and evaluate strategies that help us determine which strategy should one choose based on given parameters from a statistical standpoint.

In [2]:
n_days = 300
n_choices = 3
means = [10, 8, 5]
std_devs = [5, 4, 25]
n_iterations = 100
eps = 0.1

We will examine the following strategies:
1. **Explore only:** Randomly choose a restaurant each day 
2. **Exploit only:** Test each restaurant once and have food at the restaurant from where you had the best food for the rest of days
3. **Epsilon-Greedy:** Mix of both explore and exploit strategy:
    1. For epsilon equal to 10%:
        1. Explore 10%: Randomly choose a restaurant 
        2. Exploit 90%: Choose the restaurant which has given highest average reward so far
4. **UCB-1 (Upper Confidence Bound - 1) Strategy:** 
    1. Limitation of Epsilon Greedy strategy: While calculating highest average reward, not accounting for number of visits. This is crucial in the example where you have visited a restaurant only once but happiness index encountered for that restaurant was the lowest/highest which is not a correct measure of average.
    2. Improvement over Epsilon Greedy strategy with accounting number of visits.

In [3]:
def explore_only(means, std_devs):
    total_reward = 0
    n_iters = int(n_days / n_choices)

    for mean, std_dev in zip(means, std_devs):
        total_reward += np.sum(np.random.normal(mean, std_dev, n_iters))
    return total_reward

def exploit_only(means, std_devs):
    total_reward = 0
    rewards = []
    for mean, std in zip(means, std_devs):
        reward = np.random.normal(mean, std)
        rewards.append(reward)
        total_reward += reward

    best_idx = rewards.index(max(rewards)) 
    total_reward += np.sum(np.random.normal(means[best_idx], std_devs[best_idx], n_days - n_choices))
    return total_reward

#### Epsilon - Greedy Strategy

In [4]:
def get_best_strategy_eps(rewards, choices):
    rewards = np.array(rewards)
    choices = np.array(choices)

    avg_rewards = []
    for choice_idx in range(n_choices):
        idxs = np.where(choices==choice_idx)[0]
        
        if len(idxs):
            avg_reward = np.mean(rewards[idxs])
        else:
            avg_reward = 0

        avg_rewards.append(avg_reward)
    choice_idx = avg_rewards.index(max(avg_rewards))
    return choice_idx

def eps_greedy(means, std_devs):
    total_reward = 0
    rewards = []
    choices = []

    for _ in range(n_days):
        strategy = np.random.choice([0,1], p=[eps, 1-eps])

        if strategy==0:
            # explore
            choice_idx = np.random.choice(n_choices)
        else:
            # exploit
            # choose based on maximum average reward so far
            choice_idx = get_best_strategy_eps(rewards, choices)

        reward = np.random.normal(means[choice_idx], std_devs[choice_idx])
        total_reward += reward
        rewards.append(reward)
        choices.append(choice_idx)
    return total_reward

#### UCB1 Strategy

In [5]:
def get_best_strategy_ucb(time_idx, rewards, choices):
    rewards = np.array(rewards)
    choices = np.array(choices)

    avg_rewards = []
    for choice_idx in range(n_choices):
        idxs = np.where(choices==choice_idx)[0]
        
        if len(idxs):
            avg_reward = np.mean(rewards[idxs]) + np.sqrt(2 * np.log(time_idx) / len(idxs))
        else:
            avg_reward = 0

        avg_rewards.append(avg_reward)
    choice_idx = avg_rewards.index(max(avg_rewards))
    return choice_idx

def ucb1_strategy(means, std_devs):
    total_reward = 0
    rewards = []
    choices = []

    for time_idx in range(1, n_days+1):
        strategy = np.random.choice([0,1], p=[eps, 1-eps])

        if strategy==0:
            # explore
            choice_idx = np.random.choice(n_choices)
        else:
            # exploit
            # choose based on maximum average reward so far
            choice_idx = get_best_strategy_ucb(time_idx, rewards, choices)

        reward = np.random.normal(means[choice_idx], std_devs[choice_idx])
        total_reward += reward
        rewards.append(reward)
        choices.append(choice_idx)
    return total_reward

### Comparison between strategies

In [6]:
def get_expected_reward(strategy, means, std_devs):
    expected_reward = []
    for _ in range(n_iterations):
        expected_reward.append(strategy(means, std_devs))
    return int(np.mean(expected_reward))

print('Explore Only:', get_expected_reward(explore_only, means, std_devs))
print('Exploit Only:', get_expected_reward(exploit_only, means, std_devs))
print('Epsilon Greedy:', get_expected_reward(eps_greedy, means, std_devs))
print('UCB1:', get_expected_reward(ucb1_strategy, means, std_devs))

Explore Only: 2252
Exploit Only: 2274
Epsilon Greedy: 2881
UCB1: 2861


#### Points to note:
1. We can see that Epsilon Greedy strategy which is a mixture of exploitation-exploration is a better strategy compared to exploration or exploitation only **given the parameters**. However, it may not always be the case and one should evaluate all strategies based on given parameters.
2. Since the number of choices of restaurants was low here, UCB1 strategy might not have been able to stand out from Epsilon Greedy strategy.

Let's consider a general scenario with a large number of restaurants/choices.

In [7]:
n_days = 300
mean_factor = 3
num_choices = [3, 10, 100]
deviation_ratios = [0.1, 0.5, 1]
n_iterations = 100
eps = 0.1

In [8]:
def get_mean_std(mean_factor, n_choices, deviation_ratio):
    means = mean_factor * np.arange(1, n_choices+1)
    std_devs = means * deviation_ratio
    return means, std_devs

def get_rewards(means, std_devs):
    val1 = get_expected_reward(explore_only, means, std_devs)
    val2 = get_expected_reward(exploit_only, means, std_devs)
    val3 = get_expected_reward(eps_greedy, means, std_devs)
    val4 = get_expected_reward(ucb1_strategy, means, std_devs)
    return [val1, val2, val3, val4]

In [9]:
vals = []
for n_choices in tqdm(num_choices):
    for deviation_ratio in deviation_ratios:
        means, std_devs = get_mean_std(mean_factor, n_choices, deviation_ratio)    
        val = [n_choices] + [deviation_ratio] + get_rewards(means, std_devs)
        vals.append(val)

100%|██████████| 3/3 [01:59<00:00, 39.81s/it]


In [10]:
cols = ['n_choices', 'deviation_ratio', 'explore', 'exploit', 'eps_greedy', 'ucb1_strategy']
data = np.array(vals)
df = pd.DataFrame(data, columns = cols)
cols.remove('deviation_ratio')
df[cols] = df[cols].astype('int')
df

Unnamed: 0,n_choices,deviation_ratio,explore,exploit,eps_greedy,ucb1_strategy
0,3,0.1,1801,2692,2488,2497
1,3,0.5,1801,2374,2433,2462
2,3,1.0,1796,2235,2338,2338
3,10,0.1,4948,8708,7779,7801
4,10,0.5,4948,7675,7579,7539
5,10,1.0,4989,7462,7145,7152
6,100,0.1,45489,73203,73998,75458
7,100,0.5,45292,70288,72056,71434
8,100,1.0,45897,67521,68164,68930


### Observations:

1. Exploration only is never the best strategy. In other terms, randomization is not really a strategy. 

In [11]:
df[df['deviation_ratio']==0.1]

Unnamed: 0,n_choices,deviation_ratio,explore,exploit,eps_greedy,ucb1_strategy
0,3,0.1,1801,2692,2488,2497
3,10,0.1,4948,8708,7779,7801
6,100,0.1,45489,73203,73998,75458


In [12]:
df[df['deviation_ratio']==1]

Unnamed: 0,n_choices,deviation_ratio,explore,exploit,eps_greedy,ucb1_strategy
2,3,1.0,1796,2235,2338,2338
5,10,1.0,4989,7462,7145,7152
8,100,1.0,45897,67521,68164,68930


### Observations:

1. If deviation ratio is less, that means that there isn't much variation in happiness and so we need not explore it much. So, exploitation strategy dominates. 
2. However, if number of choices are high, it can compensate for lower deviation ratio, so mixture of exploitation-exploration is necessary.
3. At the same time, if deviation ratio is high, there is large variation in happiness which will be gauged after enough exploration and so exploration is necessary, so combination of exploration-exploitation dominates.

In [13]:
df[df['n_choices']==100]

Unnamed: 0,n_choices,deviation_ratio,explore,exploit,eps_greedy,ucb1_strategy
6,100,0.1,45489,73203,73998,75458
7,100,0.5,45292,70288,72056,71434
8,100,1.0,45897,67521,68164,68930


### Observations:

1. If number of choices increase, exploitation strategy doesn't dominate as you need exploration.