# Maximum Likelihood Estimation

MLE finds the parameter that makes the data most likely. It does not imply the best parameter given the data, for that we need the Bayesian approach. But first, it is good to practice with MLE.

Every single badit pull can be imagined as a Bernouli experiment which can be succesfull with probability p(t) and unsuccessfull by probability 1-p(t). "t" here denotes the time since the bandit's assigned probability decays over time by pulling it. This can be simplified as:

\\[p(t) = p(n), \\]

where n is the number of times the bandit has been pulled so far, or:

\\[p(n) = p_0 d^{n} ,\\]

where d is the decay rate (0.97).

In our case, the number of times the bandits have been pulled are known but the initial success probability of the bandits are unknown and we should estimate them. By having a good estimate of the \\(p_0\\)s we can then calculate the success probability for each bandit in the next round and chose the most probable one in the next action.

In normal Bernouli experiment where there is no decay, you would repeat the experiemt and calculate the p that has the maximum likelihood. For example, if you repeat it 6 times with 4 times success and 2 times unsuccess, you can estimate the likelihood as 

\\[p^4.(1-p)^2\\]

and the identify the \\(p\\) that maximizes the likelihood. (in this case \\( p \approx 0.66 \\) )

In [None]:
import pylab as pl

f = lambda p: p**4 * (1-p)**2
x = pl.linspace(0,1,100)
pl.plot(x, f(x))
pl.plot(0.667, f(0.667), 'r*')
pl.xlabel('p')
pl.ylabel('Likelihood');

In our case we have two problem, first the decay, and second unknown rewards by opponent pulling. In this case after 6 times pulling the bandit the reward might be something like this:

\\[ 1 0 X 1 X 0 \\]

where X denotes unknown rewards. The probability of succes and fail has also changed over time:

|experiment | 1 | 0 | X | 1 | X | 0 |
|:----------|:---:|:---:|:---:|:---:|:---:|:---:|
|**success prob** | \\(p\\) | \\(pd\\) | \\(pd^2\\) | \\(pd^3\\) | \\(pd^4\\) | \\(pd^5\\) |
|**fail prob** |\\((1-p)\\)| \\((1-pd)\\) | \\((1-pd^2)\\) | \\((1-pd^3)\\) | \\((1-pd^4)\\) | \\((1-pd^5)\\) |
| |<img width=60/>|<img width=60/>|<img width=60/>|<img width=60/>|<img width=60/>|<img width=60/>|

The lilkelihood equation that we can infer from this experiment is:

\\[p.(1-pd).pd^3.(1-pd^5)\\]

which is maximum when \\(p \approx 0.545\\)

In [None]:
import pylab as pl

d = 0.97
f = lambda p: p * (1-p*d) * p*d**3 * (1-p*d**5)
x = pl.linspace(0,1,100)
pl.plot(x, f(x))
pl.plot(0.545, f(0.545), 'r*')
pl.xlabel('p')
pl.ylabel('Likelihood');

## Agent

The following Agent works by estimating the initial probability by maximum likelihood and then calculates the success probability of next pulling for bandits. Of course, the estimates are noisy in the beginning but gets better by more experiments.

In [None]:
%%writefile ML_agent.py

import numpy as np
from collections import defaultdict
from random import choices
from scipy.special import softmax
from scipy.optimize import minimize

def sigmoid(x, x0=10):
    return 1 / (1 + np.exp(x0-x))

def maximize(f, bound=[0,1], res=0.1, tol=1e-4):
    x = np.linspace(bound[0], bound[1], int((bound[1]-bound[0])/res)+1)
    a = x[f(x).argmax()]
    if res < tol:
        return a
    else:
        return maximize(f, [max(bound[0], a-res), min(bound[1], a+res)], res/10, tol)


num_activated, num_activated_byme, LL, total_rewards, my_last_action, p_estimate = (None, )* 6

def agent(observation, configuration):
    global num_activated, num_activated_byme, LL, total_rewards, my_last_action, p_estimate
    
    N = configuration['banditCount']
    d = configuration['decayRate']
    
    # initialization
    if num_activated is None:
        num_activated = np.zeros(N)
        num_activated_byme = np.zeros(N)
        LL = defaultdict(lambda: '1')
        total_rewards = 0
        my_last_action = -1
        p_estimate = np.zeros(N) + 0.5

    
    # update
    if observation['lastActions']:
        num_activated[observation['lastActions'][0]] += 1
        num_activated[observation['lastActions'][1]] += 1
        num_activated_byme[my_last_action] += 1
        last_reward = observation['reward'] - total_rewards
        total_rewards = observation['reward']
        LL[my_last_action] += f' * (p* {d ** (num_activated[my_last_action]-1)})' if last_reward else f' * (1 - p* {d ** (num_activated[my_last_action]-1)})'
    
    # decision
    if min(num_activated_byme) < 2:
        my_last_action = int(num_activated_byme.argmin())
        return my_last_action
    
    next_prob_estimate = np.zeros(N)
    for b in observation['lastActions']:
        f = lambda p: eval(LL[b])
        best_p = maximize(f, bound=[0,1])
        r = sigmoid(num_activated_byme[b], 3)
        p_estimate[b] = best_p * r + 0.5 * (1-r)
    for b in range(N):
        next_prob_estimate[b] = p_estimate[b] * d ** num_activated[b]
    
    my_last_action = int(np.argmax(next_prob_estimate))
    return my_last_action

In [None]:
%%writefile random_agent.py

import random

def random_agent(observation, configuration):
    return random.randrange(configuration.banditCount)

In [None]:
%%writefile ordinal_agent.py

last = -1
def ordinal_agent(observation, configuration):
    global last
    
    last += 1
    return last % 100

In [None]:
!pip install kaggle-environments --upgrade

## The Agent against the random Agent

In [None]:
from kaggle_environments import make

env = make("mab", debug=True)

env.run(["ML_agent.py", "random_agent.py"])
env.render(mode="ipython", width=800, height=800)

## A way to evaluate Agents before submission

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np


def run(agent1, agent2):
    env = make("mab", debug=True)
    env.run([agent1, agent2])
    return env.state[0]['reward'] - env.state[1]['reward']


def evaluate(agents, episodes=5):
    result = pd.DataFrame(columns=agents, index=agents)
    for i, agent1 in enumerate(agents):
        for j, agent2 in enumerate(agents[:i+1]):
            result.loc[agent1, agent2] = [run(agent1, agent2) for _ in range(episodes)]
            result.loc[agent2, agent1] = [-score for score in result.loc[agent1, agent2]]
    return result

result = evaluate(['random_agent.py', 'ML_agent.py', 'ordinal_agent.py'], 2)
sns.heatmap(result.applymap(np.nanmax), cbar_kws={'label':'Maximum Lead'});

In [None]:
sns.heatmap(result.applymap(np.nanmin), cbar_kws={'label':'Minimum Lead'});

In [None]:
sns.heatmap(result.applymap(np.nanmean), cbar_kws={'label':'Average Lead'});

In [None]:
result