# Rationale

ACS for MAB v1 was wayyy unsophisticated.

New idea here is to have a directed (asymmetric TSP) graph, where each city has edges leading away from it that each represents an arm. Hence, for a k-arm MAB, we have k+1 cities.

We also have to make sure that each set of same-$\theta$-ed edges form a tour, such that the shortest tour indicates the shortest arm. i.e. $k$ tours out of a possible $(k-1)!$ actually represent arms.

Denote the true thetas $\left\{ \theta_i = \frac{\alpha_i}{\beta_i} \right\}_{i=1}^k$.

The cost for each edge leading out from an arbitrary node $u$ would be given by
$$
\delta(u,v_i) = \theta_i \\
$$
(note that the $i$ in the above expression is actually based on which $\theta_i$ corresponds to which end node $v_i$, not the other way around)

Then, using a $\Gamma$ distribution
$$
\mathbb{E} [\delta(u, v_i)] = \frac{\alpha_i}{\beta_i} \\
\implies \beta = 10 e^{- \theta_i}
$$
such that
$$
\eta (u, v_i) \sim \Gamma ( \alpha_i, \beta_i )
$$

Above it, we just have Gamma bandits. Choose the edge $\implies$ pull the arm. Now we have a valid way to quantify regret.


# Les Codez

In [1]:
import numpy as np
import matplotlib.pylab as plt
%matplotlib notebook
plt.style.use(['ggplot', 'seaborn-poster'])

## MAB

In [None]:
K = 100

## ACS

In [60]:
ANTS = 10
ITERATIONS = 100
Q_0 = 0.9
RHO = 0.1
ALPHA = 0.1
BETA = 2

In [3]:
TRUE_THETAS = np.random.uniform(size=K)

In [59]:
def generate_arm_tours(k):
    """
    k: number of arms
    k+1: number of cities
    """
    U = -np.eye(k+1)
    for i in xrange(k+1):
        np.put(U[i,:], xrange(i+1, i+1+k), xrange(1,k+1), mode='wrap')
    return U

In [61]:
def generate_starts():
    return np.random.choice(xrange(ARMS), size=ANTS)