<a href="https://colab.research.google.com/github/vijaygwu/classideas/blob/main/BanditThompson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np

class ThompsonSampling:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)  # number of times reward received
        self.beta = np.ones(n_arms)   # number of times no reward received

    def select_arm(self):
        """Return the index of the next arm to pull."""
        theta_samples = [np.random.beta(self.alpha[i], self.beta[i]) for i in range(self.n_arms)]
        return np.argmax(theta_samples)

    def update(self, chosen_arm, reward):
        """Update the alpha and beta parameters based on the received reward."""
        if reward == 1:
            self.alpha[chosen_arm] += 1
        else:
            self.beta[chosen_arm] += 1

# Example usage:
n_arms = 3
bandit = ThompsonSampling(n_arms)

for _ in range(1000):
    chosen_arm = bandit.select_arm()
    # Here, you would pull the chosen arm and observe a reward.
    # For this example, let's simulate a reward using a random choice.
    reward = np.random.choice([0, 1], p=[0.8 if chosen_arm != 2 else 0.2, 0.2 if chosen_arm != 2 else 0.8])
    bandit.update(chosen_arm, reward)

print(bandit.alpha)
print(bandit.beta)


[  7.   2. 810.]
[  8.   5. 174.]


In the example above:

We initialize alpha and beta to ones, assuming a uniform prior on the probability of success for each arm.
When choosing an arm, we sample a theta value from the beta distribution parameterized by the corresponding alpha and beta for each arm and select the arm with the highest sampled value.
After observing a reward, we update our alpha (if reward = 1) or beta (if reward = 0) for the chosen arm.