<a href="https://colab.research.google.com/github/vijaygwu/classideas/blob/main/BanditUCB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import numpy as np

class UCB:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.counts = np.zeros(n_arms)  # Number of times each arm has been pulled
        self.values = np.zeros(n_arms)  # Average value (reward) received from each arm

    def select_arm(self):
        """Return the index of the next arm to pull."""
        n_total = sum(self.counts)
        if n_total == 0:
            return np.random.randint(self.n_arms)

        # Compute the UCB values for all arms
        ucb_values = [self.values[i] + np.sqrt((2 * np.log(n_total)) / self.counts[i])
                      if self.counts[i] > 0 else float('inf') for i in range(self.n_arms)]

        return np.argmax(ucb_values)

    def update(self, chosen_arm, reward):
        """Update the counts and values based on the received reward."""
        self.counts[chosen_arm] += 1
        n = self.counts[chosen_arm]
        value = self.values[chosen_arm]
        # Running average of the values
        self.values[chosen_arm] = ((n - 1) / n) * value + (1 / n) * reward

# Example usage:
n_arms = 3
bandit = UCB(n_arms)

for _ in range(1000):
    chosen_arm = bandit.select_arm()
    # Here, you'd pull the chosen arm and observe a reward.
    # For this example, let's simulate a reward using a random choice.
    reward = np.random.choice([0, 1], p=[0.8 if chosen_arm != 2 else 0.2, 0.2 if chosen_arm != 2 else 0.8])
    bandit.update(chosen_arm, reward)

print(bandit.values)


[0.30769231 0.16       0.79059829]


In the above implementation:

The select_arm method computes the UCB for each arm. If an arm hasn't been selected yet, its UCB is set to infinity to ensure it's selected next.
The update method adjusts the running average reward for the selected arm based on the observed reward.
