# Multi Armed Bandit
----

### Concept
- MAB is just a mechanism. especially using in S&R(Search and Recommendation) system.
- The main concept of MAB is using exploration and exploitation for below situation.
- In this chapter, we study about epsilon greedy mab system.

### Terms
- $ A_t $ : Action. the choice of user in system.
- $ R_t $ : Reward. the result of action in system.
- $ Q_t(A) $ : Expectation of reward.

$$ Q_t(A) = \frac{sum \, of \, rewards \, when \, a \, taken \, prior \, to \, t}{number \, of \, times \, a \, taken \, prior \, to \, t} $$

- Greedy algorithm is always choice the item that what makes the expected reward maximum.

----
### Model Equation
- Simple greedy algorithm doesn't take into account about exploration.So, through the method called epsilon-greedy, we select harmoniously between greedy and random behaviors in probability.
- The probability called $ \epsilon $.
- Exploitation of greedy behavior is selected with probability of $ 1- \epsilon $, and random behavior is selected with probability as much as $ \epsilon $. In other words, it can be said that the best case is selected as the probability of $ 1- \epsilon $, and the remaining probability is selected in consideration of diversity.
- However, this method also has its drawbacks
    - Depending on the value of $ \epsilon $, there may be insufficient observations among all cases.
    - In addition, even if an optimal case is found, the ratio as much as $ \epsilon $ must be used at random, which can produce unfortunate results from an optimization point of view.
- Below code is implementation of e-greedy.

In [None]:
import random
import numpy as np


class EpsilonGreedy():
    def __init__(self, epsilon, counts, values):
        self.epsilon = epsilon
        self.counts = counts
        self.values = values

    def initialize(self, n_arms):
        self.counts = np.zeros(n_arms)
        self.values = np.array([12, 31, 11, 22])
#         self.values = np.zeros(n_arms)
    
    def select_arm(self):
        if random.random() > self.epsilon:
            return np.argmax(self.values)
        else:
            return random.randrange(len(self.values))
        
    def update(self, chosen_arm, reward):
        self.counts[chosen_arm] += 1
        n = self.counts[chosen_arm]
        value = self.values[chosen_arm]
        new_value = (((n-1) / n) * value) + ((1 / n) * reward)  # 이동 평균으로 업데이트
        self.values[chosen_arm] = new_value
        
    

In [None]:
model = EpsilonGreedy(epsilon=0.5, counts=1, values=1)
model.initialize(n_arms=4)

In [None]:
model.select_arm()