# Multi-armed bandit problem

### Задание
**1. Можно ли вычислить $Q_{t+1}(a)$ инкрементально (известно лишь  $Q_t(a)$ и награда  $r_{t+1}$, назначенная за выбор действия $a$)?**

Пусть $Q_t(a) = \frac{\sum_k r_{k}}{c_t(a)}$. При выборе на $t+1$-ом шаге действия $a:\; Q_{t+1}(a) =  \frac{\sum_k r_{k} \; + \; r_{t+1}}{c_t(a) + 1} = \frac{Q_{t}(a) \cdot c_t(a) \; + \; r_{t+1}}{c_t(a) + 1}$

### Построим для начала нашу модель:

In [46]:
import numpy as np

class MAB_Model:
    def __init__(self, actions):
        self.actions = actions
        self.__reward_means = np.random.normal(size=actions)
    def get_reward(self, action):
        return np.random.normal(self.__reward_means[action])
    
class MAB_Strategy:
    def __init__(self, actions):
        self.actions = actions
        self.__reward_means = np.zeros((actions))
        self.__action_uses = np.zeros((actions))
    def make_step(self): # Maybe should return reward
        pass
    def update_reward(self, action, reward):
        self.__reward_means[action] = (self.__reward_means[action] * self.__action_uses[action] + reward)\
        / (self.__action_uses[action] + 1)
    def best_actions(self):
        best =  np.argwhere(self.__reward_means == np.max(self.__reward_means))
        return best.reshape(best.shape[0])