#### The Upper Confidence Bound Algorithm
- Epsilon Greedy and Softmax does not keep track of how much they know about any of the arms available to them. They pay attention only to _how much reward they have gotten from the arms_. This means that they will under explore options whose initial experiences were not rewarding , even though they do not have enough data to be confident about those arms.
- We can do better by using an algorithm that pays attention to not only what it knows, but also how much it knows.
- UCB does not use randomness at all. Unlinke epsilon-greedy or softmax, it's possible to know exactly how UCB algorithm will behave in any given situation. This can make it easier to reason about at times.
- UCB keep track of our confidence in our assessment of the estimated values of all of the arms.
- UCB does not have any free parameters that you need to configure before you can deploy it. This is a major improvement if you were interested in runnning it in the wild, because it means that you can start to use UCB without having a clear sense of what you expect the world to behave like.
- _**Taken together, thse use of an explicit measure of confidence, the absense of randomness and the absense of configurable parameter makes UCB very compelling**_

In [4]:
import numpy
import math


In [5]:
class UCB1():
    def __init__(self, counts, values ) :
        self.counts = counts
        self.values = values
    
    def initialize(self, n_arms):
        self.counts = [ 0 for col in range(n_arms) ]
        self.values = [ 0.0 for col in range(n_arms) ]
    
    def select_arm ( self ):
        n_arms = len(self.counts)
        
        '''
        Following few lines of code ensure ensure UCB does not have a cold start before it starts to 
        apply its confidence based decision rule'''
        for arm in range(n_arms):
            if self.counts[arm] == 0:
                return arm
        
        ucb_values = [ 0.0 for arm in range(n_arms) ]
        total_counts = sum( self.counts )
        for arm in range(n_arms):
            bonus = math.sqrt( 2 * math.log(total_counts) ) / float( self.counts[arm] )
            ucb_values[arm] = self.values[arm] + bonus 
        
        return ucb_values.index( max( ucb_values  ) )
            
    
    def update( sef, chosen_arm, reward ):
        self.counts[chosen_arm] += 1
        n = self.counts[chosen_arm]
        
        value = self.values[ chosen_arm ]
        new_value = ( (n-1)* value + reward ) / float(n)
        self.values[chosen_arm] = new_value
        
        
        