## Epsilon Greedy Algorithm

Welcome back! In this lecture we are going to discuss the multi-armed bandit problem and the first arm selection policy that is readily implementable in Python. To recap, a multi-armed bandits is a learning algorithm that allocates a fixed number of resources to a set of competing choices, with the purpose of learning the optimal resource allocation rule over time. Therefore, it's useful to establish a few parameters and principles in the bansit setting. First of all

* You're presented with k distinct "arms" to choose from. An arm could define a piece of content for a pick

In [None]:
# We can implement a traditional epsilon greedy bandit strategy using the epsilon greedy policy definition.

def epsilon_greedy_policy(df, arms, epsilon = 0.2, slate_size = 5, batch_size = 50):
    '''
    We take five arguments to iteratively update epsilon greedy policy to generate movie recommendations.
    
    The arguments are:
        df: dataframe. The dataset to apply the policy to
        arms: it is a list or an array of every eligible arm, such as ID, video, etc
        epsilon: it is a float that represents the proportion of timesteps where we explore random arms
        slate_size: it is an integer representing the number of recommendations to make at each step
        batch_size:
    
    '''
    
    # The first step is to use np.random.binomial to define the probability of explore as a binomial
    # distribution, with epsilon specifying the likelihood of drawing a 1, otherwise 0
    explore = np.random.binomial(1, epsilon)
    
    # There are two cases when we explore random recommendations. First, if the value fo explore is 1. 
    # Second, if the recommender system has no data in history, that is, when we just initiate the process.
    
    # So if explore equals to 1 or if the first row of data is used
    if explore == 1 or df.shape[0] == 0:
        # we use np.random.choice to shuffle the items to choose a random set of recommendations. 
        # Note that arms is an array of eligible arm to choose for recommendation,
        # and at this step, choose the recommendations equal to its slate_size without replacement.
        recs = np.random.choice(arms, size = (slate_size), replace = False)
        
    # Other than the two cases, the recommender system will exploit the arms that have performed
    # the best historically. We sort the arms by the proportion of people rated good and recommend 
    # the items with the best performance so far!
    
    else:
        
        scores = df['column'].groupby('ID').agg({'target': ['mean', 'count']})
        
        scores.columns = ['mean', 'count']
        
        scores['column'] = scores.index
        
        score = scores.sort_values('mean', ascending = False)
        
        recs = scores.loc[scores.index[0:slate_size], 'column'].values
        
    return recs
