## K arm bandit for recommendation system
Considering a recommendation system, we need to recommend relevant items to the user. 1 item could have n relevant items. K arm bandit takes an action based on the environment whereas while considering a recommendation system, it needs to identify the needs of the user and take and action.

Good recommendations are rewarded, and bad recommendations are punished in case of RL based recommendation system.

Unlike a Robot cleaning dirt problem, where the robot will not visit a tile that is not cleaned, in a recommendation system, visited item can be revisited (recommended)

The user here will give feedback (reward) if they find the recommendation relevant. Instead of a probability distribution from which rewards are drawn, we use a human to give a reward depending on how well the recommendation is. User’s feedback is subjective and depends on a lot of external factors. The reward cannot be learnt in this case. You can never predict what reward you will get.

Initially the exploration phase happens where the algorithm will recommend random items to the user and the user had to reward/punish accordingly so that it learns, which item, when recommended, yields a good reward.

A recommendation given by the algorithm is considered arm here.

By recommending items in the exploration phase, the algorithm tends to get an idea regarding which action taken for a particular item yields good rewards.

For example, let's say A and B are relevant and C is irrelevant to A. When started from A, we explore B and C. B gets rewarded more than C, in this way, while exploring, the algorithms learns that A and B are relevant, and C is irrelevant.



In [46]:
import numpy as np
import random

In [48]:
class KArmBanditRecommendation:
    def __init__(self, k, epsilon=0.1):
        self.k = k 
        self.epsilon = epsilon  
        self.q_values = np.zeros(k)  
        self.action_counts = np.zeros(k)  # number of times each arm was selected

    def select_action(self):
        if random.random() < self.epsilon:
            #explore
            action = random.randint(0, self.k - 1)
        else:
            #exploit
            action = np.argmax(self.q_values)
        return action

    def update_estimates(self, action, reward):
        #action update
        self.action_counts[action] += 1

        n = self.action_counts[action]
        self.q_values[action] += (1 / n) * (reward - self.q_values[action])

    def recommend_item(self, items):
        action = self.select_action()
        return items[action]

In [50]:
k_arm_bandit = KArmBanditRecommendation(k=10, epsilon=0.5)

In [51]:
items = [f'Item_{chr(65 + i)}' for i in range(10)]
items

['Item_A',
 'Item_B',
 'Item_C',
 'Item_D',
 'Item_E',
 'Item_F',
 'Item_G',
 'Item_H',
 'Item_I',
 'Item_J']

Feedback : 1 for good recommendation, 0 for bad recommendation\
We assume ItemA and ItemF to be more relevant (good recommendations)

In [52]:
for i in range(40):
    # Recommend an item
    recommended_item = k_arm_bandit.recommend_item(items)
    print(f"Recommended: {recommended_item}")

    # Simulate user feedback (1 for good recommendation, 0 for bad recommendation)
    # For simplicity, we assume some items are more relevant than others
    reward = 1 if recommended_item in ['Item_A', 'Item_F'] else 0  # User rewards Item_A and Item_B

    # Update estimates with the observed reward
    action = items.index(recommended_item)
    k_arm_bandit.update_estimates(action, reward)

Recommended: Item_C
Recommended: Item_G
Recommended: Item_F
Recommended: Item_F
Recommended: Item_A
Recommended: Item_A
Recommended: Item_H
Recommended: Item_A
Recommended: Item_A
Recommended: Item_A
Recommended: Item_A
Recommended: Item_B
Recommended: Item_A
Recommended: Item_A
Recommended: Item_A
Recommended: Item_A
Recommended: Item_D
Recommended: Item_A
Recommended: Item_A
Recommended: Item_A
Recommended: Item_G
Recommended: Item_B
Recommended: Item_A
Recommended: Item_G
Recommended: Item_A
Recommended: Item_A
Recommended: Item_H
Recommended: Item_H
Recommended: Item_H
Recommended: Item_A
Recommended: Item_C
Recommended: Item_A
Recommended: Item_G
Recommended: Item_A
Recommended: Item_C
Recommended: Item_A
Recommended: Item_G
Recommended: Item_A
Recommended: Item_A
Recommended: Item_A


In [53]:
for i, item in enumerate(items):
    print(f"Estimated value for {item}: {k_arm_bandit.q_values[i]}")


Estimated value for Item_A: 1.0
Estimated value for Item_B: 0.0
Estimated value for Item_C: 0.0
Estimated value for Item_D: 0.0
Estimated value for Item_E: 0.0
Estimated value for Item_F: 1.0
Estimated value for Item_G: 0.0
Estimated value for Item_H: 0.0
Estimated value for Item_I: 0.0
Estimated value for Item_J: 0.0
