# Another Q-learning tryout on Santa's uncertain bags

We reduce the number of possible states corresponding to 1000 bags to states corresponding to 1 bag. Problem of filling 1000 bags will be translated to the optimal usage of policy and action-value function on varying environment. The last is given by the array of available gifts which decreases when bags are filled.


## States 

A state is characterized by a vector of size `(N_TYPES)`. For example, `s=[1,0,1,0,0,0,0,0,0]`. The initial state is when the null vector or a customly defined vector. Terminal states are defined by state's score. 

How many state there are? There are at most `10^N_TYPES` states.


## Actions

Action is to add a toy to the bag following the list of available toys. For example, action is a integer value corresponding to the toy index.


## Rewards

Action reward can be defined by the score of the bag where a toy has been added.


## Q-learning: Off-Policy Temporal Difference Control

In this algorithm we estimate action-value function $Q(s,a)$ as :
$$
Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t,A_t) \right], \, Q(\cal{S}^{+},a)=0
$$

**Algorithm**
<br>
<div style="background-color: #aaaaaa; padding: 10px; width: 75%; border: solid black; border-radius: 5px;">

    Initialize $Q(s, a)$, for all $s \in \cal{S}$, $a \in \cal{A}(s)$, arbitrarily, and $Q(\text{terminal-state}, \cdot) = 0$<br>
    Repeat (for each episode):<br>
    &emsp;Initialize $S$<br>
    &emsp;Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)<br>
    &emsp;Repeat (for each step of episode):<br>
    &emsp;&emsp;Take action $A$, observe $R$, $S'$<br>
    &emsp;&emsp;$Q(S,A) \leftarrow Q(S,A) + \alpha \left[ R + \gamma \max_{a}Q(S', a) - Q(S,A) \right]$<br>
    &emsp;&emsp;$S \leftarrow S'; \, A \leftarrow A';$<br>
    &emsp;until $S$ is terminal
</div>

In [1]:
# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

In [278]:
import matplotlib.pyplot as plt
%matplotlib inline

In [28]:
from time import time
from copy import deepcopy
import numpy as np
np.random.seed(2017)

from collections import defaultdict
import heapq

import logging
logging.getLogger().setLevel(logging.DEBUG)

In [565]:
import sys
sys.path.append('../common')
from utils import weight3 as weight_fn, weight_by_index
from utils import bag_weight, score, mean_n_sigma, score_stats
from utils import MAX_WEIGHT, AVAILABLE_GIFTS, GIFT_TYPES, N_TYPES, N_BAGS

In [241]:
REJECTED_BAGS_THRESHOLD = 0.015
NEGATIVE_REWARD = -1000
POSITIVE_REWARD = 1000

In [221]:
def step_reward(rejected):    
    return 1.0 if rejected < REJECTED_BAGS_THRESHOLD else -rejected*10

def take_action(state, action):
    new_state = state.copy()
    new_state[action] += 1
    return new_state

def is_available(state, available_gifts, gift_types=GIFT_TYPES):
    for v, gift_type in zip(state, gift_types):
        if available_gifts[gift_type] - v < 0:
            return False
    return True

def update_available_gifts(available_gifts, state, gift_types=GIFT_TYPES):
    for v, gift_type in zip(state, gift_types):
        assert available_gifts[gift_type] - v >= 0, "Found state is not available : {}, {}".format(state, available_gifts)
        available_gifts[gift_type] = available_gifts[gift_type] - v
        
def state_to_str(state):
    return state.tolist().__str__()

def find_value(action, actions_values, return_index=False):
    for i, (v, a) in enumerate(actions_values):
        if action == a:
            if return_index:
                return v, i
            return v
    raise Exception("No action={} in actions_values={}".format(action, actions_values))
    
def has_action(actions_values, action, return_index=False):
    for i, (v, a) in enumerate(actions_values):
        if action == a:
            if return_index:
                return True, i
            return True
    if return_index:
        return False, None
    return False

In [226]:
def get_policy_action(state, action_value_function, epsilon=0.1):
    state_key = state_to_str(state)
    u = np.random.rand()
    b = state_key in action_value_function
    if b and u > epsilon:
        # Get max value action
        actions_values = action_value_function[state_key]
        max_action_value = actions_values[0]
        return max_action_value[1]
    else:
        # Take a random action
        action = np.random.randint(N_TYPES)
        b1 = False
        if b:
            b1 = has_action(action_value_function[state_key], action)
        if not b or not b1:
            # Arbitrary initialization
            # We store values as POSITIVE_REWARD - value to use heapq property that heap[0] is the smallest element
            # In our case this element corresponds to the largest value
            value = POSITIVE_REWARD - np.random.rand()
            heapq.heappush(action_value_function[state_key], [value, action])                     
        return action

In [779]:
def q_learning(goal_weight, 
               available_gifts,
               initial_state=None,
               n_episodes=10, alpha=0.75, gamma=0.95, epsilon=0.1, action_value_function=None):
    
    logging.info("--- Q-learning : goal={}, n_episodes={}".format(goal_weight, n_episodes))
    if action_value_function is None:
        logging.info("-- Reset action_value_function")
        action_value_function = defaultdict(list)
    
    best_state = initial_state
    best_score = 0
    
    for i in range(n_episodes):

        logging.debug("-- Episode : %i" % i)
        
        episode_length = 5**N_TYPES
        state = np.zeros((N_TYPES), dtype=np.uint8) if initial_state is None else initial_state        
        action = get_policy_action(state, action_value_function, epsilon=epsilon)
        state_score, state_score_std, rejected, rejected_std = score_stats((best_state,), count=200)
        #state_score, rejected = score((state,), return_rejected=True)
        score_min = state_score - state_score_std*0.1
        score_max = state_score + state_score_std*0.5
        is_terminal = score_min > goal_weight and is_available(state, available_gifts)
        is_terminal |= rejected + rejected_std*0.25 > 2.0*REJECTED_BAGS_THRESHOLD
        
        logging.debug("Initial state score/action: {}, {}".format(state_score, action))
        
        while not is_terminal:
            
            episode_length -= 1 
            if episode_length < 0:
                logging.warn('Episode length is reached, but state score is still : %f / %f' % (state_score, goal_weight))
                break
            
            current_reward = 0 
            new_state = take_action(state, action)
            state_score, state_score_std, rejected, rejected_std = score_stats((new_state,), count=200)
            score_min = state_score - state_score_std*0.1
            score_max = state_score + state_score_std*0.5            
            rejected += rejected_std*0.25
            
            if not is_available(new_state, available_gifts) or rejected > 2.0*REJECTED_BAGS_THRESHOLD:                
                current_reward = NEGATIVE_REWARD
                is_terminal = True
                logging.debug("--->1 Episode finished with NEGATIVE reward, {}, {}, {}".format(score_min, score_max, rejected))
            elif score_max >= MAX_WEIGHT:
                current_reward = NEGATIVE_REWARD
                is_terminal = True
                logging.debug("--->2 Episode finished with NEGATIVE reward, {}, {}, {}".format(score_min, score_max, rejected))
            elif MAX_WEIGHT > score_min >= goal_weight:
                current_reward = POSITIVE_REWARD
                is_terminal = True
                logging.debug("---> Episode finished with POSITIVE reward")
                if best_score < score_min:
                    best_score = score_min
                    best_state = new_state
            elif score_min < goal_weight:
                current_reward = step_reward(rejected)
            else:
                raise Exception("Unclassified state: {}, score_min={}, score_max={}, rejected={}".format(new_state, score_min, score_max, rejected))

            # logging.debug("New state score, reward, new_state, action : {}, {}, {}, {}".format(state_score, current_reward, new_state, action))                
                
            # Update Q(s,a)
            state_key = state_to_str(state)
            new_state_key = state_to_str(new_state)
            
            actions_values = action_value_function[state_key]
            action_value, action_index = find_value(action, actions_values, return_index=True)
            v = POSITIVE_REWARD - action_value
            # actions_values is a heap with first element being the smallest element
            # We store values in actions_values as POSITIVE_REWARD - Q(s,a)
            nv = POSITIVE_REWARD - actions_values[0][0]
            t = alpha * (current_reward + gamma * nv - v)
            
            action_value_function[state_key][action_index] = [POSITIVE_REWARD - (v + t), action]
            
            state = new_state
            action = get_policy_action(state, action_value_function, epsilon=epsilon)                        
                
    return action_value_function, best_score, best_state

## Single run test

In [780]:
REJECTED_BAGS_THRESHOLD = 0.05
alpha = 0.75
goal_weight = MAX_WEIGHT * alpha
print goal_weight
final_action_value_function = defaultdict(list)
$final_state = np.zeros((N_TYPES), dtype=np.uint8)

37.5


In [816]:
# final_state = np.zeros((N_TYPES), dtype=np.uint8)
final_state = np.array([2, 0, 2, 1, 0, 0, 1, 2, 0])

In [817]:
logging.getLogger().setLevel(logging.INFO)
final_action_value_function, best_score, best_state = q_learning(goal_weight, 
                                                                 AVAILABLE_GIFTS,
                                                                 initial_state=final_state,
                                                                 n_episodes=100, 
                                                                 alpha=0.75, 
                                                                 gamma=0.85, 
                                                                 epsilon=0.3, 
                                                                 action_value_function=final_action_value_function)

INFO:root:--- Q-learning : goal=38.0, n_episodes=100


AttributeError: 'list' object has no attribute 'tolist'

In [804]:
best_score, best_state, score((best_state,), return_rejected=True), 2.0*REJECTED_BAGS_THRESHOLD

(37.788062681555829,
 array([3, 0, 1, 0, 0, 1, 2, 3, 0], dtype=uint8),
 (38.243957462353478, 0.02),
 0.1)

In [777]:
best_state = [3, 0, 1, 0, 0, 1, 2, 3, 0]

In [805]:
best_state, score_stats((best_state,), count=200)

(array([3, 0, 1, 0, 0, 1, 2, 3, 0], dtype=uint8),
 (36.561601141292897,
  10.630658920860142,
  0.065000000000000002,
  0.246525860712421))

In [806]:
# for count in [100, 200, 300]:
#     sc = []
#     sc2 = []
#     for i in range(200):
#         s, r = score((best_state,), return_rejected=True, count=count)
#         sc.append(s)
#         rr.append(r)

#     plt.figure(figsize=(12,4))
#     plt.subplot(131)    
#     plt.plot(sc)
#     plt.subplot(132)
#     plt.plot(rr)

## Action-value function estimation

In [814]:
REJECTED_BAGS_THRESHOLD = 0.05
alpha = 0.76
goal_weight = MAX_WEIGHT * alpha
print goal_weight

filled_bags = np.zeros((N_BAGS, N_TYPES), dtype=np.uint8)
final_action_value_function = defaultdict(list)
available_gifts = deepcopy(AVAILABLE_GIFTS)
bag_index = 0
initial_state = filled_bags[0]
# found_goal_states = []

38.0


In [815]:
logging.getLogger().setLevel(logging.WARN)
n_episodes = 250

last_score_computation = -1
while bag_index < N_BAGS:
    
    print("Filled bags : ", bag_index, "/", N_BAGS)
    
    final_action_value_function, best_score, best_state = q_learning(goal_weight, 
                                                                 available_gifts,
                                                                 initial_state=initial_state,
                                                                 n_episodes=n_episodes, 
                                                                 alpha=0.75, 
                                                                 gamma=0.85, 
                                                                 epsilon=0.25, 
                                                                 action_value_function=final_action_value_function)
    if best_score > 0:
        print("- Got a result : ", best_score, best_state)
        update_available_gifts(available_gifts, best_state, GIFT_TYPES)
        
#         if len(found_goal_states) == 0 or found_goal_states[-1] != result.state:
#             found_goal_states.append(result.state)
        initial_state = best_state
    
        filled_bags[bag_index, :] = best_state[:]
        bag_index += 1
    else:
        print("No best state found")
        
        
    if bag_index > 0 and (bag_index % 20) == 0 and last_score_computation < bag_index:
            s, r = score(filled_bags, return_rejected=True)
            print(">>> Current score: ", s, s * N_BAGS *1.0 / bag_index, "rejected=", r)
            last_score_computation = bag_index

    if bag_index > 0 and (bag_index % 30) == 0 and last_score_computation < bag_index:
        print(">>> Currently available gifts : ", [(k, available_gifts[k]) for k in GIFT_TYPES])
        last_score_computation = bag_index

('Filled bags : ', 0, '/', 1000)
No best state found
('Filled bags : ', 0, '/', 1000)
No best state found
('Filled bags : ', 0, '/', 1000)
('- Got a result : ', 38.528628281654768, array([2, 0, 2, 1, 0, 0, 1, 2, 0], dtype=uint8))
('Filled bags : ', 1, '/', 1000)
No best state found
('Filled bags : ', 1, '/', 1000)
No best state found
('Filled bags : ', 1, '/', 1000)
No best state found
('Filled bags : ', 1, '/', 1000)


KeyboardInterrupt: 

In [314]:
score(filled_bags, return_rejected=True)

(112.7013124911667, 0.17999999999999999)