# Another Q-learning tryout on Santa's uncertain bags

We reduce the number of possible states corresponding to 1000 bags to states corresponding to 1 bag. Problem of filling 1000 bags will be translated to the optimal usage of policy and action-value function on varying environment. The last is given by the array of available gifts which decreases when bags are filled.


## States 

A state is characterized by a vector of size `(N_TYPES)`. For example, `s=[1,0,1,0,0,0,0,0,0]`. The initial state is when the null vector or a customly defined vector. Terminal states are defined by state's score. 

How many state there are? There are at most `10^N_TYPES` states.


## Actions

Action is to add a toy to the bag following the list of available toys. For example, action is a integer value corresponding to the toy index.


## Rewards

Action reward can be defined by the score of the bag where a toy has been added.


## Q-learning: Off-Policy Temporal Difference Control

In this algorithm we estimate action-value function $Q(s,a)$ as :
$$
Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t,A_t) \right], \, Q(\cal{S}^{+},a)=0
$$

**Algorithm**
<br>
<div style="background-color: #aaaaaa; padding: 10px; width: 75%; border: solid black; border-radius: 5px;">

    Initialize $Q(s, a)$, for all $s \in \cal{S}$, $a \in \cal{A}(s)$, arbitrarily, and $Q(\text{terminal-state}, \cdot) = 0$<br>
    Repeat (for each episode):<br>
    &emsp;Initialize $S$<br>
    &emsp;Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)<br>
    &emsp;Repeat (for each step of episode):<br>
    &emsp;&emsp;Take action $A$, observe $R$, $S'$<br>
    &emsp;&emsp;$Q(S,A) \leftarrow Q(S,A) + \alpha \left[ R + \gamma \max_{a}Q(S', a) - Q(S,A) \right]$<br>
    &emsp;&emsp;$S \leftarrow S'; \, A \leftarrow A';$<br>
    &emsp;until $S$ is terminal
</div>

In [1]:
# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline



In [3]:
from time import time
from copy import deepcopy
import numpy as np
np.random.seed(2017)

from collections import defaultdict
import heapq

import logging
logging.getLogger().setLevel(logging.DEBUG)

In [4]:
import sys
sys.path.append('../common')
from utils import weight3 as weight_fn, weight_by_index
from utils import bag_weight, score, mean_n_sigma, score_stats
from utils import MAX_WEIGHT, AVAILABLE_GIFTS, GIFT_TYPES, N_TYPES, N_BAGS

In [5]:
REJECTED_BAGS_THRESHOLD = 0.015
NEGATIVE_REWARD = -5000
POSITIVE_REWARD = 1000

In [6]:
def step_reward(rejected):    
    return 1.0 if rejected < REJECTED_BAGS_THRESHOLD else -rejected*10

def take_action(state, action):
    new_state = state.copy()
    new_state[action] += 1
    return new_state

def is_available(state, available_gifts, gift_types=GIFT_TYPES):
    for v, gift_type in zip(state, gift_types):
        if available_gifts[gift_type] - v < 0:
            return False
    return True

def update_available_gifts(available_gifts, state, gift_types=GIFT_TYPES):
    for v, gift_type in zip(state, gift_types):
        assert available_gifts[gift_type] - v >= 0, "Found state is not available : {}, {}".format(state, available_gifts)
        available_gifts[gift_type] = available_gifts[gift_type] - v
        
def state_to_str(state):
    return state.tolist().__str__()

def find_value(action, actions_values, return_index=False):
    for i, (v, a) in enumerate(actions_values):
        if action == a:
            if return_index:
                return v, i
            return v
    raise Exception("No action={} in actions_values={}".format(action, actions_values))
    
def has_action(actions_values, action, return_index=False):
    for i, (v, a) in enumerate(actions_values):
        if action == a:
            if return_index:
                return True, i
            return True
    if return_index:
        return False, None
    return False

In [9]:
NULL_ACTIONS_VALUES = [(POSITIVE_REWARD, None)]

def get_actions_values(state, action_value_function):
    state_key = state_to_str(state)
    actions_values = action_value_function[state_key]
    if len(actions_values) == 0:
        for i in range(N_TYPES):
            va = [POSITIVE_REWARD - np.random.rand(), i]            
            heapq.heappush(action_value_function[state_key], va)        
    return action_value_function[state_key]    

def set_actions_values(state, action_value_function, actions_values):
    state_key = state_to_str(state)
    action_value_function[state_key] = actions_values

def get_policy_action(state, action_value_function, epsilon=0.1):
    u = np.random.rand()
    # Get max value action
    actions_values = get_actions_values(state, action_value_function)
    max_action_value = actions_values[0]
    pr = 1.0 - epsilon + epsilon / N_TYPES
    if u < pr:
        # Greedy
        return max_action_value[1]
    else:
        # Exploring
        if max_action_value[1] is None:
            return None
        actions = list(range(N_TYPES))
        actions.remove(max_action_value[1])
        return actions[np.random.randint(N_TYPES-1)]

In [50]:
def q_learning(goal_weight, 
               available_gifts,
               initial_state=None,
               n_episodes=10, alpha=0.75, gamma=0.95, epsilon=0.1, action_value_function=None):
    
    logging.info("--- Q-learning : goal={}, n_episodes={}".format(goal_weight, n_episodes))
    if action_value_function is None:
        logging.info("-- Reset action_value_function")
        action_value_function = defaultdict(list)
    
    best_state = None
    best_score = 0
    
    def _is_terminal_state(state, available_gifts, goal_weight):
        _is_terminal = False
        _current_reward = 0        
        _state_score, _state_score_std, _rejected, _rejected_std = score_stats((state,), count=200)
        _score_min = _state_score - _state_score_std*0.1
        _score_max = _state_score + _state_score_std*0.5            
        _rejected += _rejected_std*0.25
        if not is_available(state, available_gifts) or _rejected > 2.0*REJECTED_BAGS_THRESHOLD:                
            _current_reward = NEGATIVE_REWARD
            _is_terminal = True
            logging.debug("--->1 Episode finished with NEGATIVE reward, {}, {}, {}".format(_score_min, _score_max, _rejected))                
        elif _score_max >= MAX_WEIGHT:
            _current_reward = NEGATIVE_REWARD
            _is_terminal = True
            logging.debug("--->2 Episode finished with NEGATIVE reward, {}, {}, {}".format(_score_min, _score_max, _rejected))
        elif MAX_WEIGHT > _score_min >= goal_weight:
            _current_reward = POSITIVE_REWARD
            _is_terminal = True
            logging.debug("---> Episode finished with POSITIVE reward")
        elif _score_min < goal_weight:
            _current_reward = step_reward(_rejected)
        else:
            raise Exception("Unclassified state: {}, score_min={}, score_max={}, rejected={}".format(new_state, score_min, score_max, rejected))

        return _is_terminal, _current_reward, _score_min, _rejected
        
    
    for i in range(n_episodes):

        logging.debug("-- Episode : %i" % i)

        state = np.zeros((N_TYPES), dtype=np.uint8) if initial_state is None else initial_state.copy()        
        action = get_policy_action(state, action_value_function, epsilon=epsilon)
        logging.debug("Initial state/action: {}, {}".format(state, action))

        is_terminal, current_reward, score_min, rejected = _is_terminal_state(state, available_gifts, goal_weight)
        if is_terminal:
            logging.debug("Initial state is terminal state. Reward on state: %f" % current_reward)
            state_key = state_to_str(state)
            action_value_function[state_key] = NULL_ACTIONS_VALUES
            if current_reward == POSITIVE_REWARD:
                if best_score < score_min:
                    best_score = score_min
                    best_state = state
            continue

        episode_length = 5**N_TYPES                        
        while not is_terminal:            
            episode_length -= 1 
            if episode_length < 0:
                logging.warn('Episode length is reached, but state score is still : %f / %f' % (state_score, goal_weight))
                break
             
            new_state = take_action(state, action)
            logging.debug("New state score, reward, new_state, action : {}, {}, {} <- {}".format(score_min, current_reward, new_state, action))                

            is_terminal, current_reward, score_min, rejected = _is_terminal_state(new_state, available_gifts, goal_weight)
                    
            if is_terminal:
                set_actions_values(new_state, action_value_function, NULL_ACTIONS_VALUES)
                if current_reward == POSITIVE_REWARD:
                    if best_score < score_min:
                        best_score = score_min
                        best_state = state

    
            # Update Q(s,a)
            actions_values = get_actions_values(state, action_value_function)
            action_value, action_index = find_value(action, actions_values, return_index=True)
            # actions_values is a heap with first element being the smallest element
            # We store values in actions_values as POSITIVE_REWARD - Q(s,a)            
            v = POSITIVE_REWARD - action_value            
            new_actions_values = get_actions_values(new_state, action_value_function)            
            nv = POSITIVE_REWARD - new_actions_values[0][0] 
            t = alpha * (current_reward + gamma * nv - v)
            actions_values[action_index] = [POSITIVE_REWARD - (v + t), action]
            heapq.heapify(actions_values)
                            
            state = new_state
            action = get_policy_action(state, action_value_function, epsilon=epsilon)                        
                
    return action_value_function, best_score, best_state

In [51]:
def fill_one_bag(state, action_value_function):
    epsilon = 0.0
    action = get_policy_action(state, action_value_function, epsilon=epsilon)
    actions_values = get_actions_values(state, action_value_function)
    value = find_value(action, actions_values)
    trajectory = [(state, action, value)]
    print trajectory[-1]
    counter = 5**N_TYPES
    while action is not None:
        state = take_action(state, action)
        action = get_policy_action(state, action_value_function, epsilon=epsilon)
        actions_values = get_actions_values(state, action_value_function)
        value = find_value(action, actions_values)
        trajectory.append((state, action, value))
        print trajectory[-1]

        counter -= 1
        if counter == 5**N_TYPES - 5:
            break
            
    if counter == 0:
        logging.warn("Counter is zero")
    return trajectory[-1][0], trajectory
        

## Single run test

In [60]:
REJECTED_BAGS_THRESHOLD = 0.05
alpha = 0.75
goal_weight = MAX_WEIGHT * alpha
print goal_weight
final_action_value_function = defaultdict(list)
#final_state = np.zeros((N_TYPES), dtype=np.uint8)

37.0


In [61]:
final_state = np.zeros((N_TYPES), dtype=np.uint8)
# final_state = np.array([2, 0, 2, 1, 0, 0, 1, 2, 0])

In [62]:
logging.getLogger().setLevel(logging.INFO)
final_action_value_function, best_score, best_state = q_learning(goal_weight, 
                                                                 AVAILABLE_GIFTS,
                                                                 initial_state=final_state,
                                                                 n_episodes=100, 
                                                                 alpha=0.75, 
                                                                 gamma=0.85, 
                                                                 epsilon=0.3, 
                                                                 action_value_function=final_action_value_function)

In [63]:
if best_state is not None:
    print best_score, best_state, score((best_state,), return_rejected=True), 2.0*REJECTED_BAGS_THRESHOLD

39.2443913607 [9 0 1 0 0 1 2 0 0] (37.107928775979033, 0.01) 0.1


In [24]:
best_state = [3, 0, 1, 0, 0, 1, 2, 3, 0]

In [64]:
best_state, score_stats((best_state,), count=200)

(array([9, 0, 1, 0, 0, 1, 2, 0, 0], dtype=uint8),
 (37.399010038836622, 4.4610715645718377, 0.0, 0.0))

In [65]:
bag, trajectory = fill_one_bag(final_state, final_action_value_function)
print bag, score((bag,), return_rejected=True)
print trajectory

(array([0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 0, 994.4950331003428)
(array([1, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 0, 994.6943548518199)
(array([2, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 0, 994.8897801471765)
(array([3, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 0, 995.1150248840314)
(array([4, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 5, 995.416722886052)
(array([4, 0, 0, 0, 0, 1, 0, 0, 0], dtype=uint8), 0, 995.6587865755291)
[4 0 0 0 0 1 0 0 0] (13.048520116603754, 0.0)
[(array([0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 0, 994.4950331003428), (array([1, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 0, 994.6943548518199), (array([2, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 0, 994.8897801471765), (array([3, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 0, 995.1150248840314), (array([4, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 5, 995.416722886052), (array([4, 0, 0, 0, 0, 1, 0, 0, 0], dtype=uint8), 0, 995.6587865755291)]


In [66]:
final_action_value_function

defaultdict(list,
            {'[0, 0, 0, 0, 0, 0, 0, 0, 0]': [[994.4950331003428, 0],
              [997.2778088076916, 8],
              [997.0927140224287, 6],
              [997.6420576990441, 3],
              [999.0460372912326, 4],
              [997.9577202876417, 2],
              [997.5563882230301, 5],
              [997.7004339512996, 1],
              [997.6771130481362, 7]],
             '[0, 0, 0, 0, 0, 0, 0, 0, 1]': [[997.0928030867881, 0],
              [999.0817371502419, 4],
              [999.5386102225391, 2],
              [999.419373731572, 7],
              [999.8800172498273, 1],
              [999.7103477812194, 5],
              [999.6040321514763, 6],
              [999.8147449603016, 3],
              [999.8166095264128, 8]],
             '[0, 0, 0, 0, 0, 0, 0, 1, 0]': [[997.6780391551988, 0],
              [999.6386264040799, 4],
              [999.0214012456644, 5],
              [999.7705905552192, 7],
              [999.9296045360005, 1],
              

In [806]:
# for count in [100, 200, 300]:
#     sc = []
#     sc2 = []
#     for i in range(200):
#         s, r = score((best_state,), return_rejected=True, count=count)
#         sc.append(s)
#         rr.append(r)

#     plt.figure(figsize=(12,4))
#     plt.subplot(131)    
#     plt.plot(sc)
#     plt.subplot(132)
#     plt.plot(rr)

## Action-value function estimation

In [67]:
REJECTED_BAGS_THRESHOLD = 0.05
alpha = 0.75
goal_weight = MAX_WEIGHT * alpha
print goal_weight

filled_bags = np.zeros((N_BAGS, N_TYPES), dtype=np.uint8)
final_action_value_function = defaultdict(list)
available_gifts = deepcopy(AVAILABLE_GIFTS)
bag_index = 0
initial_state = filled_bags[0]
# found_goal_states = []

37.5


In [None]:
logging.getLogger().setLevel(logging.WARN)
n_episodes = 250

last_score_computation = -1
while bag_index < N_BAGS:
    
    print("Filled bags : ", bag_index, "/", N_BAGS)
    
    final_action_value_function, best_score, best_state = q_learning(goal_weight, 
                                                                 available_gifts,
                                                                 initial_state=initial_state,
                                                                 n_episodes=n_episodes, 
                                                                 alpha=0.75, 
                                                                 gamma=0.85, 
                                                                 epsilon=0.25, 
                                                                 action_value_function=final_action_value_function)
    if best_score > 0:
        print("- Got a result : ", best_score, best_state)
        update_available_gifts(available_gifts, best_state, GIFT_TYPES)
        
#         if len(found_goal_states) == 0 or found_goal_states[-1] != result.state:
#             found_goal_states.append(result.state)
        initial_state = best_state
    
        filled_bags[bag_index, :] = best_state[:]
        bag_index += 1
    else:
        print("No best state found")
        
        
    if bag_index > 0 and (bag_index % 20) == 0 and last_score_computation < bag_index:
            s, r = score(filled_bags, return_rejected=True)
            print(">>> Current score: ", s, s * N_BAGS *1.0 / bag_index, "rejected=", r)
            last_score_computation = bag_index

    if bag_index > 0 and (bag_index % 30) == 0 and last_score_computation < bag_index:
        print(">>> Currently available gifts : ", [(k, available_gifts[k]) for k in GIFT_TYPES])
        last_score_computation = bag_index

('Filled bags : ', 0, '/', 1000)
('- Got a result : ', 40.290641443598858, array([11,  0,  0,  0,  0,  2,  0,  1,  0], dtype=uint8))
('Filled bags : ', 1, '/', 1000)


In [314]:
score(filled_bags, return_rejected=True)

(112.7013124911667, 0.17999999999999999)