# Another Q-learning tryout on Santa's uncertain bags

We reduce the number of possible states corresponding to 1000 bags to states corresponding to 1 bag. Problem of filling 1000 bags will be translated to the optimal usage of policy and action-value function on varying environment. The last is given by the array of available gifts which decreases when bags are filled.


## States 

A state is characterized by a vector of size `(N_TYPES)`. For example, `s=[1,0,1,0,0,0,0,0,0]`. The initial state is when the null vector or a customly defined vector. Terminal states are defined by state's score. 

How many state there are? There are at most `10^N_TYPES` states.


## Actions

Action is to add a toy to the bag following the list of available toys. For example, action is a integer value corresponding to the toy index.


## Rewards

Action reward can be defined by the score of the bag where a toy has been added.


## Q-learning: Off-Policy Temporal Difference Control

In this algorithm we estimate action-value function $Q(s,a)$ as :
$$
Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t,A_t) \right], \, Q(\cal{S}^{+},a)=0
$$

**Algorithm**
<br>
<div style="background-color: #aaaaaa; padding: 10px; width: 75%; border: solid black; border-radius: 5px;">

    Initialize $Q(s, a)$, for all $s \in \cal{S}$, $a \in \cal{A}(s)$, arbitrarily, and $Q(\text{terminal-state}, \cdot) = 0$<br>
    Repeat (for each episode):<br>
    &emsp;Initialize $S$<br>
    &emsp;Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)<br>
    &emsp;Repeat (for each step of episode):<br>
    &emsp;&emsp;Take action $A$, observe $R$, $S'$<br>
    &emsp;&emsp;$Q(S,A) \leftarrow Q(S,A) + \alpha \left[ R + \gamma \max_{a}Q(S', a) - Q(S,A) \right]$<br>
    &emsp;&emsp;$S \leftarrow S'; \, A \leftarrow A';$<br>
    &emsp;until $S$ is terminal
</div>

In [1]:
# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

In [2]:
import matplotlib.pyplot as plt
%matplotlib inline



In [3]:
from time import time
from copy import deepcopy
import numpy as np
np.random.seed(2017)

from collections import defaultdict
import heapq

import logging
logging.getLogger().setLevel(logging.DEBUG)

In [4]:
import sys
sys.path.append('../common')
from utils import weight3 as weight_fn, weight_by_index
from utils import bag_weight, score, mean_n_sigma, score_stats
from utils import MAX_WEIGHT, AVAILABLE_GIFTS, GIFT_TYPES, N_TYPES, N_BAGS

Compute maximal number of gifts in a terminal state for each product :

$$ (\mathbb{E}[Y] - \text{Var}[Y]^{1/2}) = \text{max_bag_weight}, \, Y=N_i \cdot m_i$$
or 
$$ N_i = \lceil \frac{\text{max_bag_weight}}{\mathbb{E}[X_i] - \text{Var}[X_i]^{1/2}} \rceil$$

In [5]:
LIMIT_STATE = np.zeros((N_TYPES), dtype=np.uint8)
indices = list(range(N_TYPES))
indices.remove(4)
for i in indices:
    weights = [weight_by_index(i) for j in range(50000)]    
    v = np.percentile(weights, 5)
    LIMIT_STATE[i] = int(np.ceil(MAX_WEIGHT/v))
    
v = 24.0741557491
LIMIT_STATE[4] = int(np.ceil(MAX_WEIGHT/v))
    
LIMIT_STATE

array([ 34,  14,   8, 218,   3,  26, 184,  30,  28], dtype=uint8)

Max number of states :

In [6]:
np.prod(LIMIT_STATE)

10007950417920

In [7]:
fixed_weights = {}
fixed_weights['ball'] = 1.99876912083
fixed_weights['bike'] = 20.0021364556
fixed_weights['blocks'] = 11.6630321858
fixed_weights['book'] = 2.00086596571
fixed_weights['coal'] = 23.7866257713
fixed_weights['doll'] = 4.9993625282
fixed_weights['gloves'] = 1.40310067709
fixed_weights['horse'] = 4.99527064522
fixed_weights['train'] = 10.0234458084
fixed_weights

{'ball': 1.99876912083,
 'bike': 20.0021364556,
 'blocks': 11.6630321858,
 'book': 2.00086596571,
 'coal': 23.7866257713,
 'doll': 4.9993625282,
 'gloves': 1.40310067709,
 'horse': 4.99527064522,
 'train': 10.0234458084}

In [8]:
N_TRIALS = 10000
GIFT_WEIGHTS = np.zeros((N_TRIALS, N_TYPES))
for index in range(N_TYPES):
    GIFT_WEIGHTS[:, index] = [weight_by_index(index) for i in range(10000)]
    
def compute_score(state):
    s = np.sum(GIFT_WEIGHTS * state, axis=1)
    mask = s < MAX_WEIGHT
    rejected = (N_TRIALS - np.sum(mask))*1.0 / N_TRIALS
    score = np.sum(s[mask]) * 1.0 / N_TRIALS
    return score, rejected

In [9]:
REJECTED_BAGS_THRESHOLD = 0.015
NEGATIVE_REWARD = -5000
POSITIVE_REWARD = 1000
STEP_POSITIVE_REWARD = 5.0

In [119]:
def step_reward(rejected, state):    
    r = STEP_POSITIVE_REWARD if rejected < REJECTED_BAGS_THRESHOLD else -rejected*10
    r += np.sum(state)**2
    return r 

def take_action(state, action):
    if action is None:
        return state
    new_state = state.copy()
    new_state[action] += 1
    return new_state

def is_available(state, available_gifts, gift_types=GIFT_TYPES):
    for v, gift_type in zip(state, gift_types):
        if available_gifts[gift_type] - v < 0:
            return False
    return True

def update_available_gifts(available_gifts, state, gift_types=GIFT_TYPES):
    for v, gift_type in zip(state, gift_types):
        assert available_gifts[gift_type] - v >= 0, "Found state is not available : {}, {}".format(state, available_gifts)
        available_gifts[gift_type] = available_gifts[gift_type] - v
        
def state_to_str(state):
    return state.tolist().__str__()

def find_value(action, actions_values, return_index=False):
    for i, (v, a) in enumerate(actions_values):
        if action == a:
            if return_index:
                return v, i
            return v
    raise Exception("No action={} in actions_values={}".format(action, actions_values))
    
def has_action(actions_values, action, return_index=False):
    for i, (v, a) in enumerate(actions_values):
        if action == a:
            if return_index:
                return True, i
            return True
    if return_index:
        return False, None
    return False

In [129]:
NULL_ACTIONS_VALUES = [(POSITIVE_REWARD, None)]

def bag_fix_weight(state):
    out = 0
    for i, c in enumerate(state):
        out += fixed_weights[GIFT_TYPES[i]] * c
    return out

def is_too_heavy(state):
    b1 = (LIMIT_STATE - state < 0).any()
    b2 = bag_fix_weight(state) > MAX_WEIGHT
    return b1 or b2

def get_actions_values(state, action_value_function):
    state_key = state_to_str(state)
    
#     if is_too_heavy(state):
#         action_value_function[state_key]=deepcopy(NULL_ACTIONS_VALUES)

    actions_values = action_value_function[state_key]
    if len(actions_values) == 0:
#         logging.debug("get_actions_values : Initialize random values")
#         for i in range(N_TYPES):
#             va = [POSITIVE_REWARD - np.random.rand(), i]            
#             heapq.heappush(action_value_function[state_key], va)        
        logging.debug("get_actions_values : Initialize values")
        index = np.random.randint(N_TYPES)
        indices = list(range(N_TYPES))
        indices.remove(index)
        va = [POSITIVE_REWARD - 1, index]            
        heapq.heappush(action_value_function[state_key], va)        
        for i in indices:
            va = [POSITIVE_REWARD, i]            
            heapq.heappush(action_value_function[state_key], va)        
    return action_value_function[state_key]    

def set_actions_values(state, action_value_function, actions_values):
    state_key = state_to_str(state)
    action_value_function[state_key] = deepcopy(actions_values)

def get_policy_action(state, action_value_function, available_gifts, epsilon=0.1):    
    u = np.random.rand()
    # Get max value action
    actions_values = get_actions_values(state, action_value_function)
    max_action_value = actions_values[0]
    pr = 1.0 - epsilon + epsilon / N_TYPES
    if u <= pr:
        print "C1"
        # Greedy
        action = max_action_value[1]        
        new_state = take_action(state, action)
        count = 1
        while not is_available(new_state, available_gifts) and count < len(actions_values):
            print "1 count :", count, action, new_state
            max_action_value = actions_values[count]
            action = max_action_value[1]        
            new_state = take_action(state, action)
            count += 1            
        return action if count < len(actions_values) else None
    else:
        print "C2"        
        # Exploring
        if max_action_value[1] is None:
            return None
        actions = list(range(N_TYPES))
        actions.remove(max_action_value[1])
        
        action = actions[np.random.randint(N_TYPES-1)]
        new_state = take_action(state, action)
        count = 1
        while not is_available(new_state, available_gifts) and count < N_TYPES:
            print "2 count :", count, action
            action = actions[np.random.randint(N_TYPES-1)]
            new_state = take_action(state, action)
            count += 1            
        return action if count < N_TYPES else None

In [132]:
actions_values = get_actions_values(np.array([2, 0, 0, 0, 0, 0, 0, 0, 0]), final_action_value_function)
max_action_value = actions_values[0]
count = 1
not is_available(np.array([2, 0, 0, 0, 0, 0, 0, 0, 0]), available_gifts), count < len(actions_values)

(True, False)

In [137]:
a = get_policy_action(np.array([1, 1, 0, 0, 0, 0, 0, 0, 0]), final_action_value_function, available_gifts)
print a

C1
6


In [113]:
def q_learning(goal_weight, 
               available_gifts,
               initial_state=None,
               n_episodes=10, alpha=0.75, gamma=0.95, epsilon=0.1, action_value_function=None):
    
    logging.info("--- Q-learning : goal={}, n_episodes={}".format(goal_weight, n_episodes))
    if action_value_function is None:
        logging.info("-- Reset action_value_function")
        action_value_function = defaultdict(list)
    
    best_state = None
    best_score = 0
    best_rejected = 0.0
    
    def _is_terminal_state(state, goal_weight):
        _is_terminal = False
        _current_reward = 0
        
        _state_score, _rejected = compute_score(state)
        if _rejected > 10.0*REJECTED_BAGS_THRESHOLD:                
            _current_reward = NEGATIVE_REWARD
            _is_terminal = True
            logging.debug("--->1 Episode finished with NEGATIVE reward, {}, {}".format(_state_score, _rejected))                
        elif _state_score >= MAX_WEIGHT:
            _current_reward = NEGATIVE_REWARD
            _is_terminal = True
            logging.debug("--->2 Episode finished with NEGATIVE reward, {}, {}, {}".format(_state_score, _rejected))
        elif MAX_WEIGHT > _state_score >= goal_weight:
            _current_reward = POSITIVE_REWARD
            _is_terminal = True
            logging.debug("---> Episode finished with POSITIVE reward")
        elif _state_score < goal_weight:
            _current_reward = step_reward(_rejected, state)
        else:
            raise Exception("Unclassified state: {}, score_min={}, score_max={}, rejected={}".format(new_state, score_min, score_max, rejected))

        return _is_terminal, _current_reward, _state_score, _rejected
        
    
    for i in range(n_episodes):

        logging.debug("-- Episode : %i" % i)

        state = np.zeros((N_TYPES), dtype=np.uint8) if initial_state is None else initial_state.copy()        
        action = get_policy_action(state, action_value_function, available_gifts, epsilon=epsilon)
        logging.debug("Initial state/action: {}, {}".format(state, action))

        is_terminal, current_reward, score_min, rejected = _is_terminal_state(state, goal_weight)
        if is_terminal:
            logging.debug("Initial state is terminal state. Reward on state: %f" % current_reward)
            state_key = state_to_str(state)
            action_value_function[state_key] = NULL_ACTIONS_VALUES
            if current_reward == POSITIVE_REWARD:
                if best_score < score_min:
                    best_score = score_min
                    best_state = state
                    best_rejected = rejected
            continue

        episode_length = 5**N_TYPES                        
        while not is_terminal:            
            episode_length -= 1 
            if episode_length < 0:
                logging.warn('Episode length is reached, but state score is still : %f / %f' % (state_score, goal_weight))
                break
             
            new_state = take_action(state, action)
            is_terminal, current_reward, score_min, rejected = _is_terminal_state(new_state, goal_weight)
            logging.debug("New state score, reward, new_state, action : {}, {}, {} <- {}".format(score_min, current_reward, new_state, action))                

#             print("\n --- New state score, reward, new_state, action : {}, {}, {} <- {}".format(score_min, current_reward, new_state, action))                
#             print("get_actions_values(state, action_value_function): ", get_actions_values(state, action_value_function))
#             print("get_actions_values(new_state, action_value_function): ", get_actions_values(new_state, action_value_function))
            
            if is_terminal:
#                 print "--- Terminal state is found ---"
                set_actions_values(new_state, action_value_function, NULL_ACTIONS_VALUES)
                if current_reward == POSITIVE_REWARD:
                    if best_score < score_min:
                        best_score = score_min
                        best_state = new_state
                        best_rejected = rejected

    
            # Update Q(s,a)
            actions_values = get_actions_values(state, action_value_function)
#             print("actions_values: ", actions_values) 
            action_value, action_index = find_value(action, actions_values, return_index=True)
#             print("action_value: ", action_value, action_index)
            # actions_values is a heap with first element being the smallest element
            # We store values in actions_values as POSITIVE_REWARD - Q(s,a)            
            v = POSITIVE_REWARD - action_value      
#             print "v: ", v
            new_actions_values = get_actions_values(new_state, action_value_function)            
#             print("new_actions_values: ", new_actions_values) 
            nv = POSITIVE_REWARD - new_actions_values[0][0]
#             print("nv : ", nv)
            t = alpha * (current_reward + gamma * nv - v)
            actions_values[action_index] = [POSITIVE_REWARD - (v + t), action]
#             print("-> actions_values: ", actions_values)
            heapq.heapify(actions_values)
#             print("--> actions_values: ", actions_values)
                            
            state = new_state
            action = get_policy_action(state, action_value_function, available_gifts, epsilon=epsilon)                        
                
    return action_value_function, best_score, best_state, best_rejected

In [36]:
def fill_one_bag(state, action_value_function):
    epsilon = 0.0
    action = get_policy_action(state, action_value_function, epsilon=epsilon)
    actions_values = get_actions_values(state, action_value_function)
    value = find_value(action, actions_values)
    trajectory = [(state, action, value)]
#     print "1 fill_one_bag: ", trajectory[-1]
    counter = 100
    while action is not None:
        state = take_action(state, action)
        s, r = compute_score(state)
        if r > 10 * REJECTED_BAGS_THRESHOLD:
            break        
        action = get_policy_action(state, action_value_function, epsilon=epsilon)
        actions_values = get_actions_values(state, action_value_function)
        value = find_value(action, actions_values)
        trajectory.append((state, action, value))
        
#         print "fill_one_bag: >>", trajectory[-1]

        counter -= 1
        if counter == 0:
            break
            
    if counter == 0:
        logging.warn("Counter is zero")
    return trajectory[-1][0], trajectory
        

## Single run test

In [38]:
REJECTED_BAGS_THRESHOLD = 0.05
alpha = 0.74
goal_weight = MAX_WEIGHT * alpha
print goal_weight
final_action_value_function = defaultdict(list)
#final_state = np.zeros((N_TYPES), dtype=np.uint8)

37.0


In [39]:
final_state = np.zeros((N_TYPES), dtype=np.uint8)
# final_state = np.array([2, 0, 2, 1, 0, 0, 1, 2, 0])

In [60]:
logging.getLogger().setLevel(logging.INFO)
final_action_value_function, best_score, best_state, best_rejected = q_learning(goal_weight, 
                                                                 AVAILABLE_GIFTS,
                                                                 initial_state=final_state,
                                                                 n_episodes=500, 
                                                                 alpha=0.75, 
                                                                 gamma=0.95, 
                                                                 epsilon=0.25, 
                                                                 action_value_function=final_action_value_function)

In [61]:
if best_state is not None:
    print best_score, best_state, best_rejected, score((best_state,), return_rejected=True)

37.1193011245 [7 0 1 1 0 0 1 2 0] 0.0388 (38.036851093448504, 0.01)


In [None]:
# best_state = np.array([0, 0, 0, 0, 0, 1, 1, 2, 1])

In [None]:
best_state, score_stats((best_state,), count=200), compute_score(best_state)

In [None]:
LIMIT_STATE

In [62]:
bag, trajectory = fill_one_bag(final_state, final_action_value_function)
print bag, score((bag,), return_rejected=True)
print trajectory

[7 0 1 1 0 0 1 2 0] (38.006247944341141, 0.02)
[(array([0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 6, 388.07991695118631), (array([0, 0, 0, 0, 0, 0, 1, 0, 0], dtype=uint8), 2, 361.13675468545921), (array([0, 0, 1, 0, 0, 0, 1, 0, 0], dtype=uint8), 7, 332.77553124785175), (array([0, 0, 1, 0, 0, 0, 1, 1, 0], dtype=uint8), 0, 302.92161183984388), (array([1, 0, 1, 0, 0, 0, 1, 1, 0], dtype=uint8), 0, 271.49643351562509), (array([2, 0, 1, 0, 0, 0, 1, 1, 0], dtype=uint8), 7, 238.41729843750011), (array([2, 0, 1, 0, 0, 0, 1, 2, 0], dtype=uint8), 3, 203.59715625000013), (array([2, 0, 1, 1, 0, 0, 1, 2, 0], dtype=uint8), 0, 166.94437500000015), (array([3, 0, 1, 1, 0, 0, 1, 2, 0], dtype=uint8), 0, 128.36250000000007), (array([4, 0, 1, 1, 0, 0, 1, 2, 0], dtype=uint8), 0, 87.75), (array([5, 0, 1, 1, 0, 0, 1, 2, 0], dtype=uint8), 0, 45.0), (array([6, 0, 1, 1, 0, 0, 1, 2, 0], dtype=uint8), 0, 0.0), (array([7, 0, 1, 1, 0, 0, 1, 2, 0], dtype=uint8), None, 1000)]


In [None]:
len(final_action_value_function)

In [25]:
final_action_value_function

defaultdict(list,
            {'[0, 0, 0, 0, 0, 0, 0, 0, 0]': [[973.42129064503445, 0],
              [991.77671875, 3],
              [986.7825302734375, 6],
              [995.134375, 1],
              [995.5375, 4],
              [988.417474609375, 5],
              [995.5375, 2],
              [995.5375, 7],
              [1000, 8]],
             '[0, 0, 0, 0, 0, 0, 0, 1, 0]': [[995.5375, 1],
              [1000, 0],
              [999, 6],
              [1000, 2],
              [1000, 3],
              [1000, 4],
              [1000, 5],
              [1000, 7],
              [1000, 8]],
             '[0, 0, 0, 0, 0, 0, 1, 0, 0]': [[990.443935546875, 5],
              [1000, 0],
              [1000, 1],
              [1000, 2],
              [1000, 3],
              [1000, 4],
              [1000, 6],
              [1000, 7],
              [1000, 8]],
             '[0, 0, 0, 0, 0, 1, 0, 0, 0]': [[990.248671875, 5],
              [1000, 0],
              [1000, 1],
              [1

In [None]:
get_actions_values(np.array([11,  0,  0,  0,  0,  0,  1,  1,  0]), final_action_value_function)

In [None]:
# for count in [100, 200, 300]:
#     sc = []
#     sc2 = []
#     for i in range(200):
#         s, r = score((best_state,), return_rejected=True, count=count)
#         sc.append(s)
#         rr.append(r)

#     plt.figure(figsize=(12,4))
#     plt.subplot(131)    
#     plt.plot(sc)
#     plt.subplot(132)
#     plt.plot(rr)

## Action-value function estimation

In [80]:
REJECTED_BAGS_THRESHOLD = 0.05
alpha = 0.73
goal_weight = MAX_WEIGHT * alpha
print goal_weight

filled_bags = np.zeros((N_BAGS, N_TYPES), dtype=np.uint8)
final_action_value_function = defaultdict(list)
available_gifts = deepcopy(AVAILABLE_GIFTS)
bag_index = 0
found_goal_states = []

36.5


In [87]:
logging.getLogger().setLevel(logging.WARN)
n_episodes = 500

last_score_computation = -1
limit_fails = 20
while bag_index < N_BAGS:
    
    print("Found goal bags : ", bag_index, "/", N_BAGS)
    
    final_action_value_function, best_score, best_state, best_rejected = q_learning(goal_weight, 
                                                                 available_gifts,
                                                                 n_episodes=n_episodes, 
                                                                 alpha=0.75, 
                                                                 gamma=0.95, 
                                                                 epsilon=0.25, 
                                                                 action_value_function=final_action_value_function)
    if best_score > 0:
        print("- Got a result : ", best_score, best_state, best_rejected)
        update_available_gifts(available_gifts, best_state, GIFT_TYPES)
        
        if len(found_goal_states) == 0 or (found_goal_states[-1] != best_state).any():
            s, r = score((best_state,), return_rejected=True)
            found_goal_states.append(tuple(best_state.tolist()))

        filled_bags[bag_index, :] = best_state
        bag_index += 1
        
        limit_fails = 20
    else:
        print("No best state found")
        limit_fails -= 1
        
        
    if bag_index > 0 and (bag_index % 20) == 0 and last_score_computation < bag_index:
            s, r = score(filled_bags, return_rejected=True)
            print(">>> Current score: ", s, s * N_BAGS *1.0 / bag_index, "rejected=", r)
            last_score_computation = bag_index

    if bag_index > 0 and (bag_index % 30) == 0 and last_score_computation < bag_index:
        print(">>> Currently available gifts : ", [(k, available_gifts[k]) for k in GIFT_TYPES])
        last_score_computation = bag_index
        
    if limit_fails == 0:
        break

('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)
No best state found
('Found goal bags : ', 154, '/', 1000)


KeyboardInterrupt: 

In [103]:
goal_states_set = set(found_goal_states)
# print len(goal_states_set), goal_states_set
for s in goal_states_set:
    print compute_score(s), score((s,)), np.sum(s), s

(38.095119379939753, 0.068699999999999997) 39.9020672141 13 (9, 0, 1, 1, 0, 1, 0, 1, 0)
(37.648917892090012, 0.054100000000000002) 39.0520460559 13 (8, 0, 1, 0, 0, 1, 2, 1, 0)
(37.652930952391415, 0.019900000000000001) 39.0795947806 13 (10, 0, 1, 1, 0, 1, 0, 0, 0)
(36.860008345751922, 0.024) 36.9942234001 10 (5, 0, 1, 0, 0, 1, 1, 2, 0)
(37.21313118245547, 0.0246) 38.0633084521 12 (7, 0, 1, 0, 0, 1, 2, 1, 0)
(37.994752340646137, 0.059400000000000001) 40.2544859026 13 (8, 0, 1, 1, 0, 1, 1, 1, 0)
(38.086447938161811, 0.031099999999999999) 38.3438299552 12 (8, 0, 1, 1, 0, 1, 0, 1, 0)
(37.305573908206263, 0.053400000000000003) 38.6453509823 11 (6, 0, 1, 0, 0, 1, 1, 2, 0)
(38.581749387272772, 0.0591) 39.3041167162 13 (10, 0, 1, 0, 0, 1, 0, 1, 0)
(36.554818716494374, 0.0562) 36.9726439577 11 (5, 0, 1, 0, 0, 1, 2, 2, 0)
(36.894183567669636, 0.053800000000000001) 37.4729994437 11 (9, 0, 1, 0, 0, 0, 0, 0, 1)
(36.881457162385722, 0.035299999999999998) 37.3395409241 10 (5, 0, 1, 1, 0, 1, 0, 2, 0)


In [86]:
len(final_action_value_function), available_gifts

(50113,
 {'ball': 1,
  'bike': 500,
  'blocks': 846,
  'book': 1110,
  'coal': 166,
  'doll': 854,
  'gloves': 114,
  'horse': 803,
  'train': 999})

## Bag filling with estimated action-value function

In [88]:
final_state = np.zeros((N_TYPES), dtype=np.uint8)
bag, trajectory = fill_one_bag(final_state, final_action_value_function)
print bag, score((bag,), return_rejected=True)
print trajectory

[1 0 1 2 0 3 8 1 0] (26.792676700170855, 0.40999999999999998)
[(array([0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8), 3, 954.34947339290102), (array([0, 0, 0, 1, 0, 0, 0, 0, 0], dtype=uint8), 7, 957.21000966574309), (array([0, 0, 0, 1, 0, 0, 0, 1, 0], dtype=uint8), 3, 960.22295164802063), (array([0, 0, 0, 2, 0, 0, 0, 1, 0], dtype=uint8), 6, 963.39580116421484), (array([0, 0, 0, 2, 0, 0, 1, 1, 0], dtype=uint8), 6, 966.74168746643977), (array([0, 0, 0, 2, 0, 0, 2, 1, 0], dtype=uint8), 5, 970.25972420855157), (array([0, 0, 0, 2, 0, 1, 2, 1, 0], dtype=uint8), 6, 973.95904034798252), (array([0, 0, 0, 2, 0, 1, 3, 1, 0], dtype=uint8), 6, 977.85280389231968), (array([0, 0, 0, 2, 0, 1, 4, 1, 0], dtype=uint8), 6, 981.95152122382638), (array([0, 0, 0, 2, 0, 1, 5, 1, 0], dtype=uint8), 6, 986.26481763694107), (array([0, 0, 0, 2, 0, 1, 6, 1, 0], dtype=uint8), 6, 990.8052251233006), (array([0, 0, 0, 2, 0, 1, 7, 1, 0], dtype=uint8), 6, 995.58892668741657), (array([0, 0, 0, 2, 0, 1, 8, 1, 0], dtype=uint8), 