# Q-learning tryouts on Santa's uncertain bags

## States 

A state is characterized by a matrix of size `(N_BAGS, N_TYPES)`. For example, `s[0,:]=[1,0,1,0,0,0,0,0,0]`. The initial state is when the matrix is null or a customly defined. Terminal states are defined by state's score. 

How many state there are? There are at most `N_BAGS * 9^10` states.


## Actions

Action is to add a toy following the list of available toys.


## Rewards

Action reward can be defined by the score of the bag where a toy has been added.


## Q-learning: Off-Policy Temporal Difference Control

In this algorithm we estimate action-value function $Q(s,a)$ as :
$$
Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha \left[ R_{t+1} + \gamma \max_{a} Q(S_{t+1}, a) - Q(S_t,A_t) \right], \, Q(\cal{S}^{+},a)=0
$$

**Algorithm**
<br>
<div style="background-color: #aaaaaa; padding: 10px; width: 75%; border: solid black; border-radius: 5px;">

    Initialize $Q(s, a)$, for all $s \in \cal{S}$, $a \in \cal{A}(s)$, arbitrarily, and $Q(\text{terminal-state}, \cdot) = 0$<br>
    Repeat (for each episode):<br>
    &emsp;Initialize $S$<br>
    &emsp;Choose $A$ from $S$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)<br>
    &emsp;Repeat (for each step of episode):<br>
    &emsp;&emsp;Take action $A$, observe $R$, $S'$<br>
    &emsp;&emsp;$Q(S,A) \leftarrow Q(S,A) + \alpha \left[ R + \gamma \max_{a}Q(S', a) - Q(S,A) \right]$<br>
    &emsp;&emsp;$S \leftarrow S'; \, A \leftarrow A';$<br>
    &emsp;until $S$ is terminal
</div>

In [1]:
# https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

In [18]:
from time import time
from copy import deepcopy
import numpy as np
np.random.seed(2016)

import logging
logging.getLogger().setLevel(logging.DEBUG)

In [4]:
import sys
sys.path.append('../common')
from utils import weight3 as weight_fn, weight_by_index
from utils import bag_weight, score
from utils import MAX_WEIGHT, AVAILABLE_GIFTS, GIFT_TYPES, N_TYPES, N_BAGS

In [16]:
initial_state = np.zeros((N_BAGS, N_TYPES), dtype=np.uint8)
alpha = 0.72
goal_weight = MAX_WEIGHT * N_BAGS * alpha

print goal_weight

36000.0


In [17]:
score(initial_state)

0.0

In [23]:
initial_state.__str__()

'[[0 0 0 ..., 0 0 0]\n [0 0 0 ..., 0 0 0]\n [0 0 0 ..., 0 0 0]\n ..., \n [0 0 0 ..., 0 0 0]\n [0 0 0 ..., 0 0 0]\n [0 0 0 ..., 0 0 0]]'

In [27]:
from collections import defaultdict

def state_to_str(state):
    return state.__str__()

In [None]:
def inv_cdf(cdf, u):
    out = 0
    ll= len(cdf)
    for i in range(1, ll):
        if cdf[i-1] <= u < cdf[i]:
            out = i
            break
    return out

def get_policy_action(state, policy_dict, return_index=False):
    action_probas = policy_dict[state_to_str(state)]
    ll = len(action_probas)
    cdf = np.cumsum(action_probas)
    if return_index:
        index = inv_cdf(cdf, np.random.rand())
        return ACTIONS[index], index
    return ACTIONS[inv_cdf(cdf, np.random.rand())]

In [None]:
def q_learning(goal_weight, 
               available_gifts,
               initial_state=None
               n_episodes=10, alpha=0.75, gamma=0.7, epsilon=0.001, action_value_function=None):

    policy_dict = defaultdict()
    
    for i in range(n_episodes):

        episode_length = 1000
        state = np.zeros((N_BAGS, N_TYPES)) if initial_state is None else initial_state
        
        action = get_policy_action(state, policy_dict)
        
        state_score = score(state)        
        while state_score > goal_weight:
            
            episode_length -= 1 
            if episode_length < 0:
                logging.warn('Episode length is reached, but state score is still : %f / %f' % (state_score, goal_weight))
                break
            
            #print "state, action : ", state, action
            x, y = state
            
            current_reward = 0
            nx, ny = take_action(state, action)
            if not on_grid(nx, ny):
                nx, ny = x, y
                current_reward = REWARDS[2]
            elif on_bridge(nx, ny) and not at_end_position(nx, ny):
                current_reward = REWARDS[2]
            elif in_pit(nx, ny):
                current_reward = REWARDS[0]
            elif at_end_position(nx, ny):
                current_reward = REWARDS[1] 
                
            new_state = (nx, ny)
            new_action, new_action_index = get_policy_action(new_state, policy, return_index=True)
            
            # Update Q(s,a)
            v = action_value_function[y, x, action_index]
            nv = np.max(action_value_function[ny, nx, :])
            t = alpha * (current_reward + gamma * nv - v) 
            action_value_function[y, x, action_index] += t
            
            # Update policy from Q(s,a) using epsilon-soft strategy
            action_star_index = np.argmax(action_value_function[y, x, :])
            for i in range(ll):
                policy[y, x, i] = epsilon / ll
            policy[y, x, action_star_index] = 1.0 - epsilon + epsilon / ll

            state, action, action_index = new_state, new_action, new_action_index
            
            
            state_score = score(state)
                
    return policy, action_value_function