In this competision, our task is to assign a day of each 5000 families. If we change an assigned day of one family, we'll get a different score. Assuming this as `Markov decision process (MDP)`, and we can imagine problem like

* S is state, assined day of each families
* A is action, how to re assign day of a family
* R is reward, how much score decrease

Both state after transition $S'$ and Reward $R$ only depends on current state $S$ and action $A$ . 
Now we'd like to know how action of current state $A_S$ can get max $R$ ?

In this notebook, I'll try to use Q-learning approach. Because I just wanted to do new things at the end of the year. Thank you for google, wikipedia, kaggle and our greate community! And welcome to any comment (I realy not understand method..). Have a good new year!

In [None]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm as tqdm
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('../input/santa-workshop-tour-2019/family_data.csv')
print(df.shape)
df.head()

## Score function

https://www.kaggle.com/inversion/santa-s-2019-starter-notebook

In [None]:
family_size_dict = df[['n_people']].to_dict()['n_people']

cols = [f'choice_{i}' for i in range(10)]
choice_dict = df[cols].to_dict()

N_DAYS = 100
MAX_OCCUPANCY = 300
MIN_OCCUPANCY = 125

# from 100 to 1
days = list(range(N_DAYS,0,-1))

def cost_function(prediction):

    penalty = 0

    # We'll use this to count the number of people scheduled each day
    daily_occupancy = {k:0 for k in days}
    
    # Looping over each family; d is the day for each family f
    for f, d in enumerate(prediction):

        # Using our lookup dictionaries to make simpler variable names
        n = family_size_dict[f]
        choice_0 = choice_dict['choice_0'][f]
        choice_1 = choice_dict['choice_1'][f]
        choice_2 = choice_dict['choice_2'][f]
        choice_3 = choice_dict['choice_3'][f]
        choice_4 = choice_dict['choice_4'][f]
        choice_5 = choice_dict['choice_5'][f]
        choice_6 = choice_dict['choice_6'][f]
        choice_7 = choice_dict['choice_7'][f]
        choice_8 = choice_dict['choice_8'][f]
        choice_9 = choice_dict['choice_9'][f]

        # add the family member count to the daily occupancy
        daily_occupancy[d] += n

        # Calculate the penalty for not getting top preference
        if d == choice_0:
            penalty += 0
        elif d == choice_1:
            penalty += 50
        elif d == choice_2:
            penalty += 50 + 9 * n
        elif d == choice_3:
            penalty += 100 + 9 * n
        elif d == choice_4:
            penalty += 200 + 9 * n
        elif d == choice_5:
            penalty += 200 + 18 * n
        elif d == choice_6:
            penalty += 300 + 18 * n
        elif d == choice_7:
            penalty += 300 + 36 * n
        elif d == choice_8:
            penalty += 400 + 36 * n
        elif d == choice_9:
            penalty += 500 + 36 * n + 199 * n
        else:
            penalty += 500 + 36 * n + 398 * n

    # for each date, check total occupancy
    #  (using soft constraints instead of hard constraints)
    for _, v in daily_occupancy.items():
        if (v > MAX_OCCUPANCY) or (v < MIN_OCCUPANCY):
            penalty += 100000000

    # Calculate the accounting cost
    # The first day (day 100) is treated special
    accounting_cost = (daily_occupancy[days[0]]-125.0) / 400.0 * daily_occupancy[days[0]]**(0.5)
    # using the max function because the soft constraints might allow occupancy to dip below 125
    accounting_cost = max(0, accounting_cost)
    
    # Loop over the rest of the days, keeping track of previous count
    yesterday_count = daily_occupancy[days[0]]
    for day in days[1:]:
        today_count = daily_occupancy[day]
        diff = abs(today_count - yesterday_count)
        accounting_cost += max(0, (daily_occupancy[day]-125.0) / 400.0 * daily_occupancy[day]**(0.5 + diff / 50.0))
        yesterday_count = today_count

    penalty += accounting_cost

    return penalty

submission = pd.read_csv('../input/santa-workshop-tour-2019/sample_submission.csv')
cost_function(submission['assigned_day'])

In [None]:
ary_n_people = df.n_people.values
ary_n_people

In [None]:
ary_choices = df.iloc[:, 1:-1].values-1
ary_choices

## data

For simplicity, I'll make a binary matrix. `data` has row = number of all families, column = number of all workshop day and 1 if assined.

内部状態を表すために、家族数 x 日付の表を作り、参加する日に1、参加しない日に０を入れる。 
まずはsample_submissionの日付を入れた。

In [None]:
N_FAMILY = len(df)
N_DAYS = 100
data = np.zeros((N_FAMILY, N_DAYS))
for i, d in enumerate(submission['assigned_day']):
    data[i, d-1]=1
# for i, d in enumerate(ary_choices[:,0]):
#     data[i, d-1]=1
data

### State

It's feel like a feature engineearing. You can put any other state features like family numbers, N<125, difference of prev day $N_{d-1}$

$S_{family\_id,day}$

|key|mean|min|max|
|--|--|--|--|
|choiced_order|whether if given day is choiced order of this family|0 (choice_0)|10 (other)|
|occupancy_flag0|total people of choice_0 is over 300? $N_{choice\_0}>300$|0|1|
|occupancy_flag1|total people of choice_1 is over 300? $N_{choice\_1}>300$|0|1|
|occupancy_flag2|total people of choice_2 is over 300? $N_{choice\_2}>300$|0|1|
|occupancy_flag3|total people of choice_3 is over 300? $N_{choice\_3}>300$|0|1|
|occupancy_flag4|total people of choice_4 is over 300? $N_{choice\_4}>300$|0|1|
|occupancy_flag5|total people of choice_5 is over 300? $N_{choice\_5}>300$|0|1|
|occupancy_flag6|total people of choice_6 is over 300? $N_{choice\_6}>300$|0|1|
|occupancy_flag7|total people of choice_7 is over 300? $N_{choice\_7}>300$|0|1|
|occupancy_flag8|total people of choice_8 is over 300? $N_{choice\_8}>300$|0|1|
|occupancy_flag9|total people of choice_9 is over 300? $N_{choice\_9}>300$|0|1|

In [None]:
N_STATE = 11 * 2**10
N_STATE

In [None]:
def get_state(data, family_id, day):
    choice_order = np.where(ary_choices[family_id] == day)[0]
    if choice_order.shape[0] > 0:
        choice_order = choice_order[0]
    else:
        choice_order = 10
    occupancy_flag = np.zeros(10)
    for i, choiced_day in enumerate(ary_choices[family_id]):
        if (data[:,choiced_day] * ary_n_people).sum() > 300:
            occupancy_flag[i] = 1
    return choice_order, occupancy_flag

get_state(data, 0, np.argmax(data[0]))

In [None]:
def get_state_row(data, family_id):
    day = np.argmax(data[family_id])
    choice_order, occupancy_flag = get_state(data, family_id, day)
    res = choice_order + 11 * np.sum(occupancy_flag * np.array([512,256,128,64,32,16,8,4,2,1]))
    return res.astype(int)

get_state_row(data, 0)

### Action
$A_{family\_id,day}$

|value|mean|
|--|--|
|0-9|move to choice_0～choice_9|
|10|not move|
|11|move most less people|

In [None]:
N_ACTION = 12

In [None]:
def get_action(next_state_row, episode, q_table):
    epsilon = 0.5 * (0.99** episode)
    if epsilon <= np.random.uniform(0, 1): 
        next_action = np.argmax(q_table[next_state_row])
    else:
        next_action = np.random.choice(range(N_ACTION))
    return next_action

## Reward

In [None]:
def get_score(data):
    pred = 1+np.argmax(data, axis=1)
    return round(cost_function(pred))

get_score(data)

In [None]:
def update_data(data, family_id, action):
    if action<10:
        # move to choice_0 - choice_9
        new_day = ary_choices[family_id, action]
    elif action == 10:
        # no action
        return data
    elif action == 11:
        # move most less people day
        new_day = np.argmin([np.sum(data[:,i]*ary_n_people) for i in range(100)])
    new_data = data.copy()
    new_row = np.zeros(N_DAYS)
    new_row[new_day] = 1
    new_data[family_id] = new_row
    return new_data

get_score(update_data(data, 0, 11))

## Q table

`q_table` is matrix of expected reward. row=state, column=actions.

In the Q learning approach, q_table is model. Updates q_table with reward feedback; and select next action from argmax(q_table[state])

In [None]:
def make_qtable(*size):
    return np.random.uniform(low = -1, high = 1, size = size)

make_qtable(N_STATE, N_ACTION).shape

In [None]:
def update_qtable(q_table, state, action, reward, next_state):
    gamma = 0.99
    alpha = 0.50
    next_max = max(q_table[next_state])
    q_table[state, action] = (1-alpha)*q_table[state, action] +\
    alpha * (reward + gamma * next_max)
    return q_table

# learn

* reward clipping `np.clip(reward, -1, 1)`

In [None]:
def learn(data, q_table=None, n_loop=10_000, step=10):
    if not hasattr(q_table, 'shape'):
        q_table = make_qtable(N_STATE, N_ACTION)
    history = []
    best_data = data.copy()
    best_score = get_score(data)
    score = best_score
    reward = 0   
    with tqdm(total=n_loop) as pbar:
        for episode in range(n_loop):
            new_data = best_data.copy()
            state = get_state_row(new_data, 0)
            action = np.argmax(q_table[state])

            for family_id in np.random.choice(range(N_FAMILY), step):
                next_state = get_state_row(new_data, family_id)
                q_table = update_qtable(q_table, state, action, reward, next_state)
                action = get_action(next_state, episode, q_table)
                state = next_state
                new_data = update_data(new_data, family_id, action)

            new_score = get_score(new_data)
            reward = np.clip(score - new_score, -1, 1)
            score = new_score

            if best_score > new_score:
                best_data = new_data.copy()
                best_score = new_score
            history.append(best_score)
            pbar.set_description(f'best_score={best_score:,}')
            pbar.update()
    plt.plot(history)
    plt.show()
    return best_data, q_table

In [None]:
best_data, q_table = learn(data, n_loop=30_000, step=5)

It's looks like plateau, try to search best step parameter.

In [None]:
for i in [3,5,10,15,20,30,50,100]:
    print(f'step={i}')
    _, _ = learn(best_data, n_loop=100, step=i)

## 2nd learn

* step=3
* n_loop=10_000
* 100 times learning with reset q_table

In [None]:
for i in range(100):
    best_data, _ = learn(best_data, step=3)

## submission

In [None]:
pred = np.argmax(best_data, axis=1)+1
pred

In [None]:
cost_function(pred)

In [None]:
submission['assigned_day'] = pred
submission.to_csv('submission.csv', index=False)
submission.head()