<center><h1> Markov Decision Process: A Q-learning approach </h1></center>

<center><h2>The Markov property states that,
“ The future is independent of the past given the present.”
Once the current state in known, the history of information encountered so far may be thrown away, and that state is a sufficient statistic that gives us the same characterization of the future as if we have all the history.<h2p></center>

### Component of MDPs:


S: set of states
<br>
A: set of actions
<br>
R: reward function
<br>
P: transition probability function
<br>
γ: discount for future rewards
    
<p>In mathematical terms, a state St has the Markov property, if and only if;<br>
P[St+1 | St] = P[St+1 | S1, ….. , St],</p>

### Approach:

<p>The MDP is more than just Multi-armed bandit approach which does not consider the state space but instead evaluate "arm-weights" through reward. However, the reward in each turn is temporary, so it is not as informative to tell which strategy the player should optimize for. When considering a state space, we are taking previous set of action into evaluation.</p>
<br>
<p>In this approach, a state space can consist of the tuple of each player action corresponding with each of the opponent action. Therefore, we can initialize our Q-table policy with a 2D matrix with the shape (9,3). Each of the element in the first dimension is one scenario where a player take a particular action, and an opponent takes a particular action. Each element in the second dimension is the available action that this player can take.</p>
<!-- <p>We </p> -->



### First we install the library
Let's go!

In [None]:
!pip install kaggle-environments
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from kaggle_environments import make, evaluate
import keras
import collections
import sys
import os

In [None]:
%%writefile mdp.py
import pandas as pd
import random
import numpy as np
from kaggle_environments.envs.rps.utils import get_score

# Util method for getting the state in the q table
def get_state(action, op_action):
    return action * 3 + op_action
# Current action
cur_action = 0
# Epsilon: Exploration rate
eps = 0.1
# History
history = []
# Q_table (Policies) Shape: (9, 3)
policies = [[0] *3] * 9
#Learning rate
lr = 0.7
# Discount rate for q_table
discount_rate = 0.3
# Epsilon decay rate
decay_rate = 0.9

def update_q_table(op_action):
    global policies
    global discount_rate
    global lr
    global history
    reward = get_score(cur_action, op_action)
    if len(history) > 1:
        previous_state_id = get_state(history[len(history) - 2][0], history[len(history) - 2][1])
        state_id = get_state(cur_action, op_action)
        policies[previous_state_id][cur_action] = policies[previous_state_id][cur_action] * (1 - lr) \
        + lr * (reward + discount_rate * np.max(policies[state_id][:]))

def mdp(observation, configuration):
    global cur_action
    global history
    global policies
    if observation.step > 0:
        history.append([cur_action, observation.lastOpponentAction])
        update_q_table(observation.lastOpponentAction)
    
    explore_rate = np.random.random()
    if explore_rate < eps:
        cur_action = random.randint(0, 2)
        explore_rate *= decay_rate
    else:
        if observation.step > 0:
            state_id = get_state(cur_action, observation.lastOpponentAction)
            cur_action = int(np.argmax(policies[state_id][:]))
        else:
            cur_action = random.randint(0, 2)
    return cur_action

### Evaluation Result
<p> Although the approach is not perfect, it can beat most static strategy </p>
<p> We can say that this version is a loose version of reactionary, so that it cannot be countered by countered_reactionary bot</p>

In [None]:
env = make("rps", configuration={"episodeSteps": 1000}, debug="True")
env.reset()
env.run(["mdp.py", "statistical"])
env.render(mode="ipython", width=400, height=400)