# Model Free Control
After seeing in the last lesson the manners of evaluating a given policy we will now try to improve the policy gradualy so as to find the best one. 

As we are planing to "act" without a model of the MDP, selecting the best action can't be done by the value function on its own as we need the transition probability to compute the following max:
$$ \pi'(s) = \text{argmax}_{a \in \mathcal A}(\mathcal R _s^a + \mathcal P _{ss'}^a V(s') $$

However acting greedly with respect to the **action value function** is doable without the knowledge of the MDP:

$$ \pi'(s) = \text{argmax}_{a \in \mathcal A}Q(s,a)$$

Model Free control is therefore based on the action value function only

In our discrete MDP, for controling, our agents we now store and compute return values for states and actions: it gives the famous Q table that contains the Q values for all the state for all the possible actions


## MC or TD control

The update of the Q value in the Q table can be done by doing either MC or TD approximation of the TD target for the action value function: 
If you are using a TD(0) or TD($\lambda$) the algorithm is called SARSA or SARSA($\lambda$)

Using an approximate of the Q table we can act greedily with respect to it so as to improve our average return, to balance exploration and exploitation we improve the policy by acting in a $\epsilon$-greedy manner with a decaying epsilon

We illustrate the Q learning algorithm on the Student MDP:

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import random
from tqdm import tqdm

class MarkovStudentDecisionProcess():
    """
    Markov Decision Process of the Student graph presented in the lesson

    """
    def __init__(self):

        #less states than before 
        self.titles=["C1","C2","C3","FB","Sleep"]
        self.state=0
        #transition probabilites depend on the action taken by the agent
        self.transition = [{"Study":[0,1,0,0,0],'Facebook':[0,0,0,1,0]},
                           {'Study':[0,0,1,0,0],'Sleep':[0,0,0,0,1]},
                           {'Study':[0,0,0,0,1],'Pub':[0.2,0.4,0.4,0,0]},
                           {'Facebook':[0,0,0,1,0],"Quit":[1,0,0,0,0]},
                           {"Sleep":[0,0,0,0,1]}
                            ]
        #the reward depends on the state and action taken by the agent
        self.reward=[{"Study":-2,'Facebook':-1},
                   {'Study':-2,'Sleep':0},
                   {'Study':10,'Pub':1},
                   {'Facebook':-1,"Quit":0},
                   {"Sleep":0}]


    #same step depends now of the action taken by the agent    
    def step(self,action):
        reward = self.reward[self.state][action]
        self.state = np.random.choice(range(5),p=self.transition[self.state][action])
        
        
        finished=(action == "Sleep")
        return self.state,reward,finished
    #no history stored here 
    def reboot(self):
        self.state = 0

class AgentRandom():
    def __init__(self):
        """
        Random agent has no will it just pick random actions
        """
        self.actions = [["Study",'Facebook'],
                       ['Study','Sleep'],
                       ['Study','Pub'],
                       ['Facebook',"Quit"],
                       ["Sleep"]]

    #selecting action randomly                   
    def select_action(self,state):
        return np.random.choice(self.actions[state])

    #virtual update
    def update(self,**kwargs):
        pass


class QlearnerAgent(AgentRandom):
    """
    Epsilon Greedy agent using a Q table (in the form of a dictonnary as the possible actions depend of the state
    it is inheriting form the random agent
    """

    def __init__(self,epsilon,gamma):
        AgentRandom.__init__(self)
        #Q table in the form of a dicionnary
        self.q_table=[dict([(action,0) for action in actions]) for actions in self.actions]
        #epsilon for the percentage of random actions
        self.epsilon=epsilon
        #discount
        self.gamma = gamma
        
    def select_action(self,state):
        #selecting action in a epsilon greedy fashion
        if random.random()<self.epsilon:
            return AgentRandom.select_action(self,state)
        else:
            return max(self.q_table[state].items(),key=lambda x: x[1])[0]
        
        
    def update(self,state,action,reward,new_state):
        #Update using the Q learning algorithm
        q_max=max(self.q_table[new_state].items(),key=lambda x: x[1])[1]
        self.q_table[state][action] += 0.1*(reward + self.gamma*q_max-self.q_table[state][action])

def run_multiple_episodes(number):
    gamma=0.9
    epsilon=0.9
    environement = MarkovStudentDecisionProcess()
    agent = QlearnerAgent(epsilon,gamma)
    for trial in tqdm(range(number)):
        finished=False
        while not finished:
            state = environement.state
            action = agent.select_action(state)
            new_state,reward,finished = environement.step(action)
            agent.update(state,action,reward,new_state)
        agent.epsilon = max(0.1,agent.epsilon-0.01)


        environement.reboot()
    print(dict(zip(environement.titles,agent.q_table)))


In [5]:
run_multiple_episodes(4000)

100%|██████████| 4000/4000 [00:01<00:00, 2319.25it/s]

{'C1': {'Study': 4.299999999999988, 'Facebook': 2.4829998227279715}, 'C2': {'Study': 6.999999999999991, 'Sleep': 0.0}, 'C3': {'Study': 9.999999999999993, 'Pub': 7.741687482301987}, 'FB': {'Facebook': 0.277818151707509, 'Quit': 3.8699999910769614}, 'Sleep': {'Sleep': 0.0}}





If we test the agent with this Q table on the student process with a greedy policy we obtain a very serious student :-)