# 3 | Policy Improvement: The Q-learning Algorithm

Having evaluated our policy $\pi$, how can we go about obtaining a better one? This question is the heart of *policy improvement*, perhaps the fundamental concept of RL. Recall, when we performed policy evaluation we obtained the value of taking every action in every state. Thus, we can perform policy improvement readily by picking our current best estimate of the optimal action from each state -- so-called *greedy* action selection. Once we've obtained a new policy, we can evaluate it as before. Continually iterating between policy evaluation and policy improvement in this way, we are guarenteed to reach the optimal policy $\pi^*$ according to the policy improvement theorem. 

In [1]:
from mdp import StudentMDP
mdp = StudentMDP(verbose=True)

In [2]:
from agent import QLearningAgent
agent = QLearningAgent(mdp, epsilon=1.0, alpha=0.2, gamma=0.9)

In [3]:
mdp.ep = 0
while mdp.ep < 50:
    state = mdp.reset()
    done = False
    while not done:
        action = agent.act(state)
        next_state, reward, done, info = mdp.step(action)
        agent.learn(state, action, reward, next_state, done)
        state = next_state

    print("Value function:")
    print(agent.Q)
    print("Policy:")
    print(agent.policy)
    
    agent.epsilon *= 0.95

| Time  | State    | Action         | Reward | Next state | Done  |
|-------|----------|----------------|--------|------------|-------|
| 0     | Class 1  | Go on Facebook | -1.0   | Facebook   | False |
| 1     | Facebook | Close Facebook | -2.0   | Class 1    | False |
| 2     | Class 1  | Go on Facebook | -1.0   | Facebook   | False |
| 3     | Facebook | Close Facebook | -2.0   | Class 1    | False |
| 4     | Class 1  | Go on Facebook | -1.0   | Facebook   | False |
| 5     | Facebook | Keep scrolling | -1.0   | Facebook   | False |
| 6     | Facebook | Close Facebook | -2.0   | Class 1    | False |
| 7     | Class 1  | Go on Facebook | -1.0   | Facebook   | False |
| 8     | Facebook | Close Facebook | -2.0   | Class 1    | False |
| 9     | Class 1  | Study          | -2.0   | Class 2    | False |
| 10    | Class 2  | Study          | -2.0   | Class 3    | False |
| 11    | Class 3  | Study          | 10.0   | Pass       | False |
| 12    | Pass     | Fall asleep    |  0.0   | A

Try with $\gamma=0$ and different $\epsilon$ values