## Markov Chains

### Basics
- Markov chains are a network of states which can be converted to and from one another.
- Being in one state, there is may be 5 other states which you could possibly end up next, there is a probability for you to end up in each of those states.
- This probability is called the transition probability - the probability for you to transition from State A to State B.
- Importantly, the probability for you to move to one state is completely indepdent of your past, it is memorylessness. 
- For example, the probability for you to move to State B given you are on State A can be 0.3. This 0.3 value is static and will not change regardless of which states you have landed on in the past.


### Stationary distribution
- By running through the Markov chain continuously moving from one state to the next, we can record how often we land on each state as we transition from state to state.
- Particularly, there is an interesting property where the distribution of time we spend on each state will reach a stationary constant given enough attempts.
- This feature is occurs regardless of where you start out with, eventually, due to the probabilities of each state transition, we will converge on a constant proportion of time we spend on each state.

## Markov Decision Processes

### Basics

- Markkov decision processes are very similar to Markov chains, to transition from one state to the next, the individual needs to take an action, and it is the combination of the state and action that results in the transition probabilities to moving to the possible states.
- Interesting addition is that there is a reward at every state, the key to Markov decision processes is to optimise for the reward.
- We know from Markov chains that we can calculate the stationary distribution based on the transition probabilities. If we know how many times we would land on each state, we then can calculate the expected reward that we would get.
- Another addition is Policy, policy is just an internal rule to help you decide which action you should take. In Markov decision processes, there is a choice to be made, a choice to choose from a number of possible actions. The combination of action and state then results in transitional probabilities.
- Ideally, you would want to optimise your policy such that you maximise utility, which is the expected reward you would receive if you run your policy enough times.
- In order to optimise your policy, you need a way to calculate the rewards attained from such a policy. This reward can be calculative via a recursive function

### Calculating reward

In [None]:
V_{\prod_{}^{}}(s)=\left\{
    \begin{array}{cl}
        0, \text{ if at end state} \\
        \sum_{s'}^{} T(s,a,s')[R(s, a, s') + \gamma V_{\prod_{}^{}}(s') ]
    \end{array}
\right\}

This can be more simply written in code via a recursive function

In [None]:
def get_possible_next_states(state, action, discount_factor = 1):
    return []

def calc_reward(state, action):
    if state == "end":
        return 0
    else:
        possible_next_states = get_possible_next_states(state, action)
        total_reward = 0
        for next_state in possible_next_states:
            next_state_reward = next_state.reward
            future_states_reward = calc_reward(next_state, action, discount_factor * discount_factor)

            total_reward += (next_state.prob * (next_state_reward + future_states_reward))
            
        return total_reward
            

This can also be done iteratively

In [None]:
def get_next_state():
    pass

    

def calc_reward(state, action, curr_reward, prev_reward, threshold):
    if (curr_iter - prev_reward) > threshold:
        return curr_reward
    
    next_state = get_next_state()

    return calc_reward(next_state, action, curr_reward + state.reward, curr_reward)

#### Example from dice game

In [None]:
def calc_dice_reward(prev_iter_reward = 0):
    end_state_reward = 0
    in_state_reward = 1/3 * (4 + end_state_reward) + 2/3 * (4 + prev_iter_reward)
    return calc_dice_reward(prev_iter_reward= end_state_reward + in_state_reward)