# Implementing On-policy MC control

Now, let's learn how to implement the MC control method with epsilon-greedy policy for playing the blackjack game, that is, we will see how can we use the MC control method for
finding the optimal policy in the blackjack game:

First, let's import the necessary libraries:

In [1]:
import gym
import pandas as pd
from collections import defaultdict
import random

Create a blackjack environment:

In [2]:
env = gym.make('Blackjack-v1')
env.reset()

((18, 5, False), {})

Initialize the dictionary for storing the Q values:

In [3]:
Q = defaultdict(float)

Initialize the dictionary for storing the total return of the state-action pair:

In [4]:
total_return = defaultdict(float)

Initialize the dictionary for storing the count of the number of times a state-action pair is
visited:

In [5]:
N = defaultdict(int)

In [6]:
#Initialize a dictionary to store the win rate for a state-action pair
W = defaultdict(float)

In [7]:
#Initialize a dictionary to store the no of times the reward was positive
P = defaultdict(int)

## Define the epsilon-greedy policy

We learned that we select actions based on the epsilon-greedy policy, so we define a
function called `epsilon_greedy_policy` which takes the state and Q value as an input
and returns the action to be performed in the given state:

In [8]:
def epsilon_greedy_policy(state,Q):

    #set the epsilon value to 0.5
    epsilon = 0.5

    #sample a random value from the uniform distribution, if the sampled value is less than
    #epsilon then we select a random action else we select the best action which has maximum Q
    #value as shown below

    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])

## Generating an episode

Now, let's generate an episode using the epsilon-greedy policy. We define a function called
`generate_episode` which takes the Q value as an input and returns the episode.

First, let's set the number of time steps:

In [9]:
num_timesteps = 100

In [10]:
def generate_episode(Q):

    #initialize a list for storing the episode
    episode = []

    #initialize the state using the reset function
    state = env.reset()[0]

    #then for each time step
    for t in range(num_timesteps):

        #select the action according to the epsilon-greedy policy
        action = epsilon_greedy_policy(state,Q)

        #perform the selected action and store the next state information
        next_state, reward, done, info, extra = env.step(action)

        #store the state, action, reward in the episode list
        episode.append((state, action, reward))

        #if the next state is a final state then break the loop else update the next state to the current
        #state
        if done:
            break

        state = next_state

        #print(episode)

    return episode

## Computing the optimal policy

Now, let's learn how to compute the optimal policy. First, let's set the number of iterations, that is, the number of episodes, we want to generate:

In [11]:
num_iterations = 500000

We learned that in the on-policy control method, we will not be given any policy as an
input. So, we initialize a random policy in the first iteration and improve the policy
iteratively by computing Q value. Since we extract the policy from the Q function, we don't
have to explicitly define the policy. As the Q value improves the policy also improves
implicitly. That is, in the first iteration we generate episode by extracting the policy
(epsilon-greedy) from the initialized Q function. Over a series of iterations, we will find the
optimal Q function and hence we also find the optimal policy.

In [12]:
#for each iteration
for i in range(num_iterations):

    #so, here we pass our initialized Q function to generate an episode
    episode = generate_episode(Q)

    #get all the state-action pairs in the episode
    all_state_action_pairs = [(s, a) for (s,a,r) in episode]

    #store all the rewards obtained in the episode in the rewards list
    rewards = [r for (s,a,r) in episode]

    #for each step in the episode
    for t, (state, action, reward) in enumerate(episode):

        #if the state-action pair is occurring for the first time in the episode
        if not (state, action) in all_state_action_pairs[0:t]:

            #compute the return R of the state-action pair as the sum of rewards
            R = sum(rewards[t:])

            #update the no of times the reward was positive
            if R > 0:
                P[(state,action)] += 1

            #update total return of the state-action pair
            total_return[(state,action)] = total_return[(state,action)] + R

            #update the number of times the state-action pair is visited
            N[(state, action)] += 1

            #compute the Q value by just taking the average
            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]

            #compute the win rate by dividing the no of times the reward was positive by update the number of times the state-action pair is visited
            W[(state,action)] = P[(state,action)] / N[(state, action)]



  if not isinstance(terminated, (bool, np.bool8)):


In [13]:
#Convert the no of times the reward was positive dictionary to a pandas dataframe:
df_p = pd.DataFrame(P.items(),columns=['state_action pair','no_of_positive_rewards'])

In [14]:
#Convert the no of times the state-action pair is visited dictionary to a pandas dataframe:
df_n = pd.DataFrame(N.items(),columns=['state_action pair','no_of_visits'])

In [15]:
#Convert the Win Rate dictionary to a pandas dataframe:
df_w = pd.DataFrame(W.items(),columns=['state_action pair','win_rate'])
df_w

Unnamed: 0,state_action pair,win_rate
0,"((21, 10, True), 1)",0.378378
1,"((14, 10, False), 0)",0.213160
2,"((18, 10, False), 0)",0.319671
3,"((19, 8, True), 0)",0.723140
4,"((15, 10, True), 1)",0.303805
...,...,...
555,"((5, 3, False), 1)",0.267176
556,"((4, 4, False), 0)",0.410405
557,"((16, 5, True), 1)",0.442177
558,"((17, 3, True), 1)",0.407895


Thus on every iteration, the Q value improves and so does policy.
After all the iterations, we can have a look at the Q value of each state-action in the pandas
data frame for more clarity.

First, let's convert the Q value dictionary to a pandas data
frame:

In [16]:
df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])

Let's look at the first few rows of the data frame:

In [17]:
df.head(10)

Unnamed: 0,state_action pair,value
0,"((14, 10, False), 0)",-0.57368
1,"((14, 10, False), 1)",-0.59248
2,"((21, 10, True), 1)",-0.165481
3,"((18, 10, False), 0)",-0.247129
4,"((19, 8, True), 0)",0.592975
5,"((15, 10, True), 1)",-0.320649
6,"((15, 10, False), 0)",-0.574354
7,"((12, 10, False), 0)",-0.585154
8,"((13, 2, False), 0)",-0.292178
9,"((13, 2, False), 1)",-0.449246


As we can observe, we have the Q values for all the state-action pairs. Now we can extract
the policy by selecting the action which has maximum Q value in each state.

To learn more how to select action based on this Q value, check the book under the section, implementing on-policy control.

In [18]:
df2 = df.merge(df_p, on = 'state_action pair', how='left')
df3 = df2.merge(df_n, on = 'state_action pair', how='left')
df4 = df3.merge(df_w, on = 'state_action pair', how='left')
df4

Unnamed: 0,state_action pair,value,no_of_positive_rewards,no_of_visits,win_rate
0,"((14, 10, False), 0)",-0.573680,2015,9453,0.213160
1,"((14, 10, False), 1)",-0.592480,1287,7261,0.177248
2,"((21, 10, True), 1)",-0.165481,798,2109,0.378378
3,"((18, 10, False), 0)",-0.247129,3424,10711,0.319671
4,"((19, 8, True), 0)",0.592975,350,484,0.723140
...,...,...,...,...,...
555,"((13, 7, True), 0)",-0.606838,23,117,0.196581
556,"((16, 6, True), 1)",0.054830,194,383,0.506527
557,"((4, 2, False), 1)",-0.307692,25,78,0.320513
558,"((4, 7, False), 1)",-0.328767,46,146,0.315068


In [19]:
df5=df4.sort_values("state_action pair")
print(df5.to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
110    ((4, 1, False), 0) -0.766234                       9            77  0.116883
111    ((4, 1, False), 1) -0.618750                      26           160  0.162500
515    ((4, 2, False), 0) -0.285714                      55           154  0.357143
557    ((4, 2, False), 1) -0.307692                      25            78  0.320513
552    ((4, 3, False), 0) -0.248555                      65           173  0.375723
553    ((4, 3, False), 1) -0.363636                      17            55  0.309091
559    ((4, 4, False), 0) -0.179191                      71           173  0.410405
550    ((4, 4, False), 1) -0.275862                      21            58  0.362069
536    ((4, 5, False), 0) -0.347826                      30            92  0.326087
537    ((4, 5, False), 1) -0.264000                      45           125  0.360000
501    ((4, 6, False), 0) -0.341176                      28            85  0

In [20]:
#Find the state action pair with the lowest win rate
print(df5[df5['win_rate'] == df5['win_rate'].min()])

        state_action pair  value  no_of_positive_rewards  no_of_visits  \
442   ((21, 1, False), 1)   -1.0                       0           516   
430   ((21, 2, False), 1)   -1.0                       0           332   
154   ((21, 3, False), 1)   -1.0                       0           281   
252   ((21, 4, False), 1)   -1.0                       0           311   
288   ((21, 5, False), 1)   -1.0                       0           339   
157   ((21, 6, False), 1)   -1.0                       0           315   
284   ((21, 7, False), 1)   -1.0                       0           470   
170   ((21, 8, False), 1)   -1.0                       0           418   
403   ((21, 9, False), 1)   -1.0                       0           411   
41   ((21, 10, False), 1)   -1.0                       0          1610   

     win_rate  
442       0.0  
430       0.0  
154       0.0  
252       0.0  
288       0.0  
157       0.0  
284       0.0  
170       0.0  
403       0.0  
41        0.0  


In [21]:
#Find the state action pair with teh highest win rate
print(df5[df5['win_rate'] == df5['win_rate'].max()])

      state_action pair     value  no_of_positive_rewards  no_of_visits  \
182  ((21, 9, True), 0)  0.991145                    1567          1581   

     win_rate  
182  0.991145  


In [22]:
#Find the state action pair with positive rewards
print(df5[df5['value'] >0.].to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
258    ((9, 6, False), 1)  0.040586                     444           887  0.500564
294   ((10, 4, False), 1)  0.005364                     616          1305  0.472031
239   ((10, 5, False), 1)  0.065925                     630          1259  0.500397
45    ((10, 6, False), 1)  0.056543                     625          1238  0.504847
334   ((10, 7, False), 1)  0.016047                     645          1371  0.470460
204   ((10, 8, False), 1)  0.006971                     611          1291  0.473277
34    ((11, 3, False), 1)  0.021025                     732          1522  0.480946
399   ((11, 4, False), 1)  0.058073                     782          1567  0.499043
43    ((11, 5, False), 1)  0.003600                     659          1389  0.474442
25    ((11, 6, False), 1)  0.076305                     767          1494  0.513387
208   ((11, 7, False), 1)  0.037483                     723          1494  0

In [23]:
#Find the state action pair with positive rewards
print(df5[df5['value'] <0.].to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
110    ((4, 1, False), 0) -0.766234                       9            77  0.116883
111    ((4, 1, False), 1) -0.618750                      26           160  0.162500
515    ((4, 2, False), 0) -0.285714                      55           154  0.357143
557    ((4, 2, False), 1) -0.307692                      25            78  0.320513
552    ((4, 3, False), 0) -0.248555                      65           173  0.375723
553    ((4, 3, False), 1) -0.363636                      17            55  0.309091
559    ((4, 4, False), 0) -0.179191                      71           173  0.410405
550    ((4, 4, False), 1) -0.275862                      21            58  0.362069
536    ((4, 5, False), 0) -0.347826                      30            92  0.326087
537    ((4, 5, False), 1) -0.264000                      45           125  0.360000
501    ((4, 6, False), 0) -0.341176                      28            85  0

In [24]:
#Find the state action pair with positive rewards
print(df5[df5['value'] ==0.].to_string())

      state_action pair  value  no_of_positive_rewards  no_of_visits  win_rate
451  ((13, 5, True), 0)    0.0                      86           172       0.5


## Trying with different number of iterations

Next, I am going to calculate and compare the reward value for each state action pair when the iteration is 7 million, 5 million and 3 million.

## 7 Million iterations

In [25]:
# 7 mil
env = gym.make('Blackjack-v1')
env.reset()
Q = defaultdict(float)
total_return = defaultdict(float)
N = defaultdict(int)
W = defaultdict(float)
P = defaultdict(int)

def epsilon_greedy_policy(state,Q):

    #set the epsilon value to 0.5
    epsilon = 0.5

    #sample a random value from the uniform distribution, if the sampled value is less than
    #epsilon then we select a random action else we select the best action which has maximum Q
    #value as shown below

    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])
    
num_timesteps = 100

def generate_episode(Q):

    #initialize a list for storing the episode
    episode = []

    #initialize the state using the reset function
    state = env.reset()[0]

    #then for each time step
    for t in range(num_timesteps):

        #select the action according to the epsilon-greedy policy
        action = epsilon_greedy_policy(state,Q)

        #perform the selected action and store the next state information
        next_state, reward, done, info, extra = env.step(action)

        #store the state, action, reward in the episode list
        episode.append((state, action, reward))

        #if the next state is a final state then break the loop else update the next state to the current
        #state
        if done:
            break

        state = next_state

        #print(episode)

    return episode

num_iterations = 7000000

#for each iteration
for i in range(num_iterations):

    #so, here we pass our initialized Q function to generate an episode
    episode = generate_episode(Q)

    #get all the state-action pairs in the episode
    all_state_action_pairs = [(s, a) for (s,a,r) in episode]

    #store all the rewards obtained in the episode in the rewards list
    rewards = [r for (s,a,r) in episode]

    #for each step in the episode
    for t, (state, action, reward) in enumerate(episode):

        #if the state-action pair is occurring for the first time in the episode
        if not (state, action) in all_state_action_pairs[0:t]:

            #compute the return R of the state-action pair as the sum of rewards
            R = sum(rewards[t:])

            #update the no of times the reward was positive
            if R > 0:
                P[(state,action)] += 1

            #update total return of the state-action pair
            total_return[(state,action)] = total_return[(state,action)] + R

            #update the number of times the state-action pair is visited
            N[(state, action)] += 1

            #compute the Q value by just taking the average
            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]

            #compute the win rate by dividing the no of times the reward was positive by update the number of times the state-action pair is visited
            W[(state,action)] = P[(state,action)] / N[(state, action)]

#Convert the no of times the reward was positive dictionary to a pandas dataframe:
df_p = pd.DataFrame(P.items(),columns=['state_action pair','no_of_positive_rewards'])

#Convert the no of times the state-action pair is visited dictionary to a pandas dataframe:
df_n = pd.DataFrame(N.items(),columns=['state_action pair','no_of_visits'])

#Convert the Win Rate dictionary to a pandas dataframe:
df_w = pd.DataFrame(W.items(),columns=['state_action pair','win_rate'])
df_w

  if not isinstance(terminated, (bool, np.bool8)):


Unnamed: 0,state_action pair,win_rate
0,"((17, 5, True), 0)",0.417219
1,"((20, 9, False), 1)",0.055375
2,"((12, 10, False), 0)",0.212271
3,"((10, 5, False), 0)",0.408062
4,"((13, 10, False), 0)",0.212514
...,...,...
555,"((15, 2, True), 1)",0.430066
556,"((4, 6, False), 0)",0.408642
557,"((4, 6, False), 1)",0.420146
558,"((19, 5, True), 1)",0.473979


In [26]:
df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])
df.head(10)

Unnamed: 0,state_action pair,value
0,"((17, 5, True), 0)",-0.040149
1,"((17, 5, True), 1)",-0.021658
2,"((20, 9, False), 1)",-0.886326
3,"((12, 10, False), 0)",-0.575457
4,"((10, 5, False), 0)",-0.183876
5,"((13, 10, False), 0)",-0.574972
6,"((19, 8, False), 0)",0.597076
7,"((14, 5, False), 0)",-0.174669
8,"((14, 5, False), 1)",-0.468921
9,"((5, 5, False), 0)",-0.154135


In [27]:
df2 = df.merge(df_p, on = 'state_action pair', how='left')
df3 = df2.merge(df_n, on = 'state_action pair', how='left')
df4 = df3.merge(df_w, on = 'state_action pair', how='left')
df4

Unnamed: 0,state_action pair,value,no_of_positive_rewards,no_of_visits,win_rate
0,"((17, 5, True), 0)",-0.040149,1008,2416,0.417219
1,"((17, 5, True), 1)",-0.021658,2826,6187,0.456764
2,"((20, 9, False), 1)",-0.886326,1042,18817,0.055375
3,"((12, 10, False), 0)",-0.575457,12600,59358,0.212271
4,"((10, 5, False), 0)",-0.183876,2470,6053,0.408062
...,...,...,...,...,...
555,"((12, 5, True), 0)",-0.182609,329,805,0.408696
556,"((15, 1, True), 0)",-0.773050,208,1833,0.113475
557,"((14, 8, True), 0)",-0.480565,441,1698,0.259717
558,"((4, 6, False), 0)",-0.182716,331,810,0.408642


In [28]:
df5=df4.sort_values("state_action pair")
print(df5.to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
396    ((4, 1, False), 0) -0.754386                      98           798  0.122807
545    ((4, 1, False), 1) -0.617537                     382          2395  0.159499
187    ((4, 2, False), 0) -0.287879                     329           924  0.356061
188    ((4, 2, False), 1) -0.202632                     877          2280  0.384649
477    ((4, 3, False), 0) -0.219029                     394          1009  0.390486
478    ((4, 3, False), 1) -0.208969                     832          2163  0.384651
448    ((4, 4, False), 0) -0.260960                     354           958  0.369520
377    ((4, 4, False), 1) -0.200174                     891          2293  0.388574
293    ((4, 5, False), 0) -0.164886                     899          2153  0.417557
294    ((4, 5, False), 1) -0.199199                     387           999  0.387387
558    ((4, 6, False), 0) -0.182716                     331           810  0

In [29]:
#Find the state action pair with the lowest win rate
print(df5[df5['win_rate'] == df5['win_rate'].min()])

        state_action pair  value  no_of_positive_rewards  no_of_visits  \
400   ((21, 1, False), 1)   -1.0                       0          7760   
297   ((21, 2, False), 1)   -1.0                       0          4586   
398   ((21, 3, False), 1)   -1.0                       0          4642   
172   ((21, 4, False), 1)   -1.0                       0          4502   
432   ((21, 5, False), 1)   -1.0                       0          4539   
543   ((21, 6, False), 1)   -1.0                       0          4448   
256   ((21, 7, False), 1)   -1.0                       0          6468   
325   ((21, 8, False), 1)   -1.0                       0          6467   
329   ((21, 9, False), 1)   -1.0                       0          6219   
102  ((21, 10, False), 1)   -1.0                       0         23568   

     win_rate  
400       0.0  
297       0.0  
398       0.0  
172       0.0  
432       0.0  
543       0.0  
256       0.0  
325       0.0  
329       0.0  
102       0.0  


In [30]:
#Find the state action pair with teh highest win rate
print(df5[df5['win_rate'] == df5['win_rate'].max()])

      state_action pair     value  no_of_positive_rewards  no_of_visits  \
277  ((21, 9, True), 0)  0.990831                   22153         22358   

     win_rate  
277  0.990831  


In [31]:
#Find the state action pair with positive rewards
seven_mil_ap_pos_df = df5[df5['value'] >0.]
print(df5[df5['value'] >0.].to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
97    ((10, 4, False), 1)  0.025549                    8907         18474  0.482137
279   ((10, 5, False), 1)  0.015624                    8606         17985  0.478510
419   ((10, 6, False), 1)  0.055695                    9394         18745  0.501147
399   ((11, 3, False), 1)  0.014682                   10425         21864  0.476811
340   ((11, 4, False), 1)  0.046888                   10861         21946  0.494897
404   ((11, 5, False), 1)  0.053683                   10687         21422  0.498880
336   ((11, 6, False), 1)  0.078617                   11271         22018  0.511899
331   ((11, 7, False), 1)  0.002604                   10432         22272  0.468391
554    ((12, 3, True), 1)  0.002587                    1098          2319  0.473480
544    ((12, 4, True), 1)  0.015612                    1163          2434  0.477814
550    ((12, 5, True), 1)  0.048456                    1164          2332  0

In [32]:
#Find the state action pair with negative rewards
seven_mil_ap_neg_df = df5[df5['value'] <0.]
print(df5[df5['value'] <0.].to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
396    ((4, 1, False), 0) -0.754386                      98           798  0.122807
545    ((4, 1, False), 1) -0.617537                     382          2395  0.159499
187    ((4, 2, False), 0) -0.287879                     329           924  0.356061
188    ((4, 2, False), 1) -0.202632                     877          2280  0.384649
477    ((4, 3, False), 0) -0.219029                     394          1009  0.390486
478    ((4, 3, False), 1) -0.208969                     832          2163  0.384651
448    ((4, 4, False), 0) -0.260960                     354           958  0.369520
377    ((4, 4, False), 1) -0.200174                     891          2293  0.388574
293    ((4, 5, False), 0) -0.164886                     899          2153  0.417557
294    ((4, 5, False), 1) -0.199199                     387           999  0.387387
558    ((4, 6, False), 0) -0.182716                     331           810  0

In [33]:
#Find the state action pair with 0 rewards
print(df5[df5['value'] ==0.].to_string())

Empty DataFrame
Columns: [state_action pair, value, no_of_positive_rewards, no_of_visits, win_rate]
Index: []


## 5 Million iterations

In [34]:
# 5 mil
env = gym.make('Blackjack-v1')
env.reset()
Q = defaultdict(float)
total_return = defaultdict(float)
N = defaultdict(int)
W = defaultdict(float)
P = defaultdict(int)

def epsilon_greedy_policy(state,Q):

    #set the epsilon value to 0.5
    epsilon = 0.5

    #sample a random value from the uniform distribution, if the sampled value is less than
    #epsilon then we select a random action else we select the best action which has maximum Q
    #value as shown below

    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])
    
num_timesteps = 100

def generate_episode(Q):

    #initialize a list for storing the episode
    episode = []

    #initialize the state using the reset function
    state = env.reset()[0]

    #then for each time step
    for t in range(num_timesteps):

        #select the action according to the epsilon-greedy policy
        action = epsilon_greedy_policy(state,Q)

        #perform the selected action and store the next state information
        next_state, reward, done, info, extra = env.step(action)

        #store the state, action, reward in the episode list
        episode.append((state, action, reward))

        #if the next state is a final state then break the loop else update the next state to the current
        #state
        if done:
            break

        state = next_state

        #print(episode)

    return episode

num_iterations = 5000000

#for each iteration
for i in range(num_iterations):

    #so, here we pass our initialized Q function to generate an episode
    episode = generate_episode(Q)

    #get all the state-action pairs in the episode
    all_state_action_pairs = [(s, a) for (s,a,r) in episode]

    #store all the rewards obtained in the episode in the rewards list
    rewards = [r for (s,a,r) in episode]

    #for each step in the episode
    for t, (state, action, reward) in enumerate(episode):

        #if the state-action pair is occurring for the first time in the episode
        if not (state, action) in all_state_action_pairs[0:t]:

            #compute the return R of the state-action pair as the sum of rewards
            R = sum(rewards[t:])

            #update the no of times the reward was positive
            if R > 0:
                P[(state,action)] += 1

            #update total return of the state-action pair
            total_return[(state,action)] = total_return[(state,action)] + R

            #update the number of times the state-action pair is visited
            N[(state, action)] += 1

            #compute the Q value by just taking the average
            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]

            #compute the win rate by dividing the no of times the reward was positive by update the number of times the state-action pair is visited
            W[(state,action)] = P[(state,action)] / N[(state, action)]

#Convert the no of times the reward was positive dictionary to a pandas dataframe:
df_p = pd.DataFrame(P.items(),columns=['state_action pair','no_of_positive_rewards'])

#Convert the no of times the state-action pair is visited dictionary to a pandas dataframe:
df_n = pd.DataFrame(N.items(),columns=['state_action pair','no_of_visits'])

#Convert the Win Rate dictionary to a pandas dataframe:
df_w = pd.DataFrame(W.items(),columns=['state_action pair','win_rate'])
df_w

  if not isinstance(terminated, (bool, np.bool8)):


Unnamed: 0,state_action pair,win_rate
0,"((16, 5, False), 1)",0.206190
1,"((21, 5, False), 0)",0.890502
2,"((10, 1, False), 0)",0.114504
3,"((9, 9, False), 0)",0.217610
4,"((7, 5, False), 0)",0.418988
...,...,...
555,"((12, 2, True), 1)",0.466747
556,"((5, 5, False), 1)",0.405566
557,"((4, 4, False), 1)",0.358923
558,"((16, 6, True), 1)",0.474752


In [35]:
df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])
df.head(10)

Unnamed: 0,state_action pair,value
0,"((21, 5, False), 0)",0.890502
1,"((21, 5, False), 1)",-1.0
2,"((16, 5, False), 1)",-0.551145
3,"((10, 1, False), 0)",-0.770992
4,"((10, 1, False), 1)",-0.390402
5,"((9, 9, False), 0)",-0.564781
6,"((9, 9, False), 1)",-0.240794
7,"((7, 5, False), 0)",-0.162024
8,"((17, 5, True), 0)",-0.052912
9,"((17, 6, False), 1)",-0.628648


In [36]:
df2 = df.merge(df_p, on = 'state_action pair', how='left')
df3 = df2.merge(df_n, on = 'state_action pair', how='left')
df4 = df3.merge(df_w, on = 'state_action pair', how='left')
df4

Unnamed: 0,state_action pair,value,no_of_positive_rewards,no_of_visits,win_rate
0,"((21, 5, False), 0)",0.890502,8710,9781,0.890502
1,"((21, 5, False), 1)",-1.000000,0,3298,0.000000
2,"((16, 5, False), 1)",-0.551145,1792,8691,0.206190
3,"((10, 1, False), 0)",-0.770992,510,4454,0.114504
4,"((10, 1, False), 1)",-0.390402,3457,13440,0.257217
...,...,...,...,...,...
555,"((20, 2, True), 0)",0.642308,4156,5460,0.761172
556,"((4, 3, False), 0)",-0.221591,411,1056,0.389205
557,"((4, 2, False), 1)",-0.260691,611,1707,0.357938
558,"((12, 4, True), 0)",-0.224223,237,611,0.387889


In [37]:
df5=df4.sort_values("state_action pair")
print(df5.to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
252    ((4, 1, False), 0) -0.745520                      71           558  0.127240
495    ((4, 1, False), 1) -0.628713                     261          1616  0.161510
549    ((4, 2, False), 0) -0.373134                     189           603  0.313433
557    ((4, 2, False), 1) -0.260691                     611          1707  0.357938
556    ((4, 3, False), 0) -0.221591                     411          1056  0.389205
489    ((4, 3, False), 1) -0.226016                     462          1230  0.375610
543    ((4, 4, False), 0) -0.214967                     535          1363  0.392517
551    ((4, 4, False), 1) -0.247258                     360          1003  0.358923
559    ((4, 5, False), 0) -0.182010                     582          1423  0.408995
497    ((4, 5, False), 1) -0.174603                     356           882  0.403628
42     ((4, 6, False), 0) -0.137830                     294           682  0

In [38]:
#Find the state action pair with the lowest win rate
print(df5[df5['win_rate'] == df5['win_rate'].min()])

        state_action pair  value  no_of_positive_rewards  no_of_visits  \
375   ((21, 1, False), 1)   -1.0                       0          5670   
281   ((21, 2, False), 1)   -1.0                       0          3278   
261   ((21, 3, False), 1)   -1.0                       0          3230   
56    ((21, 4, False), 1)   -1.0                       0          3324   
1     ((21, 5, False), 1)   -1.0                       0          3298   
328   ((21, 6, False), 1)   -1.0                       0          3143   
524   ((21, 7, False), 1)   -1.0                       0          4715   
304   ((21, 8, False), 1)   -1.0                       0          4519   
458   ((21, 9, False), 1)   -1.0                       0          4451   
283  ((21, 10, False), 1)   -1.0                       0         16624   

     win_rate  
375       0.0  
281       0.0  
261       0.0  
56        0.0  
1         0.0  
328       0.0  
524       0.0  
304       0.0  
458       0.0  
283       0.0  


In [39]:
#Find the state action pair with teh highest win rate
print(df5[df5['win_rate'] == df5['win_rate'].max()])

      state_action pair     value  no_of_positive_rewards  no_of_visits  \
266  ((21, 8, True), 0)  0.990477                   16017         16171   

     win_rate  
266  0.990477  


In [40]:
#Find the state action pair with positive rewards
five_mil_ap_pos_df = df5[df5['value'] >0.]
print(df5[df5['value'] >0.].to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
195    ((9, 6, False), 1)  0.003434                    4971         10484  0.474151
292   ((10, 4, False), 1)  0.015581                    6411         13478  0.475664
336   ((10, 5, False), 1)  0.044127                    6427         13076  0.491511
165   ((10, 6, False), 1)  0.061636                    6525         13028  0.500844
170   ((10, 7, False), 1)  0.002301                    6290         13473  0.466860
295   ((11, 3, False), 1)  0.006990                    7366         15594  0.472361
341   ((11, 4, False), 1)  0.029270                    7528         15579  0.483215
78    ((11, 5, False), 1)  0.052931                    7717         15492  0.498128
334   ((11, 6, False), 1)  0.086588                    7870         15233  0.516642
255   ((11, 7, False), 1)  0.007993                    7441         15763  0.472055
536    ((12, 3, True), 1)  0.008173                     817          1713  0

In [41]:
#Find the state action pair with negative rewards
five_mil_ap_neg_df = df5[df5['value'] <0.]
print(df5[df5['value'] <0.].to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
252    ((4, 1, False), 0) -0.745520                      71           558  0.127240
495    ((4, 1, False), 1) -0.628713                     261          1616  0.161510
549    ((4, 2, False), 0) -0.373134                     189           603  0.313433
557    ((4, 2, False), 1) -0.260691                     611          1707  0.357938
556    ((4, 3, False), 0) -0.221591                     411          1056  0.389205
489    ((4, 3, False), 1) -0.226016                     462          1230  0.375610
543    ((4, 4, False), 0) -0.214967                     535          1363  0.392517
551    ((4, 4, False), 1) -0.247258                     360          1003  0.358923
559    ((4, 5, False), 0) -0.182010                     582          1423  0.408995
497    ((4, 5, False), 1) -0.174603                     356           882  0.403628
42     ((4, 6, False), 0) -0.137830                     294           682  0

In [42]:
#Find the state action pair with 0 rewards
print(df5[df5['value'] ==0.].to_string())

Empty DataFrame
Columns: [state_action pair, value, no_of_positive_rewards, no_of_visits, win_rate]
Index: []


## 3 Million iterations

In [43]:
# 5 mil
env = gym.make('Blackjack-v1')
env.reset()
Q = defaultdict(float)
total_return = defaultdict(float)
N = defaultdict(int)
W = defaultdict(float)
P = defaultdict(int)

def epsilon_greedy_policy(state,Q):

    #set the epsilon value to 0.5
    epsilon = 0.5

    #sample a random value from the uniform distribution, if the sampled value is less than
    #epsilon then we select a random action else we select the best action which has maximum Q
    #value as shown below

    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])
    
num_timesteps = 100

def generate_episode(Q):

    #initialize a list for storing the episode
    episode = []

    #initialize the state using the reset function
    state = env.reset()[0]

    #then for each time step
    for t in range(num_timesteps):

        #select the action according to the epsilon-greedy policy
        action = epsilon_greedy_policy(state,Q)

        #perform the selected action and store the next state information
        next_state, reward, done, info, extra = env.step(action)

        #store the state, action, reward in the episode list
        episode.append((state, action, reward))

        #if the next state is a final state then break the loop else update the next state to the current
        #state
        if done:
            break

        state = next_state

        #print(episode)

    return episode

num_iterations = 3000000

#for each iteration
for i in range(num_iterations):

    #so, here we pass our initialized Q function to generate an episode
    episode = generate_episode(Q)

    #get all the state-action pairs in the episode
    all_state_action_pairs = [(s, a) for (s,a,r) in episode]

    #store all the rewards obtained in the episode in the rewards list
    rewards = [r for (s,a,r) in episode]

    #for each step in the episode
    for t, (state, action, reward) in enumerate(episode):

        #if the state-action pair is occurring for the first time in the episode
        if not (state, action) in all_state_action_pairs[0:t]:

            #compute the return R of the state-action pair as the sum of rewards
            R = sum(rewards[t:])

            #update the no of times the reward was positive
            if R > 0:
                P[(state,action)] += 1

            #update total return of the state-action pair
            total_return[(state,action)] = total_return[(state,action)] + R

            #update the number of times the state-action pair is visited
            N[(state, action)] += 1

            #compute the Q value by just taking the average
            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]

            #compute the win rate by dividing the no of times the reward was positive by update the number of times the state-action pair is visited
            W[(state,action)] = P[(state,action)] / N[(state, action)]

#Convert the no of times the reward was positive dictionary to a pandas dataframe:
df_p = pd.DataFrame(P.items(),columns=['state_action pair','no_of_positive_rewards'])

#Convert the no of times the state-action pair is visited dictionary to a pandas dataframe:
df_n = pd.DataFrame(N.items(),columns=['state_action pair','no_of_visits'])

#Convert the Win Rate dictionary to a pandas dataframe:
df_w = pd.DataFrame(W.items(),columns=['state_action pair','win_rate'])
df_w

  if not isinstance(terminated, (bool, np.bool8)):


Unnamed: 0,state_action pair,win_rate
0,"((10, 10, False), 0)",0.214763
1,"((13, 10, True), 1)",0.313139
2,"((13, 10, False), 0)",0.207224
3,"((17, 4, False), 0)",0.391964
4,"((16, 2, False), 0)",0.354280
...,...,...
555,"((19, 5, True), 1)",0.488615
556,"((4, 1, False), 1)",0.150588
557,"((15, 3, True), 1)",0.432868
558,"((12, 5, True), 1)",0.495495


In [44]:
df = pd.DataFrame(Q.items(),columns=['state_action pair','value'])
df.head(10)

Unnamed: 0,state_action pair,value
0,"((10, 10, False), 0)",-0.570474
1,"((13, 10, False), 0)",-0.585551
2,"((13, 10, False), 1)",-0.561429
3,"((13, 10, True), 1)",-0.299002
4,"((17, 4, False), 0)",-0.087333
5,"((16, 2, False), 0)",-0.29144
6,"((16, 2, False), 1)",-0.577708
7,"((19, 4, False), 0)",0.435298
8,"((19, 4, False), 1)",-0.798946
9,"((5, 10, False), 0)",-0.573324


In [45]:
df2 = df.merge(df_p, on = 'state_action pair', how='left')
df3 = df2.merge(df_n, on = 'state_action pair', how='left')
df4 = df3.merge(df_w, on = 'state_action pair', how='left')
df4

Unnamed: 0,state_action pair,value,no_of_positive_rewards,no_of_visits,win_rate
0,"((10, 10, False), 0)",-0.570474,2281,10621,0.214763
1,"((13, 10, False), 0)",-0.585551,5295,25552,0.207224
2,"((13, 10, False), 1)",-0.561429,14798,76503,0.193430
3,"((13, 10, True), 1)",-0.299002,2636,8418,0.313139
4,"((17, 4, False), 0)",-0.087333,6068,15481,0.391964
...,...,...,...,...,...
555,"((4, 4, False), 1)",-0.198124,332,853,0.389215
556,"((4, 5, False), 1)",-0.168874,242,604,0.400662
557,"((4, 3, False), 0)",-0.231373,392,1020,0.384314
558,"((4, 1, False), 0)",-0.743590,65,507,0.128205


In [46]:
df5=df4.sort_values("state_action pair")
print(df5.to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
558    ((4, 1, False), 0) -0.743590                      65           507  0.128205
559    ((4, 1, False), 1) -0.642353                     128           850  0.150588
506    ((4, 2, False), 0) -0.296928                     206           586  0.351536
507    ((4, 2, False), 1) -0.229829                     301           818  0.367971
557    ((4, 3, False), 0) -0.231373                     392          1020  0.384314
121    ((4, 3, False), 1) -0.291667                     122           360  0.338889
554    ((4, 4, False), 0) -0.208981                     229           579  0.395509
555    ((4, 4, False), 1) -0.198124                     332           853  0.389215
316    ((4, 5, False), 0) -0.146067                     342           801  0.426966
556    ((4, 5, False), 1) -0.168874                     242           604  0.400662
224    ((4, 6, False), 0) -0.206497                     171           431  0

In [47]:
#Find the state action pair with the lowest win rate
print(df5[df5['win_rate'] == df5['win_rate'].min()])

        state_action pair  value  no_of_positive_rewards  no_of_visits  \
496   ((21, 1, False), 1)   -1.0                       0          3242   
277   ((21, 2, False), 1)   -1.0                       0          2021   
113   ((21, 3, False), 1)   -1.0                       0          1985   
203   ((21, 4, False), 1)   -1.0                       0          1935   
470   ((21, 5, False), 1)   -1.0                       0          1912   
390   ((21, 6, False), 1)   -1.0                       0          1910   
45    ((21, 7, False), 1)   -1.0                       0          2844   
219   ((21, 8, False), 1)   -1.0                       0          2624   
337   ((21, 9, False), 1)   -1.0                       0          2790   
175  ((21, 10, False), 1)   -1.0                       0         10062   

     win_rate  
496       0.0  
277       0.0  
113       0.0  
203       0.0  
470       0.0  
390       0.0  
45        0.0  
219       0.0  
337       0.0  
175       0.0  


In [48]:
#Find the state action pair with teh highest win rate
print(df5[df5['win_rate'] == df5['win_rate'].max()])

      state_action pair     value  no_of_positive_rewards  no_of_visits  \
270  ((21, 9, True), 0)  0.991102                    9579          9665   

     win_rate  
270  0.991102  


In [49]:
#Find the state action pair with positive rewards
three_mil_ap_pos_df = df5[df5['value'] >0.]
print(df5[df5['value'] >0.].to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
144    ((9, 6, False), 1)  0.005373                    3179          6700  0.474478
214   ((10, 4, False), 1)  0.031477                    3885          7974  0.487208
205   ((10, 5, False), 1)  0.034145                    3791          7761  0.488468
67    ((10, 6, False), 1)  0.068700                    4022          7933  0.506996
235   ((11, 3, False), 1)  0.013693                    4407          9202  0.478918
456   ((11, 4, False), 1)  0.025652                    4540          9395  0.483236
48    ((11, 5, False), 1)  0.051664                    4661          9407  0.495482
295   ((11, 6, False), 1)  0.088332                    4832          9385  0.514864
75    ((11, 7, False), 1)  0.023457                    4541          9464  0.479818
534    ((12, 4, True), 1)  0.030978                     500          1033  0.484027
530    ((12, 5, True), 1)  0.037037                     495           999  0

In [50]:
#Find the state action pair with negative rewards
three_mil_ap_neg_df = df5[df5['value'] <0.]
print(df5[df5['value'] <0.].to_string())

        state_action pair     value  no_of_positive_rewards  no_of_visits  win_rate
558    ((4, 1, False), 0) -0.743590                      65           507  0.128205
559    ((4, 1, False), 1) -0.642353                     128           850  0.150588
506    ((4, 2, False), 0) -0.296928                     206           586  0.351536
507    ((4, 2, False), 1) -0.229829                     301           818  0.367971
557    ((4, 3, False), 0) -0.231373                     392          1020  0.384314
121    ((4, 3, False), 1) -0.291667                     122           360  0.338889
554    ((4, 4, False), 0) -0.208981                     229           579  0.395509
555    ((4, 4, False), 1) -0.198124                     332           853  0.389215
316    ((4, 5, False), 0) -0.146067                     342           801  0.426966
556    ((4, 5, False), 1) -0.168874                     242           604  0.400662
224    ((4, 6, False), 0) -0.206497                     171           431  0

In [51]:
#Find the state action pair with 0 rewards
print(df5[df5['value'] ==0.].to_string())

      state_action pair  value  no_of_positive_rewards  no_of_visits  win_rate
499  ((13, 4, True), 1)    0.0                     983          2074  0.473963


## Compare all 3 iterations with same index

#### Positive Rewards

In [57]:
# Rename value columns for clarity
seven_mil_ap_pos_df = seven_mil_ap_pos_df.rename(columns={'value': 'value_seven_mil'})
five_mil_ap_pos_df = five_mil_ap_pos_df.rename(columns={'value': 'value_five_mil'})
three_mil_ap_pos_df = three_mil_ap_pos_df.rename(columns={'value': 'value_three_mil'})

# Merge seven_mil with five_mil
merged_df = seven_mil_ap_pos_df[['state_action pair', 'value_seven_mil']].merge(
    five_mil_ap_pos_df[['state_action pair', 'value_five_mil']], 
    on='state_action pair', 
    how='left'
)

# Merge the resulting merged_df with three_mil
final_df = merged_df.merge(
    three_mil_ap_pos_df[['state_action pair', 'value_three_mil']], 
    on='state_action pair', 
    how='left'
)

# Optional: Set 'state_action pair' as the index
final_df.set_index('state_action pair', inplace=True)

final_df = final_df.rename(columns={'value_seven_mil':'7_mil_pos_value','value_five_mil':'5_mil_pos_value','value_three_mil':'3_mil_pos_value'})
print(final_df.to_string())

                      7_mil_pos_value  5_mil_pos_value  3_mil_pos_value
state_action pair                                                      
((10, 4, False), 1)          0.025549         0.015581         0.031477
((10, 5, False), 1)          0.015624         0.044127         0.034145
((10, 6, False), 1)          0.055695         0.061636         0.068700
((11, 3, False), 1)          0.014682         0.006990         0.013693
((11, 4, False), 1)          0.046888         0.029270         0.025652
((11, 5, False), 1)          0.053683         0.052931         0.051664
((11, 6, False), 1)          0.078617         0.086588         0.088332
((11, 7, False), 1)          0.002604         0.007993         0.023457
((12, 3, True), 1)           0.002587         0.008173              NaN
((12, 4, True), 1)           0.015612         0.039696         0.030978
((12, 5, True), 1)           0.048456         0.037124         0.037037
((12, 6, True), 1)           0.086057         0.109113         0

#### Negative Rewards

In [58]:
# Rename value columns for clarity
seven_mil_ap_neg_df = seven_mil_ap_neg_df.rename(columns={'value': 'value_seven_mil'})
five_mil_ap_neg_df = five_mil_ap_neg_df.rename(columns={'value': 'value_five_mil'})
three_mil_ap_neg_df = three_mil_ap_neg_df.rename(columns={'value': 'value_three_mil'})

# Merge seven_mil with five_mil
merged_neg_df = seven_mil_ap_neg_df[['state_action pair', 'value_seven_mil']].merge(
    five_mil_ap_neg_df[['state_action pair', 'value_five_mil']], 
    on='state_action pair', 
    how='left'
)

# Merge the resulting merged_df with three_mil
final_neg_df = merged_neg_df.merge(
    three_mil_ap_neg_df[['state_action pair', 'value_three_mil']], 
    on='state_action pair', 
    how='left'
)

# Optional: Set 'state_action pair' as the index
final_neg_df = final_neg_df.rename(columns={'value_seven_mil':'7_mil_neg_value','value_five_mil':'5_mil_neg_value','value_three_mil':'3_mil_neg_value'})
final_neg_df.set_index('state_action pair', inplace=True)

print(final_neg_df.to_string())

                      7_mil_neg_value  5_mil_neg_value  3_mil_neg_value
state_action pair                                                      
((4, 1, False), 0)          -0.754386        -0.745520        -0.743590
((4, 1, False), 1)          -0.617537        -0.628713        -0.642353
((4, 2, False), 0)          -0.287879        -0.373134        -0.296928
((4, 2, False), 1)          -0.202632        -0.260691        -0.229829
((4, 3, False), 0)          -0.219029        -0.221591        -0.231373
((4, 3, False), 1)          -0.208969        -0.226016        -0.291667
((4, 4, False), 0)          -0.260960        -0.214967        -0.208981
((4, 4, False), 1)          -0.200174        -0.247258        -0.198124
((4, 5, False), 0)          -0.164886        -0.182010        -0.146067
((4, 5, False), 1)          -0.199199        -0.174603        -0.168874
((4, 6, False), 0)          -0.182716        -0.137830        -0.206497
((4, 6, False), 1)          -0.133018        -0.124290        -0

#### Join both DataFrames to compare

In [62]:
combined_df = final_df.join(final_neg_df, how='outer')
print(combined_df.to_string())

                      7_mil_pos_value  5_mil_pos_value  3_mil_pos_value  7_mil_neg_value  5_mil_neg_value  3_mil_neg_value
state_action pair                                                                                                         
((4, 1, False), 0)                NaN              NaN              NaN        -0.754386        -0.745520        -0.743590
((4, 1, False), 1)                NaN              NaN              NaN        -0.617537        -0.628713        -0.642353
((4, 2, False), 0)                NaN              NaN              NaN        -0.287879        -0.373134        -0.296928
((4, 2, False), 1)                NaN              NaN              NaN        -0.202632        -0.260691        -0.229829
((4, 3, False), 0)                NaN              NaN              NaN        -0.219029        -0.221591        -0.231373
((4, 3, False), 1)                NaN              NaN              NaN        -0.208969        -0.226016        -0.291667
((4, 4, False), 

## Render Blackjack environment

In [None]:
# This is to show images only. 
env = gym.make('Blackjack-v1', render_mode="human")
env.reset()

In [None]:
total_return = defaultdict(float)
Q = defaultdict(float)
N = defaultdict(int)
W = defaultdict(float)
P = defaultdict(int)
num_iterations = 500
num_timesteps = 100

In [None]:
def epsilon_greedy_policy(state,Q):

    #set the epsilon value to 0.5
    epsilon = 0.5

    #sample a random value from the uniform distribution, if the sampled value is less than
    #epsilon then we select a random action else we select the best action which has maximum Q
    #value as shown below

    if random.uniform(0,1) < epsilon:
        return env.action_space.sample()
    else:
        return max(list(range(env.action_space.n)), key = lambda x: Q[(state,x)])

In [None]:
def generate_episode(Q):

    #initialize a list for storing the episode
    episode = []

    #initialize the state using the reset function
    state = env.reset()[0]

    #then for each time step
    for t in range(num_timesteps):

        #select the action according to the epsilon-greedy policy
        action = epsilon_greedy_policy(state,Q)

        #perform the selected action and store the next state information
        next_state, reward, done, info, extra = env.step(action)

        #store the state, action, reward in the episode list
        episode.append((state, action, reward))

        #if the next state is a final state then break the loop else update the next state to the current
        #state
        if done:
            break

        state = next_state

        #print(episode)

    return episode


In [None]:
#for each iteration
for i in range(num_iterations):

    #so, here we pass our initialized Q function to generate an episode
    episode = generate_episode(Q)

    #get all the state-action pairs in the episode
    all_state_action_pairs = [(s, a) for (s,a,r) in episode]

    #store all the rewards obtained in the episode in the rewards list
    rewards = [r for (s,a,r) in episode]

    #for each step in the episode
    for t, (state, action, reward) in enumerate(episode):

        #if the state-action pair is occurring for the first time in the episode
        if not (state, action) in all_state_action_pairs[0:t]:

            #compute the return R of the state-action pair as the sum of rewards
            R = sum(rewards[t:])

            #update the no of times the reward was positive
            if R > 0:
                P[(state,action)] += 1

            #update total return of the state-action pair
            total_return[(state,action)] = total_return[(state,action)] + R

            #update the number of times the state-action pair is visited
            N[(state, action)] += 1

            #compute the Q value by just taking the average
            Q[(state,action)] = total_return[(state, action)] / N[(state, action)]

            #compute the win rate by dividing the no of times the reward was positive by update the number of times the state-action pair is visited
            W[(state,action)] = P[(state,action)] / N[(state, action)]