• The game is played with an infinite deck of cards (i.e. cards are sampled
with replacement)
• Each draw from the deck results in a value between 1 and 10 (uniformly
distributed) with a colour of red (probability 1/3) or black (probability
2/3).
• There are no aces or picture (face) cards in this game
• At the start of the game both the player and the dealer draw one black
card (fully observed)
• Each turn the player may either stick or hit
• If the player hits then she draws another card from the deck
• If the player sticks she receives no further cards
• The values of the player’s cards are added (black cards) or subtracted (red
cards)
• If the player’s sum exceeds 21, or becomes less than 1, then she “goes
bust” and loses the game (reward -1)
• If the player sticks then the dealer starts taking turns. The dealer always
sticks on any sum of 17 or greater, and hits otherwise. If the dealer goes
bust, then the player wins; otherwise, the outcome – win (reward +1),
lose (reward -1), or draw (reward 0) – is the player with the largest sum.



You should write an environment that implements the game Easy21. Specifically, write a function, named step, which takes as input a state s (dealer’s first
card 1–10 and the player’s sum 1–21), and an action a (hit or stick), and returns
a sample of the next state s
0
(which may be terminal if the game is finished)
and reward r. We will be using this environment for model-free reinforcement
learning, and you should not explicitly represent the transition matrix for the
MDP. There is no discounting (γ = 1). You should treat the dealer’s moves as
part of the environment, i.e. calling step with a stick action will play out the
dealer’s cards and return the final reward and terminal state.

In [3]:
import numpy as np

In [1]:
def step(s,a): ## returnar nästa state som är "terminal" eller en tuple och reward
    dCard = s[0]
    mySum = s[1]
    if a == "hit":
        p = np.random.uniform(0,1)
        mySum += np.random.randint(1,11)*(-1 if p <=1/3 else 1)
        if mySum > 21 or mySum < 1:
            return ("terminal",-1)
        else:
            return ((dCard,mySum),0)    
    else:
        dSum = dCard
        while True:
            if dSum < 1:
                return ("terminal",1)
            elif dSum >= 1 and dSum < 17: #dSum < mySum: 
                p = np.random.uniform(0,1)
                dSum += np.random.randint(1,11)*(-1 if p <=1/3 else 1)
            else:
                break
        if dSum > 21 or mySum > dSum:
            return ("terminal",1)
        elif mySum < dSum:
            return ("terminal",-1)
        else:
            return ("terminal",0)
    

In [4]:
s = ((5,7))
a = "hit"
newS,rew = step(s,a)
print(newS)
print(rew)

print(step((newS),"stick"))


(5, 15)
0
('terminal', -1)


In [None]:
Apply Monte-Carlo control to Easy21. Initialise the value function to zero. Use
a time-varying scalar step-size of αt = 1/N(st, at) and an eps-greedy exploration
strategy with eps = N0/(N0 + N(st)), where N0 = 100 is a constant, N(s) is
the number of times that state s has been visited, and N(s, a) is the number
of times that action a has been selected from state s. Feel free to choose an
alternative value for N0, if it helps producing better results. Plot the optimal
value function V
∗
(s) = maxa Q∗
(s, a) using similar axes to the following figure
taken from Sutton and Barto’s Blackjack example.

In [102]:

def helper1 (s,actionsInState,N): 
    actions = actionsInState[s]
    sum = 0
    for a in actions:
        sum += N[(s,a)]
    return sum

def helper2(s,actionsInState,Q):
    if len(actionsInState[s]) == 0:
        p = np.random.uniform(0,1)
        return "hit" if p <= 1/2 else "stick"
    bestAction = None
    bestValue = -1000000
    actions = actionsInState[s] ##check here before if Q empty or something
    for a in actions:
        if Q[(s,a)] > bestValue:
            bestAction = a
            bestValue = Q[(s,a)]
    return bestAction
    

N0 = 100
Q = {} ##maps (s,a) --> Q(s,a)
actionsInState = {} ##maps s --> list of actions done in this state
N = {} ##maps (s,a) --> antal ggr har gjort action a i state s (dvs paret (s,a))
samples = 1000000
V  = {}

for _ in range(samples):
    returns = {} #maps (s,a) --> return for this sample from (s,a) pair
    s = (np.random.randint(1,11),np.random.randint(1,11))
    done = False
    while not done:
         
        if not s in actionsInState:
            actionsInState[s] = []
        eps = N0/(N0+helper1(s,actionsInState,N)) ##fel, uppdatera efter varje sample
        a = None                                ## men skitsamma
        if np.random.uniform(0,1) < 1-eps:
            a = helper2(s,actionsInState,Q)
        else:
            p = np.random.uniform(0,1)
            a = "hit" if p <= 1/2 else "stick"
        
         
        if not a in actionsInState[s]:
            actionsInState[s].append(a)   
        if not (s,a) in returns:
            returns[(s,a)] = 0
        if not (s,a) in N:
            N[(s,a)] = 0
        N[(s,a)] += 1
        if not (s,a) in Q:
            Q[(s,a)]= 0
           
        (news,rew) = step(s,a)
        for (state,action) in returns:
            returns[(state,action)] += rew
        if news == "terminal":
            done = True
        else:
            s = news
    ##now here we should update the Q values
    for (s,a) in returns:
        Q[(s,a)] += 1/N[(s,a)]*(returns[(s,a)]-Q[(s,a)])

for s in actionsInState:
    V[s] = Q[(s,helper2(s,actionsInState,Q))]
    
    
    
    
for s in range(2,22):
    print(str(s) + ": " + str(V[(5,s)]))


2: 0.1189634864546526
3: 0.11641620696214558
4: 0.11750269951899495
5: 0.09351221465758935
6: 0.1071041481919397
7: 0.1070240657405595
8: 0.1099455714992577
9: 0.0970911192448475
10: 0.14348061231018586
11: 0.14538813789326566
12: 0.10298102981029818
13: 0.14208389715832198
14: 0.06277372262773724
15: 0.08069164265129687
16: 0.10668563300142242
17: 0.1880877742946709
18: 0.4649776453055141
19: 0.6222222222222219
20: 0.7722473604826542
21: 0.9666666666666667


In [108]:

sum = 0
for i in range(0,100000):
    s = (np.random.randint(1,11),np.random.randint(1,11))
    while True:
        action = helper2(s,actionsInState,Q)
        (newS,rew) = step(s,action)
        if newS == "terminal":
            sum += rew
            break
        s = newS
print(sum)

4599
