# Assignment #4 - Reinforcement Learning



## <font color="blue"> Tanvi Rasam</font>

# I. Overview

Aim of this assignment is to implement Rummy Game using Reinforcement Learning. Reinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.It is a type of Machine Learning algorithms which allows software agents and machines to automatically determine the ideal behavior within a specific context, to maximize its performance. Determining the ideal behaviour can be done by  Q-learning or SARSA methods.  Here, we are building an agent using appropriate algorithm ( Q-learning or SARSA) to play in the rummy environment. So in this game each player will get 3 cards randomly from 'A' to '5'. With each turn one have to pick a card either from the pile of closed or open cards depending on the cards you have. After that you have to select a card to drop to the pile from your deck. Your goal is to make all of your 3 cards of same number. As soon as one player reaches the goal state, it will meld the cards and game will stop and all other player will get score as the sum of their cards in hand. Player with minimum score will win the game.Each player will play for the 10 rounds and if nobody reach to the meld stage then each will player will get score equivalent to the sum of their cards in hand.Player with the lowest score wins the game.
Apart from implementing the RLAgent we need to also make good selection for γ, α, and ϵ parameters.

## Rummy Environment

Importing Libraries

In [1]:
import random
from functools import reduce
from collections import defaultdict
import numpy as np
from copy import copy
%matplotlib inline

In [2]:
SUIT = ['H','S','D','C']
RANK = ['A', '2', '3', '4', '5','6','7']
RANK_VALUE = {'A': 1, '2': 2, '3': 3, '4': 4, '5': 5, '6': 6, '7': 7, '8': 8, '9': 9, 'T': 10, 'Q': 10, 'K': 10}

## Card Class Definition
__init__  : Defines the card details such as rank, suit and calculates the rank value


In [3]:
class Card:
    def __init__(self,rank,suit):
        self.rank = rank
        self.suit = suit
        self.rank_to_val = RANK_VALUE[self.rank]
   
    def __str__(self):
        return f'{self.rank}{self.suit}'

    def __repr__(self):
        return f'{self.rank}{self.suit}'

   
    def __eq__(self, other):
        return self.rank == other.rank and self.suit == other.suit

## Deck Class Definition
__shuffle__ : Shuffles the deck in random order

__draw_card__ : Draws a card from the top of the deck

In [4]:
class Deck:
    def __init__(self, packs):
        self.packs = packs
        self.cards = []
        for pack in range(0, packs) :
            for suit in SUIT:
                for rank in RANK:
                    self.cards.append(Card(rank, suit))
   
    def shuffle(self):
        random.shuffle(self.cards)
   
    def draw_card(self):
        card = self.cards[0]
        self.cards.pop(0)
        return card

## Player Class:

### 1.__init__(self,name,stash=list(),isBot=False): 
Initializing stash, name, isBot/dealer points for each player.

### 2. deal_card(self,card):
This method appends the card in the stash and check the condition that length of stash should not be greater than nuber of cards length in game.

### 3. drop_card(self,card):
This method removes the card from stash and add that card into pile.

### 4. meld(self):
This method tries to find the cards with the same rank in the hand. If it finds then it will merge the cards in the hand to the melded cards array in the game. 

### 5. stash_score(self):
This method calculates sum of all the cards in stash according to the rank of each card.

### 6. get_info(self,debug):
This function fetch all the information of the player.

In [5]:
class Player:

    """  
        Player class to create a player object.
        eg: player = Player("player1", list(), isBot = False)
        Above declaration will be for your agent.
        All the player names should be unique or else you will get error.
       
    """

    def __init__(self, name, stash=list(), isBot=False, points=0, conn=None):
        self.stash = stash
        self.name = name
        self.game = None
        self.isBot = isBot
        self.points = points
        self.conn = conn


    def deal_card(self,card):
        try :
            self.stash.append(card)
            if len(stash) > self.game.cardsLength + 1:
                raise ValueError('Cannot have cards greater than ')
        except ValueError as err:
            print(err.args)


    def drop_card(self,card):
        self.stash.remove(card)
        self.game.add_pile(card)
        return -1


    def meld(self):
        card_hash = defaultdict(list)
        for card in self.stash:
            card_hash[card.rank].append(card)
        melded_card_ranks = []
        for (card_rank,meld_cards) in card_hash.items():
            if len(meld_cards) >= 3 :
                self.game.meld.append(meld_cards)
                melded_card_ranks.append(card_rank)
                for card in meld_cards:
                    self.stash.remove(card)
       
        for card_rank in melded_card_ranks :
            card_hash.pop(card_rank)
        return len(melded_card_ranks) > 0


    def stash_score(self) :
        score = 0
        for card in self.stash :
            score += RANK_VALUE[card.rank]
        return score


    def get_info(self, debug):
        if debug :
            print(f'Player Name : {self.name} \n Stash Score: {self.stash_score()} \n Stash : {", ".join(str(x) for x in self.stash)}')
        card_ranks = []
        card_suits = []
        pileset = None
        pile = None
        for card in self.stash :
            card_suits.append(RANK_VALUE[card.rank])
            card_ranks.append(card.suit)
        if len(self.game.pile) > 0 :
            return {"Stash Score" : self.stash_score(), "CardSuit":  card_suits, "CardRanks": card_ranks, "PileRank": self.game.pile[-1].rank, "PileSuit":self.game.pile[-1].suit}
        return {"Stash Score" : self.stash_score(), "CardSuit":  card_suits, "CardRanks": card_ranks}

## Game Enviroment:

### 1. __init():  

### 2. add_pile(self, card):  This method takes a card as argument and first checks number of cards in the deck. If its is ‘0’ then add the cards from file to deck and append the passed card to the pile.

### 3. pick_card(self, player, action):  This methods helps player picking up the card from either Pile or Deck based on action.
			We have defined, If action = 0 then, player will pick a card from Pile
			And if action. = 1 then player will pick card from Deck. 
			Meld condition will be check after player picks the card, if the meld condition satisfied, player won. 
            You can modify the rewards in return, only the values

### 4. pick_from_pile(self, player): This method helps player picking card from the pile and simultaneously a card from pile gets reduced.

### 5. pick_from_deck(self, player): This method is similar to above method but it helps picking up card from the Deck. 

### 6. get_player(self, player_name): This function fetch the details of the player given player_name.

### 7. computer_play(self, player): This method defines the play of the computer/Dealer in following sequence:
			--> Randomly taking actions from picking up card from deck/pile.
			--> Checking the meld condition afterwards.
			--> If the meld condition does not satisfied, remove the card from his stash.

### 8. play(self): This method defines all the function city of play for the player:
		  --> Decrementing the maximum number of turns defined per game.
		  --> For each player, it will check the 'stash', if the 'Stash' for any player = 0 (That player won), it will add the value of each card in stash for all other players.
		  --> Or If maximum number of turns in each round becomes 0, it will add the card values in stash for all the players and return.
      
### 9. drop_card(self,player,card) This method drops a card from the player given player object and card object and returns the reward associated with it.

      --> You can modify the rewards which gets returned in the value.


### 9. reset(self, players): This method will reinitialize all the deck, pile and players.

### 10. _update_turn(self): This method is used to count the number of turns in the game.

In [6]:
class RummyAgent() :
    """
    Simple Rummy Environment
   
    Simple Rummy is a game where you need to make all the cards in your hand same before your opponent does.
    Here you are given 3 cards in your hand/stash to play.
    For the first move you have to pick a card from the deck or from the pile.
    The card in deck would be random but you can see the card from the pile.
    In the next move you will have to drop a card from your hand.
    Your goal is to collect all the cards of the same rank.
    Higher the rank of the card, the higher points you lose in the game.
    You need to keep the stash score low. Eg, if you can AH,7S,5D your strategy would be to either find the first pair of the card or by removing the highest card in the deck.
    You only have 20 turns to either win the same or collect low scoring card.
    You can't see other players cards or their stash scores.
   
    Parameters
    ====
    players: Player objects which will play the game.
    max_card_length : Number of cards each player can have
    max_turns: Number of turns in a rummy game
    """

    def __init__(self, players, max_card_length=5, max_turns=20) :
        self.max_card_length = max_card_length
        self.max_turns = max_turns
        self.reset(players)
       
    def update_player_cards(self,players):
        for player in players :
            player = Player(player.name, list(), isBot=player.isBot, points=player.points, conn=player.conn)
            stash = []
            for i in range(self.max_card_length):
                player.stash.append(self.deck.draw_card())
            player.game = self
            self.players.append(player)
        self.pile = [self.deck.draw_card()]

    def add_pile(self,card):
        if len(self.deck.cards) == 0 :
            self.deck.cards.extend(self.pile)
            self.deck.shuffle()
            self.pile = []
        self.pile.append(card)
       
       
    def pick_card(self,player,action):
        if action == 0:
            self.pick_from_pile(player)
        else :
            self.pick_from_deck(player)
        state = [c for c in player.stash]
        if player.meld() :
            return {"reward" : 10, 'state': state}
        else :
            return {"reward" : -1, 'state': state}
#             return -player.stash_score()
       
    def pick_from_pile(self, player):
        card = self.pile[-1]
        self.pile.pop()
        return player.stash.append(card)
     
    def pick_from_deck(self, player):
        return player.stash.append(self.deck.draw_card())
   
    def get_player(self,  player_name):
        return_player = [player for player in self.players if player.name == player_name]
        if len(return_player) != 1:
            print("Invalid Player")
            return None
        else:
            return return_player[0]
   
    def drop_card(self,player,card):
        player.drop_card(card)
        return {"reward" : -1}


    def computer_play(self,player):
        #Gets a card from deck or pile
        if random.randint(0,1) == 1 :
            self.pick_from_pile(player)
        else :
            self.pick_from_deck(player)
           
        #tries to meld if it can
#         if random.randint(0,10) > 5 :
        player.meld()
       
        #removes a card from the stash
        if len(player.stash) != 0:
            card = player.stash[(random.randint(0,len(player.stash) - 1))]
            player.drop_card(card)
       
    def play(self):
        for player in self.players :
            if len(player.stash) == 0 :
                return True
        if self.max_turns <= 0 :
            return True
        return False

    def _update_turn(self):
        self.max_turns -= 1
        return self.max_turns
   
    def reset(self,players,max_turns=20):
        self.players = []
        self.deck = Deck(1)
        self.deck.shuffle()
        self.meld = []
        self.pile = []
        self.max_turns = max_turns
        self.update_player_cards(players)

# III. Methods

- Decide your TD learning approach: SARSA or Q-learning? 
- Decide your function approximator.
- Describe your approach and the reason why you select it.
- Finish epsilon_greedy function and other TODOs. Explain it.
- Explain your codes.

# SARSA and Q-Learning

Temporal difference (TD) learning refers to a class of model-free reinforcement learning methods which learn by bootstrapping from the current estimate of the value function. It  is a blend of Monte Carlo (MC) method and Dynamic Programming (DP) method.

One of the TD algorithms for control or improvement is SARSA. SARSA name came from the fact that agent take one step from one state-action value pair to another state-action value pair and along the way collect reward R. SARSA is on-policy method. SARSA use action value function Q and follow the policy π.

Q-learning is an off policy reinforcement learning algorithm for Temporal Difference learning that seeks to find the best action to take given the current state. It’s considered off-policy because the q-learning function learns from actions that are outside the current policy, like taking random actions, and therefore a policy isn’t needed. More specifically, q-learning seeks to learn a policy that maximizes the total reward. In Q-learning target policy is greedy policy and behavior policy is ε-greedy policy (this ensure exploration).

$\alpha$ - the learning rate, set between 0 and 1. Setting it to 0 means that the Q-values are never updated, hence nothing is learned. Setting a high value such as 0.9 means that learning can occur quickly.

$\gamma$ - discount factor, also set between 0 and 1. This models the fact that future rewards are worth less than immediate rewards. Mathematically, the discount factor needs to be set less than 0 for the algorithm to converge.

Q-learning and SARSA are both policy control methods which work on evaluating the optimal Q-value for all action-state pairs.

They differ in their update rule:
For SARSA-
$$
    Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha ( R_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t)) 
$$

For Q-learning-

$$
    Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha ( R_{t+1} + \gamma \max_a Q(s_{t+1}, a) - Q(s_t, a_t)) 
$$


Unlike SARSA where an action a_t was chosen by following a certain policy, here the action a_t is chosen in a greedy fashion by simply taking the max of Q over it.



# TD learning approach: SARSA or Q-learning

I am using Q-learning over SARSA as I want my agent to both explore and exploit and choose the best possible action in each state using the greedy approach. Q-learning directly learns the optimal policy, whilst SARSA learns a near-optimal policy whilst exploring. To reach the win state in minimum no. of steps, optimal policy is needed. SARSA will approach convergence allowing for possible penalties from exploratory moves, whilst Q-learning will ignore them. That makes SARSA more conservative - if there is risk of a large negative reward close to the optimal path, Q-learning will tend to trigger that reward whilst exploring, whilst SARSA will tend to avoid a dangerous optimal path and only slowly learn to use it when the exploration parameters are reduced.Thus, I choose Q-learning over SARSA

# Function Approximator

Q-learning can be combined with function approximation. This makes it possible to apply the algorithm to larger problems, even when the state space is continuous. Function approximation may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.

The Function Approximator that I have used is-
 $$ \pi(s) = \arg \max Q^*(s, a) $$
 
This Function Approximator chooses the maximum of state-action pair from the Q-table 

## RLAgent for Rummy

In [7]:
class RLAgent:
    """
        Reinforcement Learning Agent Model for training/testing
        with Tabular function approximation

    """

    def __init__(self, env, epsilon=0.2, alpha=0.1, gamma=1):
        self.env = env
#         self.size = env.get_size()
        self.n_a = 2
        self.epsilon = epsilon
        self.alpha = alpha
        self.gamma = gamma
        
        self.Q = np.zeros((8, 8, 8, 8, 2, 4)) # for three cards in hand, card on the pile, pick action from pile or deck, which card to drop out of 4

    def epsilon_greed_pickup(self, s):
        if np.random.uniform() < self.epsilon:
            action = np.random.randint(2)
        else:
            action = np.where(self.Q[s[0], s[1], s[2], s[3], :, 0] == np.max(self.Q[s[0], s[1], s[2], s[3], :, 0]))[0][0]
        return action

    def epsilon_greed_dropdown(self, s):
        if np.random.uniform() < self.epsilon:
            position = np.random.randint(4)
        else:
            position = np.where(self.Q[s[0], s[1], s[2], s[3], 0, :] == np.max(self.Q[s[0], s[1], s[2], s[3], 0, :]))[0][0]
        return position
   
   
    def train(self, **params):

        # parameters
        rewards=[]
        maxiter = 100           # Number of games 
        debug = True
        for j in range(maxiter):
            for player in rummy.players :
                player.points = player.stash_score()

            rummy.reset(rummy.players)
            random.shuffle(rummy.players)
            # int i = 0
            if debug :
                print(f'**********************************\n\t\tGame Starts : {j}\n***********************************')
            while not rummy.play() :
                rummy._update_turn()
                print(rummy.max_turns)
                for player in rummy.players:
                    if player.isBot :
                        if rummy.play():
                            continue
                        if debug :
                            print(f'{player.name} Plays')
                        rummy.computer_play(player)
                        if debug :
                            player.get_info(debug)
                            if not player.stash :
                                print(f'{player.name} wins the round')
                        print("------------------------------------------------------------",player.stash)

                    else :
                        if rummy.play() :
                            continue
                        if debug :
                            print(f'{player.name} Plays')
                        player_info = player.get_info(debug)

                        s = [c.rank_to_val for c in player.stash] # set the first 3 states
                        s.append(rummy.pile[-1].rank_to_val )
                        if debug :
                            print(f'Card in pile {player_info["PileSuit"]}{player_info["PileRank"]}')
                        
                        a = self.epsilon_greed_pickup(s=s)
                        action_taken = a
                        
                        if debug :
                            print(f'{player.name} takes action {action_taken}')         
                        print(f'=============States=============     {s}')

                        
                        

                        result_1 = rummy.pick_card(player, a)  # also checks for meld
                        r = result_1["reward"]
                        rewards.append(r)
                        print('State after action of pickup from deck or pile',result_1['state'])
                        print(' ')
                                  
                        s2 = [c.rank_to_val  for c in result_1['state']]
                        #s_prime.append(0)
                        a2 = self.epsilon_greed_dropdown(s=s2)
                        print(f'---------------Card will be dropped at position-----------     {a2}')        
                       
                             #   Q table update  for pickup action
                                 
                        self.Q[s[0], s[1], s[2], s[3], a, :] += self.alpha * (r + self.gamma * self.Q[s2[0], s2[1], s2[2], s2[3], 0, a2] -\
                                             self.Q[s[0], s[1], s[2], s[3], a, 0])
                        s = s2
                        a = a2
                                 
                        
                        #player stash will have no cards if the player has melded them
                        #When you have picked up a card and you have drop it since the remaining cards have been melded.
                        if len(player.stash) == 1:
                            rummy.drop_card(player,player.stash[0])
                            if debug :
                                print(f'{player.name} Wins the round')

                        elif len(player.stash) != 0 :
                            print("-------------Player info before dropping the card--------------")
                            player_info = player.get_info(debug)
#                             
                            action_taken = a
                            card = player.stash[action_taken]
                            if debug :
                                print(f'{player.name} drops card {card}')

                            result_1 = rummy.drop_card(player,card)
                            r = result_1["reward"]
                            rewards.append(r)
                            s2_drop = s
                            a2 = self.epsilon_greed_pickup(s=s2_drop)
                                 
#                            Update the Q table for drop action
                                 
                            self.Q[s[0], s[1], s[2], s[3], :, a] += self.alpha * (r + self.gamma * \
                                                                                  self.Q[s2_drop[0], s2_drop[1], s2_drop[2], s2_drop[3], a2, 0] -\
                                                                                  self.Q[s[0], s[1], s[2], s[3], 0, a])
                        else :
                            if debug :
                                print(f'{player.name} Wins the round')
                                break
                        if debug :
                            print("------------Player info after dropping the card---------------- ")
                            player.get_info(debug)
                            print("------------------------------------------------------------")
#                                   
        return rewards                          
                                 
    def test(self):
        
        #self.env.init(start)
        
        win_comp=0
        win_agent=0
        turns=[]
                                  
        for j in range(10):
            for player in rummy.players :
                player.points = player.stash_score()

            rummy.reset(rummy.players)
            random.shuffle(rummy.players)
            
            while not rummy.play() :
                j=rummy._update_turn()
                
                for player in rummy.players:
                    if player.isBot :
                        if rummy.play():
                            continue
                        rummy.computer_play(player)
                        if not player.stash :
                                win_comp=win_comp+1
                                
                    else :
                        if rummy.play() :
                            continue
                        
                        s = [c.rank_to_val for c in player.stash] # set the first 3 states
                        s.append(rummy.pile[-1].rank_to_val )
                        
                        a = self.epsilon_greed_pickup(s=s)
                        action_taken = a
                        #trace = np.array(coord_convert(s, self.size))                    
                        

                        result_1 = rummy.pick_card(player, a)  # also checks for meld
                        r = result_1["reward"]
                                  
                        s2 = [c.rank_to_val  for c in result_1['state']]
                        #s_prime.append(0)
                        a2 = self.epsilon_greed_dropdown(s=s2)
                        
                             
                                 
                        s = s2
                        a = a2
                        #trace = np.vstack((trace, coord_convert(s, self.size)))
                        
                        #player stash will have no cards if the player has melded them
                        #When you have picked up a card and you have drop it since the remaining cards have been melded.
                        if len(player.stash) == 1:
                            rummy.drop_card(player,player.stash[0])
                            win_agent=win_agent+1
                            turns.append(j)
                            
                        elif len(player.stash) != 0 :
                              
                            action_taken = a
                            card = player.stash[action_taken]
                            
                            result_1 = rummy.drop_card(player,card)
                            r = result_1["reward"]
                            s2_drop = s
                            a2 = self.epsilon_greed_pickup(s=s2_drop)
                                 
                        else :
                            win_agent=win_agent+1
                            turns.append(j)
                                
           
                                  
        return win_agent,win_comp
        

# Explaination of the Code

•	Init method- It initializes all the parameters and constructs the Q-table with dimensions given by- 7 X 7X 7- for the 3 cards in hand, 7- the card on the pile, 2- for two actions ie. prick up from pile or deck and 4- for which card to be dropped from the 4 cards in hand after pickup

•	epsilon_greed_pickup- It selects the action from the Q-table for which the value is maximum. Here the card-drop parameter ie. the 6th dimention of Q is set to 0 as we don’t need it. Epsilon is set so as to explore the environment by making random selections

•	epsilon_greed_dropdown- Acts similar to above but selects the position of the card to drop, thus action parameter is set to 0.

•	Train- Here the maxiter is set high than 10 so that the agent gets trained well. This denotes the no. of games being played and turns means turns each player get in a round

•	At the start of each round, each player is initialized. If the player is Bot, selection of action and drop card is made by a predefined method computer_play.

•	For the agent, we first set the states by taking the rank of each card in hand and also the card on the pile. This is given to the  epsilon_greed_pickup method to get the action. 

•	For this action the agent moves to some other state which in our case is basically the same state plus the new card added.

•	Then the q table is updated based on the reward received for this action  

•	Then we have 4 cards in hand and need to decide upon which one to drop

•	For this we call the  epsilon_greed_dropdown method which return the position of card to be dropped .

•	After the drop action, the q table is again updated.

•	For each player when the stash score becomes 0, then that player wins

•	Test – This method is similar to train except here the Q values are not updated but used directly using the two greedy methods. Ie. depending on the combination of states, the best action ie. the one with maximum value is picked 




In [8]:
p1 = Player('Tanvi', list())
p2 = Player('comp1', list(), isBot=True)
rummy = RummyAgent([p1,p2], max_card_length=3, max_turns=20)

agent_smith = RLAgent(rummy)
rewards=agent_smith.train()



**********************************
		Game Starts : 0
***********************************
19
comp1 Plays
Player Name : comp1 
 Stash Score: 16 
 Stash : 7D, 6H, 3C
------------------------------------------------------------ [7D, 6H, 3C]
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 4 
 Stash : AH, 2C, AS
Card in pile H7
Tanvi takes action 0
State after action of pickup from deck or pile [AH, 2C, AS, 7H]
 
---------------Card will be dropped at position-----------     2
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 11 
 Stash : AH, 2C, AS, 7H
Tanvi drops card AS
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 10 
 Stash : AH, 2C, 7H
------------------------------------------------------------
18
comp1 Plays
Player Name : comp1 
 Stash Score: 16 
 Stash : 7D, 6H, 3C
------------------------------------------------------------ [7D, 6H, 3C]
Tanvi Plays
Player Name : Tanvi 
 Stash Score

Tanvi Plays
Player Name : Tanvi 
 Stash Score: 7 
 Stash : AS, 3S, 3D
Card in pile D2
Tanvi takes action 0
State after action of pickup from deck or pile [AS, 3S, 3D, 2D]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 9 
 Stash : AS, 3S, 3D, 2D
Tanvi drops card AS
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 8 
 Stash : 3S, 3D, 2D
------------------------------------------------------------
3
comp1 Plays
Player Name : comp1 
 Stash Score: 15 
 Stash : 4D, 5D, 6H
------------------------------------------------------------ [4D, 5D, 6H]
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 8 
 Stash : 3S, 3D, 2D
Card in pile S5
Tanvi takes action 0
State after action of pickup from deck or pile [3S, 3D, 2D, 5S]
 
---------------Card will be dropped at position-----------     2
-------------Player info before dropping the ca

 Stash : AS, 4H, 3S
Card in pile DA
Tanvi takes action 1
State after action of pickup from deck or pile [AS, 4H, 3S, 3D]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 11 
 Stash : AS, 4H, 3S, 3D
Tanvi drops card AS
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 10 
 Stash : 4H, 3S, 3D
------------------------------------------------------------
comp1 Plays
Player Name : comp1 
 Stash Score: 9 
 Stash : 2H, 2C, 5C
------------------------------------------------------------ [2H, 2C, 5C]
16
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 10 
 Stash : 4H, 3S, 3D
Card in pile D5
Tanvi takes action 0
State after action of pickup from deck or pile [4H, 3S, 3D, 5D]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Sc

Tanvi takes action 0
State after action of pickup from deck or pile [7H, 2C, 2H, 3C]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 14 
 Stash : 7H, 2C, 2H, 3C
Tanvi drops card 7H
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 7 
 Stash : 2C, 2H, 3C
------------------------------------------------------------
0
**********************************
		Game Starts : 21
***********************************
19
comp1 Plays
Player Name : comp1 
 Stash Score: 6 
 Stash : 2H, AC, 3H
------------------------------------------------------------ [2H, AC, 3H]
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 8 
 Stash : 2C, 2D, 4C
Card in pile S3
Tanvi takes action 0
State after action of pickup from deck or pile [2C, 2D, 4C, 3S]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping

---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 19 
 Stash : 4C, 4H, 5D, 6S
Tanvi drops card 4C
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 15 
 Stash : 4H, 5D, 6S
------------------------------------------------------------
comp1 Plays
Player Name : comp1 
 Stash Score: 9 
 Stash : 3H, 2S, 4C
------------------------------------------------------------ [3H, 2S, 4C]
12
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 15 
 Stash : 4H, 5D, 6S
Card in pile C5
Tanvi takes action 0
State after action of pickup from deck or pile [4H, 5D, 6S, 5C]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 20 
 Stash : 4H, 5D, 6S, 5C
Tanvi drops card 4H
------------Player info after dropping the card---------------- 
Playe

------------------------------------------------------------ [5C, 2D, AS]
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 10 
 Stash : 7C, 2H, AD
Card in pile H3
Tanvi takes action 0
State after action of pickup from deck or pile [7C, 2H, AD, 3H]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 13 
 Stash : 7C, 2H, AD, 3H
Tanvi drops card 7C
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 6 
 Stash : 2H, AD, 3H
------------------------------------------------------------
12
comp1 Plays
Player Name : comp1 
 Stash Score: 9 
 Stash : 5C, AS, 3D
------------------------------------------------------------ [5C, AS, 3D]
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 6 
 Stash : 2H, AD, 3H
Card in pile D2
Tanvi takes action 0
State after action of pickup from deck or pile [2H, AD, 3H, 2D]
 
---------------Card will be dropped 

 
---------------Card will be dropped at position-----------     1
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 11 
 Stash : AH, 5H, 3C, 2H
Tanvi drops card 5H
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 6 
 Stash : AH, 3C, 2H
------------------------------------------------------------
17
comp1 Plays
Player Name : comp1 
 Stash Score: 10 
 Stash : 5D, 4D, AC
------------------------------------------------------------ [5D, 4D, AC]
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 6 
 Stash : AH, 3C, 2H
Card in pile C6
Tanvi takes action 0
State after action of pickup from deck or pile [AH, 3C, 2H, 6C]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 12 
 Stash : AH, 3C, 2H, 6C
Tanvi drops card AH
------------Player info after dropping the card---------------- 
Play

14
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 9 
 Stash : 7S, AD, AS
Card in pile D2
Tanvi takes action 0
State after action of pickup from deck or pile [7S, AD, AS, 2D]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 11 
 Stash : 7S, AD, AS, 2D
Tanvi drops card 7S
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 4 
 Stash : AD, AS, 2D
------------------------------------------------------------
comp1 Plays
Player Name : comp1 
 Stash Score: 9 
 Stash : AC, AH, 7S
------------------------------------------------------------ [AC, AH, 7S]
13
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 4 
 Stash : AD, AS, 2D
Card in pile D7
Tanvi takes action 1
State after action of pickup from deck or pile [AD, AS, 2D, 3H]
 
---------------Card will be dropped at position-----------     1
-------------Player info before dropping th

 Stash : 
------------------------------------------------------------
**********************************
		Game Starts : 60
***********************************
19
comp1 Plays
Player Name : comp1 
 Stash Score: 13 
 Stash : 4S, 4H, 5H
------------------------------------------------------------ [4S, 4H, 5H]
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 17 
 Stash : 7S, 6C, 4C
Card in pile H6
Tanvi takes action 0
State after action of pickup from deck or pile [7S, 6C, 4C, 6H]
 
---------------Card will be dropped at position-----------     3
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 23 
 Stash : 7S, 6C, 4C, 6H
Tanvi drops card 6H
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 17 
 Stash : 7S, 6C, 4C
------------------------------------------------------------
18
comp1 Plays
Player Name : comp1 
 Stash Score: 12 
 Stash : 4S, 5H, 3H
----------------------------------------------

------------------------------------------------------------ [6S, 6C, 3H]
14
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 12 
 Stash : 3C, 7C, 2S
Card in pile S3
Tanvi takes action 0
State after action of pickup from deck or pile [3C, 7C, 2S, 3S]
 
---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 15 
 Stash : 3C, 7C, 2S, 3S
Tanvi drops card 3C
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 12 
 Stash : 7C, 2S, 3S
------------------------------------------------------------
comp1 Plays
Player Name : comp1 
 Stash Score: 11 
 Stash : 6C, 3H, 2H
------------------------------------------------------------ [6C, 3H, 2H]
13
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 12 
 Stash : 7C, 2S, 3S
Card in pile S6
Tanvi takes action 0
State after action of pickup from deck or pile [7C, 2S, 3S, 6S]
 
---------------Card will be dr

-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 16 
 Stash : 2H, 4D, 3D, 7D
Tanvi drops card 2H
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 14 
 Stash : 4D, 3D, 7D
------------------------------------------------------------
comp1 Plays
Player Name : comp1 
 Stash Score: 14 
 Stash : 5H, 7C, 2H
------------------------------------------------------------ [5H, 7C, 2H]
7
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 14 
 Stash : 4D, 3D, 7D
Card in pile H4
Tanvi takes action 0
State after action of pickup from deck or pile [4D, 3D, 7D, 4H]
 
---------------Card will be dropped at position-----------     1
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 18 
 Stash : 4D, 3D, 7D, 4H
Tanvi drops card 3D
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 15 
 Stash : 4D, 7D, 4H
-----------

------------------------------------------------------------
**********************************
		Game Starts : 84
***********************************
19
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 12 
 Stash : 5S, 3S, 4H
Card in pile D7
Tanvi takes action 1
State after action of pickup from deck or pile [5S, 3S, 4H, 7C]
 
---------------Card will be dropped at position-----------     1
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 19 
 Stash : 5S, 3S, 4H, 7C
Tanvi drops card 3S
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 16 
 Stash : 5S, 4H, 7C
------------------------------------------------------------
comp1 Plays
Player Name : comp1 
 Stash Score: 12 
 Stash : 4S, 6D, 2D
------------------------------------------------------------ [4S, 6D, 2D]
18
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 16 
 Stash : 5S, 4H, 7C
Card in pile S3
Tanvi takes action 0
State after action 

---------------Card will be dropped at position-----------     0
-------------Player info before dropping the card--------------
Player Name : Tanvi 
 Stash Score: 15 
 Stash : 4H, 2C, 2H, 7S
Tanvi drops card 4H
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 11 
 Stash : 2C, 2H, 7S
------------------------------------------------------------
comp1 Plays
Player Name : comp1 
 Stash Score: 14 
 Stash : 6C, 4C, 4H
------------------------------------------------------------ [6C, 4C, 4H]
12
Tanvi Plays
Player Name : Tanvi 
 Stash Score: 11 
 Stash : 2C, 2H, 7S
Card in pile C5
Tanvi takes action 1
State after action of pickup from deck or pile [2C, 2H, 7S, 2S]
 
---------------Card will be dropped at position-----------     0
Tanvi Wins the round
------------Player info after dropping the card---------------- 
Player Name : Tanvi 
 Stash Score: 0 
 Stash : 
------------------------------------------------------------
*********************

In [9]:
win,loss=agent_smith.test()


In [10]:
games_resulted = win + loss
winning_accuracy=((win/10)*100)
winning_accuracy_without_draw=((win/games_resulted)*100)
print(win) 
print(loss)
print(winning_accuracy)
print(winning_accuracy_without_draw)

5
3
50.0
62.5


# Conclusion

RL, known as a semi-supervised learning model in machine learning, is a technique to allow an agent to take actions and interact with an environment so as to maximize the total rewards.This Assignment helped me understand in RL we construct a mathematical framework to solve the problems such as, to find a good policy we could use valued-based methods like Q-learning to measure how good an action is in a particular state or policy-based methods to directly find out what actions to take under different states without knowing how good the actions are. I learned that making my agent trained for more no. of rounds in the game increased its capability to make better decisions. The main challenge was to decide upon the states and dimentions of the Q-table and implementing logic in a proper sequence. This Assignment helped me understand the efficiency and scope of an RL Agent.

# References

Vaibhav Kumar,'Reinforcement learning: Temporal-Difference, SARSA, Q-Learning & Expected SARSA in python' https://towardsdatascience.com/reinforcement-learning-temporal-difference-sarsa-q-learning-expected-sarsa-on-python-9fecfda7467e

Minwoo Jake Lee, 'Reinforcement Learning' https://nbviewer.jupyter.org/url/webpages.uncc.edu/mlee173/teach/itcs6156/notebooks/notes/Note-ReinforcementLearning.ipynb

Baijayanta Roy 'Temporal-Difference (TD) Learning'
https://towardsdatascience.com/temporal-difference-learning-47b4a7205ca8

Wikipedia, 'Q-learning'
https://en.wikipedia.org/wiki/Q-learning#Function_approximation