 The `np.random.choice()` function is a part of the NumPy library and is used to generate a random sample from a given 1-D array or range. Here's an explanation in Jupyter Notebook markdown format:

```markdown
### np.random.choice() Function

The `np.random.choice()` function from the NumPy library is used to generate random samples from a given one-dimensional array or from a range of numbers. This function is quite versatile and allows for various sampling techniques, such as with or without replacement, and it can handle weighted probabilities for each element in the array.

#### Syntax:
```python
numpy.random.choice(a, size=None, replace=True, p=None)
```

- `a`: 1-D array-like or int. If an array, a random sample is generated from its elements. If an int, the random sample is generated as if from `np.arange(a)`.
- `size`: int or tuple of ints, optional. Output shape. If the given shape is, e.g., `(m, n, k)`, then `m * n * k` samples are drawn. Default is `None`, in which case a single value is returned.
- `replace`: boolean, optional. Whether the sample is with or without replacement. Default is `True`, meaning that a value of `a` can be selected multiple times.
- `p`: 1-D array-like, optional. The probabilities associated with each entry in `a`. If not given, the sample assumes a uniform distribution over all entries in `a`.

#### Returns:
- `samples`: single item or ndarray. The generated random samples.

#### Examples:

Generate a uniform random sample from `np.arange(5)` of size 3:

```python
np.random.choice(5, 3)
# Output: array([0, 3, 4]) # random
```

Generate a non-uniform random sample from `np.arange(5)` of size 3 with probabilities:

```python
np.random.choice(5, 3, p=[0.1, 0, 0.3, 0.6, 0])
# Output: array([3, 3, 0]) # random
```

Generate a uniform random sample from `np.arange(5)` of size 3 without replacement:

```python
np.random.choice(5, 3, replace=False)
# Output: array([3, 1, 0]) # random
```

Generate a non-uniform random sample from `np.arange(5)` of size 3 without replacement and with probabilities:

```python
np.random.choice(5, 3, replace=False, p=[0.1, 0, 0.3, 0.6, 0])
# Output: array([2, 3, 0]) # random
```

Any of the above can be repeated with an arbitrary array-like instead of just integers. For example:

```python
aa_milne_arr = ['pooh', 'rabbit', 'piglet', 'Christopher']
np.random.choice(aa_milne_arr, 5, p=[0.5, 0.1, 0.1, 0.3])
# Output: array(['pooh', 'pooh', 'pooh', 'Christopher', 'piglet'], dtype='<U11') # random
```
```

This markdown content can be pasted directly into a Jupyter Notebook cell set to markdown mode to display the formatted explanation and examples. Remember to run the cell to render the markdown.

Source: Conversation with Copilot, 6/9/2024
(1) numpy.random.choice — NumPy v1.26 Manual. https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html.
(2) How to Use Numpy Random Choice - Sharp Sight. https://www.sharpsightlabs.com/blog/numpy-random-choice/.
(3) How is numpy.choice different from numpy.randint?. https://stackoverflow.com/questions/55772716/how-is-numpy-choice-different-from-numpy-randint.
(4) numpy.random.Generator.choice — NumPy v1.26 Manual. https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.choice.html.

In [58]:
import numpy as np
import gym
from collections import defaultdict
import pandas as pd

# Discount factor for future rewards
gamma = 1
# Number of possible actions in the environment
number_actions = 2

# Create the Blackjack environment
env = gym.make("Blackjack-v1")



def argmax_radd(arr):
    temp= np.random.choice(
                np.flatnonzero(
                    arr==np.max(arr)
            )       
            )
    return temp


def generate_episode(policy, env, pi):
    """
    Generate an episode using the given policy and environment.

    Args:
        policy (function): Policy function to determine actions.
        env (gym.Env): Gym environment to interact with.
        pi (dict): Policy table.

    Returns:
        tuple: The generated episode and the length of the episode.
    """
    done = True
    trajectory = []
    
    while True:
        if done:
            # Reset the environment to start a new episode
            St, _ = env.reset()
            Rt = None
            done = False
        else:
            # Take action At and observe the new state and reward
            St, Rt, done, _, _ = env.step(At)
        
        # Determine the action to take based on the policy
        At = policy(St, pi)
        trajectory.append((St, Rt, At, done))
        
        if done:
            break
    
    return trajectory, len(trajectory) - 1

def on_policy_cn_control(env, ep, gamma, eps):
    """
    On-policy first-visit Monte Carlo control algorithm to find the optimal policy.

    Args:
        env (gym.Env): Gym environment to interact with.
        ep (int): Number of episodes to run.
        gamma (float): Discount factor for future rewards.
        eps (float): Epsilon for exploration in epsilon-greedy policy.

    Returns:
        dict: The action-value function Q.
    """
    # Initialize policy table with equal probability for each action
    pi = defaultdict(lambda: 1 / number_actions)
    # Initialize action-value function Q
    Q = defaultdict(float)
    # Initialize the list of returns for each state-action pair
    returns = defaultdict(list)

    def policy(st, pi):
        """
        Epsilon-greedy policy to choose actions.

        Args:
            st (tuple): Current state.
            pi (dict): Policy table.

        Returns:
            int: Chosen action.
        """
        return np.random.choice([0, 1], p=[pi[(st, a)] for a in [0, 1]])

    for _ in range(ep):
        # Generate an episode
        traj, terminal_state_index = generate_episode(env=env, policy=policy, pi=pi)
        G = 0
        for time_step in range(terminal_state_index - 1, -1, -1):
            St, _, _, At = traj[time_step]
            _, Rt_1, _, _ = traj[time_step + 1]
            G = gamma * G + Rt_1
            if (St, At) not in [(traj[i][0], traj[i][2]) for i in range(0, time_step)]:
                returns[(St, At)].append(G)
                Q[(St, At)] = np.mean(returns[(St, At)])
                A_opt= argmax_radd([ Q[(St,a)] for a in range(number_actions)  ])
                for a in range(number_actions):
                    pi[(St,a)] = 1- eps+ eps/number_actions
                else:
                    pi[(St,a)]= eps/number_actions
                    
    return Q,pi

# Run the on-policy control algorithm
Q,pi = on_policy_cn_control(env, ep=500000, gamma=gamma, eps=0.1)

# Print the resulting action-value function
print(Q)
print(pi)


  if not isinstance(terminated, (bool, np.bool8)):


defaultdict(<class 'float'>, {((12, 8, False), False): -0.4907084785133566, ((12, 8, False), 1): 0.0, ((16, 2, False), False): -0.3082590151221404, ((16, 2, False), 1): 0.0, ((14, 9, False), False): -0.5619568715803025, ((14, 9, False), 1): 0.0, ((11, 8, False), False): -0.4706849315068493, ((11, 8, False), 1): 0.0, ((14, 10, False), False): -0.5669154228855722, ((14, 10, False), 1): 0.0, ((18, 10, False), False): -0.26010218300046445, ((18, 10, False), 1): 0.0, ((7, 10, False), False): -0.5743336081341027, ((7, 10, False), 1): 0.0, ((19, 7, False), False): 0.5341488277268094, ((19, 7, False), 1): 0.0, ((17, 1, False), False): -0.650199203187251, ((17, 1, False), 1): 0.0, ((14, 5, False), False): -0.1723916532905297, ((14, 5, False), 1): 0.0, ((9, 5, False), False): -0.13884673748103188, ((9, 5, False), 1): 0.0, ((16, 10, False), False): -0.5896618357487923, ((16, 10, False), 1): 0.0, ((11, 2, False), False): -0.2579582875960483, ((11, 2, False), 1): 0.0, ((12, 10, False), False): -0.5

In [59]:
pi_df = pd.DataFrame(pi.items(),columns=["state_action","probility"])
Q_df= pd.DataFrame(Q.items(),columns=["state-action","policy_value_in_each_state_actoin"])

In [60]:
Q_df

Unnamed: 0,state-action,policy_value_in_each_state_actoin
0,"((12, 8, False), False)",-0.490708
1,"((12, 8, False), 1)",0.000000
2,"((16, 2, False), False)",-0.308259
3,"((16, 2, False), 1)",0.000000
4,"((14, 9, False), False)",-0.561957
...,...,...
555,"((4, 3, False), 1)",0.000000
556,"((21, 6, False), False)",0.835366
557,"((21, 6, False), 1)",0.000000
558,"((21, 7, False), False)",0.782895


In [61]:

pi_df

Unnamed: 0,state_action,probility
0,"((12, 8, False), 0)",0.95
1,"((12, 8, False), 1)",0.05
2,"((16, 2, False), 0)",0.95
3,"((16, 2, False), 1)",0.05
4,"((14, 9, False), 0)",0.95
...,...,...
755,"((31, 6, False), 1)",0.50
756,"((31, 2, False), 0)",0.50
757,"((31, 2, False), 1)",0.50
758,"((31, 4, False), 0)",0.50
