# **Day 3:**
We Will be introducing the main topics of this camp, Agent Artificial Intelligence (AI)**

![Agent AI](https://lilianweng.github.io/posts/2018-02-19-rl-overview/RL_illustration.png)

## What is Agent Learning?

*"Try something new, add randomness to your actions. Then, compare the result to your expectation. If the result suprises you, maybe exceeded your expectations, then change your parameters to increase taking those actions in the future." ~ Ilya Sutskever*

### **What Does this Even Mean?**

1. "Try something new, add randomness to your actions."

This is like trying a new type of ice cream flavor instead of always sticking to vanilla. You never know, you might find a new favorite! This is what we call 'exploration' in reinforcement learning, where an AI agent tries different actions to see what happens. It's like a robot exploring a new planet.


2. "Then, compare the result to your expectation."

After you've tried the new ice cream flavor, you think about whether it was better, worse, or just as you expected. In reinforcement learning, this is like the AI agent checking the reward it gets after taking an action. If the ice cream was yummy, it's like getting a good reward!


3. "If the result surprises you, maybe exceeded your expectations..."

Sometimes you might be surprised by how much you liked the new flavor. Maybe you expected it to be just okay, but it was actually delicious! In reinforcement learning, this is like getting a higher reward than expected. It's like if the robot found a shiny gem on the planet when it was only expecting to find rocks.


4. "...then change your parameters to increase taking that action in the future."

If you really liked the new flavor, you might decide to choose it more often in the future. You changed your 'ice cream picking rule' based on the new information. In reinforcement learning, this is called 'exploitation'. The AI agent adjusts its policy (which is like its 'rule book' for picking actions) to do actions that give good rewards more often. It's like the robot deciding to look for shiny gems more often, because it learned that finding gems is better than finding rocks.


In [2]:
import CoderSchoolAI # Imports the entire CoderSchoolAI library!
from CoderSchoolAI import * # Imports all of the CoderSchoolAI library's things! Think of sprinkles and Cake Batter!
from CoderSchoolAI.Environment.CoderSchoolEnvironments.SnakeEnvironment import * # We are going to use a pre-cooked Cake from the Library!

pygame 2.1.0 (SDL 2.0.16, Python 3.9.13)
Hello from the pygame community. https://www.pygame.org/contribute.html


  from .autonotebook import tqdm as notebook_tqdm


In [4]:
def learn(snake_env, steps=10000, save_file="./QSnakeAgent.pkl", log_interval=1000):
    s = 0
    while s < steps:
        snake_env.update_env() # Update the environment in what we call a loop.
        s+=1
    snake_env.snake_agent.qlearning.save_q_table(save_file)
    
def load(snake_env, steps=10000, save_file="./QSnakeAgent.pkl"):
    s = 0
    snake_env.snake_agent.qlearning.load_q_table(save_file)
    snake_env.snake_agent.qlearning.epsilon = 0
    while s < steps:
        snake_env.update_env() # Update the environment in what we call a loop.
        s+=1
snake_env = SnakeEnv(
    target_fps=6, 
    height=8,
    width=8,
    cell_size=80,
    is_user_control=False, 
    snake_is_q_table=True,
    verbose=True,
    policy_kwargs=dict(
        alpha=0.9, 
        gamma=0.85,
        epsilon=1,
        epsilon_decay=0.999,
        )

                     ) # Create a SnakeEnv object!
snake_env.reset() # Reset the environment!
learn(snake_env, steps=1000000, save_file="./QSnakeAgent.pkl")
while True: # Loop until the game is over.
    snake_env.update_env() # Update the environment in what we call a loop.

Registered: game_state Attribute.
Registered: moving_direction Attribute.
Registered: apple_pos Attribute.
Registered: snake_pos Attribute.
Resetting Snake.
Taking Action: 1
Reward: -0.2795084971874737
Taking Action: 2
Snake Has Intersected with a non-collidable area.
Reward: -1
Resetting Snake.
Taking Action: 3
Reward: -0.5153882032022076
Taking Action: 0
Reward: -0.5659615711335885
Taking Action: 0
Reward: -0.625
Taking Action: 0
Snake Has Intersected with a non-collidable area.
Reward: -1
Resetting Snake.
Taking Action: 2
Snake Has Intersected with a non-collidable area.
Reward: -1
Resetting Snake.
Taking Action: 3
Reward: -0.35355339059327373
Taking Action: 4
Reward: -0.3151650429449553
Taking Action: 1
Reward: -0.2795084971874737
Taking Action: 2
Reward: -0.36443449342783124
Taking Action: 0
Reward: -0.35355339059327373
Taking Action: 3
Reward: -0.2651650429449553
Taking Action: 0
Reward: -0.2795084971874737
Taking Action: 0
Reward: -0.31868871959954903
Taking Action: 3
Reward: -0

SystemExit: 

- **Trial and Error**: 
Q-Learning is all about an agent learning from its actions (trial) and their outcomes (errors). The agent interacts with an environment by taking actions, and it learns from the rewards (positive or negative) it gets.


- **Value of Actions (Q-value)**: 
Q-Learning estimates the value (or "Q-value") of taking a certain action in a certain state. This is an estimate of reward that the agent can expect to receive if it takes that action in that state, considering both immediate and future rewards.


- **Exploration vs Exploitation**: 
Q-Learning involves a balance between exploration (trying out new actions to see their outcomes) and exploitation (choosing the action that the agent currently believes to have the best outcome). 
This is often handled using an "*epsilon-greedy*" strategy, where the agent randomly explores with a probability of epsilon, and exploits its current knowledge with a probability of 1 - epsilon. 
This just means that sometimes it will explore new actions, and then other times it will use its knowledge of the best action.


- **Learning Rate (Alpha)**: The learning rate alpha determines how much the Q-values are updated at each step. If alpha is 1, the agent completely trusts the new information it gets. If alpha is 0, the agent doesn't learn anything new.


- **Discount Factor (Gamma)**: The discount factor gamma decides how much importance to give to future rewards compared to immediate rewards. If gamma is 0, the agent only cares about immediate rewards. If gamma is close to 1, the agent cares a lot about maximizing future rewards.


- **Q-Table**: 
The Q-values are usually stored in a table called the Q-table. Each entry in the Q-table corresponds to a certain state-action pair. The agent uses this table as a "cheat sheet" to decide what action to take in a given state. 
This can be thought of your memory through a game, where keeping track of the games history will let us be able to remember what to do next.


- **Update Rule**: At each step, the Q-values are updated using the Q-Learning update rule (also called the Temporal Difference (TD) update rule). This rule is a kind of "weighted average" of the old Q-value and the newly estimated Q-value.


- **Convergence**: 
Over time, as the agent explores the environment enough and updates the Q-table, the Memory in the table will tell the Player exactly how to act in any scenario.