# RL Basics (Agent–Environment–Reward Loop)
Identify Agent, Environment, State, Action, Reward.

Learn the core RL loop using a toy Python example.

- **Agent**: Decision maker
- **Environment**: World the agent interacts with
- **State (s)**: Situation of the environment
- **Action (a)**: Choice agent makes
- **Reward (r)**: Feedback
- **Policy (π)**: Mapping from state to action


In [6]:
# Toy RL loop example
import random
states = ['Idle','User Waiting']
actions = ['Dispense Water','Dispense Soda','Ask Selection']
reward_map = {
    ('User Waiting','Ask Selection'): 1,
    ('User Waiting','Dispense Soda'): 5,
    ('User Waiting','Dispense Water'): 3,
    ('Idle','Ask Selection'): 0,
    ('Idle','Dispense Soda'): -2,
    ('Idle','Dispense Water'): -2,
}
state = 'User Waiting'
for step in range(5):
    action = random.choice(actions)
    reward = reward_map.get((state, action), 0)
    print(f"Step {step}: State={state}, Action={action}, Reward={reward}")
    state = random.choice(states)


Step 0: State=User Waiting, Action=Ask Selection, Reward=1
Step 1: State=Idle, Action=Ask Selection, Reward=0
Step 2: State=Idle, Action=Dispense Soda, Reward=-2
Step 3: State=User Waiting, Action=Dispense Water, Reward=3
Step 4: State=User Waiting, Action=Ask Selection, Reward=1


## Map RL Elements to Real-World Scenarios


### Scenario 1: Traffic Light System
- **Agent**: Traffic signal controller
- **Environment**: Road with cars and pedestrians
- **State**: Traffic density (low/medium/high)
- **Action**: Change signal (red/green/yellow)
- **Reward**: Smooth traffic flow (positive), congestion (negative)

In [7]:
import random

# Define the elements of the RL problem
states = ['Low Traffic', 'Medium Traffic', 'High Traffic']
actions = ['Change to Red', 'Change to Green', 'Change to Yellow']
reward_map = {
    ('Low Traffic', 'Change to Green'): 2,
    ('Low Traffic', 'Change to Yellow'): 0,
    ('Low Traffic', 'Change to Red'): -1,
    ('Medium Traffic', 'Change to Green'): 1,
    ('Medium Traffic', 'Change to Yellow'): 0,
    ('Medium Traffic', 'Change to Red'): -1,
    ('High Traffic', 'Change to Green'): -2,
    ('High Traffic', 'Change to Yellow'): -1,
    ('High Traffic', 'Change to Red'): 2,
}

# Simulate the RL loop
state = random.choice(states)
print(f"Initial State: {state}")

for step in range(10):
    action = random.choice(actions)
    reward = reward_map.get((state, action), -3) # Default negative reward for invalid actions
    print(f"Step {step}: State={state}, Action={action}, Reward={reward}")

    # Simulate a transition to a new state (simplified)
    state = random.choice(states)

print("Simulation finished.")

Initial State: High Traffic
Step 0: State=High Traffic, Action=Change to Yellow, Reward=-1
Step 1: State=Medium Traffic, Action=Change to Yellow, Reward=0
Step 2: State=High Traffic, Action=Change to Red, Reward=2
Step 3: State=Medium Traffic, Action=Change to Green, Reward=1
Step 4: State=Low Traffic, Action=Change to Green, Reward=2
Step 5: State=Medium Traffic, Action=Change to Yellow, Reward=0
Step 6: State=Low Traffic, Action=Change to Green, Reward=2
Step 7: State=High Traffic, Action=Change to Yellow, Reward=-1
Step 8: State=Medium Traffic, Action=Change to Green, Reward=1
Step 9: State=Medium Traffic, Action=Change to Red, Reward=-1
Simulation finished.


### Scenario 2: Dog Training
- **Agent**: Dog
- **Environment**: Training ground with trainer
- **State**: Command given (sit, run)
- **Action**: Dog's response (sit/run/ignore)
- **Reward**: Biscuit (positive), no treat (0)

In [8]:
import random

# Define the elements of the RL problem for Dog Training
states = ['Sit Command', 'Run Command']
actions = ['Sit', 'Run', 'Ignore']
reward_map = {
    ('Sit Command', 'Sit'): 1,
    ('Sit Command', 'Run'): -1,
    ('Sit Command', 'Ignore'): 0,
    ('Run Command', 'Run'): 1,
    ('Run Command', 'Sit'): -1,
    ('Run Command', 'Ignore'): 0,
}

# Simulate the RL loop for Dog Training
state = random.choice(states)
print(f"Initial State: {state}")

for step in range(5):
    action = random.choice(actions)
    reward = reward_map.get((state, action), -0.5) # Default slight negative reward for unexpected actions
    print(f"Step {step}: State={state}, Action={action}, Reward={reward}")

    # Simulate a transition to a new state (simplified)
    state = random.choice(states)

print("Simulation finished.")

Initial State: Run Command
Step 0: State=Run Command, Action=Run, Reward=1
Step 1: State=Sit Command, Action=Sit, Reward=1
Step 2: State=Sit Command, Action=Run, Reward=-1
Step 3: State=Sit Command, Action=Ignore, Reward=0
Step 4: State=Sit Command, Action=Sit, Reward=1
Simulation finished.


### Scenario 3: Warehouse Picking Robot
- **Agent**: Robot
- **Environment**: Warehouse grid
- **State**: Current robot location + item request
- **Action**: Move left/right/up/down or pick item
- **Reward**: Correct pick (positive), wrong pick (negative), delay (negative)


In [9]:
import random

# Define the elements of the RL problem for Warehouse Picking Robot
# States can be simplified to just the task
states = ['Picking Item']
# Actions include movement and the final pick
actions = ['Move Left', 'Move Right', 'Move Up', 'Move Down', 'Pick Item']
reward_map = {
    ('Picking Item', 'Pick Item'): 10,  # High reward for successful pick
    ('Picking Item', 'Move Left'): -0.1, # Small negative reward for movement (cost)
    ('Picking Item', 'Move Right'): -0.1,
    ('Picking Item', 'Move Up'): -0.1,
    ('Picking Item', 'Move Down'): -0.1,
}

# Simulate the RL loop for Warehouse Picking Robot (simplified)
state = 'Picking Item'
print(f"Initial State: {state}")

for step in range(7): # Simulate a few steps of movement and potentially a pick
    action = random.choice(actions)
    # Reward for picking is only given if the state is 'Picking Item' and the action is 'Pick Item'
    # Otherwise, movement has a small negative reward
    if state == 'Picking Item' and action == 'Pick Item':
        reward = reward_map.get((state, action), 0)
        # In a real scenario, a successful pick would likely end the episode or transition to a new state
        # For this simulation, we'll just give the reward and continue
        print(f"Step {step}: State={state}, Action={action}, Reward={reward} - Item Picked!")
    else:
         reward = reward_map.get((state, action), -1) # Larger negative reward for picking at the wrong time or other invalid actions
         print(f"Step {step}: State={state}, Action={action}, Reward={reward}")


print("Simulation finished.")

Initial State: Picking Item
Step 0: State=Picking Item, Action=Move Right, Reward=-0.1
Step 1: State=Picking Item, Action=Move Down, Reward=-0.1
Step 2: State=Picking Item, Action=Move Left, Reward=-0.1
Step 3: State=Picking Item, Action=Move Up, Reward=-0.1
Step 4: State=Picking Item, Action=Move Right, Reward=-0.1
Step 5: State=Picking Item, Action=Move Right, Reward=-0.1
Step 6: State=Picking Item, Action=Pick Item, Reward=10 - Item Picked!
Simulation finished.
