<a href="https://colab.research.google.com/github/shivamsingh163248/ML_AII_LAB/blob/main/AII/LAB_8_Reinforcement_Learning_%E2%80%94_Q_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

📚 LAB 8: Reinforcement Learning — Q-Learning
🔸 PART 1: What is Reinforcement Learning (RL)?

    RL = Learning from trial and error.

    An agent interacts with an environment, receives rewards, and learns which actions are best.

    Goal: Maximize cumulative reward.

🔸 PART 2: What is Q-Learning?

    Q-Learning is a type of model-free reinforcement learning.

    It learns the best action to take from any given state.

The Q-Value (Quality value) for a (state, action) pair is updated as:
Q(s,a)=Q(s,a)+α(r+γmax⁡a′Q(s′,a′)−Q(s,a))
Q(s,a)=Q(s,a)+α(r+γa′max​Q(s′,a′)−Q(s,a))

Where:

    αα = learning rate

    γγ = discount factor

    rr = reward

    s′s′ = next state

    a′a′ = next action

In [1]:
import numpy as np
import random

# Define environment
states = [(0,0), (0,1), (1,0), (1,1)]
actions = ['up', 'down', 'left', 'right']

# Initialize Q-table
Q = {}
for s in states:
    Q[s] = {a: 0 for a in actions}

# Hyperparameters
alpha = 0.5      # learning rate
gamma = 0.9      # discount factor
epsilon = 0.3    # exploration rate
episodes = 100

# Reward function
def get_reward(state):
    if state == (1,0):
        return 10
    else:
        return -1

# Transition function
def next_state(state, action):
    x, y = state
    if action == 'up':
        x = max(x-1, 0)
    elif action == 'down':
        x = min(x+1, 1)
    elif action == 'left':
        y = max(y-1, 0)
    elif action == 'right':
        y = min(y+1, 1)
    return (x, y)

# Training loop
for ep in range(episodes):
    state = (0,0)  # Start at (0,0)

    while state != (1,0):  # until goal is reached
        if random.uniform(0,1) < epsilon:
            action = random.choice(actions)  # explore
        else:
            action = max(Q[state], key=Q[state].get)  # exploit

        next_s = next_state(state, action)
        reward = get_reward(next_s)

        # Q-learning formula
        old_value = Q[state][action]
        next_max = max(Q[next_s].values())

        Q[state][action] = old_value + alpha * (reward + gamma * next_max - old_value)

        state = next_s  # move to next state

# Show the learned Q-values
for state in Q:
    print(f"State {state}: {Q[state]}")


State (0, 0): {'up': 7.998962399724405, 'down': 10.0, 'left': 7.999801632337036, 'right': 6.063456863164845}
State (0, 1): {'up': 2.115234375, 'down': -0.75, 'left': 7.974448680877622, 'right': 3.054007625579834}
State (1, 0): {'up': 0, 'down': 0, 'left': 0, 'right': 0}
State (1, 1): {'up': -0.5, 'down': -0.5, 'left': 5.0, 'right': 0}
