### Reinforcement Learning
<p><b>Reinforcement Learning (RL)</b> is a type of <b>Machine Learning</b> where an <b>agent</b> learns to make decision by <b>interacting with an environment</b>, aiming to maximize some notion of <b>cumulative rewards.</b></p>
It is inspired by behavioral psychology similar to how humans or animals learn from <b>trial and error.</b>

### Why Reinforcement Learning
#### Use Case Where RL Shines:-
1. Games (ex. ludo, chess, atari etc)
2. Robotics (Robot walking, talking etc)
3. Self-Driving cars
4. Recommender Systems
5. Automated Trading

#### Why not Just Supervised/Unsupervised Learning?
- Traditional ML
  1. Needs Labeled Data
  2. One-Shot Prediction
  3. Passive Model
  4. No delay rewards
- Reinforcement Learning
  1. Learns through interaction
  2. Sequenial Decision Making
  3. Active Learning Agent (Agentic AI)
  4. Handles Delayed Feedback (works on rewards)<br>
<b>In RL, the agent learns not just "What is Right?" but "What is right over time".</b>

#### Elements of Reinforcement Learning
<table>
    <tr>
        <th>Element</th>
        <th>Description</th>
    </tr>
    <tr>
        <th>Agent</th>
        <td>The Learner or Decision Maker</td>
    </tr>
    <tr>
        <th>Environment</th>
        <td>Where the agent operates</td>
    </tr>
    <tr>
        <th>State (S)</th>
        <td>Current Situation the agent is in</td>
    </tr>
    <tr>
        <th>Action (A)</th>
        <td>What the agent can do</td>
    </tr>
    <tr>
        <th>Reward (R)</th>
        <td>Feedback from the environment</td>
    </tr>
    <tr>
        <th>Policy (PIE)</th>
        <td>Strategy used by the agent to decide actions</td>
    </tr>
    <tr>
        <th>Value (V)</th>
        <td>Expected long-term reward from a state</td>
    </tr>
    <tr>
        <th>Q-Value (Q)</th>
        <td>Expected reward for taking an action in a state</td>
    </tr>
</table>

#### Exploration VS Exploitation Dilemma
<p>This is the <b>core challenge</b> in RL</p>
<p><b>Exploration: </b>Try new actions to discover better rewards.</p>
<p><b>Exploitation: </b>Use known actions that give high rewards.</p>

#### Epsilon Greedy Theorem / Algorithm
- Exploration -> In a small unit 0.01 - 0.2.
- Exploitation -> Will be 0.02 - 1.
- Win Probability +1 as a reward and -1 or 0 as a lose.

#### Algorithm
if random () < epsilon: 
    <pre>action = random_action()</pre>
else:
    <pre>action = action_with_max_Q_Value()

In [1]:
#Toss -> Heads, Tails
import random

In [2]:
#Q_Values
q = [ 0, 0 ]
#20% Explore
epsilon = 0.2
#Learning Rate
alpha = 0.3

In [38]:
for i in range(50):
    coins = 0 if random.random() < 0.7 else 1  #Let 70% is heads
    #Greedy
    if random.random() < epsilon:
        guess = random.randint(0,1)
    else:
        guess = 0 if q[0] > q[1] else 1
    reward = 1 if coins==guess else 0
    q[guess] = q[guess] + alpha*(reward-q[guess])
print("Q_Values :", q)
print("Best Case :",'Heads' if q[0] > q[1] else 'Tails')

Q_Values : [0.2330464609556261, 0.2975221392172274]
Best Case : Tails
