## <center>CSE 546: Reinforcement Learning</center>
### <center>Prof. Alina Vereshchaka</center>
#### <center>Fall 2025</center>

# Welcome to the Bonus: Firecast RL

## IMPORTS & SETUP

In [1]:
import gymnasium as gym
import numpy as np
from stable_baselines3 import PPO
from firecastrl_env.envs.wildfire_env import WildfireEnv 
from firecastrl_env.wrappers.custom_reward import CustomRewardWrapper

class SafeWildfireWrapper(gym.ObservationWrapper):
    def __init__(self, env):
        super().__init__(env)
    
    def observation(self, obs):
        if 'cells' in obs:
            # replacing positive infinity with -1.0 for numerical stability
            obs['cells'] = np.nan_to_num(obs['cells'], posinf=-1.0)
        return obs

## BASELINE (with DEFAULT REWARD)

In [4]:
raw_env = WildfireEnv(env_id=0, render_mode=None) 
env = SafeWildfireWrapper(raw_env)

# PPO Agent (untrained)
model = PPO("MultiInputPolicy", env, verbose=0)

print("Starting Baseline Episode")
obs, _ = env.reset()
done = False
total_reward = 0
info = {}

# running a sim loop
while not done:
    action, _ = model.predict(obs) # Agent(randomly)
    obs, reward, terminated, truncated, info = env.step(action)
    total_reward += reward
    done = terminated or truncated

print(f"Baseline Finished.")
print(f"Total Reward: {total_reward:.2f}")
print(f"Final Stats - Burnt: {info['cells_burnt']}, Burning: {info['cells_burning']}")
env.close()

Starting Baseline Episode
Baseline Finished.
Total Reward: -1730.83
Final Stats - Burnt: 1664, Burning: 0


## CUSTOM REWARD (to encourage aggressive fire suppression)

In [5]:
def smart_wildfire_reward(env, prev_state, curr_state):
    
    # calculating state changes from previous step
    newly_burnt = curr_state['cells_burnt'] - prev_state['cells_burnt']
    extinguished = curr_state['quenched_cells']
    curr_burning = curr_state['cells_burning']
    
    reward = 0.0
    
    # heavily incentivizing putting out active fires with a large reward +10
    if extinguished > 0:
        reward += (extinguished * 10.0)
        
    # strictly punishing allowing the fire to grow with a penalty of -5
    if newly_burnt > 0:
        reward -= (newly_burnt * 5.0)
        
    # continuous small penalty to force faster completion.
    reward -= 0.05 * curr_burning
    
    # If the agent tries to extinguish but hits nothing that is just waste water then we punish it with a small penalty -2 so like it will teach it to aim before dropping water
    try:
        last_action = env.unwrapped.state.get('last_action')
        if last_action == 4 and extinguished == 0:
            reward -= 2.0 
    except AttributeError:
        # safety catch in case env wrapper hides the state
        pass 

    # and clipping reward to prevent instability during training
    return float(np.clip(reward, -20.0, 20.0))

## TRAIN & EVALUATE SHAPED AGENT

In [6]:
# we are stacking wrappers like Raw Env in Custom Reward and then using that

raw_env_2 = WildfireEnv(env_id=0, render_mode=None)
reward_env = CustomRewardWrapper(raw_env_2, reward_fn=smart_wildfire_reward)
shaped_env = SafeWildfireWrapper(reward_env)

print("Training Shaped Agent")
shaped_model = PPO("MultiInputPolicy", shaped_env, verbose=1)
shaped_model.learn(total_timesteps=25000)

print("\nEvaluation Run (Shaped Reward)")
obs, _ = shaped_env.reset()
done = False
final_reward = 0
final_info = {}

while not done:
    action, _ = shaped_model.predict(obs)
    obs, reward, terminated, truncated, info = shaped_env.step(action)
    final_reward += reward
    final_info = info
    done = terminated or truncated

print(f"Shaped Run Finished")
print(f"Total Custom Reward: {final_reward:.2f}")
print(f"Final Stats -> Burnt: {final_info['cells_burnt']}, Burning: {final_info['cells_burning']}")
shaped_env.close()

Training Shaped Agent
Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 617      |
|    ep_rew_mean     | -6.9e+03 |
| time/              |          |
|    fps             | 17       |
|    iterations      | 1        |
|    time_elapsed    | 114      |
|    total_timesteps | 2048     |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 678          |
|    ep_rew_mean          | -7.36e+03    |
| time/                   |              |
|    fps                  | 16           |
|    iterations           | 2            |
|    time_elapsed         | 244          |
|    total_timesteps      | 4096         |
| train/                  |              |
|    approx_kl            | 0.0074659307 |
|    clip_fraction        | 0.00654      |
|    clip_range           | 

## Analysis

### Discussion & Analysis

**1. Implicit Default Behavior (Baseline)**
The default reward function was too lenient (-1.0 penalty for burnt cells), which failed to create a sense of urgency. This caused the agent to wander aimlessly rather than aggressively attacking the fire.
* **Result:** 1664 Cells Burnt

**2. Reward Shaping Strategy (The Winner)**
I found that a **"Simple but Strict"** reward function worked best:
* **Extinguish Reward (+10):** Made firefighting the top priority.
* **Spread Penalty (-5):** Made allowing the fire to grow unacceptable.
* **Wasted Water Penalty (-2):** This was the most critical optimization. By punishing the agent for dropping water on empty ground, it learned to *aim* and conserve actions, leading to much higher efficiency.

**3. Failed Experiment (Over-Optimization)**
I also attempted a "Super Smart" reward with distance shaping (penalizing hovering) and dynamic scarcity scaling.
* **Result:** ~1700 Cells Burnt (Worse than baseline).
* **Lesson:** Adding too many complex signals overwhelmed the agent during the short training window (25k steps). The agent struggled to balance conflicting priorities. It performed best when the rules were simple, dense, and consistent ("Hit the fire, don't miss").

**4. Final Conclusion**
* **Baseline Burnt:** 1664
* **My Agent Burnt:** 1396
* **Improvement:** My shaped reward successfully reduced the burnt area by **~16%** compared to the random baseline. The combination of high rewards for extinguishing and specific penalties for wasting resources was key to this success.