# Demo: CityLearn

In this demo, we will use [CityLearn](https://www.citylearn.net/index.html) ([Vazquez-Canteli et al., 2019](https://doi.org/10.1145/3360322.3360998)), a Gymnasium environment for control algorithms for building energy coordination and demand response in cities. 

This tutorial is based on the [Climate Change AI Citylearn Tutorial at ICLR 2023](https://www.climatechange.ai/tutorials?search=id%3Acitylearn)

### What is CityLearn?


<img src='https://www.citylearn.net/_images/dr.jpg' height=300></img> \
Source: [CityLearn](https://www.citylearn.net/)

- Open source gymnasium enviroment for control algorithms for building and district energy systems
- models buildings and districts as environments

Buildings:
- modeled as single thermal zone
- up to five load types: space cooling, space heating domestic hot water (DHW) heating, electric equipment, electric vehicle (EV) loads
- energy storage
- Photovoltaic (PV) system



<img src="https://github.com/intelligent-environments-lab/CityLearn/blob/master/assets/images/environment.jpg?raw=true"  width="1000" alt="An overview of the heating, ventilation and air conditioning systems, energy storage systems, on-site electricity sources and grid interaction in buildings in the CityLearn environment."></img> \
Source: [Nweye et al., 2024](https://doi.org/10.48550/arXiv.2405.03848)


### Learning objectives

- Get familiar with the CityLearn framework
- try different control algorithms
  - Rule-based control (RBC)
  - Q-Learning
  - Deep reinforcement learning (Soft-Actor Critic SAC)
- develop understanding which tools work for which task

### Dataset

We use the `citylearn_challenge_2022_phase_all` dataset from the [The CityLearn Challenge 2022](https://www.aicrowd.com/challenges/neurips-2022-citylearn-challenge), which is included in the CityLearn package. For more detailed information, look at the notebook from the [Climate Change AI Citylearn Tutorial at ICLR 2023](https://www.climatechange.ai/tutorials?search=id%3Acitylearn). The dataset consists of 17 single family homes, each with a 6.4 kWh capacity battery and PV modules. The data includes hourly energy consumption data from August 1, 2016 and July 31, 2017. All energy loads in a building are combined to a single load value. 

### Control strategy

The goal is to implement a controller that manages energy storage and load shifting in a two-building district. The controller should **minimize electricity cost** by finding a strategy when to charge or discharge the battery. The *control action* $a$ is a value from the interval $[-1, 1]$, determining the proportion of the battery to be discharged ($a < 0$) or charged ($a > 0$).

Install packages:

In [None]:
! pip install citylearn
! pip install stable=baselines3

In [None]:
import os
from typing import Any
import matplotlib.pyplot as plt
from tqdm import tqdm
import numpy as np
import pandas as pd

from citylearn.agents.base import (
    BaselineAgent,
    Agent as RandomAgent
)
from citylearn.agents.rbc import HourRBC
from citylearn.agents.q_learning import TabularQLearning
from citylearn.citylearn import CityLearnEnv
from citylearn.data import DataSet
from citylearn.reward_function import RewardFunction
from citylearn.wrappers import (
    NormalizedObservationWrapper,
    StableBaselines3Wrapper,
    TabularQLearningWrapper
)

from stable_baselines3 import SAC

from citylearn_utils import *

In [None]:
DATASET_NAME = 'citylearn_challenge_2022_phase_all'

schema = DataSet().get_schema(DATASET_NAME)
root_directory = schema['root_directory']

# We use an environment with two buildiungs
BUILDINGS = ['Building_1', 'Building_2']

# We use the first week of september from the dataset
SIMULATION_START_TIME_STEP = 745
SIMULATION_END_TIME_STEP = SIMULATION_START_TIME_STEP + 24*7 - 1

You can list all available datasets in CityLearn:

In [None]:
DataSet().get_dataset_names()

Preview building data

In [None]:
fig, axs = plt.subplots(3, len(BUILDINGS), figsize=(5*len(BUILDINGS), 5), sharex=True, sharey='row')
if len(BUILDINGS) == 1:
    axs = axs.reshape(-1, 1)
for i, building_name in enumerate(BUILDINGS):
    filename = schema['buildings'][building_name]['energy_simulation']
    filepath = os.path.join(root_directory, filename)
    building_data = pd.read_csv(filepath).iloc[SIMULATION_START_TIME_STEP:SIMULATION_END_TIME_STEP+1]

    filename = schema['buildings'][building_name]['pricing']
    filepath = os.path.join(root_directory, filename)
    pricing_data = pd.read_csv(filepath).iloc[SIMULATION_START_TIME_STEP:SIMULATION_END_TIME_STEP+1]

    x = building_data.index
    y1 = building_data['non_shiftable_load']
    y2 = building_data['solar_generation']
    y3 = pricing_data['electricity_pricing']
    axs[0, i].plot(x, y1)
    axs[1, i].plot(x, y2)
    axs[2, i].plot(x, y3)
    axs[2, i].set_xlabel('Time step')
    if i == 0:
        axs[0, i].set_ylabel('Non-shiftable load\n[kWh]', fontsize=9)
        axs[1, i].set_ylabel('Solar generation\n[W/kW]', fontsize=9)
        axs[2, i].set_ylabel('Electricity pricing\n[W/kW]', fontsize=9)

    axs[0, i].set_title(building_name)
plt.tight_layout()
plt.show()

Weather data

In [None]:
filename = schema['buildings'][BUILDINGS[0]]['weather']
filepath = os.path.join(root_directory, filename)
weather_data = pd.read_csv(filepath).iloc[SIMULATION_START_TIME_STEP:SIMULATION_END_TIME_STEP+1]

columns = [
    'outdoor_dry_bulb_temperature', 'outdoor_relative_humidity',
    'diffuse_solar_irradiance', 'direct_solar_irradiance'
]
titles = [
    'Outdoor dry-bulb\ntemperature [C]', 'Relative humidity\n[%]',
    'Diffuse solar irradiance\n[W/m2]', 'Direct solar irradiance\n[W/m2]'
]
fig, axs = plt.subplots(4, 1, figsize=(5, 6), sharex=True)
x = weather_data.index

for ax, c, t in zip(fig.axes, columns, titles):
    y = weather_data[c]
    ax.plot(x, y)
    # ax.set_xlabel('Time step')
    ax.set_ylabel(t, fontsize=9)
axs[-1].set_xlabel('Time step')
fig.align_ylabels()
plt.tight_layout()
plt.show()

Carbon intensity

In [None]:
filename = schema['buildings'][building_name]['carbon_intensity']
filepath = os.path.join(root_directory, filename)
carbon_intensity_data = pd.read_csv(filepath).iloc[SIMULATION_START_TIME_STEP:SIMULATION_END_TIME_STEP+1]

fig, ax = plt.subplots(1, 1, figsize=(8, 2))
x = carbon_intensity_data.index
y = carbon_intensity_data['carbon_intensity']
ax.plot(x, y)
ax.set_xlabel('Time step')
ax.set_ylabel('kg_CO2/kWh')
plt.show()

### Evaluation metrics


**Electricity cost**: sum of building-level imported electricity cost, $E_h^{\textrm{building}} \times T_h$ (\$), where $T_h$ is the electricity rate at hour $h$.

$$
    \textrm{cost} = \sum_{h=0}^{n-1}{\textrm{max} \left (0,E_h^{\textrm{building}} \times T_h \right )}
$$

**Carbon emissions**: sum of building-level carbon emissions (kg<sub>CO<sub>2</sub>e</sub>), $E_h^{\textrm{building}} \times O_h$, where $O_h$ is the carbon intensity (kg<sub>CO<sub>2</sub>e</sub>/kWh) at hour $h$.

$$
    \textrm{carbon emissions} = \sum_{h=0}^{n-1}{\textrm{max} \left (0,E_h^{\textrm{building}} \times O_h \right )}
$$

**Average daily peak**: mean of the daily $E_h^{\textrm{district}}$ peak where $d$ is the day index and $n$ is the total number of days.

$$
    \textrm{average daily peak} = \frac{
        {\sum}_{d=0}^{n - 1} {\sum}_{h=0}^{23} {\textrm{max} \left (E_{24d + h}^{\textrm{district}}, \dots, E_{24d + 23}^{\textrm{district}} \right)}
    }{n}
$$

**Ramping**: absolute difference of consecutive $E_h^{\textrm{district}}$. It represents the smoothness of the district's load profile where low ramping means there is gradual increase in grid load even after self-generation becomes unavailable in the evening and early morning. High ramping means abrupt change in grid load that may lead to unscheduled strain on grid infrastructure and blackouts as a result of supply deficit.

$$
    \textrm{ramping} = \sum_{h=0}^{n-1}  \lvert E_{h}^{\textrm{district}} - E_{h - 1}^{\textrm{district}} \rvert
$$

**Load factor**: average ratio of monthly average and peak $E_{h}^{\textrm{district}}$ where $m$ is the month index, $d$ is the number of days in a month and $n$ is the number of months. Load factor represents the efficiency of electricity consumption and is bounded between 0 (very inefficient) and 1 (highly efficient) thus, the goal is to maximize the load factor or in the same fashion as the other KPIs, minimize (1 - load factor).

$$
    \textrm{1 - load factor}  = \Big(
        \sum_{m=0}^{n - 1} 1 - \frac{
            \left (
                \sum_{h=0}^{d - 1} E_{d \cdot m + h}^{\textrm{district}}
            \right ) \div d
        }{
            \textrm{max} \left (E_{d \cdot m}^{\textrm{district}}, \dots, E_{d \cdot m + d - 1}^{\textrm{district}} \right )
    }\Big) \div n
$$

The KPIs are reported as normalized values with respect to the baseline outcome where the baseline outcome is when buildings are not equipped with batteries i.e., no control. Thus a KPI less than 1.0 is preferred to make a case for including the battery or an advanced control approach.

$$
    \textrm{KPI} = \frac{{\textrm{KPI}_{control}}}{\textrm{KPI}_{baseline (no\ battery)}}
$$

Initialize CityLearn Environment

In [None]:
CENTRAL_AGENT = True
ACTIVE_OBSERVATIONS = ['hour']

In [None]:
env = CityLearnEnv(
    DATASET_NAME,
    central_agent=CENTRAL_AGENT,
    buildings=BUILDINGS,
    active_observations=ACTIVE_OBSERVATIONS,
    simulation_start_time_step=SIMULATION_START_TIME_STEP,
    simulation_end_time_step=SIMULATION_END_TIME_STEP,
)

In [None]:
print('Current time step:', env.time_step)
print('environment number of time steps:', env.time_steps)
print('environment uses central agent:', env.central_agent)
print('Number of buildings:', len(env.buildings))

In [None]:
# electrical storage
print('Electrical storage capacity:', {
    b.name: b.electrical_storage.capacity for b in env.buildings
})
print('Electrical storage nominal power:', {
    b.name: b.electrical_storage.nominal_power for b in env.buildings
})
print('Electrical storage loss_coefficient:', {
    b.name: b.electrical_storage.loss_coefficient for b in env.buildings
})
print('Electrical storage soc:', {
    b.name: b.electrical_storage.soc[b.time_step] for b in env.buildings
})
print('Electrical storage efficiency:', {
    b.name: b.electrical_storage.efficiency for b in env.buildings
})
print('Electrical storage electricity consumption:', {
    b.name: b.electrical_storage.electricity_consumption[b.time_step]
    for b in env.buildings
})
print('Electrical storage capacity loss coefficient:', {
    b.name: b.electrical_storage.capacity_loss_coefficient for b in env.buildings
})
print()
# pv
print('PV nominal power:', {
    b.name: b.pv.nominal_power for b in env.buildings
})
print()
# active observations
print('Active observations:', {b.name: b.active_observations for b in env.buildings})
# active actions
print('Active actions:', {b.name: b.active_actions for b in env.buildings})

### Test 1: Baseline controller (no control)
---

In [None]:
baseline_env = CityLearnEnv(
    DATASET_NAME,
    central_agent=CENTRAL_AGENT,
    buildings=BUILDINGS,
    active_observations=ACTIVE_OBSERVATIONS,
    simulation_start_time_step=SIMULATION_START_TIME_STEP,
    simulation_end_time_step=SIMULATION_END_TIME_STEP,
)

baseline_model = BaselineAgent(baseline_env)

# always start by reseting the environment
observations, _ = baseline_env.reset()

# step through the environment until terminal
# state is reached i.e., the control episode ends
while not baseline_env.terminated:
    # select actions from the model
    actions = baseline_model.predict(observations)

    # apply selected actions to the environment
    observations, _, _, _, _ = baseline_env.step(actions)

plot_simulation_summary({
    'Baseline': baseline_env,
})

### Test 2: Random control strategy
---

In [None]:
# initialize your environment
random_env = CityLearnEnv(
    DATASET_NAME,
    central_agent=CENTRAL_AGENT,
    buildings=BUILDINGS,
    active_observations=ACTIVE_OBSERVATIONS,
    simulation_start_time_step=SIMULATION_START_TIME_STEP,
    simulation_end_time_step=SIMULATION_END_TIME_STEP,
)

# initialize your agent
random_model = RandomAgent(
    random_env
)

# reset your environment
observations, _ = random_env.reset()

# step through your environment
while not random_env.terminated:
    # select actions
    actions = random_model.predict(observations)

    # apply actions to environment step function
    observations, _, _, _, _ = random_env.step(actions)

# display simulation summary figures
plot_simulation_summary({
    'Baseline': baseline_env,
    'Random': random_env,
})

### Test 3: Rule-based control (RBC)
---

Our next step is to implement a rule-based control (RBC) algorithm. We will implement a simple time based controller: each hour of the day, the controller performs a fixed action to charge/discharge the battery. Here is an example of a possible strategy:

In [None]:
# define action map
action_map = {
    1: 1/12, # Rule for 1 AM
    2: 1/12, # Rule for 2 AM
    3: 1/12, # Rule for 3 AM
    4: 1/12, # Rule for 4 AM
    5: 1/12, # Rule for 5 AM
    6: 1/12, # Rule for 6 AM
    7: 1/12, # Rule for 7 AM
    8: 1/12, # Rule for 8 AM
    9: 1/12, # Rule for 9 AM
    10: 1/12, # Rule for 10 AM
    11: 1/12, # Rule for 11 AM
    12: 1/12, # Rule for 12 PM
    13: -1/12, # Rule for 1 PM
    14: -1/12, # Rule for 2 PM
    15: -1/12, # Rule for 3 PM
    16: -1/12, # Rule for 4 PM
    17: -1/12, # Rule for 5 PM
    18: -1/12, # Rule for 6 PM
    19: -1/12, # Rule for 7 PM
    20: -1/12, # Rule for 8 PM
    21: -1/12, # Rule for 9 PM
    22: -1/12, # Rule for 10 PM
    23: -1/12, # Rule for 11 PM
    24: -1/12, # Rule for 12 AM
}

# run inference
rbc_env = CityLearnEnv(
    DATASET_NAME,
    central_agent=CENTRAL_AGENT,
    buildings=BUILDINGS,
    active_observations=ACTIVE_OBSERVATIONS,
    simulation_start_time_step=SIMULATION_START_TIME_STEP,
    simulation_end_time_step=SIMULATION_END_TIME_STEP,
)
rbc_model = HourRBC(rbc_env, action_map=action_map)
observations, _ = rbc_env.reset()

while not rbc_env.terminated:
    actions = rbc_model.predict(observations)
    observations, _, _, _, _ = rbc_env.step(actions)

# display simulation summary
plot_simulation_summary({
    'Baseline': baseline_env,
    'Random': random_env,
    'RBC': rbc_env,
})

### Exercise: design RBC controller
---
Our RBC is not very good yet - try to find a hourly control strategy that improves the electrical cost and emissions compared to the baseline! Look at the daily load profiles below to decide when to charge or discharge the battery and think about when energy from the PV is available (look at the weather data)

In [None]:
print('Building-level daily-average load profiles:')
plot_building_load_profiles({'Baseline': baseline_env}, daily_average=True)
plt.show()

In [None]:
# define action map
action_map = {
    1: 0.0,
    2: 0.0,
    3: 0.0,
    4: 0.0,
    5: 0.0,
    6: 0.0,
    7: 0.0,
    8: 0.0,
    9: 0.0,
    10: 0.0,
    11: 0.0,
    12: 0.0,
    13: 0.0,
    14: 0.0,
    15: 0.0,
    16: 0.0,
    17: 0.0,
    18: 0.0,
    19: 0.0,
    20: 0.0,
    21: 0.0,
    22: 0.0,
    23: 0.0,
    24: 0.0,
}

# run inference
rbc_env = CityLearnEnv(
    DATASET_NAME,
    central_agent=CENTRAL_AGENT,
    buildings=BUILDINGS,
    active_observations=ACTIVE_OBSERVATIONS,
    simulation_start_time_step=SIMULATION_START_TIME_STEP,
    simulation_end_time_step=SIMULATION_END_TIME_STEP,
)
rbc_model = HourRBC(rbc_env, action_map=action_map)
observations, _ = rbc_env.reset()

while not rbc_env.terminated:
    actions = rbc_model.predict(observations)
    observations, _, _, _, _ = rbc_env.step(actions)

# display simulation summary
plot_simulation_summary({
    'Baseline': baseline_env,
    'Random': random_env,
    'RBC': rbc_env
})

### Test 4: RL controller (Q-learning)
---

#### Reward function

First, let's define our reward function: our goal is to minimize the electricity cost $C$ by using storing energy generated with the PV modules and using it when demand and costs are high. We assign a penalty $p$ for each building $i$ when (i) the battery is charged but not used or (ii) the battery is not charged but PV energy is fed into the grid.


$$
    r = \sum_{i=0}^n \Big(p_i \times |C_i|\Big)
$$

$$
    p_i = -\left(1 + \textrm{sign}(C_i) \times \textrm{SoC}^{\textrm{battery}}_i\right)
$$


When developing a RL solution, the design of the reward function is extremely important and has a great impact on how and what the agent learns!

In [None]:
class EnergyCostReward(RewardFunction):
    def __init__(self, env_metadata: dict[str, Any]):
        r"""Initialize CustomReward.

        Parameters
        ----------
        env_metadata: dict[str, Any]:
            General static information about the environment.
        """

        super().__init__(env_metadata)

    def calculate(
        self, observations: list[dict[str, int | float]]
    ) -> list[float]:
        r"""Returns reward for most recent action.

        The reward is designed to minimize electricity cost.
        It is calculated for each building, i and summed to provide the agent
        with a reward that is representative of all n buildings.
        It encourages net-zero energy use by penalizing grid load satisfaction
        when there is energy in the battery as well as penalizing
        net export when the battery is not fully charged through the penalty
        term. There is neither penalty nor reward when the battery
        is fully charged during net export to the grid. Whereas, when the
        battery is charged to capacity and there is net import from the
        grid the penalty is maximized.

        Parameters
        ----------
        observations: list[dict[str, int | float]]
            List of all building observations at current
            :py:attr:`citylearn.citylearn.CityLearnEnv.time_step`
            that are got from calling
            :py:meth:`citylearn.building.Building.observations`.

        Returns
        -------
        reward: list[float]
            Reward for transition to current timestep.
        """

        reward_list = []

        for o, m in zip(observations, self.env_metadata['buildings']):
            cost = o['net_electricity_consumption']*o['electricity_pricing']
            battery_soc = o['electrical_storage_soc']
            penalty = -(1.0 + np.sign(cost)*battery_soc)
            reward = penalty*abs(cost)
            reward_list.append(reward)

        reward = [sum(reward_list)]

        return reward

In [None]:
tql_env = CityLearnEnv(
    DATASET_NAME,
    central_agent=CENTRAL_AGENT,
    buildings=BUILDINGS,
    active_observations=ACTIVE_OBSERVATIONS, 
    simulation_start_time_step=SIMULATION_START_TIME_STEP,
    simulation_end_time_step=SIMULATION_END_TIME_STEP,
    reward_function=EnergyCostReward  # add our new reward function
)

#### Discretize action space and state space

The tabular Q-Learning algorithm does only work for *discrete* action and state spaces - i.e. when there is a finite set of actions (move up/down, for example) and states. In our case, both spaces are *continuous*, hence we have to discretize the action and state space

In [None]:
# define active observations and actions and their bin sizes
observation_bins = {'hour': 24}
action_bins = {'electrical_storage': 12}

# initialize list of bin sizes where each building
# has a dictionary in the list definining its bin sizes
observation_bin_sizes = []
action_bin_sizes = []

for b in tql_env.buildings:
    # add a bin size definition for the buildings
    observation_bin_sizes.append(observation_bins)
    action_bin_sizes.append(action_bins)


# debug error in current CityLearn version
class TabularQLearningWrapperDebug(TabularQLearningWrapper):
    def __init__(self, env, observation_bin_sizes = None, action_bin_sizes = None, default_observation_bin_size = None, default_action_bin_size = None):
        super().__init__(env, observation_bin_sizes, action_bin_sizes, default_observation_bin_size, default_action_bin_size)
        self.observation_names = env.observation_names


observation_names = tql_env.observation_names
tql_env = TabularQLearningWrapperDebug(
    tql_env,
    observation_bin_sizes=observation_bin_sizes,
    action_bin_sizes=action_bin_sizes
)

Run training:

In [None]:
# ----------------- CALCULATE NUMBER OF TRAINING EPISODES -----------------
m = tql_env.observation_space[0].n
n = tql_env.action_space[0].n

tql_episodes = 100

print('Q-Table dimension:', (m, n))
print('Number of episodes to train:', tql_episodes)

# ----------------------- SET MODEL HYPERPARAMETERS -----------------------
tql_kwargs = {
    'epsilon': 1.0,
    'minimum_epsilon': 0.01,
    'epsilon_decay': 0.0001,
    'learning_rate': 0.005,
    'discount_factor': 0.99,
}

# ----------------------- INITIALIZE AND TRAIN MODEL ----------------------
tql_model = TabularQLearning(
    env=tql_env,
    random_seed=None,
    **tql_kwargs
)

print(tql_model.q[0].shape)

for i in tqdm(range(tql_episodes)):
    _ = tql_model.learn()

In [None]:
observations, _ = tql_env.reset()

while not tql_env.unwrapped.terminated:
    actions = tql_model.predict(observations, deterministic=True)
    observations, _, _, _, _ = tql_env.step(actions)

# plot summary and compare with other control results
plot_simulation_summary({
    'Baseline': baseline_env,
    'Random': random_env,
    'RBC': rbc_env,
    'TQL': tql_env
})

#### Why the bad performance?

 As stated before, tabular Q-learning is only applicable to discrete state and action spaces. The number of possible state-action pairs increases exponentially with the number of states and actions - traditional Q-learning is inefficient in exploring these large spaces.

### Test 5: Deep Reinforcement Learning Controller (SAC)
---

To overcome the curse of dimensionality, we use a neural network based function approximatior RL-algorithm called *Soft-Actor Critic* (SAC). It consists of two neural networks: the actor network learns which actions to take in a state, the critic network then evaluates those actions by learning corresponding q values from the state-action pairs. For detailed information, see the [original paper](https://proceedings.mlr.press/v80/haarnoja18b) or [here](https://spinningup.openai.com/en/latest/algorithms/sac.html).

In [None]:
sac_env = CityLearnEnv(
    DATASET_NAME,
    central_agent=CENTRAL_AGENT,
    buildings=BUILDINGS,
    active_observations=ACTIVE_OBSERVATIONS,
    simulation_start_time_step=SIMULATION_START_TIME_STEP,
    simulation_end_time_step=SIMULATION_END_TIME_STEP,
    reward_function=EnergyCostReward
)

sac_env = NormalizedObservationWrapper(sac_env)
sac_env = StableBaselines3Wrapper(sac_env)
sac_model = SAC(policy='MlpPolicy', env=sac_env)

In [None]:
# ----------------- CALCULATE NUMBER OF TRAINING EPISODES -----------------
sac_episodes = 25
sac_episode_timesteps = sac_env.unwrapped.time_steps - 1
sac_total_timesteps = sac_episodes*sac_episode_timesteps

# ------------------------------- TRAIN MODEL -----------------------------
for i in tqdm(range(sac_episodes)):
    sac_model = sac_model.learn(
        total_timesteps=sac_episode_timesteps,
        reset_num_timesteps=False,
    )

In [None]:
observations, _ = sac_env.reset()
sac_actions_list = []

while not sac_env.unwrapped.terminated:
    actions, _ = sac_model.predict(observations, deterministic=True)
    observations, _, _, _, _ = sac_env.step(actions)
    sac_actions_list.append(actions)

# plot summary and compare with other control results
plot_simulation_summary({
    'Baseline': baseline_env,
    'Random': random_env,
    'RBC': rbc_env,
    'TQL': tql_env,
    'SAC-1': sac_env
})

### Exercise: Improve SAC controller
---

In [None]:
# === Set SAC observation space ===

active_obs = [
    'hour',
    'day_type',
    'solar_generation',
    'net_electricity_consumption',
    'electrical_storage_soc'
    ]

sac2_env = CityLearnEnv(
    DATASET_NAME,
    central_agent=CENTRAL_AGENT,
    buildings=BUILDINGS,
    active_observations=active_obs,
    simulation_start_time_step=SIMULATION_START_TIME_STEP,
    simulation_end_time_step=SIMULATION_END_TIME_STEP,
    reward_function=EnergyCostReward
)

# === Set SAC hyperparameters ===
your_agent_kwargs = {
    'learning_rate': 0.0003,
    'buffer_size': 1000000,
    'learning_starts': 100,
    'batch_size': 256,
    'tau': 0.005,
    'gamma': 0.99,
    'train_freq': 1,
}

sac2_env = NormalizedObservationWrapper(sac2_env)
sac2_env = StableBaselines3Wrapper(sac2_env)
sac2_model = SAC(policy='MlpPolicy', env=sac2_env)

sac_episodes = 25
sac_episode_timesteps = sac2_env.unwrapped.time_steps - 1
sac_total_timesteps = sac_episodes*sac_episode_timesteps

# ------------------------------- TRAIN MODEL -----------------------------
for i in tqdm(range(sac_episodes)):
    sac2_model = sac2_model.learn(
        total_timesteps=sac_episode_timesteps,
        reset_num_timesteps=False,
    )

observations, _ = sac2_env.reset()
sac_actions_list = []

while not sac2_env.unwrapped.terminated:
    actions, _ = sac2_model.predict(observations, deterministic=True)
    observations, _, _, _, _ = sac2_env.step(actions)
    sac_actions_list.append(actions)


plot_simulation_summary({
    'Baseline': baseline_env,
    'SAC-1': sac_env,
    'SAC-2': sac2_env.unwrapped
})