## Evaluation

### Individual Buildings

In order to evaluate the performance of a policy $\pi$ on a Bauwerk building $b$, we consider the *expected return* when using policy $\pi$ to operate building $b$,
$$
    \mathbb{E}_{\pi}[\sum_{t=0}^{T}\gamma_b^t R_b(s_t, a_t)],
$$
where $R_b$ is the reward function and $\gamma_b$ is the discount factor of building $b$'s *partially observable Markov decision process* (POMDP), and $s_t$, $a_t$ are random variables of states and actions visited under policy $\pi$. This value is basically the expected cost of using policy $\pi$ as a controller in building $b$.

Below we compute this expected return for a random policy.

In [7]:
import gym
import bauwerk

NUM_SAMPLES = 10

env = gym.make("bauwerk/House-v0")
env.reset()
cum_rewards = []

for i in range(NUM_SAMPLES):
    cum_rewards.append(0)
    while range(10**6):
        obs, reward, terminated, truncated, info = env.step(env.action_space.sample())
        cum_rewards[i] += reward
        if terminated or truncated:
            env.reset()
            break

overall_reward = sum(cum_rewards)/NUM_SAMPLES
print(f"Expected reward with random policy (estimated using {NUM_SAMPLES} samples): {overall_reward}")

Expected reward with random policy (estimated using 10 samples): -8562.73419946369
