### Group ID: 134
### Group Members Name with Student ID:

| Student Name       | Student ID    | Contribution |
|--------------------|---------------|--------------|
| Chakshu            | 2023aa05280   | 100%         |
| Gali Jahnavi       | 2023aa05684   | 100%         |
| Aashaank Pratap    | 2023aa05023   | 100%         |
| Shivam Sahil       | 2023aa05663   | 100%         |

# Problem Statement

The objective of the problem is to implement an Actor-Critic reinforcement learning algorithm to optimize energy consumption in a building. The agent should learn to adjust the temperature settings dynamically to minimize energy usage while maintaining comfortable indoor conditions.

#### Dataset Details
Dataset: https://archive.ics.uci.edu/dataset/374/appliances+energy+prediction

This dataset contains energy consumption data for a residential building, along with various environmental and operational factors.

Data Dictionary:
* Appliances:       Energy use in Wh
* lights:           Energy use of light fixtures in the house in Wh
* T1 - T9:          Temperatures in various rooms and outside
* RH_1 to RH_9:     Humidity measurements in various rooms and outside
* Visibility:       Visibility in km
* Tdewpoint:       Dew point temperature
* Pressure_mm_hgg:  Pressure in mm Hg
* Windspeed:        Wind speed in m/s

#### Environment Details
**State Space:**
The state space consists of various features from the dataset that impact energy consumption and comfort levels.

* Current Temperature (T1 to T9): Temperatures in various rooms and outside.
* Current Humidity (RH_1 to RH_9): Humidity measurements in different locations.
* Visibility (Visibility): Visibility in meters.
* Dew Point (Tdewpoint): Dew point temperature.
* Pressure (Press_mm_hg): Atmospheric pressure in mm Hg.
* Windspeed (Windspeed): Wind speed in m/s.

Total State Vector Dimension: Number of features = 9 (temperature) + 9 (humidity) + 1 (visibility) + 1 (dew point) + 1 (pressure) + 1 (windspeed) = 22 features

**Target Variable:** Appliances (energy consumption in Wh).

**Action Space:**
The action space consists of discrete temperature adjustments:
* Action 0: Decrease temperature by 1°C
* Action 1: Maintain current temperature
* Action 2: Increase temperature by 1°C


- If the action is to decrease the temperature by 1°C, you'll adjust each temperature feature (T1 to T9) down by 1°C.
- If the action is to increase the temperature by 1°C, you'll adjust each temperature feature (T1 to T9) up by 1°C.
- Other features remain unchanged.

**Policy (Actor):** A neural network that outputs a probability distribution over possible temperature adjustments.

**Value function (Critic):** A neural network that estimates the expected cumulative reward (energy savings) from a given state.

**Reward function:**
The reward function should reflect the overall comfort and energy efficiency based on all temperature readings. i.e., balance between minimising temperature deviations and minimizing energy consumption.

* Calculate the penalty based on the deviation of each temperature from the target temperature and then aggregate these penalties.
* Measure the change in energy consumption before and after applying the RL action.
* Combine the comfort penalty and energy savings to get the final reward.

*Example:*

Target temperature=22°C

Initial Temperatures: T1=23, T2=22, T3=21, T4=23, T5=22, T6=21, T7=24, T8=22, T9=23

Action Taken: Decrease temperature by 1°C for each room

Resulting Temperatures: T1 = 22, T2 = 21, T3 = 20, T4 = 22, T5 = 21, T6 = 20, T7 = 23, T8 = 21, T9 = 22

Energy Consumption: 50 Wh (before RL adjustment) and 48 Wh (after RL adjustment)
* Energy Before (50 Wh): Use the energy consumption from the dataset at the current time step.
* Energy After (48 Wh): Use the energy consumption from the dataset at the next time step (if available).

Consider only temperature features for deviation calculation.

Deviation = abs (Ti− Ttarget )

Deviations=[ abs(22−22), abs(21−22), abs(20−22), abs(22−22),  abs(21−22), abs(20−22), abs(23−22), abs(21−22), abs(22−22) ]

Deviations = [0, 1, 2, 0, 1, 2, 1, 1, 0], Sum of deviations = 8

Energy Savings = Energy Before−Energy After = 50 – 48 = 2Wh

Reward= −Sum of Deviations + Energy Savings = -8+6 = -2

#### Expected Outcomes
1. Pre-process the dataset to handle any missing values and create training and testing sets.
2. Implement the Actor-Critic algorithm using TensorFlow.
3. Train the model over 500 episodes to minimize energy consumption while maintaining an indoor temperature of 22°C.
4. Plot the total reward obtained in each episode to evaluate the learning progress.
5. Evaluate the performance of the model on test set to measure its performance
6. Provide graphs showing the convergence of the Actor and Critic losses.
7. Plot the learned policy by showing the action probabilities across different state values (e.g., temperature settings).
8. Provide an analysis on a comparison of the energy consumption before and after applying the reinforcement learning algorithm.


#### Code Execution

In [41]:
## Necessary Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
#### Load the dataset
data=pd.read_csv(r'dataset/energydata_complete.csv')
# Display the first few rows of the dataset to inspect it
data.head(), data.info(), data.describe()

Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,...,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,...,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,...,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,...,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,...,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,...,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


In [42]:
# Check and replace missing values
data = data.fillna(data.mean())

# Pre-process the dataset to get the features and target and scale them
features = ['T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9', 
            'RH_1', 'RH_2', 'RH_3', 'RH_4', 'RH_5', 'RH_6', 'RH_7', 'RH_8', 'RH_9', 
            'Visibility', 'Tdewpoint', 'Press_mm_hg', 'Windspeed']
target = ['Appliances']

X = data[features]
y = data[target]

# Normalize them with Standard Scaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data to training and testing sets (80% for training, 20% for testing)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)

print("Data preprocessing completed.")
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

(date           0
 Appliances     0
 lights         0
 T1             0
 RH_1           0
 T2             0
 RH_2           0
 T3             0
 RH_3           0
 T4             0
 RH_4           0
 T5             0
 RH_5           0
 T6             0
 RH_6           0
 T7             0
 RH_7           0
 T8             0
 RH_8           0
 T9             0
 RH_9           0
 T_out          0
 Press_mm_hg    0
 RH_out         0
 Windspeed      0
 Visibility     0
 Tdewpoint      0
 rv1            0
 rv2            0
 dtype: int64,
 date           datetime64[ns]
 Appliances              int64
 lights                  int64
 T1                    float64
 RH_1                  float64
 T2                    float64
 RH_2                  float64
 T3                    float64
 RH_3                  float64
 T4                    float64
 RH_4                  float64
 T5                    float64
 RH_5                  float64
 T6                    float64
 RH_6                  float6

In [43]:
# Pre process the dataset to get the features and target and scale them
# Selecting the features to be normalized
features_to_normalize = data.columns.drop(['date', 'Appliances', 'lights'])  # Excluding date, Appliances, and lights

# Applying Min-Max Scaling
scaler = MinMaxScaler()
data[features_to_normalize] = scaler.fit_transform(data[features_to_normalize])

# Define features and target variable
features = data.drop(['Appliances', 'date'], axis=1)  # Exclude 'date' if it's not used as a feature
target = data['Appliances']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=0)

#### Defining Actor Critic Model using tensorflow (1 M)

In [44]:
### Define Actor Model

data = data.drop(['date', 'Appliances'], axis=1)  # Dropping non-feature columns

# Check the number of features now
state_space = data.shape[1]  # This should be the actual number of features
print(f"State space (number of features): {state_space}")

### Redefine Actor and Critic Models to match the actual state space
def build_actor_model():
    model = models.Sequential([
        layers.Dense(64, activation='relu', input_shape=(state_space,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(3, activation='softmax')  # Assuming 3 actions: Decrease, Maintain, Increase
    ])
    return model

def build_critic_model():
    model = models.Sequential([
        layers.Dense(64, activation='relu', input_shape=(state_space,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(1)  # Output is the value estimation
    ])
    return model

# Rebuild models with the corrected state space
actor_model = build_actor_model()
critic_model = build_critic_model()

State space (number of features): 27


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


### Reward Function (0.5 M)

In [45]:
### Calculate Reward Function

def calculate_reward(current_state, next_state):
    target_temperature = 22
    # Extract temperature features (assuming they are labeled from 'T1' to 'T9')
    temp_features = ['T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9']
    current_temps = current_state[temp_features]
    next_temps = next_state[temp_features]
    
    # Calculate deviation penalty: sum of absolute differences from target temperature
    deviation_penalty = sum(abs(current_temps - target_temperature))
    
    # Calculate energy savings: difference in energy use before and after the action
    energy_savings = current_state['Appliances'] - next_state['Appliances']
    
    # Combine the comfort penalty and energy savings to get the final reward
    reward = energy_savings - deviation_penalty

    return reward


#### Environment Simulation (0.5 M)


In [46]:
### Environment Simulation

def simulate_environment(current_state, action, data, index):
    # Adjust temperature based on action
    temp_adjustment = action - 1  # action: 0 (decrease by 1°C), 1 (maintain), 2 (increase by 1°C)
    temp_features = ['T1', 'T2', 'T3', 'T4', 'T5', 'T6', 'T7', 'T8', 'T9']
    
    # Update temperatures
    next_state = current_state.copy()
    next_state[temp_features] += temp_adjustment
    
    # Ensure we don't go out of bounds in the dataset
    if index + 1 < len(data):
        next_index = index + 1
    else:
        next_index = index  # Stay at the last index if we're at the end of the data set
    
    # Get energy consumption before and after the action
    energy_before = current_state['Appliances']
    energy_after = data.iloc[next_index]['Appliances']
    
    # Update the state with new energy usage for continuity in simulation
    next_state['Appliances'] = energy_after
    
    # Calculate the reward
    reward = calculate_reward(current_state, next_state)
    
    # Return the new state and the reward
    return next_state, reward

#### Implementation of Training Function (2 M)

In [49]:
# Train the Actor-Critic models

def update_models(current_state, action, advantage, target, actor_model, critic_model, optimizer_actor, optimizer_critic):
    # Convert current state to a suitable tensor for prediction
    state_tensor = tf.convert_to_tensor(current_state.values.reshape(1, -1), dtype=tf.float32)

    # Update Critic Model
    with tf.GradientTape() as tape:
        tape.watch(critic_model.trainable_variables)
        value = critic_model(state_tensor)
        # Calculate critic loss as mean squared error between target values and predicted values
        loss_critic = tf.keras.losses.MSE(target, value)
    grads = tape.gradient(loss_critic, critic_model.trainable_variables)
    optimizer_critic.apply_gradients(zip(grads, critic_model.trainable_variables))
    
    # Update Actor Model
    with tf.GradientTape() as tape:
        tape.watch(actor_model.trainable_variables)
        action_probs = actor_model(state_tensor)
        action_log_probs = tf.math.log(action_probs[0, action])
        loss_actor = -action_log_probs * advantage  # Negative for gradient ascent
    grads = tape.gradient(loss_actor, actor_model.trainable_variables)
    optimizer_actor.apply_gradients(zip(grads, actor_model.trainable_variables))


def train_function(features, episodes=500):
    discount_factor = 0.99
    optimizer_actor = tf.keras.optimizers.Adam(learning_rate=0.001)
    optimizer_critic = tf.keras.optimizers.Adam(learning_rate=0.001)

    for episode in range(episodes):
        total_reward = 0
        current_state = features.iloc[0]  # Reset to initial state at start of each episode
        index = 0
        
        while index < len(features) - 1:
            # Predict action probabilities from actor model
            action_probabilities = actor_model.predict(current_state.values.reshape(1, -1))
            action = np.argmax(action_probabilities)
            
            # Simulate the environment with the chosen action
            next_state, reward = simulate_environment(current_state, action, features, index)
            total_reward += reward
            
            # Compute target and advantage for updating critic and actor
            value_current = critic_model.predict(current_state.values.reshape(1, -1))
            value_next = critic_model.predict(next_state.values.reshape(1, -1))
            target = reward + discount_factor * value_next
            advantage = target - value_current
            
            # Update models
            update_models(current_state, action, advantage, target, actor_model, critic_model, optimizer_actor, optimizer_critic)
            
            # Update state and index
            current_state = next_state
            index += 1
        
        # Print mean reward for episode
        mean_reward = total_reward / index
        print(f'Episode {episode + 1}: Mean Reward = {mean_reward}')

train_function(features)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 114ms/step


KeyError: 'Appliances'

#### Evaluate the performance of the model on test set (0.5 M)

In [None]:
# Evaluate the model on the test set

def evaluate_model():

    # predict the action and simulate the environment accordingly and get the respective next state

    # calculate rewards for test set



# Print the total reward obtained on the test set

### Plot the convergence of Actor and Critic losses (1 M)

In [None]:
# Plot the convergence of Actor and Critic losses

### Plot the learned policy - by showing the action probabilities across different state values (1 M)

In [None]:
# Plot the learned policy - by showing the action probabilities across different state values

# From the trained actor model, for each state in training set,
# plot the probability of each action (increasing/decreasing/maintaining) the temperature

#### Conclusion (0.5 M)

In [None]:
# Provide an analysis on a comparison of the energy consumption
# before and after applying the reinforcement learning algorithm.