# Project 11: Building Energy Consumption
*Xinchang Li <br>
November 29, 2020*
## 0. Load Modules and Data

In [None]:
# Load useful modules
import os
import gc
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import tensorflow as tf
import sklearn.metrics
from sklearn.preprocessing import LabelEncoder

In [None]:
# Print all files in the input directory (auto-generated code from Kaggle)
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# Read in the .csv files as Pandas DataFrames
bldg_meta = pd.read_csv('/kaggle/input/ashrae-energy-prediction/building_metadata.csv')

train = pd.read_csv('/kaggle/input/ashrae-energy-prediction/train.csv')
weather_train = pd.read_csv('/kaggle/input/ashrae-energy-prediction/weather_train.csv')

# To conserve RAM, the following files are loaded later before they are used.

# test = pd.read_csv('/kaggle/input/ashrae-energy-prediction/test.csv')
# weather_test = pd.read_csv('/kaggle/input/ashrae-energy-prediction/weather_test.csv')
# sample_sub = pd.read_csv('/kaggle/input/ashrae-energy-prediction/sample_submission.csv')

## 1. Exploratory Data Analysis Recap
The full Exploratory Data Analysis (EDA) available here: https://www.kaggle.com/xinchangli/cee-498ds-project-11-eda

The [exploratory data analysis (EDA)](https://www.kaggle.com/xinchangli/cee-498ds-project-11-eda) on the ASHRAE Great Energy Predictor III dataset shows that: <br>
1. **Target variable**: hourly `meter_reading` for 4 types of meters (`{0: electricity, 1: chilledwater, 2: steam, 3: hotwater}`) in 1449 building from 16 sites. <br>
2. **Independent variables (features)** from dataset: <br>
       `site_id`, `building_id`, `primary_use`, `square_feet`, `year_built`,
       `floor_count`, `meter`, `air_temperature`,
       `cloud_coverage`, `dew_temperature`, `precip_depth_1_hr`,
       `sea_level_pressure`, `wind_direction`, `wind_speed`
3. **Potential engineered features**: <br>
       `month`, `day_of_week`, `hour`
4. **Missing data in bldg_meta**: <br>
    * primary_use: 	0 (0.0%)
    * year_built: 	774 (53.4%)
    * square_feet: 	0 (0.0%)
    * floor_count: 	1094 (75.5%) <br>
5. **Missing data in weather_train**:
    * air_temperature: 	55 (0.0%)
    * cloud_coverage: 	69173 (49.5%)
    * dew_temperature: 	113 (0.1%)
    * precip_depth_1_hr: 	50289 (36.0%)
    * sea_level_pressure: 	10618 (7.6%)
    * wind_direction: 	6268 (4.5%)
    * wind_speed: 	304 (0.2%) <br>
6. **Missing data in weather_test**:
    * air_temperature: 	104 (0.0%)
    * cloud_coverage: 	140448 (50.7%)
    * dew_temperature: 	327 (0.1%)
    * precip_depth_1_hr: 	95588 (34.5%)
    * sea_level_pressure: 	21265 (7.7%)
    * wind_direction: 	12370 (4.5%)
    * wind_speed: 	460 (0.2%)
<br>
7. **Two outliers' `buildings_id`**: 740 and 1099 <br>
8. **Independent variables that are correlated with each other** (correlation coefficient > 0.4):
    * `square_feet` and `floor_count` 
    * `air_temperature` and `dew_temperature`
    * `wind_direction` and `wind_speed`
    

## 2. Model Preparation
### 2.1 Choossing the Model: Recurrent Neural Network with Long Short Term Memory (RNN-LSTM)
This dataset is in its essense a time-series dataset, which is what RNN is designed at handling. LSTM is one of the most effective and commonly used RNN that improves on RNN's diminishing gradient problem. The advantage of using RNN-LSTM is that instead of using engineer features to account for the time information, the model architecture inherently carries this info and learns the relationship between times, reducing the number of features needed. 

### 2.2 Training Data Preprocessing
#### 2.2.1 Building Metadata
We will first treat the building meta data as it is used in both training and testing. `year_built` and `floor_count` are two features containing missing data. Since one site likely has buildings built around the same time, we will use the average `year_built` in one site to impute the missing values. Similarly, similar `primary_use` may mean buildings have similar number of floors, so we use the average `floor_count` of one `primary_use` to impute the missing floor counts.

In [None]:
# Fill in missing year_built with the mean of other buildings in the same site; 
# if none of the buildings in a site has year_built, then fill with the mean of the entire dataset
year_built_gp = bldg_meta.groupby('site_id')['year_built']
bldg_meta['year_built'] = year_built_gp.transform(lambda x: x.fillna(x.mean()))
bldg_meta['year_built'].fillna(np.nanmean(bldg_meta['year_built']), inplace=True)
assert pd.isnull(bldg_meta['year_built']).sum() == 0


# Fill in missing floor_count with the mean of other buildings of the same primary_use; 
# if none of the buildings of a primary use has year_built, then fill with the mean of the entire dataset
floor_count_gp = bldg_meta.groupby('primary_use')['floor_count']
bldg_meta['floor_count'] = floor_count_gp.transform(lambda x: x.fillna(x.mean()))
bldg_meta['floor_count'].fillna(np.nanmean(bldg_meta['floor_count']), inplace=True)
assert pd.isnull(bldg_meta['floor_count']).sum() == 0

# To conserve RAM:
del year_built_gp, floor_count_gp

gc.collect()

#### 2.2.2 Weather Data
For `weather_train`, we notice that there seems to be missing entries in the `weather_train` dataframe, i.e. for some hours in the training data there were not a single weather variable record. Since NaN values cannot be handled by RNN, we need to find the missing hours and fill them in as rows in `weather_train`.

In [None]:
weather_train.shape
# this should have 140,544 records (16 x 24 x 366, 2016 is a leap year) -> 771 hours missing

In [None]:
# The following function is inspired by https://www.kaggle.com/aitude/ashrae-missing-weather-data-handling
def add_missing_hours(weather_df):
    import datetime
    time_format = "%Y-%m-%d %H:%M:%S"
    start_date = datetime.datetime.strptime(weather_df['timestamp'].min(),time_format)
    end_date = datetime.datetime.strptime(weather_df['timestamp'].max(),time_format)
    total_hours = int(((end_date - start_date).total_seconds() + 3600) / 3600)
    hours_list = [(end_date - datetime.timedelta(hours=x)).strftime(time_format) for x in range(total_hours)]

    for site_id in range(16):
        site_hours = np.array(weather_df[weather_df['site_id'] == site_id]['timestamp'])
        new_rows = pd.DataFrame(np.setdiff1d(hours_list,site_hours),columns=['timestamp'])
        new_rows['site_id'] = site_id
        weather_df = pd.concat([weather_df,new_rows]).reset_index(drop=True)
    return weather_df

weather_train = add_missing_hours(weather_train)

# Varify the number of entries
weather_train.shape 

We will now deal with the missing data in `weather_train`. Since most of the weather variables have clear seasonalities/follows a annual cycle, for each weather variable, we will imput the missing data with the average of the rest of the data in the respective month. To do that, we need to first convert the `timestamp` column into a datetime format, so later we can extract `month` from it. 

In [None]:
def convert_to_datetime(df_list):
    for df in df_list:
        df['timestamp'] = pd.to_datetime(df.timestamp)
    return df_list

train, weather_train = convert_to_datetime([train, weather_train])

In [None]:
# Columns to impute
weather_cols = ['air_temperature','cloud_coverage', 'dew_temperature', 'precip_depth_1_hr', 
                'sea_level_pressure', 'wind_direction', 'wind_speed']

def fill_weather_nan(df):
    # Create the month column
    df['month'] = df.timestamp.dt.month
    # Create groupby object
    gp = df.groupby(['site_id', 'month'])
    for col in weather_cols:
        df[col] = gp[col].transform(lambda x: x.fillna(x.mean()))
        df[col].fillna(np.nanmean(df[col]), inplace=True)
    # Delete the month column after use
    del df['month']
    return df

weather_train = fill_weather_nan(weather_train)

In [None]:
weather_train.info()

Finally, We will merge the dataframes into `train_full`, remove the outlier buildings from the training data, and modify the data types to conserve RAM:

In [None]:
train_full = bldg_meta.merge(train, on='building_id').merge(weather_train, on=['site_id', 'timestamp'])
print(train_full.shape)

assert train_full.shape[0] == train.shape[0]
#test_full = bldg_meta.merge(test, on='building_id').merge(weather_test, on=['site_id', 'timestamp'])

# To release RAM:
del train, weather_train

gc.collect()

In [None]:
# Remove the outliers from the training data
train_full = train_full[(train_full['building_id']!=740)&(train_full['building_id']!=1099)]

In [None]:
# The following function is based on https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65
def reduce_mem_usage(df):
    start_mem_usg = df.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in. 
    for col in df.columns:
        if df[col].dtype != object and df[col].dtype != 'datetime64[ns]':  # Exclude strings and datetime         
            # Print current column type
            print("******************************")
            print("Column: ",col)
            print("dtype before: ",df[col].dtype)            
            # make variables for Int, max and min
            IsInt = False
            mx = df[col].max()
            mn = df[col].min()
            print("min for this col: ",mn)
            print("max for this col: ",mx)
            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(df[col]).all(): 
                NAlist.append(col)
                df[col].fillna(mn-1,inplace=True)  
                   
            # test if column can be converted to an integer
            asint = df[col].fillna(0).astype(np.int64)
            result = (df[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        df[col] = df[col].astype(np.uint8)
                    elif mx < 65535:
                        df[col] = df[col].astype(np.uint16)
                    elif mx < 4294967295:
                        df[col] = df[col].astype(np.uint32)
                    else:
                        df[col] = df[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        df[col] = df[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        df[col] = df[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        df[col] = df[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        df[col] = df[col].astype(np.int64)    
            # Make float datatypes 32 bit
            else:
                df[col] = df[col].astype(np.float32)
            
            # Print new column type
            print("dtype after: ",df[col].dtype)
            print("******************************")
            
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = df.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return df, NAlist

train_full, _ = reduce_mem_usage(train_full)

#### 2.2.3 Categorical Column: `primary_use`
I first tried the one-hot encoding for the 16 primary use types, but it created very sparse data (i.e. every one-hot category column only has a small fraction of ones) and quickly consumed all memory. Hence I chose to use the label encoder from sklearn to convert the categories into integers.

In [None]:
# # Convert primary_use into one-hot columns
# train_full = train_full.join(pd.get_dummies(train_full.primary_use)).drop(columns = 'primary_use')
# test_full = test_full.join(pd.get_dummies(test_full.primary_use)).drop(columns = 'primary_use')

le = LabelEncoder()
train_full["primary_use"] = le.fit_transform(train_full["primary_use"])

In [None]:
train_full.info()

## 3. Model Training
### 3.1 Creating Training Data Tensors
For RNN, training tensors need to have the following shape:
    `[number of samples, number of timesteps, number of features]`

Each sample needs to have the same shape. However, not every building has record for the whole of 2016. To handle this, I used the same truncating technique as in Class 12, with two major modifications:
1. **Each sample is a building-meter pair**: this is to solve the problem that not every building has all meter types, and to conform the number of timesteps;
2. **Setting a cleaning threshold (`THRES`)**: buildings with number of `meter_readings < THRES` will be discarded;
3. **The start of record time period is truncated**: instead of truncating the time steps exceeding THRES from the end, I decided to truncate the start, because as observed in EDA, many sites have near-zero meter readings at the start of the training period, which likely is not generalizable and hence should be discarded.

In [None]:
# Data Cleaning Threshold
THRES = 8000 #7000

In [None]:
# Define feature columns
feat_cols = [#'site_id', 'building_id', 
    'square_feet', 
    'year_built', 
    'floor_count', 
    'meter', 
    'air_temperature',
    'cloud_coverage', 
    'dew_temperature', 
    'precip_depth_1_hr', 
    'sea_level_pressure', 
    'wind_direction', 
    'wind_speed', 
    'primary_use'
]

# The following function is adapted from Module 12 Class 1 Notebook
# Tensor: [number of building-meter pairs, number of timesteps, number of independent variables]
def prepare_train_data(df):
    feat_list = []
    target_list = []
    for bldg_id in df.building_id.unique():
        bldg = df[df.building_id == bldg_id]
        for m in bldg.meter.unique():
            # Each bldg_id-m pair is one sample (1)
            met = bldg[bldg.meter == m]
            # Keep only samples exceeding THRES (2)
            if len(met.timestamp) >= THRES:
                # Truncating from the start (3)
                feat_list.append(np.array(np.float32(met[feat_cols][-THRES:])))
                target_list.append(np.array(np.float32(met['meter_reading'][-THRES:])))
            del met
        del bldg
    return (np.stack(feat_list), np.array(target_list))

train_x, train_y = prepare_train_data(train_full)

# Check the tensors' shapes
train_x.shape, train_y.shape

### 3.2 Transforming Target Variable Space
Because we have many heteorogenous feature varaibles having values of different orders of magnitude, we would like to use a normalization layer in our model architecture to transform the data into having zero means and unit standard deviations. If we can also transform the target variable, projecting the values onto a closer space to the training data, and that will help the model converge faster (I learned this the hard way...).

I chose to do log transformation on the all target values plus one. This way, zero meter readings can also be handled without generating negative infinity, and the transformed data have the same order of magnitude as all feature variables. Moreover, unlike normalization/standardization, this transformation is self-contained, meaning we can transform the testing predictions back without relying on the information from the training data.

In [None]:
train_y = np.log1p(train_y)

train_y.mean(), train_y.std()

Now we can combine the training data into a single dataset. 

In [None]:
train_ds = tf.data.Dataset.from_tensor_slices((train_x, train_y))

### 3.3 Constructing the Model
I built the RNN-LSTM model with the following architecture:
1. **Normalization layer**: to transform the feature variables;
2. **LSTM layer with return_sequence = True**: This will allow LSTM to generate one output at each time step;
3. **Dense output layer**.

In [None]:
num_features = len(feat_cols)

# From class 12
model = tf.keras.Sequential()

norm = tf.keras.layers.experimental.preprocessing.Normalization()
norm.adapt(train_x)

# Add normalization layer
model.add(norm)

# Add RNN: LSTM layer
model.add(
    tf.keras.layers.LSTM(units=32, # units is the number of hidden states
                         input_shape = (None, num_features), # None to allow for flexible prediction length
                         dropout = 0.2, # for regularization
                         return_sequences = True) # So we get a prediction for each time step
         ) 

# Add output layer
model.add(tf.keras.layers.Dense(1)) # because we only want to predict one value; add 'activation=sigmoid' for classification (broken or not)


### 3.4 Compiling and Traing the Model
We will start with a relatively large learning rate, and monitor the losses with early stopping:

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-4),
              loss='mse',
              metrics=[tf.keras.metrics.RootMeanSquaredError()])#, 
                       #tf.keras.metrics.MeanSquaredLogarithmicError()])
model.summary()

In [None]:
model.fit(train_ds.shuffle(50).batch(10), 
          epochs=30, 
          callbacks=tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3))

## 4. Model Tuning
### 4.1 Further Training
We will lower the learning rate and continue training more epochs. This two-step manual learning rate scheduling seem to generate better performance than using a constant learning rate, as we can see the decreasing trend slows down and plateaus towards the initial training, which likely suggests that model learning is at capacity.


In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
              loss='mse',
              metrics=[tf.keras.metrics.RootMeanSquaredError()])#, 
                       #tf.keras.metrics.MeanSquaredLogarithmicError()])
model.summary()

model.fit(train_ds.shuffle(100).batch(10), 
          epochs=30, 
          callbacks=tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3))

In [None]:
# Run this cell to save the model 
model.save('rnn_model')

### 4.2 Hyperparamter Tuning and Other Adjustments
I have attempted to improve the model performance by adjusting the following elements of the model:
* Model architecture: whether to have dropout or not at the LSTM layer;
* `THRES` value: a higher threshold means less samples but more training time steps, and vice versa. 
* Hyperparameters: such as learning rate, number of epochs and number of samples to shuffle.

The following table summarizes the changes I made for three of the submissions, as well as the scores. Note the score for the competition is Root Mean Squared Logarithmic Error (RMSLE) as defined by the competition.

| Submission | Model Architecture  | `THRES` | Learning Rate                                     | Shuffle, Batch | EarlyStopping | Scores (Training, Testing) |
|------------|---------------------|-----------------------------------|---------------------------------------------------|----------------|---------------|----------------------------|
| 1          | LSTM w/o dropout    | 7,000                             | 1e-3 for 14/15 epochs, then 1e-4 for 10/20 epochs | 20, 10         | patience=3    | 1.696. 1.708               |
| 2          | LSTM w/o dropout    | 8,000                             | 5e-4 for 25/25 epochs, then 1e-4 for 25/25 epochs | 50, 10         | patience=3    | 1.696, 1.681               |
| 3          | LSTM w. dropout=0.2 | 8,000                             | 5e-4 for 30/30 epochs, then 1e-4 for 30/30 epochs | 100, 10        | patience=3    | 1.651, 1.623               |
<br>

Changes from 1 to 2 are mainly to test the effect of THRES, and 2 to 3 to test the effect of dropout. learning rate schedules etc. are also adjusted based on observations from other unsubmitted tries. 

## 5. Model Prediction and Submission
We need to start by releasing RAM held by the training data before we have enough memory to store and treat testing data.

In [None]:
del train_full, train_x, train_y, train_ds
gc.collect()

Now we are ready to load and preprocess the testing data, following how we treated the training data.

In [None]:
# Load data
test = pd.read_csv('/kaggle/input/ashrae-energy-prediction/test.csv')
weather_test = pd.read_csv('/kaggle/input/ashrae-energy-prediction/weather_test.csv')

In [None]:
# Fill in the missing hours in weather_test to match test
weather_test = add_missing_hours(weather_test)

# Convert timestamp in both dataframes into datetime format
test, weather_test = convert_to_datetime([test, weather_test])

# Fill in missing weather records
weather_test = fill_weather_nan(weather_test)


The following two cells merge `bldg_meta`, `test` and `weather_test` into one dataframe (splitting into two cells due to memory limitation).

In [None]:
test_full = bldg_meta.merge(test, on='building_id')
del bldg_meta
gc.collect()

In [None]:
test_full = test_full.merge(weather_test, on=['site_id', 'timestamp'])
del weather_test

assert test_full.shape[0] == test.shape[0]
print(test_full.shape)

gc.collect()

In [None]:
# Reduce RAM usage
test_full, _ = reduce_mem_usage(test_full)

# Encode primary_use column
test_full["primary_use"] = le.transform(test_full["primary_use"])

gc.collect()

We now have data ready to feed into the model. The first cell below is only used when session is forced to restart, to load the saved model back into memory.

In [None]:
# model = tf.keras.models.load_model('rnn_model')

# feat_cols = [#'site_id', 'building_id', 
#     'square_feet', 
#     'year_built', 
#     'floor_count', 
#     'meter', 
#     'air_temperature',
#     'cloud_coverage', 
#     'dew_temperature', 
#     'precip_depth_1_hr', 
#     'sea_level_pressure', 
#     'wind_direction', 
#     'wind_speed', 
#     'primary_use'
# ]

Since we constructed traing dataset by seperating samples based on building_id and meter type, we will also do predictions accordingly, looping through each building and each of its meters. We cannot compile testing data into a single array because 1) it causes too much memory overhead; and 2) each building may have different length of time for predictions.

The prediction results are first converted back into the original data space (by taking exponential and adding 1), then stored to the corresponding rows in the newly added `meter_reading` column in the original `test` dataframe. Using the original `test` dataframe is becuase we need to match predictions with `row_id`, as required by the submission.

In [None]:
test['meter_reading'] = np.zeros(test.shape[0], dtype=np.float32)

for bldg_id in test_full.building_id.unique():
    bldg = test_full[test_full.building_id==bldg_id]
    print(str(bldg_id)+', ', end='')
    for m in bldg.meter.unique():
        met = bldg[bldg.meter==m]
        # adding a dim=1 at axis=0 to match the input layer shape
        ts = np.expand_dims(met[feat_cols].values, axis=0) 
        del met
        v = np.float32(np.expm1(model.predict(ts).squeeze()))
        del ts
        test.loc[(test.building_id==bldg_id)&(test.meter==m), 'meter_reading'] = v
        del v
    del bldg
    gc.collect()

test
# -- takes ~35 mins to run.

In [None]:
del test_full
gc.collect()

To prepare the final submission file, we first load the sample submission, keeping only `row_id` to conserve RAM. Then we will merge `test` and `sample_sub` into one dataframe, matching `row_id`. The merged dataframe is saved as a .csv file and submitted to the competition page for testing score evaluation.

In [None]:
sample_sub = pd.read_csv('/kaggle/input/ashrae-energy-prediction/sample_submission.csv',
                         usecols=['row_id'])
print(sample_sub.shape)
sample_sub

In [None]:
sample_sub = sample_sub.merge(test[['row_id', 'meter_reading']], on='row_id')
sample_sub

In [None]:
sample_sub.to_csv('submission.csv', index=False)

## 6. Discussions
In this notebook we explored the use of RNN-LSTM on making time-series prediction for the ASHRAE building energy use dataset. The simple LSTM model is fairly effective, largely outperforming the baseline linear regression model (by Benjamin Smakic in our group) but slightly underperforming the lightgbm model (by Zhiyi Yang in our group). Model tuning I attempted improved the performance marginally but steadily. 

Other tuning opportunities I hope to explore if I had more time include: excluding some correlated features; increasing model complexity by adding one or more layers; other ways of handling missing data. 

The most significant challenge has been combating the limited memory resources. A significant amount of time was spent on optimizing the memory usage; which is also helpful as that potentially has also improved the speed of training/predicting, and in the long run building foundations for dealing with larger data and more complex problems and models in the future. Another challenge was the data preprocessing. The trade-off between number of samples and number of timestamps can potentially be viewed as a shortcoming for RNN-LSTM (or rather my way of handling it) because it meant I am forced to leave behind part of the information from the raw data in training. Nevertheless, given the incomplete training data the performance is quite satisfactory, and I am happy to consider my first time building and training an RNN model a success.

