# Building energy consumption prediction using linear regression
## The main goal of this Notebook is to predict the energy consumption of different buildings using linear regression. 

### First all packages are imported

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import tensorflow as tf #AI models

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### The following function is used to reduce the RAM usage of the notebook. Initially it was discovered that the notebook would run out of RAM and reset. This function in combination with deleting unwanted variables, choosing the right data to load and adding/combining data carefully solved the issue of RAM usage. The reduce_mem_usage function is taken from this source: https://www.kaggle.com/gemartin/load-data-reduce-memory-usage 

In [None]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        #else:
            #df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df





### Next, all necesssary files needed to train the model are loaded. train_data contains the meter_readings, which we want to predict. It also contains the different meater types:
* 0: electricity
* 1: chilled water
* 2: steam
* 3: hot water

### The file building_metadata contains information about each site and all the buildings in each site. Here only the columns that are needed to merge the file with train_data, as well as columns that are believed to have predictive power, are loaded. "primary_use", "square_feet" and "year built" are believed to have predictive power. The primary use of the buildings, such as education, office, public service etc.  most likely all have different energy consumption patters. As I discovered in the EDA, there are mostly buildings used in educational purposes (see figure below). 

### The weather train data contains all conditions present for every meater reading. Just like before, only columns thought to have predictive power are loaded to save RAM.

### Please note that I have tried to decrease the RAM usage as much as possible, but I still had to exclude some variables in order to not restart the notebook before results could be obtained.

In [None]:
train_data = reduce_mem_usage(pd.read_csv("../input/ashrae-energy-prediction/train.csv"))
building_metadata = reduce_mem_usage(pd.read_csv("../input/ashrae-energy-prediction/building_metadata.csv", usecols=["site_id", "building_id", "primary_use", "square_feet", "year_built"]))
weather_train_data = reduce_mem_usage(pd.read_csv("../input/ashrae-energy-prediction/weather_train.csv", usecols= ["site_id", "timestamp", "air_temperature"]))



#CHANGE THIS CELL


In [None]:
building_metadata["primary_use"].value_counts().plot.bar(figsize = (14,5), xlabel = "Primary use", ylabel = "Number of buildings", fontsize = 10, rot = 90, title = "Number of buildings used for a particual reason")


### The next step is to merge the three files in to one. The "meter" values (0, 1, 2, 3) are changed to string-type so that it can be one-hot encoded and used as a catergorical feature. Note that there is a lot of code which is not in use. This is code that I have expermiented with and want to keep in case I figure out new and better ways of creating my model.

In [None]:
#Merging building_metadata and weather                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              _train in 2 steps
merged_train = train_data.merge(building_metadata, left_on = "building_id", right_on = "building_id", how = "left")
merged_train = merged_train.merge(weather_train_data, left_on = ["site_id", "timestamp"], right_on = ["site_id", "timestamp"], how = "left")

merged_train["meter"] = merged_train["meter"].astype(str)
#merged_train["meter"].replace({0: 1, 1: 2, 2: 3, 3: 4}, inplace=True)
#merged_train.head()




### In my exploratory data analysis, I discovered that some measurements were suspiciously high. After analyzing each meter type, I found that meter 2 (steam) was responsible for the unusually high values. After this discovery, each site was analyzed, and I found that site 13 was the anomay. The first graph shows the mean hourly steam readings for site 13. The second graph shows all meter readings for all sites. It is clear that the readings are much larger in the first graph. Also, the shape of the graph is dictated by site 13.

In [None]:
suspicious_data = merged_train_1 = merged_train.loc[(merged_train["site_id"]==13) & (merged_train["meter"]=="2")]
suspicious_data.groupby(by = "timestamp").mean().filter(["timestamp", "meter_reading"]).plot(figsize =(15,7), ylabel = "mean meter readings", title = "mean hourly steam readings, site 13 (index 2)")

In [None]:
merged_train.groupby(by = "timestamp").mean().filter(["timestamp", "meter_reading"]).plot(figsize =(15,7), ylabel = "mean meter readings", title = "mean hourly meter readings (all meters)")

### Now, the suspicious site_id is removed, as it is believed that the model will perform better overall.

In [None]:
merged_train = merged_train.loc[(merged_train["site_id"]!=13) & (merged_train["meter"]!=3)]
merged_train.groupby(by = "timestamp").mean().filter(["timestamp", "meter_reading"]).plot(figsize =(15,7), ylabel = "mean hourly meter readings", title = "mean hourly meter readings (all meters)")


#divide the timestamp column in to 4 new columns
#merged_train["timestamp"] = (merged_train["timestamp"].dt.month.astype(str) + merged_train["timestamp"].dt.weekday.astype(str) + merged_train["timestamp"].dt.day.astype(str) + merged_train["timestamp"].dt.hour.astype(str))
#merged_train["week"] = merged_train["timestamp"].dt.weekday.astype(str)
#merged_train["day"] = merged_train["timestamp"].dt.day.astype(str)
#merged_train["hour"] = merged_train["timestamp"].dt.hour.astype(str)

#merged_train.drop(["timestamp"], axis =1, inplace=True)
#merged_test["month"] = merged_test["timestamp"].dt.month
#merged_test["weekday"] = merged_test["timestamp"].dt.weekday
#merged_test["day"] = merged_test["timestamp"].dt.day
#merged_test["hour"] = merged_test["timestamp"].dt.hour

### Now it can be seen that the meter readings are not affected by site 13

### Next, the column with all the timestamps is changed to datetime datatype. This is done so that the timestamps can be divided in to month, weekday and hour. This is done because energy consumption can vary depending on the month (empty office buildings during vacation times, empty buildings used for educational purposes inbetween semester periods etc.). Energy consumption can also vary depending on the weekday, e.g. office/education buildings that are empty in the weekend. The daily energy consumption is most likely lower during the night.

In [None]:
print(merged_train["timestamp"].dtype)
merged_train["timestamp"] = pd.to_datetime(merged_train["timestamp"])
print(merged_train["timestamp"].dtype)

merged_train["month"] = merged_train["timestamp"].dt.month
merged_train["hour"] = merged_train["timestamp"].dt.hour
merged_train["weekday"] = merged_train["timestamp"].dt.weekday

### The following funtion is taken from class, and is used to convert the dataframe to a tensor.

Source: https://www.kaggle.com/christophertessum/module-9-class-2-airplanes

In [None]:
# Convert a Pandas series to a tensor.
def convert_to_tensor(s):
    dt = s.dtype
    if dt == "float64" or dt == "int64" or dt == "float32" or dt == "float16" or dt == "int32" or dt == "int16" or dt == "int8":
        a = np.asarray(s).astype("float32")
        a = np.nan_to_num(a, nan=a[~np.isnan(a)].mean())
        return (a - a.mean()) / a.std()
    elif dt == "object":
        return s
    return None
    del dt
    del s
    del a
    import gc
    gc.collect()
# A utility method to create a tf.data dataset from a Pandas Dataframe
# Adapted from https://www.tensorflow.org/tutorials/structured_data/feature_columns
def df_to_dataset(dataframe, target_name):
    data_dict = {}
    for col in dataframe.columns:
        t = convert_to_tensor(dataframe[col])
        if col == target_name:
            labels = t
        elif t is not None:
            data_dict[col] = t
    ds = tf.data.Dataset.from_tensor_slices((data_dict, labels))
    return ds
    del ds
    del dat_dict
    del t
    del labels
    import gc
    gc.collect()

### In the following cell, the different columns are prepared to be used in the model. Initially, the following columns were used:
* meter
* weekday
* meter_reading (the value we want to predict)
* square_feet
* year_built
* air_temperature
* hour
* month

### These values worked during the model training, but when new values were predicted, categorical values  did not work with model.predict(*test data*). If I put it this way: after trying to predict new values with categorical columns, I found out that the maximum runtime for a Kaggle notebook is 9 hours..

### After this setback, I tried to use only numerical values. 

In [None]:

#Preparing the data
#df_subset = merged_train[["meter", "weekday", "meter_reading", "square_feet", "year_built", "air_temperature", "hour", "month"]]
df_subset = merged_train[["air_temperature", "meter_reading", "square_feet", "year_built", "weekday", "hour", "month"]]
data = df_to_dataset(df_subset, "meter_reading")

#Here the categorical data is one-hot encoded

#primary_use = tf.feature_column.categorical_column_with_vocabulary_list(key = "primary_use", vocabulary_list = df_subset["primary_use"].unique())
#primary_use = tf.feature_column.indicator_column(primary_use)

#meter = tf.feature_column.categorical_column_with_vocabulary_list("meter", df_subset["meter"].unique())
#meter = tf.feature_column.indicator_column(meter)

#timestamp = tf.feature_column.categorical_column_with_vocabulary_list("timestamp", df_subset["timestamp"].unique())
#timestamp = tf.feature_column.indicator_column(timestamp)

#Here numerical columns are added
meter = tf.feature_column.numeric_column("meter")
weekday = tf.feature_column.numeric_column("weekday")
hour = tf.feature_column.numeric_column("hour")
square_feet = tf.feature_column.numeric_column("square_feet")
year_built =tf.feature_column.numeric_column("year_built")
air_temperature = tf.feature_column.numeric_column("air_temperature")
month = tf.feature_column.numeric_column("month")

feature_layer = tf.keras.layers.DenseFeatures([weekday, square_feet, year_built, air_temperature, hour, month])
#feature_layer = tf.keras.layers.DenseFeatures([square_feet, year_built, air_temperature])

### This function for creating a linear regression model is taken from class. I tried to add a kernel regularizer, but it did not seem to make a difference.

Source: https://www.kaggle.com/christophertessum/module-9-class-2-airplanes

In [None]:


def create_model(learning_rate, feature_layer):
    # Sequential model
    model = tf.keras.models.Sequential()

    # Here the feature_layer is added
    model.add(feature_layer)
     
    # Here another layer is added to create a linear regression model
    model.add(tf.keras.layers.Dense(units=1))    
    
    #Optional L2 kernel regularizer
    #model.add(tf.keras.layers.Dense(1, kernel_regularizer=tf.keras.regularizers.L2(0.2)))

    # Construct the layers into a model that TensorFlow can execute.
    model.compile(optimizer=tf.keras.optimizers.Adam(lr=learning_rate),
                loss="mean_squared_error",
                metrics=[tf.keras.metrics.RootMeanSquaredError()])
    return model



In [None]:
# Hyperparameters
learning_rate = 0.001
epochs = 10
batch_size = 10000

model = create_model(learning_rate, feature_layer)
model.fit(data.batch(batch_size),
                  epochs=epochs, shuffle=True)

# The list of epochs is stored separately from the rest of history.
#epochs = history.epoch

# Isolate the mean absolute error for each epoch.
#hist = pd.DataFrame(history.history)
#rmse = hist["root_mean_squared_error"]

### The following cell deletes unnecessary variables to free up RAM. However, depending on the size of the feature layer used in the model, this might not be necessary.

In [None]:
#Delete unnecessary data
import gc

del feature_layer
del data
del meter
del weekday
#del primary_use
del square_feet
del year_built
del air_temperature

gc.collect()

### The following cell repeats the previous steps, but with the test datasets. The names have been kept to make it easier to copy and paste, since I changed many things to try and get a better score.

In [None]:
train_data = reduce_mem_usage(pd.read_csv("../input/ashrae-energy-prediction/test.csv"))
building_metadata = reduce_mem_usage(pd.read_csv("../input/ashrae-energy-prediction/building_metadata.csv", usecols=["site_id", "building_id", "primary_use", "square_feet", "year_built"]))
weather_train_data = reduce_mem_usage(pd.read_csv("../input/ashrae-energy-prediction/weather_test.csv", usecols= ["site_id", "timestamp", "air_temperature"]))

#Merging building_metadata and weather                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              _train in 2 steps
merged_train = train_data.merge(building_metadata, left_on = "building_id", right_on = "building_id", how = "left")
merged_train = merged_train.merge(weather_train_data, left_on = ["site_id", "timestamp"], right_on = ["site_id", "timestamp"], how = "left")

merged_train["meter"].replace({0: 1, 1: 2, 2: 3, 3: 4}, inplace=True)
#merged_train["meter"] = merged_train["meter"].astype(str)
#merged_train.head()


#merged_train = merged_train.loc[(merged_train["site_id"]!=13) & (merged_train["meter"]!="2")]
#merged_train.groupby(by = "timestamp").mean().filter(["timestamp", "meter_reading"]).plot(figsize =(15,7), ylabel = "mean meter readings", title = "mean hourly chilled water readings (index 1)")
#print(merged_train["timestamp"].dtype)
merged_train["timestamp"] = pd.to_datetime(merged_train["timestamp"])
#print(merged_train["timestamp"].dtype)

merged_train["month"] = merged_train["timestamp"].dt.month
merged_train["hour"] = merged_train["timestamp"].dt.hour
merged_train["weekday"] = merged_train["timestamp"].dt.weekday


#divide the timestamp column in to 4 new columns
#merged_train["timestamp"] = (merged_train["timestamp"].dt.month.astype(str) + merged_train["timestamp"].dt.weekday.astype(str) + merged_train["timestamp"].dt.day.astype(str) + merged_train["timestamp"].dt.hour.astype(str))
#merged_train["week"] = merged_train["timestamp"].dt.weekday.astype(str)
#merged_train["day"] = merged_train["timestamp"].dt.day.astype(str)
#merged_train["hour"] = merged_train["timestamp"].dt.hour.astype(str)

#merged_train.drop(["timestamp"], axis =1, inplace=True)
#merged_test["month"] = merged_test["timestamp"].dt.month
#merged_test["weekday"] = merged_test["timestamp"].dt.weekday
#merged_test["day"] = merged_test["timestamp"].dt.day
#merged_test["hour"] = merged_test["timestamp"].dt.hour


del train_data
del building_metadata
del weather_train_data
gc.collect()

### Option to save the model

In [None]:
#model.save("./")

### The final steps to make a Kaggle submission. Like I stated previously, I could not figure out how to include categorical values in my model.predict function. 

In [None]:
#merged_train["row_id"] =merged_train.index
ids = merged_train["row_id"]
#df_subset = merged_train[["meter", "timestamp", "row_id", "primary_use", "square_feet", "year_built", "air_temperature"]]
#data = df_to_dataset(df_subset, "row_id")
#[["meter", "timestamp", "meter_reading", "primary_use", "square_feet", "year_built", "air_temperature", "hour"]]
merged_train["meter"] = convert_to_tensor(merged_train["meter"])
merged_train["weekday"] = convert_to_tensor(merged_train["weekday"])
#merged_train["primary_use"] = convert_to_tensor(merged_train["primary_use"])
merged_train["square_feet"] = convert_to_tensor(merged_train["square_feet"])
merged_train["year_built"] = convert_to_tensor(merged_train["year_built"])
merged_train["air_temperature"] = convert_to_tensor(merged_train["air_temperature"])
merged_train["hour"] = convert_to_tensor(merged_train["hour"])
merged_train["month"] = convert_to_tensor(merged_train["month"])

In [None]:
predicted_readings = model.predict({"weekday": merged_train.weekday, "square_feet":merged_train.square_feet, "year_built": merged_train.year_built,"air_temperature":merged_train.air_temperature, "hour":merged_train.hour, "month": merged_train.month})
#predicted_readings = model.predict({"air_temperature":merged_train.air_temperature, "square_feet": merged_train.square_feet, "year_built":merged_train.year_built})


In [None]:
predict = pd.DataFrame(predicted_readings)
predict["row_id"] = ids
predict["meter_reading"] = pd.DataFrame(predicted_readings)

predict.drop([0], axis = 1, inplace = True)
predict.head()

### Here the submission file is created that I manually submit to the competition

In [None]:
predict.to_csv('submission.csv', index=False)

### Final conclusion: My score obtained is approx 4,3 no matter what changes I make. Is this the limit for linear regression? Most likely not. In this notebook, several things can be made better. First, the datasets can be optimized even better. I have removed site_id 13 because the values were suspiciously high. In this case, an even more in-depth "cleaning" can be made and locade the exact building_id (or several building_ids) that is responsible for the data anomaly. 

### Furthermore, if I can manage to pretict a model that includes categorical values, perhaps the score can be improved. 
