<font color=black size=5 face=Arial><center>**Predict Household Electricity Consumption with Machine Learning**</font></center>

<font color=black size=3 face=Arial><center>Joy(Ruoqian) Wang</font></center>
<font color=black size=3 face=Arial><center>July 2022</font></center>

# Introduction

<font color=black size=3 face=Arial>
    Electricity consumption is among one of the essential topics of energy systems. It is critical for short-term resources allocation and long-term planning for new generation. It also serves as an good indicator of technology development and lifestyle changes. </font>

<font color=black size=3 face=Arial>In this project, yearly household eletricity consumption will be predicted based on a dataset from Residential Energy Consumption Survey conducted in 2009 in US.  </font>
<font color=black size=3 face=Arial>Dataset can be found at https://www.eia.gov/consumption/residential/data/2020/index.php?view=microdata </font>

<font color=black size=3 face=Arial>The objective of this project is to use energy/electricity usage related charactistics like housing unit, usage patterns, and household demographics to build a model that will allow us to understand the status and project future consumption trends, as a consequence, to make better decisions in terms of cost and energy efficiency. </font>

<font color=black size=3 face=Arial>Specifically, this goal is achieved as the following steps:</font>  
     <font color=black size=3 face=Arial>1. Get the data ready through data engineering </font>  
     <font color=black size=3 face=Arial>2. Reduce dimensionality of data</font>  
     <font color=black size=3 face=Arial>3. Further selection and engineering of important independent features for electricity consumption modelling</font>   
     <font color=black size=3 face=Arial>4. Model selection : Artificial Neural Networks (ANN) for Regression and fitting </font>  
     <font color=black size=3 face=Arial>5. Model performance evaluation </font>  
     <font color=black size=3 face=Arial>6. Conclusion and discussion </font>

# Data Engineering

## 1. Preparation and data reading

<font color=black size=3 face=Arial> This is section is also completed with checking and understanding corresponding files provided by RECS -- "Layout file", which contains descriptive labels and formats for each data variable; The "Variable and response codebook" contains descriptive labels for variables. </font>

<font color=black size=3 face=Arial> There are 12083 records and 940 columns/variables. There is decent amount of columns are actually not containing any practical information -- they are imputation flags, indicating if this characristic's records got missing data replaced with substituted values. Also, by checking the codebook, most of columns use '-2' indicating 'Not Applicable'. </font>

In [None]:
# Load Modules
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load Data
df_2009raw = pd.read_csv('/Joy Wang/Data Science Projects/electricity consumption predict/recs2009_public.csv')
df_layout = pd.read_csv('/Joy Wang/Data Science Projects/electricity consumption predict/public_layout.csv')

In [None]:
# Read Data
df_2009raw.info()
df_2009raw.describe()
df_2009raw.head()

In [None]:
# Extract and understand column 
list_columns = df_2009raw.columns.tolist()
series_columns = index_columns.to_series()
list_columns

## 2. Data cleaning

<font color=black size=3 face=Arial> At the beginning of data cleaning, I went through its NULL values and N/A values as well as imputation indicators. This dataset does not have any NULL values but it contains significant amount of '-2', which technically may or may not refer as NULL or N/A or some other specific information. To simplify the process due to time manner, i assumed all '-2' indicating not useable values -- for further analysis keeping these many N/A would give us wrong message in modelling. Same for imputation flag columns, they are also considered as useless to affect electricity usage in each household. Therefore, I will remove thoes values and columns from the data.</font>

In [None]:
# Extract colmun names of imputation indicator
imputation_flags = series_columns[series_columns.str.startswith('Z')]
print(imputation_flags.index.tolist())

In [None]:
# Checking if there is any null values
df_2009raw.isnull().sum().sum()

<font color=black size=3 face=Arial> If more than 50% of the 12083 records in a particular column are N/A values, then it is identified as an variable that would not contribute to the prediction. </font>

In [None]:
# Checking N/A values: '-2'
Counting_NA = df_2009raw[df_2009raw == -2].count()

# Find variables that contain more than half of records are not applicable -- Not valuable
NA_morethanhalf = Counting_NA[Counting_NA > 6041].sort_values(ascending=False)
print(NA_morethanhalf)

In [None]:
# Drop columns/variables that would not be valuabe to predicting electricity consumption 
df_2009clean = df_2009raw.drop(labels=NA_morethanhalf.index.union(imputation_flags.index),axis=1)
print(df_2009clean)

In [None]:
# Check out this cleaned data
df_2009clean.info()
df_2009clean.describe()
df_2009clean.head(5)

## 3. Reduce dimensionality and further data processing

In [None]:
df_layout_clean = df_layout.loc[df_layout['Variable Name'].isin(df_2009clean.columns.to_series())]
print(df_layout_clean)
df_layout_clean.to_csv('/Joy Wang/Data Science Projects/electricity consumption predict/public_layout_clean.csv')

<font color=black size=3 face=Arial> After removing all the non-significant variables, there are 370 variables left in this dataset. It is now easier to review them with layout/description in Excel but still it is necessary to reduce more dimensions in a more practical sense. </font>

<font color=black size=3 face=Arial> Because of the time constraints, I reviewed 370 variables with their description in excel and picked 15 veriables as features of inetrest with consideration of demographics/household, weather/climate and lifestyle/usage pattern. </font>

In [None]:
# Reduce dimension to 15 variables 
df_2009reduced = df_2009clean[
    ['KWH',
    'BEDROOMS', 'MONEYPY', 'NHSLDMEM', 'TYPEHUQ', 'TOTSQFT', 'TOTHSQFT', 'TOTCSQFT',
    'HDD30YR', 'CDD30YR','TEMPNITE',
    'TOTALBTUSPH', 'TOTALBTUCOL', 'TOTALBTUWTH', 'TOTALBTURFG','TOTALBTUOTH']
    ]
print(df_2009reduced)  
df_2009reduced.to_csv('/Joy Wang/Data Science Projects/electricity consumption predict/recs2009_public_reduced15.csv')

In [None]:
# Check out this variables reduced data
df_2009reduced.info()
df_2009reduced.head(5)
df_2009reduced.describe()

<font color=black size=3 face=Arial>There are still a few 'Not Applicable'/'-2' records in the dimension reduced dataset, for example BEDROOMS and TEMPNITE. It is not a large amount but still they won't contribute to the model. Rows with N/A values will be removed.</font>

In [None]:
# So removing rows with N/A values
temp = df_2009reduced.apply(lambda x: True if -2 in list(x) else False, axis=1)
df_2009rclean = df_2009reduced.drop(temp[temp == True].index)
print(df_2009rclean)

In [None]:
# 11460 records 
df_2009rclean.describe()
df_2009rclean.head()

<font color=black size=3 face=Arial> Additionally,I also need to make sure only numerical variables are going into machine learning since most machine learning algorithms require numerical values; Therefore, all categorical format attributes in the dataset should be encoded into numerical labels before training the model. Well by checking this recleaned data, luckly i think this dataframe should be good to go they are all numerical format. </font>  

<font color=black size=3 face=Arial> It’s always better to rename the columns and format them to the most readable format which can be understood easily in data interpretation. </font>

In [None]:
# Rename column -- more Readable
colNameDict = {'KWH':'PowerUsage','BEDROOMS':'NumOfBedrooms','MONEYPY':'HouseholdIcome', 
               'NHSLDMEM':'NumOfResidents', 'TYPEHUQ':'TypeOfUnit', 'TOTSQFT':'TotalSQFT', 
               'TOTHSQFT':'HeatSQFT', 'TOTCSQFT':'ColdSQFT','HDD30YR':'Heat30Y', 'CDD30YR':'Cold30Y',
               'TEMPNITE':'NightTemp','TOTALBTUSPH':'HeatUsage', 'TOTALBTUCOL':'ACusage', 
               'TOTALBTUWTH':'WaterheatUsage', 'TOTALBTURFG':'RefriUsage','TOTALBTUOTH':'OthersUsage'}   
df_2009rclean.rename(columns = colNameDict,inplace=True)

# Feature Importance and engineering

<font color=black size=3 face=Arial> Feature Importance refers to techniques that calculate a score for all the input variables for a given model — the scores simply represent the influence or the importance of each variable. A higher score means that the specific variable will have a larger effect on the model. This section is completed by analyzing correlation coefficients and random forest decision tree with exploratory data visualization. </font>

## 1. Pearson's correlation coefficient

<font color=black size=3 face=Arial> Correlation coefficients are used to measure how strong a relationship is between two variables. There are several types of correlation coefficient, but the most popular is Pearson’s. It is a correlation coefficient commonly used in linear regression. </font>  

<font color=black size=3 face=Arial> Correlation coefficient formulas returns a value between -1 and 1, where: </font>  
<font color=black size=3 face=Arial> 1 indicates a strong positive relationship. </font>  
<font color=black size=3 face=Arial>-1 indicates a strong negative relationship. </font>  
<font color=black size=3 face=Arial>A result of zero indicates no relationship at all. </font>  

In [None]:
# check simple pairwise correlation -- linear -- EDA
corre_2009rclean = df_2009rclean.corr()

# Visualization -- heat map
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(corre_2009rclean,vmin=-1, vmax=1, annot=True,cmap='BrBG')
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);

In [None]:
# Just curious (can ignore) -- checking pairwise correlations on all variables
all_corr = df_2009clean.corr(method ='pearson')
corr_KWH = all_corr['KWH']
sig_corr_KWH = corr_KWH[(corr_KWH >= 0.5) | (corr_KWH <= -0.5)]
sig_corr_KWH

In [None]:
# Just curious (can ignore) 
a = df_2009clean.apply(lambda x: True if -2 in list(x) else False, axis=1)
df_2009rclean2 = df_2009clean.drop(a[a == True].index)
print(df_2009rclean2)

## 2. Random forest regression

<font color=black size=3 face=Arial> Random Forest is a supervised model that implements both decision trees and bagging method. The idea is that the training dataset is resampled according to a procedure called “bootstrap”. Each sample contains a random subset of the original columns and is used to fit a decision tree. Finally, the predictions of the trees are mixed together calculating the mean value for regression. </font>  

<font color=black size=3 face=Arial> Each tree of the random forest can calculate the importance of a feature according to its ability to increase the pureness of the leaves. The higher the increment in leaves purity, the higher the importance of the feature. This is done for each tree, then is averaged among all the trees and, finally, normalized to 1. So, the sum of the importance scores calculated by a Random Forest is 1. </font>

In [None]:
# import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Select dependent and independent variables
X = df_2009rclean.drop(columns= 'PowerUsage')
Y = df_2009rclean[['PowerUsage']]

In [None]:
# split the data in training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=40, shuffle=True)

<font color=black size=3 face=Arial> The number of models and the number of columns and other parameters are not optimized by hyperparameters due to time constraints. </font>

In [None]:
forest_model = RandomForestRegressor()

In [None]:
forest_model.fit(X_train, Y_train)

In [None]:
# To get feature importances for each variable
importances = forest_model.feature_importances_
importances

In [None]:
# Plotting the importance
sort = importances.argsort()
plt.figure(figsize=(16, 6))
plt.barh(X.columns[sort], importances[sort])
plt.xlabel("Feature Importance")

## 3. Select important features and normalization

<font color=black size=3 face=Arial> By checking correlation coefficient and random forest regression graphs above, some inconsistent indications between their results can be found. A Random Forest's nonlinear nature can give it a leg up over linear algorithms, making it a better selection. To simplify the question, only 10 features will be selected into model: the top six ranked variables from random forest regression will be included then the rest four features will be selected according to their correlation coefficient. </font>  

<font color=black size=3 face=Arial> So in the following step, the left five variables will be removed from dataframe before training: 'HouseholdIcome', 'NumOfResidents', 'Cold30Y', 'NightTemp', 'HeatUsage'. They seem not provide useful information for predicting the electricity usage, from neither feature selection algorithms. </font>

In [None]:
df_2009_10 = df_2009rclean.drop(['HouseholdIcome','NumOfResidents','Cold30Y','NightTemp', 'HeatUsage'], axis = 1)

<font color=black size=3 face=Arial> In addition, I need to transform all numerical variables into a common scale. This can prevent the variables with large values dominating the machine learning process. All transformations are implemented using Scikit-Learn. </font>

In [None]:
# Normalization
from sklearn import preprocessing
# min-max normalization
scaler = preprocessing.MinMaxScaler(feature_range=(0, 100))
normalization = scaler.fit_transform(df_2009_10)

In [None]:
min_max_columns = df_2009_10.columns
df_normalized = pd.DataFrame(normalization, columns=min_max_columns)
df_normalized.head()
df_normalized.describe()

In [None]:
# It is not necessary to normalize the dependent variable
old = df_normalized['PowerUsage'].tolist()
new = df_2009_10['PowerUsage'].tolist()
df_normalized['PowerUsage']=df_normalized['PowerUsage'].replace(old,new)
df_normalized

# Modelling electricity consumption

## 1. Getting prepared of data for training

In [None]:
# Select target and predict datesets
P = df_normalized.drop(columns= 'PowerUsage') # Predictors
T = df_normalized[['PowerUsage']] #Target

# Splitting the modeling-ready dataset into the Training set and Test set
P_train, P_test, T_train, T_test = train_test_split(P, T, test_size=0.25, random_state=0)

## 2. Model selection : Artificial Neural Networks (ANN) for Regression and fitting

<font color=black size=3 face=Arial> Artificial Neural Networks are one of the deep learning algorithms that simulate the workings of neurons in the human brain. </font>

<font color=black size=3 face=Arial> The Artificial Neural Networks consists of the Input layer, Hidden layers, Output layer. The hidden layer can be more than one in number. Each layer consists of n number of neurons. Each layer will be having an Activation Function associated with each of the neurons. The activation function is the function that is responsible for introducing non-linearity in the relationship. </font>

<font color=black size=3 face=Arial> The advantages of using ANN such as its capability to learn complex behaviour or adaptability, makes it widely used for predictions and pattern recognition. They modify themselves as they learn from initial training and subsequent runs provide more information about the world. </font>

In [None]:
# 1. Installing required librabries
#!pip install tensorflow
!pip install keras
from keras.layers import Dense, Activation
from keras.models import Sequential
from sklearn.model_selection import train_test_split

<font color=black size=3 face=Arial> Every Hyperparameter plays very important role in machine learning -- ANN in case. They specify how many neurons would be in each layer; which technique would be used to initialize the weights in the network; what will be the activation function for each neuron in that layer; how many rows will be passed to the Network at once; how many time ANN would go over the training data etc. In this learning, I assumed the best set of parameters having been used in ANN to simplyfy the process. </font>

In [None]:
# 2. Build the model

# Initialising the ANN
model = Sequential()

# Adding the input layer and the first hidden layer
model.add(Dense(units=10, kernel_initializer='normal', activation = 'relu', input_dim = 10))

# Adding the second hidden layer
model.add(Dense(units = 10, kernel_initializer='normal', activation = 'relu'))

# Adding the third hidden layer
model.add(Dense(units = 10, kernel_initializer='normal', activation = 'relu'))

# Adding the output layer

model.add(Dense(units = 1, kernel_initializer='normal'))

#model.add(Dense(1))
# Compiling the ANN
model.compile(optimizer = 'adam', loss = 'mean_squared_error')

# Fitting the ANN to the Training set
model.fit(P_train, T_train, batch_size = 10, epochs = 50, verbose=1)

In [None]:
# 3. Generating Predictions on testing data
T_pred = model.predict(P_test)

In [None]:
# 4. Plotting prediction and real data
plt.figure(figsize=(20, 10))
plt.plot(T_test, color = 'red', label = 'Real data')
plt.plot(T_pred, color = 'blue', label = 'Predicted data')
plt.title('Prediction')
plt.legend()
plt.show()

In [None]:
# 5. Predicting the electricity consumption on testing data.
P.columns.tolist()
Prediction=pd.DataFrame(data=P_test, columns=P.columns.tolist())
Prediction['PowerUsage']=T_test
Prediction['PredictedUsage']=T_pred
Prediction.head()

## 3. Model performance evaluation

<font color=black size=3 face=Arial> I will just use accuracy as the metrics to evaluate model performance. Later more metrics can be applied as well as confusion matrix. As it shows, the general accuracy of the prediction is not very ideal (65.59%), indicating that the model is making some errors in the data. </font>

In [None]:
# Computing the absolute percent error
APE=100*(abs(Prediction['PowerUsage']-Prediction['PredictedUsage'])/Prediction['PowerUsage'])
Prediction['APE']=APE
 
print('The Accuracy of ANN model is:', 100-np.mean(APE))
Prediction.head()

# Conclusion and discussion

<font color=black size=3 face=Arial> This project can be used to fit the Deep Learning ANN regression model on any given dataset. This should be also a good example for predicting electricity consumption but a few things need to be improved: </font>

<font color=black size=3 face=Arial>1. Raw data processing </font>  
<font color=black size=3 face=Arial>A significant amount of variables are deleted since I assumed all '-2' are N/A values and they are over 50% in thoese variables. A deeper consideration into these deleted columns should be suggested since they may be some factors that could contribute to prediction. </font>

<font color=black size=3 face=Arial>2. Feature selection </font>  
<font color=black size=3 face=Arial>Dimensionality reduction from 370 to 15 variables involved my subjective assumption. This might be considered as an uncertained factor in machine learning. Further methods like PCA or LDA could be introduced in future analyses.</font>

<font color=black size=3 face=Arial>3. Hyperparameter tuning  </font>  
<font color=black size=3 face=Arial>For the sake of time, Hyperparameter tuning is not applied in this project. However I would suggest to include it in the future since sometimes it can affect the fit and performance of model a lot. The selection of hyperparameters consists of testing the performance of the model against different combinations of hyperparameters, selecting those that perform best according to a chosen metric and a validation method.  </font> 

<font color=black size=3 face=Arial>4. Model selection </font>   
<font color=black size=3 face=Arial>Low accuracy of prediction may also indicate that the algorithm I selected (ANN) are failing or do not fit for the task. I did some research after and it is noted that ANNs work great when there is a good amount of data. For smaller datasets with less than 50K records, the learning is usually suggested to the supervised ML models like Random Forests, Adaboosts, XGBoosts, etc. </font>   

<font color=black size=3 face=Arial>Additionally, as i mentioned above, more metrics such as sensitivity, specificity precision could be applied to evaluate the quality of the model.</font> 

# Happy Learning!