# Capstone Project
## Modeling Electricity Usage Profiles
### _Conrad Camit_

_ 

## Executive Statement

### Background

Goal of clustering is to identify accounts that are similar and seek potentials for energy efficiency. 
Clustering accounts based on these profiles allows for the comparison and contrasting of those with similar energy usage tendencies. 

Traditional methods of utility providers tend to aggregating usage at much higher aggregation levels.  This method does not necessarliy accounts for customer accounts  may be similar to each other in energy usage tendencies on an aggregated level, but not in their hourly usage throughout the day because of environmental and behavioral differences.

Additionally, the prediction of hourly residential electricity consumption is of great economical interest for electricity providers in the global electricity market, since an accurate prediction of the consumption ahead of time is needed to obtain the best prices on the day-ahead market and to avoid purchasing on the more expensive real-time market. 



### Problem Statements

1. Clustering by electricity usage profile
    -  Identify patterns in residential customer electricity hourly usage and group customers based on similar usage profiles tendencies.  
    -  Identify customers that deviate in their hourly usage ove and above their peer accounts during peak demand hours where electricity prices are the highest (4pm-6pm on Summer days) 
    

2. Predict residential electricity hourly usage
    -  Current method to forecast a customer’s residential electricity usage is error-prone and could be more accurate.  
    -  The goal is to predict individual residential electricity hourly usage day ahead better than the current benchmark method.
    -  This will allow retail electricity providers from spending more on the real-day market for errors in day-ahead prediction for their customers.

    
## Data

-  Retrieved from a data repository provided by Pecan Street, an organization focused providing access to data on consumer energy consumption behavior for research purposes.
-  Hourly electricity usage for almost 1,000 residential customers across the United States from 2011 through 2017
-  Demographic information including housing square footage, annual income, education level, and number of occupants.
-  Access to hundreds of residential house/apartment electricity customers from Texas, Colorado, and California.
    -  899 customers from Austin, Texas
    -  58 customers from Boulder, Colorado
    -  57 customers from San Diego, California

Figure 1: Map of customers with electricity usage available from Pecan Street
<img src="images/map.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

### Data Storage

Data was retrieved from the data repository and stored in a local Postgres database using Python and the psycopg2 library.  All data rows and columns from the following tables were cloned into a local Postgres database.

1. metadata
    -  This view draws data from multiple tables on the Dataport database to provide general information about each data ID in one location. The metadata view shows the useful information about the customers including housing square footage and housing typel, which makes it particularly useful to researchers looking to assemble a dataset containing all the necessary data to answer a research question.
2. electricity_egauge_hours
    -  This table stores all of the electricity data collected by Pecan Street’s eGauge devices. It is aggregated to hourly intervals, where the specified timestamp is the beginning of the interval for that row. All of the values in this table are average real power over the interval in kilowatts (kW). Timestamps indicate the starting time for the interval over which the data was measured.
3. survey_(2011-2014)_all_participants
    -  This table stores the results of the 2012 Pecan Street general participant survey. Questions on this survey asked about topics including demographics, electricity use, eductional level and income level of the residents.
4. weather
    -  This table contains weather data from Austin, Boulder, and San Diego, where the majority of Pecan Street’s participants are located.






### Exploratory Data Analysis 

The Jupyter notebook code used for performing the exploratory data analysis can be found here:  
-  <a href="capstone-EDA.ipynb" >capstone-EDA.ipynb</a>

#### Data Cleaning
The following steps were taken to clean the data for analysis.  

1. Dataid = this field in the database was used to identify each individual customer
2. Hourly electricity usage data extraction
    -  Extracted all rows per customer where a row was available for all hours in year 2015 (24 hours x 365 days = 8,760 rows).  373 customers had this required amount of electricity usage data
    -  Calculated yearly usage (kWh) for 2015 for each customer by aggregating hourly usage (3,267,480 usage data points)
3. Filled in gaps in hourly electricity usage:
    -  Used linear interpolation to fill in usage gaps of zero where gap <= 4 hours since closest usage values most representative to missing usage
    -  Flagged customer accounts with gaps > 4 hours for removal from dataset
3. Retrieved the following data from the metadata table and stored it in a DataFrame:
    -  house square footage
    -  building type
    -  city
4. Retrieved the following data from the survey tables and stored it in a DataFrame:
    -  number of adults in household
    -  number of children in household
    -  number of seniors in household
    -  education level
    -  income level
5. Aggregated hourly electricity usage data for each customer:
    -  Day of the week (usage by day of the week, Monday-Sunday)
    -  Season (usage aggregated by season: winter, spring, summer, fall)
    -  Came up with 672 data points per customer account, one data point for each hour per day of the week per season
    



#### Data Findings
-  By viewing the hourly data for particular account, was able to discern that usage had a particular pattern daily, where usage would often be low during the nighttime hours and would increase during the day and peak in the early afternoon.  Dropped customer which abnormally high yearly usage that was above 2 SD above mean
<img src="images/electricityusage26.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

-  A histogram of the customers 2015 yearly usage showed that the distribution of usage was relatively normally distributed with a sligh positive skew.

<img src="images/histogram_yearly_usage.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

- As expected, there appeared to be a relationship between a customer's yearly usage and the square footage of the house/apartment.  There appeared to be a positive correlation between the two variables.

<img src="images/square_footage_vs_yearly_usage.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />


## Problem 1: Clustering by electricity usage profile

The Jupyter notebook code used for performing the clustering models can be found here:  
-  <a href="capstone-clustering.ipynb" >capstone-clustering.ipynb</a>


### Modeling
#### Model Selection

-  In order to achieve the goal of clustering accounts and to merge energy profiles into clusters where profiles in a common cluster are more similar than those in other clusters, I wanted to use a standard partitioning method.

-  Of the major clustering approaches, partitioning was chosen because partitioning methods are distance‐based.

-  This is an ideal feature for our application because I am aiming to compare and contrast individual energy profiles, which can be accomplished with distinct distance measures providing a tangible dissimilarity measure.

-  The clustering method k‐means was selected as a standard heuristic partitioning method. This was identified as the grouping of energy use profiles is based on the similarity to a mean profile. This is desired because mean profiles can be used in energy efficiency comparisons.

-  Also considered using DBSCAN clustering but results showed identical results when compared to K-Means clustering.

#### Choosing the optimal number of clusters
-  Applying the elbow method
    -  Compute sum of squared error (SSE) for different values of clusters
    -  SSE is defined as the sum of the squared distance between each member of the cluster and its centroid
    -  If you plot k against the SSE, you will see that the error decreases as k gets larger
    -  Ten clusters was determined to be a reasonable cutoff based on cluster analysis.
    
#### Normalization of data
-  With the comparison of individual energy profiles as our goal, normalization of the data is necessary. 
-  Accounts may mirror each other in energy usage tendencies, but not in their kWh throughout the day because of differences in the underlying environment. 
-  Normalized usage data by dividing hourly usage by customer's total yearly usage.


<img src="images/kmeans_analysis.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

The overall mean of all the customer account usages by month and day of the week for the year 2015.  On average, usage peaks in the early afternoon daily and tends to be much higher in the summer months.  This is expected considering majority of the customers are in the Austin, Texas area, where warm summer temperatures translates to increased air conditioning usage in the summer months.  
<img src="images/usage_by_season_and_day.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

Primary focus for this problem statement will be on examining summer usage where peak demand prices are the highest.

The cluster centers of the the ten clusters of 2015 summer hourly usage are graphed below.  
<img src="images/10clusters-summer.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

Below is the graph of average 2015 summer hourly usage for cluster=0.  28 customer accounts in this cluster tended to have 9am-5pm Monday-Friday jobs (lower usage during this time) away from home with use of air conditioning due to warmer temperature climate (all located in Austin, Texas) (much higher usage in Summer compared to other seasons).
<img src="images/cluster0-summer.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

Below is the graph of average 2015 summer hourly usage for cluster=6.  27 customer accounts in this cluster tended to more consistent usage between workdays and weekends.  Most households had children at home and houses were located in areas with more moderate climate in the summer (most were located in Boulder, Colorado).
<img src="cluster6-summer.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

Below is the graph of average 2015 summer hourly usage for cluster=7.  6 customer accounts had peak usage in the morning hours and appeared to be away during the daytime hours with low usage. Majority were two adult person households.
<img src="images/cluster7-summer.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

#### Targeting customers with more usage than peers at peak hours of 4pm-6pm daily.

-  Resources with the lowest variable operating costs are always dispatched first, while more expensive generating units—like peaking power plants—are brought online when demand increases. 
-  Customers exceeding average summer peak usage for their profile are targeted
-  Target these customers to reduce load by 50% during peak hours

Below is the graph of the average 2015 summer hourly usage for cluster #2 plotted against each of the other accounts in the cluster in grey.  The overall peak demand for summer 2015 for all accounts was between 4pm-6pm.  It is during this time of peak demand that the price for generating power is the greatest.  Customers with usage over and above those accounts with the same profile will be targeted for energy efficiency programs.  
<img src="images/cluster2-summer.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

<img src="images/customers_over_summer_peak.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

With successful reduction of 50% of usage at peak summer hours, with average real-time price of $39.59 during summer peaks:

-  Total dollar recovery for 373 customers is $329.19

<img src="images/dollar_recovery_from_usage_reduction.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

## Problem 2: Forecasting day-ahead hourly electricity usage

Current method to forecast a customer’s residential electricity usage is error-prone and could be more accurate.  The goal is to predict individual residential electricity hourly usage day ahead better than the current benchmark method.
Real-time prices are consequently much more volatile than day-ahead prices due to their unpredictable nature. 


### Modeling
#### Model Selection
Two models were chosen to compare to the current benchmark method of calculating day-ahead forecast for residential electricity usage.  The goal is to determine a model through data science that provide a substantial improvement in the prediction of day-ahead hourly electricity usage. 

##### ARIMA model

The Jupyter notebook code used for performing the ARIMA forecasting method can be found here: <a href="capstone-ISO.ipynb">capstone-ARIMA.ipynb</a>

- ARIMA model was chosen as the model to forecast day-ahead hourly electricity usage.  
- A Time Series (TS) model predicts future values based on previous observations and the commonly used Auto-Regressive Integrated Moving Average (ARIMA) is defined in terms of three parameters: d, the number of times a time series needs to be differenced to make it stationary; p, the auto-regressive order, that denotes the number of past observations included in the model; and q, the moving average order that denotes the number of past white noise error terms included in the model. The advantages of ARIMA include the fact that it does not require domain knowledge, nor does it depend on other features. 

##### Timeseries Forecasting with LSTM 

The Jupyter notebook code used for performing the LSTM forecasting method can be found here: <a href="capstone-ISO.ipynb">capstone-LSTM.ipynb</a>

- Modeling timeseries and forecasting with neural networks is a growing trend. The Long Short Term Memory (LSTM) recurrent neural network architecture is a popular choice when "context" or memory across time is a desired capability of the model.
- Use "statfeul model to allow the model to "remember" all of the previous timesteps. Instead of resetting their internal state after each training batch, the internal state of the neurons is maintained. 

##### Baseline forecasting method

The Jupyter notebook code used for performing the baseline forecasting method can be found here: <a href="capstone-ISO.ipynb">capstone-ISO.ipynb</a>

- Current method that electricity providers use to forecast electricity usage from their customers, usually at an agreggated level.  
- Taking hourly averages across three most recent days with highest average consumption value out of a pool of ten previous days


### Model Forecasting Results

#### Baseline Forecast Prediction

The below graph shows the baseline prediction for the month of April 2015 using previous actual hourly data

<img src="images/baseline_forecast_prediction.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

The baseline April 1, 2015 prediction does a reasonably good job of predicting day-ahead usage. But misses the peak load almost half.
<img src="images/baseline_april1_prediction.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

<img src="images/mean_squared_error_baseline.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />
#### Baseline Model: Average mean squared error: 0.718466826596

#### ARIMA Forecast Prediction

The below graph shows the ARIMA prediction for the month of April 2015 using previous actual hourly data

<img src="images/arima_forecast_prediction.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

<img src="images/arima_april1_prediction.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

<img src="images/mean_squared_error_arima.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />
#### ARIMA Model: Average mean squared error: 0.492931579608

- 31.2% reduction in forecasting prediction error using ARIMA model over baseline
<img src="images/mean_squared_error.png" alt="Alt text that describes the graphic" title="Title text" height="400" width="600" />

### Findings

#### Problem 1: Clustering by electricity usage profile

1. Data science techniques such as K-means and DBSCAN clustering can allow for the grouping of presidential customer's hourly usage into usage profiles based on similar hourly and seasonal usage trends can result in the ability to target customer accounts for energy efficiency programs.
2. Customers with usage over and above those accounts with the same usage profile during peak hours where targeted for energy efficiency programs.
3. With successful reduction of 50% of usage at peak summer hours, with average real-time price of $39.59 during summer peaks.

4. Total dollar recovery targeting all 373 customers is $329.19

#### Problem 2: Forecasting day-ahead hourly electricity usage

1. 31.2% reduction in forecasting prediction error using ARIMA model over baseline
2. Real-time prices are consequently much more volatile than day-ahead prices due to their unpredictable nature so accurately day-ahead predictions are valuable in avoiding unnecessary price volatility 

## Future directions
1. Future directions involve investigating future models to furthur improve forecasting projections.
2. Increasing the number of accounts available for investigation to furthur cluster accounts having more detailed difference.  
3. Use demographic/behavioral information of household to predict future residential usage for new customers.
