# California Energy Usage

## Scenario
You have been hired as a Data Scientist for a large-scale battery manufacturer, who is considering expanding their business to customers in California. Your first task is to provide an overview of the state of the California electric grid, with a focus on data driven analyses. Based on this analysis, you may be asked to then make recommendations on which areas of California to focus on or types of customers to target with advertisements (e.g. residential vs. commercial and industrial).

Based on a tip from a colleague, you start by looking into publically available data from the California Independent System Operator (CAISO), which oversees the majority of the California electrical power system.

In [None]:
# load required packages
# - numpy: vectorized math functions
# - matplotlib: plotting
# - pandas: data manipulation
# - seaborn: improved plot formatting and additional plotting functions

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# make the code compatible with both Python 2 and 3
from __future__ import print_function, division

## Hourly Demand
You quickly find that CAISO publishes historical hourly load data on its website: http://www.caiso.com/planning/Pages/ReliabilityRequirements/Default.aspx#Historical

A copy of the 2014-2016 hourly load data has been included with this notebook (see "caiso_historical.csv").

**Questions:**
1. What data is included in the file?
2. How is the data organized?
3. Are there any oddities in the data (e.g. corrupted values or inconsistent units)?

*Hint:* With text-based data files, it's usually a good idea to start by inspecting the data with a text editor (Notepad++, Sublime, Atom, etc.).

After you get some sense of the data file (variables, organization, etc.), we can move into trying to load the data. For that, we'll make use of the pandas package (although you could use the Python stdlib or other packages, e.g., numpy).

In [None]:
# load the hourly demand [MW] data as a pandas DataFrame object
df = pd.read_csv('caiso_historical.csv',delimiter=',',
                 parse_dates=True,   # parse the timestamps
                 index_col=0         # use the timestamps as the index
)

# check how much data we have
print(df.shape)

# verify that the data looks "correct"
print(df.head())

## Quick time-series visualization
Now that we've got the data loaded, let's try a few quick plots.

In [None]:
# pandas has some standard plot functions built-in
df.plot()
plt.show()

**Question:** What are the units of the load data? The data file doesn't list units, but you should be able to guess based on information from the CAISO website: http://www.caiso.com/outlook.html

## Summary statistics
The next obvious thing to try is to look at some summary statistics of the data (e.g. min, mean and max values of each variable).

In [None]:
# get summary statistics of the data
print(df.describe())

**Question:** Do you notice anything in the summary statistics? Alternatively, is there anything "missing" from the summary statistics (i.e. other info you wish you had)?

## Effect of time of day
Since we're looking at an electric grid, it's natural to want to understand the effect of time on the variables. Let's start with looking at the hourly load as a function of the hour of the day.

In [None]:
# We could write custom code, but instead we'll leverage some of the
# built-in functionality available in pandas. Specifically, we'll use
# the ``groupby`` function to automatically cluster the data.
#

# get CAISO total load as a function of the hour of the day
load_mean = df['CAISO'].groupby(df.index.hour).mean()
load_std = df['CAISO'].groupby(df.index.hour).mean()

# compare loud by hour of the day
hours = range(24)
plt.figure(figsize=(8, 8))
plt.errorbar(hours, load_mean, yerr=load_std)
plt.xlabel('Hour of the day')
plt.ylabel('Load [???]')
plt.show()

**Question:** What does looking at data by the hour of the day tells us?

**Task:** Now try the same analysis on the load data for each utility (PGE, SCE, SDGE and VEA) and compare.

In [None]:
# (put your code here)


## Effect of time of week
Now let's look at how the time of the week effects the data (e.g. weekday vs weekend). We'll aim to make the same plots as before, but this time we'll group by the day of the week.

In [None]:
# get load by the day of the week
load_mean = df['???'].groupby(df.index.dayofweek).mean()
load_std = ???

# plot
days = range(7)
plt.figure(figsize=(???, ???))
plt.errorbar(???)
plt.show()

**Question:** What does looking at data by the day of the week tell us?

**Task:** Now try the same analysis on the load data for each utility (PGE, SCE, SDGE and VEA) and compare.

## Advanced visualization
Now that we have some initial understanding of the effects of time on the load data, let's try to make a couple polished figures. For this, we'll leverage the seaborn package, which provides some advanced features not found in the base matplotlib package.

In [None]:
# violinplots are a nice way to compare distributions
sns.violinplot(data=df, x=df.index.year, y='CAISO')
plt.xlabel('???')
plt.ylabel('CAISO Total Load [MW]')
plt.show()

**Discussion:** What do the violinplots tell us about how the load for each year changes? What don't the plots tell us?

In [None]:
# let's try using violinplots to look at another time variable: hour of the day
plt.figure(figsize=(12, 4))
sns.violinplot(data=df, x=df.index.hour, y='CAISO', inner=None)
plt.xlabel('???')
plt.ylabel('CAISO Total Load [MW]')

# save figure as a high-res PNG
plt.savefig('caiso_load_hour.png', dpi=200, bbox_inches='tight')

plt.show()

**Discussion:** What can we infer from this figure? What do we still not know? Are there other visualizations that could provide additional useful info?

### Ramp rates
As part of your due diligence for the project, you meet with one of your colleagues who focuses on the policy aspect of battery energy storage. They recommend you take a look at the prevalence and size of the ramp rates (changes in load per time step) in California. A selling point for batteries (compared to other storage technolgies) is their ability to more rapidly respond to ramp events. If you can pin-point areas with large ramps, you can narrow down the search for potential customers.

Since you have hourly load data, you can start by focusing on hourly ramp rates [MW/h].

In [None]:
# we'll use finite differences to get the hourly ramp rates [MW/h]
ramps = df['CAISO'].diff()

# visually inspect the ramp rates
plt.figure(figsize=(10, 6))
ramps.plot()
plt.ylabel('Ramp rate [MW/h]')
plt.show()

**Discussion:** How can we better summarize the ramp rates? Is there a way to check the prevalence of ramps of different magnitudes? Are ramps of different magnitudes equally likely (e.g. 1 MW/h vs. 1000 MW/h)? And when do the ramps occur?

### Next steps
Can you think of other questions that we should be asking? Which ones can be answered with data? Also, what other types of data could be useful for this topic (e.g. historical temperature or info on the amount of installed solar capacity)?