In [None]:
!pip install causal-impact

### Python Libraries to know
- Pandas: For loading and manipulating data. 
- NumPy: Working with numbers and matrices. 
- Matplotlib (Seaborn too): For plotting and graphing data. Matplotlib will do the basics, and Seaborn can do some more advanced plots. 
- Sklearn: Has a bunch of different models and useful functions in an easy to use format.
- Statsmodels: Many models, time series models and statistical tests. 

### Nice to know (but not necessary)
- PyTorch or Tensorflow: Neural networks and gradient descent optimization. Last year there was text data for both competitions I went to, but I'm under the impression the winners didn't rely heavily on NLP.
- LightGBM or XGBoost: Easy to use black-box gradient boosting trees models. Handles missing values automatically. You can use shapley values to "interepret" the model. 
- PyMC3, PyStan or other PPL: Very powerful but has a learning curve. For bayesian modeling.  
- CVXPY: Convex optimization. Useful if you are trying to optimize over a convex function. You can always use a general optimizer (ie particle swarm, bayesian optimization, grid-search) instead. 

### Useful Resources:
Kaggle

https://datasetsearch.research.google.com/

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

train_df = pd.read_csv('../input/ashrae-energy-prediction/train.csv')
weather_df = pd.read_csv('../input/ashrae-energy-prediction/weather_train.csv')
meta_df = pd.read_csv('../input/ashrae-energy-prediction/building_metadata.csv')

##  Look at what kind of data is in these files

In [None]:
train_df.head()

In [None]:
weather_df.head()
# There are some missing values. We should also eventually ensure that all of the values fall within a reasonable range. 

In [None]:
meta_df.head()
# Missing values as well. 

## Explore train_df

In [None]:
train_df['meter'].value_counts()

In [None]:
train_df['timestamp'][0] 

## Timestamp is not in a date time format, so let's convert to the pandas date format, and then add some additional features!

In [None]:
train_df['timestamp'] = pd.to_datetime(train_df['timestamp'])

train_df['month'] = train_df['timestamp'].dt.month
train_df['weekday'] = train_df['timestamp'].dt.dayofweek
train_df['monthday'] = train_df['timestamp'].dt.day
train_df['hour'] = train_df['timestamp'].dt.hour
train_df['minute'] = train_df['timestamp'].dt.minute

In [None]:
train_df['minute'].unique() # Looks like the data doesn't go down to minute resolution. Lets drop it. 

In [None]:
train_df = train_df.drop(['minute'], axis = 1)

## Look at individual buildings.

In [None]:
plt.plot(train_df[train_df['building_id'] == 0]['meter_reading'], alpha = 0.8)
plt.plot(train_df[train_df['building_id'] == 1]['meter_reading'], alpha = 0.8)
plt.plot(train_df[train_df['building_id'] == 2]['meter_reading'], alpha = 0.8)
plt.plot(train_df[train_df['building_id'] == 500]['meter_reading'], alpha = 0.8)

### Let's look at the autocorrelation of these plots, and look at a lagplot.

In [None]:
pd.plotting.lag_plot(train_df[train_df['building_id'] == 0]['meter_reading'])
plt.plot([0,400],[0,400])
# Look at the 3 clusters. 

In [None]:
pd.plotting.lag_plot(train_df[train_df['building_id'] == 500]['meter_reading'])
plt.plot([0,400],[0,400])

In [None]:
pd.plotting.autocorrelation_plot(train_df[train_df['building_id'] == 500]['meter_reading'])
plt.show()
pd.plotting.autocorrelation_plot(train_df[train_df['building_id'] == 500]['meter_reading'][:300])
plt.show()

### Some buildings have multiple power meters.

In [None]:
train_df[train_df['meter'] == 2].head()

### The meters are not necessarily consecutive numbers.

In [None]:
train_df[train_df['building_id'] == 745].head()

In [None]:
print(train_df[train_df['building_id'] == 745].meter.unique())
print(train_df[train_df['building_id'] == 1414].meter.unique())

### What does the distribution of weather look like?

In [None]:
sns.distplot(weather_df['air_temperature'].dropna())
plt.show()

### We have 3 dataframes, but they should be merged into one so we can feed it to a model.
### The site_id will be mapped to the building_id between the train and the meta, and the weather will be mapped to the site and time of the training data.

In [None]:
all_df = pd.merge(train_df, meta_df, on = 'building_id', how = 'left')
all_df.head()

In [None]:
weather_df['timestamp'] = pd.to_datetime(weather_df['timestamp']) # Convert weather to the correct format before merging
all_df = pd.merge(all_df, weather_df, on = ['site_id', 'timestamp'], how = 'left')
all_df['date'] = all_df['timestamp'].dt.date
all_df.head()

### Now we have the data in a format we can use, so we now need to think of some problems we can solve with this data. It is a good idea to Google what relevant news stories there are around power consumption, buildings, and weather. Then also look to see if there is any relevant research and papers about the topic. 

### Some ideas I came up with are:
### 1. New York recently came up with a tax on inefficient buildings. How will this tax affect power consumption?
### 2. How will climate change affect power consumption?
### 3. What is the most effective way to transition to renewable power? What combinaiton of Solar, Wind and Batteries would be the most cost effective and would be robust to prolonged bad weather. 



### Then consider some relevant factors such as: How difficult it will be to answer the question, what external data can be brough in, is there any relevant research, how much will the topic impress judges?

### 1. We would likely need some electricity/building pricing information or data before and after the tax/policy change to be able to solve it. This is an active topic of discussion in NYC, so there are probably many different opinions that can be discussed in good narrative. There might be difficulties since data is noisy, and there might not be a visible reaction from changes to electricity prices (risky). Maybe there are some other policies which can be better examined, but if this can be pulled off, it will be a top contender. 
### 2. This is very do-able, even with just linear regression. Not much external data required, just some estimates of the effects of climate change. Just create a model which predicts power usage from weather and adjust the weather to climate change predictions. Might be less impressive to the judges, but has serious potential to win if done right.
### 3.  Solvable, but harder than 2. Would need pricing data on renewable power, and how to convert weather to power generation. Very impressive if you manage to solve it (similar idea won us the Championship).

### Let's take a crack at #2 since it is the easiest. 

### Causation
Counterfactuals are one of the easiest ways to show causation. The goal of a counterfactual approach is to estimate how **Y** (energy use) would be different had **X** (climate) been something else. There are two common ways of establishing a counterfactual with data. 

- Matching: 
Matching based approaches try to replicate a controlled experiment. We can match buildings in cold weather to similar buildings in warm weather and compare differences in energy use between similar buildings. 
https://github.com/benmiroglio/pymatch
- Model Based Counterfactuals:
The idea here is to "learn" a counter factual with a model. One way we can do this is to create a forecast assuming the temperature stays the same, and compare that forecast to the real results once the temperature changes. This isn't a great example since the weather is always changing and this approach works best with point-in-time interventions, but it illustrates how to use this approach.
https://github.com/tcassou/causal_impact

In [None]:
data = all_df.groupby(['date', 'monthday'])[['meter']].mean() \
        .join(all_df.groupby(['date', 'monthday'])['air_temperature'].mean()).sort_values(['date']).reset_index()
data.head()

In [None]:
data['air_temperature'].iloc[200:500].plot()
plt.plot([264,264],[14,28])

In [None]:
from causal_impact import CausalImpact
data['weekend'] = (pd.to_datetime(data.date).dt.dayofweek >= 5).astype(int) / 100 + 0.66
data_time = data[['meter','weekend']].iloc[200:350].rename({'meter':'y','weekend':'x1'}, axis = 1).reset_index(drop = True)

ci = CausalImpact(data_time, 264-200)
ci.run(max_iter=1000)
ci.plot()

In [None]:
data['meter'].iloc[200:500].plot()
plt.plot([264,264],[0.66,0.68])

In [None]:
from statsmodels.regression.linear_model import OLS
data['constant'] = 1
model = OLS(data['meter'], data[['air_temperature','weekend', 'constant']])
results = model.fit()
results.summary()

In [None]:
plt.plot(results.resid)

In [None]:
sns.distplot(results.resid)
plt.show()