<a href="https://colab.research.google.com/github/syphax/solar-data/blob/main/Clean_GMP_Solar_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

This notebook pre-processes raw downloads from https://greenmountainpower.com/account/usage and produces one cleansed file, suitable for further analysis by the `Solar Viz` notebook.

To run this script, you need access to Google Drive, and you need to copy the data from https://github.com/syphax/solar-data/tree/main/data to `/My Drive/Data/Solar` (or edit the path variable in the 2nd code block to point somewhere else).

_TODO: Load the data directly from the GitHub repo._



# Setup

In [1]:
import os
import re

from datetime import datetime

import numpy as np
import pandas as pd

import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# You can of course edit this to taste:

path = '/content/drive/MyDrive/Data/Solar/'

In [3]:
# This will require you to click through a couple windows to 
# give permission to access your GDrive.

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Load Data

This script preps data that was downloaded from [Green Mountain Power's website](https://greenmountainpower.com/account/usage/).

GMP has an excellent UI for reporting usage, and provides downloadable data in 15 minute increments (either CSV or Green Button XML). *Unfortunately* it only supports manual data downloads in 15 day (max) chunks. *Fortuntely* it only takes a couple minutes to download several months of data. It's just slightly too easy to bother automating for a single account.

Fields in the CSV downloads are:
* `ServiceAgreement`: Account info. Format is `Account Holder / Service / Service Acronym / Account Start Date / Account Status`
* `IntervalStart`: Timestamp; format is `yyyy-MM-dd-hh:mm:ss`
* `IntervalEnd`: Same, 15 minutes later. Redundant but explicit!
* `Quantity`: Amount of electricity generated
* `UnitOfMeasure` `kWh`. I love that they have an explicit UoM field!


In [4]:
path = '/content/drive/MyDrive/Data/Solar/'

raw_input_files = os.path.join(path, 'UsageData*.csv')
joined_input_file = os.path.join(path, 'full_dataset.csv')


In [5]:
# This should list the data files that you copied from https://github.com/syphax/solar-data/tree/main/data

!ls $path 

full_dataset.csv		 UsageData_2022-01-16_15Days.csv
MA				 UsageData_2022-01-31_1Days.csv
old				 UsageData_2022-02-01_15Days.csv
UsageData_2021-05-23_14Days.csv  UsageData_2022-02-15_14Days.csv
UsageData_2021-06-06_14Days.csv  UsageData_2022-03-01_15Days.csv
UsageData_2021-06-20_14Days.csv  UsageData_2022-03-16_14Days.csv
UsageData_2021-07-04_14Days.csv  UsageData_2022-03-30_14Days.csv
UsageData_2021-07-18_14Days.csv  UsageData_2022-04-13_14Days.csv
UsageData_2021-08-01_14Days.csv  UsageData_2022-04-27_14Days.csv
UsageData_2021-08-15_14Days.csv  UsageData_2022-05-11_14Days.csv
UsageData_2021-08-29_14Days.csv  UsageData_2022-05-25_14Days.csv
UsageData_2021-09-12_14Days.csv  UsageData_2022-06-08_14Days.csv
UsageData_2021-09-26_14Days.csv  UsageData_2022-06-22_14Days.csv
UsageData_2021-10-10_14Days.csv  UsageData_2022-07-06_14Days.csv
UsageData_2021-10-24_14Days.csv  UsageData_2022-07-20_14Days.csv
UsageData_2021-11-07_12Days.csv  UsageData_2022-08-03_14Days.csv
UsageData_2021-11-18_15Day

In [6]:
# This concatenates available data files. We will need to remove possible dupes, and check for completeness.

!cat $raw_input_files > $joined_input_file

In [7]:
df_energy_data_raw = pd.read_csv(joined_input_file)

# Check and Clean

In [8]:
# What fields are there?

df_energy_data_raw.dtypes

ServiceAgreement    object
IntervalStart       object
IntervalEnd         object
Quantity            object
UnitOfMeasure       object
dtype: object

In [9]:
# Make a clean field for kWh values

df_energy_data_raw['kWh'] = 0

df_energy_data_raw['kWh'] = np.where(df_energy_data_raw['UnitOfMeasure'] == 'kWh', df_energy_data_raw['Quantity'], 0)
df_energy_data_raw['kWh'] = df_energy_data_raw['kWh'].astype(np.float64)

In [10]:
# Quick check of values in df:

df_energy_data_raw.groupby(['ServiceAgreement', 'UnitOfMeasure'], as_index=False).agg(cnt_records=('ServiceAgreement','count'), 
                                                                                      unique_dt=('IntervalStart','nunique'), 
                                                                                      kwh=('kWh','sum'))

Unnamed: 0,ServiceAgreement,UnitOfMeasure,cnt_records,unique_dt,kwh
0,"POOLER, MADELEINE / Interconnected Generation ...",kWh,48964,48664,10967.36
1,"POOLER, MADELEINE / Residential Net Metering /...",kWh,48964,48664,3136.19
2,"POOLER, MADELEINE / Residential Net Metering /...",kWh,48964,48664,9402.37
3,"POOLER, MADELEINE / Residential Water Heater /...",kWh,48964,48664,2683.81
4,ServiceAgreement,UnitOfMeasure,36,1,0.0


## Extract service level code

In [11]:
# Extract service level code

p = re.compile('.*/.*/(.*)/.*/.*')

sl = df_energy_data_raw['ServiceAgreement'].str.extract(p)

df_energy_data_raw['Service'] = sl[0].str.strip()


## Correct timestamps

The timestamps in these files appear to be in local time. I discovered this when checking for dupes, and finding extra duplicates around e.g. Nov-07-2021 at 1-2am.

To ensure that the timestamps are aligned consistently, we need to adjust for DST. Otherwise, we'll see a funky offset when analyzing production by hour of day.

In [19]:
# Check for dupes

df_dupe_check = df_energy_data_raw.groupby(['Service', 'IntervalStart'], as_index=False).agg(cnt_dupes=('IntervalStart','count'))

df_dupe_check = df_dupe_check[df_dupe_check['cnt_dupes'] != 1]

df_dupe_records = df_energy_data_raw.merge(df_dupe_check, on=['Service', 'IntervalStart'], how='inner').sort_values(['IntervalStart', 'Service'])

In [23]:
display(df_dupe_records)

Unnamed: 0,ServiceAgreement,IntervalStart,IntervalEnd,Quantity,UnitOfMeasure,kWh,Service,cnt_dupes
60,"POOLER, MADELEINE / Interconnected Generation ...",2021-11-07-00:00:00,2021-11-07-00:15:00,0.0,kWh,0.00,INTC,2
61,"POOLER, MADELEINE / Interconnected Generation ...",2021-11-07-00:00:00,2021-11-07-00:15:00,0.0,kWh,0.00,INTC,2
20,"POOLER, MADELEINE / Residential Net Metering /...",2021-11-07-00:00:00,2021-11-07-00:15:00,0.06,kWh,0.06,N01,2
21,"POOLER, MADELEINE / Residential Net Metering /...",2021-11-07-00:00:00,2021-11-07-00:15:00,0.06,kWh,0.06,N01,2
0,"POOLER, MADELEINE / Residential Net Metering /...",2021-11-07-00:00:00,2021-11-07-00:15:00,0.0,kWh,0.00,NGEN,2
...,...,...,...,...,...,...,...,...
1999,"POOLER, MADELEINE / Residential Net Metering /...",2022-02-15-23:45:00,2022-02-16-00:00:00,0.04,kWh,0.04,N01,2
1806,"POOLER, MADELEINE / Residential Net Metering /...",2022-02-15-23:45:00,2022-02-16-00:00:00,0.0,kWh,0.00,NGEN,2
1807,"POOLER, MADELEINE / Residential Net Metering /...",2022-02-15-23:45:00,2022-02-16-00:00:00,0.0,kWh,0.00,NGEN,2
2190,"POOLER, MADELEINE / Residential Water Heater /...",2022-02-15-23:45:00,2022-02-16-00:00:00,0.0,kWh,0.00,RE03,2


So there are 2 reasons for dupes- either the downloaded data files overlap, or we have duplicate timestamps during the "fall back" Daylight Savings Time. 

In order to get aligned solar output data, we need to 

In [13]:
sh0 = df_energy_data_raw.shape

# df_solar_data = df_energy_data_raw[df_energy_data_raw['ServiceAgreement']=='POOLER, MADELEINE / Residential Net Metering / NGEN / 06-04-2021 12:00:00AM / Active']

# # Remove any errant header rows (one may stick around in the data b/c it's not a dupe of the header row)
# df_solar_data = df_solar_data[df_solar_data['ServiceAgreement'] != 'ServiceAgreement']

# # Now we don't need that column anymore!
# df_solar_data = df_solar_data.drop('ServiceAgreement', axis=1)

sh0 = df_solar_data.shape

df_solar_data = df_solar_data.drop_duplicates()

sh1 = df_solar_data.shape

cnt_dupes = sh0[0] - sh1[0]

print("Removed {:,} duplicate entries; {:,} left.".format(cnt_dupes, sh1[0]))

NameError: ignored

In [None]:
# Drop ServiceAgreement (which contains some PID):

df_energy_data_raw = df_energy_data_raw.drop('ServiceAgreement', axis=1)