<a href="https://colab.research.google.com/github/syphax/solar-data/blob/dev/nb/Clean_GMP_Solar_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro

This notebook pre-processes raw downloads from https://greenmountainpower.com/account/usage and produces one cleansed file, suitable for further analysis by the `Solar Viz` notebook.

To run this script, you need access to Google Drive, and you need to copy the data from https://github.com/syphax/solar-data/tree/main/data to `/My Drive/Data/Solar` (or edit the path variable in the 2nd code block to point somewhere else).

_TODO: Load the data directly from the GitHub repo._



# Setup

In [None]:
import os
import re

from datetime import datetime
import dateutil
import pytz

import numpy as np
import pandas as pd

import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# You can of course edit this to taste:

path = '/content/drive/MyDrive/Data/Solar/'

In [None]:
# This will require you to click through a couple windows to 
# give permission to access your GDrive.

from google.colab import drive
drive.mount('/content/drive')

# Load Data

This script preps data that was downloaded from [Green Mountain Power's website](https://greenmountainpower.com/account/usage/).

GMP has an excellent UI for reporting usage, and provides downloadable data in 15 minute increments (either CSV or Green Button XML). *Unfortunately* it only supports manual data downloads in 15 day (max) chunks. *Fortuntely* it only takes a couple minutes to download several months of data. It's just slightly too easy to bother automating for a single account.

Fields in the CSV downloads are:
* `ServiceAgreement`: Account info. Format is `Account Holder / Service / Service Acronym / Account Start Date / Account Status`
* `IntervalStart`: Timestamp; format is `yyyy-MM-dd-hh:mm:ss`
* `IntervalEnd`: Same, 15 minutes later. Redundant but explicit!
* `Quantity`: Amount of electricity generated
* `UnitOfMeasure` `kWh`. I love that they have an explicit UoM field!


In [None]:
path = '/content/drive/MyDrive/Data/Solar/'

raw_input_files = os.path.join(path, 'UsageData*.csv')
joined_input_file = os.path.join(path, 'full_dataset.csv')


In [None]:
# This should list the data files that you copied from https://github.com/syphax/solar-data/tree/main/data

!ls $path 

In [None]:
# This concatenates available data files. We will need to remove possible dupes, and check for completeness.

!cat $raw_input_files > $joined_input_file

In [None]:
df_energy_data_raw = pd.read_csv(joined_input_file)

# Check and Clean

In [None]:
# What fields are there?

df_energy_data_raw.dtypes

In [None]:
# Make a clean field for kWh values

df_energy_data_raw['kWh'] = 0

df_energy_data_raw['kWh'] = np.where(df_energy_data_raw['UnitOfMeasure'] == 'kWh', df_energy_data_raw['Quantity'], 0)
df_energy_data_raw['kWh'] = df_energy_data_raw['kWh'].astype(np.float64)

In [None]:
# Quick check of values in df:

df_energy_data_raw.groupby(['ServiceAgreement', 'UnitOfMeasure'], as_index=False).agg(cnt_records=('ServiceAgreement','count'), 
                                                                                      unique_dt=('IntervalStart','nunique'), 
                                                                                      kwh=('kWh','sum'))

## Extract service level code

In [None]:
# Extract service level code

p = re.compile('.*/.*/(.*)/.*/.*')

sl = df_energy_data_raw['ServiceAgreement'].str.extract(p)

df_energy_data_raw['Service'] = sl[0].str.strip()

# Drop Service Agreement field, for compactness and because it contains PII
df_energy_data_raw = df_energy_data_raw.drop(['ServiceAgreement'], axis=1)


## Correct timestamps

The timestamps in these files appear to be in local time. I discovered this when checking for dupes, and finding extra duplicates around e.g. Nov-07-2021 at 1-2am.

To ensure that the timestamps are aligned consistently, we need to adjust for DST. Otherwise, we'll see a funky offset when analyzing production by hour of day.

In [None]:
df_energy_data_raw

In [None]:
# Convert to UTZ, and drop ambiguous times (occurs during "Fall Back" events)

tz_str = 'America/New_York'

df_energy_data_raw['IntervalStart_utc'] = pd.to_datetime(df_energy_data_raw['IntervalStart'], format="%Y-%m-%d-%H:%M:%S", errors='coerce')
df_energy_data_raw['IntervalStart_utc'] = df_energy_data_raw['IntervalStart_utc'].dt.tz_localize(tz_str, ambiguous='NaT').dt.tz_convert('UTC')

df_energy_data_raw['IntervalEnd_utc'] = pd.to_datetime(df_energy_data_raw['IntervalEnd'], format="%Y-%m-%d-%H:%M:%S", errors='coerce')
df_energy_data_raw['IntervalEnd_utc'] = df_energy_data_raw['IntervalEnd_utc'].dt.tz_localize(tz_str, ambiguous='NaT').dt.tz_convert('UTC')


In [None]:
# Check how many records have ambiguous timestamps:
# As this should be a handful of records

df_ambiguous_record_summary = df_energy_data_raw[(df_energy_data_raw['IntervalStart_utc'].isnull()) |
                   (df_energy_data_raw['IntervalEnd_utc'].isnull())].groupby('Service').agg(cnt=('Quantity','count'), kwh=('kWh','sum'))

print('Initial:')
display(df_ambiguous_record_summary)

df_energy_data_raw = df_energy_data_raw[(~df_energy_data_raw['IntervalStart_utc'].isnull()) &
                   (~df_energy_data_raw['IntervalEnd_utc'].isnull())]

df_ambiguous_record_summary = df_energy_data_raw[(df_energy_data_raw['IntervalStart_utc'].isnull()) |
                   (df_energy_data_raw['IntervalEnd_utc'].isnull())].groupby('Service').agg(cnt=('Quantity','count'), kwh=('kWh','sum'))

print('After cleansing:')
display(df_ambiguous_record_summary)


## Remove remaining duplicate records

In [None]:
# Check for any remaining dupes

df_dupe_check = df_energy_data_raw.groupby(['Service', 'IntervalStart'], as_index=False).agg(cnt_dupes=('IntervalStart','count'))

df_dupe_check = df_dupe_check[df_dupe_check['cnt_dupes'] != 1]

df_dupe_records = df_energy_data_raw.merge(df_dupe_check, on=['Service', 'IntervalStart'], how='inner').sort_values(['IntervalStart', 'Service'])

display(df_dupe_records)

In [None]:
# Remove duplicates

sh0 = df_energy_data_raw.shape

df_energy_data_raw = df_energy_data_raw.drop_duplicates()

sh1 = df_energy_data_raw.shape

cnt_dupes = sh0[0] - sh1[0]

print("Removed {:,} duplicate entries; {:,} left.".format(cnt_dupes, sh1[0]))

## Final cleanup

In [None]:
# Define timestamps in terms of local time zone:

In [None]:
# Filter and re-order columns