## About this notebook

For our partial CEMS methodology, we use hourly CEMS data for one part of a plant to shape data from another part of a plant. These methods make two different assumptions:
1. That all units within a single subplant have a similar hourly operational profile
2. That all subplants within a single plant have a similar hourly operational profile

We want to test these assumptions by answering:
 - Within a single subplant, how similar are the fuels/nameplate capacities of each unit?
 - Within a single subplant, how similar are the hourly profiles of each unit?
 - Within a single plant, how similar are the hourly profiles of each subplant?

Additional questions include:
 - Does the correlation of these units/subplants vary by the fuel type of the plant?
 - Are the units that report to CEMS structurally different than the units that do not report to CEMS at a plant?


In [1]:
# import packages
import pandas as pd
import os
import plotly.express as px

%reload_ext autoreload
%autoreload 2

# # Tell python where to look for modules.
import sys
sys.path.append('../../../open-grid-emissions/src/')

import download_data
import load_data
from column_checks import get_dtypes
from filepaths import *
import impute_hourly_profiles
import data_cleaning
import output_data
import emissions
import validation
import gross_to_net_generation
import eia930

year = 2020
path_prefix = f"{year}/"

## How similar are units within each subplant?

In [2]:
# load unit-level CEMS data
eia923_allocated, primary_fuel_table = data_cleaning.clean_eia923(year, False)
cems = data_cleaning.clean_cems(year, False, primary_fuel_table)



    Checking that there are no missing energy source codes associated with non-zero fuel consumption...  OK
    Checking that fuel and emissions values are positive...  OK
 
Missing factors for FC prime movers are currently expected
      prime_mover_code energy_source_code boiler_bottom_type boiler_firing_type
20403               FC                LFG                NaN                NaN
13248               FC                 NG                NaN                NaN
19938               FC                OBG                NaN                NaN
 
 
Missing factors for FC prime movers are currently expected
      prime_mover_code energy_source_code boiler_bottom_type boiler_firing_type
20403               FC                LFG                NaN                NaN
13248               FC                 NG                NaN                NaN
19938               FC                OBG                NaN                NaN
 
 
Missing factors for FC prime movers are currently expected
 

In [13]:
#identify subplants with multiple units
multi_unit_subplants = cems[["plant_id_eia","subplant_id","unitid"]].drop_duplicates()
multi_unit_subplants = multi_unit_subplants.loc[multi_unit_subplants.duplicated(subset=["plant_id_eia","subplant_id"]), ["plant_id_eia","subplant_id"]].drop_duplicates()

# only keep subplants with multiple units
multi_unit_subplants = multi_unit_subplants.merge(cems, how="inner", on=["plant_id_eia","subplant_id"], )[["plant_id_eia","subplant_id","report_date","datetime_utc","unitid","fuel_consumed_mmbtu"]]

# create a new numeric unitid column
multi_unit_subplants["unitid_num"] = multi_unit_subplants.groupby(["plant_id_eia","subplant_id","unitid"]).ngroup()
multi_unit_subplants["unitid_num"] = multi_unit_subplants["unitid_num"] - multi_unit_subplants.groupby(["plant_id_eia","subplant_id"])["unitid_num"].transform("min")
multi_unit_subplants

Unnamed: 0,plant_id_eia,subplant_id
24145,3,4
41713,3,5
221928,533,1
331009,7710,0
347856,7710,1
...,...,...
27569584,55284,4
27585712,55284,5
27624520,3935,3
27635536,6264,1


In [55]:
# calculate the correlation
corr = multi_unit_subplants.pivot(index=["plant_id_eia","subplant_id","report_date","datetime_utc"], columns="unitid_num", values="fuel_consumed_mmbtu").groupby(["plant_id_eia","subplant_id","report_date"]).corr().dropna(how="all", axis=0).stack()
corr = corr[corr.index.get_level_values(3) != corr.index.get_level_values(4)]

# remove duplicate pairwise correlations
corr.index = corr.index.droplevel(4)
corr = corr.reset_index().drop_duplicates(subset=["plant_id_eia","subplant_id","report_date",0])

corr = corr.round(2)
corr

Unnamed: 0,plant_id_eia,subplant_id,report_date,unitid_num,0
0,3,4,2020-01-01,0,0.95
2,3,4,2020-02-01,0,0.40
4,3,4,2020-03-01,0,0.87
6,3,4,2020-04-01,0,-0.02
8,3,4,2020-05-01,0,-0.01
...,...,...,...,...,...
25736,61028,0,2020-08-01,0,-0.14
25738,61028,0,2020-09-01,0,0.95
25740,61028,0,2020-10-01,0,-0.01
25742,61028,0,2020-11-01,0,0.99


In [72]:
corr.groupby(["plant_id_eia","subplant_id"])[0].mean().mean()

0.6718558124488478

In [73]:
px.histogram(corr, x=0,  histnorm="percent", nbins=20, width=600).update_xaxes(dtick=0.25, range=[-1,1])

In [69]:
data_to_plot = multi_unit_subplants[multi_unit_subplants["plant_id_eia"] == 612]

px.line(data_to_plot, x="datetime_utc", y="fuel_consumed_mmbtu", color="unitid")

### Do certain types of subplants have greater correlations?

## How similar are subplants within a single plant?

In [76]:
# aggregate cems data to subplant level
cems_sub = cems.groupby(["plant_id_eia","subplant_id","report_date","datetime_utc"])["fuel_consumed_mmbtu"].sum().reset_index()

#identify plants with multiple subplants
multi_subplant_plants = cems[["plant_id_eia","subplant_id"]].drop_duplicates()
multi_subplant_plants = multi_subplant_plants.loc[multi_subplant_plants.duplicated(subset=["plant_id_eia"]), ["plant_id_eia"]].drop_duplicates()

# only keep subplants with multiple units
multi_subplant_plants = multi_subplant_plants.merge(cems_sub, how="inner", on=["plant_id_eia"], )[["plant_id_eia","subplant_id","report_date","datetime_utc","fuel_consumed_mmbtu"]]


In [80]:
# calculate the correlation
corr = multi_subplant_plants.pivot(index=["plant_id_eia","report_date","datetime_utc"], columns="subplant_id", values="fuel_consumed_mmbtu").groupby(["plant_id_eia","report_date"]).corr().dropna(how="all", axis=0).stack()
corr = corr[corr.index.get_level_values(2) != corr.index.get_level_values(3)]

# remove duplicate pairwise correlations
corr.index = corr.index.droplevel(3)
corr = corr.reset_index().drop_duplicates(subset=["plant_id_eia","report_date",0])

corr = corr.round(2)
corr

Unnamed: 0,plant_id_eia,report_date,subplant_id,0
0,3,2020-01-01,0,0.71
1,3,2020-01-01,0,-0.04
2,3,2020-01-01,0,-0.10
3,3,2020-01-01,0,0.16
4,3,2020-01-01,0,0.65
...,...,...,...,...
78766,61242,2020-04-01,0,0.43
78768,61242,2020-05-01,0,0.47
78770,61242,2020-06-01,0,0.39
78772,61242,2020-07-01,0,0.57


In [97]:
corr.groupby(["plant_id_eia","subplant_id"])[0].mean().mean()

0.392552191053462

In [83]:
px.histogram(corr, x=0,  histnorm="percent", nbins=10, width=600).update_xaxes(dtick=0.25, range=[-1,1])