# Hourly eGRID Data Pipeline Overview

## Data Used
EPA Continuous Emissions Monitoring System (CEMS) data
 - What is it: Measured hourly gross generation, fuel consumption, and emissions data for emitting power generation units > 25MW
 - How we use it: Primary source for hourly emissions and generation data

EIA Form 923
 - What is it: Reported monthly net generation and fuel consumption data for power generators > 1 MW
 - How we use it: To convert gross generation data from CEMS to net generation, and to calculate emissions that are not reported to CEMS

EIA Form 860
 - What is it: Inventory of all generators and plants and their static characteristics
 - How we use it: to transform and aggregate the data reported in CEMS and EIA-923 based on plant and generator characteristics

EPA-EIA Power Sector Data Crosswalk
 - What is it: Maps EPA plant IDs and unit IDs to EIA plant IDs and generator IDs
 - How we use it: To match data between CEMS and EIA-923

EIA Form 930 / Hourly Electric Grid Monitor
 - What is it: Reported hourly net generation by fuel category, demand, and interchange for each Balancing Area in the U.S. 
 - How we use it: To assign an hourly profile to the monthly generation and fuel data reported in EIA-923

EPA eGRID database
 - What is it: Reports annual-level generation and emissions statistics at the plant and BA level 
 - How we use it: to validate our outputs

## Process
1. Download data, including CEMS (via PUDL), EIA Forms 860 and 923 (via PUDL), EPA-EIA Power Sector Data Crosswalk, EIA-930 data
2. Clean monthly generation and fuel data from EIA-923
    - allocate monthly net generation and fuel consumption data reported for each plant prime mover to each plant generator
    - Calculate monthly emissions for each generator based on its fuel consumption and fuel source
    - Remove data for non grid-connected plants and plants in Puerto Rico
    - Assign a primary fuel type and balancing authority location to each generator
3. Clean hourly generation, fuel, and emissions data from CEMS
    - Remove data for non grid-connected plants, plants in Puerto Rico, and certain steam-only units
    - Assign a monthly "report_date" to each hourly observation based on the date of the local timestamp (this allows us to match the data to EIA-923 report dates)
    - Assign a fuel type to each unit
    - Fill in missing hourly emissions data using the assigned fuel type and reported hourly fuel consumption data
    - Remove all observations for each unit-month when no operation is reported for that unit in that month
    - Allocate hourly data for combined heat and power plants between electricity generation and steam production
    - Remove data for units for which we are unable to fill missing emissions data
4. Crosswalk the CEMS data to the EIA-923 data to identify for which generator-months there is no hourly data reported in CEMS
    - Use the EPA-EIA Power Sector Data Crosswalk
    - Assign subplant groupings to data
5. Convert hourly gross generation in CEMS to hourly net generation
    - aggregate CEMS gross generation to monthly level to match with monthly-reported net generation
    - Apply several methodologies to calculate gross-to-net generation conversion factors
    - apply GTN factors to convert hourly gross generation to hourly net generation
6. Assign static plant characteristics to CEMS and EIA data to allow for data aggregation and matching with EIA-930
    - assign generator and plant-level primary fuel
    - assign Balancing Authority and State to each plant
    - assign fuel categories to each plant that match EIA-930 categories
7. Clean and reconcile EIA-930 data
    - Fix timezone/timestamp issues with raw 930 data
    - Perform physics-based reconciliation so that data satisfies conservation of energy equations
8. Calculate residual net generation profiles for each BA-fuel category by comparing EIA-930 and CEMS hourly net generation data
9. Assign monthly EIA-923 data an hourly profile based on the residual net generation profile
10. Concatenate the shaped hourly EIA-923 data to the hourly CEMS data
11. Run validation checks on processed data
12. Aggregate the hourly data to the BA level and output


## Outputs
 - Processed hourly subplant-level data
 - Aggregated hourly data for each BA (total emissions, total generation, generated carbon intensity)

## Output Validation Checks
 - Aggregate data to annual level and compare with published eGRID results
 - Check that aggregated heat rates and emissions rates by fuel type are within reasonable ranges for each BA
 - Plant-level checks for anomolous data

In [1]:
# import packages
import numpy as np
import pandas as pd
import plotly.express as px
from IPython.display import display

In [142]:
%reload_ext autoreload
%autoreload 2

# Tell python where to look for modules. 
import sys
sys.path.append('../../hourly-egrid/')

# import local modules
import src.data_cleaning as data_cleaning
import src.gross_to_net_generation as gross_to_net_generation
import src.load_data as load_data
import src.validation as validation

# Specify the year for analysis
This data pipeline can only be run for a single year at a time

In [3]:
year = 2020

# 1. Download data
Checks if all necessary files are already downloaded. If not, downloads them

In [4]:
############### PUDL Database ######################

load_data.download_pudl_data(zenodo_url = 'https://zenodo.org/record/6349861/files/pudl-v0.6.0-2022-03-12.tgz')

################# eGRID data #########################

# the 2019 and 2020 data appear to be hosted on different urls
egrid_files_to_download = ['https://www.epa.gov/sites/default/files/2021-02/egrid2019_data.xlsx', 
                           'https://www.epa.gov/system/files/documents/2022-01/egrid2020_data.xlsx']

load_data.download_egrid_files(egrid_files_to_download)

############# EIA-930 data #####################

load_data.download_eia930_data(years_to_download=[year])
load_data.download_chalendar_files()

########## Power Sector Data Crosswalk #############
# NOTE: Check for new releases at https://github.com/USEPA/camd-eia-crosswalk

load_data.download_epa_psdc(psdc_url='https://github.com/USEPA/camd-eia-crosswalk/releases/download/v0.2.1/epa_eia_crosswalk.csv')


PUDL data already downloaded
egrid2019_data.xlsx already downloaded
egrid2020_data.xlsx already downloaded
2020_Jan_Jun data already downloaded
2020_Jul_Dec data already downloaded
EBA_elec.csv already downloaded
EBA_raw.csv already downloaded
epa_eia_crosswalk.csv already downloaded


# 2. Clean EIA-923 Generation and Fuel Data at the Monthly Level

In [5]:
eia923_allocated, primary_fuel_table = data_cleaning.clean_eia923(year)

# Add primary fuel data to each generator
eia923_allocated = eia923_allocated.merge(primary_fuel_table, how='left', on=['plant_id_eia','generator_id'])

eia923_allocated

Removing 0 plants that are not grid-connected
Removing 0 plants located in the following states: ['PR']


Unnamed: 0,report_date,plant_id_eia,generator_id,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu,co2_mass_tons,co2_mass_tons_adjusted,energy_source_code,plant_primary_fuel
0,2020-01-01,1,1,,,,,,DFO,WND
1,2020-01-01,1,2,,,,,,DFO,WND
2,2020-01-01,1,3,,,,,,DFO,WND
3,2020-01-01,1,5,,,,,,DFO,WND
4,2020-01-01,1,WT1,15.1585,133.0,133.0,0.0,0.0,WND,WND
...,...,...,...,...,...,...,...,...,...,...
286693,2020-12-01,64816,GEN1,,,,,,NG,NG
286694,2020-12-01,64816,GEN2,,,,,,NG,NG
286695,2020-12-01,64816,GEN3,,,,,,NG,NG
286696,2020-12-01,64836,CATAL,,,,,,SUN,SUN


In [144]:
# perform checks on allocated data
# fuel consumption and co2 emissions should be positive
negative_test = validation.test_for_negative_values(eia923_allocated, ['fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted'])

# if net generation is positive, fuel consumption should be non zero
missing_fuel_test = validation.test_for_missing_fuel(eia923_allocated, 'net_generation_mwh')

# fuel consumed for electricity should be less than fuel consumed
chp_allocation_test = validation.test_chp_allocation(eia923_allocated)

# check for missing co2 data
missing_co2_test = validation.test_for_missing_co2(eia923_allocated)

# check for generators with no data
missing_data_test = validation.test_for_missing_data(eia923_allocated, ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted'])

# check for missing energy source code
missing_esc_test = validation.test_for_missing_energy_source_code(eia923_allocated)

# check heat rates
validation.test_for_outlier_heat_rates(eia923_allocated)

Heat Rate Test


# 3. Clean Hourly Data from CEMS

In [7]:
# NOTE: all of the functions in this section could be run by calling clean_cems()
cems = data_cleaning.clean_cems(year)

Removing 45 plants that are not grid-connected
Removing 0 plants located in the following states: ['PR']
Removing 3 units that only produce steam and do not report to EIA




removing 7325217 observations from cems for unit-months where no data reported
Unable to calculate emissions for the following plants_units: []


In [8]:
cems.head(5)

Unnamed: 0,plant_id_eia,unitid,cems_id,operating_datetime_utc,operating_time_hours,gross_load_mw,gross_generation_mwh,steam_load_1000_lbs,fuel_consumed_mmbtu,co2_mass_tons,co2_mass_measurement_code,plant_id_epa,unit_id_epa,report_date,energy_source_code,fuel_consumed_for_electricity_mmbtu,co2_mass_tons_adjusted
0,3,1,3_1,2020-01-01 06:00:00+00:00,0.0,0.0,0.0,0.0,0.0,0.0,,3,1,2020-01-01,NG,0.0,0.0
1,3,1,3_1,2020-01-01 07:00:00+00:00,0.0,0.0,0.0,0.0,0.0,0.0,,3,1,2020-01-01,NG,0.0,0.0
2,3,1,3_1,2020-01-01 08:00:00+00:00,0.0,0.0,0.0,0.0,0.0,0.0,,3,1,2020-01-01,NG,0.0,0.0
3,3,1,3_1,2020-01-01 09:00:00+00:00,0.0,0.0,0.0,0.0,0.0,0.0,,3,1,2020-01-01,NG,0.0,0.0
4,3,1,3_1,2020-01-01 10:00:00+00:00,0.0,0.0,0.0,0.0,0.0,0.0,,3,1,2020-01-01,NG,0.0,0.0


In [145]:
# fuel consumption and co2 emissions should be positive
cems_negative_test = validation.test_for_negative_values(cems, ['fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted', 'gross_generation_mwh'])

# if net generation is positive, fuel consumption should be non zero
cems_missing_fuel_test = validation.test_for_missing_fuel(cems,'gross_generation_mwh')

# fuel consumed for electricity should be less than fuel consumed
cems_chp_allocation_test = validation.test_chp_allocation(cems)

# check for missing co2 data
cems_missing_co2_test = validation.test_for_missing_co2(cems)

# check for missing energy source code
cems_missing_esc_test = validation.test_for_missing_energy_source_code(cems)



# 4. Crosswalk CEMS and EIA data

In [10]:
# flag any generator-months for which we already have cems data
# TODO: Ensure correct identification when there is a m:1 or m:m mapping between units and generators
#eia923_allocated = eia923_allocated.drop(columns=['data_source'])
eia923_allocated = data_cleaning.identify_emissions_data_source(cems, eia923_allocated, year)

# what percent of generators are in CEMS vs not
(eia923_allocated.groupby('data_source').sum() / eia923_allocated.groupby('data_source').sum().sum(axis=0)).round(3)

Unnamed: 0_level_0,plant_id_eia,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu,co2_mass_tons,co2_mass_tons_adjusted
data_source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
cems,0.12,0.573,0.514,0.535,0.816,0.897
eia_only,0.88,0.427,0.486,0.465,0.184,0.103


# 5. Convert CEMS Hourly Gross Generation to Hourly Net Generation

In [11]:
# convert hourly gross generation to net generation
# TODO: Implement methodological hierarchy
#cems = cems.drop(columns=['net_generation_mwh','gross_to_net_ratio','net_gen_method'])
cems = data_cleaning.convert_gross_to_net_generation(cems, eia923_allocated)

# for generators where there is heat input but no gross generation reported, impute hourly net generation based on reported EIA values
# TODO: Need to match data on unit level rather than plant level
cems = data_cleaning.impute_missing_hourly_net_generation(cems, eia923_allocated)

# what percent of net generation was allocated using each method?
cems.groupby('net_gen_method', dropna=False).sum()['net_generation_mwh'] / cems['net_generation_mwh'].sum()

net_gen_method
annual_regression                0.001728
imputed_from_fuel_consumption    0.000928
monthly_ratio                    0.994740
net_equals_gross                 0.002605
Name: net_generation_mwh, dtype: float32

# 6. Assign static characteristics to CEMS and EIA data to aid in aggregation

In [12]:
# assign a BA code and state code to each plant
eia923_allocated = data_cleaning.assign_ba_code_to_plant(eia923_allocated, year)

# assign a fuel category to each plant based on what is most likely to match with the category used in EIA-930
# TODO: Add two different fuel categories (one for 930, one that is more specific)
eia923_allocated = data_cleaning.assign_fuel_category_to_ESC(eia923_allocated, fuel_category_name='fuel_group_eia930', esc_column='plant_primary_fuel')

In [13]:
# add information that we need to aggregate the data and match to eia930
#cems = cems.drop(columns=['ba_code','ba_code_physical','state'])
#cems = cems.drop(columns=['distribution_flag'])
#cems = cems.drop(columns=['plant_primary_fuel'])

cems = data_cleaning.assign_ba_code_to_plant(cems, year)

# add a flag about whether the plant is distribution connected
cems = data_cleaning.identify_distribution_connected_plants(cems, year, voltage_threshold_kv=60)

# add a plant primary fuel and a fuel category for eia930
cems = cems.merge(primary_fuel_table.drop_duplicates(subset='plant_id_eia')[['plant_id_eia','plant_primary_fuel']], how='left', on='plant_id_eia')
cems = data_cleaning.assign_fuel_category_to_ESC(cems, fuel_category_name='fuel_group_eia930', esc_column='plant_primary_fuel')
cems = cems.rename(columns={'fuel_category':'fuel_category_eia930'})

In [14]:
# add a subplant id to each dataset
subplant_crosswalk = pd.read_csv('../data/output/subplant_crosswalk.csv')
cems = cems.merge(subplant_crosswalk[['plant_id_eia','unitid','subplant_id']], how='left', on=['plant_id_eia','unitid'])
eia923_allocated = eia923_allocated.merge(subplant_crosswalk[['plant_id_eia','generator_id','subplant_id']], how='left', on=['plant_id_eia','generator_id'])

In [15]:
# test to make sure that there is a complete subplant mapping
cems_missing_subplant_test = cems[cems['subplant_id'].isna()]
if not cems_missing_subplant_test.empty:
    print(f'Warning: There are {len(cems_missing_subplant_test)} CEMS records without a subplant ID. See `cems_missing_subplant_test` for details')
    

eia_missing_subplant_test = eia923_allocated[eia923_allocated['subplant_id'].isna()]
if not eia_missing_subplant_test.empty:
    print(f'Warning: There are {len(eia_missing_subplant_test)} EIA-923 records without a subplant ID. See `eia_missing_subplant_test` for details')
    



In [92]:
columns_to_export = ['plant_id_eia', 'operating_datetime_utc', 'gross_generation_mwh', 'net_generation_mwh', 'fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu', 'co2_mass_tons','co2_mass_tons_adjusted']
cems_output = cems[columns_to_export].groupby(['plant_id_eia','operating_datetime_utc']).sum().reset_index()
cems_output.to_csv('../data/final_outputs/plant_files/hourly_plant_data_from_cems.csv', index=False)

In [95]:
# export plant frame
plant_static_columns = ['plant_id_eia','plant_primary_fuel', 'fuel_category_eia930', 'ba_code', 'ba_code_physical', 'state', 'distribution_flag']
plant_frame = cems[plant_static_columns].drop_duplicates(subset='plant_id_eia')
plant_frame.to_csv('../data/final_outputs/plant_files/plant_static_attributes.csv', index=False)

# 7. Clean and Reconcile EIA-930 data
Load raw EIA-930 data, fix timestamp issues, perform physics-based reconciliation

Currently implemented in `notebooks/930_lag` and the `gridemissions` repository

Output: `data/eia930/chalendar/EBA_adjusted_elec.csv`

In [None]:
# not yet implemented here

# 8. Calculate Residual Net Generation Profile

Currently implemented in `notebooks/clean_930_compare_residual`

In [21]:
# save CEMS data as CSV to be used to calculate residual profile
# NOTE: Once this code is integrated into the pipeline, this will not be necessary
# filter the data to export only the data/columns that we need for the residual calculation
cems_for_export = cems.loc[cems[['co2_mass_tons_adjusted','gross_generation_mwh','fuel_consumed_for_electricity_mmbtu']].sum(axis=1) > 0, ['ba_code', 'ba_code_physical',
       'energy_source_code','plant_primary_fuel','fuel_category_eia930','distribution_flag',
       'plant_id_eia', 'unitid', 'operating_datetime_utc',
       'net_generation_mwh']]

cems_for_export.to_csv(f'../data/output/cems_{year}_for_residual.csv', index=False)

In [None]:
# not yet implemented here

# 9. Assign hourly profile to monthly data


In [16]:
# create a separate dataframe containing only the generators for which we do not have CEMS data
monthly_eia_data_to_distribute = eia923_allocated[(eia923_allocated['data_source'] == 'eia_only') & ~(eia923_allocated['fuel_consumed_mmbtu'].isna())]

In [18]:
# load profile data and format for use in the pipeline
#hourly_profiles = eia930.load_chalendar_for_pipeline('../data/eia930/chalendar/EBA_adjusted_elec.csv', year)
hourly_profiles = pd.read_csv('../data/output/residual_profiles.csv', parse_dates=['report_date'])
hourly_profiles

Unnamed: 0,ba_code,fuel_category,datetime_utc,datetime_local,report_date,residual_scaled
0,AEC,coal,2020-01-01 06:00:00+00:00,2020-01-01 00:00:00-06:00,2020-01-01,16.263592
1,AEC,coal,2020-01-01 07:00:00+00:00,2020-01-01 01:00:00-06:00,2020-01-01,79.554775
2,AEC,coal,2020-01-01 08:00:00+00:00,2020-01-01 02:00:00-06:00,2020-01-01,80.880073
3,AEC,coal,2020-01-01 09:00:00+00:00,2020-01-01 03:00:00-06:00,2020-01-01,82.018135
4,AEC,coal,2020-01-01 10:00:00+00:00,2020-01-01 04:00:00-06:00,2020-01-01,82.967691
...,...,...,...,...,...,...
2608843,CISO,biomass,2021-01-01 03:00:00+00:00,2020-12-31 19:00:00-08:00,2020-12-01,561.905635
2608844,CISO,biomass,2021-01-01 04:00:00+00:00,2020-12-31 20:00:00-08:00,2020-12-01,555.685760
2608845,CISO,biomass,2021-01-01 05:00:00+00:00,2020-12-31 21:00:00-08:00,2020-12-01,550.856234
2608846,CISO,biomass,2021-01-01 06:00:00+00:00,2020-12-31 22:00:00-08:00,2020-12-01,554.061442


In [20]:
# for fuel categories that exist in the EIA-923 data but not in EIA-930, create flat profiles to add to the hourly profiles from 930
# TODO: Identify for which BA-fuels a flat profile was created
# TODO: Is there a better assumption than flat?
ba_list = list(monthly_eia_data_to_distribute['ba_code'].dropna().unique())

# create an hourly datetime series in local time for each ba/fuel type
hourly_profiles_to_add = []

# for each ba
for ba in ba_list:
    # get a list of fuels categories that exist in that BA
    ba_fuel_list = list(monthly_eia_data_to_distribute.loc[monthly_eia_data_to_distribute['ba_code'] == ba,'fuel_category'].unique())
    for fuel in ba_fuel_list:
        # if there is no data for that fuel type in the eia930 data, create a flat profile
        if len(hourly_profiles[(hourly_profiles['ba_code'] == ba) & (hourly_profiles['fuel_category'] == fuel)]) == 0:
            print(f'Adding flat profile for {ba} {fuel}')
            # create a dataframe
            df_temp = pd.DataFrame(index=pd.date_range(start=f'{year-1}-12-31 00:00:00', end=f'{year+1}-01-01 23:00:00', freq='H', tz='UTC', name='datetime_utc'),
                                    columns=['ba_code','fuel_category']).reset_index()                  
            df_temp['ba_code'] = ba
            df_temp['fuel_category'] = fuel
            df_temp['net_generation_mwh_930'] = 1.0
            df_temp['datetime_local'] = df_temp['datetime_utc']
            df_temp['datetime_local'] = df_temp['datetime_utc'].dt.tz_convert(data_cleaning.ba_timezone(ba=ba, type='local'))
            # create a report date column
            df_temp['report_date'] = df_temp['datetime_local'].astype(str).str[:7]
            df_temp['report_date'] = pd.to_datetime(df_temp['report_date'])
            hourly_profiles_to_add.append(df_temp)

hourly_profiles_to_add = pd.concat(hourly_profiles_to_add, axis=0, ignore_index=True)

Adding flat profile for CEA natural_gas
Adding flat profile for CEA hydro
Adding flat profile for CEA wind
Adding flat profile for TEPC other
Adding flat profile for BANC petroleum
Adding flat profile for WALC petroleum
Adding flat profile for PACE petroleum
Adding flat profile for SPA petroleum
Adding flat profile for SPA natural_gas
Adding flat profile for SPA wind
Adding flat profile for SPA solar
Adding flat profile for BPAT coal
Adding flat profile for PSCO other
Adding flat profile for WACM petroleum
Adding flat profile for WACM other
Adding flat profile for WAUW wind
Adding flat profile for AVA petroleum
Adding flat profile for FPL hydro
Adding flat profile for FPL coal
Adding flat profile for SCL other
Adding flat profile for SCL natural_gas
Adding flat profile for GVL solar
Adding flat profile for FMPP nan
Adding flat profile for SCEG petroleum
Adding flat profile for SEPA solar
Adding flat profile for SEPA other
Adding flat profile for HECO petroleum
Adding flat profile for H

In [21]:
# concat the flat profiles to the hourly profiles
hourly_profiles = pd.concat([hourly_profiles,hourly_profiles_to_add], axis=0)

In [22]:
hourly_eia_data = data_cleaning.distribute_monthly_eia_data_to_hourly(monthly_eia_data_to_distribute, hourly_profiles, 'residual_scaled')

In [98]:
columns_for_output = ['ba_code', 'fuel_category', 'datetime_utc', 'net_generation_mwh', 'fuel_consumed_mmbtu', 'fuel_consumed_for_electricity_mmbtu', 'co2_mass_tons', 'co2_mass_tons_adjusted']
hourly_eia_data[columns_for_output].to_csv('../data/final_outputs/plant_files/hourly_data_distributed_from_eia.csv', index=False)

# 10. Aggregate CEMS data to BA-fuel and combine with hourly shaped EIA data


In [47]:
# aggregate cems data by BA and fuel type
cems_ba_fuel = cems.groupby(['ba_code','fuel_category_eia930','operating_datetime_utc']).sum()[['gross_generation_mwh','net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
cems_ba_fuel['data_source'] = 'CEMS'

# rename the datetime_utc column
cems_ba_fuel = cems_ba_fuel.rename(columns={'operating_datetime_utc':'datetime_utc','fuel_category_eia930':'fuel_category'})


In [48]:
combined_data = pd.concat([cems_ba_fuel, hourly_eia_data.drop(columns=['datetime_local','report_date'])], axis=0)
# only keep relevant columns
combined_data = combined_data[['ba_code','fuel_category','datetime_utc','net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted','data_source']]
combined_data

Unnamed: 0,ba_code,fuel_category,datetime_utc,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu,co2_mass_tons,co2_mass_tons_adjusted,data_source
0,AEC,coal,2020-03-01 06:00:00+00:00,0.0,0.0,0.0,0.0,0.0,CEMS
1,AEC,coal,2020-03-01 07:00:00+00:00,0.0,0.0,0.0,0.0,0.0,CEMS
2,AEC,coal,2020-03-01 08:00:00+00:00,0.0,0.0,0.0,0.0,0.0,CEMS
3,AEC,coal,2020-03-01 09:00:00+00:00,0.0,0.0,0.0,0.0,0.0,CEMS
4,AEC,coal,2020-03-01 10:00:00+00:00,0.0,0.0,0.0,0.0,0.0,CEMS
...,...,...,...,...,...,...,...,...,...
3059275,SEC,other,2021-01-01 19:00:00+00:00,,,,,,EIA
3059276,SEC,other,2021-01-01 20:00:00+00:00,,,,,,EIA
3059277,SEC,other,2021-01-01 21:00:00+00:00,,,,,,EIA
3059278,SEC,other,2021-01-01 22:00:00+00:00,,,,,,EIA


## Export hourly data to final CSVs (one for each BA)

In [79]:
for ba in list(combined_data.ba_code.unique()):

    # filter the data for a single BA
    ba_table = combined_data[combined_data['ba_code'] == ba].drop(columns='ba_code')

    # convert the datetime_utc column back to a datetime
    ba_table['datetime_utc'] = pd.to_datetime(ba_table['datetime_utc'], utc=True)

    # combine the data from CEMS and EIA for each fuel-hour
    ba_table = ba_table.groupby(['fuel_category', 'datetime_utc']).sum().reset_index()

    # calculate a total for the BA
    ba_total = ba_table.groupby(['datetime_utc']).sum()[['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
    ba_total['fuel_category'] = 'total'

    # concat the totals to the fuel-specific totals
    ba_table = pd.concat([ba_table,ba_total], axis=0, ignore_index=True)

    # calculate a generated emission rate
    ba_table['generated_co2_rate_lb_per_mwh'] = (ba_table['co2_mass_tons'] * 2000 / ba_table['net_generation_mwh']).fillna(0).replace(np.inf, np.NaN)
    ba_table['adjusted_generated_co2_rate_lb_per_mwh'] = (ba_table['co2_mass_tons_adjusted'] * 2000 / ba_table['net_generation_mwh']).fillna(0).replace(np.inf, np.NaN)

    ba_table = ba_table.pivot(index='datetime_utc', columns='fuel_category')

    # round all values to one decimal place
    ba_table = ba_table.round(1)

    # flatten the multilevel column into a single column name like data_fuelname
    ba_table.columns = ['_'.join(col) for col in ba_table.columns.values]

    # export to a csv
    ba_table.to_csv(f'../data/final_outputs/ba_files/{ba}.csv')

In [69]:
def graph_hourly_data_by_fuel_category(hourly_data, ba, column_name, fuel_category_name, plot_type):
    fuel_color = {'natural_gas':'sienna',
              'coal':'black',
              'nuclear':'red',
              'biomass':'green',
              'geothermal':'orange',
              'wind':'blue',
              'solar':'gold',
              'petroleum':'purple',
              'hydro':'skyblue',
              'other':'lightgrey',
              'waste':'pink'}

    fuel_order = ['nuclear','geothermal','hydro','other','coal','biomass','petroleum','waste','solar','wind','natural_gas']

    data_to_graph = hourly_data[hourly_data['ba_code'] == ba]
    #data_to_graph = data_to_graph.groupby(['datetime_utc','fuel_category','data_source']).sum().reset_index()

    if plot_type == 'area':
        plot = px.area(data_to_graph, 
                x='datetime_utc', 
                y=column_name, 
                color=fuel_category_name, 
                color_discrete_map=fuel_color, 
                facet_col='data_source', 
                template='plotly_white',
                title=f'Hourly {column_name} data for {ba} by fuel type',
                category_orders={fuel_category_name:fuel_order}).update_traces(line={'width':0})
    elif plot_type == 'line':
        plot = px.line(data_to_graph, 
                x='datetime_utc', 
                y=column_name, 
                color=fuel_category_name, 
                color_discrete_map=fuel_color, 
                facet_col='data_source', 
                template='plotly_white',
                title=f'Hourly {column_name} data for {ba} by fuel type',
                category_orders={fuel_category_name:fuel_order}).update_traces(line={'width':0})
    
    return plot
        

In [None]:
ba = 'CISO'

graph_hourly_data_by_fuel_category(hourly_data=combined_data, ba=ba, column_name='net_generation_mwh', fuel_category_name='fuel_category', plot_type='area')

In [None]:
graph_hourly_data_by_fuel_category(hourly_data=combined_data, ba=ba, column_name='co2_mass_tons', fuel_category_name='fuel_category', plot_type='area')

In [None]:
graph_hourly_data_by_fuel_category(hourly_data=combined_data, ba=ba, column_name='co2_rate_lb_per_mwh_generated', fuel_category_name='fuel_category', plot_type='line')

# Data Quality Metrics

### Compare monthly totals for each plant from each data source

In [106]:
# for plants where there is data reported in cems, see how off it is from data reported in eia
cems_plant_monthly = cems.groupby(['plant_id_eia','subplant_id','report_date'], dropna=False).sum()[['gross_generation_mwh','net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
gf_plant_monthly = eia923_allocated.groupby(['plant_id_eia','subplant_id','report_date'], dropna=False).sum().reset_index()
compare_cems_eia = gf_plant_monthly.merge(cems_plant_monthly, how='inner', on=['plant_id_eia','subplant_id','report_date'], suffixes=("_eia",'_cems'))


for column in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']:
    compare_cems_eia[f'{column}_pctdiff'] = ((compare_cems_eia[f'{column}_cems'].replace(0,0.1) - compare_cems_eia[f'{column}_eia'].replace(0,0.1)) / compare_cems_eia[f'{column}_eia'].replace(0,0.1)).round(3)

compare_cems_eia = compare_cems_eia.set_index(['plant_id_eia','subplant_id','report_date'])
compare_cems_eia = compare_cems_eia.reindex(sorted(compare_cems_eia.columns), axis=1)

In [113]:
# identify where there are differences between reported CEMS and EIA values for the same subplant-month
compare_cems_eia[~compare_cems_eia.filter(like='pctdiff').sum(axis=1).between(-0.05,0.05)]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,co2_mass_tons_adjusted_cems,co2_mass_tons_adjusted_eia,co2_mass_tons_adjusted_pctdiff,co2_mass_tons_cems,co2_mass_tons_eia,co2_mass_tons_pctdiff,fuel_consumed_for_electricity_mmbtu_cems,fuel_consumed_for_electricity_mmbtu_eia,fuel_consumed_for_electricity_mmbtu_pctdiff,fuel_consumed_mmbtu_cems,fuel_consumed_mmbtu_eia,fuel_consumed_mmbtu_pctdiff,gross_generation_mwh,net_generation_mwh_cems,net_generation_mwh_eia,net_generation_mwh_pctdiff
plant_id_eia,subplant_id,report_date,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
3,0.0,2020-01-01,2832.987499,0.00000,28328.875,2832.987499,0.00000,28328.875,47664.050781,0.00,476639.508,47664.050781,0.00,476639.508,3310.000000,3139.284912,0.000000,31391.849
3,0.0,2020-02-01,1829.956300,437.16042,3.186,1829.956300,437.16042,3.186,30769.275391,7480.50,3.113,30769.275391,7480.50,3.113,2375.000000,2315.075195,1758.034399,0.317
3,1.0,2020-01-01,1684.845575,0.00000,16847.456,1684.845575,0.00000,16847.456,28352.875000,0.00,283527.750,28352.875000,0.00,283527.750,1390.250000,1318.546997,0.000000,13184.470
3,2.0,2020-01-01,38.568750,0.00000,384.687,38.568750,0.00000,384.687,377.487488,0.00,3773.875,377.487488,0.00,3773.875,0.000000,0.000000,0.000000,0.000
3,2.0,2020-04-01,59086.781306,145578.94587,-0.594,59086.781306,145578.94587,-0.594,575902.312500,1472555.25,-0.609,575902.312500,1472555.25,-0.609,61442.000000,59161.578125,57613.831717,0.027
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61242,,2020-08-01,8324.379800,5545.95600,0.501,8324.379800,5545.95600,0.501,140059.281250,94900.00,0.476,140059.281250,94900.00,0.476,15776.870117,8143.190918,8143.191000,-0.000
61242,,2020-09-01,3650.107021,4576.43640,-0.202,3650.107021,4576.43640,-0.202,61413.335938,78310.00,-0.216,61413.335938,78310.00,-0.216,7094.189941,3359.802490,6719.605000,-0.500
61242,,2020-10-01,5151.623887,3957.55680,0.302,5151.623887,3957.55680,0.302,86687.671875,67720.00,0.280,86687.671875,67720.00,0.280,10034.290039,2905.441406,5810.883000,-0.500
61242,,2020-11-01,2159.491985,3099.89136,-0.303,2159.491985,3099.89136,-0.303,36337.781250,53044.00,-0.315,36337.781250,53044.00,-0.315,4248.689941,2275.763428,4551.527000,-0.500


## Compare Our Results to eGRID

In [99]:
# Aggregate total calculated values
###################################

# Aggregate cems and eia data by plant id, then combine
cems_plant_annual = cems.groupby(['ba_code','state','plant_id_eia'], dropna=False).sum()[['net_generation_mwh','fuel_consumed_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
eia_plant_annual = monthly_eia_data_to_distribute.groupby(['ba_code','state','plant_id_eia'], dropna=False).sum()[['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
plant_annual_total = pd.concat([cems_plant_annual,eia_plant_annual], axis=0)
# group any plants that have records from both datasets
plant_annual_total = plant_annual_total.groupby(['ba_code','state','plant_id_eia'], dropna=False).sum().reset_index()

# add a egrid id
plant_annual_total = validation.add_egrid_plant_id(plant_annual_total, from_id='eia', to_id='egrid')

# Load the eGRID plant table
egrid_plant = load_data.load_egrid_plant_file(year)

### Identify plants in eGRID missing from our results

These plants seem to have retired before 2020

In [100]:
# identify any plants that are in egrid but not our totals, and any plants that are in our totals, but not egrid
plant_not_in_calc = list(set(egrid_plant['plant_id_eia'].unique()) - set(plant_annual_total['plant_id_eia'].unique()))

# Which plants are included in eGRID but are missing from our calculations?
missing_from_calc = egrid_plant[egrid_plant['plant_id_egrid'].isin(plant_not_in_calc)]

# see if any of these plants are retired
generators_eia860 = load_data.load_pudl_table('generators_eia860', year=year)
missing_from_calc.merge(generators_eia860.groupby('plant_id_eia')['retirement_date'].unique().reset_index(), how='left', on='plant_id_eia')

Unnamed: 0,ba_code,state,plant_id_egrid,plant_name,energy_source_code,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu,co2_mass_tons,co2_mass_tons_adjusted,plant_id_eia,retirement_date
0,CISO,CA,328,Borel,WAT,-204.0,,,0.0,0.0,328,[2017-09-01]
1,MISO,IA,7922,Brooklyn City North Plant,DFO,42.0,,,,,7922,[2011-09-01]


### Identify plants in our calculations that are missing from eGRID

In [103]:
# Which plants are in our calculations, but are missing from eGRID?
plants_not_in_egrid = list(set(plant_annual_total['plant_id_egrid'].unique()) - set(egrid_plant['plant_id_egrid'].unique()))

plant_names = load_data.load_pudl_table('plants_entity_eia')[['plant_id_eia','plant_name_eia','sector_name_eia']]
missing_from_egrid = plant_annual_total[plant_annual_total['plant_id_egrid'].isin(plants_not_in_egrid)].merge(plant_names, how='left', on='plant_id_eia')

missing_from_egrid

Unnamed: 0,ba_code,state,plant_id_eia,net_generation_mwh,fuel_consumed_mmbtu,co2_mass_tons,co2_mass_tons_adjusted,fuel_consumed_for_electricity_mmbtu,plant_id_egrid,plant_name_eia,sector_name_eia
0,AECI,MO,2127,0.0120,0.012,0.000980,0.000980,0.012,2127,Gallatin (MO),Electric Utility
1,AECI,MO,2141,0.0120,0.012,0.000980,0.000980,0.012,2141,Macon,Electric Utility
2,AECI,MO,2146,0.0120,0.012,0.000980,0.000980,0.012,2146,Monroe (MO),Electric Utility
3,AECI,MO,8110,0.0240,0.024,0.001681,0.001681,0.024,8110,Stanberry,Electric Utility
4,AECI,MO,56126,0.0120,0.012,0.000980,0.000980,0.012,56126,Sub 2 Generating Station,Electric Utility
...,...,...,...,...,...,...,...,...,...,...,...
359,,AK,56147,0.0120,0.012,0.000980,0.000980,0.012,56147,Viking,Electric Utility
360,,AK,60250,0.0120,441.000,36.012060,36.012060,441.000,60250,Swampy Acres Microgrid,Electric Utility
361,,HI,63280,0.0120,0.012,0.000000,0.000000,0.012,63280,"AES Kekaha Solar, LLC Hybrid",IPP Non-CHP
362,,,52106,0.0000,2379781.250,139074.420255,703.304943,0.000,52106,Philadelphia Refinery,Industrial CHP


In [105]:
# how many of the plants missing from egrid have non-zero data
missing_from_egrid[missing_from_egrid['fuel_consumed_mmbtu'] > 1]

Unnamed: 0,ba_code,state,plant_id_eia,net_generation_mwh,fuel_consumed_mmbtu,co2_mass_tons,co2_mass_tons_adjusted,fuel_consumed_for_electricity_mmbtu,plant_id_egrid,plant_name_eia,sector_name_eia
68,CISO,CA,57714,0.012,1.011,0.05908284,0.059083,1.011,57714,City of Palo Alto,Electric Utility
79,CISO,CA,61464,0.012,1276.0,104.1982,104.19816,1276.0,61464,"CoreSite Real Estate 2972 Stender, L.P.",IPP Non-CHP
80,CISO,CA,61474,0.012,223.0,18.21018,18.21018,223.0,61474,CoreSite Real Estate 55 S. Market Street,Commercial Non-CHP
81,CISO,CA,61475,0.012,1270.0,103.7082,103.7082,1270.0,61475,"CoreSite Real Estate 1656 McCarthy, L.P.",Commercial Non-CHP
82,CISO,CA,61476,0.012,584.0,47.68944,47.68944,584.0,61476,"CoreSite Real Estate 2901 Coronado, L.P.",Commercial Non-CHP
83,CISO,CA,61482,0.012,2249.0,183.6533,183.65334,2249.0,61482,"CoreSite Real Estate 3032 Coronado, L.P.",Commercial Non-CHP
91,CPLE,NC,54316,0.024,16737.0,981.8023,0.001681,0.024,54316,Southport,Industrial CHP
97,DUK,NC,54363,0.012,204.0,16.65864,16.65864,204.0,54363,PPG Industries Inc Shelby NC W,Industrial Non-CHP
106,FMPP,FL,7997,0.012,3589.0,293.0777,293.07774,3589.0,7997,Winston,Electric Utility
146,MISO,IA,1146,0.012,6.011,0.4908583,0.490858,6.011,1146,Harlan,Electric Utility


### Compare whether totals for plants with EPA IDs that differ from EIA IDs match

In [106]:
# identify where there is a single egrid plant id for multiple eia plant ids
double_ids = plant_annual_total[plant_annual_total['plant_id_egrid'].duplicated(keep=False)]
double_ids = double_ids.groupby('plant_id_egrid').sum()['net_generation_mwh'].reset_index() # focus on net generation for now
# merge the egrid data
double_ids = double_ids.merge(egrid_plant[['plant_id_egrid','net_generation_mwh']], how='left', on='plant_id_egrid', suffixes=('_calc','_egrid'))
double_ids['percent_diff'] = ((double_ids['net_generation_mwh_calc'] - double_ids['net_generation_mwh_egrid']) / double_ids['net_generation_mwh_egrid']).round(3)
double_ids

Unnamed: 0,plant_id_egrid,net_generation_mwh_calc,net_generation_mwh_egrid,percent_diff
0,562,297109.5,295534.003,0.005
1,762,24598.11,13132.0,0.873
2,1416,6970409.0,3497026.0,0.993
3,2709,5829410.0,5829409.996,0.0
4,3612,7233673.0,4045388.003,0.788
5,4076,113748.6,110539.0,0.029
6,10474,670411.0,670410.997,0.0
7,55306,16977010.0,8487498.0,1.0
8,55375,10894150.0,5447419.084,1.0
9,55481,14552670.0,7276336.0,1.0


### Identify plants where our BA assignment does not match eGRID

In [107]:
ba_code_match = egrid_plant.set_index('plant_id_eia')[['plant_name','ba_code']].merge(plant_annual_total.set_index('plant_id_eia')[['ba_code']], how='inner', left_index=True, right_index=True, suffixes=("_egrid",'_calc'))

# plants with missing ba code
ba_code_match[(ba_code_match['ba_code_calc'].isna()) & ~(ba_code_match['ba_code_egrid'].isna())]

Unnamed: 0_level_0,plant_name,ba_code_egrid,ba_code_calc
plant_id_eia,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


In [108]:
# plants with incorrect ba code
ba_code_match[(ba_code_match['ba_code_calc'] != ba_code_match['ba_code_egrid']) & ~(ba_code_match['ba_code_calc'].isna())]

Unnamed: 0_level_0,plant_name,ba_code_egrid,ba_code_calc
plant_id_eia,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
55369,Hamakua Energy Plant,,HECO
56447,Hawi Wind Farm,,HECO
769,Kanoelehua,,HECO
8083,Keahole,,HECO
56378,Pakini Nui Wind Farm,,HECO
7130,Puna,,HECO
52028,Puna Geothermal Venture I,,HECO
771,Puueo,,HECO
772,W H Hill,,HECO
54827,Wailuku River Hydroelectric,,HECO


## Identify where eGRID might be missing data
It seems that there are quite a few generators where fuel consumption data is missing

In [110]:
pudl_out = load_data.initialize_pudl_out(year)

# load the EIA generator fuel data
IDX_PM_ESC = ["report_date", "plant_id_eia", "energy_source_code", "prime_mover_code"]
gf = (pudl_out.gf_eia923().loc[:,IDX_PM_ESC + ["net_generation_mwh","fuel_consumed_mmbtu","fuel_consumed_for_electricity_mmbtu",],])

# add egrid plant ids
egrid_crosswalk = pd.read_csv('../data/egrid/egrid_static_tables/2020/table_C5_crosswalk_of_EIA_ID_to_EPA_ID.csv')
eia_to_egrid_id = dict(zip(list(egrid_crosswalk['plant_id_eia']), list(egrid_crosswalk['plant_id_egrid'])))
gf['plant_id_egrid'] = gf['plant_id_eia']
gf['plant_id_egrid'].update(gf['plant_id_egrid'].map(eia_to_egrid_id))

# calculate an annual total for each plant
gf_total = gf.groupby(['plant_id_egrid']).sum().reset_index()

# choose a metric to compare
metric = 'fuel_consumed_mmbtu'

egrid_eia_comparison = egrid_plant[['plant_id_egrid','plant_name','ba_code','energy_source_code',metric]].merge(gf_total[['plant_id_egrid',metric]], how='left', on='plant_id_egrid', suffixes=('_egrid','_eia923')).round(0)

egrid_eia_comparison['difference'] = egrid_eia_comparison[f'{metric}_egrid'] - egrid_eia_comparison[f'{metric}_eia923']
egrid_eia_comparison['percent_difference'] = (egrid_eia_comparison[f'{metric}_egrid'] - egrid_eia_comparison[f'{metric}_eia923']) / egrid_eia_comparison[f'{metric}_eia923']

In [111]:
# egrid seems to be missing fuel consumption data for most nuclear power plants
missing_nuclear = egrid_eia_comparison[egrid_eia_comparison['energy_source_code'] == 'NUC']
missing_nuclear.sum()

plant_id_egrid                                                           263588
plant_name                    Browns FerryJoseph M FarleyArkansas Nuclear On...
ba_code                       TVASOCOMISOSRPCISOISNEFPLFPLSOCOSOCOMISOPJMPJM...
energy_source_code            NUCNUCNUCNUCNUCNUCNUCNUCNUCNUCNUCNUCNUCNUCNUCN...
fuel_consumed_mmbtu_egrid                                            43749478.0
fuel_consumed_mmbtu_eia923                                         8296982588.0
difference                                                        -8253233110.0
percent_difference                                                   -57.753049
dtype: object

In [112]:
# where is egrid missing data?
egrid_eia_comparison[(egrid_eia_comparison['percent_difference'] < -0.01) & (egrid_eia_comparison['energy_source_code'] != 'NUC')]

Unnamed: 0,plant_id_egrid,plant_name,ba_code,energy_source_code,fuel_consumed_mmbtu_egrid,fuel_consumed_mmbtu_eia923,difference,percent_difference
145,55409,Calhoun Energy Center,SOCO,NG,422671.0,439140.0,-16469.0,-0.037503
147,56,Charles R Lowman,AEC,BIT,4043809.0,4163218.0,-119409.0,-0.028682
148,47,Colbert,TVA,NG,53425.0,55313.0,-1888.0,-0.034133
151,55292,Decatur Energy Center,TVA,NG,11008538.0,11454823.0,-446285.0,-0.038960
161,10,Greene County,SOCO,NG,11641750.0,11782651.0,-140901.0,-0.011958
...,...,...,...,...,...,...,...,...
9963,3943,Fort Martin Power Station,PJM,BIT,50443730.0,53566447.0,-3122717.0,-0.058296
9966,10151,Grant Town Power Plant,PJM,WC,8121428.0,8517719.0,-396291.0,-0.046525
9967,3944,Harrison Power Station,PJM,RC,104654598.0,110030419.0,-5375821.0,-0.048858
9986,6004,Pleasants Power Station,PJM,RC,51133698.0,53296481.0,-2162783.0,-0.040580


In [113]:
# how much emissions does this account for?
# group by fuel code
missing_emissions = egrid_eia_comparison[(egrid_eia_comparison['percent_difference'] < -0.01) & (egrid_eia_comparison['energy_source_code'] != 'NUC')].groupby('energy_source_code').sum().reset_index()

# get emission factors
emission_factors = load_data.load_emission_factors()[['energy_source_code', 'co2_tons_per_mmbtu']]
missing_emissions = missing_emissions.merge(emission_factors, how='left', on='energy_source_code')
missing_emissions['co2_mass_tons'] = missing_emissions['difference'] * missing_emissions['co2_tons_per_mmbtu']
missing_emissions.sum()

energy_source_code            BFGBITBLQCOGDFOLIGNGOBGOGOTHPCPRGRCRFOSGCSUBSU...
plant_id_egrid                                                         14587515
fuel_consumed_mmbtu_egrid                                          8511611934.0
fuel_consumed_mmbtu_eia923                                         9145585836.0
difference                                                         -633973902.0
percent_difference                                                   -62.320487
co2_tons_per_mmbtu                                                      1.66837
co2_mass_tons                                                   -47871070.84939
dtype: object

## Plant Metric

In [115]:
# standardize column names and index so that the two dfs can be divided
calculated_to_compare = plant_annual_total.groupby('plant_id_egrid').sum().drop(columns=['plant_id_eia'])

# drop the plants that have no data in eGRID
plants_with_no_data_in_egrid = list(egrid_plant[egrid_plant[['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].sum(axis=1) == 0]['plant_id_egrid'])
egrid_plant = egrid_plant[~egrid_plant['plant_id_eia'].isin(plants_with_no_data_in_egrid)]

egrid_to_compare = egrid_plant.set_index(['plant_id_egrid']).drop(columns=['ba_code','state','plant_name','plant_id_eia'])

# divide calculated value by egrid value
compared = calculated_to_compare.div(egrid_to_compare).merge(egrid_plant[['plant_id_egrid','plant_name','ba_code', 'state']], how='left', left_index=True, right_on='plant_id_egrid').set_index('plant_id_egrid')
compared['plant_name'] = compared['plant_name'].fillna('unknown')

# create a dataframe that merges the two sources of data together
compared_merged = calculated_to_compare.merge(egrid_to_compare, how='outer', on='plant_id_egrid', suffixes=('_calc','_egrid'))

# for each column, change missing values to zero if both values are zero (only nan b/c divide by zero)
for col in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons_adjusted','co2_mass_tons']:
    # identify plants with zero values for both
    plant_ids = list(compared_merged[(compared_merged[f'{col}_calc'] == 0) & (compared_merged[f'{col}_egrid'] == 0)].index)
    compared.loc[compared.index.isin(plant_ids), col] = 1

# for each column, categorize the data based on how far it is off from egrid
for col in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons_adjusted','co2_mass_tons']:
    # add a new column
    compared[f'{col}_status'] = pd.cut(x=compared[col], 
                                       bins=[-999999999,0,0.5,0.9,0.99,0.9999,1,1.0001,1.01,1.1,1.5,999999999], 
                                       labels=['negative','<50%','+/-50%','+/-10%','+/-1%','!exact','!exact','+/-1%','+/-10%','+/-50%','>50%'], 
                                       ordered=False)
    # replace any missing values with missing
    compared[f'{col}_status'] = compared[f'{col}_status'].astype(str)  
    compared[f'{col}_status'] = compared[f'{col}_status'].fillna('missing')
    compared[f'{col}_status'] = compared[f'{col}_status'].replace('nan','missing')
    compared.loc[(compared.index.isin(plants_not_in_egrid)),f'{col}_status'] = 'not_in_egrid'

# identify which plants are missing from egrid vs calculated values
for col in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons_adjusted','co2_mass_tons']:
    # identify plants that are missing in egrid
    plants_missing_egrid = list(compared_merged[(compared_merged[f'{col}_calc'] > 0) & (compared_merged[f'{col}_egrid'].isna())].index)
    compared.loc[compared.index.isin(plants_missing_egrid), f'{col}_status'] = 'missing_in_egrid'
    # identify plants that are missing from our calculations
    plants_missing_calc = list(compared_merged[(compared_merged[f'{col}_calc'].isna()) & (compared_merged[f'{col}_egrid'] > 0)].index)
    compared.loc[compared.index.isin(plants_missing_calc), f'{col}_status'] = 'missing_in_calc'
    # identify where our calculations are missing a zero value
    plants_missing_zero_calc = list(compared_merged[(compared_merged[f'{col}_calc'].isna()) & (compared_merged[f'{col}_egrid'] == 0)].index)
    compared.loc[compared.index.isin(plants_missing_zero_calc), f'{col}_status'] = 'missing_zero_in_calc'
    # identify where egrid has a missing value instead of a zero
    plants_missing_zero_egrid = list(compared_merged[(compared_merged[f'{col}_calc'] == 0) & (compared_merged[f'{col}_egrid'].isna())].index)
    compared.loc[compared.index.isin(plants_missing_zero_egrid), f'{col}_status'] = 'missing_zero_in_egrid'
    # identify where egrid has a zero value where we have a positive value
    plants_incorrect_zero_egrid = list(compared_merged[(compared_merged[f'{col}_calc'] > 0) & (compared_merged[f'{col}_egrid'] == 0)].index)
    compared.loc[compared.index.isin(plants_incorrect_zero_egrid), f'{col}_status'] = '>50%'

# create a dataframe that counts how many plants are in each category
comparison_count = []
for col in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons_adjusted','co2_mass_tons']:
    count = compared.groupby(f'{col}_status', dropna=False).count()['plant_name'].rename(col)
    count.index = count.index.rename('status')
    comparison_count.append(count)

comparison_count = pd.concat(comparison_count, axis=1).fillna(0).astype(int)
comparison_count = pd.concat([comparison_count, pd.DataFrame(comparison_count.sum().rename('Total')).T], axis=0)   
comparison_count

Unnamed: 0,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu,co2_mass_tons_adjusted,co2_mass_tons
!exact,9279,8462,7858,7369,8269
+/-1%,151,309,95,290,359
+/-10%,66,316,211,514,450
+/-50%,76,250,374,512,320
<50%,31,14,809,44,12
>50%,414,683,293,1185,511
missing,1,2,2,1,1
missing_in_calc,1,0,0,0,0
missing_in_egrid,364,383,381,317,317
missing_zero_in_egrid,1,0,2,177,177


In [None]:
print(comparison_count.to_markdown())

## BA Metric

In [136]:
# load egrid BA totals
"""egrid_ba = pd.read_excel(f'../data/egrid/egrid{year}_data.xlsx', sheet_name=f'BA{str(year)[-2:]}', header=1, usecols=['BANAME','BACODE','BAHTIANT','BANGENAN','BACO2AN'])
# rename the columns
egrid_ba = egrid_ba.rename(columns={'BANAME':'ba_name',
                                    'BACODE':'ba_code',
                                    'BAHTIANT':'fuel_consumed_mmbtu',
                                    'BANGENAN':'net_generation_mwh',
                                    'BACO2AN':'co2_mass_tons'})"""

data_columns = ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']

#aggregate the plant data up to the BA level
egrid_ba = egrid_plant.groupby(['ba_code']).sum()[data_columns].reset_index()

# divide our calculation by the BA totals from eGRID
# if there are 0 values, replace with 0.1, so that div by zero doesn't return missing value
ba_metric = plant_annual_total.groupby('ba_code', dropna=False).sum().drop(columns='plant_id_eia').replace(0,0.1).div(egrid_ba.set_index('ba_code').replace(0,0.1)).sort_values(by='co2_mass_tons').round(3)

total = pd.DataFrame(plant_annual_total[data_columns].sum().div(egrid_ba[data_columns].sum()).rename('Total')).T

# calculate the difference in the number of plants in each region
plant_count = (plant_annual_total.groupby('ba_code', dropna=False).count()['plant_id_egrid'] - egrid_plant.groupby('ba_code', dropna=False).count()['plant_id_egrid']).rename('num_plants')
ba_metric = ba_metric.merge(plant_count, how='left', left_index=True, right_index=True).drop(columns=['plant_id_egrid']).sort_index()

ba_metric = pd.concat([ba_metric, total], axis=0).round(2) 

ba_metric = ba_metric[data_columns + ['num_plants']]

columns_to_check = ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons']

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(ba_metric[~(ba_metric[columns_to_check] == 1).all(axis=1)])

Unnamed: 0,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu,co2_mass_tons,co2_mass_tons_adjusted,num_plants
AEC,2.04,1.96,0.03,1.9,1.92,0.0
AECI,1.35,1.3,0.08,1.22,1.22,6.0
AVA,1.22,1.14,0.8,1.21,2.47,1.0
AVRN,1.4,1.36,0.63,1.98,1.99,0.0
AZPS,1.15,1.21,0.22,1.16,1.19,1.0
BANC,1.74,1.73,0.19,1.86,2.0,2.0
BPAT,1.07,1.19,0.97,1.28,1.55,8.0
CISO,1.28,1.36,0.77,1.39,1.77,72.0
CPLE,1.1,2.43,1.52,1.34,1.15,8.0
CSTO,2.0,2.0,0.0,2.0,2.0,0.0


In [None]:
# how much co2 comes from CEMS vs EIA
data_source_by_ba = pd.pivot_table(eia923_allocated, values='co2_mass_tons', index='ba_code', columns='data_source', dropna=False, aggfunc=np.sum).replace(0,0.001)
data_source_by_ba = data_source_by_ba.div(data_source_by_ba.sum(axis=1), axis=0)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(data_source_by_ba.round(3).fillna(0).sort_values(by='cems'))

In [None]:
print(ba_metric.to_markdown())

## Explore specific plants

### Notes

Net generation < 50%:
 - 2617 has negative net generation
 - 10444, 59395: some plants are hybrid fossil / solar plants, but the primary energy source code is getting listed as PV for certain generators, which is causing an issue

Net generation > 50%:
 - 335 has two extra units (CT1 and CT2) that report to CEMS but not EIA. Likely correct
 - 7288 reports DFO to CEMS, but NG to EIA - there's probably some generators not reporting
 - 60698 double counting generation from cEMS because of 90% heat threhshold in a month - might be fixed once we tackle heat input
 - 58256 is a solar/hybrid and in generators_entity_eia the battery portion is associated with a separate generator id (should only be 1)


Other:
 - plant 3754 has heat input in cems and eia that don't match
 - 2401 has generation in both cems and eia
 - 50933 might have allocation issue (doesn't appear in CEMS)

To check
 - 1404 reports generation to CEMS in December, but not to EIA. This is likely correct
 - plant 2504 has three units (120, 121, 122) that don't appear in EIA, and in CEMS only report steam. 

Fuel > 50%
 - Plant 3116 reports much more heat input to CEMS during ozone months than to EIA

BA Totals
 - TEPC and SRP are off because the Gila River Generator is shared between SRP and TEPC, and eGRID reports all generation from this project belonging to TEPC


In [None]:
# examine specific plants in a category
value = 'fuel_consumed_mmbtu'
status = '>50%'

#compared_merged.loc[64877,:]

compared[compared[f'{value}_status'] == status].sort_values(by=value)

In [None]:
compared[compared[f'{value}_status'] == status].sort_values(by=value).sample(10)

In [None]:
plant_to_explore = 58380

In [None]:
egrid_plant[egrid_plant['plant_id_eia'] == plant_to_explore]

In [None]:
plant_annual_total[plant_annual_total['plant_id_eia'] == plant_to_explore]

In [None]:
eia923_allocated[eia923_allocated['plant_id_eia'] == plant_to_explore].sum()

In [None]:
cems_unit_monthly = cems.groupby(['plant_id_eia','unitid','report_date']).sum().reset_index()


In [None]:
cems_unit_monthly[cems_unit_monthly['plant_id_eia'] == plant_to_explore].sum()

In [None]:
cems[(cems['plant_id_eia'] == plant_to_explore) & (cems['operating_time_hours'] > 0)]

In [None]:
cems_plant_annual[cems_plant_annual['plant_id_eia'] == plant_to_explore]

In [None]:
# there are some plants that report heat input and co2 in CEMS, but are missing net generation data
# TODO: we should maybe try and fill net generation data using EIA-923?
cems_plant_annual[cems_plant_annual['net_generation_mwh'] == 0]

## Explore BA Matching
### Notes
TEPC and SRP are off because the Gila River Generator is shared between SRP and TEPC, and eGRID reports all generation from this project belonging to TEPC

In [None]:
ba_plant_match = plant_annual_total[plant_annual_total['ba_code'].isna()].merge(egrid_plant[egrid_plant['ba_code'].isna()], how='left', on='plant_id_egrid', suffixes=("_calc",'_egrid'))
ba_plant_match[ba_plant_match['net_generation_mwh_calc'].round(0) != ba_plant_match['net_generation_mwh_egrid'].round(0)]

In [None]:
egrid_plant[egrid_plant['ba_code'] == 'CPLE']

In [None]:
compare_plants_in_ba = egrid_plant.merge(plant_annual_total, how='outer', on='plant_id_egrid', suffixes=('_egrid','_calc'))

In [None]:
ba = 'CPLE'
metric = 'fuel_consumed_mmbtu'

compare_plants_in_ba[((compare_plants_in_ba['ba_code_egrid'] == ba) | (compare_plants_in_ba['ba_code_calc'] == ba)) & (compare_plants_in_ba[f'{metric}_egrid'].round(0) != compare_plants_in_ba[f'{metric}_calc'].round(0))]

In [None]:
plant_annual_total[plant_annual_total['ba_code'] == 'AMPL']