# Hourly eGRID Data Pipeline Overview

## Data Used
EPA Continuous Emissions Monitoring System (CEMS) data
 - What is it: Measured hourly gross generation, fuel consumption, and emissions data for emitting power generation units > 25MW
 - How we use it: Primary source for hourly emissions and generation data

EIA Form 923
 - What is it: Reported monthly net generation and fuel consumption data for power generators > 1 MW
 - How we use it: To convert gross generation data from CEMS to net generation, and to calculate emissions that are not reported to CEMS

EIA Form 860
 - What is it: Inventory of all generators and plants and their static characteristics
 - How we use it: to transform and aggregate the data reported in CEMS and EIA-923 based on plant and generator characteristics

EPA-EIA Power Sector Data Crosswalk
 - What is it: Maps EPA plant IDs and unit IDs to EIA plant IDs and generator IDs
 - How we use it: To match data between CEMS and EIA-923

EIA Form 930 / Hourly Electric Grid Monitor
 - What is it: Reported hourly net generation by fuel category, demand, and interchange for each Balancing Area in the U.S. 
 - How we use it: To assign an hourly profile to the monthly generation and fuel data reported in EIA-923

EPA eGRID database
 - What is it: Reports annual-level generation and emissions statistics at the plant and BA level 
 - How we use it: to validate our outputs

## Process
1. Download data, including CEMS (via PUDL), EIA Forms 860 and 923 (via PUDL), EPA-EIA Power Sector Data Crosswalk, EIA-930 data
2. Identify subplants and gross-to-net generation factors using multiple years of historical data.
    - Using Power Sector Data Crosswalk, identify distinct subplant clusters of EPA units and EIA generators in each plant
    - Using multiple years of generation data from CEMS and EIA-923, run linear regressions of net generation on gross generation at teh subplant and plant level
    - Calculate simple monthly ratios between gross and net generation at teh subplant and plant level.
3. Clean monthly generation and fuel data from EIA-923
    - allocate monthly net generation and fuel consumption data reported for each plant prime mover to each plant generator
    - Calculate monthly emissions for each generator based on its fuel consumption and fuel source
    - Remove data for non grid-connected plants and plants in Puerto Rico
    - Assign a primary fuel type and balancing authority location to each generator
4. Clean hourly generation, fuel, and emissions data from CEMS
    - Remove data for non grid-connected plants, plants in Puerto Rico, and certain steam-only units
    - Assign a monthly "report_date" to each hourly observation based on the date of the local timestamp (this allows us to match the data to EIA-923 report dates)
    - Assign a fuel type to each unit
    - Fill in missing hourly emissions data using the assigned fuel type and reported hourly fuel consumption data
    - Remove all observations for each unit-month when no operation is reported for that unit in that month
    - Allocate hourly data for combined heat and power plants between electricity generation and steam production
    - Remove data for units for which we are unable to fill missing emissions data
5. Convert hourly gross generation in CEMS to hourly net generation
    - aggregate CEMS gross generation to monthly level to match with monthly-reported net generation
    - Apply several methodologies to calculate gross-to-net generation conversion factors
    - apply GTN factors to convert hourly gross generation to hourly net generation
6. Crosswalk the CEMS data to the EIA-923 data to identify for which generator-months there is no hourly data reported in CEMS
    - Use the EPA-EIA Power Sector Data Crosswalk
    - Assign subplant groupings to data
7. Assign static plant characteristics to CEMS and EIA data to allow for data aggregation and matching with EIA-930
    - assign generator and plant-level primary fuel
    - assign Balancing Authority and State to each plant
    - assign fuel categories to each plant that match EIA-930 categories
8. Clean and reconcile EIA-930 data
    - Fix timezone/timestamp issues with raw 930 data
    - Perform physics-based reconciliation so that data satisfies conservation of energy equations
9. Calculate residual net generation profiles for each BA-fuel category by comparing EIA-930 and CEMS hourly net generation data
10. Assign monthly EIA-923 data an hourly profile based on the residual net generation profile
11. Concatenate the shaped hourly EIA-923 data to the hourly CEMS data
12. Run validation checks on processed data
13. Aggregate the hourly data to the BA level and output


## Outputs
 - Processed hourly subplant-level data
 - Aggregated hourly data for each BA (total emissions, total generation, generated carbon intensity)

## Output Validation Checks
 - Aggregate data to annual level and compare with published eGRID results
 - Check that aggregated heat rates and emissions rates by fuel type are within reasonable ranges for each BA
 - Plant-level checks for anomolous data

In [1]:
# import packages
import numpy as np
import pandas as pd
import plotly.express as px
from IPython.display import display

In [2]:
%reload_ext autoreload
%autoreload 2

# Tell python where to look for modules. 
import sys
sys.path.append('../../hourly-egrid/')

# import local modules
import src.data_cleaning as data_cleaning
import src.gross_to_net_generation as gross_to_net_generation
import src.load_data as load_data
import src.validation as validation

# Specify the year for analysis
This data pipeline can only be run for a single year at a time

In [3]:
year = 2020

# 1. Download data
Checks if all necessary files are already downloaded. If not, downloads them

In [9]:
############### PUDL Database ######################

load_data.download_pudl_data(zenodo_url = 'https://zenodo.org/record/6349861/files/pudl-v0.6.0-2022-03-12.tgz')

################# eGRID data #########################

# the 2019 and 2020 data appear to be hosted on different urls
egrid_files_to_download = ['https://www.epa.gov/sites/default/files/2021-02/egrid2019_data.xlsx', 
                           'https://www.epa.gov/system/files/documents/2022-01/egrid2020_data.xlsx']

load_data.download_egrid_files(egrid_files_to_download)

############# EIA-930 data #####################

load_data.download_eia930_data(years_to_download=[year])
load_data.download_chalendar_files()

########## Power Sector Data Crosswalk #############
# NOTE: Check for new releases at https://github.com/USEPA/camd-eia-crosswalk

load_data.download_epa_psdc(psdc_url='https://github.com/USEPA/camd-eia-crosswalk/releases/download/v0.2.1/epa_eia_crosswalk.csv')


PUDL data already downloaded
egrid2019_data.xlsx already downloaded
egrid2020_data.xlsx already downloaded
2020_Jan_Jun data already downloaded
2020_Jul_Dec data already downloaded
EBA_elec.csv already downloaded
EBA_raw.csv already downloaded
epa_eia_crosswalk.csv already downloaded


# 2. Identify subplants and gross-to net ratios
This is not yet integrated in the main data pipeline, but requires loading and cleaning EIA-923 and CEMS data for multiple years, recommended at least the most current three years of data. Not everyone may have the processing power to run this process, so we could also include these outputs in the data folder.

In [5]:
# specify the number of years of data to load to calculate gross to net regressions
# this will load the x years ending at the `year` specified above
number_of_years = 5

# NOTE: This only needs to be run once, as outputs are saved to csv files
gross_to_net_generation.identify_subplants_and_gtn_conversions(year, number_of_years)

loading 2016 CEMS data




loading 2017 CEMS data




loading 2018 CEMS data




loading 2019 CEMS data




loading 2020 CEMS data




Allocating EIA-923 generation data


# 3. Clean EIA-923 Generation and Fuel Data at the Monthly Level

In [73]:
eia923_allocated, primary_fuel_table = data_cleaning.clean_eia923(year)

# Add primary fuel data to each generator
eia923_allocated = eia923_allocated.merge(primary_fuel_table, how='left', on=['plant_id_eia','generator_id'])

eia923_allocated



Removing 0 plants that are not grid-connected
Removing 0 plants located in the following states: ['PR']


Unnamed: 0,report_date,plant_id_eia,generator_id,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu,co2_mass_tons,co2_mass_tons_adjusted,subplant_id,prime_mover_code,energy_source_code,plant_primary_fuel
0,2020-01-01,1,1,,,,,,,IC,DFO,WND
1,2020-01-01,1,2,,,,,,,IC,DFO,WND
2,2020-01-01,1,3,,,,,,,IC,DFO,WND
3,2020-01-01,1,5,,,,,,,IC,DFO,WND
4,2020-01-01,1,WT1,15.2,133.0,133.0,0.0,0.0,,WT,WND,WND
...,...,...,...,...,...,...,...,...,...,...,...,...
283172,2020-12-01,64816,GEN1,,,,,,,IC,NG,
283173,2020-12-01,64816,GEN2,,,,,,,IC,NG,
283174,2020-12-01,64816,GEN3,,,,,,,IC,NG,
283175,2020-12-01,64836,CATAL,,,,,,,PV,SUN,


### Run Validation Tests on Cleaned EIA-923 data
Notes:
- When net generation is positive but no fuel consumption is reported, this could be due to several reasons:
    - The generator uses a clean fuel, like WAT, and reports 0 mmbtu per physical unit fuel consumed
    - The generator is part of a combined cycle unit and all of the fuel is reported on the other cycle
- Records where all data is zero could just mean that the generator didn't operate in that month

In [74]:
# perform checks on allocated data
# fuel consumption and co2 emissions should be positive
negative_test = validation.test_for_negative_values(eia923_allocated, ['fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted'])

# if net generation is positive, fuel consumption should be non zero
missing_fuel_test = validation.test_for_missing_fuel(eia923_allocated, 'net_generation_mwh')

# fuel consumed for electricity should be less than fuel consumed
chp_allocation_test = validation.test_chp_allocation(eia923_allocated)

# check for missing co2 data
missing_co2_test = validation.test_for_missing_co2(eia923_allocated)

# check for generators with no data
missing_data_test = validation.test_for_missing_data(eia923_allocated, ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted'])

# check for generators with all data = 0
zero_data_test = validation.test_for_zero_data(eia923_allocated, ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted'])

# check for missing energy source code
missing_esc_test = validation.test_for_missing_energy_source_code(eia923_allocated)

# check heat rates
validation.test_for_outlier_heat_rates(eia923_allocated)

Heat Rate Test


In [32]:
# identify missing prime mover code issues
missing_data_test[missing_data_test['prime_mover_code'].isna()]

Unnamed: 0,report_date,plant_id_eia,generator_id,net_generation_mwh,fuel_consumed_mmbtu,fuel_consumed_for_electricity_mmbtu,co2_mass_tons,co2_mass_tons_adjusted,subplant_id,prime_mover_code,energy_source_code,plant_primary_fuel
2462,2020-01-01,1391,1A,,,,,,0.0,,OG,
2463,2020-01-01,1391,2A,,,,,,1.0,,OG,
2464,2020-01-01,1391,3A,,,,,,2.0,,OG,
2465,2020-01-01,1391,4A,,,,,,3.0,,OG,
2466,2020-01-01,1391,5A,,,,,,,,OG,
...,...,...,...,...,...,...,...,...,...,...,...,...
277617,2020-12-01,57854,1PT1,,,,,,,,WAT,WAT
277618,2020-12-01,57854,1PT2,,,,,,,,WAT,WAT
277619,2020-12-01,57854,1PT3,,,,,,,,WAT,WAT
278185,2020-12-01,58207,G9,,,,,,,,NG,NG


# 4. Clean Hourly Data from CEMS

In [57]:
cems = data_cleaning.clean_cems(year)

Removing 45 plants that are not grid-connected
Removing 0 plants located in the following states: ['PR']
Removing 3 units that only produce steam and do not report to EIA




removing 7325217 observations from cems for unit-months where no data reported
Unable to calculate emissions for the following plants_units: []


In [58]:
# fuel consumption and co2 emissions should be positive
cems_negative_test = validation.test_for_negative_values(cems, ['fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted', 'gross_generation_mwh'])

# if net generation is positive, fuel consumption should be non zero
cems_missing_fuel_test = validation.test_for_missing_fuel(cems,'gross_generation_mwh')

# fuel consumed for electricity should be less than fuel consumed
cems_chp_allocation_test = validation.test_chp_allocation(cems)

# check for missing co2 data
cems_missing_co2_test = validation.test_for_missing_co2(cems)

# check for missing energy source code
cems_missing_esc_test = validation.test_for_missing_energy_source_code(cems)



In [59]:
cems_missing_esc_test.cems_id.unique()

array(['54096_X026', '54096_Z007', '60698_D1', '60698_D7', '335_CT1',
       '335_CT2', '7242_**1', '50240_5', '55088_GT2100', '55088_GT3100',
       '7277_17', '10244_B002', '10244_B003', '50397_36', '50397_39',
       '50481_253-26', '61241_CT1', '61241_CT2', '60925_CT1', '60925_CT2',
       '3982_1', '3982_2', '4125_8', '55641_CT-03', '55641_CT-04'],
      dtype=object)

# 5. Convert CEMS Hourly Gross Generation to Hourly Net Generation
Hierarchy
1. Use regression value if regression has good r2
2. If there is not a good regrssion, use monthly ratio unless ratio is outside of normal bounds (negative, >>1, missing)

Where there are outliers (eg gross generation is very different from net generation), which number do we trust?
 - if EIA reported monthly and not distributed, maybe trust EIA (monthly ratio).
 - Otherwise, trust general regression


In [60]:
# merge the data back into cems and calculate net generation
cems = data_cleaning.convert_gross_to_net_generation(cems)

# for generators where there is heat input but no gross generation reported, impute hourly net generation based on reported EIA values
# TODO: Need to match data on unit level rather than plant level
# cems = data_cleaning.impute_missing_hourly_net_generation(cems, eia923_allocated)

# what percent of net generation was allocated using each method?
cems.groupby('gtn_method', dropna=False).sum()['net_generation_mwh'] / cems['net_generation_mwh'].sum()

gtn_method
assumed_value          0.033989
plant_ratio            0.006415
plant_regression       0.118712
subplant_ratio         0.057691
subplant_regression    0.783194
Name: net_generation_mwh, dtype: float64

# 6. Crosswalk CEMS and EIA data

The goal of this step is to identify for which subplant-months we have hourly data reported in CEMS (which will be considered the default source of data), and for which subplant-months we only have monthly data from EIA-923. 

Based on the data we have available, at this point we assign each subplant-month one of three potential data sources for the hourly data:
1. `cems`: For subplant-months for which we have hourly CEMS data for all CEMS units that make up that subplant, we will use the hourly values reported in CEMS. (Add a validation check for the net generation and fuel consumption totals)
2. `partial_cems`: For subplant-months for which we have hourly CEMS data for only some of the CEMS units that make up a subplant, we will use the reported EIA-923 values to scale the partial hourly CEMS data from the other units to match the total value for the entire subplant. This will also calculate a partial subplant scaling factor for each data column (e.g. net generation, fuel consumption) by comparing the total monthly CEMS data to the monthly EIA-923 data.
3. `eia`: for subplant-months for which no hourly data is reported in CEMS, we will attempt to use EIA-930 data to assign an hourly profile to the monthly EIA-923 data


In [131]:
eia923_allocated = data_cleaning.identify_hourly_data_source(eia923_allocated, cems, year)

# what percent of emissions is reported in CEMS vs EIA
# NOTE: This does not include emissions only reported by CEMS, so the % may be higher
(eia923_allocated.groupby('hourly_data_source')['co2_mass_tons_adjusted'].sum() / eia923_allocated.groupby('hourly_data_source')['co2_mass_tons_adjusted'].sum().sum(axis=0)).round(3)

hourly_data_source
cems            0.892
eia             0.105
partial_cems    0.003
Name: co2_mass_tons_adjusted, dtype: float64

# 7. Assign static characteristics to CEMS and EIA data to aid in aggregation

In [132]:
# assign a BA code and state code to each plant
eia923_allocated = data_cleaning.assign_ba_code_to_plant(eia923_allocated, year)

# assign a fuel category to each plant based on what is most likely to match with the category used in EIA-930
# TODO: Add two different fuel categories (one for 930, one that is more specific)
eia923_allocated = data_cleaning.assign_fuel_category_to_ESC(eia923_allocated, fuel_category_name='fuel_group_eia930', esc_column='plant_primary_fuel')

In [None]:
# add information that we need to aggregate the data and match to eia930
#cems = cems.drop(columns=['ba_code','ba_code_physical','state'])
#cems = cems.drop(columns=['distribution_flag'])
#cems = cems.drop(columns=['plant_primary_fuel'])

cems = data_cleaning.assign_ba_code_to_plant(cems, year)

# add a flag about whether the plant is distribution connected
cems = data_cleaning.identify_distribution_connected_plants(cems, year, voltage_threshold_kv=60)

# add a plant primary fuel and a fuel category for eia930
cems = cems.merge(primary_fuel_table.drop_duplicates(subset='plant_id_eia')[['plant_id_eia','plant_primary_fuel']], how='left', on='plant_id_eia')
cems = data_cleaning.assign_fuel_category_to_ESC(cems, fuel_category_name='fuel_group_eia930', esc_column='plant_primary_fuel')
cems = cems.rename(columns={'fuel_category':'fuel_category_eia930'})

In [None]:
# add a subplant id to each dataset
subplant_crosswalk = pd.read_csv('../data/output/subplant_crosswalk.csv')
cems = cems.merge(subplant_crosswalk[['plant_id_eia','unitid','subplant_id']], how='left', on=['plant_id_eia','unitid'])
eia923_allocated = eia923_allocated.merge(subplant_crosswalk[['plant_id_eia','generator_id','subplant_id']], how='left', on=['plant_id_eia','generator_id'])

In [None]:
# test to make sure that there is a complete subplant mapping
cems_missing_subplant_test = cems[cems['subplant_id'].isna()]
if not cems_missing_subplant_test.empty:
    print(f'Warning: There are {len(cems_missing_subplant_test)} CEMS records without a subplant ID. See `cems_missing_subplant_test` for details')
    

eia_missing_subplant_test = eia923_allocated[eia923_allocated['subplant_id'].isna()]
if not eia_missing_subplant_test.empty:
    print(f'Warning: There are {len(eia_missing_subplant_test)} EIA-923 records without a subplant ID. See `eia_missing_subplant_test` for details')
    

In [None]:
columns_to_export = ['plant_id_eia', 'operating_datetime_utc', 'gross_generation_mwh', 'net_generation_mwh', 'fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu', 'co2_mass_tons','co2_mass_tons_adjusted']
cems_output = cems[columns_to_export].groupby(['plant_id_eia','operating_datetime_utc']).sum().reset_index()
cems_output.to_csv('../data/final_outputs/plant_files/hourly_plant_data_from_cems.csv', index=False)

In [None]:
# export plant frame
plant_static_columns = ['plant_id_eia','plant_primary_fuel', 'fuel_category_eia930', 'ba_code', 'ba_code_physical', 'state', 'distribution_flag']
plant_frame = cems[plant_static_columns].drop_duplicates(subset='plant_id_eia')
plant_frame.to_csv('../data/final_outputs/plant_files/plant_static_attributes.csv', index=False)

# 8. Clean and Reconcile EIA-930 data
Load raw EIA-930 data, fix timestamp issues, perform physics-based reconciliation

Currently implemented in `notebooks/930_lag` and the `gridemissions` repository

Output: `data/eia930/chalendar/EBA_adjusted_elec.csv`

In [None]:
# not yet implemented here

# 9. Calculate Residual Net Generation Profile

Currently implemented in `notebooks/clean_930_compare_residual`

In [None]:
# save CEMS data as CSV to be used to calculate residual profile
# NOTE: Once this code is integrated into the pipeline, this will not be necessary
# filter the data to export only the data/columns that we need for the residual calculation
cems_for_export = cems.loc[cems[['co2_mass_tons_adjusted','gross_generation_mwh','fuel_consumed_for_electricity_mmbtu']].sum(axis=1) > 0, ['ba_code', 'ba_code_physical',
       'energy_source_code','plant_primary_fuel','fuel_category_eia930','distribution_flag',
       'plant_id_eia', 'unitid', 'operating_datetime_utc',
       'net_generation_mwh']]

cems_for_export.to_csv(f'../data/output/cems_{year}_for_residual.csv', index=False)

In [None]:
# not yet implemented here

# 10. Assign hourly profile to monthly data


In [None]:
# create a separate dataframe containing only the generators for which we do not have CEMS data
monthly_eia_data_to_distribute = eia923_allocated[(eia923_allocated['data_source'] == 'eia_only') & ~(eia923_allocated['fuel_consumed_mmbtu'].isna())]

In [None]:
# load profile data and format for use in the pipeline
#hourly_profiles = eia930.load_chalendar_for_pipeline('../data/eia930/chalendar/EBA_adjusted_elec.csv', year)
hourly_profiles = pd.read_csv('../data/output/residual_profiles.csv', parse_dates=['report_date'])
hourly_profiles

In [None]:
def create_flat_profile(year):
    df_temp = pd.DataFrame(index=pd.date_range(start=f'{year-1}-12-31 00:00:00', end=f'{year+1}-01-01 23:00:00', freq='H', tz='UTC', name='datetime_utc'),
                                columns=['ba_code','fuel_category']).reset_index()                  
        
    df_temp['net_generation_mwh_930'] = 1.0
    df_temp['datetime_local'] = df_temp['datetime_utc']
    df_temp['datetime_local'] = df_temp['datetime_utc'].dt.tz_convert(data_cleaning.ba_timezone(ba=ba, type='local'))
    # create a report date column
    df_temp['report_date'] = df_temp['datetime_local'].astype(str).str[:7]
    df_temp['report_date'] = pd.to_datetime(df_temp['report_date'])

    return df_temp

In [None]:
# for fuel categories that exist in the EIA-923 data but not in EIA-930, create flat profiles to add to the hourly profiles from 930
# TODO: Identify for which BA-fuels a flat profile was created
# TODO: Is there a better assumption than flat?
ba_list = list(monthly_eia_data_to_distribute['ba_code'].dropna().unique())

# create an hourly datetime series in local time for each ba/fuel type
hourly_profiles_to_add = []

# for each ba
for ba in ba_list:
    # get a list of fuels categories that exist in that BA
    ba_fuel_list = list(monthly_eia_data_to_distribute.loc[monthly_eia_data_to_distribute['ba_code'] == ba,'fuel_category'].unique())
    for fuel in ba_fuel_list:
        # if there is no data for that fuel type in the eia930 data, create a flat profile
        if len(hourly_profiles[(hourly_profiles['ba_code'] == ba) & (hourly_profiles['fuel_category'] == fuel)]) == 0:
            print(f'Adding flat profile for {ba} {fuel}')
            df_temp = create_flat_profile(year)
            df_temp['ba_code'] = ba
            df_temp['fuel_category'] = fuel
            hourly_profiles_to_add.append(df_temp)

hourly_profiles_to_add = pd.concat(hourly_profiles_to_add, axis=0, ignore_index=True)

In [None]:
# concat the flat profiles to the hourly profiles
hourly_profiles = pd.concat([hourly_profiles,hourly_profiles_to_add], axis=0)

In [None]:
hourly_eia_data = data_cleaning.distribute_monthly_eia_data_to_hourly(monthly_eia_data_to_distribute, hourly_profiles, 'residual_scaled')

In [None]:
columns_for_output = ['ba_code', 'fuel_category', 'datetime_utc', 'net_generation_mwh', 'fuel_consumed_mmbtu', 'fuel_consumed_for_electricity_mmbtu', 'co2_mass_tons', 'co2_mass_tons_adjusted']
hourly_eia_data[columns_for_output].to_csv('../data/final_outputs/plant_files/hourly_data_distributed_from_eia.csv', index=False)

# 11. Aggregate CEMS data to BA-fuel and combine with hourly shaped EIA data


In [None]:
# aggregate cems data by BA and fuel type
cems_ba_fuel = cems.groupby(['ba_code','fuel_category_eia930','operating_datetime_utc']).sum()[['gross_generation_mwh','net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
cems_ba_fuel['data_source'] = 'CEMS'

# rename the datetime_utc column
cems_ba_fuel = cems_ba_fuel.rename(columns={'operating_datetime_utc':'datetime_utc','fuel_category_eia930':'fuel_category'})


In [None]:
combined_data = pd.concat([cems_ba_fuel, hourly_eia_data.drop(columns=['datetime_local','report_date'])], axis=0)
# only keep relevant columns
combined_data = combined_data[['ba_code','fuel_category','datetime_utc','net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted','data_source']]
combined_data

## Export hourly data to final CSVs (one for each BA)

In [None]:
for ba in list(combined_data.ba_code.unique()):

    # filter the data for a single BA
    ba_table = combined_data[combined_data['ba_code'] == ba].drop(columns='ba_code')

    # convert the datetime_utc column back to a datetime
    ba_table['datetime_utc'] = pd.to_datetime(ba_table['datetime_utc'], utc=True)

    # combine the data from CEMS and EIA for each fuel-hour
    ba_table = ba_table.groupby(['fuel_category', 'datetime_utc']).sum().reset_index()

    # calculate a total for the BA
    ba_total = ba_table.groupby(['datetime_utc']).sum()[['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
    ba_total['fuel_category'] = 'total'

    # concat the totals to the fuel-specific totals
    ba_table = pd.concat([ba_table,ba_total], axis=0, ignore_index=True)

    # calculate a generated emission rate
    ba_table['generated_co2_rate_lb_per_mwh'] = (ba_table['co2_mass_tons'] * 2000 / ba_table['net_generation_mwh']).fillna(0).replace(np.inf, np.NaN)
    ba_table['adjusted_generated_co2_rate_lb_per_mwh'] = (ba_table['co2_mass_tons_adjusted'] * 2000 / ba_table['net_generation_mwh']).fillna(0).replace(np.inf, np.NaN)

    ba_table = ba_table.pivot(index='datetime_utc', columns='fuel_category')

    # round all values to one decimal place
    ba_table = ba_table.round(1)

    # flatten the multilevel column into a single column name like data_fuelname
    ba_table.columns = ['_'.join(col) for col in ba_table.columns.values]

    # export to a csv
    ba_table.to_csv(f'../data/final_outputs/ba_files/{ba}.csv')

In [None]:
def graph_hourly_data_by_fuel_category(hourly_data, ba, column_name, fuel_category_name, plot_type):
    fuel_color = {'natural_gas':'sienna',
              'coal':'black',
              'nuclear':'red',
              'biomass':'green',
              'geothermal':'orange',
              'wind':'blue',
              'solar':'gold',
              'petroleum':'purple',
              'hydro':'skyblue',
              'other':'lightgrey',
              'waste':'pink'}

    fuel_order = ['nuclear','geothermal','hydro','other','coal','biomass','petroleum','waste','solar','wind','natural_gas']

    data_to_graph = hourly_data[hourly_data['ba_code'] == ba]
    #data_to_graph = data_to_graph.groupby(['datetime_utc','fuel_category','data_source']).sum().reset_index()

    if plot_type == 'area':
        plot = px.area(data_to_graph, 
                x='datetime_utc', 
                y=column_name, 
                color=fuel_category_name, 
                color_discrete_map=fuel_color, 
                facet_col='data_source', 
                template='plotly_white',
                title=f'Hourly {column_name} data for {ba} by fuel type',
                category_orders={fuel_category_name:fuel_order}).update_traces(line={'width':0})
    elif plot_type == 'line':
        plot = px.line(data_to_graph, 
                x='datetime_utc', 
                y=column_name, 
                color=fuel_category_name, 
                color_discrete_map=fuel_color, 
                facet_col='data_source', 
                template='plotly_white',
                title=f'Hourly {column_name} data for {ba} by fuel type',
                category_orders={fuel_category_name:fuel_order}).update_traces(line={'width':0})
    
    return plot
        

In [None]:
ba = 'CISO'

graph_hourly_data_by_fuel_category(hourly_data=combined_data, ba=ba, column_name='net_generation_mwh', fuel_category_name='fuel_category', plot_type='area')

In [None]:
graph_hourly_data_by_fuel_category(hourly_data=combined_data, ba=ba, column_name='co2_mass_tons', fuel_category_name='fuel_category', plot_type='area')

In [None]:
graph_hourly_data_by_fuel_category(hourly_data=combined_data, ba=ba, column_name='co2_rate_lb_per_mwh_generated', fuel_category_name='fuel_category', plot_type='line')

# Data Quality Metrics

### Compare monthly totals for each plant from each data source

In [None]:
# for plants where there is data reported in cems, see how off it is from data reported in eia
cems_plant_monthly = cems.groupby(['plant_id_eia','subplant_id','report_date'], dropna=False).sum()[['gross_generation_mwh','net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
gf_plant_monthly = eia923_allocated.groupby(['plant_id_eia','subplant_id','report_date'], dropna=False).sum().reset_index()
compare_cems_eia = gf_plant_monthly.merge(cems_plant_monthly, how='inner', on=['plant_id_eia','subplant_id','report_date'], suffixes=("_eia",'_cems'))


for column in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']:
    compare_cems_eia[f'{column}_pctdiff'] = ((compare_cems_eia[f'{column}_cems'].replace(0,0.1) - compare_cems_eia[f'{column}_eia'].replace(0,0.1)) / compare_cems_eia[f'{column}_eia'].replace(0,0.1)).round(3)

compare_cems_eia = compare_cems_eia.set_index(['plant_id_eia','subplant_id','report_date'])
compare_cems_eia = compare_cems_eia.reindex(sorted(compare_cems_eia.columns), axis=1)

In [None]:
# identify where there are differences between reported CEMS and EIA values for the same subplant-month
compare_cems_eia[~compare_cems_eia.filter(like='pctdiff').sum(axis=1).between(-0.05,0.05)]

## Compare Our Results to eGRID

In [None]:
# Aggregate total calculated values
###################################

# Aggregate cems and eia data by plant id, then combine
cems_plant_annual = cems.groupby(['ba_code','state','plant_id_eia'], dropna=False).sum()[['net_generation_mwh','fuel_consumed_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
eia_plant_annual = monthly_eia_data_to_distribute.groupby(['ba_code','state','plant_id_eia'], dropna=False).sum()[['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].reset_index()
plant_annual_total = pd.concat([cems_plant_annual,eia_plant_annual], axis=0)
# group any plants that have records from both datasets
plant_annual_total = plant_annual_total.groupby(['ba_code','state','plant_id_eia'], dropna=False).sum().reset_index()

# add a egrid id
plant_annual_total = validation.add_egrid_plant_id(plant_annual_total, from_id='eia', to_id='egrid')

# Load the eGRID plant table
egrid_plant = load_data.load_egrid_plant_file(year)

### Identify plants in eGRID missing from our results

These plants seem to have retired before 2020

In [None]:
# identify any plants that are in egrid but not our totals, and any plants that are in our totals, but not egrid
plant_not_in_calc = list(set(egrid_plant['plant_id_eia'].unique()) - set(plant_annual_total['plant_id_eia'].unique()))

# Which plants are included in eGRID but are missing from our calculations?
missing_from_calc = egrid_plant[egrid_plant['plant_id_egrid'].isin(plant_not_in_calc)]

# see if any of these plants are retired
generators_eia860 = load_data.load_pudl_table('generators_eia860', year=year)
missing_from_calc.merge(generators_eia860.groupby('plant_id_eia')['retirement_date'].unique().reset_index(), how='left', on='plant_id_eia')

### Identify plants in our calculations that are missing from eGRID

In [None]:
# Which plants are in our calculations, but are missing from eGRID?
plants_not_in_egrid = list(set(plant_annual_total['plant_id_egrid'].unique()) - set(egrid_plant['plant_id_egrid'].unique()))

plant_names = load_data.load_pudl_table('plants_entity_eia')[['plant_id_eia','plant_name_eia','sector_name_eia']]
missing_from_egrid = plant_annual_total[plant_annual_total['plant_id_egrid'].isin(plants_not_in_egrid)].merge(plant_names, how='left', on='plant_id_eia')

missing_from_egrid

In [None]:
# how many of the plants missing from egrid have non-zero data
missing_from_egrid[missing_from_egrid['fuel_consumed_mmbtu'] > 1]

### Compare whether totals for plants with EPA IDs that differ from EIA IDs match

In [None]:
# identify where there is a single egrid plant id for multiple eia plant ids
double_ids = plant_annual_total[plant_annual_total['plant_id_egrid'].duplicated(keep=False)]
double_ids = double_ids.groupby('plant_id_egrid').sum()['net_generation_mwh'].reset_index() # focus on net generation for now
# merge the egrid data
double_ids = double_ids.merge(egrid_plant[['plant_id_egrid','net_generation_mwh']], how='left', on='plant_id_egrid', suffixes=('_calc','_egrid'))
double_ids['percent_diff'] = ((double_ids['net_generation_mwh_calc'] - double_ids['net_generation_mwh_egrid']) / double_ids['net_generation_mwh_egrid']).round(3)
double_ids

### Identify plants where our BA assignment does not match eGRID

In [None]:
ba_code_match = egrid_plant.set_index('plant_id_eia')[['plant_name','ba_code']].merge(plant_annual_total.set_index('plant_id_eia')[['ba_code']], how='inner', left_index=True, right_index=True, suffixes=("_egrid",'_calc'))

# plants with missing ba code
ba_code_match[(ba_code_match['ba_code_calc'].isna()) & ~(ba_code_match['ba_code_egrid'].isna())]

In [None]:
# plants with incorrect ba code
ba_code_match[(ba_code_match['ba_code_calc'] != ba_code_match['ba_code_egrid']) & ~(ba_code_match['ba_code_calc'].isna())]

## Identify where eGRID might be missing data
It seems that there are quite a few generators where fuel consumption data is missing

In [None]:
pudl_out = load_data.initialize_pudl_out(year)

# load the EIA generator fuel data
IDX_PM_ESC = ["report_date", "plant_id_eia", "energy_source_code", "prime_mover_code"]
gf = (pudl_out.gf_eia923().loc[:,IDX_PM_ESC + ["net_generation_mwh","fuel_consumed_mmbtu","fuel_consumed_for_electricity_mmbtu",],])

# add egrid plant ids
egrid_crosswalk = pd.read_csv('../data/egrid/egrid_static_tables/2020/table_C5_crosswalk_of_EIA_ID_to_EPA_ID.csv')
eia_to_egrid_id = dict(zip(list(egrid_crosswalk['plant_id_eia']), list(egrid_crosswalk['plant_id_egrid'])))
gf['plant_id_egrid'] = gf['plant_id_eia']
gf['plant_id_egrid'].update(gf['plant_id_egrid'].map(eia_to_egrid_id))

# calculate an annual total for each plant
gf_total = gf.groupby(['plant_id_egrid']).sum().reset_index()

# choose a metric to compare
metric = 'fuel_consumed_mmbtu'

egrid_eia_comparison = egrid_plant[['plant_id_egrid','plant_name','ba_code','energy_source_code',metric]].merge(gf_total[['plant_id_egrid',metric]], how='left', on='plant_id_egrid', suffixes=('_egrid','_eia923')).round(0)

egrid_eia_comparison['difference'] = egrid_eia_comparison[f'{metric}_egrid'] - egrid_eia_comparison[f'{metric}_eia923']
egrid_eia_comparison['percent_difference'] = (egrid_eia_comparison[f'{metric}_egrid'] - egrid_eia_comparison[f'{metric}_eia923']) / egrid_eia_comparison[f'{metric}_eia923']

In [None]:
# egrid seems to be missing fuel consumption data for most nuclear power plants
missing_nuclear = egrid_eia_comparison[egrid_eia_comparison['energy_source_code'] == 'NUC']
missing_nuclear.sum()

In [None]:
# where is egrid missing data?
egrid_eia_comparison[(egrid_eia_comparison['percent_difference'] < -0.01) & (egrid_eia_comparison['energy_source_code'] != 'NUC')]

In [None]:
# how much emissions does this account for?
# group by fuel code
missing_emissions = egrid_eia_comparison[(egrid_eia_comparison['percent_difference'] < -0.01) & (egrid_eia_comparison['energy_source_code'] != 'NUC')].groupby('energy_source_code').sum().reset_index()

# get emission factors
emission_factors = load_data.load_emission_factors()[['energy_source_code', 'co2_tons_per_mmbtu']]
missing_emissions = missing_emissions.merge(emission_factors, how='left', on='energy_source_code')
missing_emissions['co2_mass_tons'] = missing_emissions['difference'] * missing_emissions['co2_tons_per_mmbtu']
missing_emissions.sum()

## Plant Metric

In [None]:
# standardize column names and index so that the two dfs can be divided
calculated_to_compare = plant_annual_total.groupby('plant_id_egrid').sum().drop(columns=['plant_id_eia'])

# drop the plants that have no data in eGRID
plants_with_no_data_in_egrid = list(egrid_plant[egrid_plant[['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']].sum(axis=1) == 0]['plant_id_egrid'])
egrid_plant = egrid_plant[~egrid_plant['plant_id_eia'].isin(plants_with_no_data_in_egrid)]

egrid_to_compare = egrid_plant.set_index(['plant_id_egrid']).drop(columns=['ba_code','state','plant_name','plant_id_eia'])

# divide calculated value by egrid value
compared = calculated_to_compare.div(egrid_to_compare).merge(egrid_plant[['plant_id_egrid','plant_name','ba_code', 'state']], how='left', left_index=True, right_on='plant_id_egrid').set_index('plant_id_egrid')
compared['plant_name'] = compared['plant_name'].fillna('unknown')

# create a dataframe that merges the two sources of data together
compared_merged = calculated_to_compare.merge(egrid_to_compare, how='outer', on='plant_id_egrid', suffixes=('_calc','_egrid'))

# for each column, change missing values to zero if both values are zero (only nan b/c divide by zero)
for col in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons_adjusted','co2_mass_tons']:
    # identify plants with zero values for both
    plant_ids = list(compared_merged[(compared_merged[f'{col}_calc'] == 0) & (compared_merged[f'{col}_egrid'] == 0)].index)
    compared.loc[compared.index.isin(plant_ids), col] = 1

# for each column, categorize the data based on how far it is off from egrid
for col in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons_adjusted','co2_mass_tons']:
    # add a new column
    compared[f'{col}_status'] = pd.cut(x=compared[col], 
                                       bins=[-999999999,0,0.5,0.9,0.99,0.9999,1,1.0001,1.01,1.1,1.5,999999999], 
                                       labels=['negative','<50%','+/-50%','+/-10%','+/-1%','!exact','!exact','+/-1%','+/-10%','+/-50%','>50%'], 
                                       ordered=False)
    # replace any missing values with missing
    compared[f'{col}_status'] = compared[f'{col}_status'].astype(str)  
    compared[f'{col}_status'] = compared[f'{col}_status'].fillna('missing')
    compared[f'{col}_status'] = compared[f'{col}_status'].replace('nan','missing')
    compared.loc[(compared.index.isin(plants_not_in_egrid)),f'{col}_status'] = 'not_in_egrid'

# identify which plants are missing from egrid vs calculated values
for col in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons_adjusted','co2_mass_tons']:
    # identify plants that are missing in egrid
    plants_missing_egrid = list(compared_merged[(compared_merged[f'{col}_calc'] > 0) & (compared_merged[f'{col}_egrid'].isna())].index)
    compared.loc[compared.index.isin(plants_missing_egrid), f'{col}_status'] = 'missing_in_egrid'
    # identify plants that are missing from our calculations
    plants_missing_calc = list(compared_merged[(compared_merged[f'{col}_calc'].isna()) & (compared_merged[f'{col}_egrid'] > 0)].index)
    compared.loc[compared.index.isin(plants_missing_calc), f'{col}_status'] = 'missing_in_calc'
    # identify where our calculations are missing a zero value
    plants_missing_zero_calc = list(compared_merged[(compared_merged[f'{col}_calc'].isna()) & (compared_merged[f'{col}_egrid'] == 0)].index)
    compared.loc[compared.index.isin(plants_missing_zero_calc), f'{col}_status'] = 'missing_zero_in_calc'
    # identify where egrid has a missing value instead of a zero
    plants_missing_zero_egrid = list(compared_merged[(compared_merged[f'{col}_calc'] == 0) & (compared_merged[f'{col}_egrid'].isna())].index)
    compared.loc[compared.index.isin(plants_missing_zero_egrid), f'{col}_status'] = 'missing_zero_in_egrid'
    # identify where egrid has a zero value where we have a positive value
    plants_incorrect_zero_egrid = list(compared_merged[(compared_merged[f'{col}_calc'] > 0) & (compared_merged[f'{col}_egrid'] == 0)].index)
    compared.loc[compared.index.isin(plants_incorrect_zero_egrid), f'{col}_status'] = '>50%'

# create a dataframe that counts how many plants are in each category
comparison_count = []
for col in ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons_adjusted','co2_mass_tons']:
    count = compared.groupby(f'{col}_status', dropna=False).count()['plant_name'].rename(col)
    count.index = count.index.rename('status')
    comparison_count.append(count)

comparison_count = pd.concat(comparison_count, axis=1).fillna(0).astype(int)
comparison_count = pd.concat([comparison_count, pd.DataFrame(comparison_count.sum().rename('Total')).T], axis=0)   
comparison_count

In [None]:
print(comparison_count.to_markdown())

## BA Metric

In [None]:
# load egrid BA totals
"""egrid_ba = pd.read_excel(f'../data/egrid/egrid{year}_data.xlsx', sheet_name=f'BA{str(year)[-2:]}', header=1, usecols=['BANAME','BACODE','BAHTIANT','BANGENAN','BACO2AN'])
# rename the columns
egrid_ba = egrid_ba.rename(columns={'BANAME':'ba_name',
                                    'BACODE':'ba_code',
                                    'BAHTIANT':'fuel_consumed_mmbtu',
                                    'BANGENAN':'net_generation_mwh',
                                    'BACO2AN':'co2_mass_tons'})"""

data_columns = ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons','co2_mass_tons_adjusted']

#aggregate the plant data up to the BA level
egrid_ba = egrid_plant.groupby(['ba_code']).sum()[data_columns].reset_index()

# divide our calculation by the BA totals from eGRID
# if there are 0 values, replace with 0.1, so that div by zero doesn't return missing value
ba_metric = plant_annual_total.groupby('ba_code', dropna=False).sum().drop(columns='plant_id_eia').replace(0,0.1).div(egrid_ba.set_index('ba_code').replace(0,0.1)).sort_values(by='co2_mass_tons').round(3)

total = pd.DataFrame(plant_annual_total[data_columns].sum().div(egrid_ba[data_columns].sum()).rename('Total')).T

# calculate the difference in the number of plants in each region
plant_count = (plant_annual_total.groupby('ba_code', dropna=False).count()['plant_id_egrid'] - egrid_plant.groupby('ba_code', dropna=False).count()['plant_id_egrid']).rename('num_plants')
ba_metric = ba_metric.merge(plant_count, how='left', left_index=True, right_index=True).drop(columns=['plant_id_egrid']).sort_index()

ba_metric = pd.concat([ba_metric, total], axis=0).round(2) 

ba_metric = ba_metric[data_columns + ['num_plants']]

columns_to_check = ['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_tons']

with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(ba_metric[~(ba_metric[columns_to_check] == 1).all(axis=1)])

In [None]:
# how much co2 comes from CEMS vs EIA
data_source_by_ba = pd.pivot_table(eia923_allocated, values='co2_mass_tons', index='ba_code', columns='data_source', dropna=False, aggfunc=np.sum).replace(0,0.001)
data_source_by_ba = data_source_by_ba.div(data_source_by_ba.sum(axis=1), axis=0)
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(data_source_by_ba.round(3).fillna(0).sort_values(by='cems'))

In [None]:
print(ba_metric.to_markdown())

## Explore specific plants

### Notes

Net generation < 50%:
 - 2617 has negative net generation
 - 10444, 59395: some plants are hybrid fossil / solar plants, but the primary energy source code is getting listed as PV for certain generators, which is causing an issue

Net generation > 50%:
 - 335 has two extra units (CT1 and CT2) that report to CEMS but not EIA. Likely correct
 - 7288 reports DFO to CEMS, but NG to EIA - there's probably some generators not reporting
 - 60698 double counting generation from cEMS because of 90% heat threhshold in a month - might be fixed once we tackle heat input
 - 58256 is a solar/hybrid and in generators_entity_eia the battery portion is associated with a separate generator id (should only be 1)


Other:
 - plant 3754 has heat input in cems and eia that don't match
 - 2401 has generation in both cems and eia
 - 50933 might have allocation issue (doesn't appear in CEMS)

To check
 - 1404 reports generation to CEMS in December, but not to EIA. This is likely correct
 - plant 2504 has three units (120, 121, 122) that don't appear in EIA, and in CEMS only report steam. 

Fuel > 50%
 - Plant 3116 reports much more heat input to CEMS during ozone months than to EIA

BA Totals
 - TEPC and SRP are off because the Gila River Generator is shared between SRP and TEPC, and eGRID reports all generation from this project belonging to TEPC


In [None]:
# examine specific plants in a category
value = 'fuel_consumed_mmbtu'
status = '>50%'

#compared_merged.loc[64877,:]

compared[compared[f'{value}_status'] == status].sort_values(by=value)

In [None]:
compared[compared[f'{value}_status'] == status].sort_values(by=value).sample(10)

In [None]:
plant_to_explore = 58380

In [None]:
egrid_plant[egrid_plant['plant_id_eia'] == plant_to_explore]

In [None]:
plant_annual_total[plant_annual_total['plant_id_eia'] == plant_to_explore]

In [None]:
eia923_allocated[eia923_allocated['plant_id_eia'] == plant_to_explore].sum()

In [None]:
cems_unit_monthly = cems.groupby(['plant_id_eia','unitid','report_date']).sum().reset_index()


In [None]:
cems_unit_monthly[cems_unit_monthly['plant_id_eia'] == plant_to_explore].sum()

In [None]:
cems[(cems['plant_id_eia'] == plant_to_explore) & (cems['operating_time_hours'] > 0)]

In [None]:
cems_plant_annual[cems_plant_annual['plant_id_eia'] == plant_to_explore]

In [None]:
# there are some plants that report heat input and co2 in CEMS, but are missing net generation data
# TODO: we should maybe try and fill net generation data using EIA-923?
cems_plant_annual[cems_plant_annual['net_generation_mwh'] == 0]

## Explore BA Matching
### Notes
TEPC and SRP are off because the Gila River Generator is shared between SRP and TEPC, and eGRID reports all generation from this project belonging to TEPC

In [None]:
ba_plant_match = plant_annual_total[plant_annual_total['ba_code'].isna()].merge(egrid_plant[egrid_plant['ba_code'].isna()], how='left', on='plant_id_egrid', suffixes=("_calc",'_egrid'))
ba_plant_match[ba_plant_match['net_generation_mwh_calc'].round(0) != ba_plant_match['net_generation_mwh_egrid'].round(0)]

In [None]:
egrid_plant[egrid_plant['ba_code'] == 'CPLE']

In [None]:
compare_plants_in_ba = egrid_plant.merge(plant_annual_total, how='outer', on='plant_id_egrid', suffixes=('_egrid','_calc'))

In [None]:
ba = 'CPLE'
metric = 'fuel_consumed_mmbtu'

compare_plants_in_ba[((compare_plants_in_ba['ba_code_egrid'] == ba) | (compare_plants_in_ba['ba_code_calc'] == ba)) & (compare_plants_in_ba[f'{metric}_egrid'].round(0) != compare_plants_in_ba[f'{metric}_calc'].round(0))]

In [None]:
plant_annual_total[plant_annual_total['ba_code'] == 'AMPL']