# Hourly eGRID Data Pipeline Overview

## Data Used
EPA Continuous Emissions Monitoring System (CEMS) data
 - What is it: Measured hourly gross generation, fuel consumption, and emissions data for emitting power generation units > 25MW
 - How we use it: Primary source for hourly emissions and generation data

EIA Form 923
 - What is it: Reported monthly net generation and fuel consumption data for power generators > 1 MW
 - How we use it: To convert gross generation data from CEMS to net generation, and to calculate emissions that are not reported to CEMS

EIA Form 860
 - What is it: Inventory of all generators and plants and their static characteristics
 - How we use it: to transform and aggregate the data reported in CEMS and EIA-923 based on plant and generator characteristics

EPA-EIA Power Sector Data Crosswalk
 - What is it: Maps EPA plant IDs and unit IDs to EIA plant IDs and generator IDs
 - How we use it: To match data between CEMS and EIA-923

EIA Form 930 / Hourly Electric Grid Monitor
 - What is it: Reported hourly net generation by fuel category, demand, and interchange for each Balancing Area in the U.S. 
 - How we use it: To assign an hourly profile to the monthly generation and fuel data reported in EIA-923

EPA eGRID database
 - What is it: Reports annual-level generation and emissions statistics at the plant and BA level 
 - How we use it: to validate our outputs

## Process
1. Download data, including CEMS (via PUDL), EIA Forms 860 and 923 (via PUDL), EPA-EIA Power Sector Data Crosswalk, EIA-930 data
2. Identify subplants and gross-to-net generation factors using multiple years of historical data.
    - Using Power Sector Data Crosswalk, identify distinct subplant clusters of EPA units and EIA generators in each plant
    - Using multiple years of generation data from CEMS and EIA-923, run linear regressions of net generation on gross generation at teh subplant and plant level
    - Calculate simple monthly ratios between gross and net generation at teh subplant and plant level.
3. Clean monthly generation and fuel data from EIA-923
    - allocate monthly net generation and fuel consumption data reported for each plant prime mover to each plant generator
    - Calculate monthly emissions for each generator based on its fuel consumption and fuel source
    - Remove data for non grid-connected plants and plants in Puerto Rico
    - Assign a primary fuel type and balancing authority location to each generator
4. Clean hourly generation, fuel, and emissions data from CEMS
    - Remove data for non grid-connected plants, plants in Puerto Rico, and certain steam-only units
    - Assign a monthly "report_date" to each hourly observation based on the date of the local timestamp (this allows us to match the data to EIA-923 report dates)
    - Assign a fuel type to each unit
    - Fill in missing hourly emissions data using the assigned fuel type and reported hourly fuel consumption data
    - Remove all observations for each unit-month when no operation is reported for that unit in that month
    - Allocate hourly data for combined heat and power plants between electricity generation and steam production
    - Remove data for units for which we are unable to fill missing emissions data
5. Convert hourly gross generation in CEMS to hourly net generation
    - aggregate CEMS gross generation to monthly level to match with monthly-reported net generation
    - Apply several methodologies to calculate gross-to-net generation conversion factors
    - apply GTN factors to convert hourly gross generation to hourly net generation
6. Crosswalk the CEMS data to the EIA-923 data to identify for which generator-months there is no hourly data reported in CEMS
    - Use the EPA-EIA Power Sector Data Crosswalk
    - Assign subplant groupings to data
7. Assign static plant characteristics to CEMS and EIA data to allow for data aggregation and matching with EIA-930
    - assign generator and plant-level primary fuel
    - assign Balancing Authority and State to each plant
    - assign fuel categories to each plant that match EIA-930 categories
8. Clean and reconcile EIA-930 data
    - Fix timezone/timestamp issues with raw 930 data
    - Perform physics-based reconciliation so that data satisfies conservation of energy equations
9. Calculate residual net generation profiles for each BA-fuel category by comparing EIA-930 and CEMS hourly net generation data
10. Assign monthly EIA-923 data an hourly profile based on the residual net generation profile
11. Concatenate the shaped hourly EIA-923 data to the hourly CEMS data
12. Run validation checks on processed data
13. Aggregate the hourly data to the BA level and output


## Outputs
 - Processed hourly subplant-level data
 - Aggregated hourly data for each BA (total emissions, total generation, generated carbon intensity)

## Output Validation Checks
 - Aggregate data to annual level and compare with published eGRID results
 - Check that aggregated heat rates and emissions rates by fuel type are within reasonable ranges for each BA
 - Plant-level checks for anomolous data

In [None]:
# import packages
import numpy as np
import pandas as pd
import plotly.express as px
from IPython.display import display

In [None]:
%reload_ext autoreload
%autoreload 2

# Tell python where to look for modules. 
import sys
sys.path.append('../../hourly-egrid/')

# import local modules
import src.data_cleaning as data_cleaning
import src.gross_to_net_generation as gross_to_net_generation
import src.load_data as load_data
import src.validation as validation

# Specify the year for analysis
This data pipeline can only be run for a single year at a time

In [None]:
year = 2020

# 1. Download data
Checks if all necessary files are already downloaded. If not, downloads them

In [None]:
############### PUDL Database ######################

load_data.download_pudl_data(zenodo_url = 'https://zenodo.org/record/6349861/files/pudl-v0.6.0-2022-03-12.tgz')
load_data.download_updated_pudl_database(download=False)

################# eGRID data #########################

# the 2019 and 2020 data appear to be hosted on different urls
egrid_files_to_download = ['https://www.epa.gov/sites/default/files/2021-02/egrid2019_data.xlsx', 
                           'https://www.epa.gov/system/files/documents/2022-01/egrid2020_data.xlsx']

load_data.download_egrid_files(egrid_files_to_download)

############# EIA-930 data #####################

load_data.download_eia930_data(years_to_download=[year])
load_data.download_chalendar_files()

########## Power Sector Data Crosswalk #############
# NOTE: Check for new releases at https://github.com/USEPA/camd-eia-crosswalk

load_data.download_epa_psdc(psdc_url='https://github.com/USEPA/camd-eia-crosswalk/releases/download/v0.2.1/epa_eia_crosswalk.csv')


# 2. Identify subplants and gross-to net ratios
This is not yet integrated in the main data pipeline, but requires loading and cleaning EIA-923 and CEMS data for multiple years, recommended at least the most current three years of data. Not everyone may have the processing power to run this process, so we could also include these outputs in the data folder.

In [None]:
# specify the number of years of data to load to calculate gross to net regressions
# this will load the x years ending at the `year` specified above
number_of_years = 5

# NOTE: This only needs to be run once, as outputs are saved to csv files
gross_to_net_generation.identify_subplants_and_gtn_conversions(year, number_of_years)

# 3. Clean EIA-923 Generation and Fuel Data at the Monthly Level

In [None]:
eia923_allocated, primary_fuel_table = data_cleaning.clean_eia923(year)

# Add primary fuel data to each generator
eia923_allocated = eia923_allocated.merge(primary_fuel_table, how='left', on=['plant_id_eia','generator_id'], validate="m:1")

eia923_allocated

# 4. Clean Hourly Data from CEMS

In [None]:
cems = data_cleaning.clean_cems(year)

# 5. Convert CEMS Hourly Gross Generation to Hourly Net Generation
Hierarchy
1. Use regression value if regression has good r2
2. If there is not a good regrssion, use monthly ratio unless ratio is outside of normal bounds (negative, >>1, missing)

Where there are outliers (eg gross generation is very different from net generation), which number do we trust?
 - if EIA reported monthly and not distributed, maybe trust EIA (monthly ratio).
 - Otherwise, trust general regression


In [None]:
# merge the data back into cems and calculate net generation
#cems = cems.drop(columns=['gtn_method','net_generation_mwh'])
cems = data_cleaning.convert_gross_to_net_generation(cems)

# for generators where there is heat input but no gross generation reported, impute hourly net generation based on reported EIA values
# TODO: Need to match data on unit level rather than plant level
# cems = data_cleaning.impute_missing_hourly_net_generation(cems, eia923_allocated)

# what percent of net generation was allocated using each method?
cems.groupby('gtn_method', dropna=False).sum()['net_generation_mwh'] / cems['net_generation_mwh'].sum()

# 6. Crosswalk CEMS and EIA data

The goal of this step is to identify for which subplant-months we have hourly data reported in CEMS (which will be considered the default source of data), and for which subplant-months we only have monthly data from EIA-923. 

Based on the data we have available, at this point we assign each subplant-month one of three potential data sources for the hourly data:
1. `cems`: For subplant-months for which we have hourly CEMS data for all CEMS units that make up that subplant, we will use the hourly values reported in CEMS. (Add a validation check for the net generation and fuel consumption totals)
2. `partial_cems`: For subplant-months for which we have hourly CEMS data for only some of the CEMS units that make up a subplant, we will use the reported EIA-923 values to scale the partial hourly CEMS data from the other units to match the total value for the entire subplant. This will also calculate a partial subplant scaling factor for each data column (e.g. net generation, fuel consumption) by comparing the total monthly CEMS data to the monthly EIA-923 data.
3. `eia`: for subplant-months for which no hourly data is reported in CEMS, we will attempt to use EIA-930 data to assign an hourly profile to the monthly EIA-923 data


In [None]:
eia923_allocated = data_cleaning.identify_hourly_data_source(eia923_allocated, cems, year)

# what percent of emissions is reported in CEMS vs EIA
# NOTE: This does not include emissions only reported by CEMS, so the % may be higher
(eia923_allocated.groupby('hourly_data_source')['co2_mass_lb_adjusted'].sum() / eia923_allocated.groupby('hourly_data_source')['co2_mass_lb_adjusted'].sum().sum(axis=0)).round(3)

## Export intermediate data files

In [None]:
cems.to_csv(f'../data/outputs/cems_{year}.csv', index=False)

In [None]:
eia923_allocated.to_csv(f'../data/outputs/eia923_allocated_{year}.csv', index=False)

# 7. Assign static characteristics to CEMS and EIA data to aid in aggregation

In [None]:
# assign a BA code and state code to each plant
eia923_allocated = data_cleaning.assign_ba_code_to_plant(eia923_allocated, year)

# assign a fuel category to each plant based on what is most likely to match with the category used in EIA-930
# TODO: Add two different fuel categories (one for 930, one that is more specific)
eia923_allocated = data_cleaning.assign_fuel_category_to_ESC(eia923_allocated, fuel_category_name='fuel_group_eia930', esc_column='plant_primary_fuel')

In [None]:
# add information that we need to aggregate the data and match to eia930
#cems = cems.drop(columns=['ba_code','ba_code_physical','state'])
#cems = cems.drop(columns=['distribution_flag'])
#cems = cems.drop(columns=['plant_primary_fuel'])

cems = data_cleaning.assign_ba_code_to_plant(cems, year)

# add a flag about whether the plant is distribution connected
cems = data_cleaning.identify_distribution_connected_plants(cems, year, voltage_threshold_kv=60)

# add a plant primary fuel and a fuel category for eia930
cems = cems.merge(primary_fuel_table.drop_duplicates(subset='plant_id_eia')[['plant_id_eia','plant_primary_fuel']], how='left', on='plant_id_eia')
cems = data_cleaning.assign_fuel_category_to_ESC(cems, fuel_category_name='fuel_group_eia930', esc_column='plant_primary_fuel')
cems = cems.rename(columns={'fuel_category':'fuel_category_eia930'})

In [None]:
# export plant frame
plant_static_columns = ['plant_id_eia','plant_primary_fuel', 'fuel_category_eia930', 'ba_code', 'ba_code_physical', 'state', 'distribution_flag']
plant_frame = eia923_allocated[plant_static_columns].drop_duplicates(subset='plant_id_eia')
plant_frame.to_csv('../data/final_outputs/plant_files/plant_static_attributes.csv', index=False)

### Export plant-level houlry CEMS data
We might want to move this step elsewhere in the pipeline

In [None]:
columns_to_export = ['plant_id_eia', 'operating_datetime_utc', 'gross_generation_mwh', 'net_generation_mwh', 'fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu', 'co2_mass_lb','co2_mass_lb_adjusted']
cems_output = cems[columns_to_export].groupby(['plant_id_eia','operating_datetime_utc']).sum().reset_index()
cems_output.to_csv('../data/final_outputs/plant_files/hourly_plant_data_from_cems.csv', index=False)

In [None]:
# export plant frame
plant_static_columns = ['plant_id_eia','plant_primary_fuel', 'fuel_category_eia930', 'ba_code', 'ba_code_physical', 'state', 'distribution_flag']
plant_frame = cems[plant_static_columns].drop_duplicates(subset='plant_id_eia')
plant_frame.to_csv('../data/final_outputs/plant_files/plant_static_attributes.csv', index=False)

# 8. Clean and Reconcile EIA-930 data
Load raw EIA-930 data, fix timestamp issues, perform physics-based reconciliation

Currently implemented in `notebooks/930_lag` and the `gridemissions` repository

Output: `data/outputs/EBA_adjusted_elec.csv`

In [None]:
# not yet implemented here

# 9. Calculate Residual Net Generation Profile

Currently implemented in `notebooks/clean_930_compare_residual`

In [None]:
# save CEMS data as CSV to be used to calculate residual profile
# NOTE: Once this code is integrated into the pipeline, this will not be necessary
# filter the data to export only the data/columns that we need for the residual calculation
cems_for_export = cems.loc[cems[['co2_mass_lb_adjusted','gross_generation_mwh','fuel_consumed_for_electricity_mmbtu']].sum(axis=1) > 0, ['ba_code', 'ba_code_physical',
       'energy_source_code','plant_primary_fuel','fuel_category_eia930','distribution_flag',
       'plant_id_eia', 'unitid', 'operating_datetime_utc',
       'net_generation_mwh']]

cems_for_export.to_csv(f'../data/outputs/cems_{year}_for_residual.csv', index=False)

In [None]:
# not yet implemented here

# 10. Assign hourly profile to monthly data


In [None]:
eia923_allocated[eia923_allocated['hourly_data_source'] == 'partial_cems']

In [None]:
# create a separate dataframe containing only the generators for which we do not have CEMS data
monthly_eia_data_to_distribute = eia923_allocated[(eia923_allocated['hourly_data_source'] == 'eia') & ~(eia923_allocated['fuel_consumed_mmbtu'].isna())]

In [None]:
# load profile data and format for use in the pipeline
#hourly_profiles = eia930.load_chalendar_for_pipeline('../data/downloads/outputs/EBA_adjusted_elec.csv', year)
hourly_profiles = pd.read_csv('../data/outputs/residual_profiles.csv', parse_dates=['report_date'])
hourly_profiles

In [None]:
def create_flat_profile(year):
    df_temp = pd.DataFrame(index=pd.date_range(start=f'{year-1}-12-31 00:00:00', end=f'{year+1}-01-01 23:00:00', freq='H', tz='UTC', name='datetime_utc'),
                                columns=['ba_code','fuel_category']).reset_index()                  
        
    df_temp['net_generation_mwh_930'] = 1.0
    df_temp['datetime_local'] = df_temp['datetime_utc']
    df_temp['datetime_local'] = df_temp['datetime_utc'].dt.tz_convert(data_cleaning.ba_timezone(ba=ba, type='local'))
    # create a report date column
    df_temp['report_date'] = df_temp['datetime_local'].astype(str).str[:7]
    df_temp['report_date'] = pd.to_datetime(df_temp['report_date'])

    return df_temp

In [None]:
# for fuel categories that exist in the EIA-923 data but not in EIA-930, create flat profiles to add to the hourly profiles from 930
# TODO: Identify for which BA-fuels a flat profile was created
# TODO: Is there a better assumption than flat?
ba_list = list(monthly_eia_data_to_distribute['ba_code'].dropna().unique())

# create an hourly datetime series in local time for each ba/fuel type
hourly_profiles_to_add = []

# for each ba
for ba in ba_list:
    # get a list of fuels categories that exist in that BA
    ba_fuel_list = list(monthly_eia_data_to_distribute.loc[monthly_eia_data_to_distribute['ba_code'] == ba,'fuel_category'].unique())
    for fuel in ba_fuel_list:
        # if there is no data for that fuel type in the eia930 data, create a flat profile
        if len(hourly_profiles[(hourly_profiles['ba_code'] == ba) & (hourly_profiles['fuel_category'] == fuel)]) == 0:
            print(f'Adding flat profile for {ba} {fuel}')
            df_temp = create_flat_profile(year)
            df_temp['ba_code'] = ba
            df_temp['fuel_category'] = fuel
            hourly_profiles_to_add.append(df_temp)

hourly_profiles_to_add = pd.concat(hourly_profiles_to_add, axis=0, ignore_index=True)

In [None]:
# concat the flat profiles to the hourly profiles
hourly_profiles = pd.concat([hourly_profiles,hourly_profiles_to_add], axis=0)

In [None]:
hourly_eia_data = data_cleaning.distribute_monthly_eia_data_to_hourly(monthly_eia_data_to_distribute, hourly_profiles, 'residual_scaled')

In [None]:
columns_for_output = ['ba_code', 'fuel_category', 'datetime_utc', 'net_generation_mwh', 'fuel_consumed_mmbtu', 'fuel_consumed_for_electricity_mmbtu', 'co2_mass_lb', 'co2_mass_lb_adjusted']
hourly_eia_data[columns_for_output].to_csv('../data/final_outputs/plant_files/hourly_data_distributed_from_eia.csv', index=False)

# 11. Aggregate CEMS data to BA-fuel and combine with hourly shaped EIA data


In [None]:
# aggregate cems data by BA and fuel type
cems_ba_fuel = cems.groupby(['ba_code','fuel_category_eia930','operating_datetime_utc']).sum()[['gross_generation_mwh','net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_lb','co2_mass_lb_adjusted']].reset_index()
cems_ba_fuel['data_source'] = 'CEMS'

# rename the datetime_utc column
cems_ba_fuel = cems_ba_fuel.rename(columns={'operating_datetime_utc':'datetime_utc','fuel_category_eia930':'fuel_category'})


In [None]:
combined_data = pd.concat([cems_ba_fuel, hourly_eia_data.drop(columns=['datetime_local','report_date'])], axis=0)
# only keep relevant columns
combined_data = combined_data[['ba_code','fuel_category','datetime_utc','net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_lb','co2_mass_lb_adjusted','data_source']]
combined_data

## Export hourly data to final CSVs (one for each BA)

In [None]:
for ba in list(combined_data.ba_code.unique()):

    # filter the data for a single BA
    ba_table = combined_data[combined_data['ba_code'] == ba].drop(columns='ba_code')

    # convert the datetime_utc column back to a datetime
    ba_table['datetime_utc'] = pd.to_datetime(ba_table['datetime_utc'], utc=True)

    # combine the data from CEMS and EIA for each fuel-hour
    ba_table = ba_table.groupby(['fuel_category', 'datetime_utc']).sum().reset_index()

    # calculate a total for the BA
    ba_total = ba_table.groupby(['datetime_utc']).sum()[['net_generation_mwh','fuel_consumed_mmbtu','fuel_consumed_for_electricity_mmbtu','co2_mass_lb','co2_mass_lb_adjusted']].reset_index()
    ba_total['fuel_category'] = 'total'

    # concat the totals to the fuel-specific totals
    ba_table = pd.concat([ba_table,ba_total], axis=0, ignore_index=True)

    # calculate a generated emission rate
    ba_table['generated_co2_rate_lb_per_mwh'] = (ba_table['co2_mass_lb'] * 2000 / ba_table['net_generation_mwh']).fillna(0).replace(np.inf, np.NaN)
    ba_table['adjusted_generated_co2_rate_lb_per_mwh'] = (ba_table['co2_mass_lb_adjusted'] * 2000 / ba_table['net_generation_mwh']).fillna(0).replace(np.inf, np.NaN)

    ba_table = ba_table.pivot(index='datetime_utc', columns='fuel_category')

    # round all values to one decimal place
    ba_table = ba_table.round(1)

    # flatten the multilevel column into a single column name like data_fuelname
    ba_table.columns = ['_'.join(col) for col in ba_table.columns.values]

    # export to a csv
    ba_table.to_csv(f'../data/final_outputs/ba_files/{ba}.csv')

In [None]:
def graph_hourly_data_by_fuel_category(hourly_data, ba, column_name, fuel_category_name, plot_type):
    fuel_color = {'natural_gas':'sienna',
              'coal':'black',
              'nuclear':'red',
              'biomass':'green',
              'geothermal':'orange',
              'wind':'blue',
              'solar':'gold',
              'petroleum':'purple',
              'hydro':'skyblue',
              'other':'lightgrey',
              'waste':'pink'}

    fuel_order = ['nuclear','geothermal','hydro','other','coal','biomass','petroleum','waste','solar','wind','natural_gas']

    data_to_graph = hourly_data[hourly_data['ba_code'] == ba]
    #data_to_graph = data_to_graph.groupby(['datetime_utc','fuel_category','data_source']).sum().reset_index()

    if plot_type == 'area':
        plot = px.area(data_to_graph, 
                x='datetime_utc', 
                y=column_name, 
                color=fuel_category_name, 
                color_discrete_map=fuel_color, 
                facet_col='data_source', 
                template='plotly_white',
                title=f'Hourly {column_name} data for {ba} by fuel type',
                category_orders={fuel_category_name:fuel_order}).update_traces(line={'width':0})
    elif plot_type == 'line':
        plot = px.line(data_to_graph, 
                x='datetime_utc', 
                y=column_name, 
                color=fuel_category_name, 
                color_discrete_map=fuel_color, 
                facet_col='data_source', 
                template='plotly_white',
                title=f'Hourly {column_name} data for {ba} by fuel type',
                category_orders={fuel_category_name:fuel_order}).update_traces(line={'width':0})
    
    return plot
        

In [None]:
ba = 'CISO'

graph_hourly_data_by_fuel_category(hourly_data=combined_data, ba=ba, column_name='net_generation_mwh', fuel_category_name='fuel_category', plot_type='area')

In [None]:
graph_hourly_data_by_fuel_category(hourly_data=combined_data, ba=ba, column_name='co2_mass_lb', fuel_category_name='fuel_category', plot_type='area')

In [None]:
graph_hourly_data_by_fuel_category(hourly_data=combined_data, ba=ba, column_name='co2_rate_lb_per_mwh_generated', fuel_category_name='fuel_category', plot_type='line')