Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hourly data for all individual plants #246

Merged
merged 8 commits into from
Dec 21, 2022

Conversation

grgmiller
Copy link
Collaborator

@grgmiller grgmiller commented Oct 19, 2022

This PR is meant to address #241

NOTE: I'm requesting to merge into the pudl_update branch, not directly into development.

Updates in this PR:

  • Adds code to export hourly shaped data for each individual plant, instead of aggregating the eia_only data to the fleet level. We get around the memory issue by shaping and exporting plant data for each region (BA or state), so that we are not trying to hold hourly data for each plant for the entire country in memory at once. We add a new step 14 in the pipeline that shapes and exports this data, but does not retain this shaped data for the rest of the pipeline. Instead, after this export, the pipeline continues as it did previously, aggregating the data to the fleet level before shaping. This shaped fleet level data is used for the rest of the pipeline (calculating and exporting power sector and consumed emissions data) since individual plant data is not necessary for these steps.
  • I wrote in a new command-line argument --shape_individual_plants that gives the user the option to export individual plant data, or to skip that step, and export the aggregated fleet data as before. Shaping individual plant data will be the new default behavior.
  • When exporting individual plant data, the pipeline only shapes and exports data for one subset of plants at a time, based on the location of the plant. The default behavior is to split up the data by balancing authority. Since BAs range in size, this still means that the biggest individual files are still pretty large (MISO is 1.4 GB), while some of the smaller BAs are less than 1MB. An alternative approach to more evenly distributing file size could be to export the data by state, although this could make it more difficult to work with the data on the other end if you want data for all of the plants in a certain BA (the function includes an option to use state as the export grouping, but ba_code is the default).
  • The output files from this process still takes up a large amount of space (about 8GB total), so to save memory, I had to limit the amount of data we are exporting:
    • The default behavior is to only export data in us_units and not metric. Exporting both would double the amount of disk space from these outputs (from 8 to 16 GB). The thought here is that researchers or others using this type of data can do the conversions themselves easily.
    • I also limited the types of emissions columns that were being exported. We have 5 different pollutants and four ways of potentially presenting these data (raw, for electricity, adjusted, and for electricity adjusted), which would create 20 columns, and significantly increase the size of the files. For this output, I limited it to 11 emissions columns: the raw emissions data for each pollutant, the electricity-only emissions for each pollutant, and the adjusted electricity-only CO2 emissions. These seem to be the most relevant outputs for plant-level data uses, and limits duplicate data (eg CH4 doesn't actually get adjusted for biomass, so the ch4_mass_adjusted = ch4_mass).
  • Cleanup: Remove the --gtn_years argument since it is no longer relevant after improving the gross to net calculations and separating the gross to net calcuations from the subplant id identification process. The identify_subplants() function retains teh number_of_years parameter, but now this is just specified as 5 by default.

@grgmiller grgmiller changed the base branch from development to pudl_update December 16, 2022 23:53
src/data_pipeline.py Outdated Show resolved Hide resolved
@grgmiller grgmiller merged commit c270af7 into pudl_update Dec 21, 2022
@grgmiller grgmiller deleted the greg/hourly_plant_data branch December 21, 2022 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants