Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access OGE outputs from Amazon S3 #338

Merged
merged 5 commits into from Feb 1, 2024
Merged

Access OGE outputs from Amazon S3 #338

merged 5 commits into from Feb 1, 2024

Conversation

rouille
Copy link
Collaborator

@rouille rouille commented Jan 31, 2024

Purpose

Allow users to read OGE output data from Amazon S3. This is particularly useful when we want to use OGE outputs in a separate project. Closes CAR-3681

What the code is doing

Create a function setting the OGE data store. It is looking for an OGE_DATA_STORE environment variable. If it does not exist the data store is set to local and it will write/read data to/from the open_grud_emissions_data folder located in the users' $HOME (current behavior)

Testing

The feature has been tested setting the OGE_DATA_DIR environment variable to s3 in a project importing the oge package. A file stored on Amazon S3 was then successfully loaded using the pandas' read_csv function

Where to look

  • Pipfile and environment.yml where I have added the s3fs dependency as it is needed by pandas to read files located on Amazon S3. Note that I did not make sure that it works for conda, @grgmiller, can you try to generate a conda environment, install the dependencies and try to load a CSV file on S3 using pandas' read_csv.
  • README where documentation where added for users who want to import oge in their project to fetch OGE data outputs without running first the pipeline
  • the oge.filepaths module where the feature is implemented.
  • the data_pipeline script where we ensure that OGE_DATA_STORE is not set to s3

Usage Example/Visuals

Setting the OGE_DATA_STORE environment variable

(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python
Python 3.11.2 (main, Nov  1 2023, 11:27:45) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ["OGE_DATA_STORE"] = "s3"
>>> from oge.filepaths import data_folder
>>> data_folder()
's3://open-grid-emissions/open_grid_emissions_data/'
>>> 

Not setting it:

(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python
Python 3.11.2 (main, Nov  1 2023, 11:27:45) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from oge.filepaths import data_folder
>>> data_folder()
'/Users/brdo/open_grid_emissions_data/'
>>> 

Trying to run pipeline with the OGE_DATA_STORE set to s3 raises an OSError

(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ export OGE_DATA_STORE=2
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ echo $OGE_DATA_STORE 
2
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python src/oge/data_pipeline.py --year 2020
Traceback (most recent call last):
  File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 642, in <module>
    main(sys.argv[1:])
  File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 73, in main
    raise OSError("Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'")
OSError: Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'

Review estimate

15min

Future work

N/A

Checklist

  • Update the documentation to reflect changes made in this PR
  • Format all updated python files using black
  • Clear outputs from all notebooks modified
  • Add docstrings and type hints to any new functions created

Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@@ -69,6 +69,11 @@ def print_args(args: argparse.Namespace, logger):

def main(args):
"""Runs the OGE data pipeline."""
if os.getenv("OGE_DATA_STORE") in ["s3", "2"]:
raise OSError(
"Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we've prevented someone from running the data pipeline if the store is set to s3. However, it is possible that someone could still run part of the pipeline (for example in a notebook), and I just want to make sure that it is not possible for someone to write/update data to the s3, and to only read from it? Maybe this is a setting in the bucket itself, but what happens if to try to run an output_data command when the store is set to s3?

@grgmiller
Copy link
Collaborator

Closes CAR-3681

Copy link
Collaborator

@grgmiller grgmiller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@grgmiller grgmiller merged commit 976b4f7 into development Feb 1, 2024
2 checks passed
@grgmiller grgmiller deleted the ben/store branch February 1, 2024 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants