Access OGE outputs from Amazon S3 #338

rouille · 2024-01-31T22:50:29Z

Purpose

Allow users to read OGE output data from Amazon S3. This is particularly useful when we want to use OGE outputs in a separate project. Closes CAR-3681

What the code is doing

Create a function setting the OGE data store. It is looking for an OGE_DATA_STORE environment variable. If it does not exist the data store is set to local and it will write/read data to/from the open_grud_emissions_data folder located in the users' $HOME (current behavior)

Testing

The feature has been tested setting the OGE_DATA_DIR environment variable to s3 in a project importing the oge package. A file stored on Amazon S3 was then successfully loaded using the pandas' read_csv function

Where to look

Pipfile and environment.yml where I have added the s3fs dependency as it is needed by pandas to read files located on Amazon S3. Note that I did not make sure that it works for conda, @grgmiller, can you try to generate a conda environment, install the dependencies and try to load a CSV file on S3 using pandas' read_csv.
README where documentation where added for users who want to import oge in their project to fetch OGE data outputs without running first the pipeline
the oge.filepaths module where the feature is implemented.
the data_pipeline script where we ensure that OGE_DATA_STORE is not set to s3

Usage Example/Visuals

Setting the OGE_DATA_STORE environment variable

(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python
Python 3.11.2 (main, Nov  1 2023, 11:27:45) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.environ["OGE_DATA_STORE"] = "s3"
>>> from oge.filepaths import data_folder
>>> data_folder()
's3://open-grid-emissions/open_grid_emissions_data/'
>>>

Not setting it:

(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python
Python 3.11.2 (main, Nov  1 2023, 11:27:45) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from oge.filepaths import data_folder
>>> data_folder()
'/Users/brdo/open_grid_emissions_data/'
>>>

Trying to run pipeline with the OGE_DATA_STORE set to s3 raises an OSError

(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ export OGE_DATA_STORE=2
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ echo $OGE_DATA_STORE 
2
(open-grid-emissions) [~/Singularity/open-grid-emissions] (ben/store) brdo$ python src/oge/data_pipeline.py --year 2020
Traceback (most recent call last):
  File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 642, in <module>
    main(sys.argv[1:])
  File "/Users/brdo/Singularity/open-grid-emissions/src/oge/data_pipeline.py", line 73, in main
    raise OSError("Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'")
OSError: Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'

Review estimate

15min

Future work

N/A

Checklist

Update the documentation to reflect changes made in this PR
Format all updated python files using black
Clear outputs from all notebooks modified
Add docstrings and type hints to any new functions created

grgmiller

Looks good.

grgmiller · 2024-01-31T23:42:28Z

src/oge/data_pipeline.py

@@ -69,6 +69,11 @@ def print_args(args: argparse.Namespace, logger):

 def main(args):
    """Runs the OGE data pipeline."""
+    if os.getenv("OGE_DATA_STORE") in ["s3", "2"]:
+        raise OSError(
+            "Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'"


I like that we've prevented someone from running the data pipeline if the store is set to s3. However, it is possible that someone could still run part of the pipeline (for example in a notebook), and I just want to make sure that it is not possible for someone to write/update data to the s3, and to only read from it? Maybe this is a setting in the bucket itself, but what happens if to try to run an output_data command when the store is set to s3?

grgmiller · 2024-02-01T19:10:07Z

Closes CAR-3681

grgmiller

Looks good

rouille added 4 commits January 30, 2024 14:47

feat: manage different stores for OGE data

38c9483

chore: add s3fs dependency

22abb11

refactor: remove unnecessary join

ff09d38

feat: ensure the pipeline runs with the correct data store

5117cd7

grgmiller approved these changes Jan 31, 2024

View reviewed changes

docs: update README

5286464

rouille force-pushed the ben/store branch from 9be6c76 to 5286464 Compare February 1, 2024 19:30

grgmiller approved these changes Feb 1, 2024

View reviewed changes

grgmiller merged commit 976b4f7 into development Feb 1, 2024
2 checks passed

grgmiller deleted the ben/store branch February 1, 2024 21:44

This was referenced Feb 6, 2024

Output PUDL as Parquet as well as SQLite catalyst-cooperative/pudl#3102

Closed

v0.3.1 #342

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Access OGE outputs from Amazon S3 #338

Access OGE outputs from Amazon S3 #338

rouille commented Jan 31, 2024 •

edited by grgmiller

grgmiller left a comment

grgmiller Jan 31, 2024

grgmiller commented Feb 1, 2024

grgmiller left a comment

Access OGE outputs from Amazon S3 #338

Access OGE outputs from Amazon S3 #338

Conversation

rouille commented Jan 31, 2024 • edited by grgmiller

Purpose

What the code is doing

Testing

Where to look

Usage Example/Visuals

Review estimate

Future work

Checklist

grgmiller left a comment

Choose a reason for hiding this comment

grgmiller Jan 31, 2024

Choose a reason for hiding this comment

grgmiller commented Feb 1, 2024

grgmiller left a comment

Choose a reason for hiding this comment

rouille commented Jan 31, 2024 •

edited by grgmiller