New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access OGE outputs from Amazon S3 #338
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
@@ -69,6 +69,11 @@ def print_args(args: argparse.Namespace, logger): | |||
|
|||
def main(args): | |||
"""Runs the OGE data pipeline.""" | |||
if os.getenv("OGE_DATA_STORE") in ["s3", "2"]: | |||
raise OSError( | |||
"Invalid OGE_DATA_STORE environment variable. Should be 'local' or '1'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that we've prevented someone from running the data pipeline if the store is set to s3. However, it is possible that someone could still run part of the pipeline (for example in a notebook), and I just want to make sure that it is not possible for someone to write/update data to the s3, and to only read from it? Maybe this is a setting in the bucket itself, but what happens if to try to run an output_data command when the store is set to s3?
Closes CAR-3681 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
Purpose
Allow users to read OGE output data from Amazon S3. This is particularly useful when we want to use OGE outputs in a separate project. Closes CAR-3681
What the code is doing
Create a function setting the OGE data store. It is looking for an
OGE_DATA_STORE
environment variable. If it does not exist the data store is set to local and it will write/read data to/from the open_grud_emissions_data folder located in the users' $HOME (current behavior)Testing
The feature has been tested setting the
OGE_DATA_DIR
environment variable tos3
in a project importing theoge
package. A file stored on Amazon S3 was then successfully loaded using thepandas
'read_csv
functionWhere to look
s3fs
dependency as it is needed by pandas to read files located on Amazon S3. Note that I did not make sure that it works for conda, @grgmiller, can you try to generate a conda environment, install the dependencies and try to load a CSV file on S3 usingpandas
'read_csv
.oge
in their project to fetch OGE data outputs without running first the pipelineoge.filepaths
module where the feature is implemented.data_pipeline
script where we ensure thatOGE_DATA_STORE
is not set tos3
Usage Example/Visuals
Setting the
OGE_DATA_STORE
environment variableNot setting it:
Trying to run pipeline with the
OGE_DATA_STORE
set tos3
raises anOSError
Review estimate
15min
Future work
N/A
Checklist
black