Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add summary function and command #92

Merged
merged 11 commits into from Jun 7, 2021
Merged

Add summary function and command #92

merged 11 commits into from Jun 7, 2021

Conversation

amontanez24
Copy link
Contributor

No description provided.

amontanez24 and others added 3 commits May 26, 2021 12:41
* sdgym-gretel: adding gretel synthesizer

* pr comments and changes discussed in OH

* getting rid of error messages

* moving static method out

* Cureate dependencies to avoid conflicts

Co-authored-by: Carles Sala <carles@pythiac.com>
@amontanez24 amontanez24 marked this pull request as draft June 3, 2021 22:03
@amontanez24 amontanez24 requested a review from csala June 3, 2021 22:03
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few suggestions to improve the current implementation and interface

sdgym/summary.py Outdated
@@ -150,3 +153,54 @@ def errors_summary(data):
all_errors[synthesizer] = errors.fillna(0).astype(int)

return all_errors


def make_summary_spreadsheet(file_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make a few changes to this function signature:

  • I would rename file_path to raw_csv_path (or results_csv_path)
  • I would get the output path as an input argument that would default to None. If an output path is given, use it, otherwise write the output as re.sub('.csv$', '.xlsx', raw_csv_path)
  • I would make the baseline a module-level dictionary, with data modalities as keys and baseline lists as values and also add a baselines argument which would default to None (meaning use the default one we just defined).

At the implementation level, I would make a sub-function called _make_summary which would receive:

  • A name (which will be the data modality)
  • The data already filtered by data modality
  • The baselines list (list, not dict!)
  • The writer

The overall process would be something similar to this:

def _add_summary(data, modality, baselines, writer):
    # Compute the summaries on the data and add the corresponding
    # sheets to the writer, using `{Sheet name} ({modality})` as names

def make_summary_spreadsheet(raw_csv_path, output_path=None, baselines=None):
    # Preprocess the data
    # Create and configure the writer
    for modality, subset in data.groupby('modality'):
        modality_baselines = baselines[modality]
        _add_summary(subset, modality, modality_baselines, writer)

    writer.save()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amontanez24 we should also add support to read from and write to S3 buckets, following an approach similar to the collect command.
This means we need to add aws_key and aws_secret arguments, and that we need to use the s3 module functions to read and write files (we may need to change something there to write the xlsx file?)

sdgym/summary.py Outdated
@@ -3,6 +3,8 @@
import numpy as np
import pandas as pd

from sdgym.results import add_sheet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the results.py module is not being used anywhere else (which I think is not), I would copy whatever is necessary here and remove the module.

@amontanez24 amontanez24 marked this pull request as ready for review June 4, 2021 21:36
@amontanez24 amontanez24 requested a review from csala June 4, 2021 21:36
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about how to decide whether to write to S3 or not, and how to do it.

sdgym/summary.py Outdated
baselines = baselines or MODALITY_BASELINES
output_path = output_path or re.sub('.csv$', '.xlsx', results_csv_path)
output = io.BytesIO()
writer = pd.ExcelWriter(output) if aws_key and aws_secret else pd.ExcelWriter(output_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws_key and aws_secret cannot be used to decide whether we are writing in S3 or not, since an S3 path can be passed without keys to let boto3 use the system-wide credentials.

To keep things simple, I would just not distinguish between a local or an S3 path here and always write to the BytesIO and pass it down to write_file, which will decide whether to write to a local or remote file based on the given path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I didn't know that. This is nice though since it will be cleaner in the end

@amontanez24 amontanez24 changed the base branch from summary-improvements to master June 7, 2021 17:22
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go! Thanks @amontanez24 !

@csala csala merged commit 7c7b7a9 into master Jun 7, 2021
@csala csala deleted the sdgym-summary branch June 7, 2021 18:45
@katxiao katxiao added this to the 0.4.0 milestone Jun 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants