New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add summary function and command #92
Conversation
* sdgym-gretel: adding gretel synthesizer * pr comments and changes discussed in OH * getting rid of error messages * moving static method out * Cureate dependencies to avoid conflicts Co-authored-by: Carles Sala <carles@pythiac.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a few suggestions to improve the current implementation and interface
sdgym/summary.py
Outdated
@@ -150,3 +153,54 @@ def errors_summary(data): | |||
all_errors[synthesizer] = errors.fillna(0).astype(int) | |||
|
|||
return all_errors | |||
|
|||
|
|||
def make_summary_spreadsheet(file_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make a few changes to this function signature:
- I would rename
file_path
toraw_csv_path
(orresults_csv_path
) - I would get the output path as an input argument that would default to
None
. If an output path is given, use it, otherwise write the output asre.sub('.csv$', '.xlsx', raw_csv_path)
- I would make the baseline a module-level dictionary, with data modalities as keys and baseline lists as values and also add a
baselines
argument which would default toNone
(meaning use the default one we just defined).
At the implementation level, I would make a sub-function called _make_summary
which would receive:
- A
name
(which will be the data modality) - The
data
already filtered by data modality - The
baselines
list (list, not dict!) - The
writer
The overall process would be something similar to this:
def _add_summary(data, modality, baselines, writer):
# Compute the summaries on the data and add the corresponding
# sheets to the writer, using `{Sheet name} ({modality})` as names
def make_summary_spreadsheet(raw_csv_path, output_path=None, baselines=None):
# Preprocess the data
# Create and configure the writer
for modality, subset in data.groupby('modality'):
modality_baselines = baselines[modality]
_add_summary(subset, modality, modality_baselines, writer)
writer.save()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amontanez24 we should also add support to read from and write to S3 buckets, following an approach similar to the collect
command.
This means we need to add aws_key
and aws_secret
arguments, and that we need to use the s3
module functions to read and write files (we may need to change something there to write the xlsx file?)
sdgym/summary.py
Outdated
@@ -3,6 +3,8 @@ | |||
import numpy as np | |||
import pandas as pd | |||
|
|||
from sdgym.results import add_sheet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the results.py
module is not being used anywhere else (which I think is not), I would copy whatever is necessary here and remove the module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment about how to decide whether to write to S3 or not, and how to do it.
sdgym/summary.py
Outdated
baselines = baselines or MODALITY_BASELINES | ||
output_path = output_path or re.sub('.csv$', '.xlsx', results_csv_path) | ||
output = io.BytesIO() | ||
writer = pd.ExcelWriter(output) if aws_key and aws_secret else pd.ExcelWriter(output_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aws_key
and aws_secret
cannot be used to decide whether we are writing in S3 or not, since an S3 path can be passed without keys to let boto3
use the system-wide credentials.
To keep things simple, I would just not distinguish between a local or an S3 path here and always write to the BytesIO
and pass it down to write_file
, which will decide whether to write to a local or remote file based on the given path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I didn't know that. This is nice though since it will be cleaner in the end
2f73a59
to
07a6988
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go! Thanks @amontanez24 !
No description provided.