Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata and description of variables to output files #401

Open
JuliaKukulies opened this issue Feb 1, 2024 · 2 comments
Open

Add metadata and description of variables to output files #401

JuliaKukulies opened this issue Feb 1, 2024 · 2 comments

Comments

@JuliaKukulies
Copy link
Member

As part of the xarray transition, we should add some metadata and description of variables to the output files that are created with tobac. Part of it can be left to the user (e.g. the user-specific bulk statistics), but for projects like MCSMIP where tobac data is shared and published, it would be helpful to open the files and see what our definitions of variables are (e.g., what we currently only have listed here ).

@JuliaKukulies JuliaKukulies added this to the Xarray Transition milestone Feb 1, 2024
@freemansw1
Copy link
Member

Entirely agreed. This is a key component of being good citizens of FAIR principles (https://www.go-fair.org/fair-principles/). #354 doesn't necessarily get us all the way there for that; our feature detection output will still be a Pandas DataFrame at the moment, which has frustratingly limited metadata options.

We have a couple options for resolving that issue; we could simply output xarray if users input xarray rather than iris data. The issue there is that our users likely don't have a workflow set up for that xarray data (but they would have to opt into using xarray by changing their workflow anyway). We could also make it an option, and decide down the road whether to disable or make pandas non-default for output.

After #354, but before 1.6.0 releases, I think we should make sure that we have an xarray output option with the appropriate metadata. Perhaps that would be a good topic for the tobathon next week. How we implement it (default or an option) would be a good discussion; I think there are reasonable points on both sides.

Longer-term, we should have options (I think there's another issue for this) to output/combine into a single file, although that gets challenging with how large segmentation output can get.

@JuliaKukulies
Copy link
Member Author

We have a couple options for resolving that issue; we could simply output xarray if users input xarray rather than iris data. The issue there is that our users likely don't have a workflow set up for that xarray data (but they would have to opt into using xarray by changing their workflow anyway). We could also make it an option, and decide down the road whether to disable or make pandas non-default for output.

I think outputting xarray is the way to go because, as you say, with the xarray transition, users have to change their workflow anyhow. And yes, it is frustrating that pandas dataframes have so limited options for metadata, and a question that I think we have not discussed extensively is whether we only want to switch from iris to xarray or also replace all pandas dataframe operations internally. Pandas dataframes still have some very useful functionalities, so maybe it would make sense to output even the features as xarray but keep pandas internally? I am not sure about this.

After #354, but before 1.6.0 releases, I think we should make sure that we have an xarray output option with the appropriate metadata. Perhaps that would be a good topic for the tobathon next week. How we implement it (default or an option) would be a good discussion; I think there are reasonable points on both sides.

Good idea, I also thought that this is something we could take up at the tobathon since it would be useful to get input from users who are not currently developers.

Longer-term, we should have options (I think there's another issue for this) to output/combine into a single file, although that gets challenging with how large segmentation output can get.

Do you mean something like our tobac.utils..combine_feature_dataframes functionality but more internal so that users can input a list of files/dataframes for tracking and output them all into a single file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants