Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have attributes of training dataset in the repository #266

Open
merveenoyan opened this issue Jan 16, 2023 · 10 comments
Open

Have attributes of training dataset in the repository #266

merveenoyan opened this issue Jan 16, 2023 · 10 comments

Comments

@merveenoyan
Copy link
Collaborator

The widget is cool and everything but it's hard to see all the unique values of categorical variables, which variables are categorical or the range for continuous columns. Couple of solutions:

  • Have attributes in config or README file
  • Have these in a separate file.
    Ping @skops-dev/maintainers
@BenjaminBossan
Copy link
Collaborator

I agree it would be useful to have this information.

Some questions I would have:

  1. How would this information be collected? I don't think it's feasible to automatically derive it from the training data. Even if it's a pandas df, there is still room for ambiguity. Therefore, it sounds like the user would have to indicate the information.
  2. What are all the different types that can exist? Categorical, ordinal, cardinal. How about time (at what resolution)? Text? Images? I don't think there is an agreed upon standard for all feature types.
  3. Is there a standard of how to represent these types? It would be good if we didn't have to invent something new.

Of course, we don't have to have everything right from the start, but we should have an idea of what this addition would entail. And to me, it looks like it's far from trivial.

@adrinjalali
Copy link
Member

I think it'd make sense to have this in the README as a part of the model card, we can have some method to generate as much info as we can from a given input dataframe for example.

@BenjaminBossan
Copy link
Collaborator

I think the reason why Merve wanted to have them in the config.json or a separate file is that this information could be used to improve the UI on Hub. E.g. in the inference widget, if we know the distinct values of a categorical features, the widget could allow to choose the value from a list. If this information is added to the README, it would make it more difficult to extract the information.

@adrinjalali
Copy link
Member

I see, for that I'm happy for that to be in a data-info.yml/json kinda file. We probably don't want to make the config file too large I guess?

@merveenoyan
Copy link
Collaborator Author

@adrinjalali I agree.

@lazarust
Copy link
Contributor

lazarust commented Sep 3, 2023

@merveenoyan I'm happy to take this if it still needs to be done!

@lazarust
Copy link
Contributor

lazarust commented Sep 8, 2023

@BenjaminBossan I'm happy to take this one but had a few thoughts/questions:

  1. When should the file be generated?
  2. Is there a list of data types that we want to support initially? You mentioned a couple above and I agree it would be pretty hard to have all of them since there isn't an agreed-upon standard.

@BenjaminBossan
Copy link
Collaborator

Thanks for taking an interest in the issue. I think there is no definite answer to your question. The initial motivation is to know in advance what options exist for categorical data to improve the widget, but I think Adrin made a good point about file size, which can easily get large if we just record all distinct values, so some kind of compromise would need to be found.

Also, for this feature to make sense, we would need to do work on the widget side as well, for which there is currently no capacity AFAIK, so I would rather not work on this feature right now.

@lazarust
Copy link
Contributor

lazarust commented Sep 8, 2023

@BenjaminBossan Sounds good! Is there another issue I could help out with?

@BenjaminBossan
Copy link
Collaborator

If this is something you're willing to jump into, I think we have some room to improve the skops.io persistence format. For instance, support for me external libraries could be added, like scikeras (#388) or skorch :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants