Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example to Cookbook to export training data for spaCy v3 #420

Closed
dvsrepo opened this issue Oct 5, 2021 · 8 comments · Fixed by #1642 or #1691
Closed

Add example to Cookbook to export training data for spaCy v3 #420

dvsrepo opened this issue Oct 5, 2021 · 8 comments · Fixed by #1642 or #1691
Assignees
Labels
good first issue Indicates a good issue for first-time contributors status: help wanted Indicates that a maintainer wants help on an issue or pull request type: documentation Improvements or additions to documentation
Projects
Milestone

Comments

@dvsrepo
Copy link
Member

dvsrepo commented Oct 5, 2021

Similar to the following example:

https://rubrix.readthedocs.io/en/stable/guides/cookbook.html#Training

Add example to transform a Rubrix dataset into a spaCy Docbin format and save it to disk (usable with the train spacy command for training)

@dvsrepo dvsrepo added type: documentation Improvements or additions to documentation good first issue Indicates a good issue for first-time contributors status: help wanted Indicates that a maintainer wants help on an issue or pull request labels Oct 5, 2021
@ignacioct
Copy link
Contributor

Hi guys! Thought I might give some issues a try and go back to good old Rubrix. And also seemed reasonable to start with some documentation 😁.

I have several doubts regarding this one.

  • The idea is to extend the current Training section and include the export, or create a new one below it.
  • Is the dataset meant to be exported after training? If this is a yes, I'm supposing that is more or less making the same process as the Training section, with the export and save example (that's also why I was asking if I'm meant to extend the current example or create a new one).
  • In case of adding a new section, should it be placed after Training?

I hope everything's going fine in sunny, warm Spain 💛.

@dvsrepo
Copy link
Member Author

dvsrepo commented Apr 28, 2022

Hi @ignacioct !!

Thanks so much for coming back!

I think that it would be better to work on extending the current prepare_for_training method that currently supports HF datasets to also support spacy. @dcfidalgo knows the details so it might be good to organize what's needed to tackle this. We can then change this issue title and description.

@dcfidalgo
Copy link
Contributor

Hi @ignacioct ! great to hear from you!

Yes, I think it makes more sense to extend the DatasetFor*.prepare_for_training method.
The idea is that you have a DatasetForTokenClassification for example, and calling DatasetForTokenClassification.prepare_for_training(framework="spacy") returns a spacy Docbin that you can use for training with spacy. We still need to discuss the exact naming of the argument.

I think we should start by implementing the extension for one task only, and show how to use it in the documentation.
And then continue with the other tasks. Not sure if text or token classification will be easier to start with, I have to investigate a bit.
Hmm, not sure how much time you can spare, let me know if you would be interested in working on the code base, just the documentation, just the code base ... maybe we could even have a quick call. Just let me know what you prefer!

@dvsrepo
Copy link
Member Author

dvsrepo commented Apr 28, 2022

Yes! I would say that the most widely used with a huge difference is NER so we should focus on that one first.

@ignacioct
Copy link
Contributor

Hi! I've looked at that method a little bit, and I think this is the way to go. I will need to investigate further on Spacy's docbins, but should not be a problem. Regarding my time, I will be using some spare time that I have, so as long as it's not critical I can take both code and documentation. And yeah, we can communicate via slack, this discussion (or using the forums, however you guys usually proceed) and we could have a brief talk. The method that steals the least amount of time from you guys 😊.

@dcfidalgo
Copy link
Contributor

Hey, just left you a message on the Rubrix slack channel, I think this will be the easiest way to get started.

@frascuchon frascuchon added this to Backlog in Release via automation Jun 14, 2022
@ignacioct
Copy link
Contributor

Hi! I've added a PR with the first draft of what the method would look like. It's is of course a first version, and further coding/testing must be done, but just to be sure we are on the same page with the direction of this @dcfidalgo @dvsrepo

@frascuchon frascuchon moved this from Backlog to In progress in Release Jul 21, 2022
@frascuchon frascuchon added this to the v0.17.0 milestone Jul 28, 2022
Release automation moved this from In progress to Waiting Release Jul 31, 2022
@frascuchon frascuchon moved this from Waiting Release to Ready to Release QA in Release Aug 18, 2022
frascuchon added a commit that referenced this issue Aug 18, 2022
@dvsrepo dvsrepo reopened this Aug 20, 2022
@dvsrepo
Copy link
Member Author

dvsrepo commented Aug 20, 2022

After reviewing this, I've improved the cookbook description and removed the cell output.

Once this change is merged this is ready to go in 0.17.0

frascuchon pushed a commit that referenced this issue Aug 22, 2022
@frascuchon frascuchon linked a pull request Aug 22, 2022 that will close this issue
@frascuchon frascuchon moved this from Ready to Release QA to Waiting Release in Release Aug 22, 2022
@frascuchon frascuchon moved this from Waiting Release to Ready to Release QA in Release Aug 22, 2022
frascuchon added a commit that referenced this issue Aug 22, 2022
@frascuchon frascuchon moved this from Ready to Release QA to Approved Release QA in Release Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indicates a good issue for first-time contributors status: help wanted Indicates that a maintainer wants help on an issue or pull request type: documentation Improvements or additions to documentation
Projects
No open projects
Release
Approved Release QA
4 participants