Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement 'prepare_for_training' for text classification datasets #1209

Merged
merged 6 commits into from Mar 2, 2022

Conversation

dcfidalgo
Copy link
Contributor

@dcfidalgo dcfidalgo commented Feb 28, 2022

This PR implements the prepare_for_training method for the DatasetForTextClassification. I modified the first tutorial to show the usage: https://rubrix.readthedocs.io/en/feat-prepare_for_training/tutorials/01-labeling-finetuning.html
I don't think I will manage to implement the prepare_for_training for all tasks before the release.

@dvsrepo @frascuchon For the TokenClassification task, should we rely on spacy utilities to make the conversion from spans to ner tags, or should we provide our own utilities? To me it would seem a bit strange if we required spacy to be installed, to prepare the dataset for training with transformers ... @dvsrepo Do you know of any other library that provides some utilities for converting spans to ner tags?

@dcfidalgo dcfidalgo self-assigned this Feb 28, 2022
@dcfidalgo dcfidalgo added this to In progress in Release via automation Feb 28, 2022
@codecov
Copy link

codecov bot commented Feb 28, 2022

Codecov Report

Merging #1209 (e4c0bc8) into master (1ba663f) will increase coverage by 0.04%.
The diff coverage is 97.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1209      +/-   ##
==========================================
+ Coverage   95.23%   95.28%   +0.04%     
==========================================
  Files         124      124              
  Lines        5143     5155      +12     
==========================================
+ Hits         4898     4912      +14     
+ Misses        245      243       -2     
Flag Coverage Δ
pytest 95.28% <97.61%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/rubrix/client/datasets.py 98.03% <97.61%> (-0.23%) ⬇️
src/rubrix/client/sdk/users/api.py 100.00% <0.00%> (ø)
src/rubrix/client/rubrix_client.py 94.04% <0.00%> (+1.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ba663f...e4c0bc8. Read the comment docs.

src/rubrix/client/datasets.py Outdated Show resolved Hide resolved
src/rubrix/client/datasets.py Outdated Show resolved Hide resolved
@dcfidalgo dcfidalgo marked this pull request as ready for review March 1, 2022 16:01
@dcfidalgo dcfidalgo requested a review from dvsrepo March 1, 2022 16:24
@dvsrepo
Copy link
Member

dvsrepo commented Mar 1, 2022

This PR implements the prepare_for_training method for the DatasetForTextClassification. I modified the first tutorial to show the usage: https://rubrix.readthedocs.io/en/feat-prepare_for_training/tutorials/01-labeling-finetuning.html I don't think I will manage to implement the prepare_for_training for all tasks before the release.

@dvsrepo @frascuchon For the TokenClassification task, should we rely on spacy utilities to make the conversion from spans to ner tags, or should we provide our own utilities? To me it would seem a bit strange if we required spacy to be installed, to prepare the dataset for training with transformers ... @dvsrepo Do you know of any other library that provides some utilities for converting spans to ner tags?

I agree we shouldn't rely on spaCy. Let me think of alternatives

Release automation moved this from In progress to Review Mar 1, 2022
@frascuchon
Copy link
Member

This PR implements the prepare_for_training method for the DatasetForTextClassification. I modified the first tutorial to show the usage: https://rubrix.readthedocs.io/en/feat-prepare_for_training/tutorials/01-labeling-finetuning.html I don't think I will manage to implement the prepare_for_training for all tasks before the release.
@dvsrepo @frascuchon For the TokenClassification task, should we rely on spacy utilities to make the conversion from spans to ner tags, or should we provide our own utilities? To me it would seem a bit strange if we required spacy to be installed, to prepare the dataset for training with transformers ... @dvsrepo Do you know of any other library that provides some utilities for converting spans to ner tags?

I agree we shouldn't rely on spaCy. Let me think of alternatives

The server part computes tags at token level. You could use for this purpose. If i remember the computed tags are in the IOB format. It could be a begining

@dvsrepo
Copy link
Member

dvsrepo commented Mar 1, 2022

This PR implements the prepare_for_training method for the DatasetForTextClassification. I modified the first tutorial to show the usage: https://rubrix.readthedocs.io/en/feat-prepare_for_training/tutorials/01-labeling-finetuning.html I don't think I will manage to implement the prepare_for_training for all tasks before the release.
@dvsrepo @frascuchon For the TokenClassification task, should we rely on spacy utilities to make the conversion from spans to ner tags, or should we provide our own utilities? To me it would seem a bit strange if we required spacy to be installed, to prepare the dataset for training with transformers ... @dvsrepo Do you know of any other library that provides some utilities for converting spans to ner tags?

I agree we shouldn't rely on spaCy. Let me think of alternatives

The server part computes tags at token level. You could use for this purpose. If i remember the computed tags are in the IOB format. It could be a begining

Definitely, that's more than enough!

@dcfidalgo dcfidalgo merged commit 0878709 into master Mar 2, 2022
@dcfidalgo dcfidalgo deleted the feat/prepare_for_training branch March 2, 2022 15:09
@dcfidalgo
Copy link
Contributor Author

The server part computes tags at token level. You could use for this purpose. If i remember the computed tags are in the IOB format. It could be a begining

Nice, totally forgot about this. Hm, should I copy the logic to the client part? Shall we require the user to upload/load the records before calling prepare_for_training? Or is there an endpoint where I can send the records and get back the metrics/iob tags?

@frascuchon
Copy link
Member

Not sure about the user flow here. We can discuss tomorrow in a call @dcfidalgo

@frascuchon frascuchon moved this from Review to Ready to DEV QA in Release Mar 3, 2022
@frascuchon frascuchon moved this from Ready to DEV QA to Approved DEV QA in Release Mar 3, 2022
@frascuchon frascuchon moved this from Approved DEV QA to Ready to Release QA in Release Mar 3, 2022
frascuchon pushed a commit that referenced this pull request Mar 3, 2022
…ts (#1209)

* feat: add prepare for training

* test: add test

* docs: adapt tutorial

* chore: return colum for each input key

* test: adapt tests

* docs: fix small typo

(cherry picked from commit 0878709)
@dcfidalgo dcfidalgo moved this from Ready to Release QA to Approved Release QA in Release Mar 3, 2022
frascuchon pushed a commit that referenced this pull request Mar 4, 2022
…ts (#1209)

* feat: add prepare for training

* test: add test

* docs: adapt tutorial

* chore: return colum for each input key

* test: adapt tests

* docs: fix small typo

(cherry picked from commit 0878709)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Release
Approved Release QA
Development

Successfully merging this pull request may close these issues.

None yet

3 participants