feat: Implement 'prepare_for_training' for text classification datasets #1209

dcfidalgo · 2022-02-28T20:29:13Z

This PR implements the prepare_for_training method for the DatasetForTextClassification. I modified the first tutorial to show the usage: https://rubrix.readthedocs.io/en/feat-prepare_for_training/tutorials/01-labeling-finetuning.html
I don't think I will manage to implement the prepare_for_training for all tasks before the release.

@dvsrepo @frascuchon For the TokenClassification task, should we rely on spacy utilities to make the conversion from spans to ner tags, or should we provide our own utilities? To me it would seem a bit strange if we required spacy to be installed, to prepare the dataset for training with transformers ... @dvsrepo Do you know of any other library that provides some utilities for converting spans to ner tags?

codecov · 2022-02-28T20:42:23Z

Codecov Report

Merging #1209 (e4c0bc8) into master (1ba663f) will increase coverage by 0.04%.
The diff coverage is 97.61%.

@@            Coverage Diff             @@
##           master    #1209      +/-   ##
==========================================
+ Coverage   95.23%   95.28%   +0.04%     
==========================================
  Files         124      124              
  Lines        5143     5155      +12     
==========================================
+ Hits         4898     4912      +14     
+ Misses        245      243       -2

Flag	Coverage Δ
pytest	`95.28% <97.61%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/rubrix/client/datasets.py	`98.03% <97.61%> (-0.23%)`	⬇️
src/rubrix/client/sdk/users/api.py	`100.00% <0.00%> (ø)`
src/rubrix/client/rubrix_client.py	`94.04% <0.00%> (+1.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ba663f...e4c0bc8. Read the comment docs.

src/rubrix/client/datasets.py

dvsrepo · 2022-03-01T16:29:32Z

This PR implements the prepare_for_training method for the DatasetForTextClassification. I modified the first tutorial to show the usage: https://rubrix.readthedocs.io/en/feat-prepare_for_training/tutorials/01-labeling-finetuning.html I don't think I will manage to implement the prepare_for_training for all tasks before the release.

@dvsrepo @frascuchon For the TokenClassification task, should we rely on spacy utilities to make the conversion from spans to ner tags, or should we provide our own utilities? To me it would seem a bit strange if we required spacy to be installed, to prepare the dataset for training with transformers ... @dvsrepo Do you know of any other library that provides some utilities for converting spans to ner tags?

I agree we shouldn't rely on spaCy. Let me think of alternatives

frascuchon · 2022-03-01T20:33:22Z

This PR implements the prepare_for_training method for the DatasetForTextClassification. I modified the first tutorial to show the usage: https://rubrix.readthedocs.io/en/feat-prepare_for_training/tutorials/01-labeling-finetuning.html I don't think I will manage to implement the prepare_for_training for all tasks before the release.
@dvsrepo @frascuchon For the TokenClassification task, should we rely on spacy utilities to make the conversion from spans to ner tags, or should we provide our own utilities? To me it would seem a bit strange if we required spacy to be installed, to prepare the dataset for training with transformers ... @dvsrepo Do you know of any other library that provides some utilities for converting spans to ner tags?

I agree we shouldn't rely on spaCy. Let me think of alternatives

The server part computes tags at token level. You could use for this purpose. If i remember the computed tags are in the IOB format. It could be a begining

dvsrepo · 2022-03-01T20:51:45Z

This PR implements the prepare_for_training method for the DatasetForTextClassification. I modified the first tutorial to show the usage: https://rubrix.readthedocs.io/en/feat-prepare_for_training/tutorials/01-labeling-finetuning.html I don't think I will manage to implement the prepare_for_training for all tasks before the release.
@dvsrepo @frascuchon For the TokenClassification task, should we rely on spacy utilities to make the conversion from spans to ner tags, or should we provide our own utilities? To me it would seem a bit strange if we required spacy to be installed, to prepare the dataset for training with transformers ... @dvsrepo Do you know of any other library that provides some utilities for converting spans to ner tags?

I agree we shouldn't rely on spaCy. Let me think of alternatives

The server part computes tags at token level. You could use for this purpose. If i remember the computed tags are in the IOB format. It could be a begining

Definitely, that's more than enough!

dcfidalgo · 2022-03-02T15:17:18Z

The server part computes tags at token level. You could use for this purpose. If i remember the computed tags are in the IOB format. It could be a begining

Nice, totally forgot about this. Hm, should I copy the logic to the client part? Shall we require the user to upload/load the records before calling prepare_for_training? Or is there an endpoint where I can send the records and get back the metrics/iob tags?

frascuchon · 2022-03-02T22:30:00Z

Not sure about the user flow here. We can discuss tomorrow in a call @dcfidalgo

…ts (#1209) * feat: add prepare for training * test: add test * docs: adapt tutorial * chore: return colum for each input key * test: adapt tests * docs: fix small typo (cherry picked from commit 0878709)

dcfidalgo added 2 commits February 28, 2022 19:11

feat: add prepare for training

36ece69

test: add test

1b28994

dcfidalgo added the client label Feb 28, 2022

dcfidalgo self-assigned this Feb 28, 2022

dcfidalgo added this to In progress in Release via automation Feb 28, 2022

dvsrepo reviewed Feb 28, 2022

View reviewed changes

src/rubrix/client/datasets.py Outdated Show resolved Hide resolved

src/rubrix/client/datasets.py Outdated Show resolved Hide resolved

dcfidalgo added 3 commits March 1, 2022 13:10

docs: adapt tutorial

b3bcd58

chore: return colum for each input key

6c5fe74

test: adapt tests

e42a9e6

dcfidalgo marked this pull request as ready for review March 1, 2022 16:01

docs: fix small typo

e4c0bc8

dcfidalgo requested a review from dvsrepo March 1, 2022 16:24

Release automation moved this from In progress to Review Mar 1, 2022

dvsrepo approved these changes Mar 1, 2022

View reviewed changes

dcfidalgo merged commit 0878709 into master Mar 2, 2022

dcfidalgo deleted the feat/prepare_for_training branch March 2, 2022 15:09

frascuchon moved this from Review to Ready to DEV QA in Release Mar 3, 2022

frascuchon moved this from Ready to DEV QA to Approved DEV QA in Release Mar 3, 2022

frascuchon moved this from Approved DEV QA to Ready to Release QA in Release Mar 3, 2022

dcfidalgo moved this from Ready to Release QA to Approved Release QA in Release Mar 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement 'prepare_for_training' for text classification datasets #1209

feat: Implement 'prepare_for_training' for text classification datasets #1209

dcfidalgo commented Feb 28, 2022 •

edited

codecov bot commented Feb 28, 2022 •

edited

dvsrepo commented Mar 1, 2022

frascuchon commented Mar 1, 2022

dvsrepo commented Mar 1, 2022

dcfidalgo commented Mar 2, 2022

frascuchon commented Mar 2, 2022

feat: Implement 'prepare_for_training' for text classification datasets #1209

feat: Implement 'prepare_for_training' for text classification datasets #1209

Conversation

dcfidalgo commented Feb 28, 2022 • edited

codecov bot commented Feb 28, 2022 • edited

Codecov Report

dvsrepo commented Mar 1, 2022

frascuchon commented Mar 1, 2022

dvsrepo commented Mar 1, 2022

dcfidalgo commented Mar 2, 2022

frascuchon commented Mar 2, 2022

dcfidalgo commented Feb 28, 2022 •

edited

codecov bot commented Feb 28, 2022 •

edited