Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Active Learning] Small text tutorial hits errors when running on local #1693

Closed
frascuchon opened this issue Aug 22, 2022 · 8 comments · Fixed by #1726 or #1744
Closed

[Active Learning] Small text tutorial hits errors when running on local #1693

frascuchon opened this issue Aug 22, 2022 · 8 comments · Fixed by #1726 or #1744
Labels
type: documentation Improvements or additions to documentation
Milestone

Comments

@frascuchon
Copy link
Member

The small text tutorial should be review since some weird errors are hit when running on a local machine with some customizations. (Setting a different initial batch size, no CUDA installed...)

@frascuchon frascuchon added type: bug Indicates an unexpected problem or unintended behavior type: documentation Improvements or additions to documentation labels Aug 22, 2022
@frascuchon frascuchon added this to Backlog in Release via automation Aug 22, 2022
@frascuchon frascuchon moved this from Backlog to Planified in Release Sep 7, 2022
@frascuchon frascuchon removed this from Planified in Release Sep 13, 2022
@frascuchon frascuchon added this to the v0.18.0 milestone Sep 13, 2022
@chschroeder
Copy link
Contributor

Just spotted this. Which errors exactly? Is this something in the scope of tutorial or can this be improved upstream in small-text?

@frascuchon
Copy link
Member Author

frascuchon commented Sep 21, 2022

Hi @chschroeder, sorry for the late response and thank yo for ask.

I've recently launched the tutorial from an clean python environment and I found several problems:

  • Disabling the CUDA configuration works fine, but quering new records takes really too much time. Not sure if we can improve it somehow.
  • The dataset feature names are changed. The label-coarse field now is called coarse_label. We can handle this easily.
  • For the initial batch (batch_id=0), labeling all records with the same label raises the error:
    Updating with batch_id 0 ...
    Traceback (most recent call last):
    File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/rubrix/listeners/listener.py", line 219, in run_action
    return self.action(*args, *action_args, **kwargs)
    File "/var/folders/8f/mt_m87_d19q3zcnyr6dmf0pw0000gn/T/ipykernel_85718/568527003.py", line 30, in active_learning_loop
    active_learner.initialize_data(indices, y)
    File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 151, in initialize_data
    self._retrain(indices_validation=indices_validation)
    File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 390, in _retrain
    self._clf.fit(dataset)
    File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 330, in fit
    return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler)
    File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 340, in _fit_main
    raise ValueError('Conflicting information about the number of classes: '
    ValueError: Conflicting information about the number of classes: expected: 6, encountered: 4
    This situation does not occurs for next batches (<=1)

Really don't know what should be the behaviour and how could we avoid it. Any ideas? @dvsrepo @dcfidalgo @chschroeder

@chschroeder
Copy link
Contributor

No worries, I missed that issue here as well. Feel free to ping me in such cases as you did now :).

  • Disabling the CUDA configuration works fine, but quering new records takes really too much time. Not sure if we can improve it somehow.

    Transformers on the CPU (at least when training is involved) is usually so time consuming that you don't want to do that.

    Suggestion: Fall back to using a very small transformer model such as bert-medium or bert-tiny if CUDA is not available. I would still print a warning which tells the users that this intended to run on a GPU (maybe provide a Colab link?) and is running now in a cpu-only fallback mode which is slow (and yields worse results in case a smaller model is used). (Check once if the classification results are still acceptable after that.) If it's still too slow after these changes, you can try lowering the number of epochs (by setting num_epochs in the factory's kwargs argument).

  • For the initial batch (batch_id=0), labeling all records with the same label raises the error:
    [...]

    First: This problem is caused by a safety check. The intention behind that check was that every class must occur at least once (i.e. the number encountered classes must match the number of classes of the model). In this case this check was a good thing, since it told you that the initialization could be better. In reality, this might not be achievable every time, especially in multi-label scenarios. Therefore I removed this check in the current dev version of small-text.

    For now, you can switch from random initialization to balanced random initialization:

    from small_text.initialization import random_initialization
    [...]
    # Randomly draw an initial subset from the data pool
    initial_indices = random_initialization(dataset, NUM_SAMPLES)
    

    -->

    from small_text.initialization import random_initialization_balanced
    [...]
    # Randomly draw a *class-balanced* initial subset from the data pool
    initial_indices = random_initialization_balanced(dataset, NUM_SAMPLES)
    

    random_initialization_balanced provides an initial set where the label distribution is balanced over the classes (or close to it otherwise). This also means that every class occurs at least once if your initial set size is larger or equal than the number of classes.

    With small-text 1.1.0 this error will not be raised anymore.

@frascuchon
Copy link
Member Author

Thanks a lot @chschroeder

I've tried using the balanced random initialization but it hitting the following error (INITIAL_SAMPLES=20):

RuntimeError                              Traceback (most recent call last)
Cell In [6], line 12
      9 NUM_SAMPLES = 5
     11 # Randomly draw an initial subset from the data pool
---> 12 initial_indices = random_initialization_balanced(dataset, INITIAL_SAMPLES)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/initialization/strategies.py:84, in random_initialization_balanced(y, n_samples)
     82     raise NotImplementedError()
     83 else:
---> 84     return balanced_sampling(y, n_samples=n_samples)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/data/sampling.py:120, in balanced_sampling(y, n_samples)
    117     y = np.array(y)
    119 # num classes according to the labels
--> 120 num_classes = np.max(y) + 1
    121 # num classes encountered
    122 num_classes_present = len(np.unique(y))

File <__array_function__ internals>:180, in amax(*args, **kwargs)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/numpy/core/fromnumeric.py:2793, in amax(a, axis, out, keepdims, initial, where)
   2677 @array_function_dispatch(_amax_dispatcher)
   2678 def amax(a, axis=None, out=None, keepdims=np._NoValue, initial=np._NoValue,
   2679          where=np._NoValue):
   2680     """
   2681     Return the maximum of an array or maximum along an axis.
   2682 
   (...)
   2791     5
   2792     """
-> 2793     return _wrapreduction(a, np.maximum, 'max', axis, None, out,
   2794                           keepdims=keepdims, initial=initial, where=where)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86, in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     83         else:
     84             return reduction(axis=axis, out=out, **passkwargs)
---> 86 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

RuntimeError: Boolean value of Tensor with more than one value is ambiguous

@chschroeder
Copy link
Contributor

Oh, my bad :). This initialization method takes the labels as arguments (dataset -> dataset.y). I just replaced the function name and forgot to adapt this part.

initial_indices = random_initialization_balanced(dataset.y, NUM_SAMPLES)

@frascuchon
Copy link
Member Author

Great!

I've changed some code from the tutorial (the training set was initialized with LABEL_UNLABELLED values and was not possible to use labels with the random_initialization_balanced function.

Now it's working but, for the initial annotated batch, passing only one label (all records in batch annotated with the same value) hits a similar error:

Updating with batch_id 0 ...
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/rubrix/listeners/listener.py", line 219, in __run_action__
    return self.action(*args, *action_args, **kwargs)
  File "/var/folders/8f/mt_m87_d19q3zcnyr6dmf0pw0000gn/T/ipykernel_11659/568527003.py", line 30, in active_learning_loop
    active_learner.initialize_data(indices, y)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 151, in initialize_data
    self._retrain(indices_validation=indices_validation)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 390, in _retrain
    self._clf.fit(dataset)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 330, in fit
    return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 340, in _fit_main
    raise ValueError('Conflicting information about the number of classes: '
ValueError: Conflicting information about the number of classes: expected: 6, encountered: 1

Any idea to handle this?

@chschroeder
Copy link
Contributor

Hm, in this case the required amount of classes are of course impossible to find. Using small-text==1.1.0 would make the symptoms go away. (How urget is this fix? I could prioritize the release depending on that.)

However, this might also give a false sense of security. Under these circumstances the model has not seen all classes, therefore the uncertainties may be suboptimal, and thus the queries examples may not be that useful as they otherwise could be.

In my own active learning settings with few labels I always argue that it is reasonable to require 1-2 examples per class from the user before starting the active learning loop. I know that this is not applicable to use cases with thousands of labels. There are so-called cold start approaches which try to handle this setting, but as far as I know there is no single best approach as well, and each method brings its on advantages/disadvantages as well.

In the end, this is a decision for your rubrix workflows as well. Possible solutions (non-exhaustive): a) Do nothing (once the error is fixed) and trust the user. b) Show a warning. c) Show a warning and advise better settings dynamically (e.g., request the user to provide examples for all classes or recommend another query strategy). d) Describe the problem and give recommendations in the documentation (Might also be my responsibility to do this in the small-text docs regardless of what you decide on the workflows here).

@frascuchon
Copy link
Member Author

Hi @chschroeder

Thanks a lot for your responses. It's not an urgent problem. We'll just wait until the new small-text version is released.

Anyway, It should be a good practice to include a balanced annotated dataset for initial batches.

@frascuchon frascuchon removed the type: bug Indicates an unexpected problem or unintended behavior label Sep 27, 2022
frascuchon added a commit that referenced this issue Sep 28, 2022
* docs: fixing the active learning tutorial with `small-text`

* docs: using a tiny model

* docs: Change tutorial title

* docs: Change active learning title in card

(cherry picked from commit f4f2289)

Closes #1693
frascuchon added a commit that referenced this issue Sep 29, 2022
* docs: fixing the active learning tutorial with `small-text`

* docs: using a tiny model

* docs: Change tutorial title

* docs: Change active learning title in card

(cherry picked from commit f4f2289)

Closes #1693
frascuchon added a commit that referenced this issue Sep 30, 2022
* docs: fixing the active learning tutorial with `small-text`

* docs: using a tiny model

* docs: Change tutorial title

* docs: Change active learning title in card

(cherry picked from commit f4f2289)

Closes #1693
@frascuchon frascuchon linked a pull request Oct 3, 2022 that will close this issue
@frascuchon frascuchon reopened this Oct 3, 2022
frascuchon added a commit that referenced this issue Oct 3, 2022
* docs: fixing the active learning tutorial with `small-text`

* docs: using a tiny model

* docs: Change tutorial title

* docs: Change active learning title in card

(cherry picked from commit f4f2289)

Closes #1693
frascuchon added a commit that referenced this issue Oct 4, 2022
* docs: fixing the active learning tutorial with `small-text`

* docs: using a tiny model

* docs: Change tutorial title

* docs: Change active learning title in card

(cherry picked from commit f4f2289)

Closes #1693
frascuchon added a commit that referenced this issue Oct 5, 2022
* docs: fixing the active learning tutorial with `small-text`

* docs: using a tiny model

* docs: Change tutorial title

* docs: Change active learning title in card

(cherry picked from commit f4f2289)

Closes #1693
frascuchon pushed a commit that referenced this issue Oct 5, 2022
frascuchon added a commit that referenced this issue Oct 5, 2022
* docs: fixing the active learning tutorial with `small-text`

* docs: using a tiny model

* docs: Change tutorial title

* docs: Change active learning title in card

(cherry picked from commit f4f2289)

Closes #1693
frascuchon pushed a commit that referenced this issue Oct 5, 2022
frascuchon added a commit that referenced this issue Oct 5, 2022
* docs: fixing the active learning tutorial with `small-text`

* docs: using a tiny model

* docs: Change tutorial title

* docs: Change active learning title in card

(cherry picked from commit f4f2289)

Closes #1693
frascuchon pushed a commit that referenced this issue Oct 5, 2022
frascuchon pushed a commit that referenced this issue Oct 14, 2022
frascuchon added a commit that referenced this issue Oct 24, 2022
* initial restructuring of the docs #1752

* renamed files

* added new mock-up colors for blog cards #1752

* initial restructure getting started

* edded rough re-structure guides/deepdives

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* initial working version for 1.x demo-testers

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated python client info

* reviews ui references

* fix path to conf.py in rtd config

* move makes to _source

* delete makes from docs

* adds client and labeling module reference

* initial restructuring of the docs #1752

* renamed files

* added new mock-up colors for blog cards #1752

* initial restructure getting started

* edded rough re-structure guides/deepdives

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* initial working version for 1.x demo-testers

* updated python client info

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reviews ui references

* fix path to conf.py in rtd config

* move makes to _source

* delete makes from docs

* adds client and labeling module reference

* fix typo readme

* restructuring, intro texts, and others

* restructure and gitignore _build

* fix tutorial links for readthedocs deployments

* docs: raise small-text version to 1.1.0 and adapt tutorial (#1744)

Refs: #1693
(cherry picked from commit 9afc735)

* Fix some tutorials

* Fix more tutorials

* updated docs
- allow for a redirect tutorials and deepdives
- use old github and slack name

* updated faulty task redirect docs

* updated stargazers button

* updated docs:
- feature redirections
- updated images
- added terminology

* continued replacing thumbnails

* docs: enable notfound url prefix

* adds custom Github Stargazers button

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated github star number format

* docs: Trying to generate properly prefixes for not found page

* Include the link to source code for telemetry

* using the (future) main branch for telemetry source code ref

* updated docs
- tutorial images
- thumbnails
- card colors
- references

* updated docs:
- added library

* Adding section to migration guides

* updated docs
- wrapped up library references

* updated docs:
- resolved skweak library error

Co-authored-by: david <david.m.berenstein@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: dvsrepo <daniel@recogn.ai>
Co-authored-by: Christopher Schröder <6340318+chschroeder@users.noreply.github.com>
Co-authored-by: leire <leire@recogn.ai>
frascuchon added a commit that referenced this issue Oct 24, 2022
* initial restructuring of the docs #1752

* renamed files

* added new mock-up colors for blog cards #1752

* initial restructure getting started

* edded rough re-structure guides/deepdives

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* initial working version for 1.x demo-testers

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated python client info

* reviews ui references

* fix path to conf.py in rtd config

* move makes to _source

* delete makes from docs

* adds client and labeling module reference

* initial restructuring of the docs #1752

* renamed files

* added new mock-up colors for blog cards #1752

* initial restructure getting started

* edded rough re-structure guides/deepdives

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* initial working version for 1.x demo-testers

* updated python client info

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reviews ui references

* fix path to conf.py in rtd config

* move makes to _source

* delete makes from docs

* adds client and labeling module reference

* fix typo readme

* restructuring, intro texts, and others

* restructure and gitignore _build

* fix tutorial links for readthedocs deployments

* docs: raise small-text version to 1.1.0 and adapt tutorial (#1744)

Refs: #1693
(cherry picked from commit 9afc735)

* Fix some tutorials

* Fix more tutorials

* updated docs
- allow for a redirect tutorials and deepdives
- use old github and slack name

* updated faulty task redirect docs

* updated stargazers button

* updated docs:
- feature redirections
- updated images
- added terminology

* continued replacing thumbnails

* docs: enable notfound url prefix

* adds custom Github Stargazers button

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* updated github star number format

* docs: Trying to generate properly prefixes for not found page

* Include the link to source code for telemetry

* using the (future) main branch for telemetry source code ref

* updated docs
- tutorial images
- thumbnails
- card colors
- references

* updated docs:
- added library

* Adding section to migration guides

* updated docs
- wrapped up library references

* updated docs:
- resolved skweak library error

Co-authored-by: david <david.m.berenstein@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: dvsrepo <daniel@recogn.ai>
Co-authored-by: Christopher Schröder <6340318+chschroeder@users.noreply.github.com>
Co-authored-by: leire <leire@recogn.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: documentation Improvements or additions to documentation
Projects
None yet
2 participants