[Active Learning] Small text tutorial hits errors when running on local #1693

frascuchon · 2022-08-22T14:39:45Z

The small text tutorial should be review since some weird errors are hit when running on a local machine with some customizations. (Setting a different initial batch size, no CUDA installed...)

chschroeder · 2022-09-15T10:19:23Z

Just spotted this. Which errors exactly? Is this something in the scope of tutorial or can this be improved upstream in small-text?

frascuchon · 2022-09-21T15:13:07Z

Hi @chschroeder, sorry for the late response and thank yo for ask.

I've recently launched the tutorial from an clean python environment and I found several problems:

Disabling the CUDA configuration works fine, but quering new records takes really too much time. Not sure if we can improve it somehow.
The dataset feature names are changed. The label-coarse field now is called coarse_label. We can handle this easily.
For the initial batch (batch_id=0), labeling all records with the same label raises the error:
Updating with batch_id 0 ...
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/rubrix/listeners/listener.py", line 219, in run_action
return self.action(*args, *action_args, **kwargs)
File "/var/folders/8f/mt_m87_d19q3zcnyr6dmf0pw0000gn/T/ipykernel_85718/568527003.py", line 30, in active_learning_loop
active_learner.initialize_data(indices, y)
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 151, in initialize_data
self._retrain(indices_validation=indices_validation)
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 390, in _retrain
self._clf.fit(dataset)
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 330, in fit
return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler)
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 340, in _fit_main
raise ValueError('Conflicting information about the number of classes: '
ValueError: Conflicting information about the number of classes: expected: 6, encountered: 4
This situation does not occurs for next batches (<=1)

Really don't know what should be the behaviour and how could we avoid it. Any ideas? @dvsrepo @dcfidalgo @chschroeder

chschroeder · 2022-09-21T15:48:06Z

No worries, I missed that issue here as well. Feel free to ping me in such cases as you did now :).

Disabling the CUDA configuration works fine, but quering new records takes really too much time. Not sure if we can improve it somehow.

Transformers on the CPU (at least when training is involved) is usually so time consuming that you don't want to do that.

Suggestion: Fall back to using a very small transformer model such as bert-medium or bert-tiny if CUDA is not available. I would still print a warning which tells the users that this intended to run on a GPU (maybe provide a Colab link?) and is running now in a cpu-only fallback mode which is slow (and yields worse results in case a smaller model is used). (Check once if the classification results are still acceptable after that.) If it's still too slow after these changes, you can try lowering the number of epochs (by setting num_epochs in the factory's kwargs argument).
For the initial batch (batch_id=0), labeling all records with the same label raises the error:
[...]

First: This problem is caused by a safety check. The intention behind that check was that every class must occur at least once (i.e. the number encountered classes must match the number of classes of the model). In this case this check was a good thing, since it told you that the initialization could be better. In reality, this might not be achievable every time, especially in multi-label scenarios. Therefore I removed this check in the current dev version of small-text.

For now, you can switch from random initialization to balanced random initialization:
```
from small_text.initialization import random_initialization
[...]
# Randomly draw an initial subset from the data pool
initial_indices = random_initialization(dataset, NUM_SAMPLES)
```
-->
```
from small_text.initialization import random_initialization_balanced
[...]
# Randomly draw a *class-balanced* initial subset from the data pool
initial_indices = random_initialization_balanced(dataset, NUM_SAMPLES)
```
random_initialization_balanced provides an initial set where the label distribution is balanced over the classes (or close to it otherwise). This also means that every class occurs at least once if your initial set size is larger or equal than the number of classes.

With small-text 1.1.0 this error will not be raised anymore.

frascuchon · 2022-09-22T07:58:13Z

Thanks a lot @chschroeder

I've tried using the balanced random initialization but it hitting the following error (INITIAL_SAMPLES=20):

RuntimeError                              Traceback (most recent call last)
Cell In [6], line 12
      9 NUM_SAMPLES = 5
     11 # Randomly draw an initial subset from the data pool
---> 12 initial_indices = random_initialization_balanced(dataset, INITIAL_SAMPLES)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/initialization/strategies.py:84, in random_initialization_balanced(y, n_samples)
     82     raise NotImplementedError()
     83 else:
---> 84     return balanced_sampling(y, n_samples=n_samples)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/data/sampling.py:120, in balanced_sampling(y, n_samples)
    117     y = np.array(y)
    119 # num classes according to the labels
--> 120 num_classes = np.max(y) + 1
    121 # num classes encountered
    122 num_classes_present = len(np.unique(y))

File <__array_function__ internals>:180, in amax(*args, **kwargs)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/numpy/core/fromnumeric.py:2793, in amax(a, axis, out, keepdims, initial, where)
   2677 @array_function_dispatch(_amax_dispatcher)
   2678 def amax(a, axis=None, out=None, keepdims=np._NoValue, initial=np._NoValue,
   2679          where=np._NoValue):
   2680     """
   2681     Return the maximum of an array or maximum along an axis.
   2682 
   (...)
   2791     5
   2792     """
-> 2793     return _wrapreduction(a, np.maximum, 'max', axis, None, out,
   2794                           keepdims=keepdims, initial=initial, where=where)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86, in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     83         else:
     84             return reduction(axis=axis, out=out, **passkwargs)
---> 86 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

RuntimeError: Boolean value of Tensor with more than one value is ambiguous

chschroeder · 2022-09-22T08:21:12Z

Oh, my bad :). This initialization method takes the labels as arguments (dataset -> dataset.y). I just replaced the function name and forgot to adapt this part.

initial_indices = random_initialization_balanced(dataset.y, NUM_SAMPLES)

frascuchon · 2022-09-22T09:22:28Z

Great!

I've changed some code from the tutorial (the training set was initialized with LABEL_UNLABELLED values and was not possible to use labels with the random_initialization_balanced function.

Now it's working but, for the initial annotated batch, passing only one label (all records in batch annotated with the same value) hits a similar error:

Updating with batch_id 0 ...
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/rubrix/listeners/listener.py", line 219, in __run_action__
    return self.action(*args, *action_args, **kwargs)
  File "/var/folders/8f/mt_m87_d19q3zcnyr6dmf0pw0000gn/T/ipykernel_11659/568527003.py", line 30, in active_learning_loop
    active_learner.initialize_data(indices, y)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 151, in initialize_data
    self._retrain(indices_validation=indices_validation)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 390, in _retrain
    self._clf.fit(dataset)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 330, in fit
    return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 340, in _fit_main
    raise ValueError('Conflicting information about the number of classes: '
ValueError: Conflicting information about the number of classes: expected: 6, encountered: 1

Any idea to handle this?

chschroeder · 2022-09-22T11:42:35Z

Hm, in this case the required amount of classes are of course impossible to find. Using small-text==1.1.0 would make the symptoms go away. (How urget is this fix? I could prioritize the release depending on that.)

However, this might also give a false sense of security. Under these circumstances the model has not seen all classes, therefore the uncertainties may be suboptimal, and thus the queries examples may not be that useful as they otherwise could be.

In my own active learning settings with few labels I always argue that it is reasonable to require 1-2 examples per class from the user before starting the active learning loop. I know that this is not applicable to use cases with thousands of labels. There are so-called cold start approaches which try to handle this setting, but as far as I know there is no single best approach as well, and each method brings its on advantages/disadvantages as well.

In the end, this is a decision for your rubrix workflows as well. Possible solutions (non-exhaustive): a) Do nothing (once the error is fixed) and trust the user. b) Show a warning. c) Show a warning and advise better settings dynamically (e.g., request the user to provide examples for all classes or recommend another query strategy). d) Describe the problem and give recommendations in the documentation (Might also be my responsibility to do this in the small-text docs regardless of what you decide on the workflows here).

frascuchon · 2022-09-22T12:56:40Z

Hi @chschroeder

Thanks a lot for your responses. It's not an urgent problem. We'll just wait until the new small-text version is released.

Anyway, It should be a good practice to include a balanced annotated dataset for initial batches.

* docs: fixing the active learning tutorial with `small-text` * docs: using a tiny model * docs: Change tutorial title * docs: Change active learning title in card (cherry picked from commit f4f2289) Closes #1693

Refs: #1693 and #1726

Refs: #1693 and #1726 (cherry picked from commit 9afc735)

* docs: fixing the active learning tutorial with `small-text` * docs: using a tiny model * docs: Change tutorial title * docs: Change active learning title in card (cherry picked from commit f4f2289) Closes #1693

Refs: #1693 and #1726 (cherry picked from commit 9afc735)

* docs: fixing the active learning tutorial with `small-text` * docs: using a tiny model * docs: Change tutorial title * docs: Change active learning title in card (cherry picked from commit f4f2289) Closes #1693

Refs: #1693 (cherry picked from commit 9afc735)

* initial restructuring of the docs #1752 * renamed files * added new mock-up colors for blog cards #1752 * initial restructure getting started * edded rough re-structure guides/deepdives * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * initial working version for 1.x demo-testers * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated python client info * reviews ui references * fix path to conf.py in rtd config * move makes to _source * delete makes from docs * adds client and labeling module reference * initial restructuring of the docs #1752 * renamed files * added new mock-up colors for blog cards #1752 * initial restructure getting started * edded rough re-structure guides/deepdives * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * initial working version for 1.x demo-testers * updated python client info * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reviews ui references * fix path to conf.py in rtd config * move makes to _source * delete makes from docs * adds client and labeling module reference * fix typo readme * restructuring, intro texts, and others * restructure and gitignore _build * fix tutorial links for readthedocs deployments * docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) Refs: #1693 (cherry picked from commit 9afc735) * Fix some tutorials * Fix more tutorials * updated docs - allow for a redirect tutorials and deepdives - use old github and slack name * updated faulty task redirect docs * updated stargazers button * updated docs: - feature redirections - updated images - added terminology * continued replacing thumbnails * docs: enable notfound url prefix * adds custom Github Stargazers button * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated github star number format * docs: Trying to generate properly prefixes for not found page * Include the link to source code for telemetry * using the (future) main branch for telemetry source code ref * updated docs - tutorial images - thumbnails - card colors - references * updated docs: - added library * Adding section to migration guides * updated docs - wrapped up library references * updated docs: - resolved skweak library error Co-authored-by: david <david.m.berenstein@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: dvsrepo <daniel@recogn.ai> Co-authored-by: Christopher Schröder <6340318+chschroeder@users.noreply.github.com> Co-authored-by: leire <leire@recogn.ai>

frascuchon added type: bug Indicates an unexpected problem or unintended behavior type: documentation Improvements or additions to documentation labels Aug 22, 2022

frascuchon added this to Backlog in Release via automation Aug 22, 2022

frascuchon moved this from Backlog to Planified in Release Sep 7, 2022

frascuchon removed this from Planified in Release Sep 13, 2022

frascuchon added this to the v0.18.0 milestone Sep 13, 2022

frascuchon mentioned this issue Sep 22, 2022

docs: fixing the active learning tutorial with small-text #1726

Merged

frascuchon closed this as completed in #1726 Sep 27, 2022

frascuchon removed the type: bug Indicates an unexpected problem or unintended behavior label Sep 27, 2022

chschroeder mentioned this issue Oct 2, 2022

docs: raise small-text version to 1.1.0 and adapt tutorial #1744

Merged

frascuchon linked a pull request Oct 3, 2022 that will close this issue

docs: raise small-text version to 1.1.0 and adapt tutorial #1744

Merged

frascuchon reopened this Oct 3, 2022

frascuchon closed this as completed in #1744 Oct 5, 2022

frascuchon pushed a commit that referenced this issue Oct 5, 2022

docs: raise small-text version to 1.1.0 and adapt tutorial (#1744)

9afc735

Refs: #1693 and #1726

frascuchon pushed a commit that referenced this issue Oct 5, 2022

docs: raise small-text version to 1.1.0 and adapt tutorial (#1744)

82ff59e

Refs: #1693 and #1726 (cherry picked from commit 9afc735)

frascuchon pushed a commit that referenced this issue Oct 5, 2022

docs: raise small-text version to 1.1.0 and adapt tutorial (#1744)

81a899b

Refs: #1693 and #1726 (cherry picked from commit 9afc735)

frascuchon pushed a commit that referenced this issue Oct 5, 2022

docs: raise small-text version to 1.1.0 and adapt tutorial (#1744)

16f19b7

Refs: #1693 (cherry picked from commit 9afc735)

frascuchon pushed a commit that referenced this issue Oct 14, 2022

docs: raise small-text version to 1.1.0 and adapt tutorial (#1744)

c00fce3

Refs: #1693 (cherry picked from commit 9afc735)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Active Learning] Small text tutorial hits errors when running on local #1693

[Active Learning] Small text tutorial hits errors when running on local #1693

frascuchon commented Aug 22, 2022

chschroeder commented Sep 15, 2022

frascuchon commented Sep 21, 2022 •

edited

chschroeder commented Sep 21, 2022

frascuchon commented Sep 22, 2022

chschroeder commented Sep 22, 2022

frascuchon commented Sep 22, 2022

chschroeder commented Sep 22, 2022

frascuchon commented Sep 22, 2022

[Active Learning] Small text tutorial hits errors when running on local #1693

[Active Learning] Small text tutorial hits errors when running on local #1693

Comments

frascuchon commented Aug 22, 2022

chschroeder commented Sep 15, 2022

frascuchon commented Sep 21, 2022 • edited

chschroeder commented Sep 21, 2022

frascuchon commented Sep 22, 2022

chschroeder commented Sep 22, 2022

frascuchon commented Sep 22, 2022

chschroeder commented Sep 22, 2022

frascuchon commented Sep 22, 2022

frascuchon commented Sep 21, 2022 •

edited