New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Active Learning] Small text tutorial hits errors when running on local #1693
Comments
Just spotted this. Which errors exactly? Is this something in the scope of tutorial or can this be improved upstream in small-text? |
Hi @chschroeder, sorry for the late response and thank yo for ask. I've recently launched the tutorial from an clean python environment and I found several problems:
Really don't know what should be the behaviour and how could we avoid it. Any ideas? @dvsrepo @dcfidalgo @chschroeder |
No worries, I missed that issue here as well. Feel free to ping me in such cases as you did now :).
|
Thanks a lot @chschroeder I've tried using the balanced random initialization but it hitting the following error ( RuntimeError Traceback (most recent call last)
Cell In [6], line 12
9 NUM_SAMPLES = 5
11 # Randomly draw an initial subset from the data pool
---> 12 initial_indices = random_initialization_balanced(dataset, INITIAL_SAMPLES)
File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/initialization/strategies.py:84, in random_initialization_balanced(y, n_samples)
82 raise NotImplementedError()
83 else:
---> 84 return balanced_sampling(y, n_samples=n_samples)
File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/data/sampling.py:120, in balanced_sampling(y, n_samples)
117 y = np.array(y)
119 # num classes according to the labels
--> 120 num_classes = np.max(y) + 1
121 # num classes encountered
122 num_classes_present = len(np.unique(y))
File <__array_function__ internals>:180, in amax(*args, **kwargs)
File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/numpy/core/fromnumeric.py:2793, in amax(a, axis, out, keepdims, initial, where)
2677 @array_function_dispatch(_amax_dispatcher)
2678 def amax(a, axis=None, out=None, keepdims=np._NoValue, initial=np._NoValue,
2679 where=np._NoValue):
2680 """
2681 Return the maximum of an array or maximum along an axis.
2682
(...)
2791 5
2792 """
-> 2793 return _wrapreduction(a, np.maximum, 'max', axis, None, out,
2794 keepdims=keepdims, initial=initial, where=where)
File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86, in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
83 else:
84 return reduction(axis=axis, out=out, **passkwargs)
---> 86 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
RuntimeError: Boolean value of Tensor with more than one value is ambiguous |
Oh, my bad :). This initialization method takes the labels as arguments (
|
Great! I've changed some code from the tutorial (the training set was initialized with Now it's working but, for the initial annotated batch, passing only one label (all records in batch annotated with the same value) hits a similar error: Updating with batch_id 0 ...
Traceback (most recent call last):
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/rubrix/listeners/listener.py", line 219, in __run_action__
return self.action(*args, *action_args, **kwargs)
File "/var/folders/8f/mt_m87_d19q3zcnyr6dmf0pw0000gn/T/ipykernel_11659/568527003.py", line 30, in active_learning_loop
active_learner.initialize_data(indices, y)
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 151, in initialize_data
self._retrain(indices_validation=indices_validation)
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 390, in _retrain
self._clf.fit(dataset)
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 330, in fit
return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler)
File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 340, in _fit_main
raise ValueError('Conflicting information about the number of classes: '
ValueError: Conflicting information about the number of classes: expected: 6, encountered: 1 Any idea to handle this? |
Hm, in this case the required amount of classes are of course impossible to find. Using However, this might also give a false sense of security. Under these circumstances the model has not seen all classes, therefore the uncertainties may be suboptimal, and thus the queries examples may not be that useful as they otherwise could be. In my own active learning settings with few labels I always argue that it is reasonable to require 1-2 examples per class from the user before starting the active learning loop. I know that this is not applicable to use cases with thousands of labels. There are so-called cold start approaches which try to handle this setting, but as far as I know there is no single best approach as well, and each method brings its on advantages/disadvantages as well. In the end, this is a decision for your rubrix workflows as well. Possible solutions (non-exhaustive): a) Do nothing (once the error is fixed) and trust the user. b) Show a warning. c) Show a warning and advise better settings dynamically (e.g., request the user to provide examples for all classes or recommend another query strategy). d) Describe the problem and give recommendations in the documentation (Might also be my responsibility to do this in the small-text docs regardless of what you decide on the workflows here). |
Hi @chschroeder Thanks a lot for your responses. It's not an urgent problem. We'll just wait until the new small-text version is released. Anyway, It should be a good practice to include a balanced annotated dataset for initial batches. |
* initial restructuring of the docs #1752 * renamed files * added new mock-up colors for blog cards #1752 * initial restructure getting started * edded rough re-structure guides/deepdives * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * initial working version for 1.x demo-testers * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated python client info * reviews ui references * fix path to conf.py in rtd config * move makes to _source * delete makes from docs * adds client and labeling module reference * initial restructuring of the docs #1752 * renamed files * added new mock-up colors for blog cards #1752 * initial restructure getting started * edded rough re-structure guides/deepdives * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * initial working version for 1.x demo-testers * updated python client info * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reviews ui references * fix path to conf.py in rtd config * move makes to _source * delete makes from docs * adds client and labeling module reference * fix typo readme * restructuring, intro texts, and others * restructure and gitignore _build * fix tutorial links for readthedocs deployments * docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) Refs: #1693 (cherry picked from commit 9afc735) * Fix some tutorials * Fix more tutorials * updated docs - allow for a redirect tutorials and deepdives - use old github and slack name * updated faulty task redirect docs * updated stargazers button * updated docs: - feature redirections - updated images - added terminology * continued replacing thumbnails * docs: enable notfound url prefix * adds custom Github Stargazers button * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated github star number format * docs: Trying to generate properly prefixes for not found page * Include the link to source code for telemetry * using the (future) main branch for telemetry source code ref * updated docs - tutorial images - thumbnails - card colors - references * updated docs: - added library * Adding section to migration guides * updated docs - wrapped up library references * updated docs: - resolved skweak library error Co-authored-by: david <david.m.berenstein@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: dvsrepo <daniel@recogn.ai> Co-authored-by: Christopher Schröder <6340318+chschroeder@users.noreply.github.com> Co-authored-by: leire <leire@recogn.ai>
* initial restructuring of the docs #1752 * renamed files * added new mock-up colors for blog cards #1752 * initial restructure getting started * edded rough re-structure guides/deepdives * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * initial working version for 1.x demo-testers * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated python client info * reviews ui references * fix path to conf.py in rtd config * move makes to _source * delete makes from docs * adds client and labeling module reference * initial restructuring of the docs #1752 * renamed files * added new mock-up colors for blog cards #1752 * initial restructure getting started * edded rough re-structure guides/deepdives * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * initial working version for 1.x demo-testers * updated python client info * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reviews ui references * fix path to conf.py in rtd config * move makes to _source * delete makes from docs * adds client and labeling module reference * fix typo readme * restructuring, intro texts, and others * restructure and gitignore _build * fix tutorial links for readthedocs deployments * docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) Refs: #1693 (cherry picked from commit 9afc735) * Fix some tutorials * Fix more tutorials * updated docs - allow for a redirect tutorials and deepdives - use old github and slack name * updated faulty task redirect docs * updated stargazers button * updated docs: - feature redirections - updated images - added terminology * continued replacing thumbnails * docs: enable notfound url prefix * adds custom Github Stargazers button * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated github star number format * docs: Trying to generate properly prefixes for not found page * Include the link to source code for telemetry * using the (future) main branch for telemetry source code ref * updated docs - tutorial images - thumbnails - card colors - references * updated docs: - added library * Adding section to migration guides * updated docs - wrapped up library references * updated docs: - resolved skweak library error Co-authored-by: david <david.m.berenstein@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: dvsrepo <daniel@recogn.ai> Co-authored-by: Christopher Schröder <6340318+chschroeder@users.noreply.github.com> Co-authored-by: leire <leire@recogn.ai>
The small text tutorial should be review since some weird errors are hit when running on a local machine with some customizations. (Setting a different initial batch size, no CUDA installed...)
The text was updated successfully, but these errors were encountered: