-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using active learning on already trained model #30
Comments
Hi @etiennekintzler! Two very valid questions that need to be included in the documentation. 1. Bypassing Unfortunately, this is awkward with the current API (but will be changed with version 2.0.0). A solution is shown in #10. Let me know if this does not work for you. I will also add this to the docs eventually. 2. Creating an unlabeled dataset For multi-label datasets: If you create your dataset using TransformersDataset.from_arrays() then you just pass an empty list of labels (i.e. a csr_matrix which does not have any entries). from small_text import TransformersDataset, list_to_csr
texts = ['this is my document', 'yet another document']
num_classes = ... # omitted
tokenizer = ... # omitted
target_labels = ... # omitted
y = list_to_csr([[], []], shape=(2, num_classes))
dataset = TransformersDataset.from_arrays(texts, y, tokenizer, target_labels=target_labels) For single-label datasets: from small_text import LABEL_UNLABELED |
Thank you for your fast and detailed answer @chschroeder ! This is not really the answer you'd expect but provided that uncertainty based query strategy like breaking ties and least confident are both simple to implement and works well enough empirically (cf your paper Revisiting Uncertainty-based Query Strategies for def get_bt_from_probas(probas_mat: np.array, num_samples: int = 5):
argsort_mat = np.argsort(probas_mat, axis=1)
k2k1_mat = argsort_mat[:, -2:]
scores = np.array([p[k1] - p[k2] for (p, (k2, k1)) in zip(probas_mat, k2k1_mat)])
indices = np.argsort(scores)[:num_samples]
return indices I could have used the implementation of query strategies in https://github.com/webis-de/small-text/blob/v1.3.0/small_text/query_strategies/strategies.py but it seems tightly coupled to the dataset and classifier while I just needed something to be applied on the model probabilities (I get that for other methods like expected gradient length you'd need the model as well). Feel free to close the ticket if you want ! I'll be watching the project and would be happy to try it again when 2.0 is out. |
Thanks for the feedback! Yes, you can of course extract individual parts, but then you lose the benefits of the interface. Nevertheless, regarding small-text functions like this should be separated from the class in small-text (like it was done with the core set strategies, for example), then an import would have sufficed. Adding this to the list of tasks. |
Hello :)
I am trying to use the library on a transformer model that is already trained. For that matter I don't need to use
initialize_data
method since the model is already trained, however it seems to be necessary before usingquery
method (otherwise it throws an error).To be more specific let's say I have an object
model
(multi-label model from hugging face) trained on datatext_train
andlabels_train
. Then I havetext_test
data for which no labels is available. I would like to use active learning to select the best (based on a given query strategy) samples intext_test
to be labelled by my users. How could I use the library to do so ?Thank you in advance for your help !
The text was updated successfully, but these errors were encountered: