Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using active learning on already trained model #30

Closed
etiennekintzler opened this issue Mar 9, 2023 · 3 comments
Closed

Using active learning on already trained model #30

etiennekintzler opened this issue Mar 9, 2023 · 3 comments

Comments

@etiennekintzler
Copy link

Hello :)

I am trying to use the library on a transformer model that is already trained. For that matter I don't need to use initialize_data method since the model is already trained, however it seems to be necessary before using query method (otherwise it throws an error).

To be more specific let's say I have an object model (multi-label model from hugging face) trained on data text_train and labels_train. Then I have text_test data for which no labels is available. I would like to use active learning to select the best (based on a given query strategy) samples in text_test to be labelled by my users. How could I use the library to do so ?

Thank you in advance for your help !

@chschroeder
Copy link
Contributor

Hi @etiennekintzler! Two very valid questions that need to be included in the documentation.

1. Bypassing initialize_data

Unfortunately, this is awkward with the current API (but will be changed with version 2.0.0).

A solution is shown in #10. Let me know if this does not work for you. I will also add this to the docs eventually.

2. Creating an unlabeled dataset

For multi-label datasets:

If you create your dataset using TransformersDataset.from_arrays() then you just pass an empty list of labels (i.e. a csr_matrix which does not have any entries).

from small_text import TransformersDataset, list_to_csr

texts = ['this is my document', 'yet another document']

num_classes = ... # omitted
tokenizer =  ... # omitted
target_labels = ... # omitted

y = list_to_csr([[], []], shape=(2, num_classes))

dataset = TransformersDataset.from_arrays(texts, y, tokenizer, target_labels=target_labels)

For single-label datasets:
A label of -1 means "unlabeled" (accessible through the constant LABEL_UNLABELED:

from small_text import LABEL_UNLABELED

@etiennekintzler
Copy link
Author

etiennekintzler commented Mar 11, 2023

Thank you for your fast and detailed answer @chschroeder !

This is not really the answer you'd expect but provided that uncertainty based query strategy like breaking ties and least confident are both simple to implement and works well enough empirically (cf your paper Revisiting Uncertainty-based Query Strategies for
Active Learning with Transformers
) I decided to just write a simple function for breaking ties querying strategy that can be applied directly on the model probabilities:

def get_bt_from_probas(probas_mat: np.array, num_samples: int = 5):
    argsort_mat = np.argsort(probas_mat, axis=1)
    k2k1_mat = argsort_mat[:, -2:]
    scores = np.array([p[k1] - p[k2] for (p, (k2, k1)) in zip(probas_mat, k2k1_mat)])
    indices = np.argsort(scores)[:num_samples]
    return indices

I could have used the implementation of query strategies in https://github.com/webis-de/small-text/blob/v1.3.0/small_text/query_strategies/strategies.py but it seems tightly coupled to the dataset and classifier while I just needed something to be applied on the model probabilities (I get that for other methods like expected gradient length you'd need the model as well).

Feel free to close the ticket if you want ! I'll be watching the project and would be happy to try it again when 2.0 is out.

@chschroeder
Copy link
Contributor

Thanks for the feedback! Yes, you can of course extract individual parts, but then you lose the benefits of the interface. Nevertheless, regarding small-text functions like this should be separated from the class in small-text (like it was done with the core set strategies, for example), then an import would have sufficed. Adding this to the list of tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants