Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting up a PoolBaseActiveLearner without initialization. #10

Open
HannahKirk opened this issue Feb 2, 2022 · 11 comments
Open

Setting up a PoolBaseActiveLearner without initialization. #10

HannahKirk opened this issue Feb 2, 2022 · 11 comments
Labels
documentation Improvements or additions to documentation

Comments

@HannahKirk
Copy link

Hi,
I am training a transformers model in a separate script over a pre-defined training set. I want to then use this classifier to query examples from the unlabelled pool. I can load the trained model from pre-trained pytorch model files or from PoolBasedActiveLearner.load('test-model/active_leaner.pkl').

However, I then don't want to initialise this model as it has already been trained on a portion of the labelled data. Is it possible to still query over data i.e. learner.query() without running the initialization step learner.initialize_data(x_indices_train, y_train, x_indices_validation=val_indices)?

Alternatively is it possible to still run this initialisation step but without running any training, i.e. just ignoring all indices for initialisation or setting the number of initialisation examples to zero in x_indices_initial = random_initialization(y_train, n_samples=0).

Really appreciate your help on this one!

Thanks :)

@chschroeder
Copy link
Contributor

Hi,

PoolBasedActiveLearner.save()/load() just writes/reads the object as it is. So after loading a previously saved active learner, you can continue you with you previously initialized model as if nothing happened in between.

The other option you mentioned is also possible, for example if your training data changes. In this case you can call learner.initialize_data(..., retrain=False), where retrain=False to omit the training step. In your case, however, it sounds as you should not need this additional step.

@HannahKirk
Copy link
Author

HannahKirk commented Feb 3, 2022

Thank you! retrain=False was what I was looking for as I'm also changing the data that the active learner has assess to so will have to initialise a new learner object.

However, I am still encountering issues. I am loading a pre-trained model from file, then initialising an active learner (on no new initialisation data as this model has already been trained on some training data). Then as you suggested, I'm setting retrain=False:

transformer_model = TransformerModelArguments('path/to/pretrained-model', tokenizer='path/to/pretrained/tokenizer')


clf_factory = TransformerBasedClassificationFactory(transformer_model, 
                                                    num_classes, 
                                                    kwargs=dict({'device': 'cuda', 
                                                                 'mini_batch_size': 32,
                                                                 'early_stopping_no_improvement': -1
                                                                }))

def init_pretrained_learner(clf_factory, query_strat, pool):

  active_learner_pool = PoolBasedActiveLearner(clf_factory, query_strat, pool)
  # select no examples for initialisation as learner is already trained
  x_indices_initial = random_initialization(pool, n_samples=0)
  y_initial = pool.y[x_indices_initial]

  active_learner_pool.initialize_data(x_indices_initial, y_initial, retrain=False)

  return active_learner_pool

active_learner_pool = init_pretrained_learner(clf_factory, RandomSampling(), pool)

However, when I then try to access the classifier, e.g. by running:

 embeddings, proba = active_learner_pool.classifier.embed(pool, return_proba=True)

I get the following error: AttributeError: 'NoneType' object has no attribute 'embed' so it seems the classifier hasn't been initialised.

Could you suggest how to proceed?

@HannahKirk
Copy link
Author

Hi @chschroeder, do you have any update on this error?

Thanks! :)

@chschroeder
Copy link
Contributor

Hi @HannahKirk, sorry I completely missed your last edit. But now I think I have understood what your intention is:

  1. You have a pretrained huggingface model, not a serialized previously trained active learner.
  2. You want to use this model in combination with the active learner but without any further training.

This is something that probably will not be possible right now without any hassle but might be a valid use case to include in the future.

For now, you can try the following:

transformer_model = TransformerModelArguments('path/to/pretrained-model', tokenizer='path/to/pretrained/tokenizer')


clf_factory = TransformerBasedClassificationFactory(transformer_model, 
                                                    num_classes, 
                                                    kwargs=dict({'device': 'cuda', 
                                                                 'mini_batch_size': 32,
                                                                 'early_stopping_no_improvement': -1
                                                                }))

active_learner_pool = PoolBasedActiveLearner(clf_factory, RandomSampling(), pool)
# initialize the classifier
active_learner_pool._clf = clf_factory.new()
# initialize the underlying model
active_learner_pool._clf.initialize_transformer(active_learner_pool._clf.cache_dir)
active_learner_pool._clf.num_classes = 123 # TODO: set number of classes

This manually initializes some objects which would otherwise be set up in active_learner._retrain() and classifier.fit(). This might work already but I would not rule out that I have missed something on the classifier side.

@HannahKirk
Copy link
Author

Thanks @chschroeder. I think even with the suggested changes, there are still some problems:
When trying to run embeddings, proba = active_learner_pool.classifier.embed(pool, return_proba=True), I now get the error:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

Perhaps the pool is has not been moved .to_device()?

Thanks :)

@chschroeder
Copy link
Contributor

chschroeder commented Feb 10, 2022

Yes, this error makes sense, we need to prepare the classifier as if fit would have been called.
The error is caused by model still being on the CPU.

Append the following line to the code above:

active_learner_pool._clf.model = active_learner_pool._clf.model.to(active_learner_pool._clf.device)

Edit: Fixed device reference.

@HannahKirk
Copy link
Author

Thanks @chschroeder, one remaining point. What does the self refer to here?

@chschroeder
Copy link
Contributor

That was a copy/paste remnant, sorry :).

Fixed it above. What I am doing here is just copying the code from the classifier

@HannahKirk
Copy link
Author

HannahKirk commented Feb 14, 2022

Hi @chschroeder, I have followed this instructions in the code:


def initialise_learner(pt_model_path, n_classes,  pool):
  tf_model = TransformerModelArguments(pt_model_path)
  n_epochs = 0
  clf_factory = TransformerBasedClassificationFactory(tf_model,
                                                    n_classes,
                                                    n_epochs,
                                                    kwargs = dict({'device': 'cuda', 'early_stopping_no_improvement': -1}))
  active_learner_pool = PoolBasedActiveLearner(clf_factory,  RandomSampling(), pool)
  # initialise classifier
  active_learner_pool._clf = clf_factory.new()
  # initialize the underlying model
  active_learner_pool._clf.initialize_transformer(active_learner_pool._clf.cache_dir)
  active_learner_pool._clf.num_classes = n_classes
  active_learner_pool._clf.model.to(active_learner_pool._clf.device)
  return active_learner_pool

However, when I now try and use this model for querying with:
selected_indices = trained_learner.query(num_samples=10) I still get an error: LearnerNotInitializedException.

How should I proceed? I thought the pre-trained model would now have been initialised with the learner so that the learner could be used to query the pool. I can initialise the data in a hacky way but selecting 0 samples with:

# initialise data
x_indices_initial = random_initialization(pool, n_samples=0)
y_initial = pool.y[x_indices_initial]
active_learner_pool.initialize_data(x_indices_initial, y_initial, retrain=False)

But there may be a better solution

Thanks :)

P.S the full source of the exception is:

[/usr/local/lib/python3.7/dist-packages/small_text/active_learner.py](https://localhost:8080/#) in query(self, num_samples, x, query_strategy_kwargs)
    167         """
    168         if self._label_to_position is None:
--> 169             raise LearnerNotInitializedException()
    170 
    171         size = list_length(self.x_train)

@chschroeder
Copy link
Contributor

chschroeder commented Feb 14, 2022

You were almost there :). Sorry, I had no time to try this myself last week, and this distant trial and error takes a bit longer.
But now I tried it myself.

I used your function and changed this:

# initialise classifier
active_learner_pool._clf = clf_factory.new()

to this:

# initialise classifier
active_learner_pool._clf = clf_factory.new()
active_learner_pool._x_index_to_position = dict()

This got me so far that that I could use active_learner.query(...) and active_learner.classifier.predict(...).

At second thought I am questioning now what we even get from the PoolBasedActiveLearner at this point. Just using the classifier and query strategy directly would likely result in better code.

@chschroeder chschroeder added the documentation Improvements or additions to documentation label Jun 15, 2022
@chschroeder chschroeder added this to the small-text-1.2.0 milestone Dec 26, 2022
@chschroeder chschroeder removed this from the small-text-1.2.0 milestone Feb 11, 2023
@chschroeder
Copy link
Contributor

Hi, just to give an update: this issue has not been forgotten. I would call your use case here "pre-initialized" or "externally initialized". Similar problems regarding the API exist for cold start active learning for which we now have a notebook.

Both of these use cases are difficult to realize without breaking the current API. The solution that I would prefer will likely generalize the initialization mechanism, but this will have to wait until a next major version 2.0.0.

@chschroeder chschroeder added this to the small-text-2.0.0 milestone Feb 26, 2023
chschroeder added a commit that referenced this issue Feb 11, 2024
Signed-off-by: Christopher Schröder <chschroeder@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants