Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initialize_active_learner error #6

Closed
neel17 opened this issue Nov 23, 2021 · 8 comments
Closed

initialize_active_learner error #6

neel17 opened this issue Nov 23, 2021 · 8 comments

Comments

@neel17
Copy link

neel17 commented Nov 23, 2021

I am trying to initialize a active learner for text classification using transformer. I have 11014 classes which need to be trained by the classification model. My data set is highly imbalanced. While doing the initialize_active_learner( active_learner, y_train) I have used

def initialize_active_learner(active_learner, y_train):

    x_indices_initial = random_initialization(y_train)
    #random_initialization_stratified(y_train, n_samples=11015)
    #random_initialization_balanced(y_train)
    
    y_initial = np.array([y_train[i] for i in x_indices_initial])

    active_learner.initialize_data(x_indices_initial, y_initial)

    return x_indices_initial

But I get this error always:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-d0348c5b7547> in <module>
      1 # Active learner
      2 active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, x_train)
----> 3 labeled_indices = initialize_active_learner(active_learner, y_train)
      4 #

<ipython-input-22-ed58e0714c48> in initialize_active_learner(active_learner, y_train)
     17     y_initial = np.array([y_train[i] for i in x_indices_initial])
     18 
---> 19     active_learner.initialize_data(x_indices_initial, y_initial)
     20 
     21     return x_indices_initial

~/.local/lib/python3.7/site-packages/small_text/active_learner.py in initialize_data(self, x_indices_initial, y_initial, x_indices_ignored, x_indices_validation, retrain)
    139 
    140         if retrain:
--> 141             self._retrain(x_indices_validation=x_indices_validation)
    142 
    143     def query(self, num_samples=10, x=None, query_strategy_kwargs=None):

~/.local/lib/python3.7/site-packages/small_text/active_learner.py in _retrain(self, x_indices_validation)
    380 
    381         if x_indices_validation is None:
--> 382             self._clf.fit(x)
    383         else:
    384             indices = np.arange(self.x_indices_labeled.shape[0])

~/.local/lib/python3.7/site-packages/small_text/integrations/transformers/classifiers/classification.py in fit(self, train_set, validation_set, optimizer, scheduler)
    332         self.class_weights_ = self.initialize_class_weights(sub_train)
    333 
--> 334         return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler)
    335 
    336     def initialize_class_weights(self, sub_train):

~/.local/lib/python3.7/site-packages/small_text/integrations/transformers/classifiers/classification.py in _fit_main(self, sub_train, sub_valid, optimizer, scheduler)
    351                 raise ValueError('Conflicting information about the number of classes: '
    352                                  'expected: {}, encountered: {}'.format(self.num_classes,
--> 353                                                                         np.max(y) + 1))
    354 
    355             self.initialize_transformer(self.cache_dir)

ValueError: Conflicting information about the number of classes: expected: 11014, encountered: 8530

Please help here.

Thanks in advance

@neel17
Copy link
Author

neel17 commented Nov 23, 2021

I have also used stratified

def initialize_active_learner(active_learner, y_train):

    x_indices_initial = random_initialization_stratified(y_train, n_samples=11015)
    #random_initialization_balanced(y_train)
    y_initial = np.array([y_train[i] for i in x_indices_initial])

    active_learner.initialize_data(x_indices_initial, y_initial)

    return x_indices_initial

error is same but active learner was initialized and perform_active_learning has started

Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
WARNING:small_text.integrations.transformers.classifiers.classification:Overridering scheduler since optimizer in kwargs needs to be passed in combination with scheduler

Early stopping after 6 epochs
Iteration #0 (12015 samples)
Train accuracy: 0.00
Test accuracy: 0.00
---
Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
WARNING:small_text.integrations.transformers.classifiers.classification:Overridering scheduler since optimizer in kwargs needs to be passed in combination with scheduler

Early stopping after 6 epochs
Iteration #1 (13015 samples)
Train accuracy: 0.00
Test accuracy: 0.00
---
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-23-e3d172fd9815> in <module>()
      1 #!g1.1
      2 try:
----> 3     perform_active_learning(active_learner, x_train, labeled_indices, x_test)
      4 
      5 except PoolExhaustedException:

4 frames
/usr/local/lib/python3.7/dist-packages/small_text/integrations/transformers/classifiers/classification.py in _fit_main(self, sub_train, sub_valid, optimizer, scheduler)
    351                 raise ValueError('Conflicting information about the number of classes: '
    352                                  'expected: {}, encountered: {}'.format(self.num_classes,
--> 353                                                                         np.max(y) + 1))
    354 
    355             self.initialize_transformer(self.cache_dir)

ValueError: Conflicting information about the number of classes: expected: 11014, encountered: 11013

@chschroeder
Copy link
Contributor

Hi,

regarding your first post:

  1. The error here (ValueError: Conflicting information about the number of classes: expected: 11014, encountered: 8530) indicates that y_initial does not contain all labels. The number of labels must match what you have set as num_classes in your classifier. Therefore, cannot rely on random initialization here, especially of your dataset is highly imbalanced (which you probably figured out yourself).

(For the future: With such a high number of labels, it might not be practical to require so many initialization samples. I will think what can be done about this.)

Regarding your second post:

  1. Here, the initialization worked and the error is raised later. This should never happen. Did you modify your dataset between initialization and active learning loop? Can you show the full code for this? I will have a look at the stratified initialization as soon as I find some time. This is the only point of failure I can imagine.

You could try the class-balanced initialization in the meantime.

@neel17
Copy link
Author

neel17 commented Nov 23, 2021

@chschroeder Thanks for the reply.

I have tried all the methods of initialization:

  • random_initialization(y_train)
  • random_initialization_stratified(y_train, n_samples=11015) also tried with exact number of classes i.e 11014
  • random_initialization_balanced(y_train)

All are failing for sure.

@neel17
Copy link
Author

neel17 commented Nov 24, 2021

Here is my code

# Dataset


def get_transformers_dataset(tokenizer, data, labels, max_length=60):

    data_out = []

    for i, doc in enumerate(data):
        encoded_dict = tokenizer.encode_plus(
            doc,
            add_special_tokens=True,
            padding='max_length',
            max_length=max_length,
            return_attention_mask=True,
            return_tensors='pt',
            truncation='longest_first'
        )

        data_out.append((encoded_dict['input_ids'], encoded_dict['attention_mask'], labels[i]))

    return TransformersDataset(data_out)

#train = get_transformers_dataset(tokenizer, df_train['text'], df_train['label'])
#test = get_transformers_dataset(tokenizer, df_test['text'], df_test['label'])

x_train = get_transformers_dataset(tokenizer, df_train['text'], df_train['label'])
y_train = df_train['label'].values
x_test = get_transformers_dataset(tokenizer, df_test['text'], df_test['label'])

def initialize_active_learner(active_learner, y_train):

    x_indices_initial = random_initialization_balanced(y_train, n_samples=10)
    #random_initialization(y_train)
    #random_initialization_stratified(y_train, n_samples=11015)
    #random_initialization_balanced(y_train)
    
    y_initial = np.array([y_train[i] for i in x_indices_initial])

    active_learner.initialize_data(x_indices_initial, y_initial)

    return x_indices_initial


def perform_active_learning(active_learner, train, labeled_indices, test):
    # Perform 10 iterations of active learning...
    for i in range(20):
        # ...where each iteration consists of labelling 20 samples
        q_indices = active_learner.query(num_samples=1000)

        # Simulate user interaction here. Replace this for real-world usage.
        y = train.y[q_indices]

        # Return the labels for the current query to the active learner.
        active_learner.update(y)

        labeled_indices = np.concatenate([q_indices, labeled_indices])

        print('Iteration #{:d} ({} samples)'.format(i, len(labeled_indices)))
        evaluate(active_learner, train[labeled_indices], test)

transformer_model = TransformerModelArguments(transformer_model_name)
clf_factory = TransformerBasedClassificationFactory(transformer_model, 
                                                    num_classes, 
                                                    kwargs=dict({'device': 'cuda', 
                                                                 'mini_batch_size': 32#,
                                                                 #'early_stopping_no_improvement': -1
                                                                }))

query_strategy = RandomSampling()

# Active learner
active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, x_train)
labeled_indices = initialize_active_learner(active_learner, y_train)


try:
    perform_active_learning(active_learner, x_train, labeled_indices, x_test)

except PoolExhaustedException:
    print('Error! Not enough samples left to handle the query.')
    
except EmptyPoolException:
    print('Error! No more samples left. (Unlabeled pool is empty)')


@chschroeder
Copy link
Contributor

@neel17 Okay, if they all don't seem to work, could it be that there are labels that only appear once? If so, we can have splits that don't include every label. Maybe the initialization functions should check for plausibility in this case.

Can you have a look at your class distribution? You can use the following snippet to obtain an array which contains the number of samples for each class at the given index:

def get_class_histogram(y, num_classes):
    ind, counts = np.unique(y, return_counts=True)
    ind_set = set(ind)

    histogram = np.zeros(num_classes)
    for i, c in zip(ind, counts):
        if i in ind_set:
            histogram[i] = c

    return histogram.astype(int)

In your case: hist = get_class_histogram(y, 11014) and for y you use the train set's label, e.g. train.y.
Then with (hist <= 1).astype(int).sum() you should get the number of underrepresented lables (code untested).

@neel17
Copy link
Author

neel17 commented Nov 25, 2021

@chschroeder Thanks for the reply, I have tried to check for the underrepresented labels as you suggested. I don have any label which are less than 30 actually.

image

image

@chschroeder
Copy link
Contributor

chschroeder commented Nov 25, 2021

I looked into this briefly, I can take a closer look this weekend.

It seems we have two problems:

  1. random_initialization_stratified can lead to not all classes being sampled (due to rounding). I think this is correct behavior, but unexpected. Maybe, I should think about a parameter that tries to include at least one of each class.

  2. Your second try had a similar error, but this is during training, not during initialization. By default, if no validation set is given, 10% of the currently labeled training data is used. The check which raises the exception is only very superficial and checks if the label with the maximum index is present, otherwise the exception would have been raised even earlier. You could try increasing your query size to at least 2*11014. If you then also configure the validation set size with validation_set_size=0.5, it should use 50% of your training data for validation and will not throw an error anymore.

This is only a quick fix. I think this check should be optional but I want to think about this before taking action. I might remove this check or add an option to disable it.

Thanks for your patience. Having this many classes is a new use case to be honest, but if you have some patience I think we can find a solution here (and improve small-text at the same time).

@chschroeder
Copy link
Contributor

@neel17: Okay, I think the only actual problem here is the class check, which results in ValueError: Conflicting information about the number of classes: expected: 11014, encountered: 8530.

This was required when the number of classes was obtained implicitly in earlier versions of the code, but can now be removed or at least changed to be a warning.

If you need a quick fix, you can override the _fit_main method and remove the following lines:

if self.num_classes != np.max(y) + 1:
raise ValueError('Conflicting information about the number of classes: '
'expected: {}, encountered: {}'.format(self.num_classes,
np.max(y) + 1))

I will remove this check in the next version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants