Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train on data without entities #139

Open
AnnaKholkina opened this issue Jul 2, 2024 · 1 comment
Open

Train on data without entities #139

AnnaKholkina opened this issue Jul 2, 2024 · 1 comment

Comments

@AnnaKholkina
Copy link

Hi. I want to finetune a model on data where some of them do not contain entities (so that there is less fp). I tried to do it with such examples in the dataset:
{'tokenized_text': ['In', 'this', 'year', '.'], 'ner': []},
And I have an error:

Traceback (most recent call last):
  File "/home/jovyan/work/dev/ner/gliner/GLiNER/examples/finetuning/finetune-balanced-remove-short-orgs-empty-ner.py", line 59, in <module>
    trainer.train(num_epochs=25)
  File "/home/jovyan/work/dev/ner/gliner/GLiNER/examples/finetuning/trainer.py", line 213, in train
    total_loss = self.model(batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/gliner/model.py", line 141, in forward
    logits_label = scores.view(-1, num_classes)
RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0] because the unspecified dimension size -1 can be any value and is ambiguous

Or this format:
{'tokenized_text': ['In', 'this', 'year', '.'], 'ner': [[]]},
And error:

Traceback (most recent call last):
  File "/home/jovyan/work/dev/ner/gliner/GLiNER/examples/finetuning/finetune-balanced-remove-short-orgs-empty-ner.py", line 59, in <module>
    trainer.train(num_epochs=25)
  File "/home/jovyan/work/dev/ner/gliner/GLiNER/examples/finetuning/trainer.py", line 208, in train
    for batch_idx, batch in progress_bar:
  File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 464, in __iter__
    next_batch = next(dataloader_iter)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 634, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 678, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 83, in <lambda>
    return DataLoader(data, collate_fn=lambda x: self.collate_fn(x, entity_types), **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 67, in collate_fn
    class_to_ids, id_to_classes = self.batch_generate_class_mappings(batch_list)
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 42, in batch_generate_class_mappings
    negs = self.get_negatives(batch_list, 100)
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 34, in get_negatives
    types = set([el[-1] for el in b['ner']])
  File "/usr/local/lib/python3.10/dist-packages/gliner/modules/data.py", line 34, in <listcomp>
    types = set([el[-1] for el in b['ner']])
IndexError: list index out of range

Is there any way to fix this?

@AnnaKholkina AnnaKholkina changed the title Train with balance data Train on data without entities Jul 2, 2024
@urchade
Copy link
Owner

urchade commented Jul 3, 2024

You cannot train the model without any entity types. The model needs entity types to compute de matching scores.

you can pre-define the list of labels under the key "label", if the list of named entities is empty:

{'tokenized_text': ['In', 'this', 'year', '.'], 'ner': [], 'label': ["person", "org"]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants