During finetuning, `classes_to_id` is not correct #32

Ahmedn1 · 2024-03-20T04:42:19Z

Line 100 in e15c22a

all_types_i = list(x['classes_to_id'][i].keys())

This will error out on a KeyError because it is using numeric indices to look up keys that can be strings.
This happens, when entity_types are provided to the create_dataloader

The text was updated successfully, but these errors were encountered:

urchade · 2024-03-20T05:27:26Z

I am not sure to understand

Is it for training or inference?

Ahmedn1 · 2024-03-20T06:32:53Z

@urchade training. Following this notebook with the only difference of providing entity_types to training script. Like so:

train_loader = model.create_dataloader(train_data, batch_size=config.train_batch_size, shuffle=True, entity_types=config.entity_types)

urchade · 2024-03-20T06:42:40Z

Is entity types in the correct format? It should be a list of string

Actually I do not suggest setting entity types during training for better generalization

Ahmedn1 · 2024-03-20T19:31:43Z

@urchade yes it is a list of strings

tcourat · 2024-04-18T13:04:55Z

I have the same issue when specifying entity_types in create_dataloader .

The issue is here :

When identity_type is set to none, then class_to_ids is indeed a LIST of dictionnaries (one for each sentence in the batch) :

GLiNER/gliner/modules/base.py

Line 62 in e15c22a

class_to_ids = []

But when providing identity_type, then it is a dictionnary :

GLiNER/gliner/modules/base.py

Line 113 in e15c22a

class_to_ids = {k: v for v, k in enumerate(entity_types, start=1)}

So a quick fix, when using entity_type, is to deal with each case by removing this line :

GLiNER/gliner/model.py

Line 100 in e15c22a

all_types_i = list(x['classes_to_id'][i].keys())

And instead add :

if isinstance(x["classes_to_id"], list) :
    all_types_i = list(x["classes_to_id"][i].keys())
elif isinstance(x["classes_to_id"], dict) :
    all_types_i = list(x["classes_to_id"].keys())

urchade · 2024-04-18T21:52:01Z

So, you want to fix the label during training, for supervised fine-tuning ?

The solution for this is to add the key "label" to each training samples (i.e in addition to "tokenized_text" and "ner"). You can do it as follows:

for i in range(len(train)):
    train[i]["label"] = labels

@Ahmedn1 @tcourat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During finetuning, `classes_to_id` is not correct #32

During finetuning, `classes_to_id` is not correct #32

Ahmedn1 commented Mar 20, 2024 •

edited

urchade commented Mar 20, 2024

Ahmedn1 commented Mar 20, 2024

urchade commented Mar 20, 2024

Ahmedn1 commented Mar 20, 2024

tcourat commented Apr 18, 2024 •

edited

urchade commented Apr 18, 2024 •

edited

During finetuning, classes_to_id is not correct #32

During finetuning, classes_to_id is not correct #32

Comments

Ahmedn1 commented Mar 20, 2024 • edited

urchade commented Mar 20, 2024

Ahmedn1 commented Mar 20, 2024

urchade commented Mar 20, 2024

Ahmedn1 commented Mar 20, 2024

tcourat commented Apr 18, 2024 • edited

urchade commented Apr 18, 2024 • edited

During finetuning, `classes_to_id` is not correct #32

During finetuning, `classes_to_id` is not correct #32

Ahmedn1 commented Mar 20, 2024 •

edited

tcourat commented Apr 18, 2024 •

edited

urchade commented Apr 18, 2024 •

edited