Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

During finetuning, classes_to_id is not correct #32

Open
Ahmedn1 opened this issue Mar 20, 2024 · 6 comments
Open

During finetuning, classes_to_id is not correct #32

Ahmedn1 opened this issue Mar 20, 2024 · 6 comments

Comments

@Ahmedn1
Copy link

Ahmedn1 commented Mar 20, 2024

all_types_i = list(x['classes_to_id'][i].keys())

This will error out on a KeyError because it is using numeric indices to look up keys that can be strings.
This happens, when entity_types are provided to the create_dataloader

@urchade
Copy link
Owner

urchade commented Mar 20, 2024

I am not sure to understand

Is it for training or inference?

@Ahmedn1
Copy link
Author

Ahmedn1 commented Mar 20, 2024

@urchade training. Following this notebook with the only difference of providing entity_types to training script. Like so:

train_loader = model.create_dataloader(train_data, batch_size=config.train_batch_size, shuffle=True, entity_types=config.entity_types)

@urchade
Copy link
Owner

urchade commented Mar 20, 2024

Is entity types in the correct format? It should be a list of string

Actually I do not suggest setting entity types during training for better generalization

@Ahmedn1
Copy link
Author

Ahmedn1 commented Mar 20, 2024

@urchade yes it is a list of strings

@tcourat
Copy link

tcourat commented Apr 18, 2024

I have the same issue when specifying entity_types in create_dataloader .

The issue is here :

When identity_type is set to none, then class_to_ids is indeed a LIST of dictionnaries (one for each sentence in the batch) :

class_to_ids = []

But when providing identity_type, then it is a dictionnary :

class_to_ids = {k: v for v, k in enumerate(entity_types, start=1)}

So a quick fix, when using entity_type, is to deal with each case by removing this line :

all_types_i = list(x['classes_to_id'][i].keys())

And instead add :

if isinstance(x["classes_to_id"], list) :
    all_types_i = list(x["classes_to_id"][i].keys())
elif isinstance(x["classes_to_id"], dict) :
    all_types_i = list(x["classes_to_id"].keys())

@urchade
Copy link
Owner

urchade commented Apr 18, 2024

So, you want to fix the label during training, for supervised fine-tuning ?

The solution for this is to add the key "label" to each training samples (i.e in addition to "tokenized_text" and "ner"). You can do it as follows:

for i in range(len(train)):
    train[i]["label"] = labels

@Ahmedn1 @tcourat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants