Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to choose unlabelled data #26

Closed
Punchwes opened this issue Mar 30, 2021 · 6 comments
Closed

How to choose unlabelled data #26

Punchwes opened this issue Mar 30, 2021 · 6 comments

Comments

@Punchwes
Copy link

Hi @timoschick, thanks very much for your work, I have a question about how you decide the unlabelled data for each task.

In the paper you say

Similarly, we construct the set D of unlabeled examples by selecting 10000 examples per label and removing all labels

Taking agnews as an example, I assume it means you take 40,000 examples (it has 4 classes in total) from training in total and there will be 10,000 examples for each class. However, in your code, it seems that you are not following the 10,000 examples per label thing by just shuffling and picking first 40,000 examples.

I am little bit confused about this, any clarification would be helpful.

@timoschick
Copy link
Owner

Hi @Punchwes, there are two options for limiting the number of unlabeled examples:

  1. You can specify --unlabeled examples <k> for some natural number <k>, e.g. --unlabeled examples 40000. If you do so, the entire set of unlabeled examples is shuffled and then the first 40,000 examples in the shuffled dataset are chosen. Of course, this does not guarantee that there is an equal number of examples for each label.

  2. You can specify --unlabeled examples <k> --split_examples_evenly for some natural number <k> as above. In this case, if your dataset has <n> labels, for each label, the first <k>/<n> examples that can be found in the (unshuffled) unlabeled dataset are chosen.

For our experiments on AG's News, we chose the second option (that is, --unlabeled examples 40000 --split_examples_evenly). If you wanted to combine both options (shuffle the dataset and select the same number of examples for each label), you'd have to implement this yourself, but it should not require more than one or two lines of code.

I hope this answers your question!

@Punchwes
Copy link
Author

Hi @timoschick , thanks for your quick reply.

I think the method your describe in the paper corresponds to the second option, what confuses me is that in the code it seems that -split_examples_evenly never applies to unlabeled data.

As your code comment in tasks.py:

    assert (not set_type == UNLABELED_SET) or (num_examples is not None), \
        "For unlabeled data, 'num_examples_per_label' is not allowed"

and in the example loading part in cli.py:

    train_data = load_examples(
        args.task_name, args.data_dir, TRAIN_SET, num_examples=train_ex, num_examples_per_label=train_ex_per_label)
    eval_data = load_examples(
        args.task_name, args.data_dir, eval_set, num_examples=test_ex, num_examples_per_label=test_ex_per_label)
    unlabeled_data = load_examples(
        args.task_name, args.data_dir, UNLABELED_SET, num_examples=args.unlabeled_examples)

there's no num_examples_per_label parameter passing to unlabeled_data loading. This is the reason why I am confused it seems that you would always choose the first option for unlabeled data.

    if args.split_examples_evenly:
        train_ex_per_label = eq_div(args.train_examples, len(args.label_list)) if args.train_examples != -1 else -1
        test_ex_per_label = eq_div(args.test_examples, len(args.label_list)) if args.test_examples != -1 else -1
        train_ex, test_ex = None, None

and unlabeled data seems not be involved in the split_examples_evenly part as I could see.

Or I missed something in the code where the --split_examples_evenly can be applied to unlabeled data.

@timoschick
Copy link
Owner

Oh right, my mistake, you are absolutely correct!
For our AG's News results, we used an older version of the code (the corresponding file can still be found here). Back then, examples were always split evenly across all labels, so option (1) from my previous comment was not possible and option (2) was the default. When I wrote the current version of PET, I explicitly removed the num_examples_per_label option for unlabeled data because in a real-world setting, of course you do not have labels for unlabeled data so back then I felt like this was a sensible choice. But of course this means that with the current version of PET, option (2) from my previous comment is not possible anymore. So you'd have to either

  1. modify the code by removing the assertion and applying the if args.split_examples_evenly: [...] code block also to unlabeled examples or
  2. write a script that extracts the first 10,000 examples for each label and writes them to a separate file, and then use this separate file as input.

Sorry for the confusion!

@Punchwes
Copy link
Author

Thanks very much for this clarification, very helpful and it makes sense to remove the option for unlabelled data.

One last question I have is about seed. You mentioned in the paper that:

each model is trained three times using different seeds and average results are reported

After checking the code, it seems that the seed parameter passed by command line (args.seed) is not used to choose data examples,

    train_data = load_examples(
        args.task_name, args.data_dir, TRAIN_SET, num_examples=train_ex, num_examples_per_label=train_ex_per_label)
    eval_data = load_examples(
        args.task_name, args.data_dir, eval_set, num_examples=test_ex, num_examples_per_label=test_ex_per_label)
    unlabeled_data = load_examples(
        args.task_name, args.data_dir, UNLABELED_SET, num_examples=args.unlabeled_examples)

seed in the load_examples function is fixed as 42:

def load_examples(task, data_dir: str, set_type: str, *_, num_examples: int = None,
                  num_examples_per_label: int = None, seed: int = 42) -> List[InputExample]:

So I wonder when you run the model 3 times with different seeds, do you also change the seed in load_example() manually?

@timoschick
Copy link
Owner

For our experiments in Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference, we use the same set of examples for all three runs. The different seeds only affect the initialization of model parameters (for regular supervised training), dropout and the shuffling of training examples (i.e., the order in which they are presented to the model), which happens here.

If you're interested in how different sets of training examples affect performance, you might find Table 6 in this paper useful.

@Punchwes
Copy link
Author

Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants