Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meaning of DEV_FILE_NAME #7

Closed
chris-aeviator opened this issue Oct 11, 2020 · 4 comments
Closed

meaning of DEV_FILE_NAME #7

chris-aeviator opened this issue Oct 11, 2020 · 4 comments
Assignees

Comments

@chris-aeviator
Copy link

chris-aeviator commented Oct 11, 2020

Thanks for sharing this repo. When looking at the /examples dir, you split your dataset (labeled data?) to

  • DEV_FILE_NAME
  • TRAIN_FILE_NAME
  • TEST_FILE_NAME

& further

  • UNLABELED_FILE_NAME

Two questions arise:
a) How do you split the labeled data (distribution, e.g. are you splitting 32 training examples from fewglue to DEV / TRAIN / TEST equally ?)
b) will UNLABELED be automatically predicted and how is the result stored

@timoschick
Copy link
Owner

Hi @chris-aeviator ,

a) the labeled (training) data is not split at all. In case of fewglue, this means that TRAIN_FILE_NAME should point to a file containing all 32 examples, whereas DEV_FILE_NAME and TEST_FILE_NAME should point to files containing the original dev/test examples. Note that the dev examples are not at all used during training or for hyperparameter optimization; just like the test examples, they are only used for evaluation. If you have no dev examples, you can simply set def get_dev_examples(self, data_dir: str) to return an empty list.

b) yes, but only for the individual models and not for the final distilled classifier. If you need predictions for the unlabeled data, you can simply set TEST_FILE_NAME = UNLABELED_FILE_NAME. The result is then stored in a file predictions.jsonl where each line is of the form {"idx": <IDX>, "label": "<LABEL>"} where <IDX> is the index of the example in the test file and <LABEL> is the predicted label.

@timoschick timoschick self-assigned this Oct 12, 2020
@timoschick
Copy link
Owner

I'm closing this issue for now. Feel free to reopen it if you have further questions.

@aidahalitaj
Copy link

Hi @timoschick ,

I am running PET for a custom task with --model_type bert . In the --data_dir I have 4 files train.csv, test.csv, dev.csv, unlabeled.csv.

In the shell script, I have:
--do_train
--do_eval

Now in the output, I always get the predictions.jsonl file. The UNLABELED_FILE_NAME = "unlabeled.csv", so it is not set to other datasets. However, in the predictions file I thought I was getting model predictions of dev.csv. I tested it with different number of samples for each file train/test/dev/unlabeled and the number of rows in predictions.jsonl matched with that of dev set. Is it by default predictions file (located in the final folder) showing the predictioins of dev.csv?

image

@aidahalitaj
Copy link

More info on what I said earlier...

@timoschick I run two similar experiments on the same dataset (playing with unlabeled sample size)

Experiment A settings:

  • Balanced Dataset
  • train (50 samples per class)
  • test (150 samples per class)
  • dev (150 samples per class)
  • unlabeled (10 samples per class)

predictions.jsonl file ha 300 predicted labels in total
Experiment A predictions.jsonl has predictions labels (300 samples) of only one class

Experiment B settings:

  • Balanced Dataset
  • train (50 samples per class)
  • test (150 samples per class)
  • dev (150 samples per class)
  • unlabeled (100 samples per class)

predictions.jsonl file has 300 predicted labels in total
Experiment B predictions.jsonl file has predictions labels from both classes

My task is a classification problem with two labels but I don't understand what's the role of unlabeled data in this case and why is it impacting the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants