meaning of `DEV_FILE_NAME` #7

chris-aeviator · 2020-10-11T10:46:38Z

Thanks for sharing this repo. When looking at the /examples dir, you split your dataset (labeled data?) to

DEV_FILE_NAME
TRAIN_FILE_NAME
TEST_FILE_NAME

& further

UNLABELED_FILE_NAME

Two questions arise:
a) How do you split the labeled data (distribution, e.g. are you splitting 32 training examples from fewglue to DEV / TRAIN / TEST equally ?)
b) will UNLABELED be automatically predicted and how is the result stored

The text was updated successfully, but these errors were encountered:

timoschick · 2020-10-12T09:38:38Z

Hi @chris-aeviator ,

a) the labeled (training) data is not split at all. In case of fewglue, this means that TRAIN_FILE_NAME should point to a file containing all 32 examples, whereas DEV_FILE_NAME and TEST_FILE_NAME should point to files containing the original dev/test examples. Note that the dev examples are not at all used during training or for hyperparameter optimization; just like the test examples, they are only used for evaluation. If you have no dev examples, you can simply set def get_dev_examples(self, data_dir: str) to return an empty list.

b) yes, but only for the individual models and not for the final distilled classifier. If you need predictions for the unlabeled data, you can simply set TEST_FILE_NAME = UNLABELED_FILE_NAME. The result is then stored in a file predictions.jsonl where each line is of the form {"idx": <IDX>, "label": "<LABEL>"} where <IDX> is the index of the example in the test file and <LABEL> is the predicted label.

timoschick · 2020-10-15T15:51:26Z

I'm closing this issue for now. Feel free to reopen it if you have further questions.

aidahalitaj · 2024-01-31T14:52:45Z

Hi @timoschick ,

I am running PET for a custom task with --model_type bert . In the --data_dir I have 4 files train.csv, test.csv, dev.csv, unlabeled.csv.

In the shell script, I have:
--do_train
--do_eval

Now in the output, I always get the predictions.jsonl file. The UNLABELED_FILE_NAME = "unlabeled.csv", so it is not set to other datasets. However, in the predictions file I thought I was getting model predictions of dev.csv. I tested it with different number of samples for each file train/test/dev/unlabeled and the number of rows in predictions.jsonl matched with that of dev set. Is it by default predictions file (located in the final folder) showing the predictioins of dev.csv?

aidahalitaj · 2024-02-01T12:18:13Z

More info on what I said earlier...

@timoschick I run two similar experiments on the same dataset (playing with unlabeled sample size)

Experiment A settings:

Balanced Dataset
train (50 samples per class)
test (150 samples per class)
dev (150 samples per class)
unlabeled (10 samples per class)

predictions.jsonl file ha 300 predicted labels in total
Experiment A predictions.jsonl has predictions labels (300 samples) of only one class

Experiment B settings:

Balanced Dataset
train (50 samples per class)
test (150 samples per class)
dev (150 samples per class)
unlabeled (100 samples per class)

predictions.jsonl file has 300 predicted labels in total
Experiment B predictions.jsonl file has predictions labels from both classes

My task is a classification problem with two labels but I don't understand what's the role of unlabeled data in this case and why is it impacting the result.

timoschick self-assigned this Oct 12, 2020

timoschick closed this as completed Oct 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

meaning of `DEV_FILE_NAME` #7

meaning of `DEV_FILE_NAME` #7

chris-aeviator commented Oct 11, 2020 •

edited

Loading

timoschick commented Oct 12, 2020

timoschick commented Oct 15, 2020

aidahalitaj commented Jan 31, 2024

aidahalitaj commented Feb 1, 2024

meaning of DEV_FILE_NAME #7

meaning of DEV_FILE_NAME #7

Comments

chris-aeviator commented Oct 11, 2020 • edited Loading

timoschick commented Oct 12, 2020

timoschick commented Oct 15, 2020

aidahalitaj commented Jan 31, 2024

aidahalitaj commented Feb 1, 2024

meaning of `DEV_FILE_NAME` #7

meaning of `DEV_FILE_NAME` #7

chris-aeviator commented Oct 11, 2020 •

edited

Loading