Annotating an unlabeled set #1

Mahhos · 2020-09-10T05:39:07Z

Hi. Thanks for the great repo. I have got a question regarding the PET training and annotating an unlabeled set (as mentioned in the paper examples from D). I assume that it would be done using the command in the PET Training and Evaluation section in the repo. However, I am not sure where to put the unlabeled set and where to get the predicted labels? Would you please let me know how we should get the predicted labels for the unlabeled set? Thank you.

The text was updated successfully, but these errors were encountered:

timoschick · 2020-09-11T09:02:40Z

Hi, this is a bit difficult to do in the current version because PET expects the unlabeled examples to be in the same file as the labeled examples (this will be fixed in the next version, which will hopefully be released in ~2 weeks). What you can do until then is the following:

replace line 154 in run_training.py (all_train_data = load_examples(args.task_name, args.data_dir, args.lm_train_examples_per_label, evaluate=False)) with some custom function to load your unlabeled set, something like all_train_data = load_unlabeled_examples().
when you run run_training.py, set the --save_train_logits flag. This will produce a file called logits.txt in the specified output directory that, for each unlabeled example in all_train_data, contains the logits for all labels.

For example, if your TaskProcessor's get_labels() function returns ["good", "bad"] and all_train_data = [ex0, ex1, ex2], then the model's logits for "bad" given ex2 correspond to the second number in the third line of logits.txt.

Mahhos · 2020-09-11T10:01:38Z

Thanks for the response. In the current version (not changing line 154 in run_training.py), may I put my unlabeled samples in the training file (at the end of the file for example) and set --save_train_logits to get the predicted labels?
If not, should I put my unlabeled data in a separate csv file and define a new version of load_examples(), and get_dev_examples()/get_train_examples() to read my unlabeled data?

timoschick · 2020-09-11T13:49:22Z

should I put my unlabeled data in a separate csv file and define a new version of load_examples(), and get_dev_examples()/get_train_examples() to read my unlabeled data?

That would be the safest way, so I'd recommend doing it like that!

Mahhos · 2020-09-13T06:55:06Z

Thanks. I have got another question regarding the verbalizer. I am designing a custom PVP. How should I make sure that the language model would exactly fill the <MASK> with my tokens?

For example for the Yelp task, how did you know that the language model would exactly predict ["terrible"], ["bad"], ["okay"], ["good"], ["great"] and not any other synonyms of these words?

timoschick · 2020-09-14T12:43:31Z

If your verbalizer uses only the words terrible, bad, okay, good and great, then PET simply ignores the probabilities assigned to all other words. Let's assume the model's predictions are (in that order):

horrible # 0.30
awful    # 0.20
terrible # 0.20
bad      # 0.10
... 
okay     # 0.02
good     # 0.01
great    # 0.01

PET basically removes all words that are not used by the verbalizer, resulting in the following reduced list:

terrible # 0.20
bad      # 0.10
... 
okay     # 0.02
good     # 0.01
great    # 0.01

So PET would assign the label corresponding to terrible to this example, even if terrible is not the word that the language model would have predicted.

chris-aeviator · 2020-10-11T14:00:46Z

@timoschick

if I have labels 0 = 'bad' & 1 = 'good' I'll get an unlabeled_logits.txt with the first row beeing -1 and then a row for each row in my unlabeled.csv file.

Is it correct that I then apply softmax to it to get a prediction of the first label "bad" (corresponds to first "column" in logits file) and "good" (second "column")

example logits

-1
0.21161096000000001 0.3217776633333334
1.6751958333333334  -1.45424471

EDIT:

Ended up writing a conversion script (since I'm using an airflow pipeline anyways for the job) that writes me a prediction file with probabilities from the logits

import torch
import numpy as np        
import pandas as pd

logits_file = '/tmp/unlabeled_logits.txt'
results = []
with open(logits_file, 'r') as fh:
  for line in fh.read().splitlines():
    example_logits = [float(x) for x in line.split()]
    tensors = torch.tensor(example_logits)
    sm = torch.nn.Softmax()
    results.append(sm(tensors).numpy())
df = pd.DataFrame(results)
df.to_csv('/out/predictions.csv')

output is a propability for my label bad (first column) and good (2nd)

0.9937028288841248,0.006297166459262371

timoschick closed this as completed Sep 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotating an unlabeled set #1

Annotating an unlabeled set #1

Mahhos commented Sep 10, 2020

timoschick commented Sep 11, 2020

Mahhos commented Sep 11, 2020

timoschick commented Sep 11, 2020

Mahhos commented Sep 13, 2020

timoschick commented Sep 14, 2020

chris-aeviator commented Oct 11, 2020 •

edited

Annotating an unlabeled set #1

Annotating an unlabeled set #1

Comments

Mahhos commented Sep 10, 2020

timoschick commented Sep 11, 2020

Mahhos commented Sep 11, 2020

timoschick commented Sep 11, 2020

Mahhos commented Sep 13, 2020

timoschick commented Sep 14, 2020

chris-aeviator commented Oct 11, 2020 • edited

chris-aeviator commented Oct 11, 2020 •

edited