Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotating an unlabeled set #1

Closed
Mahhos opened this issue Sep 10, 2020 · 6 comments
Closed

Annotating an unlabeled set #1

Mahhos opened this issue Sep 10, 2020 · 6 comments

Comments

@Mahhos
Copy link

Mahhos commented Sep 10, 2020

Hi. Thanks for the great repo. I have got a question regarding the PET training and annotating an unlabeled set (as mentioned in the paper examples from D). I assume that it would be done using the command in the PET Training and Evaluation section in the repo. However, I am not sure where to put the unlabeled set and where to get the predicted labels? Would you please let me know how we should get the predicted labels for the unlabeled set? Thank you.

@timoschick
Copy link
Owner

Hi, this is a bit difficult to do in the current version because PET expects the unlabeled examples to be in the same file as the labeled examples (this will be fixed in the next version, which will hopefully be released in ~2 weeks). What you can do until then is the following:

  1. replace line 154 in run_training.py (all_train_data = load_examples(args.task_name, args.data_dir, args.lm_train_examples_per_label, evaluate=False)) with some custom function to load your unlabeled set, something like all_train_data = load_unlabeled_examples().
  2. when you run run_training.py, set the --save_train_logits flag. This will produce a file called logits.txt in the specified output directory that, for each unlabeled example in all_train_data, contains the logits for all labels.

For example, if your TaskProcessor's get_labels() function returns ["good", "bad"] and all_train_data = [ex0, ex1, ex2], then the model's logits for "bad" given ex2 correspond to the second number in the third line of logits.txt.

@Mahhos
Copy link
Author

Mahhos commented Sep 11, 2020

Thanks for the response. In the current version (not changing line 154 in run_training.py), may I put my unlabeled samples in the training file (at the end of the file for example) and set --save_train_logits to get the predicted labels?
If not, should I put my unlabeled data in a separate csv file and define a new version of load_examples(), and get_dev_examples()/get_train_examples() to read my unlabeled data?

@timoschick
Copy link
Owner

should I put my unlabeled data in a separate csv file and define a new version of load_examples(), and get_dev_examples()/get_train_examples() to read my unlabeled data?

That would be the safest way, so I'd recommend doing it like that!

@Mahhos
Copy link
Author

Mahhos commented Sep 13, 2020

Thanks. I have got another question regarding the verbalizer. I am designing a custom PVP. How should I make sure that the language model would exactly fill the <MASK> with my tokens?

For example for the Yelp task, how did you know that the language model would exactly predict ["terrible"], ["bad"], ["okay"], ["good"], ["great"] and not any other synonyms of these words?

@timoschick
Copy link
Owner

If your verbalizer uses only the words terrible, bad, okay, good and great, then PET simply ignores the probabilities assigned to all other words. Let's assume the model's predictions are (in that order):

horrible # 0.30
awful    # 0.20
terrible # 0.20
bad      # 0.10
... 
okay     # 0.02
good     # 0.01
great    # 0.01

PET basically removes all words that are not used by the verbalizer, resulting in the following reduced list:

terrible # 0.20
bad      # 0.10
... 
okay     # 0.02
good     # 0.01
great    # 0.01

So PET would assign the label corresponding to terrible to this example, even if terrible is not the word that the language model would have predicted.

@chris-aeviator
Copy link

chris-aeviator commented Oct 11, 2020

@timoschick

if I have labels 0 = 'bad' & 1 = 'good' I'll get an unlabeled_logits.txt with the first row beeing -1 and then a row for each row in my unlabeled.csv file.

Is it correct that I then apply softmax to it to get a prediction of the first label "bad" (corresponds to first "column" in logits file) and "good" (second "column")

example logits

-1
0.21161096000000001 0.3217776633333334
1.6751958333333334  -1.45424471

EDIT:

Ended up writing a conversion script (since I'm using an airflow pipeline anyways for the job) that writes me a prediction file with probabilities from the logits

import torch
import numpy as np        
import pandas as pd

logits_file = '/tmp/unlabeled_logits.txt'
results = []
with open(logits_file, 'r') as fh:
  for line in fh.read().splitlines():
    example_logits = [float(x) for x in line.split()]
    tensors = torch.tensor(example_logits)
    sm = torch.nn.Softmax()
    results.append(sm(tensors).numpy())
df = pd.DataFrame(results)
df.to_csv('/out/predictions.csv')

output is a propability for my label bad (first column) and good (2nd)

0.9937028288841248,0.006297166459262371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants