Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem using personalized task #14

Closed
JohnPFL opened this issue Nov 4, 2020 · 6 comments
Closed

Problem using personalized task #14

JohnPFL opened this issue Nov 4, 2020 · 6 comments
Assignees

Comments

@JohnPFL
Copy link

JohnPFL commented Nov 4, 2020

Hello, @timoschick.
First of all, I want to compliment you for the great work you did with PET.
I think this is amazing and I can't stand the idea to try it to solve some of my data problems. I am quite new to transformers, so probably I'm doing something terribly naive.
First of all I created my personalized task using
`class MyTaskDataProcessor(DataProcessor):
"""
Example for a data processor.
"""

# Set this to the name of the task
TASK_NAME = "illlegal-detection"

# Set this to the name of the file containing the train examples
TRAIN_FILE_NAME = "train.csv"

# Set this to the name of the file containing the dev examples
DEV_FILE_NAME = "dev.csv"

# Set this to the name of the file containing the test examples
TEST_FILE_NAME = "test.csv"

# Set this to the name of the file containing the unlabeled examples
UNLABELED_FILE_NAME = "unlabeled.csv"

# Set this to a list of all labels in the train + test data
LABELS = ["0", "1"]

# Set this to the column of the train/test csv files containing the input's text a
TEXT_A_COLUMN = 0

# Set this to the column of the train/test csv files containing the input's text b or to -1 if there is no text b
TEXT_B_COLUMN = -1

# Set this to the column of the train/test csv files containing the input's gold label
LABEL_COLUMN = 1

def get_train_examples(self, data_dir: str) -> List[InputExample]:
    """
    This method loads train examples from a file with name `TRAIN_FILE_NAME` in the given directory.
    :param data_dir: the directory in which the training data can be found
    :return: a list of train examples
    """
    return self._create_examples(os.path.join(data_dir, MyTaskDataProcessor.TRAIN_FILE_NAME), "train")

def get_dev_examples(self, data_dir: str) -> List[InputExample]:
    """
    This method loads dev examples from a file with name `DEV_FILE_NAME` in the given directory.
    :param data_dir: the directory in which the dev data can be found
    :return: a list of dev examples
    """
    return self._create_examples(os.path.join(data_dir, MyTaskDataProcessor.DEV_FILE_NAME), "dev")

def get_test_examples(self, data_dir) -> List[InputExample]:
    """
    This method loads test examples from a file with name `TEST_FILE_NAME` in the given directory.
    :param data_dir: the directory in which the test data can be found
    :return: a list of test examples
    """
    return self._create_examples(os.path.join(data_dir, MyTaskDataProcessor.TEST_FILE_NAME), "test")

def get_unlabeled_examples(self, data_dir) -> List[InputExample]:
    """
    This method loads unlabeled examples from a file with name `UNLABELED_FILE_NAME` in the given directory.
    :param data_dir: the directory in which the unlabeled data can be found
    :return: a list of unlabeled examples
    """
    return self._create_examples(os.path.join(data_dir, MyTaskDataProcessor.UNLABELED_FILE_NAME), "unlabeled")

def get_labels(self) -> List[str]:
    """This method returns all possible labels for the task."""
    return MyTaskDataProcessor.LABELS

def _create_examples(self, path, set_type, max_examples=-1, skip_first=0):
    """Creates examples for the training and dev sets."""
    examples = []

    with open(path) as f:
        reader = csv.reader(f, delimiter=',')
        for idx, row in enumerate(reader):
            guid = "%s-%s" % (set_type, idx)
            label = row[MyTaskDataProcessor.LABEL_COLUMN]
            text_a = row[MyTaskDataProcessor.TEXT_A_COLUMN]
            text_b = row[MyTaskDataProcessor.TEXT_B_COLUMN] if MyTaskDataProcessor.TEXT_B_COLUMN >= 0 else None
            example = InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label)
            examples.append(example)

    return examples

PROCESSORS[MyTaskDataProcessor.TASK_NAME] = MyTaskDataProcessor`

Ok, now I want to use it to solve my problem. So I'm using this command line:
cmd = """python cli.py --method pet --data_dir .../comments_class/code/pet-master/pet --model_type roberta --model_name_or_path roberta --task_name --output_dir ...comments_class/data/output -- pattern_ids 0 1 --do_train --do_eval """
The problem is: how to communicate to the cli.py the new task I've created?
Sorry if I'm being too naive. I think this is an easy one and maybe it will be useful for future noobs too.
Thank again for your work!

@JohnPFL
Copy link
Author

JohnPFL commented Nov 5, 2020

Ok, I solved this by putting the example's classes directly on the cli.py file! Now I'm struggling setting the personalized PATTERN IDS, which I guess I have to take from the personalized PVP.

@timoschick
Copy link
Owner

Hi @JohnPFL ,

as you already know, PET requires a DataProcessor and a PVP.
What you need to do is to register the DataProcessor in the PROCESSORS dictionary (as you have done) and to register the PVP in the PVPS dictionary, e.g.:

 PROCESSORS['illegal-detection'] = MyTaskDataProcessor
 PVPS['illegal-detection'] = MyPVP

You can then call PET with python cli.py --task_name illegal-detection ... and it will automatically load the corresponding DataProcessor and PVP.

With regards to the pattern IDs, you may want to take a look at https://github.com/timoschick/pet/blob/master/examples/custom_task_pvp.py . In your get_parts(...) and verbalize(...) functions, you can define PVPs for an arbitrary number of pattern ids (starting from 0) as shown in this example file. If, for example, you define 2 patterns (with ids 0 and 1), as in the file linked above, you can then simply add --pattern_ids 0 1 when you call cli.py.

@JohnPFL
Copy link
Author

JohnPFL commented Nov 6, 2020

Thank you @timoschick for the answer!
I guess I'm done with the setting, but now I am getting this Assertion error.
Do you know what I'm getting wrong?

cmd ispython cli.py --method pet --data_dir /comments_class/code/pet-master/pet --model_type roberta --model_name_or_path roberta-base --task_name illegal-detection --output_dir /comments_class/data/output/pet/results/TRUE --overwrite_output_dir --pattern_ids 0 1 --do_train --do_eval 2020-11-06 10:10:34,359 - INFO - cli - Parameters: Namespace(adam_epsilon=1e-08, alpha=0.9999, cache_dir='', data_dir='comments_class/code/pet-master/pet', decoding_strategy='default', do_eval=True, do_train=True, eval_set='dev', ipet_generations=3, ipet_logits_percentage=0.25, ipet_n_most_likely=-1, ipet_scale_factor=5, learning_rate=1e-05, lm_training=False, logging_steps=50, max_grad_norm=1.0, method='pet', model_name_or_path='roberta-base', model_type='roberta', no_cuda=False, no_distillation=False, output_dir='/comments_class/data/output/pet/results/TRUE', overwrite_output_dir=True, pattern_ids=[0, 1], pet_gradient_accumulation_steps=1, pet_max_seq_length=256, pet_max_steps=-1, pet_num_train_epochs=3, pet_per_gpu_eval_batch_size=8, pet_per_gpu_train_batch_size=4, pet_per_gpu_unlabeled_batch_size=4, pet_repetitions=3, priming=False, reduction='wmean', sc_gradient_accumulation_steps=1, sc_max_seq_length=256, sc_max_steps=-1, sc_num_train_epochs=3, sc_per_gpu_eval_batch_size=8, sc_per_gpu_train_batch_size=4, sc_per_gpu_unlabeled_batch_size=4, sc_repetitions=1, seed=42, split_examples_evenly=False, task_name='illegal-detection', temperature=2, test_examples=-1, train_examples=-1, unlabeled_examples=-1, verbalizer_file=None, warmup_steps=0, weight_decay=0.01, wrapper_type='mlm') 2020-11-06 10:10:34,360 - INFO - tasks - Creating features from dataset file at comments_class/code/pet-master/pet (num_examples=-1, set_type=train) 2020-11-06 10:10:34,361 - INFO - tasks - Returning 380 train examples with label dist.: [('label', 1), ('1', 179), ('0', 200)] 2020-11-06 10:10:34,361 - INFO - tasks - Creating features from dataset file at /comments_class/code/pet-master/pet (num_examples=-1, set_type=dev) 2020-11-06 10:10:34,361 - INFO - tasks - Returning 41 dev examples with label dist.: [('label', 1), ('0', 13), ('1', 27)] 2020-11-06 10:10:34,361 - INFO - tasks - Creating features from dataset file at /comments_class/code/pet-master/pet (num_examples=-1, set_type=unlabeled) 2020-11-06 10:10:34,386 - INFO - tasks - Returning 10001 unlabeled examples with label dist.: [('0', 10001)] 2020-11-06 10:10:34,386 - WARNING - modeling - Path /comments_class/data/output/pet/results/TRUE/p0-i0 already exists, skipping it... 2020-11-06 10:10:34,386 - WARNING - modeling - Path /comments_class/data/output/pet/results/TRUE/p0-i1 already exists, skipping it... 2020-11-06 10:10:34,386 - WARNING - modeling - Path /comments_class/data/output/pet/results/TRUE/p0-i2 already exists, skipping it... 2020-11-06 10:10:34,386 - WARNING - modeling - Path /comments_class/data/output/pet/results/TRUE/p1-i0 already exists, skipping it... 2020-11-06 10:10:34,386 - WARNING - modeling - Path /comments_class/data/output/pet/results/TRUE/p1-i1 already exists, skipping it... 2020-11-06 10:10:34,387 - WARNING - modeling - Path /comments_class/data/output/pet/results/TRUE/p1-i2 already exists, skipping it... 2020-11-06 10:10:34,387 - INFO - modeling - === OVERALL RESULTS === 2020-11-06 10:10:34,387 - INFO - modeling - Found the following 6 subdirectories: ['p0-i0', 'p0-i1', 'p0-i2', 'p1-i0', 'p1-i1', 'p1-i2'] 2020-11-06 10:10:34,387 - WARNING - modeling - Skipping subdir 'p0-i0' because 'results.txt' or 'logits.txt' not found 2020-11-06 10:10:34,387 - WARNING - modeling - Skipping subdir 'p0-i1' because 'results.txt' or 'logits.txt' not found 2020-11-06 10:10:34,387 - WARNING - modeling - Skipping subdir 'p0-i2' because 'results.txt' or 'logits.txt' not found 2020-11-06 10:10:34,387 - WARNING - modeling - Skipping subdir 'p1-i0' because 'results.txt' or 'logits.txt' not found 2020-11-06 10:10:34,387 - WARNING - modeling - Skipping subdir 'p1-i1' because 'results.txt' or 'logits.txt' not found 2020-11-06 10:10:34,387 - WARNING - modeling - Skipping subdir 'p1-i2' because 'results.txt' or 'logits.txt' not found Traceback (most recent call last): File "cli.py", line 285, in <module> main() File "cli.py", line 266, in main no_distillation=args.no_distillation, seed=args.seed) File "comments_class/code/pet_master_new_try/pet/modeling.py", line 256, in train_pet merge_logits(output_dir, logits_file, reduction) File "/comments_class/code/pet_master_new_try/pet/modeling.py", line 575, in merge_logits merged_loglist = merge_logits_lists(all_logits_lists, reduction=reduction) File "/comments_class/code/pet_master_new_try/pet/modeling.py", line 590, in merge_logits_lists assert len(set(len(ll.logits) for ll in logits_lists)) == 1 AssertionError

@timoschick
Copy link
Owner

When you run PET, what happens internally is that for each pattern, three models are trained and used to annotate unlabeled examples; these annotations are written in a file called logits.txt, so for each pattern $P, you will have three files p$P-i0/logits.txt, p$P-i1/logits.txt, p$P-i2/logits.txt. The annotations are then merged to train a single model. This AssertionError arises because at least one of your logits.txt files contains a different number of logits than the others. Perhaps you have restarted training multiple times without first deleting the target directory, so some partial results are stored in one or more of the logits.txt files? (There are also some other lines indicating that training has been restarted multiple times without clearing the target directory first, e.g the line WARNING - modeling - Path /comments_class/data/output/pet/results/TRUE/p1-i2 already exists, skipping it...)

@JohnPFL
Copy link
Author

JohnPFL commented Nov 10, 2020

Perfect. Solved completely the error, it was as you said.
Now I've got this key error:
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 2020-11-10 13:00:17,642 - INFO - wrapper - Writing example 0 Traceback (most recent call last): File "cli.py", line 285, in <module> main() File "cli.py", line 266, in main no_distillation=args.no_distillation, seed=args.seed) File "/comments_class/code/pet_master_new_try/pet/modeling.py", line 249, in train_pet save_unlabeled_logits=not no_distillation, seed=seed) File "/comments_class/code/pet_master_new_try/pet/modeling.py", line 355, in train_pet_ensemble unlabeled_data=unlabeled_data)) File "/comments_class/code/pet_master_new_try/pet/modeling.py", line 434, in train_single_model results_dict['train_set_before_training'] = evaluate(model, train_data, eval_config)['scores']['acc'] File "/comments_class/code/pet_master_new_try/pet/modeling.py", line 490, in evaluate n_gpu=config.n_gpu, decoding_strategy=config.decoding_strategy, priming=config.priming) File "/comments_class/code/pet_master_new_try/pet/wrapper.py", line 352, in eval eval_dataset = self._generate_dataset(eval_data, priming=priming) File "/comments_class/code/pet_master_new_try/pet/wrapper.py", line 399, in _generate_dataset features = self._convert_examples_to_features(data, labelled=labelled, priming=priming) File "/comments_class/code/pet_master_new_try/pet/wrapper.py", line 424, in _convert_examples_to_features input_features = self.preprocessor.get_input_features(example, labelled=labelled, priming=priming) File "/comments_class/code/pet_master_new_try/pet/preprocessor.py", line 83, in get_input_features label = self.label_map[example.label] if example.label is not None else -100 KeyError: 'label

@timoschick
Copy link
Owner

The label_map is initialized like this:

self.label_map = {label: i for i, label in enumerate(self.wrapper.config.label_list)}

where self.wrapper.config.label_list is the list of labels that your TaskProcessors get_labels() method returns. You are getting this error because one of your training examples seems to have the label 'label', but this label is not one of the labels defined in your TaskProcessor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants