Questions about training data preprocessing #74

liuqi6777 · 2023-01-12T12:42:15Z

Hi!

I noticed that "attention_mask" was ignored when preprocessing the training data, as shown in the code of the file src/tevatron/data.py.

class TrainDataset(Dataset):
    def create_one_example(self, text_encoding: List[int], is_query=False):
        item = self.tok.prepare_for_model(
            text_encoding,
            truncation='only_first',
            max_length=self.data_args.q_max_len if is_query else self.data_args.p_max_len,
            padding=False,
            return_attention_mask=False,
            return_token_type_ids=False,
        )
        return item

And I found that some other sources of work on dense retrieval didn't do this. So I want to ask what is the reason for designing the code like this.

Thanks for your answer:)

The text was updated successfully, but these errors were encountered:

liuqi6777 · 2023-01-12T13:24:14Z

I just found that the QPCollator will add "attention_mask" to it.

liuqi6777 closed this as completed Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about training data preprocessing #74

Questions about training data preprocessing #74

liuqi6777 commented Jan 12, 2023

liuqi6777 commented Jan 12, 2023

Questions about training data preprocessing #74

Questions about training data preprocessing #74

Comments

liuqi6777 commented Jan 12, 2023

liuqi6777 commented Jan 12, 2023