Customize torchtext.data.Dataset takes much time to generate dataset #858

xdwang0726 · 2020-06-29T06:33:39Z

❓ Questions and Help

Description
I wrote a customized data.Dataset for multilabel classification. When I processed the data, I found that it is very slow to generate train and test using the customized dataset (it takes about 1.5s per example). I am wondering is it normal or it's something wrong with my customized dataset.

Customized data.Dataset for mulilabel classification is as follows:

class TextMultiLabelDataset(data.Dataset):
    def __init__(self, text, text_field, label_field, lbls=None, **kwargs):
        # torchtext Field objects
        fields = [('text', text_field), ('label', label_field)]
        # for l in lbl_cols:
        # fields.append((l, label_field))

        is_test = True if lbls is None else False
        if is_test:
            pass
        else:
            n_labels = len(lbls)

        examples = []
        for i, txt in enumerate(tqdm(text)):
            if not is_test:
                l = lbls[i]
            else:
                l = [0.0] * n_labels

            examples.append(data.Example.fromlist([txt, l], fields))

        super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)

where text is a list of list strings that in the documents, and lbls is a list of list labels in binary. (Total number of labels ~ 20000)

examples of text:

[["There are few factors more important to the mechanisms of evolution than stress. The stress response has formed as a result of natural selection..."], ["A 46-year-old female patient presenting with unspecific lower back pain, diffuse abdominal pain, and slightly elevated body temperature"], ...]

examples of lbls:

[[1 1 1 1 0 0 0 1 0 ...], [1 0 1 0 1 1 1 1 ...], ...]

The text was updated successfully, but these errors were encountered:

zhangguanheng66 · 2020-06-29T14:09:52Z

We are switching to a new dataset abstraction. Please take a look at the text classification datasets here. It should work for multilable problem with a minor change in _csv_iterator func.

xdwang0726 · 2020-06-29T17:19:43Z

Hi @zhangguanheng66, thank you! I have checked the new dataset abstraction and tried with my data. I didn't find much difference: when using the old abstraction it process 1.35 items/s and it increases to 2.78 items/s when using the new one. This is really problematic when dealing with large dataset (i.e. my dataset contains ~14,000,000 examples and it seems that it will take me ~58 days to do so).

zhangguanheng66 · 2020-06-29T17:26:34Z

@xdwang0726 Thanks. Could you share with your a simple code snippet? I can run some benchmark cases on my side. It shouldn't take so long time as we also have some similar lengthy datasets. You can create a PR and share the link here so I can test it.

xdwang0726 · 2020-06-29T20:03:00Z

Thank you! Here's my code:

from torchtext import data

class TextMultiLabelDataset(data.Dataset):
     def __init__(self, df, text_field, label_field, txt_col, lbl_cols, **kwargs):
         # torchtext Field objects
         fields = [('text', text_field), ('label', label_field)]

         is_test = False if lbl_cols[0] in df.columns else True
         n_labels = len(lbl_cols)

         examples = []
         for idx, row in df.iterrows():
             if not is_test:
                 lbls = [row[l] for l in lbl_cols]
             else:
                 lbls = [0.0] * n_labels

             txt = str(row[txt_col])
             examples.append(data.Example.fromlist([txt, lbls], fields))

         super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)

sample df (a data frame which contains text and labels(in binary))

text                                 label1           label2          label3
"There are few factors....."          1                 0                 1
"A 46-year-old female...."            0                 1                 1

txt_col and lbl_cols are the column names in the dataframe

I am not sure whether the above description is clear, if anything needed, please let me know. Thank you!

zhangguanheng66 · 2020-06-29T20:20:13Z

Do you mind if you can share the code with the new dataset abstraction?

xdwang0726 · 2020-06-29T20:27:07Z

I am not sure whether I understand correctly. The torchtext.data.Dataset inherits from torch.utils.data.Dataset, and my code inherits from the torchtext.data.Dataset. Are you expecting some class inherit directly from torch.utils.data.Dataset? Thanks!

zhangguanheng66 · 2020-06-29T20:49:31Z

In the new dataset abstraction, TextClassificationDataset inherits from torch.utils.data.Dataset directly and we don't use Field anymore. https://github.com/pytorch/text/blob/master/torchtext/datasets/text_classification.py

xdwang0726 · 2020-07-07T18:53:57Z

Thank you! I have modified the above mentioned python file to make it fit for multi-label classification, I am wondering is there some sample codes available to show how to use the code? For example, how to build vocab using pre-trained embeddings and how to get train_iter and test_iter? Also, the current documentation is not up to dated and it would be better to show clear that the abstraction of torchtext.data.Dataset has been changed and it is great help if new examples are provided. Thank you!

xdwang0726 · 2020-07-07T20:12:02Z

Never mind, I found the examples here

zhangguanheng66 · 2020-07-07T21:11:31Z

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

xdwang0726 · 2020-07-08T02:01:17Z

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

zhangguanheng66 · 2020-07-08T13:52:47Z

This is our new pre-trained word vector (FastText and Glove) (link).
cc @Nayef211

xdwang0726 · 2020-07-08T18:13:30Z

Thank you!

xdwang0726 · 2020-07-09T02:13:50Z

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model?
Before, (TEXT is the desired field in previous abstraction)
align these with the vocabulary:

TEXT.build_vocab(train, vectors=pretrained_embeddings)

load to the model:

model.embedding.weight.data.copy_(TEXT.vocab.vectors)

For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!

zhangguanheng66 · 2020-07-09T15:08:14Z

In that case, you don't need to load vector into model.embedding. Instead, just convert a list of tokens into a tensor by calling Vector's getitem func here and send the tensor to model.

xdwang0726 · 2020-07-16T16:20:20Z

Hi, I found that in the torchtext repo, only text classification tasks use the new dataset abstraction, for other tasks fields still exist in the dataset settings (referred from here). I am wondering if I want use torchtext to create dataset for summarization task with BERT, which resource is better to refer to, the translation dataset? Thank you!

zhangguanheng66 · 2020-07-16T16:38:44Z

I just merged the BERT pipeline under the example folder. #767

xdwang0726 · 2020-07-16T17:01:15Z

It really helps! Thank you!

zhangguanheng66 added the new datasets and building blocks label Jun 29, 2020

xdwang0726 closed this as completed Jul 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Customize torchtext.data.Dataset takes much time to generate dataset #858

Customize torchtext.data.Dataset takes much time to generate dataset #858

xdwang0726 commented Jun 29, 2020

zhangguanheng66 commented Jun 29, 2020

xdwang0726 commented Jun 29, 2020 •

edited

zhangguanheng66 commented Jun 29, 2020 •

edited

xdwang0726 commented Jun 29, 2020

zhangguanheng66 commented Jun 29, 2020 •

edited

xdwang0726 commented Jun 29, 2020 •

edited

zhangguanheng66 commented Jun 29, 2020

xdwang0726 commented Jul 7, 2020 •

edited

xdwang0726 commented Jul 7, 2020

zhangguanheng66 commented Jul 7, 2020

xdwang0726 commented Jul 8, 2020

zhangguanheng66 commented Jul 8, 2020

xdwang0726 commented Jul 8, 2020

xdwang0726 commented Jul 9, 2020

zhangguanheng66 commented Jul 9, 2020

xdwang0726 commented Jul 16, 2020

zhangguanheng66 commented Jul 16, 2020

xdwang0726 commented Jul 16, 2020

Customize torchtext.data.Dataset takes much time to generate dataset #858

Customize torchtext.data.Dataset takes much time to generate dataset #858

Comments

xdwang0726 commented Jun 29, 2020

❓ Questions and Help

zhangguanheng66 commented Jun 29, 2020

xdwang0726 commented Jun 29, 2020 • edited

zhangguanheng66 commented Jun 29, 2020 • edited

xdwang0726 commented Jun 29, 2020

zhangguanheng66 commented Jun 29, 2020 • edited

xdwang0726 commented Jun 29, 2020 • edited

zhangguanheng66 commented Jun 29, 2020

xdwang0726 commented Jul 7, 2020 • edited

xdwang0726 commented Jul 7, 2020

zhangguanheng66 commented Jul 7, 2020

xdwang0726 commented Jul 8, 2020

zhangguanheng66 commented Jul 8, 2020

xdwang0726 commented Jul 8, 2020

xdwang0726 commented Jul 9, 2020

zhangguanheng66 commented Jul 9, 2020

xdwang0726 commented Jul 16, 2020

zhangguanheng66 commented Jul 16, 2020

xdwang0726 commented Jul 16, 2020

xdwang0726 commented Jun 29, 2020 •

edited

zhangguanheng66 commented Jun 29, 2020 •

edited

zhangguanheng66 commented Jun 29, 2020 •

edited

xdwang0726 commented Jun 29, 2020 •

edited

xdwang0726 commented Jul 7, 2020 •

edited