Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize torchtext.data.Dataset takes much time to generate dataset #858

Closed
xdwang0726 opened this issue Jun 29, 2020 · 18 comments
Closed

Comments

@xdwang0726
Copy link

❓ Questions and Help

Description
I wrote a customized data.Dataset for multilabel classification. When I processed the data, I found that it is very slow to generate train and test using the customized dataset (it takes about 1.5s per example). I am wondering is it normal or it's something wrong with my customized dataset.

Customized data.Dataset for mulilabel classification is as follows:

class TextMultiLabelDataset(data.Dataset):
    def __init__(self, text, text_field, label_field, lbls=None, **kwargs):
        # torchtext Field objects
        fields = [('text', text_field), ('label', label_field)]
        # for l in lbl_cols:
        # fields.append((l, label_field))

        is_test = True if lbls is None else False
        if is_test:
            pass
        else:
            n_labels = len(lbls)

        examples = []
        for i, txt in enumerate(tqdm(text)):
            if not is_test:
                l = lbls[i]
            else:
                l = [0.0] * n_labels

            examples.append(data.Example.fromlist([txt, l], fields))

        super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)
where text is a list of list strings that in the documents, and lbls is a list of list labels in binary. (Total number of labels ~ 20000)

examples of text:

[["There are few factors more important to the mechanisms of evolution than stress. The stress response has formed as a result of natural selection..."], ["A 46-year-old female patient presenting with unspecific lower back pain, diffuse abdominal pain, and slightly elevated body temperature"], ...]

examples of lbls:

[[1 1 1 1 0 0 0 1 0 ...], [1 0 1 0 1 1 1 1 ...], ...]
@zhangguanheng66
Copy link
Contributor

We are switching to a new dataset abstraction. Please take a look at the text classification datasets here. It should work for multilable problem with a minor change in _csv_iterator func.

@xdwang0726
Copy link
Author

xdwang0726 commented Jun 29, 2020

We are switching to a new dataset abstraction. Please take a look at the text classification datasets here. It should work for multilable problem with a minor change in _csv_iterator func.

Hi @zhangguanheng66, thank you! I have checked the new dataset abstraction and tried with my data. I didn't find much difference: when using the old abstraction it process 1.35 items/s and it increases to 2.78 items/s when using the new one. This is really problematic when dealing with large dataset (i.e. my dataset contains ~14,000,000 examples and it seems that it will take me ~58 days to do so).

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Jun 29, 2020

@xdwang0726 Thanks. Could you share with your a simple code snippet? I can run some benchmark cases on my side. It shouldn't take so long time as we also have some similar lengthy datasets. You can create a PR and share the link here so I can test it.

@xdwang0726
Copy link
Author

@xdwang0726 Thanks. Could you share with your a simple code snippet? I can run some benchmark cases on my side. It shouldn't take so long time as we also have some similar lengthy datasets. You can create a PR and share the link here so I can test it.

Thank you! Here's my code:

from torchtext import data

class TextMultiLabelDataset(data.Dataset):
     def __init__(self, df, text_field, label_field, txt_col, lbl_cols, **kwargs):
         # torchtext Field objects
         fields = [('text', text_field), ('label', label_field)]

         is_test = False if lbl_cols[0] in df.columns else True
         n_labels = len(lbl_cols)

         examples = []
         for idx, row in df.iterrows():
             if not is_test:
                 lbls = [row[l] for l in lbl_cols]
             else:
                 lbls = [0.0] * n_labels

             txt = str(row[txt_col])
             examples.append(data.Example.fromlist([txt, lbls], fields))

         super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)

sample df (a data frame which contains text and labels(in binary))

text                                 label1           label2          label3
"There are few factors....."          1                 0                 1
"A 46-year-old female...."            0                 1                 1    

txt_col and lbl_cols are the column names in the dataframe

I am not sure whether the above description is clear, if anything needed, please let me know. Thank you!

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Jun 29, 2020

@xdwang0726 Thanks. Could you share with your a simple code snippet? I can run some benchmark cases on my side. It shouldn't take so long time as we also have some similar lengthy datasets. You can create a PR and share the link here so I can test it.

Thank you! Here's my code:

from torchtext import data

class TextMultiLabelDataset(data.Dataset):
     def __init__(self, df, text_field, label_field, txt_col, lbl_cols, **kwargs):
         # torchtext Field objects
         fields = [('text', text_field), ('label', label_field)]

         is_test = False if lbl_cols[0] in df.columns else True
         n_labels = len(lbl_cols)

         examples = []
         for idx, row in df.iterrows():
             if not is_test:
                 lbls = [row[l] for l in lbl_cols]
             else:
                 lbls = [0.0] * n_labels

             txt = str(row[txt_col])
             examples.append(data.Example.fromlist([txt, lbls], fields))

         super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)

sample df (a data frame which contains text and labels(in binary))

text                                 label1           label2          label3
"There are few factors....."          1                 0                 1
"A 46-year-old female...."            0                 1                 1    

txt_col and lbl_cols are the column names in the dataframe

I am not sure whether the above description is clear, if anything needed, please let me know. Thank you!

Do you mind if you can share the code with the new dataset abstraction?

@xdwang0726
Copy link
Author

xdwang0726 commented Jun 29, 2020

@xdwang0726 Thanks. Could you share with your a simple code snippet? I can run some benchmark cases on my side. It shouldn't take so long time as we also have some similar lengthy datasets. You can create a PR and share the link here so I can test it.

Thank you! Here's my code:

from torchtext import data

class TextMultiLabelDataset(data.Dataset):
     def __init__(self, df, text_field, label_field, txt_col, lbl_cols, **kwargs):
         # torchtext Field objects
         fields = [('text', text_field), ('label', label_field)]

         is_test = False if lbl_cols[0] in df.columns else True
         n_labels = len(lbl_cols)

         examples = []
         for idx, row in df.iterrows():
             if not is_test:
                 lbls = [row[l] for l in lbl_cols]
             else:
                 lbls = [0.0] * n_labels

             txt = str(row[txt_col])
             examples.append(data.Example.fromlist([txt, lbls], fields))

         super(TextMultiLabelDataset, self).__init__(examples, fields, **kwargs)

sample df (a data frame which contains text and labels(in binary))

text                                 label1           label2          label3
"There are few factors....."          1                 0                 1
"A 46-year-old female...."            0                 1                 1    

txt_col and lbl_cols are the column names in the dataframe
I am not sure whether the above description is clear, if anything needed, please let me know. Thank you!

Do you mind if you can share the code with the new dataset abstraction?

I am not sure whether I understand correctly. The torchtext.data.Dataset inherits from torch.utils.data.Dataset, and my code inherits from the torchtext.data.Dataset. Are you expecting some class inherit directly from torch.utils.data.Dataset? Thanks!

@zhangguanheng66
Copy link
Contributor

In the new dataset abstraction, TextClassificationDataset inherits from torch.utils.data.Dataset directly and we don't use Field anymore. https://github.com/pytorch/text/blob/master/torchtext/datasets/text_classification.py

@xdwang0726
Copy link
Author

xdwang0726 commented Jul 7, 2020

In the new dataset abstraction, TextClassificationDataset inherits from torch.utils.data.Dataset directly and we don't use Field anymore. https://github.com/pytorch/text/blob/master/torchtext/datasets/text_classification.py

Thank you! I have modified the above mentioned python file to make it fit for multi-label classification, I am wondering is there some sample codes available to show how to use the code? For example, how to build vocab using pre-trained embeddings and how to get train_iter and test_iter? Also, the current documentation is not up to dated and it would be better to show clear that the abstraction of torchtext.data.Dataset has been changed and it is great help if new examples are provided. Thank you!

@xdwang0726
Copy link
Author

In the new dataset abstraction, TextClassificationDataset inherits from torch.utils.data.Dataset directly and we don't use Field anymore. https://github.com/pytorch/text/blob/master/torchtext/datasets/text_classification.py

Thank you! I have modified the above mentioned python file to make it fit for multi-label classification, I am wondering is there some sample codes available to show how to use the code? For example, how to build vocab using pre-trained embeddings and how to get train_iter and test_iter? Also, the current documentation is not up to dated and it would be better to show clear that the abstraction of torchtext.data.Dataset has been changed and it is great help if new examples are provided. Thank you!

Never mind, I found the examples here

@zhangguanheng66
Copy link
Contributor

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

@xdwang0726
Copy link
Author

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

@zhangguanheng66
Copy link
Contributor

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link).
cc @Nayef211

@xdwang0726
Copy link
Author

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link).
cc @Nayef211

Thank you!

@xdwang0726
Copy link
Author

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link).
cc @Nayef211

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model?
Before, (TEXT is the desired field in previous abstraction)
align these with the vocabulary:

TEXT.build_vocab(train, vectors=pretrained_embeddings)

load to the model:

model.embedding.weight.data.copy_(TEXT.vocab.vectors)

For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!

@zhangguanheng66
Copy link
Contributor

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link).
cc @Nayef211

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model?
Before, (TEXT is the desired field in previous abstraction)
align these with the vocabulary:

TEXT.build_vocab(train, vectors=pretrained_embeddings)

load to the model:

model.embedding.weight.data.copy_(TEXT.vocab.vectors)

For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!

In that case, you don't need to load vector into model.embedding. Instead, just convert a list of tokens into a tensor by calling Vector's getitem func here and send the tensor to model.

@xdwang0726
Copy link
Author

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link).
cc @Nayef211

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model?
Before, (TEXT is the desired field in previous abstraction)
align these with the vocabulary:

TEXT.build_vocab(train, vectors=pretrained_embeddings)

load to the model:

model.embedding.weight.data.copy_(TEXT.vocab.vectors)

For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!

In that case, you don't need to load vector into model.embedding. Instead, just convert a list of tokens into a tensor by calling Vector's getitem func here and send the tensor to model.

Hi, I found that in the torchtext repo, only text classification tasks use the new dataset abstraction, for other tasks fields still exist in the dataset settings (referred from here). I am wondering if I want use torchtext to create dataset for summarization task with BERT, which resource is better to refer to, the translation dataset? Thank you!

@zhangguanheng66
Copy link
Contributor

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link).
cc @Nayef211

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model?
Before, (TEXT is the desired field in previous abstraction)
align these with the vocabulary:

TEXT.build_vocab(train, vectors=pretrained_embeddings)

load to the model:

model.embedding.weight.data.copy_(TEXT.vocab.vectors)

For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!

In that case, you don't need to load vector into model.embedding. Instead, just convert a list of tokens into a tensor by calling Vector's getitem func here and send the tensor to model.

Hi, I found that in the torchtext repo, only text classification tasks use the new dataset abstraction, for other tasks fields still exist in the dataset settings (referred from here). I am wondering if I want use torchtext to create dataset for summarization task with BERT, which resource is better to refer to, the translation dataset? Thank you!

I just merged the BERT pipeline under the example folder. #767

@xdwang0726
Copy link
Author

That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See train.py file showing how to use DataLoader.

It helps, thank you! One last question, in the previous code, the field has a build_vocab function which allows to load per-trained word2vec using vocab created by training set, i.e. TEXT.build_vocab(train, vectors=vectors). Is there any function does the similar thing or nn.Embedding.from_pretrained is the right way? Thank you!

This is our new pre-trained word vector (FastText and Glove) (link).
cc @Nayef211

When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model?
Before, (TEXT is the desired field in previous abstraction)
align these with the vocabulary:

TEXT.build_vocab(train, vectors=pretrained_embeddings)

load to the model:

model.embedding.weight.data.copy_(TEXT.vocab.vectors)

For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you!

In that case, you don't need to load vector into model.embedding. Instead, just convert a list of tokens into a tensor by calling Vector's getitem func here and send the tensor to model.

Hi, I found that in the torchtext repo, only text classification tasks use the new dataset abstraction, for other tasks fields still exist in the dataset settings (referred from here). I am wondering if I want use torchtext to create dataset for summarization task with BERT, which resource is better to refer to, the translation dataset? Thank you!

I just merged the BERT pipeline under the example folder. #767

It really helps! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants