New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Customize torchtext.data.Dataset takes much time to generate dataset #858
Comments
We are switching to a new dataset abstraction. Please take a look at the text classification datasets here. It should work for multilable problem with a minor change in |
Hi @zhangguanheng66, thank you! I have checked the new dataset abstraction and tried with my data. I didn't find much difference: when using the old abstraction it process 1.35 items/s and it increases to 2.78 items/s when using the new one. This is really problematic when dealing with large dataset (i.e. my dataset contains ~14,000,000 examples and it seems that it will take me ~58 days to do so). |
@xdwang0726 Thanks. Could you share with your a simple code snippet? I can run some benchmark cases on my side. It shouldn't take so long time as we also have some similar lengthy datasets. You can create a PR and share the link here so I can test it. |
Thank you! Here's my code:
sample df (a data frame which contains text and labels(in binary))
txt_col and lbl_cols are the column names in the dataframe I am not sure whether the above description is clear, if anything needed, please let me know. Thank you! |
Do you mind if you can share the code with the new dataset abstraction? |
I am not sure whether I understand correctly. The |
In the new dataset abstraction, |
Thank you! I have modified the above mentioned python file to make it fit for multi-label classification, I am wondering is there some sample codes available to show how to use the code? For example, how to build vocab using pre-trained embeddings and how to get |
Never mind, I found the examples here |
That one might not be the best resource (it's based on the legacy code). If you are using the new dataset abstraction, you can take a look at this text classification example here. See |
It helps, thank you! One last question, in the previous code, the field has a |
This is our new pre-trained word vector (FastText and Glove) (link). |
Thank you! |
When we have the pertained vectors, I am wondering how to align these with the vocabulary built from the training set or load it to the model?
load to the model:
For now in the new abstraction the field has been removed, I am wondering how to do these without field? Thank you! |
In that case, you don't need to load vector into |
Hi, I found that in the torchtext repo, only text classification tasks use the new dataset abstraction, for other tasks fields still exist in the dataset settings (referred from here). I am wondering if I want use torchtext to create dataset for summarization task with BERT, which resource is better to refer to, the translation dataset? Thank you! |
I just merged the BERT pipeline under the example folder. #767 |
It really helps! Thank you! |
❓ Questions and Help
Description
I wrote a customized data.Dataset for multilabel classification. When I processed the data, I found that it is very slow to generate train and test using the customized dataset (it takes about 1.5s per example). I am wondering is it normal or it's something wrong with my customized dataset.
Customized data.Dataset for mulilabel classification is as follows:
examples of text:
examples of lbls:
The text was updated successfully, but these errors were encountered: