Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequence Labeling Dataset #157

Merged
merged 31 commits into from
Dec 23, 2017
Merged

Sequence Labeling Dataset #157

merged 31 commits into from
Dec 23, 2017

Conversation

sivareddyg
Copy link
Contributor

Very handy for working with sequences tagged with labels. Once this request is accepted, I will send a pull request of a sequence labeling task to pytorch/examples

@jekbradbury
Copy link
Contributor

Thanks so much, this is really useful!

I believe you don't need to provide a custom splits method; the standard Dataset.splits should work fine here. Also, we're using root as the name for the directory like .data and path as the name for the full dataset path, so your keyword args should be updated to agree with that.

@sivareddyg
Copy link
Contributor Author

@jekbradbury thanks! I have incorporated your suggestion. Travis CL tests pass.

@jekbradbury
Copy link
Contributor

What I mean is that you can delete the entire splits method because the parent class already has exactly the same implementation.

@sivareddyg
Copy link
Contributor Author

That makes sense. I removed it. Thanks!

@sivareddyg
Copy link
Contributor Author

@jekbradbury do you want me to do anything else for this to be merged? Feel free. The only difference with other datasets is I introduced load_default_dataset function. I can get rid of it but I thought this function is more transparent than loading default dataset using splits.

I plan to push sequence labeling task to pytorch/examples and would appreciate if this can be merged. Thanks!

@jekbradbury
Copy link
Contributor

Sorry, one last thing: I think the clearest approach would be to make SequenceLabelingDataset (or maybe call it TaggingDataset) an abstract class that doesn't have a "default" dataset, and then add a subclass UniversalDependenciesPOS that provides URLs/dirname attributes and a splits method with the default filenames that uses super to call Dataset.splits, along the lines of what we do for TranslationDatasets. I can make these changes and merge tomorrow if you want.

@sivareddyg
Copy link
Contributor Author

No worries, I can do this. Thanks!

@sivareddyg
Copy link
Contributor Author

I would prefer SequenceTagging or SequenceLabeling instead of just Tagging. What do you suggest?

@jekbradbury
Copy link
Contributor

I likeSequenceTaggingDataset

@jekbradbury jekbradbury merged commit 7a2e442 into pytorch:master Dec 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants