Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrates PoS tagging infrastructure to a torch.utils.data.Dataset and implements PUNCT augmentation #1303

Merged
merged 3 commits into from Oct 25, 2023

Conversation

AngledLuffa
Copy link
Collaborator

In order to ensure long-term ability to augment the PoS tagging data against various types of augmentation (so far, punctuation removal on some sentences), and in order to make the DataLoader more efficient and compatible with current PyTorch semantics, we rename models.pos.data.DataLoader to models.pos.data.Dataset which has a to_loader() function that yields a torch.utils.data.DataLoader.
The data in this loader can dynamically be augmented per epoch (by changing the getitem function to perform any augmentation with a certain chance each time getitem is called), because the PyTorch DataLoader lazily fetches data only when it is accessed.

Debugged & squashed version of #1296

Jemoka and others added 3 commits October 24, 2023 19:01
… implements PUNCT augmentation

In order to ensure long-term ability to augment the PoS tagging data against various types of augmentation (so far, punctuation removal on some sentences), and in order to make the DataLoader more efficient and compatible with current PyTorch semantics, we rename models.pos.data.DataLoader to models.pos.data.Dataset which has a to_loader() function that yields a torch.utils.data.DataLoader.

The data in this loader can dynamically be augmented per epoch (by changing the getitem function to perform any augmentation with a certain chance each time getitem is called), because the PyTorch DataLoader lazily fetches data only when it is accessed.

batch_size is updated from counting words to counting sentences, since the torch DataLoader counts one sentence per __getitem__

A few tests are added to make sure the augmentation happens 0%, 100%, or a reasonable number of times
…__. This cuts down on the number of copies needed to make the collated batch
@AngledLuffa AngledLuffa merged commit d9fc52d into dev Oct 25, 2023
1 check passed
@AngledLuffa AngledLuffa deleted the pos_dataloader_5 branch October 25, 2023 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants