Migrates PoS tagging infrastructure to a torch.utils.data.Dataset and implements PUNCT augmentation #1303

AngledLuffa · 2023-10-25T07:34:50Z

In order to ensure long-term ability to augment the PoS tagging data against various types of augmentation (so far, punctuation removal on some sentences), and in order to make the DataLoader more efficient and compatible with current PyTorch semantics, we rename models.pos.data.DataLoader to models.pos.data.Dataset which has a to_loader() function that yields a torch.utils.data.DataLoader.
The data in this loader can dynamically be augmented per epoch (by changing the getitem function to perform any augmentation with a certain chance each time getitem is called), because the PyTorch DataLoader lazily fetches data only when it is accessed.

Debugged & squashed version of #1296

… implements PUNCT augmentation In order to ensure long-term ability to augment the PoS tagging data against various types of augmentation (so far, punctuation removal on some sentences), and in order to make the DataLoader more efficient and compatible with current PyTorch semantics, we rename models.pos.data.DataLoader to models.pos.data.Dataset which has a to_loader() function that yields a torch.utils.data.DataLoader. The data in this loader can dynamically be augmented per epoch (by changing the getitem function to perform any augmentation with a certain chance each time getitem is called), because the PyTorch DataLoader lazily fetches data only when it is accessed. batch_size is updated from counting words to counting sentences, since the torch DataLoader counts one sentence per __getitem__ A few tests are added to make sure the augmentation happens 0%, 100%, or a reasonable number of times

…__. This cuts down on the number of copies needed to make the collated batch

Jemoka and others added 3 commits October 24, 2023 19:01

Push the tensor creation a bit earlier, from _collate_fn to __getitem…

d35c498

…__. This cuts down on the number of copies needed to make the collated batch

These defaults work well for the new POS dataloader

94103c5

AngledLuffa merged commit d9fc52d into dev Oct 25, 2023
1 check passed

AngledLuffa deleted the pos_dataloader_5 branch October 25, 2023 07:35

AngledLuffa mentioned this pull request Oct 26, 2023

Unknown words can still result in punct tag at end of sentence #1000

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrates PoS tagging infrastructure to a torch.utils.data.Dataset and implements PUNCT augmentation #1303

Migrates PoS tagging infrastructure to a torch.utils.data.Dataset and implements PUNCT augmentation #1303

AngledLuffa commented Oct 25, 2023

Migrates PoS tagging infrastructure to a torch.utils.data.Dataset and implements PUNCT augmentation #1303

Migrates PoS tagging infrastructure to a torch.utils.data.Dataset and implements PUNCT augmentation #1303

Conversation

AngledLuffa commented Oct 25, 2023