Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding batch-size to dataset_splitter #120

Merged
merged 3 commits into from Oct 5, 2020
Merged

Adding batch-size to dataset_splitter #120

merged 3 commits into from Oct 5, 2020

Conversation

ngreenwald
Copy link
Collaborator

The current dataset splitter will return exactly the number of images corresponding to each split ratio provided. However, to maintain a constant batch size during training, for splits that are smaller than the batch size, we want to duplicate the images in that split, up to the batch size.

This PR adds the option to specify a min_size parameter for all splits. If any split results in fewer than that many images, the images are duplicated up to the min size.

Copy link
Contributor

@willgraf willgraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a small suggestion.

splits = [0.001, 0.3, 1]
ds = DatasetSplitter(splits=splits, seed=0)
split_dict = ds.split(train_dict=data_dict)
print(split_dict['0.001']['X'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the print statement.

Suggested change
print(split_dict['0.001']['X'])

@ngreenwald ngreenwald merged commit 7cbb95a into master Oct 5, 2020
@ngreenwald ngreenwald deleted the pad_split branch October 5, 2020 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants