Skip to content

Conversation

ghost
Copy link

@ghost ghost commented Jul 31, 2018

Handling the case where the number of samples is not a multiple of batch_size, avoiding wasting samples

Checklist

  • I've tested that my changes are compatible with the latest version of Tensorflow.
  • I've read the Contribution Guidelines
  • I've updated the documentation if necessary.

Motivation and Context

Description

Handling the case where the number of samples is not a multiple of batch_size, avoiding wasting samples
@DEKHTIARJonathan
Copy link
Member

Please update the Changelog with your changes.

@tensorlayer tensorlayer deleted a comment Jul 31, 2018
@DEKHTIARJonathan
Copy link
Member

DEKHTIARJonathan commented Jul 31, 2018

I have tested this PR it works. However, I would like to have your opinion before merging @zsdonghao @luomai @lgarithm

Let's consider this simple script:

import numpy as np
import tensorlayer as tl

data = np.random.random((1050, 100))
y = np.random.random((1050, ))

i = 0
total_data = 0

for batch in tl.iterate.minibatches(data, y, batch_size=100, shuffle=True):
    print("Batch ID: %d - Batch Size: %d" % (i, batch[0].shape[0]))
    i += 1

    total_data += batch[0].shape[0]

print("Total Data: %d" % total_data)

Output with current TensorLayer

Batch ID: 0 - Batch Size: 100
Batch ID: 1 - Batch Size: 100
Batch ID: 2 - Batch Size: 100
Batch ID: 3 - Batch Size: 100
Batch ID: 4 - Batch Size: 100
Batch ID: 5 - Batch Size: 100
Batch ID: 6 - Batch Size: 100
Batch ID: 7 - Batch Size: 100
Batch ID: 8 - Batch Size: 100
Batch ID: 9 - Batch Size: 100
Total Data: 1000

Output with this PR work

Batch ID: 0 - Batch Size: 100
Batch ID: 1 - Batch Size: 100
Batch ID: 2 - Batch Size: 100
Batch ID: 3 - Batch Size: 100
Batch ID: 4 - Batch Size: 100
Batch ID: 5 - Batch Size: 100
Batch ID: 6 - Batch Size: 100
Batch ID: 7 - Batch Size: 100
Batch ID: 8 - Batch Size: 100
Batch ID: 9 - Batch Size: 100
Batch ID: 10 - Batch Size: 50
Total Data: 1050

My question

Usually people don't really care if they loose a small number of samples (dataset is very large) and the dataset should be shuffle at the beginning of each epoch.
Is it actually a good thing to enforce a smaller batch at the end (potentially of size 1) if the number of samples is not a multiple of batch_size ?

I believe, and might be wrong, that the version currently in Tensorlayer is more robust than the version proposed in this PR and more standard practice in Deep Learning. But I genuinely have doubts ...

I'm actually puzzled with this situation, what you think is best ?

@zsdonghao
Copy link
Member

Returning a different batch size may lead to error when users set a fixed batch size in the input placeholder (many people do that),
soI think we should add an argument (e.g. is_dynamic_batch_size etc) in iterate API, and set it to False by default.

@DEKHTIARJonathan
Copy link
Member

@aloooha can you reflect the changes suggested by @zsdonghao

Jonathan DEKHTIAR and others added 2 commits July 31, 2018 16:45
add an argument 'is_dynamic_batch_size'  in iterate API.
@tensorlayer tensorlayer deleted a comment Aug 1, 2018
@ghost
Copy link
Author

ghost commented Aug 1, 2018

@DEKHTIARJonathan I have changed the code, added an option argument 'is_dynamic_batch_size'

@tensorlayer tensorlayer deleted a comment Aug 1, 2018
@DEKHTIARJonathan DEKHTIARJonathan merged commit 31e4c8e into tensorlayer:master Aug 1, 2018
luomai pushed a commit that referenced this pull request Nov 21, 2018
* Update iterate.py

Handling the case where the number of samples is not a multiple of batch_size, avoiding wasting samples

* Update CHANGELOG.md

* Update VHANGELOG>md

* Update CHANGELOG.md

* Update iterate.py

add an argument 'is_dynamic_batch_size'  in iterate API.

* Update iterate.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants