Unexpected steps_per_epoch behavior in model.fit #64076

varshad18 · 2024-03-20T18:59:59Z

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

2.15.0

Custom code

Yes

OS platform and distribution

Windows 11 x64

Mobile device

No response

Python version

3.10.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

According to documentation, under the Args of fit method:

It mentions steps_per_epoch can have values as Integer or None.
The default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined.

So when I run the model.fit function without the steps_per_epoch assigned, and without the validation_steps, the model trains all epochs without error. But uses a very small number for steps_per_epoch.

In my case:

train size 714
valid size 89
batch size 4
Steps_per_epoch must be =178 (using the formula train size // batch size)
validation_steps must be=22, (using the formula valid size // batch size)

But it trains only for 23 steps each epoch.

Questions

Why does it choose 23?
How does it use the data batches, as in does it select the first 23 batches only for every epoch or does it shuffle randomly?
Is all the data being trained?
On using steps_per_epoch and validation_steps running into error

Standalone code to reproduce the issue

epochs = 300  
BATCH_SIZE = 4
train_size = trainImageTotalData
valid_size = validationImageTotalData
steps_per_epoch = (train_size // BATCH_SIZE)-1
validation_steps = (valid_size // BATCH_SIZE)-1
hist = model.fit(
    x=[trainNumericData, trainImagesSBData, trainImagesCBData, trainImagesWBData, trainImagesHBData,trainImagesLLData,trainImagesLBData, trainImagesUpLeftABData,trainImagesUpRightABData, trainImagesALeftLData, trainImagesARightLData],   #x=trainImages,    # images
    y=trainAllRegressionData,  #y=train["severity"],  # severity target regression value
    epochs=epochs,#steps_per_epoch=steps_per_epoch,
    validation_data=([validationNumericData, validationImagesSBData, validationImagesCBData, validationImagesWBData, validationImagesHBData, validationImagesLLData,validationImagesLBData, validationImagesUpLeftABData, validationImagesUpRightABData, validationImagesALeftLData, validationImagesARightLData], validationAllRegressionData),
    #validation_steps=validation_steps,
    callbacks=[mc, tensorboard_callback]).history   #callback for both checkpoints and tensorboard.

Relevant log output

train size 714
valid size 89
batch size 4
steps_per_epoch =177
validation_steps =21
Steps per epoch for training is 177
[INFO] training model...
Epoch 1/10
23/23 [==============================] - ETA: 0s - loss: 5.9347
Epoch 1: val_loss improved from inf to 5.02212, saving model to /content/drive/MyDrive/FashionBody/Regression/TrainingRun/Run4/checkpoint-01-5.02.tf
23/23 [==============================] - 585s 8s/step - loss: 5.9347 - val_loss: 5.0221
Epoch 2/10
23/23 [==============================] - ETA: 0s - loss: 4.7514
Epoch 2: val_loss improved from 5.02212 to 5.01143, saving model to /content/drive/MyDrive/FashionBody/Regression/TrainingRun/Run4/checkpoint-02-5.01.tf

sushreebarsa · 2024-03-21T07:11:51Z

@varshad18 Could you please double-check your calculation for steps_per_epoch. Kindly ensure it considers the total number of samples in your dataset and the batch size.
In order to expedite the trouble-shooting process, please provide a code snippet to reproduce the issue reported here. Thank you!

NBCBM · 2024-03-21T18:58:02Z

epochs = 300
BATCH_SIZE = 4
train_size = trainImageTotalData
valid_size = validationImageTotalData

Calculate steps per epoch and validation steps

steps_per_epoch = train_size // BATCH_SIZE
validation_steps = valid_size // BATCH_SIZE

Optionally, you can adjust steps_per_epoch and validation_steps based on whether your dataset is shuffled

If your dataset is shuffled during training, you might want to set steps_per_epoch to None

and let it automatically determine the number of steps based on the dataset size and batch size

hist = model.fit(

x=[trainNumericData, trainImagesSBData, trainImagesCBData, trainImagesWBData, trainImagesHBData,trainImagesLLData,trainImagesLBData, trainImagesUpLeftABData,trainImagesUpRightABData, trainImagesALeftLData, trainImagesARightLData],

y=trainAllRegressionData,

epochs=epochs,

validation_data=([validationNumericData, validationImagesSBData, validationImagesCBData, validationImagesWBData, validationImagesHBData, validationImagesLLData,validationImagesLBData, validationImagesUpLeftABData, validationImagesUpRightABData, validationImagesALeftLData, validationImagesARightLData], validationAllRegressionData),

steps_per_epoch=steps_per_epoch, # Set steps_per_epoch

validation_steps=validation_steps, # Set validation_steps

callbacks=[mc, tensorboard_callback]).history

Instead of manually setting steps_per_epoch and validation_steps, you can let it automatically determine based on the dataset size and batch size

hist = model.fit(
x=[trainNumericData, trainImagesSBData, trainImagesCBData, trainImagesWBData, trainImagesHBData,trainImagesLLData,trainImagesLBData, trainImagesUpLeftABData,trainImagesUpRightABData, trainImagesALeftLData, trainImagesARightLData],
y=trainAllRegressionData,
epochs=epochs,
validation_data=([validationNumericData, validationImagesSBData, validationImagesCBData, validationImagesWBData, validationImagesHBData, validationImagesLLData,validationImagesLBData, validationImagesUpLeftABData, validationImagesUpRightABData, validationImagesALeftLData, validationImagesARightLData], validationAllRegressionData),
callbacks=[mc, tensorboard_callback]).history

sushreebarsa · 2024-03-27T09:53:00Z

@varshad18 Could you please share the complete code in a notebook or gist to replicate the issue reported here?
Thank you!

varshad18 · 2024-03-27T21:41:17Z

@sushreebarsa I double-checked my calculation for steps_per_epoch and tried using the following formula:

BATCH_SIZE = 4
train_size = trainImageTotalData
valid_size = validationImageTotalData

print("train size " + str(train_size))
print("valid size " + str(valid_size))
print("batch size " + str(BATCH_SIZE))
steps_per_epoch = (train_size / BATCH_SIZE)
validation_steps = (valid_size / BATCH_SIZE)
print("steps_per_epoch ="+str(steps_per_epoch))
print("validation_steps ="+str(validation_steps))

steps_per_epoch = math.ceil(steps_per_epoch)
#steps_per_epoch=steps_per_epoch-1
validation_steps = math.ceil(validation_steps)
#validation_steps=validation_steps-1
print("steps_per_epoch ="+str(steps_per_epoch))
print("validation_steps ="+str(validation_steps))

train size 714
valid size 89
batch size 4
steps_per_epoch =178.5
validation_steps =22.25
steps_per_epoch =179
validation_steps =23

This worked for me and is training for all 179 steps with no errors. But the most common approach is to simply exclude the last incomplete batch from training during an epoch and here if I try to exclude the last batch by (steps_per_epoch-1) I get an error as follows

KeyError: 'Failed to format this callback filepath: "/content/drive/MyDrive/FashionBody/Regression/TrainingRun/300Run2.0/checkpoint-{epoch:02d}-{val_loss:.2f}.tf". Reason: 'val_loss''

Is it okay to train 179 steps according to my train size? Or am I doing something wrong?

sushreebarsa · 2024-04-03T04:45:59Z

@varshad18 Could you please confirm if you are still using Keras 2 ? If so then please migrate to Keras 3 and follow this documentation here. Thank you!

github-actions · 2024-04-11T01:47:57Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2024-04-19T01:47:54Z

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2024-04-19T01:48:02Z

Are you satisfied with the resolution of your issue?
Yes
No

google-ml-butler bot added the type:bug Bug label Mar 20, 2024

google-ml-butler bot assigned sushreebarsa Mar 20, 2024

sushreebarsa added comp:apis Highlevel API related issues TF 2.15 For issues related to 2.15.x labels Mar 21, 2024

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Mar 21, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 21, 2024

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Mar 27, 2024

google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 27, 2024

sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Apr 3, 2024

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Apr 11, 2024

github-actions bot closed this as completed Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected steps_per_epoch behavior in model.fit #64076

Unexpected steps_per_epoch behavior in model.fit #64076

varshad18 commented Mar 20, 2024

sushreebarsa commented Mar 21, 2024

NBCBM commented Mar 21, 2024

sushreebarsa commented Mar 27, 2024

varshad18 commented Mar 27, 2024

sushreebarsa commented Apr 3, 2024

github-actions bot commented Apr 11, 2024

github-actions bot commented Apr 19, 2024

google-ml-butler bot commented Apr 19, 2024

Unexpected steps_per_epoch behavior in model.fit #64076

Unexpected steps_per_epoch behavior in model.fit #64076

Comments

varshad18 commented Mar 20, 2024

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

In my case:

Questions

Standalone code to reproduce the issue

Relevant log output

sushreebarsa commented Mar 21, 2024

NBCBM commented Mar 21, 2024

Calculate steps per epoch and validation steps

Optionally, you can adjust steps_per_epoch and validation_steps based on whether your dataset is shuffled

If your dataset is shuffled during training, you might want to set steps_per_epoch to None

and let it automatically determine the number of steps based on the dataset size and batch size

hist = model.fit(

x=[trainNumericData, trainImagesSBData, trainImagesCBData, trainImagesWBData, trainImagesHBData,trainImagesLLData,trainImagesLBData, trainImagesUpLeftABData,trainImagesUpRightABData, trainImagesALeftLData, trainImagesARightLData],

y=trainAllRegressionData,

epochs=epochs,

steps_per_epoch=steps_per_epoch, # Set steps_per_epoch

validation_steps=validation_steps, # Set validation_steps

callbacks=[mc, tensorboard_callback]).history

Instead of manually setting steps_per_epoch and validation_steps, you can let it automatically determine based on the dataset size and batch size

sushreebarsa commented Mar 27, 2024

varshad18 commented Mar 27, 2024

sushreebarsa commented Apr 3, 2024

github-actions bot commented Apr 11, 2024

github-actions bot commented Apr 19, 2024

google-ml-butler bot commented Apr 19, 2024