Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected steps_per_epoch behavior in model.fit #64076

Closed
varshad18 opened this issue Mar 20, 2024 · 8 comments
Closed

Unexpected steps_per_epoch behavior in model.fit #64076

varshad18 opened this issue Mar 20, 2024 · 8 comments
Assignees
Labels
comp:apis Highlevel API related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.15 For issues related to 2.15.x type:bug Bug

Comments

@varshad18
Copy link

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

2.15.0

Custom code

Yes

OS platform and distribution

Windows 11 x64

Mobile device

No response

Python version

3.10.12

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

No response

GPU model and memory

No response

Current behavior?

According to documentation, under the Args of fit method:

  • It mentions steps_per_epoch can have values as Integer or None.
  • The default None is equal to the number of samples in your dataset divided by the batch size, or 1 if that cannot be determined.

So when I run the model.fit function without the steps_per_epoch assigned, and without the validation_steps, the model trains all epochs without error. But uses a very small number for steps_per_epoch.

In my case:

  • train size 714
  • valid size 89
  • batch size 4
  • Steps_per_epoch must be =178 (using the formula train size // batch size)
  • validation_steps must be=22, (using the formula valid size // batch size)

But it trains only for 23 steps each epoch.

Questions

  1. Why does it choose 23?
  2. How does it use the data batches, as in does it select the first 23 batches only for every epoch or does it shuffle randomly?
  3. Is all the data being trained?
  4. On using steps_per_epoch and validation_steps running into error

Standalone code to reproduce the issue

epochs = 300  
BATCH_SIZE = 4
train_size = trainImageTotalData
valid_size = validationImageTotalData
steps_per_epoch = (train_size // BATCH_SIZE)-1
validation_steps = (valid_size // BATCH_SIZE)-1
hist = model.fit(
    x=[trainNumericData, trainImagesSBData, trainImagesCBData, trainImagesWBData, trainImagesHBData,trainImagesLLData,trainImagesLBData, trainImagesUpLeftABData,trainImagesUpRightABData, trainImagesALeftLData, trainImagesARightLData],   #x=trainImages,    # images
    y=trainAllRegressionData,  #y=train["severity"],  # severity target regression value
    epochs=epochs,#steps_per_epoch=steps_per_epoch,
    validation_data=([validationNumericData, validationImagesSBData, validationImagesCBData, validationImagesWBData, validationImagesHBData, validationImagesLLData,validationImagesLBData, validationImagesUpLeftABData, validationImagesUpRightABData, validationImagesALeftLData, validationImagesARightLData], validationAllRegressionData),
    #validation_steps=validation_steps,
    callbacks=[mc, tensorboard_callback]).history   #callback for both checkpoints and tensorboard.

Relevant log output

train size 714
valid size 89
batch size 4
steps_per_epoch =177
validation_steps =21
Steps per epoch for training is 177
[INFO] training model...
Epoch 1/10
23/23 [==============================] - ETA: 0s - loss: 5.9347
Epoch 1: val_loss improved from inf to 5.02212, saving model to /content/drive/MyDrive/FashionBody/Regression/TrainingRun/Run4/checkpoint-01-5.02.tf
23/23 [==============================] - 585s 8s/step - loss: 5.9347 - val_loss: 5.0221
Epoch 2/10
23/23 [==============================] - ETA: 0s - loss: 4.7514
Epoch 2: val_loss improved from 5.02212 to 5.01143, saving model to /content/drive/MyDrive/FashionBody/Regression/TrainingRun/Run4/checkpoint-02-5.01.tf
@google-ml-butler google-ml-butler bot added the type:bug Bug label Mar 20, 2024
@sushreebarsa sushreebarsa added comp:apis Highlevel API related issues TF 2.15 For issues related to 2.15.x labels Mar 21, 2024
@sushreebarsa
Copy link
Contributor

@varshad18 Could you please double-check your calculation for steps_per_epoch. Kindly ensure it considers the total number of samples in your dataset and the batch size.
In order to expedite the trouble-shooting process, please provide a code snippet to reproduce the issue reported here. Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Mar 21, 2024
@NBCBM
Copy link

NBCBM commented Mar 21, 2024

epochs = 300
BATCH_SIZE = 4
train_size = trainImageTotalData
valid_size = validationImageTotalData

Calculate steps per epoch and validation steps

steps_per_epoch = train_size // BATCH_SIZE
validation_steps = valid_size // BATCH_SIZE

Optionally, you can adjust steps_per_epoch and validation_steps based on whether your dataset is shuffled

If your dataset is shuffled during training, you might want to set steps_per_epoch to None

and let it automatically determine the number of steps based on the dataset size and batch size

hist = model.fit(

x=[trainNumericData, trainImagesSBData, trainImagesCBData, trainImagesWBData, trainImagesHBData,trainImagesLLData,trainImagesLBData, trainImagesUpLeftABData,trainImagesUpRightABData, trainImagesALeftLData, trainImagesARightLData],

y=trainAllRegressionData,

epochs=epochs,

validation_data=([validationNumericData, validationImagesSBData, validationImagesCBData, validationImagesWBData, validationImagesHBData, validationImagesLLData,validationImagesLBData, validationImagesUpLeftABData, validationImagesUpRightABData, validationImagesALeftLData, validationImagesARightLData], validationAllRegressionData),

steps_per_epoch=steps_per_epoch, # Set steps_per_epoch

validation_steps=validation_steps, # Set validation_steps

callbacks=[mc, tensorboard_callback]).history

Instead of manually setting steps_per_epoch and validation_steps, you can let it automatically determine based on the dataset size and batch size

hist = model.fit(
x=[trainNumericData, trainImagesSBData, trainImagesCBData, trainImagesWBData, trainImagesHBData,trainImagesLLData,trainImagesLBData, trainImagesUpLeftABData,trainImagesUpRightABData, trainImagesALeftLData, trainImagesARightLData],
y=trainAllRegressionData,
epochs=epochs,
validation_data=([validationNumericData, validationImagesSBData, validationImagesCBData, validationImagesWBData, validationImagesHBData, validationImagesLLData,validationImagesLBData, validationImagesUpLeftABData, validationImagesUpRightABData, validationImagesALeftLData, validationImagesARightLData], validationAllRegressionData),
callbacks=[mc, tensorboard_callback]).history

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 21, 2024
@sushreebarsa
Copy link
Contributor

@varshad18 Could you please share the complete code in a notebook or gist to replicate the issue reported here?
Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Mar 27, 2024
@varshad18
Copy link
Author

@sushreebarsa I double-checked my calculation for steps_per_epoch and tried using the following formula:

BATCH_SIZE = 4
train_size = trainImageTotalData
valid_size = validationImageTotalData

print("train size " + str(train_size))
print("valid size " + str(valid_size))
print("batch size " + str(BATCH_SIZE))
steps_per_epoch = (train_size / BATCH_SIZE)
validation_steps = (valid_size / BATCH_SIZE)
print("steps_per_epoch ="+str(steps_per_epoch))
print("validation_steps ="+str(validation_steps))

steps_per_epoch = math.ceil(steps_per_epoch)
#steps_per_epoch=steps_per_epoch-1
validation_steps = math.ceil(validation_steps)
#validation_steps=validation_steps-1
print("steps_per_epoch ="+str(steps_per_epoch))
print("validation_steps ="+str(validation_steps))

train size 714
valid size 89
batch size 4
steps_per_epoch =178.5
validation_steps =22.25
steps_per_epoch =179
validation_steps =23

This worked for me and is training for all 179 steps with no errors. But the most common approach is to simply exclude the last incomplete batch from training during an epoch and here if I try to exclude the last batch by (steps_per_epoch-1) I get an error as follows

KeyError: 'Failed to format this callback filepath: "/content/drive/MyDrive/FashionBody/Regression/TrainingRun/300Run2.0/checkpoint-{epoch:02d}-{val_loss:.2f}.tf". Reason: 'val_loss''

Is it okay to train 179 steps according to my train size? Or am I doing something wrong?

@google-ml-butler google-ml-butler bot removed the stat:awaiting response Status - Awaiting response from author label Mar 27, 2024
@sushreebarsa
Copy link
Contributor

@varshad18 Could you please confirm if you are still using Keras 2 ? If so then please migrate to Keras 3 and follow this documentation here. Thank you!

@sushreebarsa sushreebarsa added the stat:awaiting response Status - Awaiting response from author label Apr 3, 2024
Copy link

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

@github-actions github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Apr 11, 2024
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author TF 2.15 For issues related to 2.15.x type:bug Bug
Projects
None yet
Development

No branches or pull requests

3 participants