New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak in tf.keras.Model.predict #44711
Comments
https://stackoverflow.com/questions/64199384/tf-keras-model-predict-results-in-memory-leak Seems to be an issue with tf.keras.Model.predict |
@plooney In your case, When I added one line in your code, the code is not crashing. Please check the gist here. Thanks!
Please read 1, 2, 3, and 4. These resources will help you more. Thanks! Please close the issue if this was resolved for you. Thanks! |
@jvishnuvardhan Predict in a loop it is a quite recurrent issue, I remember some weeks ago I've just triaged two of this tickets. |
@bhack Good point. I think updating one of the docs (tutorial/guides) would help resolving this kind of issues. Thanks! |
@jvishnuvardhan Yes we need to find a quite popular entry point in the Docs if any internal team member has some stats about Docs page views. |
Also a more specific "entry level" warning (instead of the generic function retracing) could be very useful for newcomers. |
@jvishnuvardhan thanks for the clear explanation. If this were calling a The first two clues that suggest it are:
Investigating a little farther you can find the So I agree that this is leaking memory somewhere. But I've confirmed that it's not the @tomerk, you're pretty familiar with this code, do you have any ideas? |
I've just modified the original stackoverflow mentioned code for Colab:
|
|
It doesn't OOM with memory_profiler running because If you add a So that points towards an issue with cyclical-garbage not getting cleaned up fast enough. |
Yes It was one of the suspects |
Also at step
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
So? I understand the OOM of GPU is because the GPU memory is too small. It means it is not the same issue with the leakage problem. The bug i met is that after 7 hours of running model.predict() in a loop, there is a GPU OOM. |
My attempts: 2.4.1: leak How the problem occurs to me?
How did I solve it?
|
Lol, I'm fighting with the memory-leak problems in multiple TensorFlow service in PROD since years and implemented different things like watchers that check the memory usage to gracefully restart our workers before they OOM-crash in a job, and adding Just yesterday I finally discovered this. Maybe one can prevent future users like me from wasting so much time/energy on this by adjusting "What's the difference between
🙂 |
In case this helps: If the dataset can fit in memory, then the following functions can replace the call to def generate_batches(
x: np.ndarray | tf.Tensor, batch_size: int = 32
) -> np.ndarray | tf.Tensor:
"""Generate batches of test data for inference.
Args:
x (np.ndarray | tf.Tensor):
Full test data set.
batch_size (int, default=32):
Batch size.
Yields:
np.ndarray | tf.Tensor:
Yielded batches of test data.
"""
for index in range(0, x.shape[0], batch_size):
yield x[index : index + batch_size]
def predict(
model: tf.keras.Model,
x: np.ndarray | tf.Tensor,
batch_size: int = 32,
) -> np.ndarray:
"""Predict using generated batched of test data.
- Used instead of model.predict() due to memory leaks.
- https://github.com/tensorflow/tensorflow/issues/44711
Args:
model (tf.keras.Model):
The model to use for prediction.
x (np.ndarray | tf.Tensor):
Full test data set.
batch_size (int, default=32):
Batch size.
Returns:
np.ndarray:
Predictions on the test data.
"""
y_batches = []
for x_batch in generate_batches(x=x, batch_size=batch_size):
y_batch = model(x_batch, training=False).numpy()
y_batches.append(y_batch)
return np.concatenate(y_batches)
# instead of
# y_pred = model.predict(x_test)
# use
y_pred = predict(model=model, x=x_test, batch_size=32) Else, if the dataset does not fit in memory, then consider using def create_tf_dataset(
data_split: str,
x: np.ndarray,
y: np.ndarray,
batch_size: int,
use_mixed_precision: bool,
) -> tf.data.Dataset:
"""Create a TensorFlow dataset.
- Cache train data before shuffling for performance (consider full dataset size).
- Shuffle train data to increase accuracy (not needed for validation or test data).
- Batch train data after shuffling for unique batches at each epoch.
- Cache test data after batching as batches can be the same between epochs.
- End pipeline with prefetching for performance.
Args:
data_split (str):
The data split to create the dataset for.
Supported are "train", "validation", and "test".
x (np.ndarray):
The feature data.
y (np.ndarray):
The target data.
batch_size (int):
The batch size.
use_mixed_precision (bool):
Whether to use mixed precision.
Raises:
ValueError: If the data split is not supported.
Returns:
tf.data.Dataset:
The TensorFlow dataset.
"""
if data_split not in {"train", "validation", "test"}:
raise ValueError(f"Invalid data split: {data_split}")
if use_mixed_precision:
tf.keras.mixed_precision.set_global_policy("mixed_float16")
x = x.astype(np.float16)
y = y.astype(np.float16)
ds = tf.data.Dataset.from_tensor_slices((x, y))
if data_split == "train":
ds = ds.cache()
set_random_seed(seed=RANDOM_SEED)
ds = ds.shuffle(number_of_samples, seed=RANDOM_SEED)
ds = ds.batch(batch_size)
else:
ds = ds.batch(batch_size)
ds = ds.cache()
ds = ds.prefetch(AUTOTUNE)
return ds
# need to do this call separately on a machine with enough memory
ds_test = create_tf_dataset(
data_split="test",
x=x_test,
y=y_test,
batch_size=32,
use_mixed_precision=True,
)
# then use it
y_pred = model.predict(ds_test) |
@lukeconibear worked for me thank you! |
I had the same issue with '.predict', running on about 50,000 inputs over several hours, seen a leak of around 0.35 GB. Traced the leak back to the '.predict' method. Tried replacing it with the 'call' method, which solved the memory leak but was slower by about 50%. |
This is innefficient and can lead to memory leaks. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 The issue even leads to crash in test suite on github for keras 3.0 (maybe also because of the tensorflow version used)
The idea is to avoid calling predict() which is known to be not designed for small arrays, and leads to memory leaks when used in loops. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 Use it in wrapper instead of predict().
This is innefficient and can lead to memory leaks. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 The issue even leads to crash in test suite on github for keras 3.0 (maybe also because of the tensorflow version used)
The idea is to avoid calling predict() which is known to be not designed for small arrays, and leads to memory leaks when used in loops. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 Use it in wrapper instead of predict().
This is innefficient and can lead to memory leaks. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 The issue even leads to crash in test suite on github for keras 3.0 (maybe also because of the tensorflow version used)
The idea is to avoid calling predict() which is known to be not designed for small arrays, and leads to memory leaks when used in loops. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 Use it in wrapper instead of predict().
This is innefficient and can lead to memory leaks. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 The issue even leads to crash in test suite on github for keras 3.0 (maybe also because of the tensorflow version used)
The idea is to avoid calling predict() which is known to be not designed for small arrays, and leads to memory leaks when used in loops. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 Use it in wrapper instead of predict().
This is innefficient and can lead to memory leaks. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 The issue even leads to crash in test suite on github for keras 3.0 (maybe also because of the tensorflow version used)
The idea is to avoid calling predict() which is known to be not designed for small arrays, and leads to memory leaks when used in loops. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 Use it in wrapper instead of predict().
This is innefficient and can lead to memory leaks. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 The issue even leads to crash in test suite on github for keras 3.0 (maybe also because of the tensorflow version used)
The idea is to avoid calling predict() which is known to be not designed for small arrays, and leads to memory leaks when used in loops. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 Use it in wrapper instead of predict().
This is innefficient and can lead to memory leaks. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 The issue even leads to crash in test suite on github for keras 3.0 (maybe also because of the tensorflow version used)
The idea is to avoid calling predict() which is known to be not designed for small arrays, and leads to memory leaks when used in loops. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 Use it in wrapper instead of predict().
This is innefficient and can lead to memory leaks. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 The issue even leads to crash in test suite on github for keras 3.0 (maybe also because of the tensorflow version used)
The idea is to avoid calling predict() which is known to be not designed for small arrays, and leads to memory leaks when used in loops. See https://keras.io/api/models/model_training_apis/#predict-method and tensorflow/tensorflow#44711 Use it in wrapper instead of predict().
Switching to the model: tf.keras.Model
x: np.ndarray
graph = tf.function(model)
# When processing large data, it is necessary to add logic to divide the data into small batches for processing.
result = graph(convert_to_tensor(x)) |
https://stackoverflow.com/questions/64199384/tf-keras-model-predict-results-in-memory-leak
Please make sure that this is a bug. As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
Describe the expected behavior
Standalone code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate
the problem. If possible, please share a link to Colab/Jupyter/any notebook.
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: