Loading model from local in RNN prediction is slower than from HDFS due to page fault #16856

kdmxen · 2018-02-08T07:49:10Z

We have trained a RNN model and use it to predict. We feed some data and calculate QPS in prediction. We find that when CPU usage is above than 30%, the QPS always stayed in 900+. And not increasing linearly by CPU usage. But if we put the model in HDFS, The QPS can reach 2400+.

Our system infomation:

    OS:  RedHat 7.2
    CPU:  2 * 16 core * 2 thread
    Memory: 512G in 1 node

In local model case, we use performance tool to trace function call time and find nearly 20% time hanged in page fault which lead to spin_lock. Those page fault occurs less than 1% in hdfs situation.

Our performance result listed as below:

model loading from local:

model loading from hdfs:

We check the source code (both eigen and tensorflow ) again and again and can not find any suspectable code which lead to page fault. we test loading model (wide and deep, cnn), page fault not happened. In RNN model we modify the code use HDFS file sytem instead of posix file system. page fault not happend too. we print log in every function in core/platform/posix/posix_file_system.cc. The log is only displayed in model loading, not occurs in prediction process.

Is anyone can help us to find out this problem? Thank you!

The text was updated successfully, but these errors were encountered:

tensorflowbutler · 2018-02-08T19:17:02Z

Thank you for your post. We noticed you have not filled out the following field in the issue template. Could you update them if they are relevant in your case, or leave them as N/A? Thanks.
Have I written custom code
OS Platform and Distribution
TensorFlow installed from
TensorFlow version
Bazel version
CUDA/cuDNN version
GPU model and memory
Exact command to reproduce

tensorflowbutler · 2018-02-23T14:00:02Z

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-03-10T13:11:02Z

Nagging Awaiting Response: It has been 14 days with no activityand the awaiting response label was assigned. Is this still an issue?

tensorflowbutler · 2018-03-25T12:33:17Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue?

drpngx · 2018-04-03T18:19:14Z

@jhseu any comment?

tensorflowbutler · 2018-04-18T12:36:23Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue?

jhseu · 2018-04-18T20:39:04Z

Fixes are welcome.

weberxie · 2018-07-12T14:18:07Z

@kdmxen could you provide a demo to trigger this problem?

github-actions · 2023-03-28T02:02:24Z

This issue is stale because it has been open for 180 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2024-03-28T01:48:53Z

This issue was closed because it has been inactive for 1 year.

google-ml-butler · 2024-03-28T01:48:55Z

Are you satisfied with the resolution of your issue?
Yes
No

tensorflowbutler added the stat:awaiting response Status - Awaiting response from author label Feb 8, 2018

tensorflowbutler assigned drpngx Apr 3, 2018

drpngx assigned jhseu and unassigned drpngx Apr 3, 2018

drpngx added type:bug Bug stat:community support Status - Community Support labels Apr 3, 2018

drpngx unassigned jhseu Apr 3, 2018

jhseu added stat:contribution welcome Status - Contributions welcome and removed stat:awaiting response Status - Awaiting response from author labels Apr 18, 2018

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Mar 28, 2023

github-actions bot closed this as completed Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading model from local in RNN prediction is slower than from HDFS due to page fault #16856

Loading model from local in RNN prediction is slower than from HDFS due to page fault #16856

kdmxen commented Feb 8, 2018

tensorflowbutler commented Feb 8, 2018

tensorflowbutler commented Feb 23, 2018

tensorflowbutler commented Mar 10, 2018

tensorflowbutler commented Mar 25, 2018

drpngx commented Apr 3, 2018

tensorflowbutler commented Apr 18, 2018

jhseu commented Apr 18, 2018

weberxie commented Jul 12, 2018

github-actions bot commented Mar 28, 2023

github-actions bot commented Mar 28, 2024

google-ml-butler bot commented Mar 28, 2024

Loading model from local in RNN prediction is slower than from HDFS due to page fault #16856

Loading model from local in RNN prediction is slower than from HDFS due to page fault #16856

Comments

kdmxen commented Feb 8, 2018

tensorflowbutler commented Feb 8, 2018

tensorflowbutler commented Feb 23, 2018

tensorflowbutler commented Mar 10, 2018

tensorflowbutler commented Mar 25, 2018

drpngx commented Apr 3, 2018

tensorflowbutler commented Apr 18, 2018

jhseu commented Apr 18, 2018

weberxie commented Jul 12, 2018

github-actions bot commented Mar 28, 2023

github-actions bot commented Mar 28, 2024

google-ml-butler bot commented Mar 28, 2024