Skip to content

Loading model from local in RNN prediction is slower than from HDFS due to page fault  #16856

@kdmxen

Description

@kdmxen

We have trained a RNN model and use it to predict. We feed some data and calculate QPS in prediction. We find that when CPU usage is above than 30%, the QPS always stayed in 900+. And not increasing linearly by CPU usage. But if we put the model in HDFS, The QPS can reach 2400+.

Our system infomation:

    OS:  RedHat 7.2
    CPU:  2 * 16 core * 2 thread
    Memory: 512G in 1 node

In local model case, we use performance tool to trace function call time and find nearly 20% time hanged in page fault which lead to spin_lock. Those page fault occurs less than 1% in hdfs situation.

Our performance result listed as below:

model loading from local:
local

model loading from hdfs:
hdfs

We check the source code (both eigen and tensorflow ) again and again and can not find any suspectable code which lead to page fault. we test loading model (wide and deep, cnn), page fault not happened. In RNN model we modify the code use HDFS file sytem instead of posix file system. page fault not happend too. we print log in every function in core/platform/posix/posix_file_system.cc. The log is only displayed in model loading, not occurs in prediction process.

Is anyone can help us to find out this problem? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleThis label marks the issue/pr stale - to be closed automatically if no activitystat:community supportStatus - Community Supportstat:contribution welcomeStatus - Contributions welcometype:bugBug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions