We have trained a RNN model and use it to predict. We feed some data and calculate QPS in prediction. We find that when CPU usage is above than 30%, the QPS always stayed in 900+. And not increasing linearly by CPU usage. But if we put the model in HDFS, The QPS can reach 2400+.
Our system infomation:
OS: RedHat 7.2
CPU: 2 * 16 core * 2 thread
Memory: 512G in 1 node
In local model case, we use performance tool to trace function call time and find nearly 20% time hanged in page fault which lead to spin_lock. Those page fault occurs less than 1% in hdfs situation.
Our performance result listed as below:
model loading from local:

model loading from hdfs:

We check the source code (both eigen and tensorflow ) again and again and can not find any suspectable code which lead to page fault. we test loading model (wide and deep, cnn), page fault not happened. In RNN model we modify the code use HDFS file sytem instead of posix file system. page fault not happend too. we print log in every function in core/platform/posix/posix_file_system.cc. The log is only displayed in model loading, not occurs in prediction process.
Is anyone can help us to find out this problem? Thank you!
We have trained a RNN model and use it to predict. We feed some data and calculate QPS in prediction. We find that when CPU usage is above than 30%, the QPS always stayed in 900+. And not increasing linearly by CPU usage. But if we put the model in HDFS, The QPS can reach 2400+.
Our system infomation:
In local model case, we use performance tool to trace function call time and find nearly 20% time hanged in page fault which lead to spin_lock. Those page fault occurs less than 1% in hdfs situation.
Our performance result listed as below:
model loading from local:

model loading from hdfs:

We check the source code (both eigen and tensorflow ) again and again and can not find any suspectable code which lead to page fault. we test loading model (wide and deep, cnn), page fault not happened. In RNN model we modify the code use HDFS file sytem instead of posix file system. page fault not happend too. we print log in every function in core/platform/posix/posix_file_system.cc. The log is only displayed in model loading, not occurs in prediction process.
Is anyone can help us to find out this problem? Thank you!