Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why mobileNet.predict() hit a performance cliff after certain amount of runs? #145

Closed
nsthorat opened this issue Apr 7, 2018 · 10 comments
Closed
Labels
comp:core type:bug Something isn't working

Comments

@nsthorat
Copy link
Contributor

nsthorat commented Apr 7, 2018

From @samwyi on April 5, 2018 22:21

TensorFlow.js version: 0.6.0

TensorFlow.js Core version: 0.6.0

Browser version: Chrome 65.0.3325.181 (Official Build) (64-bit)

Describe the problem or feature request

mobileNet.predict() hit a performance cliff after certain amount of runs.
On my MacBook Pro, the average of 100 runs is ~11ms, while the average of 200 runs drops to ~46ms. Similar issue happened on my Android device. I wonder what caused the performance drop? Anyway to avoid it? Thanks.

Code to reproduce the bug / link to feature request:

Change tfjs-converter/demo/index.js cat.onload() code to call mobileNet.predict() multiple times in a loop:

console.time('Subsequent predictions');
for (let i = 0; i < 200; i++) {
result = mobileNet.predict(pixels);
}
console.timeEnd('Subsequent prediction');

Copied from original issue: tensorflow/tfjs-core#925

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 7, 2018

Try await tf.nextFrame() between each tick (you may have to wrap your code in a function marked as async).

async function run() {
  console.time('Subsequent predictions');
  for (let i = 0; i < 200; i++) {
    result = mobileNet.predict(pixels);
    await tf.nextFrame();
  }
  console.timeEnd('Subsequent prediction');
}
run();

@samwyi
Copy link

samwyi commented Apr 7, 2018

@nsthorat Adding "await tf.nextFrame()" really makes the performance number consistent on my laptop!

But on smartphones, the performance cliff still exits. After profiling with the Chrome performance dev tool, I noticed that after every 20-30 predictions, there is a long wait of 30-60 seconds (not ms!), with GPU shown as busy (solid green). Since I put "await tf.nextFrame()" between each prediction, there should be at most 1 inference running on the GPU, I wonder what is causing the long wait? Any idea?

I tested on two Android phones, both have similar issue.

@nsthorat
Copy link
Contributor Author

nsthorat commented Apr 7, 2018

Ah, apologies, you also probably have a memory leak and need to dispose the result of each predict. Try this code:

async function run() {
  console.time('Subsequent predictions');
  for (let i = 0; i < 200; i++) {
    result = mobileNet.predict(pixels);
    await tf.nextFrame();
    result.dispose();  // Get rid of the result
  }
  console.timeEnd('Subsequent prediction');
}
run();

Check out our section on memory in this tutorial: https://js.tensorflow.org/tutorials/core-concepts.html

@samwyi
Copy link

samwyi commented Apr 9, 2018

Tried calling result.dispose() after each prediction. On my Android phone (Huawei BLN-L24, Android 7, GPU: Mali-T830MP2), it really eliminated the 60 seconds wait between groups of 20-30 predictions, BUT at the cost of 2 seconds wait for each result.dispose(). I'm surprised to see that releasing GPU memory took much longer time than running the model itself ;-) Is this a issue in tensorflow.js or the GPU driver?

@nsthorat
Copy link
Contributor Author

Interesting, it really shouldn't take 2 seconds for a dispose(). Actually, our dispose just marks memory for reuse, it doesn't actually trash the memory.

One thing to test, console log dl.memory().numTensors() at each tick and make sure the number of tensors isn't increasing.

@samwyi
Copy link

samwyi commented Apr 10, 2018

tfc.memory.numTensors increases by 4 after each mobileNet.predict() call. Calling dispose() doesn't seem to help :(

easadler pushed a commit to easadler/tfjs that referenced this issue Apr 12, 2018
* vectorize min/max/logsumexp/nan shaders

* vectorize reduce sum
@davidsoergel davidsoergel added type:bug Something isn't working comp:core labels May 10, 2018
@dsmilkov
Copy link
Contributor

dsmilkov commented Jun 6, 2018

Hi Samwyi,

Can you share some simple code to reproduce this? This way we can take a closer look. Especially regarding the number of tensors going up by 4. after each mobileNet.predict()

@RadEdje
Copy link

RadEdje commented Jul 6, 2018

Hello, just wanted to ask if the perforamance cliff has been figured out on android 5.0?
I built a proof of concept web app that uses the videocam of a phone or the webcam of a desktop/laptop to detect/recognize radiographic findings.

The web apps are at 2 versions:

https://radhorizon.com/SITES/RadLense/
(this uses tj.js version 0.10.0)

and

https://radhorizon.com/SITES/RadLense/index3.html
(this uses the latest tensorflow.js ver 0.11.7)

These all work on the latest firefox, opera and chrome on desktop as well as the latest chrome on android 8.0.

the latest 0.11.7 is blazing fast I must say compared to ver 0.10.0 but they both seem to have the same problem; they only work on android 8.0 but not android 5.0.

I've tried numerous ways to debug and look for the cause of the problem.
No errors show up on the error log so it's not a javascript issue.
The web cam runs so the phone is detecting and putting the web cam data in a video element.
The AI/ML model loads properly since I rigged the app to stop at the splash screen if it doesn't.
I had to manually put console.log("check here"); to see which part of the app was stalling since no actual errors were showing up on console. This is when I narrowed it down to model.predict() and that's how I found this thread. I tried the numerous solutions but it still does not seem to work. Just hoping to know if anyone gets tj.js to run on android 5.0? or I should just wait for everyone to end up with android 8.0. thanks.

@nsthorat
Copy link
Contributor Author

Hi, can you rerun this with the latest version? Thanks!

@nsthorat
Copy link
Contributor Author

Closing this out for inactivity.

nsthorat pushed a commit that referenced this issue Aug 19, 2019
<!-- Reviewable:start -->
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/tensorflow/tfjs-node/145)
<!-- Reviewable:end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:core type:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants