Improve performance #36

alancnet · 2017-06-28T20:44:13Z

@wagenaartje I am opening this issue as a discussion regarding performance optimization. In Endless repeat 'no more nodes left to remove!', you had said:

(PS: i'm working on speeding up evolution drastically using webworkers)

If you recall, I emailed you earlier regarding my efforts to widely distribute the evolution process, so it may be a good idea to collaborate.

I've been experimenting with Azure Functions and AWS Lambda, which both allow you to execute JavaScript code. An implementation I am experimenting with would turn the fitness function into a Promise, and that promise can be resolved by any means, including web workers, child processes, or API calls.

The problem is, once one thing is a promise, everything must be a promise. Currently, evolve() and all its dependents are synchronous, so implementing that change will drastically effect the interface. Despite that, I strongly believe Promises to be the best practice for this scenario. It's a widely accepted pattern that works seamlessly with libraries like Bluebird and RXJS, leaving the actual implementation up to the user.

wagenaartje · 2017-06-29T12:26:14Z

I do recall! Didn't know that was you.

I have been doing some tests with webworkers. And I agree, I think usinges Promise is the way to go, both for performance and just to make the UI non freezing for browser. However, I am not completely new to that part of Javascript. But at this moment i'm trying my best to implement it on some toy problems.

[..] so it may be a good idea to collaborate. [..]
[..] An implementation I am experimenting with would turn the fitness function into a Promise [..]

Definitely a good idea. I'm doing some more tests on webworkers but i'm planning to have a small example of the evolution process that works faster using webworkers (very similar to what you're doing, i'm trying to run the fitness function in parallel for all genomes).

Once (and if) we get something basic to work, i'll create a seperate branch.

wagenaartje · 2017-06-29T13:24:53Z

So I've managed to get parallel genome fitness calculation working. The following examples are tested on 500 digits from the MNIST dataset. I use neat.popsize webworkers (this is just a small test, improvement coming) and I convert the dataset to a float64array which I can send as a buffer. Results:

Open console!

Normal (sequential): 3100ms
~~Webworkers (preprocess + buffer): 3200ms (slower)~~

~~Webworkers (preprocess): 2100ms (quite a bit faster!)~~

~~But I can't seem to understand why using a buffer makes it slower.~~

Update: wow! I managed to get it working:

* Webworkers (preprocess + buffer): 1400ms!!! that's more than 2x faster!

PS: the code is very badly written, but it's an example

So now i'm going to do some more testing; instead of using 1 webworker per network, I want to try 1 webworker per n networks.

alancnet · 2017-06-29T17:54:22Z

So a method I've used for parallelizing work like this has been to use a queue, and maintain a count of jobs in flight against a parallelism parameter. So it would look something like this:

const processTasks = (tasks, parallelism) => new Promise((resolve, reject) => {
  var results = Array(tasks.length)
  var inFlight = 0
  var index = 0
  const doQueue = () => {
    if (inFlight === 0 && index === tasks.length) {
      resolve(results)
    }
    if (inFlight < (parallelism || 1)) {
      if (index < tasks.length) {
        inFlight++
        const myIndex = index++
        const task = tasks[myIndex]
        const promise = Promise.resolve(
          typeof task === 'function'
          ? task()
          : task
        )
        promise.then(result => {
          results[myIndex] = result
          inFlight--
          doQueue()
        }, reject)
        doQueue()
      }
    }
  }
  doQueue()
})

doQueue is designed to run any time through the lifecycle of the job queue, and resolve the overall promise at the end. This is essentially Promise.all(...), but it takes deferred promises and ForkJoins them.

It's almost never wise to lean to either extreme (only one processor, or begin all tasks in parallel). Usually you want to limit your jobs in flight to the number of cores in your machine, or a small multiple of that. This is particularly useful when jobs are being distributed over a network. Perhaps I can make 1-6 API calls in parallel without hitting rate limits, or DoSing my API server. So setting parallelism lets me tune that.

For example, the code above could be used like this:

const tasks = [
  () => Promise.resolve('Hello'), // <-- Deferred promise
  () => 'World', // <-- Regular function
  Promise.resolve('Foo'), // <-- Refular promise
  'Bar', // <-- Primitive
  () => new Promise((resolve, reject) =>
    setTimeout(() => resolve(1), 1000)), // <-- Slow running task
  () => new Promise((resolve, reject) =>
    setTimeout(() => resolve(2), 1000)),
  () => new Promise((resolve, reject) =>
    setTimeout(() => resolve(3), 1000)),
  () => new Promise((resolve, reject) =>
    setTimeout(() => resolve(4), 1000)),
  () => new Promise((resolve, reject) =>
    setTimeout(() => resolve(5), 1000)),
  () => new Promise((resolve, reject) =>
    setTimeout(() => resolve(6), 1000))
]

processTasks(tasks, 5).then(
  results => console.log(results.join(' ')),
  console.error
)

which will output Hello World Foo Bar 1 2 3 4 5 6, and complete in exactly 2 seconds.

alancnet · 2017-06-29T17:56:53Z

PS: You'll notice my samples are in StandardJS. I'm a recent convert. Take a look 😄 IMHO it makes coding JavaScript much more expressive and pleasant.

wagenaartje · 2017-07-01T10:51:02Z

Using a queue is the way to go. Thanks for the tip on the amount of threads to keep running. The way you implemented it looks very neat. So tasks outputs are ordered, but how do you make task A run after task B , when task B requires task A.

And i'm not sure if i'm right, but it seems like your code is still running in one thread right? So technically, the tasks are still running sequentially right? But I believe the code just serves as an example so...

I'll take a look at StandardJS to see if it's any good 👍

wagenaartje · 2017-07-01T13:11:57Z

So I just checked if my above worker implementation worked but sadly I made a type so I have not found a way yet to run evaluation faster by running it in parallel. I can't get it to run any faster than this

If you have any working examples with the built-in Neat instance that would be great, it would lay a foundation for me (and maybe others) to build on.

I'm also updating the code to be more in line with (semi!) standardJS.

alancnet · 2017-07-03T08:51:35Z

True, the sample code I showed is still running on one thread. It's up to the deferred promise to make that happen -- for instance, if the deferred promise called an API. The coordinator would not need know the implementation details. The function I demonstrated would be for a single step in the process, so there would be no cross-dependencies between tasks -- for instance, you would use this to run cost/fitness analysis against the generation. The assumption is that they are pure, idempotent functions.

As for your fiddle, I think the problem is you're spending a lot of CPU time on serialization. standalone() has to create the function as a string, then the webworker needs to parse the function and compile it. The rest of the work of what the webworker actually does is minuscule compared to that. network.test() does not need serialization.

If the network could be serialized as pure data, perhaps as a single Float64Array, and the worker were able to use that data, then you could reuse webworkers without having to recompile javascript, and/or you can chunk networks together.. sending 4-8 at a time to a single worker.

The implementation I'm considering (which does not work against the current Neataptic) farms out the test functionality to web services.. The reason being, it would need to run the neural network in a simulation that could take seconds, or minutes to complete. So on a single computer, it could take hours or months to evolve enough generations, or seconds if I farmed it out to compute clusters :) For instance, if I wanted to run the neural network on a self-driving car simulation with multiple obstacles and situations the car needs to handle, it would take some serious CPU time for that simulation to run. The CPU time spent on serialization, even with standalone(), would pale in comparison to those computations, so it would be worth it to farm those out.

alancnet · 2017-07-03T08:56:08Z

Also here is your fiddle modified to use deferred promises, so switch PARALLELISM between 1 and to see how parallelism actually effects the same volume of work.

https://jsfiddle.net/yr7Lukju/5/

wagenaartje · 2017-07-03T11:50:05Z

That's what I was thinking as well. I have managed to serialize the network into a single Float64Array, and i'm using just 4 workers to which I send a new network once the previous one is done. The speed went from 2500ms to 1400ms.

But I have only got the serialization working for feedforward networks though, but I think i'll have it done for recurrent networks soon as well. It looks promising.

So basically, the Neat instance code should be written to work completely on Promises for that implementation?

Update: I seem to have it working now. It works more than 3x faster on evaluating (any type of network). So right now I'm think about how I can create a network.evolve function that works completely on async functions and promises. I have created a new branch, async. I will push some commits there somewhere this week.

D-Nice · 2017-07-15T01:01:46Z

I wonder if any parts of this could end up being leveraged as well https://github.com/MaiaVictor/WebMonkeys

alancnet · 2017-07-19T21:33:31Z

@D-Nice I have a lot to learn! Wow.. I had no idea that project existed!

D-Nice · 2017-07-23T02:12:01Z

@wagenaartje The latest implementation with async appears broken to me. The scores seem to be getting jumbled up between various genomes.

Anyways, I tried making my own implementation to speed things up, by having the fitness function run in parallel. I used https://github.com/parallel-js/parallel.js/tree/master and have been able to achieve ~6-10x speed up on nodejs. Previously, for a set where a generation would take 300 seconds, now takes 45 seconds, with 9 threads running (only 75% cpu utilization, could run 12 for 100%).

wagenaartje · 2017-07-23T10:44:33Z

@D-Nice Hmm... the way workers are set up should guarantee that genomes get their own score.

So you are running the fitness function simultaneously over multipe genomes at the same time? I haven't heard of that library before, i'll check it out. And do you have the same improvements for the Browser? If you have some kind of generic code that can be run from the evolve function or such, i'd be very happy to see it (or a PR!),

Edit: I read through it and parallel.js seems interesting. On the other hand, it seems like it is just an API for Workers. It looks very good, and especially because it works for both Node and the browser. The only thing i'm seeing (if I'm correct), is that data is not send as buffers, which slows down communication considerably.

D-Nice · 2017-07-23T18:31:25Z

@wagenaartje Haven't tried on browser, but here's an example of how the implementation looks on my end: https://gist.github.com/D-Nice/9f332010ad9a22279c6f2a026be64a33

The XOR dataset is a pretty terrible example with it though, to get the benefits out of it, you want a large dataset, I'm running datasets of >100k and complex fitness functions. There I was able to see vast improvements in speed. The parallel library does seem to have its issues. It seems as though the threads need to be spooled up first to run in parallel, so there's already a significant loss there if you're running low populations.

wagenaartje · 2017-08-01T10:27:23Z

@D-Nice I think I have to say you're right, something seems wrong with the scores when using the multithreading options. I'm on to it now.

wagenaartje · 2017-08-03T10:41:27Z

I'm also noticing that Node performance is way worse than Chrome performance. I'm talking about at least 5x slower.

wagenaartje · 2017-08-03T11:43:44Z

@D-Nice the

The scores seem to be getting jumbled up between various genomes.

should be fixed as of v1.3.6.

wagenaartje closed this as completed Aug 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance #36

Improve performance #36

alancnet commented Jun 28, 2017

wagenaartje commented Jun 29, 2017

wagenaartje commented Jun 29, 2017 •

edited

alancnet commented Jun 29, 2017

alancnet commented Jun 29, 2017

wagenaartje commented Jul 1, 2017 •

edited

wagenaartje commented Jul 1, 2017 •

edited

alancnet commented Jul 3, 2017

alancnet commented Jul 3, 2017

wagenaartje commented Jul 3, 2017 •

edited

D-Nice commented Jul 15, 2017

alancnet commented Jul 19, 2017

D-Nice commented Jul 23, 2017 •

edited

wagenaartje commented Jul 23, 2017 •

edited

D-Nice commented Jul 23, 2017

wagenaartje commented Aug 1, 2017

wagenaartje commented Aug 3, 2017

wagenaartje commented Aug 3, 2017

Improve performance #36

Improve performance #36

Comments

alancnet commented Jun 28, 2017

wagenaartje commented Jun 29, 2017

wagenaartje commented Jun 29, 2017 • edited

alancnet commented Jun 29, 2017

alancnet commented Jun 29, 2017

wagenaartje commented Jul 1, 2017 • edited

wagenaartje commented Jul 1, 2017 • edited

alancnet commented Jul 3, 2017

alancnet commented Jul 3, 2017

wagenaartje commented Jul 3, 2017 • edited

D-Nice commented Jul 15, 2017

alancnet commented Jul 19, 2017

D-Nice commented Jul 23, 2017 • edited

wagenaartje commented Jul 23, 2017 • edited

D-Nice commented Jul 23, 2017

wagenaartje commented Aug 1, 2017

wagenaartje commented Aug 3, 2017

wagenaartje commented Aug 3, 2017

wagenaartje commented Jun 29, 2017 •

edited

wagenaartje commented Jul 1, 2017 •

edited

wagenaartje commented Jul 1, 2017 •

edited

wagenaartje commented Jul 3, 2017 •

edited

D-Nice commented Jul 23, 2017 •

edited

wagenaartje commented Jul 23, 2017 •

edited