Massive Memory Leaks, 100 loops #466

Closed
rauls opened this Issue Jul 4, 2012 · 17 comments

Projects

None yet

4 participants

@rauls
rauls commented Jul 4, 2012

I basically did a test of 100 loops decoding a 10kb html page, that was simple, and only contained a table and 10 ROWS, with 38 data CELLS.
There is NO CSS, there is NO class names, just plain 1990s html.

Am I doing something wrong? Should I free something? I do need to call this decoder on 1000s of pages, but it barely gets to 500 until the process dies.

node-v0.8 memory results printed every 10 loops:
{ rss: 23052288, heapTotal: 16414464, heapUsed: 11052720 }
{ rss: 44204032, heapTotal: 37286144, heapUsed: 20598328 }
{ rss: 65110016, heapTotal: 57159168, heapUsed: 36111544 }
{ rss: 71761920, heapTotal: 64382720, heapUsed: 35709832 }
{ rss: 78684160, heapTotal: 70574336, heapUsed: 38881408 }
{ rss: 83673088, heapTotal: 74702080, heapUsed: 54306104 }
{ rss: 90898432, heapTotal: 80893696, heapUsed: 57606448 }
{ rss: 97910784, heapTotal: 89149184, heapUsed: 60812848 }
{ rss: 104034304, heapTotal: 94308864, heapUsed: 62109328 }
{ rss: 109240320, heapTotal: 98436608, heapUsed: 78288200 }

Node-v0.6.11 printed every 10 loops;
{ rss: 25526272, heapTotal: 17585280, heapUsed: 10782256 }
{ rss: 40357888, heapTotal: 32112768, heapUsed: 17500296 }
{ rss: 61333504, heapTotal: 53142208, heapUsed: 27617224 }
{ rss: 68038656, heapTotal: 59846400, heapUsed: 30513056 }
{ rss: 75681792, heapTotal: 67359424, heapUsed: 34469968 }
{ rss: 79060992, heapTotal: 70815104, heapUsed: 50902336 }
{ rss: 85581824, heapTotal: 77246912, heapUsed: 55379912 }
{ rss: 92008448, heapTotal: 83784896, heapUsed: 57894512 }
{ rss: 98906112, heapTotal: 90541824, heapUsed: 62839816 }
{ rss: 105295872, heapTotal: 96762560, heapUsed: 67425680 }

The code for my loop as follows;

for( var i=0; i< 100; i++ )
{
    jsdom.env( {
        html: htmldata, 
        scripts: [ "../javascript/jquery-min.js" ],             // jQuery JavaScript Library v1.6.2

        done: function(errors,win) {
            var $ = win.$;
            DEBUG( "  Process_New_Data html TR size is " + $("tr").length );
            var values = [];
            $("tr").each(function() 
            {
                $.each(this.cells, function()
                {
                    var v = $(this).text();
                    values.push(v);
                } );
            } );

            var found_link = 0;
            $("a").each(function() 
            {
                if( $(this).html() == '&gt;' ) {
                    found_link = 1;
                    var link = $(this).attr('href');
                    DEBUG("  Getting next new data ... " + link );
                }
            } );
        }
    } );

    if( i % 10 == 0 ) {
        console.log(util.inspect(process.memoryUsage()));
    }
}
@tmpvar
Owner
tmpvar commented Jul 4, 2012

Use window.close()
On Jul 3, 2012 6:18 PM, "Raul" <
reply@reply.github.com>
wrote:

I basically did a test of 100 loops decoding a 10kb html page, that was
simple, and only contained a table and 10 ROWS, with 38 data CELLS.
There is NO CSS, there is NO class names, just plain 1990s html.

Am I doing something wrong? Should I free something? I do need to call
this decoder on 1000s of pages, but it barely gets to 500 until the process
dies.

node-v0.8 memory results printed every 10 loops:
{ rss: 23052288, heapTotal: 16414464, heapUsed: 11052720 }
{ rss: 44204032, heapTotal: 37286144, heapUsed: 20598328 }
{ rss: 65110016, heapTotal: 57159168, heapUsed: 36111544 }
{ rss: 71761920, heapTotal: 64382720, heapUsed: 35709832 }
{ rss: 78684160, heapTotal: 70574336, heapUsed: 38881408 }
{ rss: 83673088, heapTotal: 74702080, heapUsed: 54306104 }
{ rss: 90898432, heapTotal: 80893696, heapUsed: 57606448 }
{ rss: 97910784, heapTotal: 89149184, heapUsed: 60812848 }
{ rss: 104034304, heapTotal: 94308864, heapUsed: 62109328 }
{ rss: 109240320, heapTotal: 98436608, heapUsed: 78288200 }

Node-v0.6.11 printed every 10 loops;
{ rss: 25526272, heapTotal: 17585280, heapUsed: 10782256 }
{ rss: 40357888, heapTotal: 32112768, heapUsed: 17500296 }
{ rss: 61333504, heapTotal: 53142208, heapUsed: 27617224 }
{ rss: 68038656, heapTotal: 59846400, heapUsed: 30513056 }
{ rss: 75681792, heapTotal: 67359424, heapUsed: 34469968 }
{ rss: 79060992, heapTotal: 70815104, heapUsed: 50902336 }
{ rss: 85581824, heapTotal: 77246912, heapUsed: 55379912 }
{ rss: 92008448, heapTotal: 83784896, heapUsed: 57894512 }
{ rss: 98906112, heapTotal: 90541824, heapUsed: 62839816 }
{ rss: 105295872, heapTotal: 96762560, heapUsed: 67425680 }

The code for my loop as follows;

    for( var i=0; i< 100; i++ )
    {
            jsdom.env( {
                    html: htmldata,
                    scripts: [ "../javascript/jquery-min.js" ],
  // jQuery JavaScript Library v1.6.2

                    done: function(errors,win) {
                            var $ = win.$;
                            DEBUG( "  Process_New_Data html TR size is

" + $("tr").length );
var values = [];
$("tr").each(function()
{
$.each(this.cells, function()
{
var v = $(this).text();
values.push(v);
} );
} );

                            var found_link = 0;
                            $("a").each(function()
                            {
                                    if( $(this).html() == '&gt;' ) {
                                            found_link = 1;
                                            var link =

$(this).attr('href');
DEBUG(" Getting next new
data ... " + link );
}
} );
}
} );

            if( i % 10 == 0 ) {
                    console.log(util.inspect(process.memoryUsage()));
            }
    }

Reply to this email directly or view it on GitHub:
#466

@rauls
rauls commented Jul 4, 2012

I made a simpler test that didnt traverse any objects and added win.close();

Same result;

{ rss: 25419776, heapTotal: 17704000, heapUsed: 10796816 }
{ rss: 40214528, heapTotal: 31979392, heapUsed: 17536560 }
{ rss: 61341696, heapTotal: 53268928, heapUsed: 27619088 }
{ rss: 67948544, heapTotal: 59786176, heapUsed: 30482688 }
{ rss: 75673600, heapTotal: 67323584, heapUsed: 34412904 }
{ rss: 79204352, heapTotal: 71063744, heapUsed: 50881760 }
{ rss: 85643264, heapTotal: 77308608, heapUsed: 55390688 }
{ rss: 92000256, heapTotal: 83858816, heapUsed: 57790184 }
{ rss: 98746368, heapTotal: 90339392, heapUsed: 62643632 }
{ rss: 105230336, heapTotal: 96803968, heapUsed: 67217488 }

@rauls
rauls commented Jul 4, 2012

OK I modified the example/ender/run.js to proove it as a standalone run.

------------------- code ---------------------

var jsdom  = require("../../lib/jsdom");
function doit()
{
 jsdom.env( '<html><body></body></html>', ['ender.js'], function(errors, window) {
    if (errors) {
        console.error(errors);
        return;
    }
  //window.$('body').append("<div class='testing'>Hello World, It works!</div>");
  //console.log(window.$(".testing").text());
  window.close();
 });
}

util = require('util');

for( i=0;i<10000;i++)
{
  doit();
  if( i%100 == 0 ) console.log(util.inspect(process.memoryUsage()));
}

------------------- code ---------------------

Results:
{ rss: 18395136, heapTotal: 10875840, heapUsed: 6138512 }
{ rss: 36093952, heapTotal: 28490432, heapUsed: 21137736 }
{ rss: 56926208, heapTotal: 51323456, heapUsed: 33535336 }
{ rss: 74182656, heapTotal: 79052736, heapUsed: 50018992 }
{ rss: 92946432, heapTotal: 89408704, heapUsed: 64089840 }
{ rss: 113168384, heapTotal: 107690624, heapUsed: 74172520 }
{ rss: 130473984, heapTotal: 125315200, heapUsed: 88372280 }
{ rss: 142360576, heapTotal: 135028160, heapUsed: 108268288 }
{ rss: 155648000, heapTotal: 148476672, heapUsed: 118765288 }
{ rss: 173125632, heapTotal: 165990720, heapUsed: 133376736 }
{ rss: 190173184, heapTotal: 183248960, heapUsed: 147333880 }
{ rss: 203616256, heapTotal: 196447744, heapUsed: 156220424 }
{ rss: 220995584, heapTotal: 213912320, heapUsed: 170633816 }
{ rss: 237731840, heapTotal: 230651904, heapUsed: 184606168 }
{ rss: 247508992, heapTotal: 240397376, heapUsed: 204498064 }
{ rss: 264613888, heapTotal: 257577920, heapUsed: 218453800 }
{ rss: 280772608, heapTotal: 273869120, heapUsed: 225358728 }
{ rss: 296161280, heapTotal: 289066816, heapUsed: 237952200 }
{ rss: 313290752, heapTotal: 306599168, heapUsed: 251917568 }
{ rss: 322723840, heapTotal: 315800064, heapUsed: 271804048 }

@rauls
rauls commented Jul 24, 2012

Any one any one? Beuller?

Memory still leaks, even with window.close();

@tmpvar
Owner
tmpvar commented Jul 24, 2012

weird, I thought we squashed this a long time ago. Perhaps we've regressed :(

can you provide your node version and jsdom version?

Also, as a workaround, you could set up a master -> slaves configuration. e.g. process 10 pages per process before shutting it down and spawning a new one. You may want to consider using https://github.com/MatthewMueller/cheerio if all you are doing is scraping..

@rauls
rauls commented Jul 24, 2012

As stated at the very top, I ran this on two nodes.

node-v0.8, and node-v0.6.11, using JSDOM "version": "0.2.14",
that came via NPM

@rauls
rauls commented Jul 24, 2012

I updated jsdom, and I still get the same leaks.
My code is as the above 15 line example.
Even if I change it to print the memory heaps for every single loop, there still are losses per loop (about 45kb)

@rauls
rauls commented Aug 5, 2012

Can someone look at this, why not use nodetime http://nodetime.com/ to profile the memory usage.
If no one will in 24 days, I will.

@tmpvar
Owner
tmpvar commented Aug 5, 2012

please do

@rauls
rauls commented Aug 6, 2012

Made a slight change, and tested with nodetime.com and without nodetime.com

I changed my loop to happen only 500 times, and added it to a timer.
As you can see memory is freed, ok this time.

But when run for 1500 times, every 3 seconds, it crashes with too many open files error.
Run loop for 1000, every 3 seconds and its ok.

It seams as though if you use/alloc too many objects/second, it can die fast. Node needs time to free heap.

require('nodetime').profile();
var jsdom  = require("../lib/jsdom");
function doit()
{
 jsdom.env( '<html><body></body></html>', ['test.js'], function(errors, window) {
    if (errors) {
        console.error("Got errors : "+errors);
        return;
    }
  //window.$('body').append("<div class='testing'>Hello World, It works!</div>");
  //console.log(window.$(".testing").text());
  window.close();
 });
}

util = require('util');

function starttest()
{
 console.log("Start...");
 for( i=0;i<500;i++)
 {
  doit();
  if( i%100 == 0 ) console.log(util.inspect(process.memoryUsage()));
 }
 console.log("End.");
}


console.log("30 secs till start...");
setInterval( starttest, 30*1000 );

Start...
{ rss: 22106112, heapTotal: 18389376, heapUsed: 7256004 }
{ rss: 30420992, heapTotal: 25564928, heapUsed: 16414360 }
{ rss: 38797312, heapTotal: 35853312, heapUsed: 21972532 }
{ rss: 47087616, heapTotal: 40931712, heapUsed: 35498932 }
{ rss: 63852544, heapTotal: 56417920, heapUsed: 35581428 }
End.
Start...
{ rss: 22048768, heapTotal: 17373696, heapUsed: 7563864 }
{ rss: 30511104, heapTotal: 24549248, heapUsed: 18146440 }
{ rss: 38412288, heapTotal: 30643328, heapUsed: 22855576 }
{ rss: 46436352, heapTotal: 40931712, heapUsed: 35570236 }
{ rss: 52305920, heapTotal: 46010112, heapUsed: 40108404 }
End.
Start...
{ rss: 22155264, heapTotal: 18389376, heapUsed: 7289456 }
{ rss: 30826496, heapTotal: 25564928, heapUsed: 16913412 }
{ rss: 38289408, heapTotal: 34837632, heapUsed: 24475900 }
{ rss: 47710208, heapTotal: 40919552, heapUsed: 37838328 }
{ rss: 55517184, heapTotal: 48029312, heapUsed: 37064812 }
End.
Start...
{ rss: 22925312, heapTotal: 18389376, heapUsed: 7354268 }
{ rss: 31576064, heapTotal: 25564928, heapUsed: 16970632 }
{ rss: 38916096, heapTotal: 34837632, heapUsed: 24448732 }
{ rss: 48324608, heapTotal: 41935232, heapUsed: 37697156 }
{ rss: 55869440, heapTotal: 48029312, heapUsed: 37374700 }
End.
Start...
{ rss: 22925312, heapTotal: 18389376, heapUsed: 7354804 }
{ rss: 31514624, heapTotal: 25564928, heapUsed: 16972168 }
{ rss: 38952960, heapTotal: 34837632, heapUsed: 24001384 }
{ rss: 48324608, heapTotal: 41935232, heapUsed: 36890564 }
{ rss: 55787520, heapTotal: 48029312, heapUsed: 37325820 }
End.
Start...
{ rss: 22925312, heapTotal: 18389376, heapUsed: 7355032 }
{ rss: 31305728, heapTotal: 24549248, heapUsed: 16860208 }
{ rss: 39395328, heapTotal: 35853312, heapUsed: 23356792 }
{ rss: 47751168, heapTotal: 40931712, heapUsed: 37070376 }
{ rss: 56004608, heapTotal: 48029312, heapUsed: 35721128 }
End.

@domenic
Collaborator
domenic commented Oct 6, 2012

@rauls would you still consider this a bug? It sounds more related to an overzealous usage pattern than any fault of jsdom.

@brianmcd
Collaborator
brianmcd commented Oct 6, 2012

@rauls FWIW, your test is creating 500 DOM instances before closing the first one, since window.close() doesn't fire until the callbacks execute. That explains why the heap kept growing in your initial tests, and why memory is freed in between setInterval calls. I would guess that you're running out of file descriptors when you try to create 1500 DOM instances and it reads test.js for each one (if that's what the second parameter to jsdom.env does - I haven't used that method much).

Edit: so @domenic, I don't think this is a JSDOM bug.

@rauls
rauls commented Oct 7, 2012
  1. this is not overzealous usage, since my real life example was going to a remote URL, the same URL, but with different args each time, and to recode the pages returned table results. So this page would return only 10 rows of HTML, with a NEXT button, so my code would fine the next button URL, and follow it forever. Now this can go on for 1000s of URLs before it dies out of memory, since the garbage collector/free-er runs slow or cannot keep up with all the free objects.
  2. BrainMCD, In my example doit() has a window.close() for every URL hit. The 500 is the total overall loop max, with only print outs every 100. But look at the code output each 500 block is in its own block lasting less that the 30 second interval, so we never have two 500 block timers going of at the same time, so there should never be >500 outstanding FDs. Each 500 block loop executes in less than 20 seconds, with 10 seconds gap.
    This was the simplest example I could do for you guys based on your other test examples.

In my real life code, the code that follows the URLs forever until end, it processed 1 URL, got the result, then triggered another URL read/process, and repeat. I mean how can I do it any better, except for adding purposful sleep delays to do nothing for 5 seconds between each URL ? I want to process each URL and BODY as fast as possible, then only once that page is done, does it go and get the next page, never two pages at once.

@domenic
Collaborator
domenic commented Oct 7, 2012

since the garbage collector/free-er runs slow or cannot keep up with all the free objects

This raises some red flags for me. Garbage collectors simply don't work this way: if there is not enough memory available, then they will collect anything that is garbage. If there is no garbage to collect, e.g. because you have set up your system to hold references to all those objects, then they will throw out-of-memory. That is why I suspect you are not actually seeing a leak.

Furthermore, you mentioned running out of file descriptors. Are you aware of ulimit, and how on most systems the maximum number of open file descriptors is by default very low? One thread says 266 for Mac OS 10.6, for example.

@domenic
Collaborator
domenic commented Oct 7, 2012

You can also try testing by running node with --expose-gc and then inserting gc() calls to collect garbage manually.

@brianmcd
Collaborator
brianmcd commented Oct 8, 2012

@rauls Your code is creating 500 window instances before the first one is closed.

for( i=0;i<500;i++)
 {
  doit();
  if( i%100 == 0 ) console.log(util.inspect(process.memoryUsage()));
}

This loop will call doit() 500 times without yielding the CPU back to the event loop.

function doit()
{
 jsdom.env( '<html><body></body></html>', ['ender.js'], function(errors, window) {
    if (errors) {
        console.error(errors);
        return;
    }
  //window.$('body').append("<div class='testing'>Hello World, It works!</div>");
  //console.log(window.$(".testing").text());
  window.close();
 });
}

doit() calls jsdom.env, which is an asynchronous function that allocates a DOM and then calls the callback when it's loaded. The callback is always going to be called on a different tick of the event loop, and since you don't yield the CPU before you're done allocating 500 windows, none of the window.close() invocations will happen until after all of the browsers are created.

If you instrument your example to add console.log("allocating a window"); as the first line in doit(), and then console.log("closing a window"); right before the window.close() call, you'll see that all windows are allocated before any are closed.

Here is a cleaned up example that doesn't instantiate multiple windows at once:

var jsdom  = require('jsdom');
var util = require('util');

var count = 0;
function doit() {
  console.log('allocating a window');
  jsdom.env('<html><body></body></html>', ['test.js'], function(errors, window) {
    if (errors) {
      console.error("Got errors : "+errors);
      return;
    }
    console.log('calling close');
    window.close();
    if (++count < 500) {
      if (count % 100 == 0) {
        console.log(util.inspect(process.memoryUsage()));
      }   
      process.nextTick(doit);
    }   
  }); 
}

doit();
@rauls
rauls commented Oct 9, 2012

Domenic, I am aware of ulimit, and have set it to 300000 files, 1024 max use procs etc.. any way, quite large values. Machine is RH, 64bit/2Core, 2gig ram + 2gig swap.

Ok that explains part of it, well, at least for this example which was small to demonstrate. My program at hand, that takes one page, then follows one link at a time, and keeps going, did a lot of console.log() calls printing info before each request, before each jsdom() call, after it processed it another log(), basically, at least 20 log() calls per web page.

In this full program case, it lasts a lot longer, but it does take about 45-50mins to die, because it does have time to background process closes() , but I guess its not doing it often enough or its too slow, so the pace of creating new jsdom() calls is faster than they are freed(). My code isnt triggering 500 background jsdoms() either, but basically a simple http.request() with a end() handler, that then triggers jsdom(), and once its done, it then triggers another http.request(), so in theory there is only ever 1 page being scrapped at one time.

My program keeps a record of where it is up to, so the lame way around it , is to just restart it, and it continues, but it still dies after 4gig+ of ram is used.

Having to force yields() or sleeps() just takes me back to OS7 MacOS programming yieldProcessor(), we have a full threadable linux kernel here, the VM engine should do its magic underneath, no need for any 1995 style yields(), maybe the VM engine should better monitor whats going on, and spend a bit more time freeing stuff.
Even something so simple as a trigger on 1GIG usage, suspend that hungry process for 2 seconds, then free as much crap as possible, resume().

Maybe jsdom should keep track of how much crap is outstanding or in heap, and perhaps issue a few yields itself, or tell the VM to collect. You dont want the user to rely on quirks of console.log() to help reduce memory.

Hope this helps everyone.

@domenic domenic closed this Oct 10, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment