Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Sort of) memory leak with entry objects #2398

Closed
michael-e opened this issue Apr 5, 2015 · 54 comments
Closed

(Sort of) memory leak with entry objects #2398

michael-e opened this issue Apr 5, 2015 · 54 comments
Milestone

Comments

@michael-e
Copy link
Member

I am importing 5000 entries using the PHP CLI. The script works with Symphony and is supposed to build the core of a web-based (frontend) importer later, so I care about memory consumption.

The initial memory consumption of the script is around 47 MB. (This is caused by parsing a rather big Excel XML worksheet, so that is ok.)

Then I start saving entries to Symphony basically using the following code:

foreach ($data as $index => $fields) {
    $entry = EntryManager::create();
    $entry->set('section_id', $section_id);
    $entry->set('author_id',  $author_id);
    if (!empty($fields['id'])) {
        $entry->set('id', $fields['id']);
    }

    // Initial checks to see if the Entry is ok
    if (__ENTRY_FIELD_ERROR__ == $entry->checkPostData($fields, $errors)) {
        foreach ($errors as $field_id => $message) {
            echo 'ERROR  : entry ' . ($index + 1) . ' field ' . $field_id . ': ' . $message . "\n";
        }
    // Secondary checks, this will actually process the data
    } elseif (__ENTRY_OK__ != $entry->setDataFromPost($fields, $errors)) {
        foreach ($errors as $field_id => $message) {
            echo 'ERROR  : entry ' . ($index + 1) . ' field ' . $field_id . ': ' . $message . "\n";
        }
    } else {
        // attempt to save
        if (!$entry->commit()) {
            echo 'ERROR  : entry ' . ($index + 1) . "\n";
        } else {
            Symphony::ExtensionManager()->notifyMembers('EntryPostEdit', '/publish/edit/', array(
                'section' => SectionManager::fetch($section_id),
                'entry' => $entry,
                'fields' => $fields
            ));
            echo 'Success: entry ' . ($index + 1) . "\n";
        }
    }

    unset($entry);
    unset($fields);

    echo number_format(memory_get_peak_usage(true), 0, '.', ',') . " bytes\n";
}

Memory consumption stays at the initial level of 47 MB for around 150 entries, then starts growing slowly. It reaches 437 MB (!) for entry number 5000.

I did some more tests:

  • Using $entry->checkPostData and $entry->setDataFromPost only (i.e. without $entry->commit()), memory still grows, but much slower. It reaches 55 MB for entry number 5000.
  • Using only $entry->commit() and the extension manager notification (i.e. without $entry->checkPostData and $entry->setDataFromPost), it reaches 239 MB.

(Isn't this strange? If I add these figures, this is still much lower than using everything at the same time. I really double-checked these figures!)

I even tried to sleep(1) the script after bunches of 100 entries, giving the garbage collector a lot of time to do its job — to no avail.

Do you have any ideas what is happening here?

@michael-e
Copy link
Member Author

Ah, and BTW, if I don't do anything with the entry object, just create it, like so:

foreach ($data as $index => $fields) {
    $entry = EntryManager::create();
    $entry->set('section_id', $section_id);
    $entry->set('author_id',  $author_id);
    if (!empty($fields['id'])) {
        $entry->set('id', $fields['id']);
    }

    echo 'Success: entry ' . ($index + 1) . "\n";

    echo number_format(memory_get_peak_usage(true), 0, '.', ',') . " bytes\n";
}

the memory consumption does not increase at all.

@brendo
Copy link
Member

brendo commented Apr 6, 2015

This will be largely dependent on what fields and how many are in the
section. It's possible that a particular field is doing something expensive
in the check or the delegate.

Are you able to narrow this down at all?
On 6 Apr 2015 6:45 am, "michael-e" notifications@github.com wrote:

Ah, and BTW, if I don't do anything with the entry object, just create
it, like so:

foreach ($data as $index => $fields) { $entry = EntryManager::create(); $entry->set('section_id', $section_id); $entry->set('author_id', $author_id); if (!empty($fields['id'])) { $entry->set('id', $fields['id']); } echo 'Success: entry ' . ($index + 1) . "\n"; echo number_format(memory_get_peak_usage(true), 0, '.', ',') . " bytes\n";}

the memory consumption does not increase at all.


Reply to this email directly or view it on GitHub
#2398 (comment)
.

@michael-e
Copy link
Member Author

I have only one custom field (one that is similar to the Reflection Field), and I already removed that to no avail. The rest of the fields is: Input, Selectbox, SBL and Checkbox. The section has 32 fields.

It's hard for me to narrow this down. Honestly, at the moment I don't understand why the memory consumption grows at all. What exactly remains in memory?

I will try.

@michael-e
Copy link
Member Author

The following functions do not increase memory consumption:

  • $entry->checkPostData

The following functions increase memory consumption:

  • $entry->setDataFromPost
  • $entry->commit
  • Symphony::ExtensionManager()->notifyMembers

I also verified that the size of the entry arrays (i.e. the number of elements) has an influence. If I remove unneeded keys from the arrays, things are a bit better.

To prevent the phenomenon altogether, I will probably have to spawn child processes for saving the entries, right?

@michael-e
Copy link
Member Author

spawn child processes

Or wouldn't that help either?

@brendo
Copy link
Member

brendo commented Apr 6, 2015

I have only one custom field (one that is similar to the Reflection Field)

Keep in mind that anything that does Reflection like capabilities is inherently expensive. This is because the single field must build the Entry object before it can function. This would be my top suspect at the moment. Is it possible to duplicate the section and remove this field and see if the trend is the same?

The following functions increase memory consumption:

$entry->setDataFromPost
$entry->commit
Symphony::ExtensionManager()->notifyMembers

At the moment it's a process of elimination I would think, so is it possible to narrow down which of these three is causes the memory use to grow? Tracking the delegate call should be simpler because we just have to look at the extensions that are subscribing to that delegate.

To prevent the phenomenon altogether, I will probably have to spawn child processes for saving the entries, right?

Unsure, it's very hard to say without knowing exactly where the memory leak is occurring, and this in itself could be tricky to nut out. I've not had to search for a memory leak in PHP before so I'm unsure where to start. A tool like MacGDBP may be able to shed some light.

Alternatively, perhaps gc_collect_cycles in the foreach may reduce the memory usage? In any case, this is a mask for the underlying problem!

@michael-e
Copy link
Member Author

This would be my top suspect at the moment. Is it possible to duplicate the section and remove this field and see if the trend is the same?

I have already done that, and it didn't change anything. (My field is also a lot simpler than the Reflection Field. No XSLT processing, just concatenating some strings. It's only similar in how it "registers".)

is it possible to narrow down which of these three is causes the memory use to grow?

Any of them.

Alternatively, perhaps gc_collect_cycles in the foreach may reduce the memory usage?

I also tried that, and it doesn't help at all.

I am just coding a child process and will watch memory consumption then. If it doesn't help, I will have to investigate more.

@michael-e
Copy link
Member Author

Spawning child processes in the CLI, using something like this:

shell_exec('php ' . EXTENSIONS . '/some/path/cli.backgroundprocess.php ' . escapeshellarg(serialize($fields)));

solves the memory issue, but it is terribly slow (around ten times slower). This may be caused by the fact that every child process has to load Symphony. :-)

I will have to dig deeper.

@brendo
Copy link
Member

brendo commented Apr 6, 2015

is it possible to narrow down which of these three is causes the memory use to grow?
Any of them.

So regardless of which is commented out, the memory growth is consistent and at the end of 5000 entries the value is the same?

I don't suppose you're able to package any of this up into an ensemble so I can aid in debugging? As I mentioned before, the leak really could be anywhere. It can be in the EntryManager, the SectionManager, FieldManager, ExtensionManager, inside the actual Field itself or any one of the Entry classes!

@brendo
Copy link
Member

brendo commented Apr 6, 2015

SectionManager::fetch($section_id)

Are you able to pull this out of your loop? If this was defined as $section, it would save the repeated calls to fetch the Section from the Manager (even though it's meant to be cached, perhaps there's a leak here?)

@michael-e
Copy link
Member Author

So regardless of which is commented out, the memory growth is consistent

No, as mentioned in an earlier comment, every single function has an impact, but the problem gets worse if I use them all.

I don't suppose you're able to package any of this up into an ensemble so I can aid in debugging?

Difficult. I would have to think about it. There are legal issues as well (with the data).

Here are some new figures:

Using spawned processes is a lot faster, of course, when the processes are pushed to the background (i.e. you stop waiting for them) by appending > /dev/null & echo $!, for example. However, this will push the server's processors (I mean all cores) to the limit.

Spawned, background:

  • no memory issues
  • 391 seconds for 5000 entries
  • all cores at 100 percent during execution

All in one process (initial setup):

  • memory consumption high
  • 173 seconds for 5000 entries
  • processor cores at around 30 percent (mean)

SectionManager::fetch($section_id)

Are you able to pull this out of your loop?

I will try this now!

@michael-e
Copy link
Member Author

Including SectionManager::fetch($section_id) in the loop:

  • 436,469,760 bytes used
  • 169.45167779922 s

Removing SectionManager::fetch($section_id) from the loop:

  • 436,469,760 bytes used
  • 166.0119240284 s

Hmmm, no significant difference. :-(

@michael-e
Copy link
Member Author

Removing a single function, Symphony::ExtensionManager()->notifyMembers:

  • 368,050,176 bytes used
  • 153.42075800896 s

Removing a single function, $entry->setDataFromPost:

  • 238,551,040 bytes used
  • 80.510625839233 s

Removing both, Symphony::ExtensionManager()->notifyMembers and $entry->setDataFromPost:

  • 170,655,744 bytes used
  • 66.98449587822 s

Removing three functions, Symphony::ExtensionManager()->notifyMembers, $entry->setDataFromPost and $entry->commit():

  • 47,710,208 bytes used
  • 16.244827032089 s

Also removing $entry->checkPostData makes no difference in memory consumption:

  • 47,710,208 bytes used
  • 11.713372945786 s

@michael-e
Copy link
Member Author

As you see, there are three functions that "leak" memory.

@brendo
Copy link
Member

brendo commented Apr 6, 2015

Difficult. I would have to think about it. There are legal issues as well (with the data).

Let me know what you come up with. Because of the scope of where the leak could be originating from, the best way for me to help it to be able to replicate the issue on my machine and then start debugging as well. If the problem is not data specific, we should be able to recreate this using 'fake data', but I'm keen to try and replicate your exact setup as closely as possible.

Otherwise, I can spin up a dummy script that creates fake data for a section that contains all of the core fields and share that ensemble with you, essentially a test case. We can use this as base to attempt to first replicate and then resolve the problem. If we can't replicate with this test case, then we'll have to start adding fields to bridge the gap between the simple test and the actual test until we find the breaking point.

I assume the environment is running PHP 5.6.7?

@michael-e
Copy link
Member Author

The whole thingie is running on a virtual server, I don't even have a local installation. On the server I am running PHP 5.4.39.

Debugging would probably be easier on localhost, right?

I will come back to this a bit later, since I have a deadline for initial data import. Since all these Excel files have 5000 rows, I could do it with the current script. (It might become a serious issue later, when people try bigger files.)

One more interesting test:

Using setDataFromPost in dry-run mode ($entry->setDataFromPost($fields, $errors, true)) I also see a reduced memory footprint:

  • 365,953,024 bytes used
  • 148.83119392395 s

Without dry-run mode:

  • 438,829,056 bytes used
  • 174.13854598999 s

(BTW, dry-run seems to be the only way to successfully set a creation_date for the entry. I haven't managed to set the modification_date at all.)

@michael-e
Copy link
Member Author

My gut feeling says it's the database (MySQL) class. Symphony has a static Database object, and doesn't that mean that the "magic" MySQL::__destruct function is useless inside my script?

@michael-e
Copy link
Member Author

(Excuse me if I am completely wrong with the assumption above. As you know, my PHP skills are pretty limited.)

@nitriques
Copy link
Member

Michael, I think your assumption is right. The MySQL instance is destroyed only when the php script exits.

@michael-e
Copy link
Member Author

OK, I think I can live with the memory consumption. For 10.000 entries the memory footprint is less than 800 MB, which is still OK for me. The good thing is that the script is pretty fast (more than 30 entries per second, with 33 fields in the section).

I will tell the client that 10.000 entries is the importer's limit.

For anybody who is interested, this is the final code:

// …
// first step: building an array of entries… (keys must match the field handles)
// …

// second step: plug 'em into Symphony
foreach ($data as $index => $fields) {
    $entry = EntryManager::create();
    $entry->set('author_id',  $author_id);
    $entry->set('section_id', $section_id);
    $entry->set('creation_date', DateTimeObj::get('c', $fields['created']));
    $entry->set('creation_date_gmt', DateTimeObj::getGMT('c', $fields['created']));

    if (!empty($fields['id'])) {
        // We must check if the provided entry ID exists
        $existing = EntryManager::fetch($fields['id']);
        $existing = $existing[0];
        if (is_object($existing)) {
            $entry->set('id', $fields['id']);
            $entry->set('creation_date', $existing->get('creation_date'));
        }
    }

    if (Entry::__ENTRY_FIELD_ERROR__ == $entry->checkPostData($fields, $errors)) {
        foreach ($errors as $field_id => $message) {
            echo 'ERROR  : entry ' . ($index + 1) . ' field ' . $field_id . ': ' . $message . "\n";
        }
        $count_errors ++;
    } else if (Entry::__ENTRY_OK__ != $entry->setDataFromPost($fields, $errors, true)) {
        foreach ($errors as $field_id => $message) {
            echo 'ERROR  : entry ' . ($index + 1) . ' field ' . $field_id . ': ' . $message . "\n";
        }
        $count_errors ++;
    } else if (!$entry->commit()) {
        echo 'ERROR  : entry ' . ($index + 1) . "\n";
        $count_errors ++;
    } else {
        Symphony::ExtensionManager()->notifyMembers('EntryPostEdit', '/publish/edit/', array(
            'section' => SectionManager::fetch($section_id),
            'entry' => $entry,
            'fields' => $fields
        ));
        echo 'Success: entry ' . ($index + 1) . "\n";
    }

    unset($errors);
}

@brendo: If you also tend to think that the memory thingie is "normal behaviour", feel free to close this issue.

@brendo
Copy link
Member

brendo commented Apr 7, 2015

Using setDataFromPost in dry-run mode ($entry->setDataFromPost($fields, $errors, true)) I also see a reduced memory footprint:

Whoa, be careful with this. It will cause some fields to skip the actual saving work (for example, the Upload field).

(BTW, dry-run seems to be the only way to successfully set a creation_date for the entry. I haven't managed to set the modification_date at all.)

Sounds like a bug

@brendo: If you also tend to think that the memory thingie is "normal behaviour", feel free to close this issue.

While I'd expect the memory to grow when importing, it is a bit curious that it's growing at a consistent rate instead of being absorbed back into the garbage collection. The MySQL class does keep a static log of all the queries that are run, so there will be some memory growth as that log gets larger, but I wouldn't expect it to as large as it is!

@michael-e
Copy link
Member Author

Regarding the bug with creation_date and modification_date, I will open a separate issue as soon as I have a test setup. I will try and build an ensemble that I can send to @brendo to test this side issue plus the main issue.

@michael-e
Copy link
Member Author

OK, I will send a very simple test scenario (built on a vanilla installation of Symphony) to @brendo. Anybody else who is interested? @nitriques?

@jonmifsud
Copy link
Member

I think I noticed something similar to Micheal, didn't bother much as I am
happy running a one-time import locally then uploading the db to the prod
servers.

My scenario is slightly different as I needed two DB Connections (to get
data from the old system), plus some strange import logic (so I didn't
bother testing if I was causing the leak or not) unfortunately due to the
Data Sensitivity cannot share either. In terms of fields Imported I've got
standard stuff, Select boxes, text inputs and the odd textarea.

On 7 April 2015 at 13:41, michael-e notifications@github.com wrote:

OK, I will send a very simple test scenario (built on a vanilla
installation of Symphony) to @brendo https://github.com/brendo. Anybody
else who is interested? @nitriques https://github.com/nitriques?


Reply to this email directly or view it on GitHub
#2398 (comment)
.

@michael-e
Copy link
Member Author

In the test scenario that I have sent to @brendo there are Input fields exclusively. Nevertheless the memory phenomenon can be seen. So it has nothing to do with any special fields, that's for sure.

@brendo
Copy link
Member

brendo commented Apr 7, 2015

Yep, just for transparency sake, I was able to run @michael-e's test case and my environment produced a slightly better result, but I still consider it high:

  • Memory peak usage: 364M
  • Execution time: 128s

I'll be looking at using XDebug and the aforementioned MacGDBP to try and find the pain points.

@brendo
Copy link
Member

brendo commented Apr 7, 2015

Well, that was fast. On @michael-e's hunch, I immediately commented out the MySQL logging and the memory growth disappeared.

Importing the 5000 entries resulted in:

Memory peak usage: 4M

@brendo
Copy link
Member

brendo commented Apr 7, 2015

So the question now is, how should we approach solving this?

  • Offer an option to turn off query logging?
  • Just keep a rolling log of 20 last queries?
  • Don't log at all?
  • Do nothing?

The purpose of the logged queries at the moment is for the Profiler, and also for if an error occurs so a backtrace can be displayed. I'm leaning towards a rolling window of queries approach, although it has it's potential downsides, I'm not sure we'll actually realise any of them.

@michael-e
Copy link
Member Author

Wow, that was fast.

I wouldn't like to remove the logging. Any chance to turn it off when running a background script (like mine)? Do you have an idea?

@michael-e
Copy link
Member Author

Something like Symphony::Database->disableLogging? :-)

@michael-e
Copy link
Member Author

I tested a little "hack", and it seems to work. I added another private static variable to the MySQL class:

private static $_logging = true;

Then added a method to disable logging:

public static function disableLogging()
{
    self::$_logging = false;
}

and wrapped the code mentioned by @brendo in a condition:

if (self::$_logging === true) {
//…
}

In my importer script, I added:

Symphony::Database()->disableLogging();

Tested on PHP 5.3.28, the maximum memory consumption is 19 MB (compared to nearly 400 MB with logging enabled). At the same time profiling database queries works normally for frontend pages (which must be one of the wonders of object-oriented programming—have I already mentioned that I am rather bad in programming?).

@michael-e
Copy link
Member Author

Bingo, with the above change to the MySQL class I performed a successful test import with 50,000 (test) datasets!

  • Memory peak usage: 131M
  • Execution time: 1,605s

(Memory usage should be even lower with newer PHP versions.)

@nitriques
Copy link
Member

Wow that's great! the disablelogging thing is nice! I would love to have that on production server...

Maybe consider to add a config setting ?

@michael-e
Copy link
Member Author

Do you think a config setting a significant impact on a production site? For standard Symphony pages, there shouldn't be more than (a few) 1000 queries or so, so IMHO the memory consumption for logging would be rather small, wouldn't it?

@nitriques
Copy link
Member

so IMHO the memory consumption for logging would be rather small, wouldn't it?

Seeing, how much memory logs can consume, I would love to be able to not generate them in production. I do have sites where the number of SQL queries is >> 1000

@michael-e
Copy link
Member Author

I am not against it. Just wanted to know… :-)

@nitriques
Copy link
Member

:)

@brendo
Copy link
Member

brendo commented Apr 8, 2015

Something like Symphony::Database->disableLogging? :-)

Yep that'll work :)

@michael-e
Copy link
Member Author

@brendo:

  1. Should I send a PR as described above? Or would you like to implement this?
  2. Shall we add this to the Symphony 2.6 branch as well? (If not, no prob — I can add it to my client's private branch).
  3. Do you think that an additional config option (as proposed by @nitriques) is a valid improvement?

@brendo
Copy link
Member

brendo commented Apr 8, 2015

Yes to all of the above. Happy for you to submit the PR if you have time.
I'm hoping to do this tomorrow evening otherwise :)
On 8 Apr 2015 7:04 pm, "michael-e" notifications@github.com wrote:

@brendo https://github.com/brendo:

Should I send a PR as described above
#2398 (comment)?
Or would you like to implement this?
2.

Shall we add this to the Symphony 2.6 branch as well? (If not, no prob
— I can add it to my client's private branch).
3.

Do you think that an additional config option (as proposed by
@nitriques https://github.com/nitriques) is a valid improvement?


Reply to this email directly or view it on GitHub
#2398 (comment)
.

@michael-e
Copy link
Member Author

I will try to do at least number 1. I am not sure about the config option, because it should be added to the updater as well.

Shall I submit against integration, and you will then cherry-pick the commit into 2.6.x?

FYI: Success! I did a "real-life" import (i.e. with production data) using the fix in the MySQL class:

  • small virtual server (3 cores, 2GB RAM)
  • parsed 9 Excel XML worksheet files (each: 40 columns, 5000 data rows)
  • merged everything (45000 associative arrays made of field/value pairs) into one array; memory usage at that point: 230 MB
  • imported into Symphony section (32 fields)
  • overall time: 1387s (which means more than 30 entries/second)
  • memory peak usage: 330 MB — given the initial value for building the array, the Symphony part was only 100 MB

I am very happy that you figured it out, @brendo!

@jensscherbl
Copy link
Member

Shall I submit against integration, and you will then cherry-pick the commit into 2.6.x?

Is this a bugfix or a new feature? PR should be submitted against 2.6.x for bugfixes, integration for new features. integration is our development branch for the upcoming 3.0.0 release, so if it's a new feature and should be available for 2.x.x, we should talk about a possible 2.7.x release.

@michael-e
Copy link
Member Author

Well, it is both. It fixes an issue with memory consumption by adding a new feature (to switch off the query logger).

So what shall I do?

@jensscherbl
Copy link
Member

It fixes an issue with memory consumption by adding a new feature

integration then. Or 2.7.x, but @brendo was against a 2.7.x release and wanted to jump directly to 3.x.x.

@michael-e
Copy link
Member Author

I am a bit lost with the PDO rewrite of the database stuff, which is in the integration branch. Actually it already has a parameter called _log_enabled, but it gets reset in the query function in these lines. Why do we do that? If we don't want the qfunction to log anything, why don't we just remove the logging there?

(Anyway, I could send my basic fix to the 2.6 branch if @brendo agrees. We'll need different implementations for 2.6 and integration/3.0.)

@brendo
Copy link
Member

brendo commented Apr 22, 2015

(Anyway, I could send my basic fix to the 2.6 branch if @brendo agrees. We'll need different implementations for 2.6 and integration/3.0.)

Sorry for missing this. Yes, please submit a basic fix for 2.6.x, and then we can figure something else out for integration. The PDO work stills needs a little refinement.

Why do we do that? If we don't want the q function to log anything, why don't we just remove the logging there?

I don't know, it looks like a mistake, perhaps from debugging the feature. Those lines should simply call q which has it's own check whether to log the query or not.

michael-e added a commit to michael-e/symphony-2 that referenced this issue Apr 22, 2015
@michael-e
Copy link
Member Author

@brendo: Feel free to do it differently—you know much better than me.

@brendo
Copy link
Member

brendo commented Apr 22, 2015

This is good, I'll add two additional methods, enableLogging and isLoggingEnabled to complete the API.

@michael-e
Copy link
Member Author

Ah, I understand, thanks!

brendo added a commit that referenced this issue Apr 22, 2015
Add possibility to disable query logging. RE: #2398
@brendo
Copy link
Member

brendo commented Apr 22, 2015

Done, the Database class has been updated with the enableLogging and isLoggingEnabled functions. I've also made Symphony aware of a new config option, query_logging. You can set this in the Database section of your configuration with the value on or off to enable/disable logging site wide without having to programmatically set this value. The downside of this is your ?profile will no longer include the query log, instead it'll just have the count.

@brendo brendo closed this as completed Apr 22, 2015
@brendo brendo added this to the 2.6.2 milestone Apr 22, 2015
@nitriques
Copy link
Member

Thanks @brendo really cool feature!

@michael-e
Copy link
Member Author

Works like a charm. It's nice that you can disable the query cache as well, so you can check how important this cache is to your site (and if you should check/optimize the MySQL settings…). Thanks, @brendo!

@nitriques
Copy link
Member

@michael-e Indeed!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants