New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve Bson/MongoDB performance #633

Closed
MartinNowak opened this Issue Apr 24, 2014 · 15 comments

Comments

Projects
None yet
3 participants
@MartinNowak
Contributor

MartinNowak commented Apr 24, 2014

I'm trying to convert a MySQL table to MongoDB and it turns out that GC usage is a huge bottleneck. It should be fairly simple to optimize the MongoDB Message construction and toBsonData using an Outbuffer interface. Maybe scopeBuffer might shine here.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Apr 24, 2014

Contributor

It should be much easier to convert it to JSONB than BSON with the row_to_json sql function, just a heads up if you haven't heard of it ;) You can then use SQL, but it would be very nice to see the vibe.d MongoDB tools become compatible with Postgres, I may write it up eventually (which btw doesn't say anything against finding a great use to scopeBuffer here)

Contributor

etcimon commented Apr 24, 2014

It should be much easier to convert it to JSONB than BSON with the row_to_json sql function, just a heads up if you haven't heard of it ;) You can then use SQL, but it would be very nice to see the vibe.d MongoDB tools become compatible with Postgres, I may write it up eventually (which btw doesn't say anything against finding a great use to scopeBuffer here)

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Apr 24, 2014

Contributor

There's a better explanation for row_to_json here. So, just re-import a .sql dump from mysql into postgresql. Then, you create a dump with the json data from postgres (one line per json string) and you can move that into MongoDB with something like mongoimport -d test -c shops data.json

Contributor

etcimon commented Apr 24, 2014

There's a better explanation for row_to_json here. So, just re-import a .sql dump from mysql into postgresql. Then, you create a dump with the json data from postgres (one line per json string) and you can move that into MongoDB with something like mongoimport -d test -c shops data.json

@MartinNowak

This comment has been minimized.

Show comment
Hide comment
@MartinNowak

MartinNowak Apr 24, 2014

Contributor

It should be much easier to convert it to JSONB than BSON with the row_to_json sql function, just a heads up if you haven't heard of it ;)

That's nice, will try to find something similar for MySQL.

Still the mongo driver could be made much faster by some simple memory management improvements.

Contributor

MartinNowak commented Apr 24, 2014

It should be much easier to convert it to JSONB than BSON with the row_to_json sql function, just a heads up if you haven't heard of it ;)

That's nice, will try to find something similar for MySQL.

Still the mongo driver could be made much faster by some simple memory management improvements.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Apr 24, 2014

Contributor

That's nice, will try to find something similar for MySQL.

Yes, you can use mysqldump to create a .sql file and pick it up in a postgres install with psql -h hostname -d databasename -U username -f file.sql, I don't think mysql has these amazing features! heh

Still the mongo driver could be made much faster by some simple memory management improvements.

In fact, I believe everything in vibe.d was meant to run with the network as a bottleneck, so a lot of objects need to be adjusted for manual allocation or maybe even scoped allocation (as with scope buffer!). Personally, I'm making a different toolset to circumvent JSON everywhere possible and I'm using as many pre-built queries (SQL) as possible (see SQL maestro query builder!).. but if I need quick development times, MongoDB and JSON are a great tool (given they're not used for performance, nosql is slow)

Contributor

etcimon commented Apr 24, 2014

That's nice, will try to find something similar for MySQL.

Yes, you can use mysqldump to create a .sql file and pick it up in a postgres install with psql -h hostname -d databasename -U username -f file.sql, I don't think mysql has these amazing features! heh

Still the mongo driver could be made much faster by some simple memory management improvements.

In fact, I believe everything in vibe.d was meant to run with the network as a bottleneck, so a lot of objects need to be adjusted for manual allocation or maybe even scoped allocation (as with scope buffer!). Personally, I'm making a different toolset to circumvent JSON everywhere possible and I'm using as many pre-built queries (SQL) as possible (see SQL maestro query builder!).. but if I need quick development times, MongoDB and JSON are a great tool (given they're not used for performance, nosql is slow)

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Apr 24, 2014

Member

There are actually quite some places with unneeded allocations in the MongoDB driver, and some of them require some deeper changes, such as directly deserializing the BSON replies instead of first reading them into memory. I'll have a look at this (wanted to do that since a while anyway).

Member

s-ludwig commented Apr 24, 2014

There are actually quite some places with unneeded allocations in the MongoDB driver, and some of them require some deeper changes, such as directly deserializing the BSON replies instead of first reading them into memory. I'll have a look at this (wanted to do that since a while anyway).

s-ludwig added a commit that referenced this issue Apr 25, 2014

s-ludwig added a commit that referenced this issue Apr 25, 2014

Fourth round of eliminating allocations in the MongoDB driver. See #633.
MongoCollection.find[One] now accepts an explicit return type instead of always returning Bson. This allows to avoid all memory allocations as long as the return type doesn't contain indirections (e.g. strings or arrays).
@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Apr 25, 2014

Member

Should be in a pretty good shape now. There is still the MongoCursorData class that gets instantiated for every query and query results that contain strings or other kinds of non-POD data still need to allocate. This could be mitigated using a custom allocator. But even without that, the improvement should be pretty drastic, considering how many allocations have been eliminated (didn't do an actual benchmark, yet).

Member

s-ludwig commented Apr 25, 2014

Should be in a pretty good shape now. There is still the MongoCursorData class that gets instantiated for every query and query results that contain strings or other kinds of non-POD data still need to allocate. This could be mitigated using a custom allocator. But even without that, the improvement should be pretty drastic, considering how many allocations have been eliminated (didn't do an actual benchmark, yet).

@MartinNowak

This comment has been minimized.

Show comment
Hide comment
@MartinNowak

MartinNowak Apr 25, 2014

Contributor

Just wow, thanks a lot.

But even without that, the improvement should be pretty drastic, considering how many allocations have been eliminated (didn't do an actual benchmark, yet).

I'll provide some feedback once I get back to this.

Contributor

MartinNowak commented Apr 25, 2014

Just wow, thanks a lot.

But even without that, the improvement should be pretty drastic, considering how many allocations have been eliminated (didn't do an actual benchmark, yet).

I'll provide some feedback once I get back to this.

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Apr 25, 2014

Member

A simple ad-hoc find benchmark with a simple query expression, querying a single item containing a string and some numeric fields, repeated 10M times.

Discarding the results (query API overhead):

  • old version: 0.28 Mqueries/s
  • new version (Bson result): 4.55 Mqueries/s
  • new version (struct result): 4.66 Mqueries/s

Reading back the results:

  • old version: 7,9 kqueries/s
  • new version (Bson result): 8,2 kqueries/s
  • new version (struct result): 8,1 kqueries/s

Seems like MongoDB is the clear bottleneck here. Next step would be to try with a larger result count per query.

Member

s-ludwig commented Apr 25, 2014

A simple ad-hoc find benchmark with a simple query expression, querying a single item containing a string and some numeric fields, repeated 10M times.

Discarding the results (query API overhead):

  • old version: 0.28 Mqueries/s
  • new version (Bson result): 4.55 Mqueries/s
  • new version (struct result): 4.66 Mqueries/s

Reading back the results:

  • old version: 7,9 kqueries/s
  • new version (Bson result): 8,2 kqueries/s
  • new version (struct result): 8,1 kqueries/s

Seems like MongoDB is the clear bottleneck here. Next step would be to try with a larger result count per query.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Apr 25, 2014

Contributor

Simple query performance/sec are a really bad indicator of transactional database performance. Like in SQL, bulk inserts / selects should be the way to go and should be remembered always when dealing with the same command in a loop.

I had to learn this the hard way when importing 1 mil CSV rows in phpMyAdmin, the import script would do it one line at a time taking 10 minutes rather than 10 sec!

Contributor

etcimon commented Apr 25, 2014

Simple query performance/sec are a really bad indicator of transactional database performance. Like in SQL, bulk inserts / selects should be the way to go and should be remembered always when dealing with the same command in a loop.

I had to learn this the hard way when importing 1 mil CSV rows in phpMyAdmin, the import script would do it one line at a time taking 10 minutes rather than 10 sec!

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Apr 25, 2014

Member

Well, it highly depends on the application what is a good or bad indicator. But the point here was not to measure the DB performance, but to set it in relation with the API/client overhead. I also didn't want to measure insertion performance because insertion isn't really affected much by the allocation issue. The test here is definitely measuring an extreme, but that kind of was the point. As I've said, testing with more documents/query would be the next step.

Member

s-ludwig commented Apr 25, 2014

Well, it highly depends on the application what is a good or bad indicator. But the point here was not to measure the DB performance, but to set it in relation with the API/client overhead. I also didn't want to measure insertion performance because insertion isn't really affected much by the allocation issue. The test here is definitely measuring an extreme, but that kind of was the point. As I've said, testing with more documents/query would be the next step.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Apr 25, 2014

Contributor

I don't think vibe.d was ever a bottleneck to mongo anyways, but I was saying that to point out that it'll be much faster to move from mysql to mongodb through vibe.d with bulk inserts (where the interest would be about benchmarking entries/sec rather than req/s) - like you said yes.

Contributor

etcimon commented Apr 25, 2014

I don't think vibe.d was ever a bottleneck to mongo anyways, but I was saying that to point out that it'll be much faster to move from mysql to mongodb through vibe.d with bulk inserts (where the interest would be about benchmarking entries/sec rather than req/s) - like you said yes.

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Apr 25, 2014

Contributor

I'd be curious to see these kqueries/sec with (10?) concurrent connections though if that wasn't the case ;)

Contributor

etcimon commented Apr 25, 2014

I'd be curious to see these kqueries/sec with (10?) concurrent connections though if that wasn't the case ;)

@s-ludwig

This comment has been minimized.

Show comment
Hide comment
@s-ludwig

s-ludwig Apr 25, 2014

Member

10 parallel tasks (same thread) have about 4 times the throughput for the same kind of benchmark - no big differences between the old and new code. BTW, for around 1000 results per query and a single task I think I got about 38 kdocuments/s. But I'll better stop the benchmarking now, there is more important stuff waiting to be done ;)

Member

s-ludwig commented Apr 25, 2014

10 parallel tasks (same thread) have about 4 times the throughput for the same kind of benchmark - no big differences between the old and new code. BTW, for around 1000 results per query and a single task I think I got about 38 kdocuments/s. But I'll better stop the benchmarking now, there is more important stuff waiting to be done ;)

@etcimon

This comment has been minimized.

Show comment
Hide comment
@etcimon

etcimon Apr 25, 2014

Contributor

Nice, these benchmarks are very helpful. Thanks!

Contributor

etcimon commented Apr 25, 2014

Nice, these benchmarks are very helpful. Thanks!

@MartinNowak

This comment has been minimized.

Show comment
Hide comment
@MartinNowak

MartinNowak Jul 17, 2014

Contributor

I think we can close this issue. The low-hanging memory waste has been addressed and further improvements aren't necessary right now.

Contributor

MartinNowak commented Jul 17, 2014

I think we can close this issue. The low-hanging memory waste has been addressed and further improvements aren't necessary right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment