Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue #53

Closed
B0rner opened this issue Nov 27, 2013 · 9 comments
Closed

Performance Issue #53

B0rner opened this issue Nov 27, 2013 · 9 comments

Comments

@B0rner
Copy link

B0rner commented Nov 27, 2013

HI,
I'm running some test to index an mongoDB to solr 4.4. The mongoDB contains 10.000.000 documents. Importing those documents need a long time. The mongo-connector puhsed 120.000 documents per hour to solr. So I will need 83 hours to run an full import.
Using the data import handler with the same documents from mysql, the whole import need 3 hours.
I think, one of the main reasons for this slow import is based on the single import of each document followed by an commit for each document, which is an "performance overkill".
So the question is: is'n it possible to push a set of documents to solr followed by one commit for all of the docs? Or isn't there an "initial sync" feature, which is based by an mongoDB query, not the oplog.
There is an 2nd. problem, which exits: because the oplog is an capped collection, there could be the situation, where more documents are in the mongoDB than in the oplog. So how can I index those documents, which are not in the oplog (anymore)?

@ghost ghost assigned llvtt Nov 29, 2013
@llvtt
Copy link

llvtt commented Dec 2, 2013

Hi @B0rner,

Upserts using the Solr DocManager do commit on every upsert, which could very well be excessive, but is default in pysolr's add method. From what I've read in the Solr Documentation, commits are by default "hard commits," which flush to disk. I don't know if "soft commits" are supported in Solr 4.x or not, given the warning exclamation point next to "Solr4.0" in the docs. I'd be willing to look into this further this week. At the very least, it seems as if there should be a feature to add documents to Solr without commiting, and then commit documents every X number of upserts or when there is no other activity happening. Does this sound reasonable to you?

As for your second question, you can do an "initial sync" of all your documents from MongoDB into Solr by truncating the oplog progress file (called "config.txt" by default). After you restart mongo-connector, it will replicate all the documents from your targeted namespace over to Solr, regardless of their presence in the oplog. Note that each document from the collection being dumped is inserted into Solr using the Solr DocManager's update method, which is where you've identified a possible bottleneck to performance with overcommitting.

If you need a quick solution to your problem, you might try changing the call to pysolr's add method to use commit=False, if only for the "initial sync" part of the process. I'm not sure when or if uncomitted documents make it to disk in Solr, so you may need to commit these upserts manually:

from pysolr import Solr
connection = Solr("http://localhost:8983/solr")
connection.commit()

Hope this helps. I'll look into a more performant way to handle commits in Solr soon.

@llvtt
Copy link

llvtt commented Dec 9, 2013

Update about better commit practices on Solr:

It looks as if the commit behavior can be configured in a solrconfig.xml file that comes with Solr. There's an autocommit option available since Solr 1.2 that allows the user to specify that documents should be committed after every X ms or after every Y documents needing to be written to disk. Since version 4.0-alpha, it seems that there's even an autosoftcommit option (see https://issues.apache.org/jira/browse/SOLR-2193). Since the user already has a lot of power over commit behavior in Solr, I don't think it's necessary for mongo-connector to make any commits in solr_doc_manager.py. Instead, there should be a blurb in the README informing the user that mongo-connector makes no commits in Solr and provide a link to the relevent Solr documentation about solrconfig.xml (given above) for how to configure commit behavior.

@B0rner, does this make sense to you?

As a side note, it looks as if elastic_doc_manager.py suffers from exactly the same problem being described here about the Solr DocManager, i.e., the DocManager "refreshes" after every upsert, even though this "refresh" behavior is also in the run_auto_commit method. I'll have to do some more research to determine what should be done in ES.

@llvtt
Copy link

llvtt commented Dec 10, 2013

An update about better commit practices on Elastic Search:

According to the ES "refresh" documentation, refreshes are by default already scheduled "periodically" (according to this page, periodically = every second). Flushes are handled automatically in ES. If mongo-connector were to assert control over refreshes on ES (currently refreshing after every update/insert), there is the notion of a refresh interval that refreshes an index every X seconds. It could be worth switching off periodic refreshes for collection dumps to improve write throughput and then resetting the refresh interval to whatever it was previously for that index. I think that either leveraging the "refresh interval" or taking out manual refreshes from elastic_doc_manager.py and letting ES do its default "periodic" refreshing is more performant than current behavior.

Another side note:

I'm noticing that the auto_commit parameter in the constructor of both the Solr DocManager and the ES DocManager is not configurable to the user of mongo-connector. It seems like this feature was originally meant to let the user configure whether mongo-connector should automatically refresh/commit on each upsert (auto_commit=True), or let the underlying configuration take care of this (auto_commit=False). Since both Solr and ES have functionality for "commit within X amount of time," the auto_commit parameter can be a number X specifying how long operations can hang around before being committed. For DocManagers that don't support committing within X amount of time, X != 0 could mean a commit on each upsert. Either way, this parameter should probably be exposed through a command-line parameter in connector.py, and both the Solr and ES DocManagers can take advantage of this.

@B0rner
Copy link
Author

B0rner commented Dec 18, 2013

I'd be willing to look into this further this week. At the very least, it seems as if there should be a feature to add documents to Solr without commiting, and then commit documents every X number of upserts or when there is no other activity happening. Does this sound reasonable to you?

Yes, I think this would be very useful on bigger databases. But of cause, there is no need to develop an existing feature again, like solr-autocommit, as you wrote.

As for your second question, you can do an "initial sync" of all your documents from MongoDB into Solr by truncating the oplog progress file (called "config.txt" by default).

This means: truncating the oplog results in something like dumping the output of an mongo "find" command to solr?
This is good to know.

Instead, there should be a blurb in the README informing the user that mongo-connector makes no commits in Solr and provide a link to the relevent Solr documentation about solrconfig.xml (given above) for how to configure commit behavior.
@B0rner, does this make sense to you?

For my case: that makes sense. I think: it's not the goal to develop an equal autocommit-feature like the solr build-in. On the other hand: there is a wide range of users with different needs. So I don't know, if this is an solution, that helps most of all people.
For me: the mono-connector are two tools in one, but I can't use that for only one job:
1.) It's a tool to migrate data from mongo db to solr and
2.) it's a tool to establish a link between solr and mongo to sync all,over the time new incoming docs to solr.
For the 1st scenario (migration tool), i think there is no commit necessary after every update. Finally there is no update necessery at the end of that initial sync, because if I create an migration-skript using the mongo-connector, I can trigger the commit with my own script, after the mongo-connector finished the indexing.

For the 2nd "usecase of this tool: solr has an inital sync and waits for new documents), the commit after each document is useful, as long the new incoming documents are not to much (which depends on the different environments, out there). Probably the auto-commit option has an to big value for systems with less new update per hour.

I will try your workaround.... in that case, the new documents should be visible in solr, because of the the solr feature autocommit. Right??

@B0rner
Copy link
Author

B0rner commented Dec 18, 2013

Update: I have realized your workaround by changing the solr doc manager, by setting Commit=False. The initial import works faster. With commit=true the update-process needs 0.05 - 0.08 seconds per document. Now it needs 0.04 second. This means the update is 25-50% faster. I think, I will need 40-50 hours to index 10.000.000 docs. (83 hours with commit=true)
But it's still much slower than the dataimport-Handler to index the same data from MySQL (3hours). Probably there are other reasons to, why that update needs so long, like establishing an HTTP-Connection for every document, etc.

@llvtt
Copy link

llvtt commented Dec 20, 2013

@B0rner,

The Solr DocManager uses the pysolr library to connect to Solr. Looking at the source, it looks like pysolr uses a requests.Session object to manage connections to Solr and that these sessions already take advantage of keep-alive, so I don't think the bottleneck here is in establishing new connections. After running some cursory tests, it seems like one of the biggest performance killers is the fact that upsert() only inserts or updates a single document at a time, whereas the Solr API is capable of doing batch operations. From these tests, it looks like batch upsert could be up to 30x faster than serial :). Having a batch insert/upsert method in the DocManager API is definitely worthwhile. I'll make another issue for this feature.

edit: new issue is #56

@B0rner
Copy link
Author

B0rner commented Jan 10, 2014

From these tests, it looks like batch upsert could be up to 30x faster than serial :)

more than this! ;-)

Sorry, but I have developed my own importer now. Finally, because my skills in python are not so god. So I wrote some lines in php, which fits my needs. It works equal like the solr data import handler: The "Mongo Solr Importer"

The first version was able to push 2500 docs from mongoDB to solr.Per seconds, which is 75x faster than the mongo-connector. After adding multi threading the script was able to import 6700 docs per second to solr. It's factor 200. Thus, the duration of the import is reduced from 83 hours to 20 minutes.
I know, that your script works much more granular and can handles changes of documents, while my script can only do one thing: full import. But this very fast.
So maybe my final solution will be the combination of both: run the initial import with my PHP script and handle new docs with your mongo-connector.

By the way. You can find the "Mongo Solr Importer" Tool here:

https://github.com/5missions/mongoSolrImporter

B0rner

@llvtt
Copy link

llvtt commented Jan 16, 2014

The bulk_upsert method is now available and should make collection dumps much faster (as of commit 7e48f55). Working on better commit behavior in #68

@llvtt
Copy link

llvtt commented Feb 13, 2014

Better commit behavior closed with #68 in 447a80f

@llvtt llvtt closed this as completed Feb 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants