-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issue #53
Comments
Hi @B0rner, Upserts using the Solr DocManager do commit on every upsert, which could very well be excessive, but is default in pysolr's As for your second question, you can do an "initial sync" of all your documents from MongoDB into Solr by truncating the oplog progress file (called "config.txt" by default). After you restart mongo-connector, it will replicate all the documents from your targeted namespace over to Solr, regardless of their presence in the oplog. Note that each document from the collection being dumped is inserted into Solr using the Solr DocManager's If you need a quick solution to your problem, you might try changing the call to pysolr's
Hope this helps. I'll look into a more performant way to handle commits in Solr soon. |
Update about better commit practices on Solr: It looks as if the commit behavior can be configured in a @B0rner, does this make sense to you? As a side note, it looks as if |
An update about better commit practices on Elastic Search: According to the ES "refresh" documentation, refreshes are by default already scheduled "periodically" (according to this page, periodically = every second). Flushes are handled automatically in ES. If mongo-connector were to assert control over refreshes on ES (currently refreshing after every update/insert), there is the notion of a refresh interval that refreshes an index every X seconds. It could be worth switching off periodic refreshes for collection dumps to improve write throughput and then resetting the refresh interval to whatever it was previously for that index. I think that either leveraging the "refresh interval" or taking out manual refreshes from elastic_doc_manager.py and letting ES do its default "periodic" refreshing is more performant than current behavior. Another side note: I'm noticing that the |
Yes, I think this would be very useful on bigger databases. But of cause, there is no need to develop an existing feature again, like solr-autocommit, as you wrote.
This means: truncating the oplog results in something like dumping the output of an mongo
For my case: that makes sense. I think: it's not the goal to develop an equal autocommit-feature like the solr build-in. On the other hand: there is a wide range of users with different needs. So I don't know, if this is an solution, that helps most of all people. For the 2nd "usecase of this tool: solr has an inital sync and waits for new documents), the commit after each document is useful, as long the new incoming documents are not to much (which depends on the different environments, out there). Probably the auto-commit option has an to big value for systems with less new update per hour. I will try your workaround.... in that case, the new documents should be visible in solr, because of the the solr feature autocommit. Right?? |
Update: I have realized your workaround by changing the solr doc manager, by setting |
The Solr DocManager uses the edit: new issue is #56 |
more than this! ;-) Sorry, but I have developed my own importer now. Finally, because my skills in python are not so god. So I wrote some lines in php, which fits my needs. It works equal like the solr data import handler: The "Mongo Solr Importer" The first version was able to push 2500 docs from mongoDB to solr.Per seconds, which is 75x faster than the mongo-connector. After adding multi threading the script was able to import 6700 docs per second to solr. It's factor 200. Thus, the duration of the import is reduced from 83 hours to 20 minutes. By the way. You can find the "Mongo Solr Importer" Tool here: https://github.com/5missions/mongoSolrImporter B0rner |
HI,
I'm running some test to index an mongoDB to solr 4.4. The mongoDB contains 10.000.000 documents. Importing those documents need a long time. The mongo-connector puhsed 120.000 documents per hour to solr. So I will need 83 hours to run an full import.
Using the data import handler with the same documents from mysql, the whole import need 3 hours.
I think, one of the main reasons for this slow import is based on the single import of each document followed by an commit for each document, which is an "performance overkill".
So the question is: is'n it possible to push a set of documents to solr followed by one commit for all of the docs? Or isn't there an "initial sync" feature, which is based by an mongoDB query, not the oplog.
There is an 2nd. problem, which exits: because the oplog is an capped collection, there could be the situation, where more documents are in the mongoDB than in the oplog. So how can I index those documents, which are not in the oplog (anymore)?
The text was updated successfully, but these errors were encountered: