Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Initial sync #47

Closed
calexandre opened this issue Dec 18, 2012 · 24 comments
Closed

Suggestion: Initial sync #47

calexandre opened this issue Dec 18, 2012 · 24 comments
Milestone

Comments

@calexandre
Copy link

Hey Richard,
I would like to suggest some sort of initial sync functionality (optional).

Something like when you create the river via the PUT api, some additional options regarding on how the user would like to perform the initial sync.

This would be a "one time" operation. I dont even know if it is possible...

The main issue is that not everything is on the oplog, especially for really large and stale collections...
So it would be nice to implement a set of options that would allow the user to tell the river to pull all data from mongo (much like a GetAll operation).

Of course we could discuss different strategies for pulling the data, such as:

  1. GetAll (easy, but cumbersome for large collections)
  2. via MongoDump, MongoExport of BsonDump
  3. Others..?

It would be nice to support different import strategies, much like as plugins for this river.

Keep up the good work :)

@xma
Copy link
Contributor

xma commented Dec 18, 2012

Hello,

I'd agree if that could be exposed by an API or any other way (conf. file / etc ...) that would be really great.

For stale collection (at least my use case I would say) I've made a patch
#45

and I don't like idea of using personal forked code suited for my use case, in production.

Regards,

@greatwitenorth
Copy link

Yes I'd also love to see this feature implemented. After working with the mysql river (which slurps in the table initially) I thought I was doing something wrong when my collection wasn't being slurped. If there is no plan to implement this, it might be worth mentioning this in the wiki.

@medcl
Copy link

medcl commented Mar 21, 2013

+1 once we have changed the mapping,we need to clean the old-index,and ask a "re-pull" function,pulling data from mongo-db to elasticsearch,hope to see this feature.

@mzafer
Copy link

mzafer commented Apr 24, 2013

+1. Would love to see this feature supported

@subratbasnet
Copy link

+1 would love this too!

@bitmorse
Copy link

bitmorse commented May 8, 2013

+1 for this!

@subratbasnet
Copy link

A good work around for this would be to simply do a BULK-UPDATE on the collection after the mongo-rivers are setup. I use this for millions of records and it works great

@enrique-fernandez-polo
Copy link

+1 Very usefull!

@yvesx
Copy link

yvesx commented May 17, 2013

+1

@yvesx
Copy link

yvesx commented May 17, 2013

subratbasnet:
what do yo u mean by a BULK-UPDATE? if A is the large stale collection, do you mean setting up another empty collection B do this:
db.collectionA.find().
forEach( function(i) {
i.ts_imported = new Date();
db.collectionB.insert(i);
});

Then setup the river on collectionB?

@subratbasnet
Copy link

Yevesx:

What I meant was, when you have a stale collection in mongo. First you would setup the river for that collection. This will NOT automatically start moving the data from the stale collection to Elastic search.

To trigger that, you could simply perform a bulk update on the collection by a specific condition that matches all the records. For example, in my case, I simply change the "updated" field inside all the documents in my collection.. and this triggers the river and it moves the affected documents to Elaseicsearch

@yvesx
Copy link

yvesx commented May 17, 2013

I see! This is a clever idea.
On May 17, 2013, at 10:59 AM, Subrat Basnet notifications@github.com wrote:

Yevesx:

What I meant was, when you have a stale collection in mongo. First you would setup the river for that collection. This will NOT automatically start moving the data from the stale collection to Elastic search.

To trigger that, you could simply perform a bulk update on the collection by a specific condition that matches all the records. For example, in my case, I simply change the "updated" field inside all the documents in my collection.. and this triggers the river and it moves the affected documents to Elaseicsearch


Reply to this email directly or view it on GitHub.

@benmccann
Copy link
Collaborator

Updating every document in MongoDB to get them to appear in the oplog and be copied by the river is very clever. Unfortunately, I believe this will make the initial import much slower. Writes to MongoDB are much slower for me than writes to ElasticSearch (because MongoDB stores data less efficiently than ES and because MongoDB has its unfortunate DB-level lock). Do you think it would work if we copied over all documents from MongoDB and then iterated over the oplog? I think that's what this issue is requesting and it doesn't sound much more difficult than what we have today. A nice optimization would be to read the latest oplog timestamp, import the collection, then import the oplog only from the start timestamp.

@richardwilly98
Copy link
Owner

There are few challenges with your suggestion:

  • A MongoDB collection does not have out of box timestamp field. So the query to read the collection will need to be defined in river settings.
  • The collection will need to be locked during the initial import. Maybe using [1]. Is it acceptable?
  • There is already an initial timestamp settings but it would need to be dynamically calculated based on the end of the initial import.

[1] - http://docs.mongodb.org/manual/reference/method/db.fsyncLock/

@benmccann
Copy link
Collaborator

Thanks for the feedback. Could you clarify? I'm not sure that those things are true. E.g. why would the collection need to be locked? Yes, copying without locking could result in an inconsistent state, but then once the oplog is applied wouldn't that fix it?

@richardwilly98
Copy link
Owner

The main reason of #47 is to synchronize data not available on oplog.rs
So I am thinking of the following scenario:

  1. The collection has been created and populated before replicaset has been setup (at this point replicaset is not setup yet).
  2. Run the initial import
  3. Once the initial import is completed setup the replicaset
  4. River will then import data from oplog.rs

In step 1. we will need to need ensure no new data is imported in the collection.

If you import without locking you could get inconsistent state / data, but it does not garanty that will be fixed when processing oplog.

@benmccann
Copy link
Collaborator

What if we just make it so that you can only run the initial import on a replica set?

@nfx
Copy link

nfx commented Sep 24, 2013

+1

@richardwilly98
Copy link
Owner

I have posted the question here [1]. Let's see what MongoDB experts will answer...

[1] - https://groups.google.com/forum/#!topic/mongodb-user/sOKlhD_E2ns

@benmccann
Copy link
Collaborator

Response from William Zola at 10gen for @richardwilly98's question:

The way that MongoDB does initial sync internally is:
 - Record the latest timestamp (call it time 'T') from the oplog
 - Copy all of the documents from the collection
 - Apply all of the operations in the oplog starting from time 'T'

You could use the same strategy in your plugin.

richardwilly98 added a commit that referenced this issue Sep 27, 2013
- TODO: Initial import with GridFS will still to be optimized
- Directly DBObject instead of Map
- Cleanup the logic with GridFS enabled
- New unit test for initial import with GridFS
- Reduce wait time unit test from 6 sec to 2 sec
- Script filter is not used anymore when GridFS enabled
@benmccann
Copy link
Collaborator

@richardwilly98 Awesome! I'm really excited about the ability to do an initial import!

One thing I'm not very sure about is how to handle the river being stopped and then started again during the initial import. We have to restart the initial import in that case. Should we drop the index and start the initial import again? I'm hesitant to drop an index though. Maybe we should just stop the river from doing anything and post a warning to the admin UI and logs that the index needs to be dropped?

@richardwilly98
Copy link
Owner

@benmccann I agree dropping the index is not a good option.

We need a flag to indicate the initial import is in progress. If the flag is not cleared and timestamp is null then stop the river and send warning as you suggested.

@cggaurav
Copy link

+1

@benmccann
Copy link
Collaborator

@cggaurav this is already implemented and released

this issue should probably be closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests