Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too much indexed docs - drop database #133

Closed
lukaszpy opened this issue Sep 24, 2013 · 13 comments
Closed

Too much indexed docs - drop database #133

lukaszpy opened this issue Sep 24, 2013 · 13 comments
Milestone

Comments

@lukaszpy
Copy link

my versions:

  • elasticsearch 0.90.2
    plugins:
    -- mapper attachments 1.4.0
    -- river jdbc 2.2.1 (driver: postgresql-9.2-1002.jdbc4)
    -- elasticsearch-river-mongodb-1.6.11

Problem:
When I creating indexes for Postgre db everything works fine in head plugin for ES I see:
structure - name of index
size: 1mb (1mb)
docs: 3587 (3587)
But when i create index on mongo db I getting:
type - index name
size: 642.6kb (642.6kb)
docs: 10495 (10495)

But in docs field is wrong number of docs, because in my db I have only 3936 docs.
This problem exist on every index on mongo db - count of indexed docs not mach count of docs in db.

I'm creating indexes with ( it's a windows version):
curl -XPUT "http://localhost:9200/_river/body/_meta" -d "{ "type": "mongodb", "mongodb": { "servers": [{host: "localhost", port: "27017" }], "options": {"secondary_read_preference": true}, "credentials": [{db: "fis-bps",user: "guest", password: "guest"}], db: "fis-bps", collection: "body",gridfs: "false"}, index: {name: "body", throttle_size: 2000}}"

This problem only exist on Windows system. On Ubuntu system problem dosn't exist.

I noticed one more think: when I dump my DB, and remove all data for dbs (from data directory for primary and slave). I create databases, create index, and then restore database from dump.
Now i have correct count of indexed docs.

It looks like elasticsearch looks deep into mongo db and normal drop DB and recreate it, still leave some data which are used by elasticsearch to create indexes.

@richardwilly98
Copy link
Owner

The river get the data from oplog.rs not directly from the collection.

Did you by any chance drop the collection?

@lukaszpy
Copy link
Author

I droped collection oplog.rs in PRIMARY but not in secondary (replica) Is that a mistake ?

@richardwilly98
Copy link
Owner

In that case you should use options/drop_collection parameter (for more details see [1])

[1] - https://github.com/richardwilly98/elasticsearch-river-mongodb/wiki#configuration

@lukaszpy
Copy link
Author

Ok, but ,I'm not sure so correct me If I think wrong.
flag drop_collection dosn't work with drop whole db, yes ?

So If I drop whole db, and recreate. Then restore data, I shoud get more indexed docs, than exist in my DB ?
Besause collection is not droped (whole DB is droped), and ES will se new dosc when using oplog

i just check that case on my Windows workstation.
I think it's a bug.

@richardwilly98
Copy link
Owner

options/drop_collection will look with drop collection - probably not with drop database.

@lukaszpy
Copy link
Author

So I think it's a bug, and shoud be corrected . Because we have incoherent states of indexes

@richardwilly98
Copy link
Owner

Can you please clarify?
Which MongoDB command to drop the database or collection?
I believe dropping database or collection usually do not really apply to production environment.

@lukaszpy
Copy link
Author

To drop db we should:

  1. use test-db
  2. db.dropDatabase()

To drop collection:

  1. use test-db
  2. db.test-collection.drop()

@lukaszpy
Copy link
Author

I believe that bug coud exist on production environment.
for example me have:
machine1 and machine2 (with the same application, databases are duplicated too (each machine have its own mongodb))
For some reason we want to move data from 1 to 2 mchine.
We connect to machine1 and make dump of database. Then we go to machine2, now we droping whole db, and restore db created on machine1.
index state on machine2 will be incoherent. Because ES will get old data from oplog (but collection are empty), and new data from restore.

@richardwilly98
Copy link
Owner

@lukaszpy

I will create a new feature request to support drop_database.

{
        "ts" : Timestamp(1380107544, 1),
        "h" : NumberLong("4469577380503976492"),
        "v" : 2,
        "op" : "c",
        "ns" : "mydb97.$cmd",
        "o" : {
                "dropDatabase" : 1
        }
}

@richardwilly98
Copy link
Owner

@lukaszpy I will postpone this feature to release 1.7.2

  • The coming release 1.7.1 uses a different technique to do the initial import using the collection data (see Suggestion: Initial sync #47)
  • That could be a good workaroun to this issue: before to restore the data on machine 2 just drop the index and river in ES. Recreate the river when the restore has been completed.

Please provide feedback.

@mahnunchik
Copy link

Is this way like in mongodb? After large node downtime.

@richardwilly98
Copy link
Owner

@mahnunchik can you please clarify?

richardwilly98 added a commit that referenced this issue Nov 3, 2013
- with ```options/drop_collection``` the river will also track
```dropDatabase``` operation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants