Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spaces_reader appears to be reading duplicate records unless using 1 worker or interval set to 1s #756

Open
ciorg opened this issue Jul 14, 2021 · 3 comments
Labels
bug Something isn't working priority:high

Comments

@ciorg
Copy link
Member

ciorg commented Jul 14, 2021

Found that the spaces_reader returns too many records unless using 1 worker and 1 slicer, also reducing the interval to 1s independent of the number of workers or slicers returns close to the correct number of records but is still a bit too high.

Used a control group of data of 6.95M records in all the tests.

Tests were ran with elasticsearch-asset version 2.6.2, node-12 on dataeng3, teraslice version 0.76.1

workers slicers interval docs returned (M)
20 10 auto 8.38
20 1 auto 7.81
1 1 auto 6.95
20 10 1s 6.97
20 10 1m 8.19
20 10 1hr 8.48

Ran a job with 20 workers and 10 slicers, interval auto that deduped the records and the count came to 6.95M, so it looks like it's picking up duplicate records.

@ciorg ciorg added the bug Something isn't working label Jul 14, 2021
@ciorg
Copy link
Member Author

ciorg commented Jul 15, 2021

here's my basic job config for the job running on de3:

{
    "name": "temp#spaces_reader",
    "lifecycle": "once",
    "workers": 20,
    "slicers": 10,
    "assets": [
        "elasticsearch:2.6.2"
    ],
    "operations": [
        {
            "_op": "spaces_reader",
            "interval": "1m",
            "endpoint": "ENDPOINT/api/v2",
            "index": "INDEX_NAME",
            "token": "TOKEN",
            "size": 100000,
            "date_field_name": "date",
            "start": "2021-05-01T00:00:00.000Z",
            "end": "2021-05-01T01:00:00.000Z",
            "query": "_exists_:date"
        },
        {
            "_op": "noop"
        }
    ]
}

@peterdemartini
Copy link
Contributor

Can you confirm the with the latest asset bundle? 2.7.2?

@ciorg
Copy link
Member Author

ciorg commented Jul 16, 2021

Tried it with 2.7.2 and got the same result - 8.4M records when I expect 6.95

"name": "temp#spaces_reader",
    "lifecycle": "once",
    "workers": 20,
    "slicers": 10,
    "assets": [
        "elasticsearch:2.7.2"
    ],
    "operations": [
        {
            "_op": "spaces_reader",
            "interval": "auto",
            "endpoint": "ENDPOINT",
            "index": "INDEX",
            "token": "TOKEN",
            "size": 100000,
            "date_field_name": "date",
            "start": "2021-05-01T00:00:00.000Z",
            "end": "2021-05-01T01:00:00.000Z",
            "query": "_exists_:date"
        },
        {
            "_op": "noop"
        }
    ],

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:high
Projects
None yet
Development

No branches or pull requests

2 participants