Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--batch-size 1 doesn't seem to commit for every item #364

Closed
simonw opened this issue Jan 6, 2022 · 16 comments
Closed

--batch-size 1 doesn't seem to commit for every item #364

simonw opened this issue Jan 6, 2022 · 16 comments
Labels
bug Something isn't working
Milestone

Comments

@simonw
Copy link
Owner

simonw commented Jan 6, 2022

I'm trying this, but it doesn't seem to write anything to the database file until I hit CTRL+C:

heroku logs --app=simonwillisonblog --tail | grep 'measure#nginx.service' | \
  sqlite-utils insert /tmp/herokutail.db log - --import re --convert "$(cat <<EOD
    r = re.compile(r'([^\s=]+)=(?:"(.*?)"|(\S+))')
    pairs = {}
    for key, value1, value2 in r.findall(line):
        pairs[key] = value1 or value2
    return pairs
EOD
)" --lines --batch-size 1
@simonw simonw added the bug Something isn't working label Jan 6, 2022
@simonw
Copy link
Owner Author

simonw commented Jan 8, 2022

It would be easier to test this if I had a utility for streaming out a file one line at a time.

A few recipes for this in https://superuser.com/questions/526242/cat-file-to-terminal-at-particular-speed-of-lines-per-second - I'm going to build a quick stream-delay tool though.

@simonw
Copy link
Owner Author

simonw commented Jan 8, 2022

@simonw
Copy link
Owner Author

simonw commented Jan 8, 2022

(That Heroku example doesn't record the timestamp, which limits its usefulness)

@simonw
Copy link
Owner Author

simonw commented Jan 8, 2022

I added a print statement after for query, params in queries_and_params and confirmed that something in the code is waiting until 16 records are available to be inserted and then executing the inserts, even with --batch-size 1.

@simonw simonw changed the title Confirm that streaming lines to standard input works correctly --batch-size 1 does not commit for every item Jan 8, 2022
@simonw
Copy link
Owner Author

simonw commented Jan 8, 2022

I'm suspicious that the chunks() utility function may not be working correctly:

In [10]: [list(d) for d in list(chunks('abc', 5))]
Out[10]: [['a'], ['b'], ['c']]

In [11]: [list(d) for d in list(chunks('abcdefghi', 5))]
Out[11]: [['a'], ['b'], ['c'], ['d'], ['e'], ['f'], ['g'], ['h'], ['i']]

In [12]: [list(d) for d in list(chunks('abcdefghi', 3))]
Out[12]: [['a'], ['b'], ['c'], ['d'], ['e'], ['f'], ['g'], ['h'], ['i']]

@simonw
Copy link
Owner Author

simonw commented Jan 8, 2022

No, chunks() seems to work OK in the test I just added.

simonw added a commit that referenced this issue Jan 8, 2022
@simonw
Copy link
Owner Author

simonw commented Jan 9, 2022

I'm having trouble figuring out the best way to write a unit test for this. Filed a relevant feature request for Click here:

@simonw
Copy link
Owner Author

simonw commented Jan 9, 2022

Possible way of running the test: add this to sqlite_utils/cli.py:

if __name__ == "__main__":
    cli()

Now the tool can be run using python -m sqlite_utils.cli --help

Then in the test use subprocess to call sys.executable (the path to the current Python interpreter) and pass it -m sqlite_utils.cli to run the script!

@simonw
Copy link
Owner Author

simonw commented Jan 9, 2022

I can now write tests that look like this:

sqlite-utils/tests/test_cli.py

Lines 2024 to 2030 in 539f5cc

def test_python_dash_m():
"Tool can be run using python -m sqlite_utils"
result = subprocess.run(
[sys.executable, "-m", "sqlite_utils", "--help"], capture_output=True
)
assert result.returncode == 0
assert b"Commands for interacting with a SQLite database" in result.stdout

Which means I can write a test that exercises this bug.

@simonw
Copy link
Owner Author

simonw commented Jan 9, 2022

This is strange. The following:

>>> import subprocess
>>> p = subprocess.Popen(["sqlite-utils", "insert", "/tmp/stream.db", "stream", "-", "--nl"], stdin=subprocess.PIPE)
>>> p.stdin.write(b'\n'.join(b'{"id": %s}' % str(i).encode("utf-8") for i in range(1000)))
11889
>>> # At this point /tmp/stream.db is still 0 bytes - but if I then run this:
>>> p.stdin.close()
>>> # /tmp/stream.db is now 20K and contains the written data

No wait, mystery solved - I can add p.stdin.flush() instead of p.stdin.close() and the file suddenly jumps up in size.

@simonw
Copy link
Owner Author

simonw commented Jan 9, 2022

Calling p.stdin.close() and then p.wait() terminates the subprocess.

@simonw
Copy link
Owner Author

simonw commented Jan 10, 2022

I think this test is right:

def test_insert_streaming_batch_size_1(db_path):
    # https://github.com/simonw/sqlite-utils/issues/364
    # Streaming with --batch-size 1 should commit on each record
    # Can't use CliRunner().invoke() here bacuse we need to
    # run assertions in between writing to process stdin
    proc = subprocess.Popen(
        [
            sys.executable,
            "-m",
            "sqlite_utils",
            "insert",
            db_path,
            "rows",
            "-",
            "--nl",
            "--batch-size",
            "1",
        ],
        stdin=subprocess.PIPE,
    )
    proc.stdin.write(b'{"name": "Azi"}')
    proc.stdin.flush()
    assert list(Database(db_path)["rows"].rows) == [{"name": "Azi"}]
    proc.stdin.write(b'{"name": "Suna"}')
    proc.stdin.flush()
    assert list(Database(db_path)["rows"].rows) == [{"name": "Azi"}, {"name": "Suna"}]
    proc.stdin.close()
    proc.wait()

@simonw
Copy link
Owner Author

simonw commented Jan 10, 2022

After a bunch of debugging with print() statements it's clear that the problem isn't with when things are committed or the size of the batches - it's that the data sent to standard input is all being processed in one go, not a line at a time.

I think that's because it is being buffered by this:

encoding = encoding or "utf-8-sig"
buffered = io.BufferedReader(file, buffer_size=4096)
decoded = io.TextIOWrapper(buffered, encoding=encoding)
tracker = None
if csv or tsv:
if sniff:
# Read first 2048 bytes and use that to detect
first_bytes = buffered.peek(2048)
dialect = csv_std.Sniffer().sniff(first_bytes.decode(encoding, "ignore"))
else:
dialect = "excel-tab" if tsv else "excel"
with file_progress(decoded, silent=silent) as decoded:

The buffering is there so that we can sniff the first few bytes to detect if it's a CSV file - added in 99ff0a2 for #230. So maybe for non-CSV inputs we should disable buffering?

@simonw simonw changed the title --batch-size 1 does not commit for every item --batch-size 1 doesn't seem to commit for every item Jan 10, 2022
@simonw simonw closed this as completed in cfb3f12 Jan 10, 2022
simonw added a commit that referenced this issue Jan 10, 2022
@simonw simonw reopened this Jan 10, 2022
@simonw
Copy link
Owner Author

simonw commented Jan 10, 2022

Urgh, tests are still failing intermittently - for example:

        time.sleep(0.4)
>       assert list(Database(db_path)["rows"].rows) == [{"name": "Azi"}]
E       AssertionError: assert [] == [{'name': 'Azi'}]
E         Right contains one more item: {'name': 'Azi'}
E         Full diff:
E         - [{'name': 'Azi'}]
E         + []

I'm going to change this code to keep on trying up to 10 seconds - that should get the tests to pass faster on most machines.

simonw added a commit that referenced this issue Jan 10, 2022
@simonw
Copy link
Owner Author

simonw commented Jan 10, 2022

Bit nasty but it might work:

    def try_until(expected):
        tries = 0
        while True:
            rows = list(Database(db_path)["rows"].rows)
            if rows == expected:
                return
            tries += 1
            if tries > 10:
                assert False, "Expected {}, got {}".format(expected, rows)
            time.sleep(tries * 0.1)

    try_until([{"name": "Azi"}])
    proc.stdin.write(b'{"name": "Suna"}\n')
    proc.stdin.flush()
    try_until([{"name": "Azi"}, {"name": "Suna"}])

@simonw
Copy link
Owner Author

simonw commented Jan 10, 2022

That did the trick.

@simonw simonw closed this as completed Jan 10, 2022
@simonw simonw added this to the 3.21 milestone Jan 10, 2022
simonw added a commit that referenced this issue Jan 11, 2022
simonw added a commit that referenced this issue Jan 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant