Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Progress bar for sqlite-utils insert #173

Closed
simonw opened this issue Sep 23, 2020 · 6 comments
Closed

Progress bar for sqlite-utils insert #173

simonw opened this issue Sep 23, 2020 · 6 comments
Labels
cli-tool enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Sep 23, 2020

It would be nice if sqlite-utils insert had a progress bar, for when it's churning through huge CSV files.

@simonw simonw added the enhancement New feature or request label Sep 23, 2020
@simonw
Copy link
Owner Author

simonw commented Sep 23, 2020

This can only work when it's reading from a file, not when it's reading from standard input.

@simonw
Copy link
Owner Author

simonw commented Sep 24, 2020

I know how to build this for CSV and TSV - I can read them via a file wrapper that counts how many bytes it has seen.

Not sure how to do it for JSON though. Maybe I could provide it just for newline-delimited JSON? Again I can measure progress based on how many bytes have been read.

@simonw
Copy link
Owner Author

simonw commented Sep 24, 2020

I'm using a click.File() at the moment:

click.argument("json_file", type=click.File(), required=True),

I'll need to change that to be something that I can easily measure progress through. Also I should change its name - json_file is a bad name when it sometimes handles csv or tsv instead.

It looks like the argument provided by click.File doesn't provide a way to read the size of the file, so I need to switch that out for a file path instead. https://click.palletsprojects.com/en/7.x/api/#click.Path

@simonw
Copy link
Owner Author

simonw commented Sep 24, 2020

Relevant code:

if csv or tsv:
dialect = "excel-tab" if tsv else "excel"
reader = csv_std.reader(json_file, dialect=dialect)
headers = next(reader)
docs = (dict(zip(headers, row)) for row in reader)
elif nl:
docs = (json.loads(line) for line in json_file)
else:
docs = json.load(json_file)
if isinstance(docs, dict):
docs = [docs]

Changing that to track progress through NL-JSON, CSV and TSV shouldn't be too hard.

@simonw
Copy link
Owner Author

simonw commented Oct 22, 2020

I could use ijson to provide a progress bar for JSON arrays too. I'd prefer to keep that as an optional dependency though, since sqlite-utils is a library dependency for many other projects and it would be using ijson purely for the CLI component.

Here's how to iterate through a list of objects being read from a file:

import json
parser = ijson.items(open(
    "/tmp/list.json"
), "item")
for object in parser:
   # ...

@Florents-Tselai
Copy link

I know how to build this for CSV and TSV - I can read them via a file wrapper that counts how many bytes it has seen.

Not sure how to do it for JSON though. Maybe I could provide it just for newline-delimited JSON? Again I can measure progress based on how many bytes have been read.

I was thinking about this, while inserting a stream of ~40M line-delimited json docs. Wouldn't a --total-expected flag work ?

That's how tqdm does it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli-tool enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants