Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--sniff option for sniffing delimiters #230

Closed
simonw opened this issue Feb 14, 2021 · 8 comments
Closed

--sniff option for sniffing delimiters #230

simonw opened this issue Feb 14, 2021 · 8 comments
Labels
cli-tool enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Feb 14, 2021

I just spotted that csv.Sniffer in the Python standard library has a .has_header(sample) method which detects if the first row appears to be a header or not, which is interesting. https://docs.python.org/3/library/csv.html#csv.Sniffer

Originally posted by @simonw in #228 (comment)

@simonw simonw added the enhancement New feature or request label Feb 14, 2021
@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

Running this could take any CSV (or TSV) file and automatically detect the delimiter. If no header row is detected it could add unknown1,unknown2 headers:

sqlite-utils insert db.db data file.csv --sniff

(Using --sniff would imply --csv)

This could be called --sniffer instead but I like --sniff better.

@simonw simonw changed the title Idea: --csv --sniff option for sniffing delimiters and presence of a header --sniff option for sniffing delimiters and presence of a header Feb 14, 2021
@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

The challenge here is how to read the first 2048 bytes and then reset the incoming file.

The Python docs example looks like this:

with open('example.csv', newline='') as csvfile:
    dialect = csv.Sniffer().sniff(csvfile.read(1024))
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)

Here's the relevant code in sqlite-utils:

if csv or tsv:
dialect = "excel-tab" if tsv else "excel"
with file_progress(json_file, silent=silent) as json_file:
csv_reader_args = {"dialect": dialect}
if delimiter:
csv_reader_args["delimiter"] = delimiter
if quotechar:
csv_reader_args["quotechar"] = quotechar
reader = csv_std.reader(json_file, **csv_reader_args)

The challenge is going to be having the --sniff option work with the progress bar. Here's how file_progress() works:

@contextlib.contextmanager
def file_progress(file, silent=False, **kwargs):
if silent or file.raw.fileno() == 0: # 0 = stdin
yield file
else:
file_length = os.path.getsize(file.raw.name)
with click.progressbar(length=file_length, **kwargs) as bar:
yield UpdateWrapper(file, bar.update)

If file.raw is stdin can I do the equivalent of csvfile.seek(0) on it?

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

No, you can't .seek(0) on stdin:

  File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/cli.py", line 678, in insert_upsert_implementation
    json_file.raw.seek(0)
OSError: [Errno 29] Illegal seek

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

Types involved:

(Pdb) type(json_file.raw)
<class '_io.FileIO'>
(Pdb) type(json_file)
<class 'encodings.utf_8.StreamReader'>

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

Maybe I shouldn't be using StreamReader at all - https://www.python.org/dev/peps/pep-0400/ suggests that it should be deprecated in favour of io.TextIOWrapper. I'm using StreamReader due to this line:

encoding = encoding or "utf-8"
json_file = codecs.getreader(encoding)(json_file)

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

There are two code paths here that matter:

  • For a regular file, can read the first 2048 bytes, then .seek(0) before continuing. That's easy.
  • stdin is harder. I need to read and buffer the first 2048 bytes, then pass an object to csv.reader() which will replay that chunk and then play the rest of stdin.

I'm a bit stuck on the second one. Ideally I could use something like itertools.chain() but I can't find an alternative for file-like objects.

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

I think I've got it. I can use io.BufferedReader() to get an object I can run .peek(2048) on, then wrap THAT in io.TextIOWrapper:

    encoding = encoding or "utf-8"
    buffered = io.BufferedReader(json_file, buffer_size=4096)
    decoded = io.TextIOWrapper(buffered, encoding=encoding, line_buffering=True)
    if pk and len(pk) == 1:
        pk = pk[0]
    if csv or tsv:
        if sniff:
            # Read first 2048 bytes and use that to detect
            first_bytes = buffered.peek(2048)
            print('first_bytes', first_bytes)

@simonw
Copy link
Owner Author

simonw commented Feb 14, 2021

@simonw simonw closed this as completed in 99ff0a2 Feb 14, 2021
@simonw simonw changed the title --sniff option for sniffing delimiters and presence of a header --sniff option for sniffing delimiters Feb 14, 2021
simonw added a commit that referenced this issue Feb 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cli-tool enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant