sqlite-utils memory should handle TSV and JSON in addition to CSV #279

simonw · 2021-06-18T15:02:54Z

Use sniff to detect CSV or TSV (if :tsv or :csv was not specified) and delimiters

Follow-on from #272

The text was updated successfully, but these errors were encountered:

simonw · 2021-06-18T15:04:15Z

To detect JSON, check to see if the stream starts with [ or { - maybe do something more sophisticated than that.

simonw · 2021-06-18T15:47:47Z

Detecting valid JSON is tricky - just because a stream starts with [ or { doesn't mean the entire stream is valid JSON. You need to parse the entire stream to determine that for sure.

One way to solve this would be with a custom state machine. Another would be to use the ijson streaming parser - annoyingly it throws the same exception class for invalid JSON for different reasons, but the e.args[0] for that exception includes human-readable text about the error - if it's anything other than parse error: premature EOF then it probably means the JSON was invalid.

simonw · 2021-06-18T18:25:04Z

Or... since I'm not using a streaming JSON parser at the moment, if I think something is JSON I can load the entire thing into memory to validate it.

I still need to detect newline-delimited JSON. For that I can consume the first line of the input to see if it's a valid JSON object, then maybe sniff the second line too?

This does mean that if the input is a single line of GIANT JSON it will all be consumed into memory at once, but that's going to happen anyway.

So I need a function which, given a file pointer, consumes from it, detects the type, then returns that type AND a file pointer to the beginning of the file again. I can use io.BufferedReader for this.

simonw · 2021-06-18T18:28:40Z

def detect_format(fp):
    # ...
    return "csv", fp, dialect
    # or
    return "json", fp, parsed_data
    # or
    return "json-nl", fp, docs

The mixed return types here are ugly. In all of these cases what we really want is to return a generator of {...} objects. So maybe it returns that instead.

def filepointer_to_documents(fp):
    # ...
    yield from documents

I can refactor sqlite-utils insert to use this new code too.

simonw · 2021-06-18T18:30:08Z

So maybe this is a function which can either be told the format or, if none is provided, it detects one for itself.

def rows_from_file(fp, format=None):
    # ...
    yield from rows

simonw · 2021-06-19T00:25:08Z

I tried writing this function with type hints, but eventually gave up:

def rows_from_file(
    fp: BinaryIO,
    format: Optional[Format] = None,
    dialect: Optional[Type[csv.Dialect]] = None,
    encoding: Optional[str] = None,
) -> Generator[dict, None, None]:
    if format == Format.JSON:
        decoded = json.load(fp)
        if isinstance(decoded, dict):
            decoded = [decoded]
        if not isinstance(decoded, list):
            raise RowsFromFileBadJSON("JSON must be a list or a dictionary")
        yield from decoded
    elif format == Format.CSV:
        decoded_fp = io.TextIOWrapper(fp, encoding=encoding or "utf-8-sig")
        yield from csv.DictReader(decoded_fp)
    elif format == Format.TSV:
        yield from rows_from_file(
            fp, format=Format.CSV, dialect=csv.excel_tab, encoding=encoding
        )
    elif format is None:
        # Detect the format, then call this recursively
        buffered = io.BufferedReader(fp, buffer_size=4096)
        first_bytes = buffered.peek(2048).strip()
        if first_bytes[0] in (b"[", b"{"):
            # TODO: Detect newline-JSON
            yield from rows_from_file(fp, format=Format.JSON)
        else:
            dialect = csv.Sniffer().sniff(first_bytes.decode(encoding, "ignore"))
            yield from rows_from_file(
                fp, format=Format.CSV, dialect=dialect, encoding=encoding
            )
    else:
        raise RowsFromFileError("Bad format")

mypy said:

sqlite_utils/utils.py:157: error: Argument 1 to "BufferedReader" has incompatible type "BinaryIO"; expected "RawIOBase"
sqlite_utils/utils.py:163: error: Argument 1 to "decode" of "bytes" has incompatible type "Optional[str]"; expected "str"

simonw · 2021-06-19T00:34:24Z

Got this working:

% curl 'https://api.github.com/repos/simonw/datasette/issues' | sqlite-utils memory - 'select id from stdin'

Refs #289, #279

simonw added the enhancement New feature or request label Jun 18, 2021

simonw closed this as completed in 00e4bd5 Jun 19, 2021

adamchainz mentioned this issue Jun 20, 2021

Mypy fixes for rows_from_file() #289

Closed

simonw added a commit that referenced this issue Jun 22, 2021

mypy annotations for rows_from_file(), run mypy in CI

7b3fdf0

Refs #289, #279

simonw mentioned this issue Sep 15, 2022

Ability to load JSON records held in a file with a single top level key that is a list of objects #489

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlite-utils memory should handle TSV and JSON in addition to CSV #279

sqlite-utils memory should handle TSV and JSON in addition to CSV #279

simonw commented Jun 18, 2021

simonw commented Jun 18, 2021

simonw commented Jun 18, 2021

simonw commented Jun 18, 2021

simonw commented Jun 18, 2021 •

edited

Loading

simonw commented Jun 18, 2021 •

edited

Loading

simonw commented Jun 19, 2021 •

edited

Loading

simonw commented Jun 19, 2021

sqlite-utils memory should handle TSV and JSON in addition to CSV #279

sqlite-utils memory should handle TSV and JSON in addition to CSV #279

Comments

simonw commented Jun 18, 2021

simonw commented Jun 18, 2021

simonw commented Jun 18, 2021

simonw commented Jun 18, 2021

simonw commented Jun 18, 2021 • edited Loading

simonw commented Jun 18, 2021 • edited Loading

simonw commented Jun 19, 2021 • edited Loading

simonw commented Jun 19, 2021

simonw commented Jun 18, 2021 •

edited

Loading

simonw commented Jun 18, 2021 •

edited

Loading

simonw commented Jun 19, 2021 •

edited

Loading