Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sqlite-utils memory should handle TSV and JSON in addition to CSV #279

Closed
simonw opened this issue Jun 18, 2021 · 7 comments
Closed

sqlite-utils memory should handle TSV and JSON in addition to CSV #279

simonw opened this issue Jun 18, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@simonw
Copy link
Owner

simonw commented Jun 18, 2021

  • Use sniff to detect CSV or TSV (if :tsv or :csv was not specified) and delimiters

Follow-on from #272

@simonw simonw added the enhancement New feature or request label Jun 18, 2021
@simonw
Copy link
Owner Author

simonw commented Jun 18, 2021

To detect JSON, check to see if the stream starts with [ or { - maybe do something more sophisticated than that.

@simonw
Copy link
Owner Author

simonw commented Jun 18, 2021

Detecting valid JSON is tricky - just because a stream starts with [ or { doesn't mean the entire stream is valid JSON. You need to parse the entire stream to determine that for sure.

One way to solve this would be with a custom state machine. Another would be to use the ijson streaming parser - annoyingly it throws the same exception class for invalid JSON for different reasons, but the e.args[0] for that exception includes human-readable text about the error - if it's anything other than parse error: premature EOF then it probably means the JSON was invalid.

@simonw
Copy link
Owner Author

simonw commented Jun 18, 2021

Or... since I'm not using a streaming JSON parser at the moment, if I think something is JSON I can load the entire thing into memory to validate it.

I still need to detect newline-delimited JSON. For that I can consume the first line of the input to see if it's a valid JSON object, then maybe sniff the second line too?

This does mean that if the input is a single line of GIANT JSON it will all be consumed into memory at once, but that's going to happen anyway.

So I need a function which, given a file pointer, consumes from it, detects the type, then returns that type AND a file pointer to the beginning of the file again. I can use io.BufferedReader for this.

@simonw
Copy link
Owner Author

simonw commented Jun 18, 2021

def detect_format(fp):
    # ...
    return "csv", fp, dialect
    # or
    return "json", fp, parsed_data
    # or
    return "json-nl", fp, docs

The mixed return types here are ugly. In all of these cases what we really want is to return a generator of {...} objects. So maybe it returns that instead.

def filepointer_to_documents(fp):
    # ...
    yield from documents

I can refactor sqlite-utils insert to use this new code too.

@simonw
Copy link
Owner Author

simonw commented Jun 18, 2021

So maybe this is a function which can either be told the format or, if none is provided, it detects one for itself.

def rows_from_file(fp, format=None):
    # ...
    yield from rows

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2021

I tried writing this function with type hints, but eventually gave up:

def rows_from_file(
    fp: BinaryIO,
    format: Optional[Format] = None,
    dialect: Optional[Type[csv.Dialect]] = None,
    encoding: Optional[str] = None,
) -> Generator[dict, None, None]:
    if format == Format.JSON:
        decoded = json.load(fp)
        if isinstance(decoded, dict):
            decoded = [decoded]
        if not isinstance(decoded, list):
            raise RowsFromFileBadJSON("JSON must be a list or a dictionary")
        yield from decoded
    elif format == Format.CSV:
        decoded_fp = io.TextIOWrapper(fp, encoding=encoding or "utf-8-sig")
        yield from csv.DictReader(decoded_fp)
    elif format == Format.TSV:
        yield from rows_from_file(
            fp, format=Format.CSV, dialect=csv.excel_tab, encoding=encoding
        )
    elif format is None:
        # Detect the format, then call this recursively
        buffered = io.BufferedReader(fp, buffer_size=4096)
        first_bytes = buffered.peek(2048).strip()
        if first_bytes[0] in (b"[", b"{"):
            # TODO: Detect newline-JSON
            yield from rows_from_file(fp, format=Format.JSON)
        else:
            dialect = csv.Sniffer().sniff(first_bytes.decode(encoding, "ignore"))
            yield from rows_from_file(
                fp, format=Format.CSV, dialect=dialect, encoding=encoding
            )
    else:
        raise RowsFromFileError("Bad format")

mypy said:

sqlite_utils/utils.py:157: error: Argument 1 to "BufferedReader" has incompatible type "BinaryIO"; expected "RawIOBase"
sqlite_utils/utils.py:163: error: Argument 1 to "decode" of "bytes" has incompatible type "Optional[str]"; expected "str"

@simonw
Copy link
Owner Author

simonw commented Jun 19, 2021

Got this working:

% curl 'https://api.github.com/repos/simonw/datasette/issues' | sqlite-utils memory - 'select id from stdin' 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant