Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert fails on JSONL with whitespace #417

Closed
blaine opened this issue Mar 21, 2022 · 3 comments
Closed

insert fails on JSONL with whitespace #417

blaine opened this issue Mar 21, 2022 · 3 comments

Comments

@blaine
Copy link

blaine commented Mar 21, 2022

Any JSON that is newline-delimited and has whitespace (newlines) between the start of a JSON object and an attribute fails due to a parse error.

e.g. given the valid JSONL:

  "attribute": "value"
}
{
  "attribute": "value2"
}

I would expect that sqlite-utils insert --nl my.db mytable file.jsonl would properly import the data into mytable. However, the following error is thrown instead:

json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)

It makes sense that since the file is intended to be newline separated, the thing being parsed is "{" (which obviously fails), however the default newline-separated output of jq isn't compact. Using jq -c avoids this problem, but the fix is unintuitive and undocumented.

Proposed solutions:

  1. Default to a "loose" newline-separated parse; this could be implemented internally as [the equivalent of] a jq -c filter ahead of the insert step.
  2. Catch the JSONDecodeError (or pre-empt it in the case of a record === "{\n") and give the user a "it looks like your json isn't actually newline-delimited; try running it through jq -c instead" error message.

It might just have been too early in the morning when I was playing with this, but running pipes of data through sqlite-utils without the 'knack' of it led to some false starts.

@simonw
Copy link
Owner

simonw commented Mar 21, 2022

I've not really thought about standards as much here as I should. It looks like there are two competing specs for newline-delimited JSON!

http://ndjson.org/ is the one I've been using in sqlite-utils - and https://github.com/ndjson/ndjson-spec#31-serialization says:

The JSON texts MUST NOT contain newlines or carriage returns.

https://jsonlines.org/ is the other one. It is slightly less clear, but it does say this:

  1. Each Line is a Valid JSON Value

The most common values will be objects or arrays, but any JSON value is permitted.

My interpretation of both of these is that newlines in the middle of a JSON object shouldn't be allowed.

So what's jq doing here? It looks to me like that jq format is its own thing - it's not actually compatible with either of those two loose specs described above.

The jq docs seem to call this "whitespace-separated JSON": https://stedolan.github.io/jq/manual/v1.6/#Invokingjq

The thing I like about newline-delimited JSON is that it's really trivial to parse - loop through each line, run it through json.loads() and that's it. No need to try and unwrap JSON objects that might span multiple lines.

Unless someone has written a robust Python implementation of a jq-compatible whitespace-separated JSON parser, I'm inclined to leave this as is. I'd be fine adding some documentation that helps point people towards jq -c though.

@blaine
Copy link
Author

blaine commented Mar 21, 2022

That makes sense; just a little hint that points folks towards doing the right thing might be helpful!

fwiw, the reason I was using jq in the first place was just a quick way to extract one attribute from an actual JSON array. When I initially imported it, I got a table with a bunch of embedded JSON values, rather than a native table, because each array entry had two attributes, one with the data I actually wanted. Not sure how common a use-case this is, though (and easily fixed, aside from the jq weirdness!)

@simonw simonw closed this as completed in 95522ad Mar 25, 2022
@simonw
Copy link
Owner

simonw commented Mar 25, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants