Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for newline-delimited JSON #13

Merged
merged 4 commits into from
May 16, 2021
Merged

Conversation

eyeseast
Copy link
Contributor

☝️ does that. Closes #12

A couple caveats:

  • Features are streamed in with a generator, since this is intended for large datasets. That means we can't look ahead at the first 100 features.
  • We can't auto-detect feature IDs using ndjson. I'm ok with that. You can still pass --pk=id and get the same thing.
  • We can't use the first 100 features to build the initial table. Again, I think it's fine. I note in the README how to grab a subset using Fiona.

I have one test here, but I'm very much open to suggestions for more.

@eyeseast eyeseast requested a review from simonw February 13, 2020 02:32
@simonw
Copy link
Owner

simonw commented Feb 15, 2020

There's a pattern for peeking ahead in sqlite-utils here:

https://github.com/simonw/sqlite-utils/blob/e8b2b7383bd94659d3b7a857a1414328bc48bc19/sqlite_utils/db.py#L993-L1004

You can use itertools.islice to pull out the first 100 items (and turn them into a list with list()), then use itertools.chain(that_list, original_iterator) to loop through the first 100 items followed by the rest of the iterator.

@simonw
Copy link
Owner

simonw commented Feb 15, 2020

This looks great - the tests look robust enough to me.

@eyeseast
Copy link
Contributor Author

Cool. I'll see if I can get the peek-ahead to work in a reasonable way. Was thinking about islice but wasn't totally sure where to do that with yield_records also happening.

@eyeseast
Copy link
Contributor Author

I got feature.id working by sampling the stream of features coming in, and then chaining that sample back into the original stream.

Getting it to work with processed features to guess column types starts to feel precarious, because lists and generators are going to operate differently. It might ultimately be easier to do features = iter(features) at the top, so it's always a one-way stream and everything operates the same, but I'm not sure that's worth it. I think collecting a subset into a feature collection, like I describe in the readme, actually feels a little easier and more deliberate.

simonw added a commit that referenced this pull request May 16, 2021
@simonw simonw merged commit 13c4e5a into simonw:master May 16, 2021
simonw added a commit that referenced this pull request May 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support newline-delimited GeoJSON
2 participants