Skip to content

CSV inference fixes#562

Merged
mildbyte merged 4 commits intomasterfrom
bugfix/csv-inference-fixes
Nov 8, 2021
Merged

CSV inference fixes#562
mildbyte merged 4 commits intomasterfrom
bugfix/csv-inference-fixes

Conversation

@mildbyte
Copy link
Copy Markdown
Contributor

@mildbyte mildbyte commented Nov 8, 2021

Make CSV parsing/inference more robust so that it doesn't crash on examples from (https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html) (sometimes they are malformed, but it will output a table with paddings/truncations).

Add `skipinitialspace=True` to csv reader. This works around CSV files that use
leading spaces in headers/fields and makes inference more tolerant ("  2" gets
inferred as a number instead of a string).

col1,col2,col3
   1,   2,  aa

Only treat actual JSON objects as the JSON datatype (had some false positives
where `"42"` parses as JSON by us but it really shouldn't be.

Ignore empty rows at inference/query time.
This is a tradeoff, since it means we will try to silently ignore errors in
malformed/weird CSV files and return a bunch of varchar columns because some
data is shifted around, however, this is still better than flat out erroring
since it will give the user some feedback and let them change the parameters or
know how to fix their file.
@mildbyte mildbyte merged commit 10966c3 into master Nov 8, 2021
@mildbyte mildbyte deleted the bugfix/csv-inference-fixes branch November 8, 2021 16:24
mildbyte added a commit that referenced this pull request Nov 17, 2021
  * Splitfile speedups (#567)
  * Various query speedups (#563, #561)
  * More robust CSV querying (#562)

Full set of changes: [`v0.2.17...v0.2.18`](v0.2.17...v0.2.18)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant