Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sgr commit fails on tables with just one jsonb column #594

Open
ltrgoddard opened this issue Dec 22, 2021 · 2 comments
Open

sgr commit fails on tables with just one jsonb column #594

ltrgoddard opened this issue Dec 22, 2021 · 2 comments

Comments

@ltrgoddard
Copy link

Hi folks,

I noticed that sgr fails when trying to commit a table with the following structure—uk_company_data_test.psc(data jsonb). The output below shows three attempts to commit the table—first with just the jsonb column, second with an all-null integer column called id added, and third with the first few rows of id filled in. This last attempt seemed succesful, but when it was pushed to the DDN, it became apparent that only the rows with values for id had in fact been committed/pushed (rows with null were skipped completely).

I'm guessing this is something to do with sorting? Not a big deal as it's generally good practice to have an ID column in a JSONB table anyway, but wanted to flag!

louis@quwain ~ % sgr init uk_company_data_test
Initialized empty repository uk_company_data_test
louis@quwain ~ % sgr commit uk_company_data_test
Committing uk_company_data_test...
Processing table companies
Processing table psc
error: psycopg2.errors.SyntaxError: syntax error at or near ")"
error: LINE 1: ..._row_nums AS (SELECT(ROW_NUMBER() OVER (ORDER BY ()::text) -...
error:                                                              ^

louis@quwain ~ % sgr commit uk_company_data_test
Committing uk_company_data_test...
Processing table companies
Object o058e47b8dc5a66b16a81c8c75279982bacf62dc3bfc309b1615c31c1894dbf already exists, skipping
Processing table psc
error: TypeError: reduce() of empty sequence with no initial value

louis@quwain ~ % sgr commit uk_company_data_test
Committing uk_company_data_test...
Processing table companies
Object o058e47b8dc5a66b16a81c8c75279982bacf62dc3bfc309b1615c31c1894dbf already exists, skipping
Processing table psc
Committed uk_company_data_test as 1c5e6165edb6.
@mildbyte
Copy link
Contributor

Thanks for flagging! You've hit a corner case with sgr: it uses the primary key to partition tables and track overwrites when versioning. If the key doesn't exist, it assumes a surrogate PK made out of the whole row (as text), but ignores some certain types that wouldn't make sense as a PK (like jsonb, geodata, ...).

In your first example, it tried doing that and failed because there were no columns that could be a PK or a surrogate PK, which isn't handled at all and so it crashed. In the second example, it used id::text as the PK and all rows that were NULLs were treated as overwriting each other, so only one NULL effectively ended up in the dataset.

This no-PK case is often handler pretty poorly so we're considering dropping it entirely and adding an autogenerated PK (like a hash of the whole row / row number like you did) automatically in sgr commit and actually storing it.

One other solution would be to "normalize" the JSON into a table like psc(kind varchar, name varchar, links.self varchar, address.country varchar, address.locality varchar, ...). This might also mean that the dataset will be physically smaller (leverage columnar storage and compress better) and result in faster scans (since we won't need to read the entire JSON doc if you're only aggregating over e.g. name). You can do it either at data load time, after the PG table load but before the sgr commit or on Splitgraph Cloud itself (we have a small example of how to do it with dbt at https://github.com/splitgraph/dbt-transform-example)

@ltrgoddard
Copy link
Author

That makes sense—thanks for explaining the logic behind it. I did think after I'd loaded this particular data set that jsonb is a poor choice for getting the best out of cstore_fdw. I'll do a cut-down, normalised version for this particular project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants