Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer unexpectedly large in GeoPackage #45178

Closed
1 of 2 tasks
tschmetzer opened this issue Sep 21, 2021 · 9 comments
Closed
1 of 2 tasks

Integer unexpectedly large in GeoPackage #45178

tschmetzer opened this issue Sep 21, 2021 · 9 comments
Labels
Data Provider Related to specific vector, raster or mesh data providers

Comments

@tschmetzer
Copy link
Contributor

tschmetzer commented Sep 21, 2021

What is the bug or the crash?

Common C++ types use 4 bytes for integer and 8 bytes for double.
Though expecting the file sizes to decrease after converting a double field to integer and removing the double field the converted file grows even larger.
The used file type is .gpkg. Haven't tried with any other file type.

image
File sizes in File manager:
image

Steps to reproduce the issue

When trying to reproduce this a second time I got a different but still larger file size of 860 kB. Yet the original resulting "corrupted" file SMOD_POP_KG_DNI500_melilla-points_int.gpkg (1.208kB) is contained in the archive. SMOD_POP_KG_DNI500_melilla_points.zip

  1. Load file SMOD_POP_KG_DNI500_melilla-points.gpkg from SMOD_POP_KG_DNI500_melilla_points.zip
  2. Create a copy/save as a different file (in my case I called it SMOD_POP_KG_DNI500_melilla-points_int.gpkg)
  3. Create a new integer field using the Field Calculator on the layer SMOD_POP_KG_DNI500_melilla-points
    image
  4. Delete the field "DNI_1" and check file sizes in File Manager.

Feature count and contents look exactly the same
image

Versions

QGIS version
3.20.2-Odense
QGIS code revision
9f59a15
Qt version
5.15.2
Python version
3.9.5
GDAL/OGR version
3.3.1
PROJ version
8.1.0
EPSG Registry database version
v10.027 (2021-06-17)
GEOS version
3.9.1-CAPI-1.14.2
SQLite version
3.35.2
PDAL version
2.3.0
PostgreSQL client version
13.0
SpatiaLite version
5.0.1
QWT version
6.1.3
QScintilla2 version
2.11.5
OS version
Windows 10 Version 2009

Active Python plugins
GroupStats
QuickOSM
db_manager
MetaSearch
processing

Supported QGIS version

  • I'm running a supported QGIS version according to the roadmap.

New profile

  • I tried with a new QGIS profile

Additional context

No response

@tschmetzer tschmetzer added the Bug Either a bug report, or a bug fix. Let's hope for the latter! label Sep 21, 2021
@m-kuhn
Copy link
Member

m-kuhn commented Sep 22, 2021

Can you try running vacuum on that database?
https://www.sqlite.org/lang_vacuum.html

@m-kuhn m-kuhn added the Feedback Waiting on the submitter for answers label Sep 22, 2021
@tschmetzer
Copy link
Contributor Author

Thanks for pointing at that interesting tool coming along with the information that the DB can get fragmented. Didn't help much though as the vacuumed file with integer values (SMOD_POP_KG_DNI500_melilla-points_int_vacuumed.gpkg) is sill larger than original one with double values (SMOD_POP_KG_DNI500_melilla-points.gpkg) even when vacuumed (SMOD_POP_KG_DNI500_melilla-points_vacuumed.gpkg):

image

@gioman gioman added Data Provider Related to specific vector, raster or mesh data providers and removed Feedback Waiting on the submitter for answers labels Sep 22, 2021
@m-kuhn
Copy link
Member

m-kuhn commented Sep 22, 2021

Can you do a bit more analysis?
Basically the first step is to make sure it's QGIS (and not SQLITE) being the problem here.

Interesting would be

  • to look into those tools: https://www.sqlite.org/dbstat.html
  • to convert the result back to double in the end and see if that changes the size again
  • to use the db manager (or an external sqlite tool) to convert the columns and check the effects

@m-kuhn m-kuhn added the Feedback Waiting on the submitter for answers label Sep 22, 2021
@agiudiceandrea
Copy link
Contributor

agiudiceandrea commented Sep 22, 2021

I cannot replicate the issue using QGIS 3.16.11 old (v1/EXE) OSGeo4W (SQLite 3.29.0) on Windows 7 64 bit.
The following steps leads to the same exact file size (as expected):

  • copy SMOD_POP_KG_DNI500_melilla-points.gpkg (778240 bytes) to SMOD_POP_KG_DNI500_melilla-points_int.gpkg using the Windows file manager
  • add SMOD_POP_KG_DNI500_melilla-points_int.gpkg in QGIS
  • use the Field Calculator to add a DNI_2 field of type "Whole number (integer)" with the expression "DNI_1"
  • remove the field DNI_1 and close the editing
  • add a new GeoPackage connection in QGIS Browser to SMOD_POP_KG_DNI500_melilla-points_int.gpkg and perform the "Compact Database (VACUUM)" task
  • remove the connection and remove the layer from QGIS
  • SMOD_POP_KG_DNI500_melilla-points_int.gpkg has the same file size (778240 bytes) as SMOD_POP_KG_DNI500_melilla-points.gpkg

@agiudiceandrea
Copy link
Contributor

agiudiceandrea commented Sep 23, 2021

I cannot replicate the issue also using QGIS 3.20.3 new (v2/MSI) OSGeo4W (SQLite 3.35.2) on Windows 10 64 bit.

@tschmetzer
Copy link
Contributor Author

tschmetzer commented Sep 23, 2021

Referring to my previous statement:

image

Thank you @agiudiceandrea for your replication efforts!
I don't know why but today I cannot replicate this either. I had used the sqlite3 binary to vacuum the file SMOD_POP_KG_DNI500_melilla-points_int.gpkg but today it results in 760KB as well as when using the GeoPackage connection in QGIS Browser to vacuum the file.

What I did achieve on two independant machines (QGIS 3.20.1 on both) was to blow the file to 776KB (794.624 Bytes) when manually reversing DNI_2 to DNI_1. Meaning:

  • add SMOD_POP_KG_DNI500_melilla-points_int.gpkg (containing the DNI_2 field) to QGIS. You can use your self-created file or my example file
  • use the Field Calculator to add a DNI_1 field of type "Decimal number (real)" with the expression "DNI_2"
  • remove the field DNI_2 and close the editing
  • add a new GeoPackage connection in QGIS Browser to SMOD_POP_KG_DNI500_melilla-points_int.gpkg and perform the "Compact Database (VACUUM)" task
  • remove the connection and remove the layer from QGIS
  • Check the file size in a Windows Explorer

Though my intention was actually not to proof an error in reversability but this is probably something that should be looked at as well. I tried to do some investigation as proposed by @m-kuhn using https://www.sqlite.org/dbstat.html

The following tables marked manually with (*) have had changes in data:

sqlite> .table

gpkg_contents
gpkg_extensions
gpkg_geometry_columns
gpkg_ogr_contents
gpkg_spatial_ref_sys
gpkg_tile_matrix
gpkg_tile_matrix_set
rtree_smod-pop-dni500_geom
*rtree_smod-pop-dni500_geom_node
*rtree_smod-pop-dni500_geom_parent
*rtree_smod-pop-dni500_geom_rowid
smod-pop-dni500

SMOD_POP_KG_DNI500_melilla-points_int.gpkg after reversing DNI_2 to DNI_1:

sqlite> .databases
main: J:\SBC-GIS\melilla-int-bug\SMOD_POP_KG_DNI500_melilla-points_int.gpkg r/w
sqlite> select pgsize,payload,unused,sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
4096|0|3904|93.5595703125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
81920|30534|5276|93.5595703125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_parent";
4096|940|2142|47.705078125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_node";
299008|264880|31768|89.3755351027397

original SMOD_POP_KG_DNI500_melilla-points.gpkg:

sqlite> .databases
main: J:\SBC-GIS\melilla-int-bug\SMOD_POP_KG_DNI500_melilla-points.gpkg r/w
sqlite> select pgsize,payload,unused,sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
4096|0|3904|91.2548828125
sqlite> select sum(pgsize),payload,unused,sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
81920|0|3904|91.2548828125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_rowid";
81920|28646|7164|91.2548828125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_parent";
4096|889|2315|43.4814453125
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="rtree_smod-pop-dni500_geom_node";
282624|250096|30387|89.2482591711957
sqlite> select sum(pgsize),sum(payload),sum(unused),sum(pgsize-unused)*100.0/sum(pgsize) from dbstat where name="smod-pop-dni500";
327680|274555|5935|98.1887817382813

Unfortunately I am not proficient enough with databases and even less with sqlite to dig down and compare the datasets that cause the trouble. Any hints how to dig deeper?

Expanding and returning to my initial topic: Even if file sizes don't grow after converting double->int and subsequent vacuuming I still don't get why using smaller field types like int (4 bytes) result in the same file sizes as doubles (8 bytes).

@agiudiceandrea
Copy link
Contributor

agiudiceandrea commented Sep 23, 2021

Even if file sizes don't grow after converting double->int and subsequent vacuuming I still don't get why using smaller field types like int (4 bytes) result in the same file sizes as doubles (8 bytes).

Datatypes in SQLite are handled dynamically and they differ from common C++ static types: "In SQLite, the datatype of a value is associated with the value itself, not with its container." (See https://www.sqlite.org/datatype3.html)

AFAIK, this means that the numeric integer value 6500 will be stored as a 2-bytes integer ("a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the magnitude of the value") in both an INTEGER type and a REAL type column. But the numeric floating point value 6500.123 will be stored as a 8-byte IEEE floating point ("a floating point value, stored as an 8-byte IEEE floating point number") in both a REAL type and a INTEGER type column.
Consider also that "As an internal optimization, small floating point values with no fractional component and stored in columns with REAL affinity are written to disk as integers in order to take up less space and are automatically converted back into floating point as the value is read out. This optimization is completely invisible at the SQL level and can only be detected by examining the raw bits of the database file."

@tschmetzer
Copy link
Contributor Author

tschmetzer commented Sep 24, 2021

Thank you so much @agiudiceandrea for pointing to https://www.sqlite.org/datatype3.html! That explains my main question.

What a pity I couldn't replicate the original issue of a double->int converted and vacuumed DB file being larger than the original one so far.

The file size increase after a conversion double->int and then reverting int->double is still unclear and could reveal some bad code.

@rouault
Copy link
Contributor

rouault commented Sep 25, 2021

The file size increase after a conversion double->int and then reverting int->double is still unclear and could reveal some bad code.

yes, I can also reproduce this. The compacted size goes from 778240 to 794624. Looking at the output of "echo .dump | sqlite3 the.db > the.db.txt" before original and the one after the 2 conversions, one can see that the order into which tables are stored in the file is not the same, and some differences in the spatial index. However at high level, the spatial index seems to be OK. I guess one must accept this as side effects of the compaction mechanism in SQLite3.

Closing as not a bug

@rouault rouault closed this as completed Sep 25, 2021
@rouault rouault removed Feedback Waiting on the submitter for answers Bug Either a bug report, or a bug fix. Let's hope for the latter! labels Sep 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data Provider Related to specific vector, raster or mesh data providers
Projects
None yet
Development

No branches or pull requests

5 participants