Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some versions of QGIS can't parse mixed-dtype-columns in GeoJSON #51911

Open
2 tasks done
joooeey opened this issue Feb 17, 2023 · 12 comments
Open
2 tasks done

Some versions of QGIS can't parse mixed-dtype-columns in GeoJSON #51911

joooeey opened this issue Feb 17, 2023 · 12 comments
Labels
Bug Either a bug report, or a bug fix. Let's hope for the latter! Data Provider Related to specific vector, raster or mesh data providers Regression Something which used to work, but doesn't anymore

Comments

@joooeey
Copy link

joooeey commented Feb 17, 2023

What is the bug or the crash?

GeoJSON allows to put any JSON value (string, number, object, array, true, false, null) as the value of a property. If a property with the same name appears in multiple GeoJSON features, GDAL maps that onto a column. Depending on the version of GDAL (I guess) and what dtype is encountered first,

  • either the column only preserves JSON objects and null values but converts strings, numbers, arrays, true and false to NULL. (Looses almost all information)
  • or coerces everything to strings. (Looses only type information)

In detail, I have seen two cases how QGIS deals with mixed types of JSON values:

  1. GDAL version is 3.6.2 AND the first value is NOT a string
  • According to Layer/Properties/Fields, the Type name of the field becomes JSON.
  • This is incorrectly mapped to the Type QVariantMap which can only represent JSON object but not any other type of JSON value (i.e. string, number, array, true, false, null).
  • Consequently, only JSON objects and null values are correctly read into the attribute table. Arrays show as empty. Strings, numbers, true, false and null are mapped to NULL.
  1. GDAL version is 3.0.4 OR the first value is a string
  • According to Layer/Properties/Fields, the Type name of the field becomes String.
  • This is mapped to QString.
  • Consequently, all types of JSON values are coerced to a string representation.

Steps to reproduce the issue

Create the following file with a text editor:

test.geojson

{ "type": "FeatureCollection",
  "features": [
    { "type": "Feature",
      "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
      "properties": {"prop0": 42}
      },
    { "type": "Feature",
      "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
      "properties": {"prop0": "astring"}
      },
    { "type": "Feature",
      "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
      "properties": {"prop0": {"nested": 75}}
      }
    ]
  }

and load it in QGIS. Look at the Attribute table. On the newest QGIS version on Windows the values 42 and "astring" are lost and replaced by NULL. On a fresh install of QGIS on Linux Mint 20.1, all values are present but converted to strings.

Edit the GeoJSON to have a string in the first feature, close and reopen, and everything is converted to string on Windows too.

Versions

NULL values on Windows & QGIS 3.28:

QGIS version
3.28.3-Firenze
QGIS code revision
c12bcb2
Qt version
5.15.3
Python version
3.9.5
GDAL/OGR version
3.6.2
PROJ version
9.1.1
EPSG Registry database version
v10.076 (2022-08-31)
GEOS version
3.11.1-CAPI-1.17.1
SQLite version
3.39.4
PDAL version
2.4.3
PostgreSQL client version
unknown
SpatiaLite version
5.0.1
QWT version
6.1.6
QScintilla2 version
2.13.1
OS version
Windows 10 Version 2009

Active Python plugins
db_manager
0.1.20
grassprovider
2.12.99
MetaSearch
0.3.6
processing
2.12.99
sagaprovider
2.12.99

NULL values on Windows & QGIS 3.22

QGIS version
3.22.16-Białowieża
QGIS code revision
6f08e4d
Qt version
5.15.3
Python version
3.9.5
GDAL/OGR version
3.6.2
PROJ version
9.1.1
EPSG Registry database version
v10.076 (2022-08-31)
GEOS version
3.11.1-CAPI-1.17.1
SQLite version
3.39.4
PDAL version
2.4.3
PostgreSQL client version
14.3
SpatiaLite version
5.0.1
QWT version
6.1.6
QScintilla2 version
2.13.1
OS version
Windows 10 Version 2009

Active Python plugins
db_manager
0.1.20
grassprovider
2.12.99
MetaSearch
0.3.5
processing
2.12.99
sagaprovider
2.12.99

string values on Linux Mint and QGIS 3.26

<style type="text/css"> p, li { white-space: pre-wrap; } </style>
QGIS version 3.26.3-Buenos Aires QGIS code revision 65e4edf
Qt version 5.12.8
Python version 3.8.10
GDAL/OGR version 3.0.4
PROJ version 6.3.1
EPSG Registry database version v9.8.6 (2020-01-22)
Compiled against GEOS 3.8.0-CAPI-1.13.1 Running against GEOS 3.8.0-CAPI-1.13.1
SQLite version 3.31.1
PDAL version 2.0.1
PostgreSQL client version 12.12 (Ubuntu 12.12-0ubuntu0.20.04.1)
SpatiaLite version 4.3.0a
QWT version 6.1.4
QScintilla2 version 2.11.2
OS version Linux Mint 20.1
       
Active Python plugins
QPackage 1.5
quick_map_services 0.19.26
mapswipetool_plugin 1.2
qconsolidate3 0.2.0
openlayers_plugin 2.0.0
qgis2web 3.16.0
qgis-maptiler-plugin 2.0.0
QuickOSM 2.0.0
grassprovider 2.12.99
db_manager 0.1.20
sagaprovider 2.12.99
processing 2.12.99
MetaSearch 0.3.6
QGIS version 3.26.3-Buenos Aires QGIS code revision [65e4edf](https://github.com/qgis/QGIS/commit/65e4edfdad) Qt version 5.12.8 Python version 3.8.10 GDAL/OGR version 3.0.4 PROJ version 6.3.1 EPSG Registry database version v9.8.6 (2020-01-22) Compiled against GEOS 3.8.0-CAPI-1.13.1 Running against GEOS 3.8.0-CAPI-1.13.1 SQLite version 3.31.1 PDAL version 2.0.1 PostgreSQL client version 12.12 (Ubuntu 12.12-0ubuntu0.20.04.1) SpatiaLite version 4.3.0a QWT version 6.1.4 QScintilla2 version 2.11.2 OS version Linux Mint 20.1

Active Python plugins
QPackage
1.5
quick_map_services
0.19.26
mapswipetool_plugin
1.2
qconsolidate3
0.2.0
openlayers_plugin
2.0.0
qgis2web
3.16.0
qgis-maptiler-plugin
2.0.0
QuickOSM
2.0.0
grassprovider
2.12.99
db_manager
0.1.20
sagaprovider
2.12.99
processing
2.12.99
MetaSearch
0.3.6

Supported QGIS version

  • I'm running a supported QGIS version according to the roadmap.

New profile

  • I tried with a new QGIS profile

Additional context

No response

@joooeey joooeey added the Bug Either a bug report, or a bug fix. Let's hope for the latter! label Feb 17, 2023
@agiudiceandrea agiudiceandrea added the Data Provider Related to specific vector, raster or mesh data providers label Feb 19, 2023
@elpaso elpaso self-assigned this Feb 21, 2023
@joooeey
Copy link
Author

joooeey commented Feb 21, 2023

I think I found the commits that led to this unfortunate situation:

  • In December 2018, QGIS was updated to improve Geopackage handling: 8f5a732
  • In February 2022, GDAL was updated to map mixed-dtype fields to JSON in place of string: OSGeo/gdal@43b1e7a

@elpaso
Copy link
Contributor

elpaso commented Feb 22, 2023

@joooeey you have found the source of the issue but I don't know how to fix it: QGIS data model does not allow for mixed type fields so I guess the best we can do is to fall back on String in case of mixed type.

Is that the expected behavior?

@joooeey
Copy link
Author

joooeey commented Feb 22, 2023

@elpaso better than loosing data. One would have to take a look how to deal with other source formats though, especially GeoPackage.

@agiudiceandrea agiudiceandrea added the Regression Something which used to work, but doesn't anymore label Feb 22, 2023
@elpaso
Copy link
Contributor

elpaso commented Feb 23, 2023

I don't see any other solution than scanning the attributes whose type is not a string and loop through all the features to determine if all the values for that attribute are compatible with type, fallback to string if not.

@rouault am I missing something here?

Here is a preliminary test for the issue:

    def testGeoJsonMixedType(self):
        """Test issue GH #51911 Some versions of QGIS can't parse mixed-dtype-columns in GeoJSON"""

        temp_dir = QTemporaryDir()
        temp_path = temp_dir.path()
        json_path = os.path.join(temp_path, 'test.json')

        data = {
            'type': 'FeatureCollection',
            'features': [
                { "type": "Feature",
                    "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
                    "properties": {"prop0": "astring"}
                },
                { "type": "Feature",
                    "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
                    "properties": {"prop0": 42}
                },
                { "type": "Feature",
                    "geometry": {"type": "Point", "coordinates": [102.0, 0.5]},
                    "properties": {"prop0": {"nested": 75}}
                }
            ]
        }

        with open(json_path, 'w+') as f:
            f.write(json.dumps(data))

        vl = QgsVectorLayer(json_path, 'vl')
        self.assertTrue(vl.isValid())
        self.assertEqual(vl.fields()[0].type(), QVariant.String)

        # Swap string and int
        f = data['features'][0]
        data['features'][0] = data['features'][1]
        data['features'][1] = f

        with open(json_path, 'w+') as f:
            f.write(json.dumps(data))

        vl = QgsVectorLayer(json_path, 'vl')
        self.assertTrue(vl.isValid())
        self.assertEqual(vl.fields()[0].type(), QVariant.String)

        # Swap obj and int
        f = data['features'][0]
        data['features'][0] = data['features'][2]
        data['features'][2] = f

        with open(json_path, 'w+') as f:
            f.write(json.dumps(data))

        vl = QgsVectorLayer(json_path, 'vl')
        self.assertTrue(vl.isValid())
        self.assertEqual(vl.fields()[0].type(), QVariant.String)



@rouault
Copy link
Contributor

rouault commented Feb 23, 2023

@rouault am I missing something here?

Recent GDAL versions will report a type=OFTString + subtype=OFSTJSON, but QGIS (actually QJsonDocument::fromJson( )) doesn't like when the string content reported by GDAL isn't a JSON object ({} type of construct) or a JSON array. GDAL can also report true, false, numeric values or (unquoted) strings depending on the content of the JSON feature.

I had started an attempt to improve things with the following, but then it appears that QGIS editor widgets don't like non-JSON objects as JSON content

diff --git a/src/core/qgsogrutils.cpp b/src/core/qgsogrutils.cpp
index 7eb146b89e..0e09232bd0 100644
--- a/src/core/qgsogrutils.cpp
+++ b/src/core/qgsogrutils.cpp
@@ -596,11 +596,36 @@ QVariant QgsOgrUtils::getOgrFeatureAttribute( OGRFeatureH ogrFet, const QgsField
       case QVariant::Map:
       {
         //it has to be JSON
-        //it's null if no json format
+        QString strVal;
         if ( encoding )
-          value = QJsonDocument::fromJson( encoding->toUnicode( OGR_F_GetFieldAsString( ogrFet, attIndex ) ).toUtf8() ).toVariant();
+          strVal = encoding->toUnicode( OGR_F_GetFieldAsString( ogrFet, attIndex ) );
         else
-          value = QJsonDocument::fromJson( QString::fromUtf8( OGR_F_GetFieldAsString( ogrFet, attIndex ) ).toUtf8() ).toVariant();
+          strVal = QString::fromUtf8( OGR_F_GetFieldAsString( ogrFet, attIndex ) );
+        if ( strVal == QLatin1String( "true" ) )
+        {
+          value = QVariant( true );
+        }
+        else if ( strVal == QLatin1String( "false" ) )
+        {
+          value = QVariant( true );
+        }
+        else if ( strVal.size() > 0 && ( strVal[0] == '{' || strVal[0] == '[' ) )
+        {
+          value = QJsonDocument::fromJson( strVal.toUtf8() ).toVariant();
+        }
+        else
+        {
+          bool ok;
+          double v = strVal.toDouble( &ok );
+          if ( ok )
+          {
+            value = QVariant( v );
+          }
+          else
+          {
+            value = QVariant( strVal );
+          }
+        }
         break;
       }
       default:

Feel free to take over from this initial attempt if that makes sense, but it looks like it doesn't.

Or perhaps the solution for QGIS would be an open option to add in the GDAL GeoJSON driver to ask it to fallback to the old behaviour ? But you would also get the same issue when opening a GeoPackage converted with ogr2ogr from the above test.geojson:

$ ogr2ogr test.gpkg test.geojson 
$ ogrinfo test.gpkg -al -q

Layer name: test
OGRFeature(test):1
  prop0 (String(JSON)) = 42
  POINT (102.0 0.5)

OGRFeature(test):2
  prop0 (String(JSON)) = astring
  POINT (102.0 0.5)

OGRFeature(test):3
  prop0 (String(JSON)) = { "nested": 75 }
  POINT (102.0 0.5)

Looping through all features doesn't scale well.

Another thing I had imagined was for QGIS to map 42 as { "unnamed": 42 } and astring as { "unnamed": "astring" } . Not ideal, but that doesn't involve an initial scan. Would probably require writing paths to strip that artificial key

@elpaso
Copy link
Contributor

elpaso commented Feb 24, 2023

@rouault please correct me if I'm wrong: GDAL knows that there is a mixed (or a complex, meaning a json object) type because it scanned the file during first pass (it isn't clear to me if GDAL by default ingests the first 100 rows or if it scans the whole stream), if the property does NOT contains simple values (numeric or strings or bool) all of the same type then OFTJSON is set and the type is a string.

QGIS at this point knows that there is a mixed (or complex) type if OFTJSON is set, there is no guarantee though that GDAL scanned the whole file so we might end up with a wrong type if the mixed (or complex) type appears later in the stream.

Now, QGIS wrongly assumes that any OFTJSON field is a simple kvp dictionary and this is probably the biggest issue, introduced to handle a very particular and limited use case.

The sad truth is that in this scenario, a field value could be anything from a null to a complex JSON object, the only type that could handle all of that is a QVariant UserType and QJsonValue.

I've been thinking at the possible solutions on QGIS side (because GDAL is doing nothing wrong here) but I'm a bit in the dark:

  1. set the QgsField type as generic QVariant UserType is not possible because it basically means unknown type and all QGIS code assumes everywhere that a field has a type, it would also break all current JSON field widget implementations that assumes JSON when field type is QVariant::Map
  2. set the field type as String seems the obvious way but it means to revert the kwp QVariantMap implementation leaving us with plain text editing widgets for that use case
  3. even if we invent a new field flag the tells QGIS when a QgsField of type String contains JSON we still would have problems handling form widgets when a field value is null or we are adding a new feature: we should implement complex logic to handle the different types and structures that JSON allows.
  4. fixing this issue at this point during the bugfixing effort it seems a bit out of scope given the amount of work that could be involved

Any thought?

CC @nyalldawson @m-kuhn @andreasneumann , I'd like to hear your opinion on this topic.

@joooeey
Copy link
Author

joooeey commented Feb 24, 2023

Why a new field type would be useful

From my narrow user perspective, I think it's best to take the leap and extend QGIS' object model with a new field type that's a union of all existing field types. My use case is the following:

I have a set of sensor measurements that a sensor delivers in hexadecimal coding. I pre-process these into JSON where the result can be a JSON number representing the sensor reading or a JSON string representing one of several error codes (e.g. "ReadError", "OutOfRange", "TransmissionError", etc.). These are then assigned to geographic points (because the sensor moves). We chose GeoJSON as interface to communicate intermediate results with consortium partners involved in the development effort.

For development, it would be helpful if this data can be visualized in QGIS. Example:

I load one of my GeoJSONs into QGIS. One column contains JSON numbers and JSON strings. Ideally, I'd like to be able to get graduated layer styling of the numbers, with strings mapped to one or more different colors. With the older version of GDAL I can at least get categorical layer styling. With the newer version I can't because the whole column is converted to NULLs.

The styling example serves to illustrate what a major development effort it would be to add a new catch-all field. This brings me a bit of-trail: Do nested JSON objects currently work well in QGIS? As far as I understand it, QVariantMap itself should have no problem with it.

The case for sticking with string

I would caution against rushing to implement complex logic to fix this bug and just default to string for now. My hunch is to make an exception for GeoPackage to maintain backwards compatibility but I'm not familiar with GeoPackage at all. Specifically I don't know if GDAL only outputs JSON objects when it says the type is OFTJSON (coming from GeoPackage). For backwards compatibility, one would also have to take a look at other formats where GDAL outputs OFTJSON.

because GDAL is doing nothing wrong here

This is not true. As I said in the lead, GDAL just defaults to string if the first value is a string. GDAL still has to catch up on doing things right. So sticking with string would give more consistent results. Coming back to my data examples: The current implementation gives a string column if the data looks like "ReadError", 1.2, 3.4" but a map column if the data looks like 1.2, "ReadError", 3.4 (You can try switching the sequence in the example given at the very top). Hence, output from the same sensor and the same processing chain is sometimes lost in QGIS (all NULLs) and sometimes readable, only depending on whether the data stream starts with an "error" value.

I think it makes sense to revert to conversion to string quickly. Here's why: The change in GDAL from January 5 (release date of GDAL 3.6.2, see my first comment for the link) broke backwards compatibility for handling GeoJSON in QGIS. If we limit the time in which QGIS converts JSON objects (7 weeks now), few people will start relying on it.

@elpaso
Copy link
Contributor

elpaso commented Feb 24, 2023

This is not true. As I said in the lead, GDAL just defaults to string if the first value is a string. GDAL still has to catch up on doing things right. So sticking with string would give more consistent results. Coming back to my data examples: The current implementation gives a string column if the data looks like "ReadError", 1.2, 3.4" but a map column if the data looks like 1.2, "ReadError", 3.4 (You can try switching the sequence in the example given at the very top).

You are right, I've filed a ticket OSGeo/gdal#7313

@elpaso
Copy link
Contributor

elpaso commented Feb 24, 2023

Ok, there may be another solution for the QGIS side of the problem: GDAL reports the type name as "JSON" which gives QGIS a possibility to use a proper widget while keeping the field internal type as a string.

elpaso added a commit to elpaso/QGIS that referenced this issue Feb 24, 2023
elpaso added a commit to elpaso/QGIS that referenced this issue Feb 24, 2023
elpaso added a commit to elpaso/QGIS that referenced this issue Feb 24, 2023
elpaso added a commit to elpaso/QGIS that referenced this issue Feb 24, 2023
@rouault
Copy link
Contributor

rouault commented Feb 24, 2023

because it scanned the file during first pass (it isn't clear to me if GDAL by default ingests the first 100 rows or if it scans the whole stream),

it scans the whole file, both in the mode where it fully json decode it, or it uses a streaming parser, at least as far as I remember (the GeoJSON driver is, non intuitively, more complex than one would expect)

@elpaso
Copy link
Contributor

elpaso commented Feb 28, 2023

I'm giving up for now, sorry.

I have looked extensively to possible solutions but the more I dig into QGIS code the more I see that all people who worked to improve JSON support in QGIS assumed that the only JSON representation QGIS supports is JSON dictionaries (and to a lesser extent arrays), no literals and absolutely no mixed types.

The solution to me is to set the field type to string, flag it as JSON using the typeName and set the default editor for JSON columns to plain text line edit and convert whatever is entered in the text edit box to JSON.

But, this means that a lot of core QGIS classes need to be changed to handle JSON literals and not only JSON maps ad arrays. Additionally, some providers (like postgres) can handle JSON natively making things more complicated.

Probably the most impacting change code-wise following this approach will be that an attribute value for a string field "marked" as JSON could be of any QVariant type which can be represented in JSON (literals, arrays, objects etc.), converting back and forth to strings when displaying and reading/storing from the data source.

Anyway, this is not a trivial change, it could impact existing workflows and it is way out of budget for a pre-release bugfixing effort.

Maybe this could be handled by a grant proposal, I'd be happy to submit one if we all agree that we want to support literals and mixed types stored in JSON.

A preliminary partial implementation can be found here: https://github.com/elpaso/QGIS/tree/bugfix-gh51911-json-mixed-type-2 there is still a problem with GDAL turning all properties into strings when syncing to disk but that's probably a GDAL issue.

@elpaso elpaso removed their assignment Feb 28, 2023
@Djedouas
Copy link
Member

Djedouas commented Mar 5, 2024

Hi,

I came across this issue, and I think this could be related to what I'm working on.

With an OGC API - Features stream, fields are loaded as String on linux, and JSON on Windows.

One of the JSON fields is all NULL on Windows, but I can see on linux that there seems to be no mixed-dtype. It is a JSON array type I guess.

My question is the following: is it related to this issue here, or is it something totally different?

Thanks.

Linux

image

Windows

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Either a bug report, or a bug fix. Let's hope for the latter! Data Provider Related to specific vector, raster or mesh data providers Regression Something which used to work, but doesn't anymore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants