Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WFS / GML parse issue, but QGIS loads GML as file fine? #45017

Closed
2 tasks done
rduivenvoorde opened this issue Sep 10, 2021 · 4 comments
Closed
2 tasks done

WFS / GML parse issue, but QGIS loads GML as file fine? #45017

rduivenvoorde opened this issue Sep 10, 2021 · 4 comments
Labels
Bug Either a bug report, or a bug fix. Let's hope for the latter! Feedback Waiting on the submitter for answers WFS data provider

Comments

@rduivenvoorde
Copy link
Contributor

rduivenvoorde commented Sep 10, 2021

What is the bug or the crash?

Having a Geoserver WFS, QGIS fails to show Features of it.
BUT: replaying the request again via curl and downloading the gml, QGIS is fine with it.

QGIS tries to request 1 features several times, BUT says it is not 'well-formed':
Retrying request https://myserver/wfs?SERVICE=WFS&REQUEST=GetFeature&VERSION=2.0.0&TYPENAMES=regelink:polygons&COUNT=1&SRSNAME=urn:ogc:def:crs:EPSG::28992: 3/3 2021-09-10T13:28:36 WARNING Error when parsing GetFeature response : Error: not well-formed (invalid token) on line 1, column 3809
If I use curl to retrieve it, QGIS loads it fine.
On position 3809, falls on exactly the colon (":") in the following string: regelink:220_1_hsi
See
b.zip (one feature)
and
c.zip (more features)

Note: the attributes in this data start with a number (from a postgis db) <= I'm aware this gives troubles
Note2: not sure if k:220 cat depict some utf code or so?

Steps to reproduce the issue

  • have a geoserver by hand (I downloaded 19.2 just now)
  • create a store from this geopackage polygons.zip
    these are 1200 parcels in the netherlands (EPSG:28992) with attributes starting with a number
  • publish the layer 'polygons'
  • open it as a WFS layer in QGIS
  • see that QGIS is requesting 1 features, failing to parse and retries:
Retrying request http://localhost/geoserver/wfs?SERVICE=WFS&REQUEST=GetFeature&VERSION=2.0.0&TYPENAMES=test:polygons&STARTINDEX=0&COUNT=1000000&SRSNAME=urn:ogc:def:crs:EPSG::28992&BBOX=199565.33582072978606448,504735.82987997209420428,200661.82303728291299194,505453.42622588382801041,urn:ogc:def:crs:EPSG::28992: 3/3
2021-09-10T14:01:47     WARNING    Error when parsing GetFeature response : Error: not well-formed (invalid token) on line 1, column 3551

Versions

3.16 -> master

QGIS version 3.21.0-Master QGIS code revision 4e0d0f6
Qt version 5.15.2
Python version 3.9.7
GDAL/OGR version 3.2.2
PROJ version 7.2.1
EPSG Registry database version v10.008 (2020-12-16)
GEOS version 3.9.1-CAPI-1.14.2
SQLite version 3.36.0
PostgreSQL client version 13.4 (Debian 13.4-3)
SpatiaLite version 5.0.1
QWT version 6.1.4
QScintilla2 version 2.11.6
OS version Debian GNU/Linux bookworm/sid
       
This copy of QGIS writes debugging output.
       
Active Python plugins QuickWKTnominatim_locator_filterNITK_RS-GIS_17pdokservicespluginplugin_reloaderQuickOSMHelloWorldPluginHCMGISGeoCodingsimplesvgorientationsagaproviderprocessinggrassprovider

Supported QGIS version

  • I'm running a supported QGIS version according to the roadmap.

New profile

  • I tried with a new QGIS profile

Additional context

No response

@rduivenvoorde rduivenvoorde added the Bug Either a bug report, or a bug fix. Let's hope for the latter! label Sep 10, 2021
@rduivenvoorde
Copy link
Contributor Author

Trying to debug this myself, setting a breakpoint in the parser part where the error is returned:
(qgsgml.cpp line 454)...

Grabbing the output ( output.txt )
The error tells line 3555, but could it be something with the underscore '_' (0x5f) in the attribute column names ?

		[3553]	't' 	116    	0x74	char
		[3554]	':' 	58    	0x3a	char
		[3555]	'2' 	50    	0x32	char
		[3556]	'2' 	50    	0x32	char
		[3557]	'0' 	48    	0x30	char
		[3558]	'_' 	95    	0x5f	char
		[3559]	'1' 	49    	0x31	char
		[3560]	'_' 	95    	0x5f	char
		[3561]	'h' 	104    	0x68	char
		[3562]	's' 	115    	0x73	char
		[3563]	'i' 	105    	0x69	char
		[3564]	'>' 	62    	0x3e	char

@rduivenvoorde
Copy link
Contributor Author

rduivenvoorde commented Sep 11, 2021

I was told that the xmltodict module also was 'expat' based (if I am correct the parsing in QGIS is done via expat xml lib), so I tried:

import xmltodict
with open('./c.gml') as fd:
    doc = xmltodict.parse(fd.read())

but that runs fine, no parse issue?

@rouault
Copy link
Contributor

rouault commented Sep 13, 2021

This is a GeoServer bug, not a QGIS one. GeoServer should refuse to expose such a layer directly, or it should modify the attributes whose name starts with a digit. The identifier of a XML element must be a valid QName (https://en.wikipedia.org/wiki/QName), which implies that the unqualified part doesn't start with a digit.

libxml2 rejects b.gml:

$ xmllint --noout b.gml
b.gml:1: namespace error : Failed to parse QName 'regelink:'
link:id><regelink:identifica>1.93100000001959E14</regelink:identifica><regelink:

The OGR GML driver when forced to use Xerces-C too:

$ GML_PARSER=XERCES ogrinfo b.gml -al -q

Layer name: polygons
ERROR 1: XML Parsing Error: invalid element name 'regelink:' at line 1, column 3810

Similarly if using the OGR GMLAS driver:

$ ogrinfo GMLAS:b.gml
ERROR 1: /vsicurl_streaming/https://geoserver-regelink.webgispublisher.nl/wfs?service=WFS&version=2.0.0&request=DescribeFeatureType&typeName=regelink%3Apolygons:12:104 invalid element name '220_1_hsi'

Here Xerces-C rejects the DescribeFeatureType response directly (the GMLAS driver is fully schema aware)

So your question why the OGR GML driver in by default Expat mode does accept that, and QGIS QgsGmlStreamingParser which does use it emits a "not well-formed (invalid token)" error is a good one.
I found that the difference is the the OGR GML driver uses Expat in a namespace unaware mode (namespaces of XML elements are discarded by the parser), whereas QGIS uses it in a namespace mode.

And that can be easily seen when using the Python Expat bindings:

$ python
>>> import xml.parsers.expat
>>> parser = xml.parsers.expat.ParserCreate()
>>> parser.ParseFile(open('b.gml', 'rb'))
1
>>> parser = xml.parsers.expat.ParserCreate(namespace_separator='?')
>>> parser.ParseFile(open('b.gml', 'rb'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 3809

I'd say Expat not rejecting the file in the namespace unaware mode could be considered as a bug (not sure if it is intended, perhaps running in that mode means that people are expected laxer checks...)

I don't think we should try to do something on QGIS side regarding that. If we wanted to do that, that would mean changing the parsing in namespace unaware mode, but this could add potential fragility.

@rouault rouault removed the Bug Either a bug report, or a bug fix. Let's hope for the latter! label Sep 13, 2021
@gioman gioman added Feedback Waiting on the submitter for answers Bug Either a bug report, or a bug fix. Let's hope for the latter! labels Sep 13, 2021
@rduivenvoorde
Copy link
Contributor Author

rduivenvoorde commented Sep 14, 2021

Thanks @rouault for your research and explanation.

I created an issue at geoserver: https://osgeo-org.atlassian.net/jira/software/c/projects/GEOS/issues/GEOS-10231

The fact that QGIS parses the same output in different ways: I'm not not really happy with that, it's not very consequent. BUT current behaviour at least makes QGIS a little forgiving (in case of the file at least)...

But I wonder if it would be nice if QGIS would maybe give some more usefull info to the average user. A lot of people are not aware of the Log messages panel, or are just not able to check.

The parsers warnings actually points to ':' or 'regelink:' which are actually fine... it is the next chars that are actually the problem, that tricked me too.
Would it help if we show the text of (in the above example) around column 3809 in the error message? And maybe propose some 'common' xml errors: ... uh... like: mwa, never mind.

Should I close this one?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Either a bug report, or a bug fix. Let's hope for the latter! Feedback Waiting on the submitter for answers WFS data provider
Projects
None yet
Development

No branches or pull requests

3 participants