Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type check to file writer and memory layer #37039

Conversation

elpaso
Copy link
Contributor

@elpaso elpaso commented Jun 8, 2020

Fixes #36715

Started as a simple fix and down into the rabbit hole of provider inconsistencies ...

This monster PR is now about making the memory provider behave like (some? many? who knows?) other providers within respect to type checking and commit changes.

  1. add type checking (relaxed) to prevent incompatible values being stored (and possibly being lost when converting to a real storage), i.e. the original issue
  2. implement a roll-back in the memory provider addFeatures and changeAttributeValues so that when using the editing buffer an error in commitChanges will leave the storage unchanged as advertised by the method signature
  3. various smaller patches and fixes that have been brought into light after the type checking

From the commit message:

Long story short: calling provider's addFeatures
is implemented for some providers in a way that
will roll back all changes on errors, leaving
the backend storage unchanged.

Adding a QgsFeatureSink flag to control this
behavior allows certain providers to support
partial feature addition.

The issue comes from QgsVectorDataProvider::commitChanges
that is documented to leave the provider unchanged (roll
back) on any error, giving the client code the possibility
to fix errors (in the editing buffer) and re-commit.

Without a full rollback implementation in the memory
provider and after the type check introduction in this
PR we ended up with situations like this:

vl = ... an empty memory layer
self.assertTrue(vl.addFeatures([valid, invalid]))
self.assertFalse(vl.commitChanges())
self.assertEqual(vl.featureCount(), 1) <--- fails!
We actually had 3 features from vl.getFeatures():
[valid, invalid, valid] (the first from the provider
the second and third from the editing buffer).

On the other hand, QgsFeatureSink would probably assume
that addFeatures will allow partial additions.

@elpaso elpaso added Data Provider Related to specific vector, raster or mesh data providers Bug Either a bug report, or a bug fix. Let's hope for the latter! labels Jun 8, 2020
@github-actions github-actions bot added this to the 3.14.0 milestone Jun 8, 2020
@m-kuhn
Copy link
Member

m-kuhn commented Jun 8, 2020

Before continuing the review, do we really want this? What do we win?

Besides the memory provider, there are also sqlite and likely other providers that support generic values. I know that "we" GIS people normally tend to think in well defined data structures, but I can imagine that we are preventing legit workflows.

@elpaso
Copy link
Contributor Author

elpaso commented Jun 8, 2020

Before continuing the review, do we really want this?

@m-kuhn legit question that I've asked myself as well and my answer is yes.

What I'm really fixing here is the use case where a feature containing an attribute value that is not compatible is silently saved into a destination datasource without any error or warning, this leads to data loss and/or corruption and it is something that cannot be accepted in any serious data processing workflow.

What do we win?

That is the wrong question: I would ask "what do we loose if we don't do that?"

What I'm really fixing here is the use case where a feature containing an attribute value that is not compatible with a destination datasource is silently discarded or - worse - saved with a wrong value without any error or warning, this leads to data loss and/or corruption and it is something that cannot be accepted in any serious data processing workflow.

Besides the memory provider, there are also sqlite and likely other providers that support generic values. I know that "we" GIS people normally tend to think in well defined data structures, but I can imagine that we are preventing legit workflows.

I'm not trying to prevent saving (say) a datetime into a string (this is ok and tested here: https://github.com/qgis/QGIS/pull/37039/files#diff-947ce79ae5af26c9e67b3690e7106485R719), I'm dealing with impossible conversions that provoke data loss or corruption (for example a float with fractional parts to a integer).

In any event, this should not normally happen: we are dealing with misconfigurations or wrong programming from plugins.

@m-kuhn
Copy link
Member

m-kuhn commented Jun 8, 2020

I fully agree on the problem and that the writer should transparently raise warnings/errors in such situations.

That shouldn't affect the memory layer code though?

@elpaso
Copy link
Contributor Author

elpaso commented Jun 8, 2020

I fully agree on the problem and that the writer should transparently raise warnings/errors in such situations.

That shouldn't affect the memory layer code though?

Correct: I'm fixing two separate issues here:

  1. the memory provider does not raise/throw and error on commit when the value is not compatible
  2. vector file writer does not raise/throw and error when writing an incompatile/wrong value

they are related but not the same:

  1. make the memory provider behave like (most) other providers (i.e. for example postgis) because when you try to commit and incompatible type they usually choke and abort the whole transaction, I agree sqlite-based are more tolerant than others but it does not really matter if the conversion cannot be performed.
  2. make sure the vector file writer raises/throws an error when it cannot convert an attribute value without loosing precision (or worse: loosing ALL the information and write a completely wrong value - see the overflow test case)

@m-kuhn
Copy link
Member

m-kuhn commented Jun 8, 2020

Is that the right level to solve this problem though?

  • This will put a hard stop on any usage other than on a "fixed data model"
  • This could result in quite some overhead (many conversions for each attribute of each feature). I think individual providers are in a much better situation to know what they actually do support.

@elpaso
Copy link
Contributor Author

elpaso commented Jun 8, 2020

Is that the right level to solve this problem though?

I don't know, any better idea?

referring to 1. and 2. above:

* This will put a hard stop on any usage other than on a "fixed data model"
  1. I don't see why... this doesn't change the situation for uncommitted features, you can still store a QVariant.Date into a QVariant.Bool field if you like (for any provider), the only thing that has changed is for memory provider that now checks for types compatibility as all other providers do (directly or indirectly relying on the back-end), note that there even was a TODO: https://github.com/qgis/QGIS/pull/37039/files#diff-d0d7b2822bcecd769c9255be2b059e14L384

Is your intention to use the memory provider as a generic QVariant loose typed storage?

* This could result in quite some overhead (many conversions for each attribute of each feature). I think individual providers are in a much better situation to know what they actually do support.
  1. QgsVectorFileWriter uses OGR/GDAL ... so it's one provider only and from the failing tests before this patch I would not say that it is was in any better situation.

For the overhead, yes, that's the price we pay for robustness.

@m-kuhn
Copy link
Member

m-kuhn commented Jun 8, 2020

  1. Memory layer type safety

Is your intention to use the memory provider as a generic QVariant loose typed storage?

Not at all right now, however ...

I can imagine that we are preventing legit workflows

... so, what do we win with this?

  1. type check in vector file writer

For the overhead, yes, that's the price we pay for robustness.?

I don't know, any better idea?

Stick to what we have right now and let individual providers fix up things, one by one?
Or at least make it optional?

I might not have the complete picture of the situation, but it looks like it's solving an edge case with a sledgehammer.

tests/src/python/test_qgsvectorlayerutils.py Outdated Show resolved Hide resolved
tests/src/python/test_qgsvectorlayerutils.py Outdated Show resolved Hide resolved
tests/src/python/test_qgsvectorlayerutils.py Outdated Show resolved Hide resolved
tests/src/python/test_qgsvectorlayerutils.py Outdated Show resolved Hide resolved
tests/src/python/test_qgsvectorlayerutils.py Outdated Show resolved Hide resolved
tests/src/python/test_qgsvectorlayerutils.py Outdated Show resolved Hide resolved
@nyalldawson
Copy link
Collaborator

In general I agree with @elpaso's thoughts that we should not allow storage of e.g. a completely non-numeric text string "abc" in a numeric field. I disagree about blocking double storage in int fields though (or similarly, datetime storage in date fields, date storage in datetime fields, etc). I think there's legitimate times we want to allow this (not least of which would be that using the field calculator should not force users to wrap an expression like "my_field" / 2 in a to_int function)

TBH - it seems like a big driver for this is making sure that 3rd party plugins do things correctly. And I wonder if we couldn't better achieve the same results in a more developer-friendly way by throwing AttributeError exceptions at the time values are being stored in a feature.

E.g.

   fields = QgsFields()
   fields.append(QgsField('int_field', QVariant.Int))

   feature = QgsFeature(fields)
   # should raise an AttributeError exception, ,e.g. "AttributeError: the specified value 'abc' is not compatible with the field type (QVariant.Int)"
   feature['int_field'] = 'abc'

   fields = QgsFields()
   fields.append(QgsField('str_field', QVariant.String))
   fields.append(QgsField('int_field', QVariant.Int))
   
   attributes = ['abc', 'def']
   # should raise an AttributeError exception: "AttributeError: the specified value 'def' is not compatible with the field type for 'int_field' (QVariant.Int)"
   features.setAttributes(attributes)

This would ultimately allow these issues to be caught right at the location of the error, making things much easier for PyQGIS devs to debug and fix.

(On the same note, for 4.0 we should consider raising exceptions instead of returning False to addFeatures if the addFeatures operation fails!)

@elpaso
Copy link
Contributor Author

elpaso commented Jun 9, 2020

In general I agree with @elpaso's thoughts that we should not allow storage of e.g. a completely non-numeric text string "abc" in a numeric field. I disagree about blocking double storage in int fields though (or similarly, datetime storage in date fields, date storage in datetime fields, etc). I think there's legitimate times we want to allow this (not least of which would be that using the field calculator should not force users to wrap an expression like "my_field" / 2 in a to_int function)

TBH - it seems like a big driver for this is making sure that 3rd party plugins do things correctly. And I wonder if we couldn't better achieve the same results in a more developer-friendly way by throwing AttributeError exceptions at the time values are being stored in a feature.

E.g.

   fields = QgsFields()
   fields.append(QgsField('int_field', QVariant.Int))

   feature = QgsFeature(fields)
   # should raise an AttributeError exception, ,e.g. "AttributeError: the specified value 'abc' is not compatible with the field type (QVariant.Int)"
   feature['int_field'] = 'abc'

   fields = QgsFields()
   fields.append(QgsField('str_field', QVariant.String))
   fields.append(QgsField('int_field', QVariant.Int))
   
   attributes = ['abc', 'def']
   # should raise an AttributeError exception: "AttributeError: the specified value 'def' is not compatible with the field type for 'int_field' (QVariant.Int)"
   features.setAttributes(attributes)

This would ultimately allow these issues to be caught right at the location of the error, making things much easier for PyQGIS devs to debug and fix.

(On the same note, for 4.0 we should consider raising exceptions instead of returning False to addFeatures if the addFeatures operation fails!)

Big +1 for exception raising, if only SIP was more friendly with exception handling :\ let's see how shiboken will deal with that.

@elpaso
Copy link
Contributor Author

elpaso commented Jun 9, 2020

@alexbruy would you please have a look to the failing processing tests? I checked one of the issues (QgsServiceAreaFromLayerAlgorithm) with a debugger and it seems that the new type checks for memory layer that I'm adding in this PR are bringing into the light some underlying issues in the algorithm, I'm not very familiar with that particular algorithm but I think that in case of includeBounds is adds the wrong number of attributes for the "upper" feature.

@elpaso
Copy link
Contributor Author

elpaso commented Jun 9, 2020

@alexbruy it Travis is drunk as usual at this time of the day, you can have a look to this previous build: https://travis-ci.org/github/qgis/QGIS/builds/696353705#L1238

@alexbruy
Copy link
Contributor

alexbruy commented Jun 9, 2020

@elpaso fixed in #37074.

@elpaso
Copy link
Contributor Author

elpaso commented Jun 10, 2020

@alexbruy thank you!

@elpaso elpaso force-pushed the bugfix-gh36715-field-validation-file-writer-and-memory branch from cb8aebc to fc44318 Compare June 10, 2020 07:55
@elpaso
Copy link
Contributor Author

elpaso commented Jun 10, 2020

@alexbruy there are still some failures but they seems just rounding errors, mind having a look?
https://travis-ci.org/github/qgis/QGIS/builds/696752836

If they are rounding errors what would you suggest to do?

Is there a way to make the test more tolerant to rounding errors or should I update the reference files (in that case I would appreciate some instructions about how to do that for processing tests) ?

@elpaso
Copy link
Contributor Author

elpaso commented Jun 10, 2020

@nyalldawson @m-kuhn: ok, no sledgehammer, I'm now using the QgsField convertCompatible implementation to check for type compatibility, this means that narrow casting and loss of precision is allowed.

I'm still uncomfortable with a processing toolchain that silently accepts double -> int conversions but as long as it is clearly documented I think we can live with that.

For those use cases that require stricter checking it might be worthwhile to add a global settings to activate a stricter type checking but that's out of scope.

(I'm closing the open comments as resolved because they do not apply anymore).

@alexbruy
Copy link
Contributor

@alexbruy there are still some failures but they seems just rounding errors, mind having a look?
https://travis-ci.org/github/qgis/QGIS/builds/696752836

Oh, in my setup local test runs without errors. Seems these are rounding errors probably related to some differences in the Travis and local setup. Tried to reduce precision a bit in #37097, let's see if it helps

src/core/providers/memory/qgsmemoryprovider.cpp Outdated Show resolved Hide resolved
src/core/qgsvectorfilewriter.cpp Outdated Show resolved Hide resolved
src/core/providers/memory/qgsmemoryprovider.cpp Outdated Show resolved Hide resolved
@alexbruy
Copy link
Contributor

@elpaso test fixed with #37097.

@elpaso elpaso force-pushed the bugfix-gh36715-field-validation-file-writer-and-memory branch from 4446e88 to 0b5ffb0 Compare June 11, 2020 10:35
@elpaso
Copy link
Contributor Author

elpaso commented Jun 12, 2020

@alexbruy here is another bunch of rounding errors, can you please have a look if they are real issues or the test can be adapted?

https://travis-ci.org/github/qgis/QGIS/builds/697190572

@elpaso
Copy link
Contributor Author

elpaso commented Jun 15, 2020

@alexbruy nevermind, I'm fixing it myself.

@alexbruy
Copy link
Contributor

@elpaso sorry, was away for several days and missed all emails/notifications.

@elpaso
Copy link
Contributor Author

elpaso commented Jun 15, 2020

@alexbruy no problem! Your first patch pointed me to the right direction!

elpaso and others added 14 commits June 17, 2020 12:25
Fixes qgis#36715

Adds a method to check for QVariant conversions, also
check for integral type narrowing so that for example
floating point 123.45 does not get down casted to integer
without raising an error.
Co-authored-by: Matthias Kuhn <matthias@opengis.ch>
Long story short: calling provider's addFeatures
is implemented for some providers in a way that
will roll back all changes on errors, leaving
the backend storage unchanged.

Adding a QgsFeatureSink flag to control this
behavior allows certain providers to support
partial feature addition.

The issue comes from QgsVectorDataProvider::commitChanges
that is documented to leave the provider unchanged (roll
back) on any error, giving the client code the possibility
to fix errors (in the editing buffer) and re-commit.

Without a full rollback implementation in the memory
provider and after the type check introduction in this
PR we ended up with situations like this:

vl = ... an empty memory layer
self.assertTrue(vl.addFeatures([valid, invalid]))
self.assertFalse(vl.commitChanges())
self.assertEqual(vl.featureCount(), 1)  <--- fails!
We actually had 3 features from vl.getFeatures():
[valid, invalid, valid] (the first from the provider
the second and third from the editing buffer).

On the other hand, QgsFeatureSink would probably assume
that addFeatures will allow partial additions.

BTW: This is for sure the longest commit message I've ever
     written.
@elpaso elpaso force-pushed the bugfix-gh36715-field-validation-file-writer-and-memory branch from c0c35a5 to a733f07 Compare June 17, 2020 11:44
@elpaso
Copy link
Contributor Author

elpaso commented Jun 17, 2020

@m-kuhn @nyalldawson are you cool to merge now or we'd better wait?

@nyalldawson
Copy link
Collaborator

Let's delay -- it's not a regression, and release is only a couple of days away..

@nyalldawson nyalldawson added the NOT FOR MERGE Don't merge! label Jun 17, 2020
@nyalldawson nyalldawson merged commit 672d6ec into qgis:master Jun 19, 2020
@nyalldawson nyalldawson removed the NOT FOR MERGE Don't merge! label Jun 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Either a bug report, or a bug fix. Let's hope for the latter! Data Provider Related to specific vector, raster or mesh data providers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

It is possible to store text into a numeric field, while it shouldn't be, using the Date/Time edit widget
4 participants