Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV: numeric header gets a trailing underscore + 1 #27465

Closed
qgib opened this issue Aug 17, 2018 · 13 comments
Closed

CSV: numeric header gets a trailing underscore + 1 #27465

qgib opened this issue Aug 17, 2018 · 13 comments
Labels
Bug Either a bug report, or a bug fix. Let's hope for the latter! Data Provider Related to specific vector, raster or mesh data providers

Comments

@qgib
Copy link
Contributor

qgib commented Aug 17, 2018

Author Name: Tobias Wendorff (Tobias Wendorff)
Original Redmine Issue: 19638
Affected QGIS version: 3.2.1
Redmine category:data_provider/delimited_text_


When loading a CSV / delimited text with a header like this "id;1;2;3;4", the numeric header gets a trailing underscore and a "1" => "id;1_1;2_1;3_1;4_1". So the fieldnames are broken.
Expected behavior: "id;1;2;3;4"

All versions >= 3.1 are affected.

@qgib
Copy link
Contributor Author

qgib commented Aug 19, 2018

Author Name: Giovanni Manghi (@gioman)


On 2.18/LTR the same happens?


  • status_id was changed from Open to Feedback
  • crashes_corrupts_data was changed from 1 to 0
  • regression was changed from 1 to 0

@qgib
Copy link
Contributor Author

qgib commented Aug 19, 2018

Author Name: Tobias Wendorff (Tobias Wendorff)


Giovanni Manghi wrote:

On 2.18/LTR the same happens?

I don't know, I'm not using 2.x any longer. But In my eyes, a change from "1" to "1_1" is data corruption. Data shouldn't be modified on import, especially not data coming from plaintext.

@qgib
Copy link
Contributor Author

qgib commented Aug 19, 2018

Author Name: Giovanni Manghi (@gioman)


Tobias Wendorff wrote:

Giovanni Manghi wrote:

On 2.18/LTR the same happens?

I don't know, I'm not using 2.x any longer. But In my eyes, a change from "1" to "1_1" is data corruption. Data shouldn't be modified on import, especially not data coming from plaintext.

is not data corruption because your original dataset is not corrupted. If you want this to be tagged as "regression" you should check if on 2.18 works as expected.


  • status_id was changed from Feedback to Open

@qgib
Copy link
Contributor Author

qgib commented Aug 19, 2018

Author Name: Tobias Wendorff (Tobias Wendorff)


Giovanni Manghi wrote:

is not data corruption because your original dataset is not corrupted.

The header is part of the dataset. If you export it back to CSV, you'll get "1_1" instead of "1". Please don't modify ANY data if not needed.

If you want this to be tagged as "regression" you should check if on 2.18 works as expected.

I thought "regression" would only apply to current main version: 3.0, 3.1, 3.2 etc. Sorry for that.

@qgib
Copy link
Contributor Author

qgib commented Aug 20, 2018

Author Name: Giovanni Manghi (@gioman)


Tobias Wendorff wrote:

Giovanni Manghi wrote:

is not data corruption because your original dataset is not corrupted.

The header is part of the dataset. If you export it back to CSV, you'll get "1_1" instead of "1". Please don't modify ANY data if not needed.

still the original dataset is not modified/lost. For data corruption we usually mean unrecoverable data loss. Here the issue in different: a wrong result from the manipulation of a dataset (that is unchanged).

I thought "regression" would only apply to current main version: 3.0, 3.1, 3.2 etc. Sorry for that.

Regression is anything that stopped to work and that used to work in any previous release. LTR releases are usually used as reference for comparison (at lease is what I do).

@qgib
Copy link
Contributor Author

qgib commented Aug 20, 2018

Author Name: Andrea Giudiceandrea (@agiudiceandrea)


I've tested the issue (a csv with the header "id;1;2;3;4") with the following QGIS version:

  • 2.18.19 behaves as reported by Tobias Wendorff for 3.2.1: imported layer field names "id;1_1;2_1;3_1;4_1"
  • 1.9.0 alpha (bb0b978) and 1.8.0 behave as Tobias think it should be: imported layer field names "id;1;2;3;4"

If the csv header is e.g. "id;2;3;4;5" the issue does not occur.

I suppose this issue/feature was introduced with 5e4f4f7 by Chris Crook and is present since QGIS 2.0

https://github.com/qgis/QGIS/blob/master/src/providers/delimitedtext/qgsdelimitedtextfile.cpp#L37-L38

// field_ is optional in following regexp to simplify QgsDelimitedTextFile::fieldNumber()
, mDefaultFieldRegexp( "^(?:field_)?(\d+)$", Qt::CaseInsensitive )

https://github.com/qgis/QGIS/blob/master/src/providers/delimitedtext/qgsdelimitedtextfile.cpp#L418-L424

// If the name looks like a default field name (field_##), then it is
// valid if the number matches its column number..
else if ( mDefaultFieldRegexp.indexIn( name ) == 0 )
{
  int col = mDefaultFieldRegexp.capturedTexts().at( 1 ).toInt();
  nameOk = col == fieldNo;
}

It seems the issue is caused by the fact that "field_" is optional in mDefaultFieldRegexp, so "mDefaultFieldRegexp.indexIn( name ) == 0" is true if column name began with "field_" OR if it contains numerical digits only. In the latter case, if the number doesn't match its column number, then "_1" is appended as a suffix to the column name.

We can fix the issue letting ""field_" not optional in mDefaultFieldRegexp:

mDefaultFieldRegexp( "^(?:field_)(\d+)$", Qt::CaseInsensitive )

but I don't know if there was a valid reason for not doing it then (in the comment it's written that the reason is "to simplify QgsDelimitedTextFile::fieldNumber()", although I can not find this "fieldNumber()")

See also #21249 that seems not really fixed.

@qgib
Copy link
Contributor Author

qgib commented Aug 20, 2018

Author Name: Tobias Wendorff (Tobias Wendorff)


Giovanni Manghi wrote:

Regression is anything that stopped to work and that used to work in any previous release. LTR releases are usually used as reference for comparison (at lease is what I do).

Thanks! Please add this to a FAQ. I'll install LTR in parallel for the next reports (I've still some on my list).

@qgib
Copy link
Contributor Author

qgib commented Aug 20, 2018

Author Name: Giovanni Manghi (@gioman)


Andrea Giudiceandrea wrote:

I've tested the issue (a csv with the header "id;1;2;3;4") with the following QGIS version:

  • 2.18.19 behaves as reported by Tobias Wendorff for 3.2.1: imported layer field names "id;1_1;2_1;3_1;4_1"
  • 1.9.0 alpha (bb0b978) and 1.8.0 behave as Tobias think it should be: imported layer field names "id;1;2;3;4"

If the csv header is e.g. "id;2;3;4;5" the issue does not occur.

I suppose this issue/feature was introduced with 5e4f4f7 by Chris Crook and is present since QGIS 2.0

https://github.com/qgis/QGIS/blob/master/src/providers/delimitedtext/qgsdelimitedtextfile.cpp#L37-L38

// field_ is optional in following regexp to simplify QgsDelimitedTextFile::fieldNumber()
, mDefaultFieldRegexp( "^(?:field_)?(\d+)$", Qt::CaseInsensitive )

https://github.com/qgis/QGIS/blob/master/src/providers/delimitedtext/qgsdelimitedtextfile.cpp#L418-L424

// If the name looks like a default field name (field_##), then it is
// valid if the number matches its column number..
else if ( mDefaultFieldRegexp.indexIn( name ) == 0 )
{
  int col = mDefaultFieldRegexp.capturedTexts().at( 1 ).toInt();
  nameOk = col == fieldNo;
}

It seems the issue is caused by the fact that "field_" is optional in mDefaultFieldRegexp, so "mDefaultFieldRegexp.indexIn( name ) == 0" is true if column name began with "field_" OR if it contains numerical digits only. In the latter case, if the number doesn't match its column number, then "_1" is appended as a suffix to the column name.

We can fix the issue letting ""field_" not optional in mDefaultFieldRegexp:

mDefaultFieldRegexp( "^(?:field_)(\d+)$", Qt::CaseInsensitive )

but I don't know if there was a valid reason for not doing it then (in the comment it's written that the reason is "to simplify QgsDelimitedTextFile::fieldNumber()", although I can not find this "fieldNumber()")

Hi, thanks for this comments. You should really comment among the lines on Github, this way there is a much bigger chance a developer will notice the comment and read your suggestions. Even better if you could come up with a patch (also submitted on github). Thanks!

@qgib
Copy link
Contributor Author

qgib commented Aug 21, 2018

Author Name: Andrea Giudiceandrea (@agiudiceandrea)


PR submitted #7671

@qgib
Copy link
Contributor Author

qgib commented Aug 21, 2018

Author Name: Giovanni Manghi (@gioman)


Andrea Giudiceandrea wrote:

PR submitted #7671

nice, thanks!


  • pull_request_patch_supplied was changed from 0 to 1

@qgib
Copy link
Contributor Author

qgib commented Aug 27, 2018

Author Name: Andrea Giudiceandrea (@agiudiceandrea)


Applied in changeset 379652d.


  • status_id was changed from Open to Closed
  • done_ratio was changed from 0 to 100

@qgib qgib closed this as completed Aug 27, 2018
@qgib
Copy link
Contributor Author

qgib commented Aug 27, 2018

Author Name: Tobias Wendorff (Tobias Wendorff)


Andrea Giudiceandrea wrote:

Applied in changeset 379652d.

Thanks to all.

@qgib
Copy link
Contributor Author

qgib commented Aug 29, 2018

Author Name: Giovanni Manghi (@gioman)


  • resolution was changed from to fixed/implemented

@qgib qgib added Bug Either a bug report, or a bug fix. Let's hope for the latter! Data Provider Related to specific vector, raster or mesh data providers labels May 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Either a bug report, or a bug fix. Let's hope for the latter! Data Provider Related to specific vector, raster or mesh data providers
Projects
None yet
Development

No branches or pull requests

1 participant