CSV: numeric header gets a trailing underscore + 1 #27465

qgib · 2018-08-17T06:15:36Z

Author Name: Tobias Wendorff (Tobias Wendorff)
Original Redmine Issue: 19638
Affected QGIS version: 3.2.1
Redmine category:data_provider/delimited_text_

When loading a CSV / delimited text with a header like this "id;1;2;3;4", the numeric header gets a trailing underscore and a "1" => "id;1_1;2_1;3_1;4_1". So the fieldnames are broken.
Expected behavior: "id;1;2;3;4"

All versions >= 3.1 are affected.

qgib · 2018-08-19T09:16:58Z

Author Name: Giovanni Manghi (@gioman)

On 2.18/LTR the same happens?

status_id was changed from Open to Feedback
crashes_corrupts_data was changed from 1 to 0
regression was changed from 1 to 0

qgib · 2018-08-19T09:27:42Z

Author Name: Tobias Wendorff (Tobias Wendorff)

Giovanni Manghi wrote:

On 2.18/LTR the same happens?

I don't know, I'm not using 2.x any longer. But In my eyes, a change from "1" to "1_1" is data corruption. Data shouldn't be modified on import, especially not data coming from plaintext.

qgib · 2018-08-19T09:29:08Z

Author Name: Giovanni Manghi (@gioman)

Tobias Wendorff wrote:

Giovanni Manghi wrote:

On 2.18/LTR the same happens?

I don't know, I'm not using 2.x any longer. But In my eyes, a change from "1" to "1_1" is data corruption. Data shouldn't be modified on import, especially not data coming from plaintext.

is not data corruption because your original dataset is not corrupted. If you want this to be tagged as "regression" you should check if on 2.18 works as expected.

status_id was changed from Feedback to Open

qgib · 2018-08-19T09:34:41Z

Author Name: Tobias Wendorff (Tobias Wendorff)

Giovanni Manghi wrote:

is not data corruption because your original dataset is not corrupted.

The header is part of the dataset. If you export it back to CSV, you'll get "1_1" instead of "1". Please don't modify ANY data if not needed.

If you want this to be tagged as "regression" you should check if on 2.18 works as expected.

I thought "regression" would only apply to current main version: 3.0, 3.1, 3.2 etc. Sorry for that.

qgib · 2018-08-20T09:04:05Z

Author Name: Giovanni Manghi (@gioman)

Tobias Wendorff wrote:

Giovanni Manghi wrote:

is not data corruption because your original dataset is not corrupted.

The header is part of the dataset. If you export it back to CSV, you'll get "1_1" instead of "1". Please don't modify ANY data if not needed.

still the original dataset is not modified/lost. For data corruption we usually mean unrecoverable data loss. Here the issue in different: a wrong result from the manipulation of a dataset (that is unchanged).

I thought "regression" would only apply to current main version: 3.0, 3.1, 3.2 etc. Sorry for that.

Regression is anything that stopped to work and that used to work in any previous release. LTR releases are usually used as reference for comparison (at lease is what I do).

qgib · 2018-08-20T14:02:14Z

Author Name: Andrea Giudiceandrea (@agiudiceandrea)

I've tested the issue (a csv with the header "id;1;2;3;4") with the following QGIS version:

2.18.19 behaves as reported by Tobias Wendorff for 3.2.1: imported layer field names "id;1_1;2_1;3_1;4_1"
1.9.0 alpha (bb0b978) and 1.8.0 behave as Tobias think it should be: imported layer field names "id;1;2;3;4"

If the csv header is e.g. "id;2;3;4;5" the issue does not occur.

I suppose this issue/feature was introduced with 5e4f4f7 by Chris Crook and is present since QGIS 2.0

https://github.com/qgis/QGIS/blob/master/src/providers/delimitedtext/qgsdelimitedtextfile.cpp#L37-L38

// field_ is optional in following regexp to simplify QgsDelimitedTextFile::fieldNumber()
, mDefaultFieldRegexp( "^(?:field_)?(\d+)$", Qt::CaseInsensitive )

https://github.com/qgis/QGIS/blob/master/src/providers/delimitedtext/qgsdelimitedtextfile.cpp#L418-L424

// If the name looks like a default field name (field_##), then it is
// valid if the number matches its column number..
else if ( mDefaultFieldRegexp.indexIn( name ) == 0 )
{
  int col = mDefaultFieldRegexp.capturedTexts().at( 1 ).toInt();
  nameOk = col == fieldNo;
}

It seems the issue is caused by the fact that "field_" is optional in mDefaultFieldRegexp, so "mDefaultFieldRegexp.indexIn( name ) == 0" is true if column name began with "field_" OR if it contains numerical digits only. In the latter case, if the number doesn't match its column number, then "_1" is appended as a suffix to the column name.

We can fix the issue letting ""field_" not optional in mDefaultFieldRegexp:

mDefaultFieldRegexp( "^(?:field_)(\d+)$", Qt::CaseInsensitive )

but I don't know if there was a valid reason for not doing it then (in the comment it's written that the reason is "to simplify QgsDelimitedTextFile::fieldNumber()", although I can not find this "fieldNumber()")

See also #21249 that seems not really fixed.

qgib · 2018-08-20T14:47:36Z

Author Name: Tobias Wendorff (Tobias Wendorff)

Giovanni Manghi wrote:

Regression is anything that stopped to work and that used to work in any previous release. LTR releases are usually used as reference for comparison (at lease is what I do).

Thanks! Please add this to a FAQ. I'll install LTR in parallel for the next reports (I've still some on my list).

qgib · 2018-08-20T16:31:48Z

Author Name: Giovanni Manghi (@gioman)

Andrea Giudiceandrea wrote:

I've tested the issue (a csv with the header "id;1;2;3;4") with the following QGIS version:

2.18.19 behaves as reported by Tobias Wendorff for 3.2.1: imported layer field names "id;1_1;2_1;3_1;4_1"

1.9.0 alpha (bb0b978) and 1.8.0 behave as Tobias think it should be: imported layer field names "id;1;2;3;4"

If the csv header is e.g. "id;2;3;4;5" the issue does not occur.

I suppose this issue/feature was introduced with 5e4f4f7 by Chris Crook and is present since QGIS 2.0

https://github.com/qgis/QGIS/blob/master/src/providers/delimitedtext/qgsdelimitedtextfile.cpp#L37-L38

// field_ is optional in following regexp to simplify QgsDelimitedTextFile::fieldNumber()
, mDefaultFieldRegexp( "^(?:field_)?(\d+)$", Qt::CaseInsensitive )

https://github.com/qgis/QGIS/blob/master/src/providers/delimitedtext/qgsdelimitedtextfile.cpp#L418-L424
// If the name looks like a default field name (field_##), then it is
// valid if the number matches its column number..
else if ( mDefaultFieldRegexp.indexIn( name ) == 0 )
{
  int col = mDefaultFieldRegexp.capturedTexts().at( 1 ).toInt();
  nameOk = col == fieldNo;
}
It seems the issue is caused by the fact that "field_" is optional in mDefaultFieldRegexp, so "mDefaultFieldRegexp.indexIn( name ) == 0" is true if column name began with "field_" OR if it contains numerical digits only. In the latter case, if the number doesn't match its column number, then "_1" is appended as a suffix to the column name.

We can fix the issue letting ""field_" not optional in mDefaultFieldRegexp:

mDefaultFieldRegexp( "^(?:field_)(\d+)$", Qt::CaseInsensitive )

but I don't know if there was a valid reason for not doing it then (in the comment it's written that the reason is "to simplify QgsDelimitedTextFile::fieldNumber()", although I can not find this "fieldNumber()")

Hi, thanks for this comments. You should really comment among the lines on Github, this way there is a much bigger chance a developer will notice the comment and read your suggestions. Even better if you could come up with a patch (also submitted on github). Thanks!

qgib · 2018-08-21T10:40:11Z

Author Name: Andrea Giudiceandrea (@agiudiceandrea)

PR submitted #7671

qgib · 2018-08-21T17:27:47Z

Author Name: Giovanni Manghi (@gioman)

Andrea Giudiceandrea wrote:

PR submitted #7671

nice, thanks!

pull_request_patch_supplied was changed from 0 to 1

qgib · 2018-08-27T01:55:00Z

Author Name: Andrea Giudiceandrea (@agiudiceandrea)

Applied in changeset 379652d.

status_id was changed from Open to Closed
done_ratio was changed from 0 to 100

qgib · 2018-08-27T05:27:01Z

Author Name: Tobias Wendorff (Tobias Wendorff)

Andrea Giudiceandrea wrote:

Applied in changeset 379652d.

Thanks to all.

qgib · 2018-08-29T09:38:45Z

Author Name: Giovanni Manghi (@gioman)

resolution was changed from to fixed/implemented

qgib closed this as completed Aug 27, 2018

qgib added Bug Either a bug report, or a bug fix. Let's hope for the latter! Data Provider Related to specific vector, raster or mesh data providers labels May 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV: numeric header gets a trailing underscore + 1 #27465

CSV: numeric header gets a trailing underscore + 1 #27465

qgib commented Aug 17, 2018

qgib commented Aug 19, 2018

qgib commented Aug 19, 2018

qgib commented Aug 19, 2018

qgib commented Aug 19, 2018

qgib commented Aug 20, 2018

qgib commented Aug 20, 2018 •

edited

Loading

qgib commented Aug 20, 2018

qgib commented Aug 20, 2018

qgib commented Aug 21, 2018

qgib commented Aug 21, 2018

qgib commented Aug 27, 2018

qgib commented Aug 27, 2018

qgib commented Aug 29, 2018

CSV: numeric header gets a trailing underscore + 1 #27465

CSV: numeric header gets a trailing underscore + 1 #27465

Comments

qgib commented Aug 17, 2018

qgib commented Aug 19, 2018

qgib commented Aug 19, 2018

qgib commented Aug 19, 2018

qgib commented Aug 19, 2018

qgib commented Aug 20, 2018

qgib commented Aug 20, 2018 • edited Loading

qgib commented Aug 20, 2018

qgib commented Aug 20, 2018

qgib commented Aug 21, 2018

qgib commented Aug 21, 2018

qgib commented Aug 27, 2018

qgib commented Aug 27, 2018

qgib commented Aug 29, 2018

qgib commented Aug 20, 2018 •

edited

Loading