Values not loaded correctly when reading CSV as strings. #1280

marty1885 · 2018-12-06T05:02:07Z

I found that values are not read correctly when reading CSV as strings using load_csv()
For example:
data.csv

aaaa, bbbb, cccc
dd d, ee e, gg g

and read it with.

std::ifstream in("data.csv");
xt::xarray<std::string> data = xt::load_csv<std::string>(in);
std::cout << data << std::endl;

However xtensor only loads the part before the first space character. Where it should be reading the entity of the cell. It prints

{{aaaa, bbbb, cccc},
 {  dd,   ee,   gg}}

I found this bug when reading a series of datetime stored in CSV into xtensor.

The text was updated successfully, but these errors were encountered:

SylvainCorlay · 2018-12-06T07:42:50Z

Thanks for the report, and the fix! This looks good to me!

marty1885 · 2018-12-06T11:31:27Z

Thanks!
I found another problem. xtensor implemented a simple CSV parser that ignores some common properties like comma in cells, new line in cells, etc... Which will cause improper parsing results.But a full CSV parser but will be slower than the current implementation.

Do you think that it will be an upgrade comparing to the current one? I'm more than happy to write one for xtensor.

JohanMabille · 2018-12-06T12:22:47Z

We can have both living together, and an additional tag / option argument in the load_csv method to choose which one to use. That would require some refactoring in xcsv.hpp though. This way the user can choose the fast implementation if she knows her csv file does not contain comma or new line in cells. This should be carefully documented.

SylvainCorlay · 2018-12-06T12:24:23Z

@marty1885 that would be awesome!

I am not too opinionated about how this should be done. I have looked at pandas's csv parser which has tons of options, and depending on the complexity, we may want to have these extra features behind an option, or in a different API...

marty1885 · 2018-12-06T13:08:34Z

Well. I guess I'll code up a prototype and see how things go after that.

marty1885 · 2018-12-09T14:09:28Z

I have written a nasty parser that handles quotes and comma/space in quotes properly. Currently:

Does not handle CRLF documents well
Handles multi line cells/cells including commas as long as the content is surrounded by double quotes
Treats two consecutive double quote as a quote character.
Is a total mess
- To save on memory, there is no staging in the parsing process. So all the logic is crammed together.
- Is saving some memory and the slight performance improvement worth it?
Far from RFC compliant
- Ignores heading/tailing white space. (Although common parser do ignore tailing/heading spaces)
- Spaces outside of quotes are allowed
- Multiple quote regions is allowed in the same cell (Ex: 1,"222" "333", 4)
- The list goes on...

Do we need the parser to be RFC compliant? Is it worth it to trade some performance for cleaner code? Also, should the function provide an option for the delimiter(s)?

commit: bbc6369, function: read_csv()

marty1885 mentioned this issue Dec 6, 2018

add string specialization to lexical_cast #1281

Merged

marty1885 mentioned this issue Dec 9, 2018

WIP: New CSV parser #1287

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Values not loaded correctly when reading CSV as strings. #1280

Values not loaded correctly when reading CSV as strings. #1280

marty1885 commented Dec 6, 2018

SylvainCorlay commented Dec 6, 2018

marty1885 commented Dec 6, 2018 •

edited

Loading

JohanMabille commented Dec 6, 2018

SylvainCorlay commented Dec 6, 2018

marty1885 commented Dec 6, 2018

marty1885 commented Dec 9, 2018

Values not loaded correctly when reading CSV as strings. #1280

Values not loaded correctly when reading CSV as strings. #1280

Comments

marty1885 commented Dec 6, 2018

SylvainCorlay commented Dec 6, 2018

marty1885 commented Dec 6, 2018 • edited Loading

JohanMabille commented Dec 6, 2018

SylvainCorlay commented Dec 6, 2018

marty1885 commented Dec 6, 2018

marty1885 commented Dec 9, 2018

marty1885 commented Dec 6, 2018 •

edited

Loading