Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Values not loaded correctly when reading CSV as strings. #1280

Open
marty1885 opened this issue Dec 6, 2018 · 6 comments
Open

Values not loaded correctly when reading CSV as strings. #1280

marty1885 opened this issue Dec 6, 2018 · 6 comments

Comments

@marty1885
Copy link
Contributor

I found that values are not read correctly when reading CSV as strings using load_csv()
For example:
data.csv

aaaa, bbbb, cccc
dd d, ee e, gg g

and read it with.

std::ifstream in("data.csv");
xt::xarray<std::string> data = xt::load_csv<std::string>(in);
std::cout << data << std::endl;

However xtensor only loads the part before the first space character. Where it should be reading the entity of the cell. It prints

{{aaaa, bbbb, cccc},
 {  dd,   ee,   gg}}

I found this bug when reading a series of datetime stored in CSV into xtensor.

@SylvainCorlay
Copy link
Member

Thanks for the report, and the fix! This looks good to me!

@marty1885
Copy link
Contributor Author

marty1885 commented Dec 6, 2018

Thanks!
I found another problem. xtensor implemented a simple CSV parser that ignores some common properties like comma in cells, new line in cells, etc... Which will cause improper parsing results.But a full CSV parser but will be slower than the current implementation.

Do you think that it will be an upgrade comparing to the current one? I'm more than happy to write one for xtensor.

@JohanMabille
Copy link
Member

We can have both living together, and an additional tag / option argument in the load_csv method to choose which one to use. That would require some refactoring in xcsv.hpp though. This way the user can choose the fast implementation if she knows her csv file does not contain comma or new line in cells. This should be carefully documented.

@SylvainCorlay
Copy link
Member

@marty1885 that would be awesome!

I am not too opinionated about how this should be done. I have looked at pandas's csv parser which has tons of options, and depending on the complexity, we may want to have these extra features behind an option, or in a different API...

@marty1885
Copy link
Contributor Author

Well. I guess I'll code up a prototype and see how things go after that.

@marty1885
Copy link
Contributor Author

I have written a nasty parser that handles quotes and comma/space in quotes properly. Currently:

  • Does not handle CRLF documents well
  • Handles multi line cells/cells including commas as long as the content is surrounded by double quotes
  • Treats two consecutive double quote as a quote character.
  • Is a total mess
    • To save on memory, there is no staging in the parsing process. So all the logic is crammed together.
    • Is saving some memory and the slight performance improvement worth it?
  • Far from RFC compliant
    • Ignores heading/tailing white space. (Although common parser do ignore tailing/heading spaces)
    • Spaces outside of quotes are allowed
    • Multiple quote regions is allowed in the same cell (Ex: 1,"222" "333", 4)
    • The list goes on...

Do we need the parser to be RFC compliant? Is it worth it to trade some performance for cleaner code? Also, should the function provide an option for the delimiter(s)?

commit: bbc6369, function: read_csv()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants