Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data corruption and crash with very wide columns? #64

Closed
oschonrock opened this issue Nov 22, 2019 · 6 comments
Closed

Data corruption and crash with very wide columns? #64

oschonrock opened this issue Nov 22, 2019 · 6 comments

Comments

@oschonrock
Copy link

oschonrock commented Nov 22, 2019

large file with 2000 x ~20char wide double columns, generated like this:

  const int cols_n = 2000;
  std::string big_filename = "MedDataset.txt";
  std::ofstream ofstream(big_filename);
  if (!ofstream.is_open()) {
    std::cerr << "failed to open " << big_filename << '\n';
    exit(1);
  }

  std::random_device rd;
  std::mt19937 gen{rd()};
  std::uniform_real_distribution<double> dist{0, 1};

  ofstream << std::setprecision(16);
  for (int r = 0; r < 1000; r++) {
    for (int c = 0; c < cols_n; c++) {
      double num = dist(gen);
      ofstream << num;
      if (c != cols_n -1) ofstream << ',';
    }
    ofstream << "\n";

  }
  ofstream.close();

parsing like this:

  CSVReader reader("MedDataset.txt");

  {
    std::vector<double> r;
    for (CSVRow& row: reader) { // Input iterator
      for (CSVField& field: row) {
        r.push_back(field.get<double>());
      }
      // use vector...
      r.clear();
    }
  }

getting this error during parsing..

terminate called after throwing an instance of 'std::runtime_error' 
  what():  Not a number.

If I reduce the cols_n = 2000 to 1800 it runs just fine.

I have visually inspected the file and not seeing any weird characters. All programmatically produced.

It feels like there some sort of "buffer overflow" due to the very large row --- roughly 32kb....?? 100% percent reproducible for me eventhough the values of the fields are random.

clang++ -O2  -std=c++17   ...    -lpthread
clang++ --version
clang version 8.0.0-3 (tags/RELEASE_800/final)
Target: x86_64-pc-linux-gnu
@oschonrock
Copy link
Author

oschonrock commented Nov 22, 2019

refined code, to reproduce

#include "csv.hpp"  // single header version

int main() {

  using namespace csv;

  const int cols_n = 5000;
  const int rows_n = 2;
  std::string filename = "WideDataset.txt";
  std::ofstream ofstream(filename);
  if (!ofstream.is_open()) {
    std::cerr << "failed to open " << filename << '\n';
    exit(1);
  }

  std::random_device rd;
  std::mt19937 gen{1}; // {rd()};
  std::uniform_real_distribution<double> dist{0, 1};

  ofstream << std::setprecision(16);
  for (int r = 0; r < rows_n; r++) {
    for (int c = 0; c < cols_n; c++) {
      double num = dist(gen);
      ofstream << num;
      if (c != cols_n -1) ofstream << ',';
    }
    ofstream << "\n";

  }
  ofstream.close();

  Matrix m;
  CSVReader reader(filename);

  {
    std::vector<double> r;
    for (CSVRow& row: reader) { // Input iterator
      for (CSVField& field: row) {
        std::cout << field.get<double>() << "\n";
        // std::cout << field << "\n";
      }
    }
  }
}

I did see some segfaults as well, but mostly throws of "not a number". Here is the valgrind output:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Not a number.
==708== 
==708== Process terminating with default action of signal 6 (SIGABRT)
==708==    at 0x4C1FED7: raise (raise.c:51)
==708==    by 0x4C01534: abort (abort.c:79)
==708==    by 0x492B671: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x49372B5: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x4937300: std::terminate() (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x4937534: __cxa_throw (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.26)
==708==    by 0x4082E7: double csv::CSVField::get<double>() (csv.hpp:3801)
==708==    by 0x406451: main (corr.cpp:122)

but if I use the std::cout << field << "\n"; (ie print the "CSVField" instance, without converting to double), then it runs without errors.
inspecting the output of the CSVField printing where it goes wrong:

...
<CSVField> 0.3118567593128918
<CSVField> 0.24509591655933560.8756073010782790.9602685581329761...

There are no "commas", and all the values are run together. At this point I began to doubt my "sample CSV generating code". But, I checked, and here is the equivalent place in the CSV file (I changed to a constant random seed to get consistency):

...0.3118567593128918,0.2450959165593356,0.875607301078279,0.9602685581329761,...

which looks all fine...

So. The error is being thrown (correctly!) at get<double>() stage, because something has gone wrong at parsing stage, the separators are "lost" and all the fields are being run together on extremely long "rows".

Hope that helps. Will investigate further, if I get time.

@vincentlaucsb
Copy link
Owner

vincentlaucsb commented Nov 22, 2019

This is really interesting. I think I'm 99% sure why this is happening.

Basically, this CSV parser stores CSV data as a giant std::string with an accompanying std::vector<unsigned short> of indices where each field starts. If you have a CSV row like
a,b,c,de,fg

then it'll be stored (roughly) as

data = "abcdefg"
fields = {0, 1, 2, 3, 5}

When you use CSVFields, you're basically creating a string_view over the larger string using the indices stored in the vector of unsigned shorts. I used naively store data in std::vector<std::string>s but this was not very fast. If you want to get into the nitty gritty then look at:

I think what is happening is that the length of your rows (in terms of characters) is beyond the range of unsigned short which is 2^16. I believe this will be fixed if you replace all instances of unsigned short with unsigned int in the parser.

I don't really want to do this because it might hurt performance for general use cases. On the other hand, I didn't actually expect somebody to craft rows with more than 60,000 characters so I might have to find a compromise.

@oschonrock
Copy link
Author

oschonrock commented Nov 22, 2019

I think what is happening is that the length of your rows (in terms of characters) is beyond the range of unsigned short which is 2^16. I believe this will be fixed if you replace all instance of unsigned short with unsigned int in the parser.

I don't really want to do this because it might hurt performance for general use cases.

Yup, that makes a lot of sense. I agree 60,000 character rows are extreme, but at least we should throw if the row is too long rather than corrupt memory, return bad data and potentially segfault... undefined behaviour bascially?

I can clone a copy down and experiment.

Are you sure that using unsigned short is beneficial from a performance POV? I have tried this many times (ie changed int to short) and measurement usually shows that it is a slow down, because modern CPUs are optimised to deal with 32 bit ints, not 16bits. The only time it's a performance benefit is when you are storing literally millions of such ints / shorts so the total memory involved is significant, and we hit cache or memory bus bandwidth bottlenecks. Is that the case here? Surely we are dealing with one row at a time? So we are dealing with thousands of ints or shorts (one for each field) not millions? The general consensus in discussions on this topic seems to be: "int is the same or faster unless memory / cache usage / bandwidth is the issue".

@oschonrock
Copy link
Author

oschonrock commented Nov 22, 2019

progress report:

  1. I grep'd for and (quite blindly) replaced all instances of unsigned short with unsigned
  2. I had to back out these 3 because they broke the tests:
tests/test_csv_field.cpp:64:    unsigned char, unsigned short, unsigned int, unsigned long long,
tests/test_csv_field.cpp:80:    unsigned char, unsigned short, unsigned int, unsigned long long int,
tests/test_csv_field.cpp:97:    unsigned char, unsigned short, unsigned int, unsigned long long int) {
  1. Result? the behaviour "changed", ie it worked for 5000 columns and 2 rows, but as soon as I changed to 10000 cols and 1000rows (my target test case) it broke again.

So "close" .. but not quite...

What do you think about making such a change from 16bit unsigned shorts to 32bit unsigned ints permanently if we can prove it doesn't harm performance?

@vincentlaucsb
Copy link
Owner

I would be fine with that. I just don't want to sacrifice peformance in general for extreme cases.

There is also another minor tweak that might help expand the size of rows which I can implement when I have time. As for the failing tests, I know why they're failing (overzealous grepping) and it's not a big deal to fix them.

@vincentlaucsb
Copy link
Owner

In #80, I replaced unsigned short replaced with size_t which maps to unsigned int64 for MSVC. No noticeable performance impacts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants