-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data corruption and crash with very wide columns? #64
Comments
refined code, to reproduce #include "csv.hpp" // single header version
int main() {
using namespace csv;
const int cols_n = 5000;
const int rows_n = 2;
std::string filename = "WideDataset.txt";
std::ofstream ofstream(filename);
if (!ofstream.is_open()) {
std::cerr << "failed to open " << filename << '\n';
exit(1);
}
std::random_device rd;
std::mt19937 gen{1}; // {rd()};
std::uniform_real_distribution<double> dist{0, 1};
ofstream << std::setprecision(16);
for (int r = 0; r < rows_n; r++) {
for (int c = 0; c < cols_n; c++) {
double num = dist(gen);
ofstream << num;
if (c != cols_n -1) ofstream << ',';
}
ofstream << "\n";
}
ofstream.close();
Matrix m;
CSVReader reader(filename);
{
std::vector<double> r;
for (CSVRow& row: reader) { // Input iterator
for (CSVField& field: row) {
std::cout << field.get<double>() << "\n";
// std::cout << field << "\n";
}
}
}
} I did see some segfaults as well, but mostly throws of "not a number". Here is the valgrind output:
but if I use the
There are no "commas", and all the values are run together. At this point I began to doubt my "sample CSV generating code". But, I checked, and here is the equivalent place in the CSV file (I changed to a constant random seed to get consistency):
which looks all fine... So. The error is being thrown (correctly!) at Hope that helps. Will investigate further, if I get time. |
This is really interesting. I think I'm 99% sure why this is happening. Basically, this CSV parser stores CSV data as a giant then it'll be stored (roughly) as
When you use CSVFields, you're basically creating a string_view over the larger string using the indices stored in the vector of unsigned shorts. I used naively store data in
I think what is happening is that the length of your rows (in terms of characters) is beyond the range of I don't really want to do this because it might hurt performance for general use cases. On the other hand, I didn't actually expect somebody to craft rows with more than 60,000 characters so I might have to find a compromise. |
Yup, that makes a lot of sense. I agree 60,000 character rows are extreme, but at least we should throw if the row is too long rather than corrupt memory, return bad data and potentially segfault... undefined behaviour bascially? I can clone a copy down and experiment. Are you sure that using |
progress report:
So "close" .. but not quite... What do you think about making such a change from 16bit |
I would be fine with that. I just don't want to sacrifice peformance in general for extreme cases. There is also another minor tweak that might help expand the size of rows which I can implement when I have time. As for the failing tests, I know why they're failing (overzealous grepping) and it's not a big deal to fix them. |
In #80, I replaced |
large file with 2000 x ~20char wide
double
columns, generated like this:parsing like this:
getting this error during parsing..
If I reduce the
cols_n = 2000
to 1800 it runs just fine.I have visually inspected the file and not seeing any weird characters. All programmatically produced.
It feels like there some sort of "buffer overflow" due to the very large row --- roughly 32kb....?? 100% percent reproducible for me eventhough the values of the fields are random.
The text was updated successfully, but these errors were encountered: