[FEA] Implement cuIO fuzz tests #6001

vuule · 2020-08-17T05:39:18Z

Depends on #6000

Python tests:

To get started, tests can only vary the data profile, and keep the reader/writer parameters to defaults.

vuule · 2020-08-18T19:36:45Z

In this design we have the data generators implemented only in C++, including the parameter serialization to/from an XML file. The tests implemented in Python/Java are supposed to use the generator API through interop. The idea is to avoid duplicating the serialization code and get the best performance when generating inputs. @kkraus14 does this sound viable, at least for Python?

vuule · 2020-08-18T19:43:56Z

@nvdbaranec you suggested that there needs to be a way to replay the failing tests in C++ environment to facilitate debugging. The design here should allow that since the data generators are deterministic and the data profile and seed can be saved and passed to a C++ API to regenerate the same input.
Do you have any feedback/concerns about the approach?

nvdbaranec · 2020-08-18T19:52:18Z

That seems fine to me. It would be particularly useful if any failure message also included a quick blurb about how to reproduce. Something like:

"Test XYZ failed :
[test output]

Test was run with these parameters [5236783, 34217, etc]. To reproduce, call ABC() with these values"

vuule · 2020-08-18T20:02:14Z

Yeah, I neglected the call parameters in the diagram and the description here because we can get started with random data + defaults. You're right, the API and its parameters should definitely be included.
In the ideal case we would save the call parameters in a machine readable format and have a utility that executes the test based on all of this input.

nvdbaranec · 2020-08-18T20:05:34Z

Random comment on the kinds of things that seem to dig up bugs:

Nulls and empties, especially in combination.
Tweakable format meta-parameters such as page size and row group size in Parquet. Things that don't necessarily affect the formal format itself but squeeze the data into interestingly small or large blobs. The classic example from the parquet reader was page sizes sufficiently small to cause entire nested rows to neither begin, nor end within a single row-group page. I also strongly suspect we will see some (fixable) performance issues with very large page sizes.
Very large and very small files.
Tiny or otherwise fiendishly constructed reads. Reading 1 row out of millions in the middle of a file. Reading 2 rows that deliberately span units of parallelization (eg, 1 block per page in Parquet).
Super-compressible vs terribly-compressible data. Billions of rows all containing the same value from a dictionary. That kind of thing.
Things that deliberately break the reader, like files that will cause out-of-memory issues. The idea is to figure out how gracefully things fail, and how we might produce more useful output when it does.
Files with huge numbers of columns vs huge numbers of rows.

nvdbaranec · 2020-08-18T20:07:47Z

"In the ideal case we would save the call parameters in a machine readable format and have a utility that executes the test based on all of this input."

I am a big fan of being able to reproduce the testing outside of the testing framework itself. I like working in my own little (debuggable) seperate test app. So consideration for providing simple code hooks that can be called would be great.

vuule · 2020-08-18T20:20:47Z

Out of the cases you listed, I see a potential problem with one:

Reading 1 row out of millions in the middle of a file. Reading 2 rows that deliberately span units of parallelization (eg, 1 block per page in Parquet).

For this type of testing (en masse) we have to rely on the reference implementation to validate the output. And I don't think the other Parquet implementations support row selection (outside of row group selection). I guess in this case we can crop the reference output before comparison.

I think we will hit all of the listed corner cases eventually with random parameter sampling as long as we set up the data profile parameters' ranges correctly.

nvdbaranec · 2020-08-18T20:25:46Z

True. If this really saturates the parameter space (implying long scale run times as you mentioned) it'll probably catch the interesting stuff. I would reiterate the point to make sure that we parameterize things that aren't specifically the data itself (eg page sizes).

nvdbaranec · 2020-08-18T20:46:52Z

Thought of another thing. This is probably not super relevant right now, but if we ever did overlapped reads with decoding (that is, read/decompress blob N+1 while decoding blob N) having tests that use both memory mapped files and true spinning-disk files might yield some race condition stuff.

nvdbaranec · 2020-08-20T00:07:38Z

Item : One of the tests I added to the python code for the parquet reader went long enough that it occasionally was causing the build machines to time out. The conclusion was that in the future we should be doing this kind of exhaustive/stress testing in the cpp tests instead of python, which makes sense.

As such, it would be pretty handy if whatever reference reader/writer code you use for this new test framework was also readily available in the normal cudf test framework.

kkraus14 · 2020-08-20T02:32:35Z

Item : One of the tests I added to the python code for the parquet reader went long enough that it occasionally was causing the build machines to time out. The conclusion was that in the future we should be doing this kind of exhaustive/stress testing in the cpp tests instead of python, which makes sense.

Just a note that the 5 minute timeout is specifically defined for the Python unit tests and isn't a limitation of the CI framework / machine.

Partially resolves: #6001 , #6260 This PR: Adds support for Orc fuzz workers(Both reader and writer) Utilize pyorc to write a pandas dataframe to orc files Adds varying test parameter combinations for cudf.read_orc and df.to_orc

vuule added feature request New feature or request Needs Triage Need team to review and classify labels Aug 17, 2020

github-actions bot added this to Needs prioritizing in Feature Planning Aug 17, 2020

harrism added cuIO cuIO issue tests Unit testing for project labels Aug 17, 2020

harrism added this to Issue-Needs prioritizing in v0.16 Release via automation Aug 17, 2020

vuule added tech debt and removed Needs Triage Need team to review and classify labels Aug 18, 2020

vuule assigned galipremsagar Aug 20, 2020

galipremsagar mentioned this issue Aug 27, 2020

[REVIEW] Implement Fuzz tests for cuIO #6114

Merged

8 tasks

galipremsagar moved this from Issue-Needs prioritizing to Issue-P0 in v0.16 Release Sep 1, 2020

galipremsagar added the Python Affects Python cuDF API. label Sep 15, 2020

galipremsagar mentioned this issue Sep 19, 2020

[BUG] unsigned type not being inferred correctly from csv column leading to incorrect data #6274

Closed

This was referenced Oct 22, 2020

[BUG] Json reader is not correctly inferring a float column with a specific type of scientific notation #6587

Open

[BUG] RuntimeError while reading avro file with duration types #6604

Open

galipremsagar closed this as completed in #6571 Nov 12, 2020

v0.17 Release automation moved this from Issue-P0 to Done Nov 12, 2020

This was referenced Jan 12, 2021

[BUG] Unable to retrieve datetime values using JSON reader dtype parameter #7128

Closed

[BUG] from_arrow fails when there is a list column in pyarrow table #7137

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement cuIO fuzz tests #6001

[FEA] Implement cuIO fuzz tests #6001

vuule commented Aug 17, 2020 •

edited by galipremsagar

vuule commented Aug 18, 2020

vuule commented Aug 18, 2020

nvdbaranec commented Aug 18, 2020

vuule commented Aug 18, 2020

nvdbaranec commented Aug 18, 2020

nvdbaranec commented Aug 18, 2020

vuule commented Aug 18, 2020 •

edited

nvdbaranec commented Aug 18, 2020 •

edited

nvdbaranec commented Aug 18, 2020

nvdbaranec commented Aug 20, 2020

kkraus14 commented Aug 20, 2020

[FEA] Implement cuIO fuzz tests #6001

[FEA] Implement cuIO fuzz tests #6001

Comments

vuule commented Aug 17, 2020 • edited by galipremsagar

vuule commented Aug 18, 2020

vuule commented Aug 18, 2020

nvdbaranec commented Aug 18, 2020

vuule commented Aug 18, 2020

nvdbaranec commented Aug 18, 2020

nvdbaranec commented Aug 18, 2020

vuule commented Aug 18, 2020 • edited

nvdbaranec commented Aug 18, 2020 • edited

nvdbaranec commented Aug 18, 2020

nvdbaranec commented Aug 20, 2020

kkraus14 commented Aug 20, 2020

vuule commented Aug 17, 2020 •

edited by galipremsagar

vuule commented Aug 18, 2020 •

edited

nvdbaranec commented Aug 18, 2020 •

edited