Implement hypothesis-based tests for linear models #4952

csadorf · 2022-10-26T16:50:42Z

Closes #4943.

csadorf · 2022-10-26T17:18:21Z

Issues I am currently observing:

The test runtime appears to be highly variable without severely restricting the input size and maximum number of experiments.
The cuml estimator converts dtypes and will sometimes fail with an error regarding lossy transformation (failed to copy the traceback, but should be relatively easy to reproduce...).

I will run some benchmarks, especially to address point one.

wphicks

Looking great! Love the general approach and the details of dataset generation.

python/cuml/tests/test_linear_model.py

python/cuml/tests/test_strategies.py

python/cuml/testing/strategies.py

csadorf · 2022-10-27T14:20:50Z

Dumping some statistics for benchmarking:

cuml/tests/test_linear_model.py::test_linear_regression_model_default:

  - during generate phase (2.26 seconds):
    - Typical runtimes: 2-101 ms, ~ 49% in data generation
    - 36 passing examples, 1 failing examples, 29 invalid examples
    - Found 1 distinct error in this phase

  - during shrink phase (36.36 seconds):
    - Typical runtimes: 2-161 ms, ~ 58% in data generation
    - 130 passing examples, 298 failing examples, 612 invalid examples
    - Tried 1040 shrinks of which 285 were successful

  - Highest target score: 66931.9  (label='')
  - Stopped because nothing left to do

This was run with the array_equal assert disabled to not interfere with the test execution on failures.

wphicks · 2022-10-27T14:31:50Z

Those benchmark times don't worry me too much. If the shrink phase takes a little while, that's fine. It means that Hypothesis has found something, and 30 seconds of CI time is a small price to pay to get us a really good reproducer to work with. We can continue to assess the test time as we roll this out more generally and see if we need to add e.g. additional pytest configuration options to control the impact (for both dev and CI).

dantegd · 2022-10-26T17:58:38Z

python/cuml/testing/utils.py

+def array_difference(a, b, with_sign=True):
+    """
+    Utility function to compute the difference between 2 arrays.
+    """
+    a = to_nparray(a)
+    b = to_nparray(b)
+
+    if len(a) == 0 and len(b) == 0:
+        return 0
+
+    if not with_sign:
+        a, b = np.abs(a), np.abs(b)
+    return np.sum(np.abs(a - b))


A common issue we've had is that sometimes it is very hard to diagnose the error or magnitude of errors when tests fail, adding some printing when failing here might be highly benefitial

This function does not directly fail tests, but only computes the difference between arrays a and b similar to array_equal(). I use it as a target function to steer hypothesis towards larger errors.

Implemented in #4973 .

python/cuml/tests/test_linear_model.py

…targets).

…les, n_targets).

csadorf · 2022-10-28T13:48:15Z

Here is an example for the output of an actual failure case.

python/cuml/tests/test_linear_model.py

csadorf · 2022-10-28T16:15:16Z

@wphicks I guess we can't merge this without also addressing the failures? Should I create a longer-running feature branch?

wphicks

Besides my one xfail comment, this looks great! Really like the current form of the dataset generation strategy.

At this point, I'd love to get this in ASAP so that we can go ahead and start applying this to the CPU/GPU algorithms that we're pulling in with this release.

python/cuml/tests/test_linear_model.py

Until rapidsai#4963 is resolved.

wphicks

Love the update. LGTM!

wphicks · 2022-11-07T17:41:53Z

@dantegd Any final thoughts on this or do we feel good about merging?

csadorf · 2022-11-08T17:26:27Z

@wphicks One of the tests appeared to be flaky with respect to the efficiency of example generation. I've suppressed the healthcheck for that particular test, but we will have to monitor whether those tests generally become flaky and consider to suppress these healthchecks globally to avoid surprises later on.

wphicks · 2022-11-08T17:33:49Z

Yep, the healthchecks can become an issue with a lot of assumptions on the dataset generation. I'd say we shouldn't be afraid to suppress them, but we should address them in the generation strategy itself if it begins to impact CI time.

wphicks · 2022-11-09T14:28:54Z

@gpucibot merge

wphicks · 2022-11-09T14:31:29Z

@dantegd Could you dismiss your review when you get a moment? (Or re-review if you're still thinking this one over)

…-gpu-comparison

Review items taken care of.

cjnolet · 2022-11-10T16:23:25Z

rerun tests

cjnolet · 2022-11-10T17:31:14Z

Looks like we got a conda timeout error in one of the gpu test builds.

csadorf · 2022-11-11T09:28:43Z

rerun tests

codecov-commenter · 2022-11-11T20:02:21Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.12@0beb45f). Click here to learn what that means.
Patch has no changes to coverable lines.

Additional details and impacted files

@@               Coverage Diff               @@
##             branch-22.12    #4952   +/-   ##
===============================================
  Coverage                ?   79.38%           
===============================================
  Files                   ?      184           
  Lines                   ?    11698           
  Branches                ?        0           
===============================================
  Hits                    ?     9287           
  Misses                  ?     2411           
  Partials                ?        0

Flag	Coverage Δ
dask	`45.93% <0.00%> (?)`
non-dask	`68.92% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

Closes rapidsai#4943. Authors: - Carl Simon Adorf (https://github.com/csadorf) Approvers: - William Hicks (https://github.com/wphicks) URL: rapidsai#4952

github-actions bot added the Cython / Python Cython or Python issue label Oct 26, 2022

csadorf changed the title ~~[WIP] Implement first prototype for hypothesis-based linear model testing.~~ [WIP] Implement hypothesis-based tests for linear models Oct 26, 2022

csadorf added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Oct 26, 2022

csadorf force-pushed the fea-hypothesis-based-testing-for-cpu-gpu-comparison branch from 80ee4e8 to b4c3155 Compare October 26, 2022 17:01

Implement first prototype for hypothesis-based linear model testing.

fe6d204

csadorf force-pushed the fea-hypothesis-based-testing-for-cpu-gpu-comparison branch from b4c3155 to fe6d204 Compare October 26, 2022 17:07

csadorf requested a review from wphicks October 26, 2022 17:18

wphicks reviewed Oct 26, 2022

View reviewed changes

python/cuml/tests/test_linear_model.py Outdated Show resolved Hide resolved

python/cuml/tests/test_strategies.py Outdated Show resolved Hide resolved

python/cuml/testing/strategies.py Outdated Show resolved Hide resolved

python/cuml/testing/strategies.py Outdated Show resolved Hide resolved

csadorf added 2 commits October 27, 2022 04:56

Fix copyright years.

cb98126

Adjust comments spacing to make flake8 happy.

fed4b6c

csadorf added 2 commits October 27, 2022 07:21

Fix and clarify assumptions.

ff26882

Remove max_examples override.

01ee031

dantegd previously requested changes Oct 27, 2022

View reviewed changes

csadorf added 5 commits October 27, 2022 08:18

Document search strategies.

0b1a5d8

Mention hypothesis strategies in developer guide.

d7af241

Datasets output values shape is either (n_samples,) or (n_samples, n_…

dedc1ba

…targets).

fixup! Datasets output values shape is either (n_samples,) or (n_samp…

80ae1cb

…les, n_targets).

Load hypothesis profiles based on test configuration.

fab7c72

csadorf requested review from dantegd and wphicks October 28, 2022 10:42

Increase min number of samples for split_datasets() default datasets.

23966fa

wphicks approved these changes Oct 28, 2022

View reviewed changes

python/cuml/tests/test_linear_model.py Outdated Show resolved Hide resolved

csadorf changed the title ~~[WIP] Implement hypothesis-based tests for linear models~~ Implement hypothesis-based tests for linear models Oct 28, 2022

csadorf marked this pull request as ready for review October 28, 2022 17:08

Add additional test for standard_dataset strategy.

c17c991

csadorf marked this pull request as ready for review November 4, 2022 19:20

fixup! Improve implementation of the split_datasets() strategy.

35dfbf1

wphicks reviewed Nov 7, 2022

View reviewed changes

python/cuml/tests/test_linear_model.py Outdated Show resolved Hide resolved

caryr35 added this to PR-WIP in v22.12 Release via automation Nov 7, 2022

caryr35 moved this from PR-WIP to PR-Needs review in v22.12 Release Nov 7, 2022

csadorf added 3 commits November 7, 2022 08:32

Find hypothesis strategy to pass test_linear_regression_model_default.

af49c89

Refactor required assumptions into dedicated functions.

779a23f

Implement test_linear_regression_model_default_generalized as stop-gap.

af41aa2

Until rapidsai#4963 is resolved.

csadorf requested a review from wphicks November 7, 2022 16:53

wphicks approved these changes Nov 7, 2022

View reviewed changes

csadorf added 2 commits November 8, 2022 00:39

Minor fixup of the documentation.

1f99c1f

Suppress too_slow healtcheck for flaky test.

22fcb64

Merge branch 'branch-22.12' into fea-hypothesis-based-testing-for-cpu…

117496a

…-gpu-comparison

v22.12 Release automation moved this from PR-Needs review to PR-Reviewer approved Nov 10, 2022

Suppress too_slow healthcheck for test_split_regression_datasets.

706f1ac

rapids-bot bot merged commit df2fbbe into rapidsai:branch-22.12 Nov 11, 2022

v22.12 Release automation moved this from PR-Reviewer approved to Done Nov 11, 2022

csadorf deleted the fea-hypothesis-based-testing-for-cpu-gpu-comparison branch November 14, 2022 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement hypothesis-based tests for linear models #4952

Implement hypothesis-based tests for linear models #4952

csadorf commented Oct 26, 2022 •

edited

csadorf commented Oct 26, 2022

wphicks left a comment

csadorf commented Oct 27, 2022

wphicks commented Oct 27, 2022

dantegd Oct 26, 2022

csadorf Oct 27, 2022

csadorf Nov 4, 2022

csadorf commented Oct 28, 2022

csadorf commented Oct 28, 2022

wphicks left a comment

wphicks left a comment

wphicks commented Nov 7, 2022

csadorf commented Nov 8, 2022

wphicks commented Nov 8, 2022

wphicks commented Nov 9, 2022

wphicks commented Nov 9, 2022

cjnolet commented Nov 10, 2022

cjnolet commented Nov 10, 2022 •

edited

csadorf commented Nov 11, 2022

codecov-commenter commented Nov 11, 2022

Implement hypothesis-based tests for linear models #4952

Implement hypothesis-based tests for linear models #4952

Conversation

csadorf commented Oct 26, 2022 • edited

csadorf commented Oct 26, 2022

wphicks left a comment

Choose a reason for hiding this comment

csadorf commented Oct 27, 2022

wphicks commented Oct 27, 2022

dantegd Oct 26, 2022

Choose a reason for hiding this comment

csadorf Oct 27, 2022

Choose a reason for hiding this comment

csadorf Nov 4, 2022

Choose a reason for hiding this comment

csadorf commented Oct 28, 2022

csadorf commented Oct 28, 2022

wphicks left a comment

Choose a reason for hiding this comment

wphicks left a comment

Choose a reason for hiding this comment

wphicks commented Nov 7, 2022

csadorf commented Nov 8, 2022

wphicks commented Nov 8, 2022

wphicks commented Nov 9, 2022

wphicks commented Nov 9, 2022

cjnolet commented Nov 10, 2022

cjnolet commented Nov 10, 2022 • edited

csadorf commented Nov 11, 2022

codecov-commenter commented Nov 11, 2022

Codecov Report

csadorf commented Oct 26, 2022 •

edited

cjnolet commented Nov 10, 2022 •

edited