Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: test_visualization.py::test_visualization_calibration_1_vs_all_output_saved fails with IndexError in seaborn regplot() function #620

Closed
jimthompson5802 opened this issue Jan 27, 2020 · 17 comments · Fixed by #626
Labels
bug Something isn't working

Comments

@jimthompson5802
Copy link
Collaborator

jimthompson5802 commented Jan 27, 2020

Describe the bug
Unit test

tests/
  integration_tests/
    test_visualization.py::test_visualization_calibration_1_vs_all_output_saved

fails with IndexError exception in the travis ci run and locally for me in a docker container:

   for command, viz_pattern in zip(commands, vis_patterns):
            result = subprocess.run(command)
            figure_cnt = glob.glob(viz_pattern)
    
>           assert 0 == result.returncode
E           AssertionError: assert 0 == 1
E            +  where 1 = CompletedProcess(args=['python', '-m', 'ludwig.visualize', '--visualization', 'calibration_1_vs_all', '--metrics', 'ac...robabilities.npy', '--model_names', 'Model1', 'Model2', '--top_k', '6', '-od', 'results/experiment_run'], returncode=1).returncode
tests/integration_tests/test_visualization.py:1581: AssertionError
----------------------------- Captured stderr call -----------------------------
Traceback (most recent call last):
  File "/opt/python/3.6.7/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/python/3.6.7/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/travis/build/uber/ludwig/ludwig/visualize.py", line 3265, in <module>
    cli(sys.argv[1:])
  File "/home/travis/build/uber/ludwig/ludwig/visualize.py", line 3260, in cli
    vis_func(**vars(args))
  File "/home/travis/build/uber/ludwig/ludwig/visualize.py", line 623, in calibration_1_vs_all_cli
    calibration_1_vs_all(probabilities_per_model, gt, **kwargs)
  File "/home/travis/build/uber/ludwig/ludwig/visualize.py", line 2654, in calibration_1_vs_all
    filename=filename
  File "/home/travis/build/uber/ludwig/ludwig/utils/visualization_utils.py", line 856, in calibration_plot
    algorithm_names) else '')
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/seaborn/regression.py", line 810, in regplot
    x_jitter, y_jitter, color, label)
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/seaborn/regression.py", line 114, in __init__
    self.dropna("x", "y", "units", "x_partial", "y_partial")
  File "/home/travis/virtualenv/python3.6.7/lib/python3.6/site-packages/seaborn/regression.py", line 66, in dropna
    setattr(self, var, val[not_na])
IndexError: too many indices for array

To Reproduce
Steps to reproduce the behavior:
Any PR submitted after 24Jan2020 should exhibit the problem

Expected behavior
Unit test should not fail

Screenshots
See above

Environment (please complete the following information):
Failure occurs in two environments. It fails in my local development environment.

  • OS: MacOS 10.15.2
  • Docker Desktop for Mac 2.2.0.0
  • Python version: 3.6.8
  • Ludwig version: 2.1

Failure also occurs in the travis-ci environment

Additional context
After staring at this issue for a few days, it appears the error may be related to the version of the seaborn package. When the unit test is run using seaborn 0.9.0, the test completes successfully. However, when seaborn 0.10.0 is installed, the test fails with the cited error message. I've confirmed this observation through local testing, i.e.,

  • seaborn 0.9.0 => unit test completes successfully
  • seaborn 0.10.0 => unit test fails

Note: unit test also fails with seaborn 0.9.1

seaborn 0.10.0 was released on PyPi on 2020-01-24. seaborn 0.9.0 was released 2018-07-16. Prior to mid-January 2020, I did not encounter any issues in running the pytest suite of tests.

seaborn 0.10.0 release notes indicate changes were made to the regplot() function, which is the one failing in ludwig's unit test. I can't tell if the issue we are seeing is due to a breaking api in the seaborn 0.10.0 api or a bug in 0.10.0 code base. More analysis is required.

Since the requirements file used to build the travis-ci environment specifies seaborn>=0.7, this means pip install will use the most recent version of seaborn greater than 0.7. This results in installing seaborn 0.10.0. The implication is that any PR submitted from this point on will fail in the cited unit test.

A possible short-term work-around for the travis-ci run failures is to update requirements_viz.txt to replace the current seaborn specification from seaborn>=0.7 with seaborn>=0.7,<=0.9.0. If this is acceptable, I'll submit a PR to effect the change.

@msaisumanth
Copy link
Contributor

I doubt there'll be any be repercussions to changing to seaborn <= 0.9.0. So it should be okay. @w4nderlust

@w4nderlust
Copy link
Collaborator

I think it is worth fixing the specific function call and testing seaborn 0.10, we don't want to keep hooked to older versions.

@w4nderlust w4nderlust added the bug Something isn't working label Jan 28, 2020
@jimthompson5802
Copy link
Collaborator Author

@msaisumanth @w4nderlust
Initially I did try to figure out how to fix the function call but I got lost in understanding some of the internal data required data structures. I'm still willing to try and fix the function call to support seaborn 0.10.0. I may need some help in questions about the data structures for the 1_vs_all plot.

While I work on the fix to support seaborn 0.10.0, I'll submit the PR to cap the version of seaborn installed for the travis-ci to 0.9.0. This way folks should get a clean run when submitting a PR for the near-term.

Once the seaborn 0.10.0 issues is fixed, the PR will backout the '<=0.9.0' cap on the seaborn version.

@w4nderlust
Copy link
Collaborator

Makes sense @jimthompson5802 , thank you!

@w4nderlust
Copy link
Collaborator

I actually added it myself. I'll keep the issue open anyway, because I prefer to adapt to whatever API change seaborn did in 0.10.0 as a more sustainable solution moving forward.

@w4nderlust w4nderlust reopened this Jan 28, 2020
@jimthompson5802
Copy link
Collaborator Author

OK..I focus on the long-term fix to support seaborn 0.10.0

@w4nderlust
Copy link
Collaborator

I've been working on this a bit, made some discoveries and pushed some code: 997c6bc
The first discovery is that the number of classes was wrong, so I fixed it.
The second discovery is that there was no way to tell the plot function to draw a line, so now I set it to check how many points are actually there and and set the order ccordingly.
Third discovery: calibration_curve return lists of the length it wants even if you specify a specific number of bins, so now i add a 0,0 point at the beginning so that there's at least one point.
Fourth discovery: the API version of the test passes, but the CLI version does not, so there's something fishy there.
I wasn't able to fix the problem anyway, it goes deep in seaborn's linear algebra functions and it's pretty hard to figure out to be hones because of pretty arcane error:

ValueError: On entry to DLASCL parameter number 4 had an illegal value

It is probably related to the fact that there are nans in the matrix to invert, but I'm not 100 sure how nans end up there in the CLI case and not in the API case.

@jimthompson5802
Copy link
Collaborator Author

I pulled down your commit 997cbc and can reproduce the symptoms. Right now I'm focusing on understanding data flows to seaborn's reg_plot() function.

@jimthompson5802
Copy link
Collaborator Author

@w4nderlust I instrumented the function calibration_1_vs_all() in ludwig/visualize.py.

Re: your observation that the api test succeeds but the cli version fails with this error ValueError: On entry to DLASCL parameter number 4 had an illegal value.

From what I can tell the api and cli tests create the same synthetic data set for the tests.

In case of the cli test the probabilities passed to calibaration_1_vs_all() function is
probablities_vis_cli_fails

In case of the api test the probabilities passed to calibration_1_vs_all() is this
probabilities_vis_api_works

I was expecting the structure of the probabilities matrix to be similar but they seem to be different. What should the structure of the probabilities matrix? Since the api version works, I'm thinking the first two columns contain the predicted probability for that particular class. The 3rd column could be the probability for the predicted class.

Does this observation point to a direction to resolve this issue?

@w4nderlust
Copy link
Collaborator

w4nderlust commented Jan 30, 2020

The structure is the same, although the distributions are very different. In the second case there's something weird as they don't look like probabilities, i.e. the rows don't sum up to 1... But that's the one that works surprisingly enough.
One potential weak point is the calibration_curve as sometimes it returns vectors of size 21 as requested (number of bins) sometimes it doesn't so maybe that function is doing something weird. Also it has a normalize parameter that if turned no gives some errors. Maybe this is a good starting point for investigation.
Finally, it seems like even with the modified version of seaborn there's still one failing test case. Will look into it.

@jimthompson5802
Copy link
Collaborator Author

I understand each row in the probabilities matrix should sum to one. Although the cli version fails, it conforms to this requirement (within rounding). The api version, which does not fail, does not conform to this requirement.

From what I can see the sample training data created for both the api and cli version create a two category output response variable:

output_features = [category_feature(vocab_size=2, reduce_input='sum')]

Excerpt from description.json:

       "output_features": [
            {
                "dependencies": [],
                "embedding_size": 5,
                "idx2str": [
                    "pvfdrgcrcl",
                    "ICznO"
                ],

        "preprocessing": {
            "category": {
                "fill_value": "<UNK>",
                "lowercase": false,
                "missing_value_strategy": "fill_with_const",
                "most_common": 10000
            },

I'm now trying to understand what probabilities are represented in each of the three columns. If the output variable has two valid values, is the third probability--column 0--for the <UNK> missing value?
probablities_vis_cli_fails

Assuming the three probabilities are as I described above, is the expected output the following, one set of charts for each of the three columns in the probability matrix?
Screen Shot 2020-01-30 at 06 44 36

@jimthompson5802
Copy link
Collaborator Author

@w4nderlust I have an explanation for this issue you pointed out re: the api test:

In the second case there's something weird as they don't look like probabilities, i.e. the rows don't sum up to 1... But that's the one that works surprisingly enough.

I believe the issue is due to an error in the Experiment class in test_visualization_api.py in this statement (line 85)

 self.probability = self.test_stats_full[0].iloc[:, 2:].values

Experiment_code_collect_probabilities

This is the self.test_stats_full[0] data structure. The .iloc[:,2:] refers to the last three columns in the data structure.
test_set_probabilities

For this particular, I believe the correct specification should have been .iloc[:,1:4], which refers to the 2nd, 3rd and 4th column in the data structure. From eye balling the numbers, these three number add up to 1, with rounding.

Unless I hear otherwise, I'll be make the change to refer to the correct columns. I'll make sure that the modifications should handle any number of probabilities and not hard coded to three columns.

@jimthompson5802
Copy link
Collaborator Author

After making the change described above, the probabilities for the api test case now have the correct structure, i.e., the rows sum to one.
Screen Shot 2020-01-30 at 21 08 08

With this change the api unit test now fails just like the cli unit test.

I think this is progress because there is now a consistent error. :-)

@jimthompson5802
Copy link
Collaborator Author

@w4nderlust I may have found root cause for this issue. In a nutshell the issue occurs when generating calibration plots for the <UNK> class. If this class is bypassed, then both the api and cli 1_vs_all calibration test plots complete successfully.

From what I can see, the plot points for the <UNK> class are only (0,0). I think this causes the issues in generating the plots.

By eliminating the <UNK> class, here is pytest log with seaborn 0.10.0 installed:

root@9464369566f1:/opt/project# pip list | grep seaborn
seaborn              0.10.0
WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
root@9464369566f1:/opt/project# pytest tests
============================= test session starts ==============================
platform linux -- Python 3.6.8, pytest-5.3.4, py-1.8.1, pluggy-0.13.1
rootdir: /opt/project
collected 77 items

tests/integration_tests/test_api.py .                                    [  1%]
tests/integration_tests/test_experiment.py ...................           [ 25%]
tests/integration_tests/test_server.py .                                 [ 27%]
tests/integration_tests/test_visualization.py .......................... [ 61%]
                                                                         [ 61%]
tests/integration_tests/test_visualization_api.py ...................... [ 89%]
                                                                         [ 89%]
tests/ludwig/models/modules/test_encoder.py ....                         [ 94%]
tests/ludwig/utils/test_data_utils.py .                                  [ 96%]
tests/ludwig/utils/test_image_utils.py ..                                [ 98%]
tests/ludwig/utils/test_normalization.py .                               [100%]

<<<deleted warning messages>>>
================ 77 passed, 7498 warnings in 357.47s (0:05:57) =================

Is it necessary to generate the <UNK> calibration plots?

If eliminating the <UNK> class, then I have to ask what do we do about the probabilities associated with the <UNK> class?

Let me know If the fix is reasonable. If it is, I'll submit the PR for review.

@jimthompson5802
Copy link
Collaborator Author

For completeness, tested the recommended fix with seaborn 0.9.0 as well. All tests pass.

root@87cea02dd905:/opt/project# pip list | grep seaborn
seaborn              0.9.0
WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
root@87cea02dd905:/opt/project# pytest tests
============================= test session starts ==============================
platform linux -- Python 3.6.8, pytest-5.3.4, py-1.8.1, pluggy-0.13.1
rootdir: /opt/project
collected 77 items

tests/integration_tests/test_api.py .                                    [  1%]
tests/integration_tests/test_experiment.py ...................           [ 25%]
tests/integration_tests/test_server.py .                                 [ 27%]
tests/integration_tests/test_visualization.py .......................... [ 61%]
                                                                         [ 61%]
tests/integration_tests/test_visualization_api.py ...................... [ 89%]
                                                                         [ 89%]
tests/ludwig/models/modules/test_encoder.py ....                         [ 94%]
tests/ludwig/utils/test_data_utils.py .                                  [ 96%]
tests/ludwig/utils/test_image_utils.py ..                                [ 98%]
tests/ludwig/utils/test_normalization.py .                               [100%]

<<<< deleted warning lines>>>>

================ 77 passed, 7446 warnings in 387.05s (0:06:27) =================


@jimthompson5802
Copy link
Collaborator Author

@w4nderlust peeled back another layer of the onion. I know understand why the <UNK> calibration plot fails. The issue is related to how the regression polynomial's order is calculated. For <UNK> only two points are passed to seaborn's reg_plot() function. Currently, the regression polynomial's order is calculated as

order = min(3, len(mean_predicted_values[i] - 1))

This results in order = 2. This results in fitting a quadratic to two points. The reg_plot() function fails.

OTOH if order is calculated as follows:

order = min(3, len(mean_predicted_values[i]) - 1)

This results in order = 1. With this setting, reg_plot() completes. This method results in the following:

  • 2 points, order=1
  • 3 points, order=2
  • 4 or more points, order=3

If it makes sense to print the calibration plot for class <UNK>, it is now possible. I just need to some guidance on the approach to take.

For the record here are the plots generated using the modified order calculation:

calibration_1_vs_all_0.png
calibration_1_vs_all_0

calibration_1_vs_all_1.png
calibration_1_vs_all_1

calibration_1_vs_all_2.png
calibration_1_vs_all_2

@w4nderlust
Copy link
Collaborator

w4nderlust commented Feb 1, 2020

This results in order = 1. With this setting, reg_plot() completes. This method results in the following:

  • 2 points, order=1
  • 3 points, order=2
  • 4 or more points, order=3

Yes this is exactly the intended behavior, you spotted the bug of the -1 being inside len() when it should have been outside, great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants