Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change POD default rtol and fix analyze_pickle demo for numpy master #1012

Merged
merged 7 commits into from
Jul 16, 2020

Conversation

sdrave
Copy link
Member

@sdrave sdrave commented Jul 13, 2020

Currently, in analyze_picle set_x/ylim gets passed a 1-d array of length 1 for one of the limits, causing an error in matplotlib with the current numpy master branch. This is fixed by this PR.

@sdrave sdrave added the pr:fix Fixes a bug label Jul 13, 2020
@sdrave sdrave added this to the 2020.1 milestone Jul 13, 2020
@sdrave sdrave requested a review from renefritze July 13, 2020 08:22
@sdrave sdrave mentioned this pull request Jul 13, 2020
@renefritze renefritze force-pushed the fix_analyze_pickle_numpy branch from 1fc790c to 7253cda Compare July 13, 2020 21:46
@renefritze
Copy link
Member

Looking into https://zivgitlab.uni-muenster.de/pymor/pymor/-/jobs/369546 and https://zivgitlab.uni-muenster.de/pymor/pymor/-/jobs/369541
Second one looks like a pip error where it ignores the download url if a #egg[full] is appended and instead installs from pypi, hence picking up the pyqt dep from our last release which is not in our current mirror any more 😮

@sdrave
Copy link
Member Author

sdrave commented Jul 14, 2020

If you can't quickly resolve this issue, maybe it would be better to disable the numpy git build for now? It's blocking everything else, and does not seem critical for the next release.

@renefritze
Copy link
Member

If you can't quickly resolve this issue, maybe it would be better to disable the numpy git build for now? It's blocking everything else, and does not seem critical for the next release.

The numpy git build is fine.

WIth the last commit the second issue I linked should be gone. Hopeful I get the first one too in the next few hours (iterating docker builds sadly is somewhat slow).

@renefritze
Copy link
Member

Looking into https://zivgitlab.uni-muenster.de/pymor/pymor/-/jobs/369546 and https://zivgitlab.uni-muenster.de/pymor/pymor/-/jobs/369541
Second one looks like a pip error where it ignores the download url if a #egg[full] is appended and instead installs from pypi, hence picking up the pyqt dep from our last release which is not in our current mirror any more

These problems are now out of the way. Once this PR is merged, all open PRs should rebase on master, @pymor/pymor-devs

@github-actions github-actions bot force-pushed the fix_analyze_pickle_numpy branch from e8a78ad to 2e50560 Compare July 14, 2020 13:18
@renefritze renefritze self-assigned this Jul 14, 2020
@sdrave
Copy link
Member Author

sdrave commented Jul 14, 2020

@renefritze, I have just fixed the build failure for test_project_array. Will push a fix into this branch soon.

@codecov
Copy link

codecov bot commented Jul 14, 2020

Codecov Report

Merging #1012 into master will decrease coverage by 0.00%.
The diff coverage is n/a.

Impacted Files Coverage Δ
src/pymor/vectorarrays/numpy.py 84.31% <0.00%> (-0.25%) ⬇️

@sdrave
Copy link
Member Author

sdrave commented Jul 14, 2020

Here is again the build failure I cannot reproduce (even with docker).

@renefritze
Copy link
Member

Here is again the build failure I cannot reproduce (even with docker).

Saw it and restarted. Not the time or place to block this PR for that.

@sdrave sdrave force-pushed the fix_analyze_pickle_numpy branch from a88b3dd to d15cf3b Compare July 14, 2020 15:44
@renefritze
Copy link
Member

Or maybe this is the time after all since restarts didn't make it go away.
There's https://zivgitlab.uni-muenster.de/pymor/pymor/-/jobs/371196#L1663 https://zivgitlab.uni-muenster.de/pymor/pymor/-/jobs/371212 https://zivgitlab.uni-muenster.de/pymor/pymor/-/jobs/371195 which seems directly related to your last commit, @sdrave ?

I tried to replicate the other one you linked earlier. I could not. AFAICT I had eliminated sources of explicit randomness.

@sdrave
Copy link
Member Author

sdrave commented Jul 14, 2020

I just pushed another change giving test_project_array some more headroom.

Regarding test_pod: whenever it pops up, I try to reproduce it using reproduce_failure, but cannot. Judging from the string representation, the generated data seems identical, so I really have no clue how to debug this.

@sdrave sdrave force-pushed the fix_analyze_pickle_numpy branch from d15cf3b to 50bd56b Compare July 14, 2020 16:15
@renefritze
Copy link
Member

Did you perchance notice a pattern in which runner the pipelines fail on, @sdrave ?

@renefritze
Copy link
Member

I've now managed to get a local failure on the pod test. Can you try to @reproduce_failure('5.19.2', b'AAABAWAAAAEfUXZ/k/oABXYfUZOAAAA=') please?

local failure means on my machine. And then:

Ok, I think this confirms my suspicion: ZIV Runner cannot reproduce: https://zivgitlab.uni-muenster.de/pymor/pymor/-/jobs/371606
Our private machine does reproduce: https://zivgitlab.uni-muenster.de/pymor/pymor/-/jobs/371607

@renefritze
Copy link
Member

So in method of snapshots the eigen values output from scipy.linalg.eigh in this case for me are [ 4.64099021e+01 8.05439078e-14 5.21804822e-15 5.10702591e-15, ..., -1.3766765505351941e-14]
With a computed tol of 7.425584340909056e-14 I get two entries in above_tol
The returned POD is [[-9.79432271e-01-2.01773207e-01j], [-1.65309757e-08-3.46335582e-09j]] with SVALS = [6.81248135e+00 2.83802586e-07]
and in the following gram_schmidt the last scaling happens with norm = 0,0 hence the inf and nan.

@renefritze
Copy link
Member

So there's definitely something weird if not nondeterministic happening in eigh atm.
Plus I guess the gram schmidt also should never scale if norm zero.

@renefritze
Copy link
Member

I've tried all the applicable LAPACK drivers for eigh now. 'ev', 'evr' and the default 'evd' all result in two modes. 'evx' segfaults.

@sdrave what CPU you got in your machine?

@renefritze
Copy link
Member

So there's definitely something weird if not nondeterministic happening in eigh atm.

Unless that result is actually the same for everyone?

@sdrave
Copy link
Member Author

sdrave commented Jul 15, 2020

So in method of snapshots the eigen values output from scipy.linalg.eigh in this case for me are [ 4.64099021e+01 8.05439078e-14 5.21804822e-15 5.10702591e-15, ..., -1.3766765505351941e-14]
With a computed tol of 7.425584340909056e-14 I get two entries in above_tol

My first eigenvalues are 4.64099021e+01 and 6.28570143e-14, the computed tolerance is 7.425584340909018e-14 (so the first EV is actually slightly different already), so the second EV is dropped.

So in my case the test succeeds, but barely. Seems like the maybe the default rtol of 4e-8 is too high to get rid of all numerical garbage. So a simple solution could be to increase it slightly to 1e-7, say. I'm not sure however, what accuracy of eigh can be expected in the first place. @pmli, @lbalicki, do you know?

So there's definitely something weird if not nondeterministic happening in eigh atm.
Plus I guess the gram schmidt also should never scale if norm zero.

I think that's ok. You should never set atol = rtol = 0 except when you want to be sure to never drop anything that is zero. If you do get zeros, then producing nans seems valid behavior.

@sdrave what CPU you got in your machine?

That might be a possible explanation: I'm on a Core i5 760, very old stuff. The newest SIMD tech seems to be SSE 4.2. So it is very well possible, that a different code path is selected in eigh. I will check again on my laptop.

@renefritze
Copy link
Member

I've tested on a Xeon E5-1630 v3, a Ryzen Pro 3700U, both fail. Could be AVX vs SSE then. http://www.cpu-world.com/Compare_CPUs/AMD_YM370BC4T4MFG,Intel_BV80605001908AN,Intel_CM8064401614501/

@sdrave
Copy link
Member Author

sdrave commented Jul 15, 2020

I can reproduce the failure on my Core i5-3320M, which has AVX (v1?)

@renefritze
Copy link
Member

renefritze commented Jul 15, 2020

Tried it with the adjusted tolerance, got

@reproduce_failure('5.19.2', b'AAAFAV0AAAl8HnP9gHkADChY4UIIAAE=')
pytest_1        | Falsifying example: test_pod(
pytest_1        |     vector_array=NumpyVectorArray(
pytest_1        |         [[0.00797201 0.00797201 0.00797201 ... 0.00797201 0.00797201 0.00797201]
pytest_1        |          [0.00797201 0.00797201 0.00797201 ... 0.00797201 0.00797201 0.00797201]
pytest_1        |          [0.00797201 0.00797201 0.00797201 ... 0.00797201 0.00797201 0.00797201]
pytest_1        |          ...
pytest_1        |          [0.00797201 0.00797201 0.00797201 ... 0.00797201 0.00797201 0.00797201]
pytest_1        |          [0.00797201 0.00797201 0.00797201 ... 0.00797201 0.00797201 0.00797201]
pytest_1        |          [0.00797201 0.00797201 0.00797201 ... 0.00797201 0.00797201 0.00797201]],
pytest_1        |         NumpyVectorSpace(22)), method='method_of_snapshots',
pytest_1        | )
pytest_1        | Traceback (most recent call last):
pytest_1        |   File "/pymor/src/pymortests/algorithms/pod.py", line 33, in test_pod
pytest_1        |     U, s = pod(A, method=method, orth_tol=orth_tol)
pytest_1        |   File "/pymor/src/pymor/core/defaults.py", line 236, in defaults_wrapper
pytest_1        |     return decorated_function(**kwargs)
pytest_1        |   File "/pymor/src/pymor/algorithms/pod.py", line 90, in pod
pytest_1        |     gram_schmidt(POD, product=product, atol=0., rtol=0., copy=False)
pytest_1        |   File "/pymor/src/pymor/core/defaults.py", line 236, in defaults_wrapper
pytest_1        |     return decorated_function(**kwargs)
pytest_1        |   File "/pymor/src/pymor/algorithms/gram_schmidt.py", line 118, in gram_schmidt
pytest_1        |     raise AccuracyError(f"result not orthogonal (max err={err})")
pytest_1        | pymor.core.exceptions.AccuracyError: result not orthogonal (max err=inf)
pytest_1        | 

Edit: that's different input

@sdrave
Copy link
Member Author

sdrave commented Jul 15, 2020

Now things get even stranger. With this decorator i get an entirely different VectorArray:

ListVectorArray('??', NumpyListVectorSpace(5))

even with docker. (I run py.test pod.py, but I would assume that this should not matter.)

@renefritze
Copy link
Member

Now things get even stranger. With this decorator i get an entirely different VectorArray:

Sorry, I might have had multiple failures and copied the wrong one.

@sdrave
Copy link
Member Author

sdrave commented Jul 15, 2020

Now things get even stranger. With this decorator i get an entirely different VectorArray:

Sorry, I might have had multiple failures and copied the wrong one.

Ok. For the above hash, I don't see a test failure (on my laptop). Also I'm surprised about the error. Are you sure you have changed the right tolerance? (rtol of pod not of method_of_snapshots)

@pmli
Copy link
Member

pmli commented Jul 15, 2020

The first few eigenvalues I get on my machine are 4.64099021e+01 and 1.03243091e-13, so the rtol should then be at least 2.120190e+07...

Right now, I can't find the error analysis results for symmetric eigenvalue problem solvers, but I expect rtol would have to depend on the size of the problem. E.g., the relative tolerance for computing the matrix rank of a matrix M in NumPy (and Matlab) is max(M.shape) * eps.

Maybe for the first failure, an easy fix would be to limit the number of modes, e.g.,

modes = min(len(A), A.dim, modes if modes else np.inf)

I'm not sure what are the dimensions of the second failure @renefritze found.

@pmli
Copy link
Member

pmli commented Jul 15, 2020

BTW, if someone wants to get the data for the first failure, it in A.zip, containing A.npy with a NumPy array which can be loaded with A = np.load('A.npy').

@renefritze
Copy link
Member

Now things get even stranger. With this decorator i get an entirely different VectorArray:

Sorry, I might have had multiple failures and copied the wrong one.

Ok. For the above hash, I don't see a test failure (on my laptop). Also I'm surprised about the error. Are you sure you have changed the right tolerance? (rtol of pod not of method_of_snapshots)

Yeah, that was not clear. If I change the rtol in pod there's no failure, even with a huge set of examples.

@pmli
Copy link
Member

pmli commented Jul 15, 2020

The first few eigenvalues I get on my machine are 4.64099021e+01 and 1.03243091e-13, so the rtol should then be at least 2.120190e+07...

I should correct this. I got this when using A.conj() @ A.T. If I use A.conj().dot(A.T), as done by NumpyVectorArray, then I get very similar result to what @renefritze got (4.64099021e+01 and 8.05867871e-14). Interestingly, using @ gives an exact Hermitian matrix, while dot does not (but its Hermitian part is exactly equal to what @ gives).

@renefritze
Copy link
Member

I'm testing the new tolerance here

@lbalicki
Copy link
Member

The test also fails for me and I get the two largest eigenvalues 1.03243145e-13 and 4.64099021e+01. I found that symmetric eigenvalue problems in LAPACK are discussed here.

It seems like the condition number of the matrix whose eigenvalues are computed in the failing test is very high. As far as I understand we can't really expect accurate results for this kind of matrix, since error bounds depend on the condition number.

I think changing the default tolerance in pod seems reasonable.

@renefritze
Copy link
Member

renefritze commented Jul 15, 2020

I'm testing the new tolerance here

That went through OK and since we all seem to be OK with adjusting it, I cherry-picked the tol change to this PR.

@renefritze
Copy link
Member

Any objections to merging this now, @pymor/pymor-devs ?

@sdrave sdrave changed the title Fix analyze_pickle demo for numpy master Change POD default rtol and fix analyze_pickle demo for numpy master Jul 16, 2020
@sdrave sdrave added the pr:change Change in existing functionality label Jul 16, 2020
@sdrave
Copy link
Member Author

sdrave commented Jul 16, 2020

I have changed the default rtol for method_of_snapshots accordingly. I'm fine with merging, however, we should continue the discussion regarding the 'right' default tolerance somewhere else.

@sdrave sdrave merged commit 3204424 into master Jul 16, 2020
@sdrave sdrave deleted the fix_analyze_pickle_numpy branch July 16, 2020 09:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:change Change in existing functionality pr:fix Fixes a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants