Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] OPTICS Change quick_scan floating point comparison to isclose #11929

Merged
merged 24 commits into from Sep 16, 2018

Conversation

Projects
None yet
5 participants
@adrinjalali
Copy link
Member

adrinjalali commented Aug 28, 2018

See #11857, #11916, #11878, and MacPython/scikit-learn-wheels#7

The part of the fix which I believe is valid, is the fix for comparison between two floats.

Other than that, scaling up the locations of the points, and reducing the number of points in each group, fixes the test issue, while still keeping the test valid (IMO).

Does this look okay @jnothman, @espg ?

@jnothman
Copy link
Member

jnothman left a comment

Yes, I'm happy with this

@adrinjalali adrinjalali changed the title Fix quick_scan and reachability test for OPTICS [MRG+1] Fix quick_scan and reachability test for OPTICS Aug 29, 2018

@adrinjalali adrinjalali force-pushed the adrinjalali:optics32_pr branch from 7fbb955 to 2684ce3 Aug 29, 2018

@@ -190,226 +189,65 @@ def test_reach_dists():
# Expected values, matches 'RD' results from:
# http://chemometria.us.edu.pl/download/optics.py

This comment has been minimized.

Copy link
@lesteve

lesteve Sep 1, 2018

Member

Have you made sure the results indeed match with this implementation?

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Sep 1, 2018

Author Member

@lesteve I actually had not. Somehow I missed this comment. But I did now, and the results are not the same. Not even close actually. But I also checked if the values I'm replacing match the output of that code, and they don't match them. So at some point in time, either somebody forgot to check against the code provided in that URL, or that URL has changed the code and it doesn't match what it used to be. @espg any idea?

This comment has been minimized.

Copy link
@espg

espg Sep 4, 2018

I'll check for the iPython notebook where I generated the values; I think I remember having to sort the output by the ordering indexes to get them to match...

C4 = [-2, 3] + .3 * rng.randn(n_points_per_cluster, 2)
C5 = [3, -2] + 1.6 * rng.randn(n_points_per_cluster, 2)
C6 = [5, 6] + 2 * rng.randn(n_points_per_cluster, 2)
n_points_per_cluster = 50

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Sep 3, 2018

Member

Why do you change the test here? I'm fine with reducing n_points_per_cluster, but .8 * rng.randn(n_points_per_cluster, 2) -> 8 * rng.randn(n_points_per_cluster, 2) seems strange. It's hard to form a cluster with your version.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Sep 3, 2018

Author Member

The issue was kinda related to floating point precisions, therefore we had the idea to scale up the data. But you're right, to scale the data I also need to scale the centers. I'll do so.

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Sep 3, 2018

Member

How about keeping the original test? Or the original test cannot detect your problem, but the new test can?

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Sep 3, 2018

Member

Also, I don't think it's good to get expected result from our implementation. Where's your expected result come from?

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Sep 3, 2018

Author Member

The original test fails on 32 bit systems.

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Sep 3, 2018

Author Member

The related conversation happened after this comment: #11857 (comment)

This comment has been minimized.

Copy link
@qinhanmin2014

qinhanmin2014 Sep 3, 2018

Member

The original test fails on 32 bit systems.

I assume this PR aims to solve the problem. Seems that I'm wrong? So it's impossible to pass the original test on 32-bit system? If so, is it possible to only reduce n_points_per_cluster? If not, then I think it's acceptable to scale up the data, as long as you get expected result from another implementation (e.g., R)

This comment has been minimized.

Copy link
@adrinjalali

adrinjalali Sep 3, 2018

Author Member

I could do that, but are you opposed to changing the == to np.isclose when comparing floats?

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 3, 2018

I can't find any detailed information about the code from http://chemometria.us.edu.pl/download/optics.py. It's it cited or mentioned elsewhere? Maybe it's better to compare with R?

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 3, 2018

are you opposed to changing the == to np.isclose

No, we've encountered similar problems (e.g., #11551).

It seems that the old test and the new test pass on master and in this PR (correct me if I'm wrong). So I'd prefer a +1/-1 PR unless you provide me with your reason to change the test.

But there's another problem here. It seems that the performance of OPTICS will be influenced if we use np.isclose in cython code. I'm not sure what's the solution (maybe seeking for some low-level functions).

%timeit test_reach_dists() # using ==
#1 loop, best of 3: 1.11 s per loop
%timeit test_reach_dists() # using np.isclose
1 loop, best of 3: 38 s per loop
@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Sep 3, 2018

It passes the tests because we don't build/test for 32 bit systems on this repo. That's why @jnothman sent this quick fix: (#11916).

The issue is not fixed yet on master.

On top of that, the current test on master fails on 32bit arch even with the np.isclose proposed here. That's why we're changing the test.

I'm trying to figure a faster way than np.isclose, unless you have a solution in mind.

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Sep 3, 2018

regarding the np.isclose performance issue, it seems if we replace it with the cython version of what's proposed in PEP485, we don't pay any performance penalty.

I've put it in this PR, since it's supported only from Python 3.5.

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 4, 2018

Brian Clowers code is a port of Michal Daszykowski's matlab code; see here and here. Michal Daszykowski has two publications that reference or use his OPTICS implementation, and his implementation has it's own DOI (DOI: 10.13140/RG.2.1.3998.3843).

Thanks @espg (though I don't have time to read them now), I'm now much more confident to use the implementation.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 4, 2018

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Sep 4, 2018

This is what I can add:

  1. Issue Related to 32bit Systems
    The issue starts here:
        if len(unproc) > 0:
            dists = pairwise_distances(P, np.take(X, unproc, axis=0),
                                       self.metric, n_jobs=None).ravel()

            rdists = np.maximum(dists, self.core_distances_[point_index])
            new_reach = np.minimum(np.take(self.reachability_, unproc), rdists)
            self.reachability_[unproc] = new_reach

        # Checks to see if everything is already processed;
        # if so, return control to main loop
        if unproc.size > 0:
            # Define return order based on reachability distance
            return(unproc[quick_scan(np.take(self.reachability_, unproc),
                                     dists)])

and comparing the values between 32 and 64 archs, it's the new_reach which is different in the two archs (of the order of 1e-13). As a result of that, quick_scan returns a different index and therefore points are processed in a different order. This is why one of the reachability values is significantly different in the two archs, and IMO, if we're testing the reachability part of the algorithm, we should also check the ordering_, and if we do that, we see that there's a significant different in the two archs and the test would fail as it is now, no matter what we set the comparison tolerance for reachability_.

That means, the test fails on a 32 bit arch because of a precision issue, but that precision issue happens at a point which makes the algorithm follow different paths on the two archs, i.e. reachability_ values are different because the algorithm processes points in a different order, not because of some 32bit floating point calculation issue directly on reachability_.

  1. Performance Issue
    This PR's implementation of the _optics_inner.pyx:
cdef isclose(double a, double b, double rel_tol=1e-09, double abs_tol=0.0):
    return abs(a-b) <= max(rel_tol * max(abs(a), abs(b)), abs_tol)

does NOT impose any computation penalty, i.e.:

%timeit test_reach_dists() # with ==
736 ms ± 48 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit test_reach_dists() # with isclose as implemented in this PR
735 ms ± 20.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  1. Different Proposed Solutions
    3.1. Add tolerance to all comparisons
    The first solution I came up with, which isn't a good idea. It would be a change like this (f50e778?diff=unified):
    for i from 0 <= i < n:
-       if rdists[i] < rdist:
+       if rdists[i] + 1e-7 < rdist:
            rdist = rdists[i]
            dist = dists[i]
            idx = i
-       if rdists[i] == rdist:
-           if dists[i] < dist:
+       if np.abs(rdists[i] - rdist) < 1e-7:
+           if dists[i] + 1e-7 < dist:
                dist = dists[i]
                idx = i

The above proposal would make the test pass without any change to the test, on both archs. But as @jnothman also pointed out (#11857 (comment)).

I'm not entirely convinced that it's the right fix. It will make it consistent across platforms, but can easily produce a different order from expected.

3.2. Upscale the space of the test by a factor of 100
The idea was to multiply the data by 100 for example, and hope that the distances are large enough for the two archs to give the same results. But it didn't.

3.3. Reduce the number of points in each cluster
The data is only 2 dimensional, therefore it may be the case that some of the randomly generated data are really close to each other. Reducing the number of points in each cluster to 50, makes both archs to give the same results (including ordering_). So yes, we don't have to upscale the data if we're reducing the number of points to 50. I hadn't checked this scenario due to time pressure. I tested it today and it works.

  1. Summary
  • There seems to be consensus that using isclose is better than == comparing floats, and since this implementation doesn't lower the performance of the algorithm, it seems reasonable to keep it.
  • Reducing the number of values in each group to 50 would make the algorithm produce same results on both archs (with or without isclose), yet it keeps the test reasonable enough to be a valid test.

I really prefer a test which passes regardless of the arch, rather than having two tests. That would mean the only thing left is to get the reachability values from a different implementation rather than our implementation.

Does it seam reasonable @espg, @jnothman, @qinhanmin2014?

@espg

This comment has been minimized.

Copy link

espg commented Sep 4, 2018

I'm fine with using 50 points per cluster if it fixes all of our problems; the only reason that 250 was used was to match the unit test to the documentation example. The 'chemometria' OPTICS implementation that I used for the test array is here , although note that it has a dependency of hcluster that prevents it from being called in the unit test itself. If you'd like me to, I can open a PR that amends the testing array to use 50 points per cluster.

@jnothman
Copy link
Member

jnothman left a comment

I'm very happy with that analysis and the proposal to reduce the number of samples per cluster.

@@ -5,6 +5,9 @@ cimport cython
ctypedef np.float64_t DTYPE_t
ctypedef np.int_t DTYPE

cdef isclose(double a, double b, double rel_tol=1e-09, double abs_tol=0.0):

This comment has been minimized.

Copy link
@jnothman

jnothman Sep 5, 2018

Member

Use cdef inline rather than relying on the C compiler to work it out

@qinhanmin2014

This comment has been minimized.

Copy link
Member

qinhanmin2014 commented Sep 5, 2018

Thanks @adrinjalali and @espg
(1) +1 to replace == with cdef inline isclose, though I still wonder if there's a better way to do such a simple thing than defining a function.
(2) +1 to reduce number of samples to 50 since it will make us easier to construct a test. I'm still not persuaded to change the definition of C1-C6, or at least @adrinjalali you should scale up the center to make the test reasonable.
(3) +1 to check reachability_ and ordering_, but that's not urgent so it can be done in a seperate PR.
(4) @adrinjalali Please make sure that you're generating the expected result using the referenced script.

adrinjalali added some commits Sep 5, 2018

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Sep 5, 2018

@espg I've reverted the test to master, since I couldn't generate expected values with your script (checked the order issue, and possible missing sqrt, still).

Could you please send a PR with 50 points in each group? No need to change C1 to C6 definitions.

@qinhanmin2014
Copy link
Member

qinhanmin2014 left a comment

LGTM, ping @jnothman

@adrinjalali adrinjalali changed the title [MRG+1] Fix quick_scan and reachability test for OPTICS [MRG+1] OPTICS Change quick_scan floating point comparison to isclose Sep 5, 2018

@espg

This comment has been minimized.

Copy link

espg commented Sep 6, 2018

@adrinjalali Sure, working on it now.

@espg

This comment has been minimized.

Copy link

espg commented Sep 7, 2018

I'm running into weirdness... Identical results between the two implementations when using n=250 per group, but divergent answers when using n=50. Not sure what to do here; apparently the implementations only agree on datasets above a certain size. So, using n=50 will cause the comparison to fail. Using 250 will cause them to agree, but then the numerical instability will cause the 32-bit comparisons to diverge.

I like this test quite a lot, as it directly verifies that the reachability values aren't changing or drifting... but I'm at a loss with the other OPTICS code and why it only agrees on large datasets. We could change to 50 points per cluster and use the output from OPTICS at merge time... that would fix the 32-bit issue and let us know if we accidentally implemented any changes that change the core numeric output of the algorithm. But we would be only looking internally, and not checking against another independent implementation. Are we comfortable with that at this point?

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 7, 2018

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 7, 2018

@espg

This comment has been minimized.

Copy link

espg commented Sep 10, 2018

min samples doesn't seem to change things much-- setting different values still has disagreement between the implementations at lower densities. There is an offset between the implementations; the Brian Clowers code doesn't include the point under consideration in the density estimation, while ours does. In practice, this means that the outputs are identical in the 250 points per cluster case if we set min_samples = 10 for our implementation and min_samples = 9 for the Clowers code. This might explain why we don't get identical results at the lower point density, although I'd have to think a bit more on why...

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 11, 2018

@espg

This comment has been minimized.

Copy link

espg commented Sep 12, 2018

@jnothman It took me a bit to digest your earlier comment, but I think I finally have a solution for this. See #12054

@espg

This comment has been minimized.

Copy link

espg commented Sep 12, 2018

@jnothman our definition of whether or not min_samples includes the query point is tied to maintaining compatibility with the DBSCAN module, which includes the query point. Keeping them both to include the query point ensures that extraction using extract_dbscan gives identical or near identical results.

@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 13, 2018

@adrinjalali

This comment has been minimized.

Copy link
Member Author

adrinjalali commented Sep 15, 2018

Anything left here I should do? I guess it can be merged.

@jnothman jnothman merged commit a79d44e into scikit-learn:master Sep 16, 2018

7 checks passed

ci/circleci: deploy Your tests passed on CircleCI!
Details
ci/circleci: python2 Your tests passed on CircleCI!
Details
ci/circleci: python3 Your tests passed on CircleCI!
Details
codecov/patch Coverage not affected when comparing 41178cd...e6cef7c
Details
codecov/project 92.81% (-0.09%) compared to 41178cd
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@jnothman

This comment has been minimized.

Copy link
Member

jnothman commented Sep 16, 2018

Thanks @adrinjalali

@adrinjalali adrinjalali deleted the adrinjalali:optics32_pr branch Sep 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.