Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize cdf, ppf #201

Merged
merged 7 commits into from
Jan 22, 2021
Merged

Vectorize cdf, ppf #201

merged 7 commits into from
Jan 22, 2021

Conversation

k15z
Copy link
Member

@k15z k15z commented Dec 22, 2020

Resolve #200. I'm on vacation from FB (happy holidays @csala!) so I thought I'd take a look at this - I've been wanting to make this faster forever, this PR makes gaussian kde ~8x faster.

Here's some code for benchmarking it:

import numpy as np
from time import time
from copulas.univariate.gaussian_kde import GaussianKDE

X_train = np.random.uniform(size=1000)
X_test = np.random.uniform(size=1000)

model = GaussianKDE()
model.fit(X_train)

start = time()
y_slow = model.percent_point(X_test)
print(time() - start)

start = time()
y_fast = model.percent_point_fast(X_test)
print(time() - start)

np.abs(y_fast - y_slow).max()

The vectorized version of the ppf takes 0.332 seconds while the current version takes 2.898 seconds.

@k15z k15z requested review from fealho and csala December 22, 2020 19:01
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks for the proposal @k15z !

The idea looks good, and I agree that vectorizing seems to be the way to go here, but we will need to work a bit more on it to make it robust, since the current implementation seems to fails in some edge cases.

Here's an example of such failure (when passing an array with a single element):

In [1]: import numpy as np
   ...: from copulas.univariate import GaussianKDE
   ...: from copulas.datasets import sample_univariate_bimodal
   ...: 
   ...: data = sample_univariate_bimodal()
   ...: model = GaussianKDE()
   ...: model.fit(data)
   ...: model.percent_point_fast(np.array([0.5]))
   ...: 
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-2ed68b5e8edf> in <module>
      6 model = GaussianKDE()
      7 model.fit(data)
----> 8 model.percent_point_fast(np.array([0.5]))

~/Projects/MIT/Copulas/copulas/univariate/gaussian_kde.py in percent_point_fast(self, U)
    265         X[is_one] = float("inf")
    266         X[is_zero] = float("-inf")
--> 267         X[is_valid] = newton(_f, np.zeros(U[is_valid].shape) + (lower+upper)/2.0)
    268 
    269         return X

~/.virtualenvs/Copulas/lib/python3.8/site-packages/scipy/optimize/zeros.py in newton(func, x0, fprime, args, tol, maxiter, fprime2, x1, rtol, full_output, disp)
    338                             " Failed to converge after %d iterations, value is %s."
    339                             % (itr + 1, p1))
--> 340                         raise RuntimeError(msg)
    341                     warnings.warn(msg, RuntimeWarning)
    342                 p = (p1 + p0) / 2.0

RuntimeError: Tolerance of [-80.25251] reached. Failed to converge after 3 iterations, value is [34.01875655].

I also added some more comments in-line

copulas/univariate/gaussian_kde.py Outdated Show resolved Hide resolved
copulas/univariate/gaussian_kde.py Outdated Show resolved Hide resolved
copulas/univariate/gaussian_kde.py Outdated Show resolved Hide resolved
copulas/univariate/gaussian_kde.py Outdated Show resolved Hide resolved
copulas/univariate/gaussian_kde.py Outdated Show resolved Hide resolved
tests/large_scale_evaluation.py Outdated Show resolved Hide resolved
@codecov-io
Copy link

codecov-io commented Dec 24, 2020

Codecov Report

Merging #201 (93ea85d) into master (47bf787) will increase coverage by 0.19%.
The diff coverage is 94.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #201      +/-   ##
==========================================
+ Coverage   89.77%   89.97%   +0.19%     
==========================================
  Files          26       27       +1     
  Lines        1585     1646      +61     
==========================================
+ Hits         1423     1481      +58     
- Misses        162      165       +3     
Impacted Files Coverage Δ
copulas/univariate/gaussian_kde.py 98.57% <93.75%> (-0.04%) ⬇️
copulas/optimize/__init__.py 95.23% <95.23%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 47bf787...93ea85d. Read the comment docs.

@k15z
Copy link
Member Author

k15z commented Dec 24, 2020

Good point re: the stability issues. Updated to use bisect and chandrupatla, both of which are equal to or better than brentq in terms of robustness (both are more stable the newton). Here are the running times:

Original: 30.22
Bisect: 5.47
Chandrupatla: 4.72

Up to you whether you prefer the simple bisect method or chandrupatla's algorithm... the latter is obviously slightly faster but is significantly more complex (and may potentially be less robust, although I'm not aware of any issues?)

copulas/univariate/gaussian_kde.py Outdated Show resolved Hide resolved
copulas/univariate/gaussian_kde.py Outdated Show resolved Hide resolved
tests/unit/univariate/test_gaussian_kde.py Show resolved Hide resolved
@csala
Copy link
Contributor

csala commented Jan 6, 2021

Good point re: the stability issues. Updated to use bisect and chandrupatla, both of which are equal to or better than brentq in terms of robustness (both are more stable the newton). Here are the running times:

Original: 30.22
Bisect: 5.47
Chandrupatla: 4.72

Up to you whether you prefer the simple bisect method or chandrupatla's algorithm... the latter is obviously slightly faster but is significantly more complex (and may potentially be less robust, although I'm not aware of any issues?)

I think that both bisect and chandrupatla options are good, but I'm getting slightly different times:

In [1]: import numpy as np
   ...: from copulas.univariate import GaussianKDE
   ...: from copulas.datasets import sample_univariate_bimodal
   ...: 
   ...: data = sample_univariate_bimodal()
   ...: model = GaussianKDE()
   ...: model.fit(data)
   ...: cdf = model.cumulative_distribution(data)

In [2]: %timeit model.percent_point_slow(cdf[0:10])
34.3 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit model.percent_point(cdf[0:10], method='bisect')
20 ms ± 83.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit model.percent_point(cdf[0:10], method='chandrupatla')
9.23 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit model.percent_point_slow(cdf)
3.35 s ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit model.percent_point(cdf, method='bisect')
2.53 s ± 77.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [7]: %timeit model.percent_point(cdf, method='chandrupatla')
1.15 s ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit model.percent_point_slow(np.concatenate([cdf] * 10))
33.8 s ± 97.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit model.percent_point(np.concatenate([cdf] * 10), method='bisect')
12.9 s ± 48.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit model.percent_point(np.concatenate([cdf] * 10), method='chandrupatla')
5.84 s ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It seems like bisect takes twice the time chandrupatla takes, and this is ~[1/3-1/6] of what the original approach took.

On the other side, it seems like both bisect and chandrupatla produce the same results as the original method, which is also be the right output (it successfully inverts the cdf):

In [11]: np.allclose(data, model.percent_point_slow(cdf))
Out[11]: True

In [12]: np.allclose(data, model.percent_point(cdf, method='bisect'))
Out[12]: True

In [13]: np.allclose(data, model.percent_point(cdf, method='chandrupatla'))
Out[13]: True

Altogether, I think we could keep chandrupatla as the default option but also have the bisect as an alternative.

Before we call it done I would add a few changes:

  1. Can we move the bisect and chandruptla functions to a copulas.optimize module?
  2. Maybe we could simplify a bit the code and remove unused options, like the return_iter and also put longer variable names, etc. to improve readability.
  3. We should add a few numerical tests to validate that both the outputs of the chandrupatla and bisect functions are right, and also numerical tests that validate that the percent_point is successfully inverting the cumulative_distribution (this could actually be added for all the distributions?)

@csala
Copy link
Contributor

csala commented Jan 6, 2021

Also, the build errors should be fixed after merging #203 to master and then merging master to this branch

@csala csala merged commit e70f689 into master Jan 22, 2021
@csala csala deleted the vectorize-kde branch January 22, 2021 10:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make gaussian_kde faster
4 participants