Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asPointwise with n_jobs>1 does not progress #19

Closed
JesseCresswell opened this issue Feb 19, 2024 · 12 comments
Closed

asPointwise with n_jobs>1 does not progress #19

JesseCresswell opened this issue Feb 19, 2024 · 12 comments

Comments

@JesseCresswell
Copy link
Contributor

(This was tested after fixing another issue locally #17)

def asPointwise(data, class_instance, precomputed_knn=None, n_neighbors=100, n_jobs=1):

asPointwise works as intended when n_jobs=1. However, in my experience when n_jobs>1 the async call does not complete.

I can recommend using the joblib library instead of multiprocessing, for example:

from joblib import delayed, Parallel
if n_jobs > 1:
    with Parallel(n_jobs=n_jobs) as parallel:
        def fit_estimator(class_instance, data, idx):
            return class_instance.fit(data[idx, :]).dimension_
        results = parallel(
            delayed(fit_estimator)(
                class_instance, data, idx
            ) for idx in knn
        )
    return np.array(results)
@j-bac
Copy link
Collaborator

j-bac commented Feb 20, 2024

Thanks, really appreciate you bringing up these issues and possible fix.

Would you have an example to reproduce this ? For example this completes with no problem for me:

import numpy as np
import skdim
X = np.random.random((1000, 10))
skdim.id.lPCA().fit_pw(X,n_jobs=4).dimension_pw_

edit: I managed to reproduce the error using asPointwise directly rather than fit_pw. I don't know why the latter works while the former gets stuck like this. If I don't find a solution with multiprocessing I'll try to implement your suggestion with joblib

@JesseCresswell
Copy link
Contributor Author

Oh I was about to explain more but I see your edit. I was specifically using the ESS estimator which does not have a built-in fit_pw function, and asPointwise appeared to be the intended method for pointwise estimation of local intrinsic dimension.

import numpy as np
import skdim
import multiprocessing as mp

# Re-defining this function to fix bug in Issue 17
def asPointwise(data, class_instance, precomputed_knn=None, n_neighbors=100, n_jobs=1):
    """Use a global estimator as a pointwise one by creating kNN neighborhoods"""
    if precomputed_knn is not None:
        knn = precomputed_knn
    else:
        _, knn = skdim.get_nn(data, k=n_neighbors, n_jobs=n_jobs)

    if n_jobs > 1:
        with mp.Pool(n_jobs) as pool:
            # Asynchronously apply the `fit` function to each data point and collect the results
            results = [pool.apply_async(class_instance.fit, (data[i, :],)) for i in knn]
            # Retrieve the computed dimensions
        return np.array([r.get().dimension_ for r in results])
    else:
        return np.array([class_instance.fit(data[i, :]).dimension_ for i in knn])


X = np.random.random((100, 10))

# estimator = skdim.id.lPCA() # GlobalEstimator
estimator = skdim.id.ESS() # LocalEstimator

# n_jobs = 1 # single process
n_jobs = 2 # multi process

if isinstance(estimator, skdim._commonfuncs.LocalEstimator):
    lid = asPointwise(X, estimator, n_neighbors=10, n_jobs=n_jobs)
elif isinstance(estimator, skdim._commonfuncs.GlobalEstimator):
    lid = estimator.fit_pw(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_

print(lid)

This code completes quickly for LPCA with n_jobs=1 or 2, and for [ESS, MADA, MLE, MOM, TLE] with n_jobs=1.
It completely hangs for [ESS, MADA, MLE, MOM, TLE] with n_jobs=2. These are the GlobalEstimator classes.

@JesseCresswell
Copy link
Contributor Author

Thanks for looking into this, and for the great library!

I had a question about terminology. Estimators like LPCA inherit from GlobalEstimator, and have the fit_pw function for pointwise (local) dimension estimation. Estimators like ESS inherit from LocalEstimator, and do not have the fit_pw function implemented, only fit which is for the whole dataset.

Based on these descriptions, it seems like the class names of GlobalEstimator and LocalEstimator are reversed. Local estimators should be the ones that operate pointwise, while global estimators only act on the whole dataset by default. Did you have a different interpretation?

@j-bac
Copy link
Collaborator

j-bac commented Feb 22, 2024

You can use ESS pointwise like so:

import numpy as np
import skdim
X = np.random.random((1000, 10))
skdim.id.ESS().fit(X,n_jobs=4).dimension_pw_

So you can do this in your code:

if isinstance(estimator, skdim._commonfuncs.LocalEstimator):
    lid =  estimator.fit(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_
elif isinstance(estimator, skdim._commonfuncs.GlobalEstimator):
    lid = estimator.fit_pw(X, n_neighbors=10, n_jobs=n_jobs).dimension_pw_

print(lid)

LocalEstimator class is used for estimators that already require computation of ID estimates for each point neighborhood to provide a global ID estimate. So in principle there is no need for a .fit_pw method and .dimension_pw_ is already returned

For lPCA I made an exception - most people are used to running PCA on the entire dataset to estimate ID, so this is the default when using .fit() to prevent confusion. Accordingly it inherits from GlobalEstimator whose .fit method runs an estimator on the entire dataset.

However paper references associated with this class rather use PCA locally and aggregate estimates to obtain global ID. Indeed ID estimation using PCA on an entire dataset will fail for non-linear data. So arguably, PCA is more of a local ID estimator that tries to obtain ID of the manifold tangent space. Hence the name lPCA (local PCA) for the class to point this out while keeping default use global.

Overall I am trying to keep users from inadequately using estimators. .fit() always expects to receive an entire (possibly non-linear) dataset as input whether the estimator inherits from LocalEstimator or GlobalEstimator. This is not the case in original R ESS implementation where ESS expects a single neighborhood as input. In Python this would be equivalent to calling ESS().fit_once

I agree the terminology is confusing so suggestions are very welcome.

@JesseCresswell
Copy link
Contributor Author

JesseCresswell commented Feb 24, 2024

I see, that makes much more sense and is simpler. I understand why Local and Global are used now. What threw me off was first seeing lPCA as a "global" estimator, since I think of using it in the local way, but it happens to be the exception to your naming scheme.

Here's a suggestion then - in the README you use DANCo as an example of a global estimator (fine), and lPCA as an example of a local estimator. Then when I look up their implementation I see they both inherit from GlobalEstimator. That's unexpected! Why not replace lPCA with ESS or another LocalEstimator in the first example that most users will see?

@j-bac
Copy link
Collaborator

j-bac commented Mar 5, 2024

I'll put this one my to do list: the lPCA exception should probably be removed and the docs provide a clear example with a lPCA.fit_once method to apply PCA "as usual" a single time on the whole dataset

@JesseCresswell
Copy link
Contributor Author

Can you give me access rights to make branches and push? I will make a PR switching this function joblib, and then another for adding parallelism internally for estimators like ESS.

@j-bac
Copy link
Collaborator

j-bac commented Mar 7, 2024

Maybe I need to request for you to be added to the scikit-learn-contrib org. I don't see an option to give you access rights myself

@j-bac
Copy link
Collaborator

j-bac commented Mar 8, 2024

Actually you should be able to PR even without being member

@JesseCresswell
Copy link
Contributor Author

No, I have no option to make a PR or even a new branch on this repo. I tried pushing a branch directly but I get a permissions error.

@j-bac
Copy link
Collaborator

j-bac commented Mar 10, 2024

Could you try again following the steps at https://github.com/firstcontributions/first-contributions ?

@JesseCresswell
Copy link
Contributor Author

Closed by PR #23

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants