Skip to content

ARM wheels#12

Merged
wangyiqiu merged 11 commits intowangyiqiu:masterfrom
anivegesana:sklearn-wrapper
Dec 1, 2022
Merged

ARM wheels#12
wangyiqiu merged 11 commits intowangyiqiu:masterfrom
anivegesana:sklearn-wrapper

Conversation

@anivegesana
Copy link
Copy Markdown
Contributor

@anivegesana anivegesana commented Nov 30, 2022

This PR builds the wheels for the ARM architecture and makes some minor rearrangements in the README to bring the parts that talk about Python into the same section. It adds a autogenerated _version.py file which will contain the version of the wheel, which can be automatically updated from the Git commit hash or tag.

I did not compile the Windows ARM wheels because numpy doesn't ship Windows ARM wheels. The Linux ARM wheels are only compiled on master to save build time. A single Linux ARM build is 20 minutes. The only ARM build that is run routinely is the Mac one, since M1/M2 Macs are the latest generation of Macs.

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Nov 30, 2022

I think Mac still takes too long. I think it is best to drop Mac x86 from non-master branch builds. I do not want to rebuild this without deciding on the subset of builds that should be built each time, but here are the changes that can be made to make it save time.

Drop Mac x86 on debug builds
diff --git a/.github/workflows/build_wheels.yml b/.github/workflows/build_wheels.yml
index e476f70..096c096 100644
--- a/.github/workflows/build_wheels.yml
+++ b/.github/workflows/build_wheels.yml
@@ -19,8 +19,6 @@ jobs:
 
     steps:
       - uses: actions/checkout@v3
-        with:
-          fetch-depth: 0
 
       - name: Build
         run: |
@@ -46,19 +44,22 @@ jobs:
           fetch-depth: 0
 
       - name: Set up QEMU
-        if: runner.os == 'Linux'
+        if: runner.os == 'Linux' && (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/v'))
         uses: docker/setup-qemu-action@v2
         with:
           platforms: all
 
       - name: Build wheels (development)
-        if: github.ref != 'refs/heads/master'
+        if: github.ref != 'refs/heads/master' && !startsWith(github.ref, 'refs/tags/v')
         uses: pypa/cibuildwheel@v2.11.2
+        env:
+          CIBW_ARCHS_MACOS: "arm64"
 
       - name: Build wheels (production)
-        if: github.ref == 'refs/heads/master'
+        if: github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/v')
         uses: pypa/cibuildwheel@v2.11.2
         env:
+          CIBW_ARCHS_MACOS: "x86_64 arm64"
           CIBW_ARCHS_LINUX: "auto aarch64"
 
       - uses: actions/upload-artifact@v3
@@ -68,6 +69,7 @@ jobs:
   build_sdist:
     name: Build source distribution
     runs-on: ubuntu-latest
+    if: github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/v')
     steps:
       - uses: actions/checkout@v3
         with:
diff --git a/pyproject.toml b/pyproject.toml
index 0024880..b82ca36 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -53,8 +53,10 @@ write_to = "pythonmodule/_version.py"
 build = "cp36-* cp38-macosx_arm64" # pp39-* cp39-win_arm64
 skip = "*-win32 *_i686 *-musllinux_*"
 
-[tool.cibuildwheel.macos]
-archs = ["x86_64", "arm64"]
+# We will only build x86_64 when merged into master to save on compilation
+# time.
+# [tool.cibuildwheel.macos]
+# archs = ["x86_64", "arm64"]
 
 # We will only build aarch64 when merged into master to save on compilation
 # time.
diff --git a/pythonmodule/__init__.py b/pythonmodule/__init__.py
index ec09bcc..5ddd39f 100644
--- a/pythonmodule/__init__.py
+++ b/pythonmodule/__init__.py
@@ -4,7 +4,7 @@ from ._dbscan import *
 from . import _dbscan
 __all__ = tuple(v for v in dir(_dbscan) if v.startswith('_'))
 try:
-    from ._version import __version__
+    from ._version import version as __version__
     __all__ += ('__version__',)
 except:
     pass

What do you think is better? Having a separate branch for staging the more time-intensive builds or working mainly on other branches and merging into master when it is working on the more easy to compile subset (4 minutes per build amounting to 1 hour of free credits)?

I can change the destination of this PR to master if you think this is good.

@wangyiqiu
Copy link
Copy Markdown
Owner

wangyiqiu commented Nov 30, 2022

Hi @anivegesana good suggestion.

Yes, I agree that let's work mainly on other branches, and put the time-insensitive builds on the master. You could change the destination of this PR to the master branch.

By the way, @anivegesana are you interested in becoming a maintainer for the PyPI repository of this project? If yes, let me know your PyPI username.

@anivegesana anivegesana changed the base branch from 0.0.10-dev to master December 1, 2022 04:50
@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Dec 1, 2022

Hey @wangyiqiu, I created a PyPI username. It matches my Github username: @anivegesana. I would be happy to be a maintainer on dbscan, but I don't know how much maintenance this project will require. Using the Python limited ABI and the CI reduce the amount of effort required to publish to PyPI because the wheels are built automatically and, unlike Cython extensions, are portable across Python versions, so we do not need to periodically recompile/update the package to allow users of new Python versions to use it. We compile dbscan once and it is ready forever.

The only real maintenance would be adding features or bug fixes. I will read the paper and get accustomed to the dbscan folder itself. (I have only looked at the outward facing Python package up until now.) I will also follow the issues and open pull requests to try to resolve them.

Thank you!

@wangyiqiu wangyiqiu merged commit 46e6028 into wangyiqiu:master Dec 1, 2022
@wangyiqiu
Copy link
Copy Markdown
Owner

Great thanks @anivegesana -- I have sent you an invite on PyPI.

As you may have already know, the code stems from one of the methods in this paper, and the goal of this repository is to enable more people to use the code in real applications. I really like your work in making the code more accessible by modifying the API, streamlining the build, improving the portability and compatibility. They are very important for making a bigger impact.

I think the short-term goal is to make v0.0.10 work well on most platforms / environment without bugs like you have mentioned. Looking forward, it may also be interesting to support dimensional beyond 20 from a software standpoint. Potential solutions may include falling back to other existing DBSCAN implementations when the dimensionality is high, or write our own fast DBSCAN code for higher dimensional data.

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Dec 4, 2022

@wangyiqiu Maybe it would be a good idea for us to look into manually triggering builds on master. There is a rather limited budget for number of free minutes and it would be irritating to not be able to build things if the minutes run out. In fact, maybe we should figure that out before publishing v0.0.10?

@wangyiqiu
Copy link
Copy Markdown
Owner

That's a good idea, though I did not figure out how to do that. At the moment I will upgrade to Pro if the minutes run out.

@anivegesana
Copy link
Copy Markdown
Contributor Author

I can also look into other CIs like Circle CI and Cirrus CI. Worst comes to worst, we can just disable MacOS builds on master and development and I could periodically build them on my MacBook. These cost the most because Apple makes it hard to reverse-engineer their products and create virtual machines, so the builds are being done on physical Mac devices. This additional cost would be the same across all CIs. The only difference would be how much free credit they give a month. This would be inconvenient though, but it would work in the short term.

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Dec 10, 2022

@wangyiqiu Considering that we don't seem to use that many builds a month, I don't think it will be a big concern for now. I can look into it a bit later. You can go ahead and push the release.

We can just use only the standard C++ libraries and get away with only testing development branches on Linux because we would expect the code to work on every system that C++ is supported on. We should first rewrite example.py as a test case to ensure that the wheels are properly working before publishing if we choose to do that.

@wangyiqiu
Copy link
Copy Markdown
Owner

Sounds good, I agree.

I just pushed a branch called 0.0.11-dev. I thought of a few items that we can include in this update (including the one you mentioned):

  1. Support float32, int32 and/or int64 in the Python API where the basic idea is to cast non-double input into doubles. I tried to work with this function call in dbscanmodule.cpp but did ont get it to work. I tried to instead call a PyArray_CheckFromAny function followed by some type casts, but the data seems to be garbled at the beginning. Are you familiar with how this works? Also, is it easier if we handle the data type conversion in Python rather than C++?
  2. Update example.py like you have mentioned. Let's test most / all the wheels before publishing 0.0.11.
  3. Update running time diagram in case the performance properties changed since 0.0.10.
  4. Use our current README as a description for the PyPI repository.

Any thoughts?

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Dec 11, 2022

Hi Yiqiu, thanks for the update! (pun intended)

  1. I could be wrong, but PyArray_FROMANY creates a copy of the original array object if the dtype doesn't match. It isn't explicitly stated in the documentation for that function, but there is an example for the similar np.asarray. In addition, I believe that the example still works when using int64 or the other types you mentioned.
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import dbscan

centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers,
                            cluster_std=0.4, random_state=0)
X = StandardScaler().fit_transform(X)

X_labels, X_core = dbscan.DBSCAN(X, eps=0.3, min_samples=10)

Y = (X * 100000).astype(np.int64)
Y_labels, Y_core = dbscan.DBSCAN(Y, eps=30000, min_samples=10)
assert (X_labels == Y_labels).all() and (X_core == Y_core).all()
  1. Well, I believe that example.py is still useful for people to be able to understand how to use the library. I think we should have a separate file test_python_module.py that contains an almost exact copy of example.py, but can be run without matplotlib.
  2. Yep. It is a good idea to retime everything and make sure that the figure is accurate and truthful and it didn't get slower. I think if it did get slower, however, it is not likely due to significant overhead from the limited API because it should only add a couple of additional function calls which would be many times smaller than the number of operations in the DBSCAN code itself. If there is a big timing difference, we should investigate to see if we added a bug in between versions.
  3. It seems that I didn't read the fine print on something. setuptools doesn't yet support all of the fields in pyproject.toml even though they are published by the same people. So I shouldn't have moved all of the fields over from setup.py to pyproject.toml. I will put the information in both places so that setuptools is happy. We can publish a v0.0.10.1 to fix that problem or just try to fix the following issue and make a v0.0.11 release.

A bug that noticed but was unable to track down was that after running the DBSCAN function, the CPU utilization would remain high even after the function terminated. It also happened in 0.0.9, so I think it might be somewhere in the actual C++ library itself, probably in pbbs. If you have any ideas, you can investigate because I don't really know where to start.

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Dec 11, 2022

Fixed the problems mentioned above (except for the CPU usage bug.) The Python test case are run during the build phase of cibuildwheel. I manually verified that the METADATA of the wheel does have the README information and added a test case to check the metadata of the wheel.

Here are the changes: #13

Looking into the CPU usage issue. Let's see if I can tackle that one too. The unnecessary busy wait is on line 283 of scheduler.h. It looks pretty hacky:

std::this_thread::sleep_for(std::chrono::nanoseconds(num_deques * 100));

Why did they not use condition_variables? It would let the OS deal with the waiting instead of re-awaking the thread every num_deques * 100 only for it to fall asleep again. Surprisingly enough, pbbs seems to have still kept this ugly line of code: https://github.com/cmuparlay/parlaylib/blob/master/include/parlay/scheduler.h#L266

I need to find some time to mess around with it to make sure I am not adding any deadlocks, so maybe I should create the PR now for the fixes above and slowly work out how to fix this up later.

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Dec 13, 2022

I have a solution here: https://github.com/anivegesana/dbscan-python/tree/atomic_wait

I am trying to use condition_variable to do it to not have to include a backport of some C++20 synchronization primitives (std::atomic_wait). I don't think it can be done without semaphores or shared mutexes (just any synchonization primitive that allows multiple threads to access one resource at the same time and wait until said resource is available), which both came into the standard later, so we might have to include this repo: https://github.com/ogiroux/atomic_wait. Don't know if you want to include a copy of it directly in this repo (like pbbs) or if you want to just use the build system to download it during compilation (like gtest.) Any thoughts/knowledge/experience you can lend?

@anivegesana anivegesana deleted the sklearn-wrapper branch December 13, 2022 02:23
@wangyiqiu
Copy link
Copy Markdown
Owner

Thanks! @anivegesana I'm still looking into it and will get back to you asap.

@anivegesana
Copy link
Copy Markdown
Contributor Author

The other option is to drop older platforms and upgrade to C++20. Right now, I use defaults for a lot of the cibuildwheel settings. For most of the platforms, the wheels are compiled on Python 3.6. Changing the settings to compile on 3.8 (so that the versions of GCC match on Linux) wouldn't change anything for the end user since we never promised supporting older versions than that. The problem is with dropping older OS versions (like MacOS Catalina and Linuxes with GCC<12.) We could do it, but I do not know if people are using outdated OSes to install this package and it might be a good idea to leave this small buffer for a year or two in case they are.

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Dec 16, 2022

I think it might be possible to just stick with C++17 and still remove the busy waiting in the scheduler. This pattern seems to do the same thing as the C++20 atomic_wait, but I have only tried it out on macOS Big Sur and Ubuntu.

#ifdef __cpp_lib_atomic_wait
  std::atomic<int> deque_size;
#else
  struct : std::atomic<int> {
    std::condition_variable_any cv{};

    inline void notify_one() {
      cv.notify_one();
    }
    inline void notify_all() {
      cv.notify_all();
    }
    inline void wait(int val) {
      static struct {
        void lock() {}
        void unlock() {}
      } mutex{}; // a dummy mutex

      bool waited = false;
      if (load() == val)
        cv.wait(mutex, [&]() {
          bool old_waited = waited;
          waited = true;
          return old_waited;
        });
    }
  } deque_size;
#endif

I am trying to see if I can verify that this performs the same operation as atomic_wait and if so, that would save the headache of vendoring the atomic_wait backport. I also am now keeping track of the number of jobs available because that seems to be working better. (anivegesana@70f2a45) Still do not know if this is equivalent/works. Will look into verifying that.

@wangyiqiu
Copy link
Copy Markdown
Owner

Hi @anivegesana thanks for finding this issue!
I tested the python module and indeed, the scheduler does not stop after the DBSCAN computation.

The parlay/pbbs scheduler is designed for C++ programs, such that a unique copy of the scheduler is created globally, and the scheduler is only destroyed at program exit. The problem you identified points to the high utilization of the CPU between the end of the DBSCAN function call and the end of the Python session, and this is because the scheduler is still expecting more work (and the threads are spinning).

The solution you proposed is a good one in the sense that it still keeps the scheduler active in the background, but potentially without causing high CPU-utilization before the next DBSCAN call. The downside, like you said, being that it may require some comprehensive testing of the scheduler, which can be quite a lot of work.

I proposed a quick fix (9a27ef4) by adding a start and stop function for the scheduler, and calling them at the correct places. Essentially, a new copy of the scheduler is responsible for each DBSCAN function call, rather than one scheduler being active during the entire Python session. The downside is that we need to start and stop the scheduler at the correct places in the code, and maybe there is some scheduler initialization overhead.

For the work (using conditional variable and atomic wait) you have in progress, I think it is great and we can keep it for future releases. I also encourage you to submit a PR for the parlaylib to help them fix the same scheduler issue. Let me know what you think.

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Jan 3, 2023

Happy new year, @wangyiqiu!

Sounds good. It would be easier to just add a temporary fix and create an issue on the parlaylib repo. That repo probably has more comprehensive tests that would be able to tell if the fix works. Eventually, when they get a general solution, we can upgrade dbscan to use the latest parlaylib version.

@wangyiqiu
Copy link
Copy Markdown
Owner

Sounds good @anivegesana Happy new year to you, too!

@anivegesana
Copy link
Copy Markdown
Contributor Author

Created the issue at cmuparlay/parlaylib#33. I will test my solution against their test cases and create a PR there after I get it to pass. I think the workaround 9a27ef4 that you pushed yesterday is good enough for this package, if you want to push another release.

@wangyiqiu
Copy link
Copy Markdown
Owner

Thanks, it looks good.

I published v0.0.11, and it seems to work well. Before publishing it I encountered some errors which seems to be coming from Python test configuration: https://github.com/wangyiqiu/dbscan-python/actions/runs/3858728146
Currently I temporarily disabled the Python tests. Will look into it soon.

I also created a v0.0.12 branch for the next version: https://github.com/wangyiqiu/dbscan-python/tree/0.0.12-dev

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Jan 6, 2023

Hmm. That looks odd. I guess that the Python tests weren't being run on non-master branches so I didn't notice that I removed the "py36" extra that the test cases use.

CIBW_TEST_EXTRAS: "test36"

The extras are specified in setup.py. I was supposed to add another extra 'py36': ['scikit-learn', 'matplotlib', 'pytest'],.

Will look into it soon and verify if that was the issue.

I will also see if I can get the images to show in PyPI.

@anivegesana
Copy link
Copy Markdown
Contributor Author

anivegesana commented Jan 8, 2023

Somewhat related to the issue from yesterday, it might be a good idea to swap Option 1 and Option 2 in the README. Putting the C++ instructions first might make it seem like that there is some benefit in rebuilding the pure C++ version over using the prebuilt Python one. Let me know if you think this is a good idea.

Also, I noticed that the large paragraph on how to build the Python module from scratch seems a bit Ubuntu specific and has some information that isn't relevant anymore since it is now automated.

To build from scratch for testing, this command seems sufficient (with the current directory set to the repo's root.)

pip install -e .

Finally, a small thing that you probably are doing right already. Are you calling the both the C++ and sklearn versions using multiple threads? In case if you aren't, there is an n_jobs parameter in the sklearn version that needs to be set to -1 to turn on multithreading:

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

@wangyiqiu
Copy link
Copy Markdown
Owner

wangyiqiu commented Jan 9, 2023

Hi @anivegesana
Thanks for the suggestions!

Yes let's swap option 1 and 2. I agree we should let the users use the prebuilt Python. Like you suggested, let's also swap the Ubuntu-specific description with pip install -e ..

I had actually being testing against the sequential sklearn since I did not know they can also run in parallel! I will rerun the tests to compare against the parallel sklearn and update the figure / number.

@anivegesana
Copy link
Copy Markdown
Contributor Author

Maybe having four columns (sequential for each, parallel for each) could be helpful. I don't know if the sizes of the bars would be legible if the two parallel ones are too similar in order of magnitude. Do you think using a log scale would make it clearer?

@wangyiqiu
Copy link
Copy Markdown
Owner

Sounds good, I will make four columns. Some of the bars can be quite similar. I will use the log scale when that happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants