ARM wheels#12
Conversation
… into 0.0.10-dev
|
I think Mac still takes too long. I think it is best to drop Mac x86 from non-master branch builds. I do not want to rebuild this without deciding on the subset of builds that should be built each time, but here are the changes that can be made to make it save time. Drop Mac x86 on debug buildsdiff --git a/.github/workflows/build_wheels.yml b/.github/workflows/build_wheels.yml
index e476f70..096c096 100644
--- a/.github/workflows/build_wheels.yml
+++ b/.github/workflows/build_wheels.yml
@@ -19,8 +19,6 @@ jobs:
steps:
- uses: actions/checkout@v3
- with:
- fetch-depth: 0
- name: Build
run: |
@@ -46,19 +44,22 @@ jobs:
fetch-depth: 0
- name: Set up QEMU
- if: runner.os == 'Linux'
+ if: runner.os == 'Linux' && (github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/v'))
uses: docker/setup-qemu-action@v2
with:
platforms: all
- name: Build wheels (development)
- if: github.ref != 'refs/heads/master'
+ if: github.ref != 'refs/heads/master' && !startsWith(github.ref, 'refs/tags/v')
uses: pypa/cibuildwheel@v2.11.2
+ env:
+ CIBW_ARCHS_MACOS: "arm64"
- name: Build wheels (production)
- if: github.ref == 'refs/heads/master'
+ if: github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/v')
uses: pypa/cibuildwheel@v2.11.2
env:
+ CIBW_ARCHS_MACOS: "x86_64 arm64"
CIBW_ARCHS_LINUX: "auto aarch64"
- uses: actions/upload-artifact@v3
@@ -68,6 +69,7 @@ jobs:
build_sdist:
name: Build source distribution
runs-on: ubuntu-latest
+ if: github.ref == 'refs/heads/master' || startsWith(github.ref, 'refs/tags/v')
steps:
- uses: actions/checkout@v3
with:
diff --git a/pyproject.toml b/pyproject.toml
index 0024880..b82ca36 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -53,8 +53,10 @@ write_to = "pythonmodule/_version.py"
build = "cp36-* cp38-macosx_arm64" # pp39-* cp39-win_arm64
skip = "*-win32 *_i686 *-musllinux_*"
-[tool.cibuildwheel.macos]
-archs = ["x86_64", "arm64"]
+# We will only build x86_64 when merged into master to save on compilation
+# time.
+# [tool.cibuildwheel.macos]
+# archs = ["x86_64", "arm64"]
# We will only build aarch64 when merged into master to save on compilation
# time.
diff --git a/pythonmodule/__init__.py b/pythonmodule/__init__.py
index ec09bcc..5ddd39f 100644
--- a/pythonmodule/__init__.py
+++ b/pythonmodule/__init__.py
@@ -4,7 +4,7 @@ from ._dbscan import *
from . import _dbscan
__all__ = tuple(v for v in dir(_dbscan) if v.startswith('_'))
try:
- from ._version import __version__
+ from ._version import version as __version__
__all__ += ('__version__',)
except:
passWhat do you think is better? Having a separate branch for staging the more time-intensive builds or working mainly on other branches and merging into master when it is working on the more easy to compile subset (4 minutes per build amounting to 1 hour of free credits)? I can change the destination of this PR to master if you think this is good. |
|
Hi @anivegesana good suggestion. Yes, I agree that let's work mainly on other branches, and put the time-insensitive builds on the master. You could change the destination of this PR to the master branch. By the way, @anivegesana are you interested in becoming a maintainer for the PyPI repository of this project? If yes, let me know your PyPI username. |
|
Hey @wangyiqiu, I created a PyPI username. It matches my Github username: @anivegesana. I would be happy to be a maintainer on dbscan, but I don't know how much maintenance this project will require. Using the Python limited ABI and the CI reduce the amount of effort required to publish to PyPI because the wheels are built automatically and, unlike Cython extensions, are portable across Python versions, so we do not need to periodically recompile/update the package to allow users of new Python versions to use it. We compile dbscan once and it is ready forever. The only real maintenance would be adding features or bug fixes. I will read the paper and get accustomed to the dbscan folder itself. (I have only looked at the outward facing Python package up until now.) I will also follow the issues and open pull requests to try to resolve them. Thank you! |
|
Great thanks @anivegesana -- I have sent you an invite on PyPI. As you may have already know, the code stems from one of the methods in this paper, and the goal of this repository is to enable more people to use the code in real applications. I really like your work in making the code more accessible by modifying the API, streamlining the build, improving the portability and compatibility. They are very important for making a bigger impact. I think the short-term goal is to make v0.0.10 work well on most platforms / environment without bugs like you have mentioned. Looking forward, it may also be interesting to support dimensional beyond 20 from a software standpoint. Potential solutions may include falling back to other existing DBSCAN implementations when the dimensionality is high, or write our own fast DBSCAN code for higher dimensional data. |
|
@wangyiqiu Maybe it would be a good idea for us to look into manually triggering builds on master. There is a rather limited budget for number of free minutes and it would be irritating to not be able to build things if the minutes run out. In fact, maybe we should figure that out before publishing v0.0.10? |
|
That's a good idea, though I did not figure out how to do that. At the moment I will upgrade to Pro if the minutes run out. |
|
I can also look into other CIs like Circle CI and Cirrus CI. Worst comes to worst, we can just disable MacOS builds on master and development and I could periodically build them on my MacBook. These cost the most because Apple makes it hard to reverse-engineer their products and create virtual machines, so the builds are being done on physical Mac devices. This additional cost would be the same across all CIs. The only difference would be how much free credit they give a month. This would be inconvenient though, but it would work in the short term. |
|
@wangyiqiu Considering that we don't seem to use that many builds a month, I don't think it will be a big concern for now. I can look into it a bit later. You can go ahead and push the release. We can just use only the standard C++ libraries and get away with only testing development branches on Linux because we would expect the code to work on every system that C++ is supported on. We should first rewrite example.py as a test case to ensure that the wheels are properly working before publishing if we choose to do that. |
|
Sounds good, I agree. I just pushed a branch called 0.0.11-dev. I thought of a few items that we can include in this update (including the one you mentioned):
Any thoughts? |
|
Hi Yiqiu, thanks for the update! (pun intended)
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import dbscan
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers,
cluster_std=0.4, random_state=0)
X = StandardScaler().fit_transform(X)
X_labels, X_core = dbscan.DBSCAN(X, eps=0.3, min_samples=10)
Y = (X * 100000).astype(np.int64)
Y_labels, Y_core = dbscan.DBSCAN(Y, eps=30000, min_samples=10)
assert (X_labels == Y_labels).all() and (X_core == Y_core).all()
A bug that noticed but was unable to track down was that after running the |
|
Fixed the problems mentioned above (except for the CPU usage bug.) The Python test case are run during the build phase of cibuildwheel. I manually verified that the METADATA of the wheel does have the README information and added a test case to check the metadata of the wheel. Here are the changes: #13 Looking into the CPU usage issue. Let's see if I can tackle that one too. The unnecessary busy wait is on line 283 of scheduler.h. It looks pretty hacky: std::this_thread::sleep_for(std::chrono::nanoseconds(num_deques * 100));Why did they not use I need to find some time to mess around with it to make sure I am not adding any deadlocks, so maybe I should create the PR now for the fixes above and slowly work out how to fix this up later. |
|
I have a solution here: https://github.com/anivegesana/dbscan-python/tree/atomic_wait I am trying to use |
|
Thanks! @anivegesana I'm still looking into it and will get back to you asap. |
|
The other option is to drop older platforms and upgrade to C++20. Right now, I use defaults for a lot of the cibuildwheel settings. For most of the platforms, the wheels are compiled on Python 3.6. Changing the settings to compile on 3.8 (so that the versions of GCC match on Linux) wouldn't change anything for the end user since we never promised supporting older versions than that. The problem is with dropping older OS versions (like MacOS Catalina and Linuxes with GCC<12.) We could do it, but I do not know if people are using outdated OSes to install this package and it might be a good idea to leave this small buffer for a year or two in case they are. |
|
I think it might be possible to just stick with C++17 and still remove the busy waiting in the scheduler. This pattern seems to do the same thing as the C++20 #ifdef __cpp_lib_atomic_wait
std::atomic<int> deque_size;
#else
struct : std::atomic<int> {
std::condition_variable_any cv{};
inline void notify_one() {
cv.notify_one();
}
inline void notify_all() {
cv.notify_all();
}
inline void wait(int val) {
static struct {
void lock() {}
void unlock() {}
} mutex{}; // a dummy mutex
bool waited = false;
if (load() == val)
cv.wait(mutex, [&]() {
bool old_waited = waited;
waited = true;
return old_waited;
});
}
} deque_size;
#endifI am trying to see if I can verify that this performs the same operation as |
|
Hi @anivegesana thanks for finding this issue! The parlay/pbbs scheduler is designed for C++ programs, such that a unique copy of the scheduler is created globally, and the scheduler is only destroyed at program exit. The problem you identified points to the high utilization of the CPU between the end of the DBSCAN function call and the end of the Python session, and this is because the scheduler is still expecting more work (and the threads are spinning). The solution you proposed is a good one in the sense that it still keeps the scheduler active in the background, but potentially without causing high CPU-utilization before the next DBSCAN call. The downside, like you said, being that it may require some comprehensive testing of the scheduler, which can be quite a lot of work. I proposed a quick fix (9a27ef4) by adding a start and stop function for the scheduler, and calling them at the correct places. Essentially, a new copy of the scheduler is responsible for each DBSCAN function call, rather than one scheduler being active during the entire Python session. The downside is that we need to start and stop the scheduler at the correct places in the code, and maybe there is some scheduler initialization overhead. For the work (using conditional variable and atomic wait) you have in progress, I think it is great and we can keep it for future releases. I also encourage you to submit a PR for the parlaylib to help them fix the same scheduler issue. Let me know what you think. |
|
Happy new year, @wangyiqiu! Sounds good. It would be easier to just add a temporary fix and create an issue on the parlaylib repo. That repo probably has more comprehensive tests that would be able to tell if the fix works. Eventually, when they get a general solution, we can upgrade dbscan to use the latest parlaylib version. |
|
Sounds good @anivegesana Happy new year to you, too! |
|
Created the issue at cmuparlay/parlaylib#33. I will test my solution against their test cases and create a PR there after I get it to pass. I think the workaround 9a27ef4 that you pushed yesterday is good enough for this package, if you want to push another release. |
|
Thanks, it looks good. I published v0.0.11, and it seems to work well. Before publishing it I encountered some errors which seems to be coming from Python test configuration: https://github.com/wangyiqiu/dbscan-python/actions/runs/3858728146 I also created a v0.0.12 branch for the next version: https://github.com/wangyiqiu/dbscan-python/tree/0.0.12-dev |
|
Hmm. That looks odd. I guess that the Python tests weren't being run on non-master branches so I didn't notice that I removed the "py36" extra that the test cases use.
The extras are specified in setup.py. I was supposed to add another extra Will look into it soon and verify if that was the issue. I will also see if I can get the images to show in PyPI. |
|
Somewhat related to the issue from yesterday, it might be a good idea to swap Option 1 and Option 2 in the README. Putting the C++ instructions first might make it seem like that there is some benefit in rebuilding the pure C++ version over using the prebuilt Python one. Let me know if you think this is a good idea. Also, I noticed that the large paragraph on how to build the Python module from scratch seems a bit Ubuntu specific and has some information that isn't relevant anymore since it is now automated. To build from scratch for testing, this command seems sufficient (with the current directory set to the repo's root.) Finally, a small thing that you probably are doing right already. Are you calling the both the C++ and sklearn versions using multiple threads? In case if you aren't, there is an https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html |
|
Hi @anivegesana Yes let's swap option 1 and 2. I agree we should let the users use the prebuilt Python. Like you suggested, let's also swap the Ubuntu-specific description with I had actually being testing against the sequential sklearn since I did not know they can also run in parallel! I will rerun the tests to compare against the parallel sklearn and update the figure / number. |
|
Maybe having four columns (sequential for each, parallel for each) could be helpful. I don't know if the sizes of the bars would be legible if the two parallel ones are too similar in order of magnitude. Do you think using a log scale would make it clearer? |
|
Sounds good, I will make four columns. Some of the bars can be quite similar. I will use the log scale when that happens. |
This PR builds the wheels for the ARM architecture and makes some minor rearrangements in the README to bring the parts that talk about Python into the same section. It adds a autogenerated
_version.pyfile which will contain the version of the wheel, which can be automatically updated from the Git commit hash or tag.I did not compile the Windows ARM wheels because numpy doesn't ship Windows ARM wheels. The Linux ARM wheels are only compiled on master to save build time. A single Linux ARM build is 20 minutes. The only ARM build that is run routinely is the Mac one, since M1/M2 Macs are the latest generation of Macs.