Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLD: switch to Cython>=3.0,<3.1 #4575

Merged
merged 9 commits into from
Aug 1, 2023

Conversation

neutrinoceros
Copy link
Member

We've been continuously testing Cython 3.0 pre-releases on Linux and for a subset of the test suite. This PR is an experiment to check for portability issues by running all tests with the current release candidate.

@neutrinoceros neutrinoceros added the do not merge this change should not be merged until later label Jul 13, 2023
@neutrinoceros
Copy link
Member Author

builds seem to go through on all platforms, however tests running on Jenkins appear to be much slower. They normally take about 50 min, and it looks like they're about to time out at 2h (with no signs of hanging). I'll run them again for good measure, but it's likely a sign of performance regressions.

@matthewturk
Copy link
Member

Probably another set of default-changes, right?

@neutrinoceros
Copy link
Member Author

That is my expectation. It might be a bit harder for me to track those down as it's always been a pain to replicate nose tests locally, but I'll give it a try.

@neutrinoceros
Copy link
Member Author

currently chasing gil acquisition in yt/utilities/lib/bounding_volume_hierarchy.pyx, slowly iterating with Cython annotations, and using scalene to profile the following script:

import numpy as np

import yt
from yt.testing import requires_file
from yt.utilities.lib.bounding_volume_hierarchy import BVH, test_ray_trace
from yt.visualization.volume_rendering.api import Camera, Scene


def get_rays(camera):
    normal_vector = camera.unit_vectors[2].d
    W = np.array([8.0, 8.0])
    N = np.array([800, 800])
    dx = W / N

    x_points = np.linspace((-N[0] / 2 + 0.5) * dx[0], (N[0] / 2 - 0.5) * dx[0], N[0])
    y_points = np.linspace((-N[1] / 2 + 0.5) * dx[1], (N[1] / 2 - 0.5) * dx[0], N[1])

    X, Y = np.meshgrid(x_points, y_points)
    result = np.dot(camera.unit_vectors[0:2].T, [X.ravel(), Y.ravel()])
    vec_origins = camera.scene.arr(result.T, "unitary") + camera.position
    return np.array(vec_origins), np.array(normal_vector)

fn = "MOOSE_sample_data/out.e-s010"    
ds = yt.load(fn)    
vertices = ds.index.meshes[0].connectivity_coords    
indices = ds.index.meshes[0].connectivity_indices - 1    
    
ad = ds.all_data()    
field_data = ad["connect1", "diffused"]    
    
bvh = BVH(vertices, indices, field_data)    

sc = Scene()    
cam = Camera(sc)    
cam.set_position(np.array([8.0, 8.0, 8.0]))    
cam.focus = np.array([0.0, 0.0, 0.0])    
origins, direction = get_rays(cam)    

image = np.empty(800 * 800, np.float64)    
test_ray_trace(image, origins, direction, bvh)    # this is the bottleneck
image = image.reshape((800, 800))    

Eventhough all methods of the BVH class are already marked as nogil, I see a lot of GIL-specific code in the code generated. So far, I haven't pinned down where that's coming from.

@@ -1,3 +1,4 @@
# cython: legacy_implicit_noexcept=True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates a bunch of warnings so it's not to be considered stable, but it's useful to quickly slap noexcept on all functions in the module and reduce the difference between the old cythonized C code and the new one. I should also point out that it's not enough to solve the performance issue seen in the example I'm currently working on.

@neutrinoceros
Copy link
Member Author

Cython 3.0 final is being deployed to PyPI. Will need to update this PR.

@neutrinoceros neutrinoceros changed the title EXP: try Cython 3.0.0rc2 EXP: try Cython 3.0 Jul 17, 2023
@neutrinoceros
Copy link
Member Author

In my latest commit, I attempted forcing legacy behaviour (set noexcept globally), but tests have been running for two and a half hours already, so it seems clear that something else is affecting performance. I'm going to let @matthewturk take over this experiment for now.

@matthewturk
Copy link
Member

used a sampling profiler and came up with

Showing nodes accounting for 8.32s, 31.27% of 26.61s total
Dropped 455 nodes (cum <= 0.13s)
Showing top 20 nodes out of 107
      flat  flat%   sum%        cum   cum%
         0     0%     0%     25.73s 96.69%  __clone3
         0     0%     0%     25.73s 96.69%  start_thread
         0     0%     0%     25.48s 95.75%  gomp_thread_start
         0     0%     0%     25.45s 95.64%  __pyx_f_2yt_9utilities_3lib_25bounding_volume_hierarchy_cast_rays
         0     0%     0%     25.38s 95.38%  __pyx_f_2yt_9utilities_3lib_25bounding_volume_hierarchy_3BVH_intersect
     0.07s  0.26%  0.26%     25.30s 95.08%  __pyx_f_2yt_9utilities_3lib_25bounding_volume_hierarchy_3BVH__recursive_intersect
     0.19s  0.71%  0.98%     25.22s 94.78%  __pyx_f_2yt_9utilities_3lib_10primitives_ray_triangle_intersect
     0.06s  0.23%  1.20%     14.47s 54.38%  PyGILState_Ensure
     0.04s  0.15%  1.35%     14.34s 53.89%  PyEval_RestoreThread
     0.64s  2.41%  3.76%     13.98s 52.54%  take_gil
     0.08s   0.3%  4.06%     10.21s 38.37%  PyCOND_TIMEDWAIT (inline)
     0.04s  0.15%  4.21%      9.80s 36.83%  ___pthread_cond_timedwait64
     0.65s  2.44%  6.65%      9.76s 36.68%  __pthread_cond_wait_common (inline)
     0.05s  0.19%  6.84%      8.04s 30.21%  _PyThreadState_DeleteCurrent.localalias
     0.01s 0.038%  6.88%      7.13s 26.79%  _PyEval_ReleaseLock (inline)
     0.14s  0.53%  7.40%      7.13s 26.79%  drop_gil
     5.15s 19.35% 26.76%      5.16s 19.39%  futex_wait (inline)
     0.72s  2.71% 29.46%      5.02s 18.87%  __GI___lll_lock_wait
     0.22s  0.83% 30.29%      4.48s 16.84%  __GI___pthread_mutex_unlock_usercnt (partial-inline)
     0.26s  0.98% 31.27%      4.29s 16.12%  ___pthread_cond_signal

Which suggests the combo of openmp and the GIL is causing it

@matthewturk
Copy link
Member

So I think the openmp might have been a red herring. I looked at the code and it was using cimport'd functions that were both inline and not declared noexcept. I added to that.

@matthewturk
Copy link
Member

There might also be a corner case of pxd-defined functions not getting the global noexcept...

@matthewturk
Copy link
Member

How do these testing times look?

@neutrinoceros
Copy link
Member Author

Still longer than the baseline, but it's too soon to conclude (still running)

@matthewturk
Copy link
Member

I will take a look at a few more places. Any ideas on narrowing it down would be helpful, but I can take a first pass.

@neutrinoceros
Copy link
Member Author

inspecting previous runs' runtime and sorting by duration might provide a couple ideas:

@neutrinoceros
Copy link
Member Author

for what it's worth, I confirm that your commits resolved the one case I was goose-chasing, so we're on the right track !

@matthewturk
Copy link
Member

Looks like the next two to go after are yt.frontends.gadget.tests and yt.utilities.lib.tests.

@matthewturk
Copy link
Member

GAMER also needs a look, which makes sense since we cythonized some of its stuff.

@matthewturk
Copy link
Member

I've pushed a change that does the SPH kernel functions, which are in a PXD file, as well as some functions that get called from within python functions in the GAMER fields.

@matthewturk
Copy link
Member

So it looks like with my latest change, it still timed out at 4hours. Am I reading this right?

https://tests.yt-project.org/job/yt_py38_git/7124/testReport

So the pixelized projection values tests are still quite slow. I'll take another look today.

@matthewturk
Copy link
Member

I believe that there are several top-level functions in the file pixelization_routines.pyx that are def, not cdef, which may also be considerably slower now. I am not sure if we should move them into cdef functions or if there is a noexcept context manager we can provide. But, moving the kernel interpolation into noexcept should speed things up a considerable amount.

@neutrinoceros
Copy link
Member Author

So it looks like with my latest change, it still timed out at 4hours. Am I reading this right?

yes.

I believe that there are several top-level functions in the file pixelization_routines.pyx that are def, not cdef, which may also be considerably slower now

I don't see anything about that in the migration guide, so if true, that might be an actual regression ?

@matthewturk
Copy link
Member

I don't see anything about that in the migration guide, so if true, that might be an actual regression ?

Sorry, I meant, they are def but still call a bunch of noexcept functions, so I'm just not sure. Let's see how the current round of tests goes.

@neutrinoceros
Copy link
Member Author

The process seems to be hanging (or maybe cookbook recipes are so slow that one of them is taking for ever, but it's not clear which one since the order they are run isn't deterministic). I'm going trigger a new job with a deterministic test order so if it hangs again, at least I'll know where.

@neutrinoceros
Copy link
Member Author

I'm splitting out the known-useful bits into smaller PRs so we can keep making progress on this one while keeping the diff reasonable.

@matthewturk
Copy link
Member

Let's not do that. I am unwilling to use or recommend Cython 3 until the performance has been addressed, so they should be a single PR to do the upgrade and fix it.

@neutrinoceros
Copy link
Member Author

I've rebased the branch to account for conflicts and removed my clearly-useless commits.

@neutrinoceros neutrinoceros changed the title EXP: try Cython 3.0 BLD: switch to Cython>=3.0,<3.1 Jul 27, 2023
@neutrinoceros neutrinoceros added infrastructure Related to CI, versioning, websites, organizational issues, etc dependencies Pull requests that update a dependency file and removed do not merge this change should not be merged until later labels Jul 27, 2023
@neutrinoceros
Copy link
Member Author

neutrinoceros commented Jul 27, 2023

Thanks to @Xarthisius I think we now have restored performance completely, so this is ready to go. Also, this now closes #4355

@neutrinoceros neutrinoceros linked an issue Jul 27, 2023 that may be closed by this pull request
15 tasks
@neutrinoceros neutrinoceros marked this pull request as ready for review July 27, 2023 21:40
@neutrinoceros neutrinoceros added this to the 4.3.0 milestone Jul 28, 2023
@neutrinoceros neutrinoceros added build related to the build process and removed infrastructure Related to CI, versioning, websites, organizational issues, etc labels Jul 31, 2023
@neutrinoceros
Copy link
Member Author

@Xarthisius can you merge this please ?

@Xarthisius Xarthisius merged commit 192a187 into yt-project:main Aug 1, 2023
11 of 12 checks passed
@neutrinoceros neutrinoceros deleted the tst_cython_3.0.0rc2 branch August 1, 2023 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build related to the build process dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BLD: compatibility with Cython 3.0
3 participants