Speed up import scanning by 400-500%#142
Speed up import scanning by 400-500%#142duzumaki wants to merge 2 commits intopython-grimp:masterfrom
Conversation
a86ed83 to
9ac45a1
Compare
| def test_syntax_error_terminates_executor_pool(): | ||
| with pytest.raises(BrokenProcessPool): |
There was a problem hiding this comment.
I guess this is one disadvantage of concurrency in python. using the .map() method in the executor results in the inability to capture errors raised from individual tasks
i've also tried using executor.submit(), with my own chunking logic in order to try capture the exception as well but any exception that isn't on the top level function get_imports_by_module, won't propagate to the context object:
with ProcessPoolExecutor() as executor:
you only get this generic
BrokenProcessPool('A process in the process pool was terminated abruptly while the future was running or pending.') error
one way around this might be to change the raises in the tasks to returns, then check the result accordingly
There was a problem hiding this comment.
It would be a shame to lose that exception - returning any syntax errors (they could even be exception objects) sounds like it is at least worth a try.
src/grimp/application/usecases.py
Outdated
| ) | ||
| imports_by_module[module] = direct_imports | ||
| with ProcessPoolExecutor() as executor: | ||
| chunk_size = ceil(len(found_package.module_files) / executor._max_workers) or 1 |
There was a problem hiding this comment.
Would be helpful to have a comment here explaining our reasoning for the chunk size.
|
Very exciting! Have left a couple of initial comments, also the tests are failing which would be good to sort out. |
cddb258 to
914d301
Compare
914d301 to
cecca20
Compare
| import_scanner: AbstractImportScanner, | ||
| exclude_type_checking_imports: bool, | ||
| cache: caching.Cache, | ||
| ): |
There was a problem hiding this comment.
Missing type annotation on return value.
|
I checked out this branch and installed it locally. It dramatically slowed down graph building time on a large test repo (0.6 seconds -> 35 seconds). On a Mac M2. I verified that my local installation of the Seems like something is wrong. When I set |
|
I'm going to close this branch because we are close to having a Rust implementation of the graph, and so I think the next step will be to move to Rust for building the graph. It'll be easier to parallelise there because we can free ourselves from the GIL and do multithreading. But thanks for your efforts, much appreciated! |
Tested on a large mono repo (35,000 modules) using a mac m1 (8 cores). The modules are divided amongst the cores on the system it runs.
so each core is doing an even share of ast walking
profiled using
Cprofile,pstats(to read the profile dump) andsnakeviz(for the visualisation)Current(~37s):

after parallelisation(~8s):
