Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @lispandfound, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant performance improvement to the SRF file parsing by rewriting the existing Cython-based parser in Rust. The new Rust parser leverages highly optimized libraries for numerical parsing, resulting in a substantial speedup and a more robust, maintainable, and memory-safe codebase. It also serves as a demonstration of integrating Rust extensions into a Python project.
Highlights
- Performance Enhancement: I've replaced the existing Cython SRF parser with a new Rust implementation, achieving a significant speedup in SRF file parsing. For instance, an 18GB SRF file that previously took nearly 5 minutes to parse now completes in approximately 30 seconds, representing up to a 7x performance improvement.
- Technology Migration & Code Quality: The core SRF parsing logic has been migrated from Cython to Rust. This transition addresses Cython's limitations regarding third-party library integration and performance interpretability, resulting in a more maintainable, type-safe, and memory-safe codebase.
- Simplified SRF Data Model: The new parser primarily focuses on the
slip1component. This simplifies the internal representation of SRF slip data by effectively deprecating explicitslip2andslip3components within theSrfFileobject and streamlining related parsing and writing logic. - Rust-Python Integration: This pull request serves as a practical demonstration of integrating Rust extensions into a Python project. It showcases the setup and build process using
pyo3for Python bindings andsetuptools-rustfor seamless distribution, including dependency management viaCargo.toml.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request replaces the Cython-based SRF parser with a new implementation in Rust, achieving a significant performance boost. The use of mmap and lexical-core is a great choice for this task. The code is well-structured and the PR description is very clear and helpful. The review includes suggestions to further improve performance, robustness, and maintainability.
|
Made some refactors suggested by Gemini (although what it suggested made things even worse so I had to undo them and write my own solution). Basically there were too many copies floating around in my code. Now the copying has been reduced I can parse the 18GB alpine SRF on my own desktop in about 40s. Can't test on Hypocentre right now because it is being hammered atm but I imagine it also faster than before with better memory usage. |
|
UPDATE: doesn't matter because scipy will "helpfully" upgrade all the arrays in question to 64-bit (signed!) integers, forcing a copy to occur and more memory usage. The best I can do is match the scipy integer format and use |
a1b85f8
|
Sorry @claudio525 one further change, the SRF writer now also uses rust. Writing takes just about as much time as reading (roughly 40 seconds on hypocentre), which suggests that the bottleneck is probably still (de)serialising a lot of floats. The SRFs produced by |
There was a problem hiding this comment.
Pull Request Overview
This PR replaces the existing Cython-based SRF parser with a Rust implementation, integrates it into the Python package, and updates tests and build configuration accordingly.
- Remove
srf_reader.pyxand Cython build, add Rust-basedsrf_parsercrate for faster parsing. - Update
read_srf/write_srfinsrf.pyto call the new Rust functions and drop multi-component slip support. - Adjust tests to reflect only a single slip component and added a windowed slip-value check.
Reviewed Changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_srf.py | Modified tests to drop slip2/3 assertions and added end-window check |
| source_modelling/srf_reader.pyx | Removed old Cython-based SRF reader entirely |
| source_modelling/srf_parser/src/lib.rs | Introduced Rust parser (parse_srf, write_srf_points, etc.) |
| source_modelling/srf_parser/Cargo.toml | Defined the Rust crate with dependencies |
| source_modelling/srf.py | Switched to Rust parser calls, removed slip2/3 handling |
| setup.py | Replaced Cython extension with setuptools-rust extension |
| pyproject.toml | Updated build-system requirements for Rust |
Comments suppressed due to low confidence (2)
source_modelling/srf_parser/src/lib.rs:112
- [nitpick] The variable
_nt2actually holds the slip2 float value (and_slip2holds the count); consider renaming them toslip2_valandnt2respectively for clarity.
let _nt2 = parse_value::<f32>(data, &mut index)?;
source_modelling/srf.py:442
- There are currently no tests targeting the new
write_srf_pointsorwrite_srffunctionality; consider adding tests to verify the output SRF format and data correctness.
srf_parser.write_srf_points(
The current SRF parser is fast-ish but not fast enough. Additionally being written in Cython makes it a little annoying to include third party libraries that might make it faster. This PR rewrites the SRF parser in rust and pulls in the lexical_core crate to get 7x speedup in SRF parsing performance with the addition of a more maintainable codebase, type-safety, and memory-safety.
Challenges with Cython
The main challenges with Cython programs are:
lexical-core = "1.0.5"to theCargo.tomlfile in the root of the rust package. The crates registry contains thousands of such packages.srf_parser.pyxI had to comment the code here to make it clear that one was assigning a value in python (read: slow!) and one was assigning a value in C. This can be a nightmare for trying to write performant extensions because a single line of python in a hot loop can completely trash your performance and the only way to know is with profiling or reading transpiled cython C output. The rust option makes it slightly harder to call Python code but with the huge benefit of making it very clear what is fast rust and what is slow Python.
3. All the problems with C, but worse. Cython is essentially a worse version of C. It retains the all the memory footguns of C but adds weird syntax, odd gotchas, and C89 style variable declarations. Rust is a performant, modern, memory-safe language that makes it easier to write the code once and forget about it.
How much faster did the parser get?
The main issue with the C code is that it spends 99% of its time on lines that look like
fscanf("%f", &latitude). C's implementation offscanfis quite slow, and the somewhat fasterstrtodrequires a lot of ceremony to make work properly. Additionally, the arrays storing all this information unfortunately can't be allocated just once because they don't know their size ahead of time due to quirks of the SRF format. My SRF generation stages were taking an hour to run reading and writing SRFs multiple times over in the intermediate stages. This would essentially mean days of turn around time for alpine fault simulations. To make this faster, the rust version of this code uses the incredibly fast lexical core crate. This crate uses SIMD instructions to process multiple characters in parallel which provides an enormous speedup. How big? Well the old code reads my 18GB 100malpine_hope_1.srffile in nearly 5 minutes and the new code takes just 30 secondsThe effect is more pronounced on newer machines with larger caches and better SIMD instruction sets, my home desktop has a two-year-old 12th gen mid range i5 and reads a 5gb SRF in only 8 seconds.
Structure of a Rust binding
I intend this PR to be a demo of how to write such rust bindings. It's at least as easy as Cython to setup, and much easier if you want to add a third-party library. The steps I went through are as follows:
source_modellingdirectory.srf_parser. Change into this directory and runcargo init --lib. The--libflag tells cargo to make a library project rather than an executable (the default). If you don't have rust installed you can very easily do so withrustup. I have globally installed rust on Hypocentre for researchers. It is also bundled with popular distributions like Ubuntu because many of their own tools are written in rust these days.Cargo.tomlto compile a Python-compatible library extension. This is as simple as changing the crate type. Add a line like thisand you're done!
4. Add the
pyo3andnumpycrates. You can do this withcargo add. Edit theCargo.tomlto add theextension-modulefeature.~/source_modelling/source_modelling/srf_parser $ cargo add pyo3 numpyand in the cargo file
src/lib.rsand start writing your rust code! This part requires knowing rust, which can be a bit of a hurdle but it isn't that hard to learn (no harder than learning to write Cython imo). The only things different you need to know is pyo3 specific stuff. To make a rust function callable from python you add#[pyfunction]to the top and add thepyargument which is a copy of the python interpreter:Then you can add it to your module at the bottom
pyproject.tomlto bundle everything together. We use setuptools mainly so this is a touch more challenging than normal but it is still pretty simple. You just addsetuptools-rustto the build system requirementsand then add a line to the
setup.pyto instruct python to build the moment when youpip installI added
native=Trueto compile the module with the fastest possible native instruction set when the module is built. Generally you shouldn't do this I just did it to make surelexical_coregets to use SIMD instructions.7. Pip install and go! You can import your module in python like an ordinary one
Now you can distribute this to researchers. Like a cython extension, they don't need to know anything about the distribution. If the wheel is prebuilt they don't even need to have rust installed. Unfortunately we use git packages so they need rust to compile this wheel themselves. I've installed rust on Hypocentre with an automatic update option turned on so they don't have to think about it there.