-
Notifications
You must be signed in to change notification settings - Fork 17
Rust importscanner #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rust importscanner #229
Conversation
CodSpeed Instrumentation Performance ReportMerging #229 will degrade performances by 18.12%Comparing Summary
Benchmarks breakdown
|
ddac1be to
3972bc7
Compare
e83fbcc to
bf838a1
Compare
Much of the Rust code was generated by an LLM, possibly it could be simplified. The Python tests are adapted so some of the same tests can be run on the Python FakeFileSystem and the Rust-based FakeBasicFileSystem.
This means we don't need to pickle the FileSystem - making it possible to use the Rust-based file system classes while doing multiprocessing.
RealBasicFileSystem is renamed to PyRealBasicFileSystem, likewise with FakeBasicFileSystem. These then wrap inner Rust structs. We do this so we can box a file system in a Rust-based ImportScanner class, and then interact with it polymorphically. (See an upcoming commit.)
bf838a1 to
80b2234
Compare
LilyFirefly
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't review before this was merged! I hope these suggestions are helpful anyway:
|
|
||
| #[getter] | ||
| fn sep(&self) -> String { | ||
| "/".to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth interning this with intern! for efficiency if it's called a lot from Python.
| let sep = self.sep(); | ||
| components | ||
| .into_iter() | ||
| .map(|c| c.trim_end_matches(&sep).to_string()) | ||
| .collect::<Vec<String>>() | ||
| .join(&sep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the performance is important, it might make sense to define const SEP = "/" as a module constant to avoid creating a String every time join is called. This could still be referred to from self.sep() for Python code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if we can avoid the Vec and/or the String in the collect?
| return ("".to_string(), "".to_string()); | ||
| } | ||
|
|
||
| let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit reference here should be unnecessary because "" is already a &str:
| let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split) | |
| let tail = components.last().unwrap_or(""); // Last component, or empty if components is empty (shouldn't happen from split) |
| let components: Vec<&str> = file_name.split('/').collect(); | ||
|
|
||
| if components.is_empty() { | ||
| return ("".to_string(), "".to_string()); | ||
| } | ||
|
|
||
| let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split) | ||
|
|
||
| let head_components = &components[..components.len() - 1]; // All components except the last |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
split returns a DoubleEndedIterator, so it can be iterated from both ends:
| let components: Vec<&str> = file_name.split('/').collect(); | |
| if components.is_empty() { | |
| return ("".to_string(), "".to_string()); | |
| } | |
| let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split) | |
| let head_components = &components[..components.len() - 1]; // All components except the last | |
| let mut components = file_name.split('/'); | |
| let tail = match components.next_back() { | |
| Some(tail) => tail, | |
| None => return ("".to_string(), "".to_string()); | |
| }; | |
| let head_components: Vec<&str> = components.collect(); |
| fn read(&self, file_name: &str) -> PyResult<String> { | ||
| match self.contents.get(file_name) { | ||
| Some(file_name) => Ok(file_name.clone()), | ||
| None => Err(PyFileNotFoundError::new_err("")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd include the file_name in the error message:
| None => Err(PyFileNotFoundError::new_err("")), | |
| None => Err(PyFileNotFoundError::new_err(format!("No such file: {file_name}"))), |
|
|
||
| #[getter] | ||
| fn sep(&self) -> String { | ||
| std::path::MAIN_SEPARATOR.to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider intern! here too.
| for component in components { | ||
| path.push(component); | ||
| } | ||
| path.to_str().unwrap().to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use .expect() instead of .unwrap() - this allows leaving an explanatory message for why it's safe to unwrap:
| path.to_str().unwrap().to_string() | |
| path.to_str().expect("Path components are valid unicode").to_string() |
| })?; | ||
|
|
||
| let s = String::from_utf8_lossy(&bytes); | ||
| let encoding_re = Regex::new(r"^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)").unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth seeing if this can be moved to module scope, so it's only compiled once no matter how often read is called. It might be necessary to use LazyCell.
| #[allow(unused_variables)] | ||
| #[new] | ||
| #[pyo3(signature = (file_system, found_packages, include_external_packages=false))] | ||
| fn new( | ||
| py: Python, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove the #[allow(unused_variables)] if you rename py to _py:
| #[allow(unused_variables)] | |
| #[new] | |
| #[pyo3(signature = (file_system, found_packages, include_external_packages=false))] | |
| fn new( | |
| py: Python, | |
| #[new] | |
| #[pyo3(signature = (file_system, found_packages, include_external_packages=false))] | |
| fn new( | |
| _py: Python, |
Or maybe you don't need py in the signature at all?
| #[allow(unused_variables)] | |
| #[new] | |
| #[pyo3(signature = (file_system, found_packages, include_external_packages=false))] | |
| fn new( | |
| py: Python, | |
| #[new] | |
| #[pyo3(signature = (file_system, found_packages, include_external_packages=false))] | |
| fn new( |
| match parse_result { | ||
| Err(GrimpError::ParseError { | ||
| line_number, text, .. | ||
| }) => { | ||
| // TODO: define SourceSyntaxError using pyo3. | ||
| let exceptions_pymodule = PyModule::import(py, "grimp.exceptions").unwrap(); | ||
| let py_exception_class = exceptions_pymodule.getattr("SourceSyntaxError").unwrap(); | ||
| let exception = py_exception_class | ||
| .call1((module_filename, line_number, text)) | ||
| .unwrap(); | ||
| return Err(PyErr::from_value(exception)); | ||
| } | ||
| Err(e) => { | ||
| return Err(e.into()); | ||
| } | ||
| _ => (), | ||
| } | ||
| let imported_objects = parse_result.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid needing to call parse_result.unwrap() by assigning the match:
| match parse_result { | |
| Err(GrimpError::ParseError { | |
| line_number, text, .. | |
| }) => { | |
| // TODO: define SourceSyntaxError using pyo3. | |
| let exceptions_pymodule = PyModule::import(py, "grimp.exceptions").unwrap(); | |
| let py_exception_class = exceptions_pymodule.getattr("SourceSyntaxError").unwrap(); | |
| let exception = py_exception_class | |
| .call1((module_filename, line_number, text)) | |
| .unwrap(); | |
| return Err(PyErr::from_value(exception)); | |
| } | |
| Err(e) => { | |
| return Err(e.into()); | |
| } | |
| _ => (), | |
| } | |
| let imported_objects = parse_result.unwrap(); | |
| let imported_objects = match parse_result { | |
| Err(GrimpError::ParseError { | |
| line_number, text, .. | |
| }) => { | |
| // TODO: define SourceSyntaxError using pyo3. | |
| let exceptions_pymodule = PyModule::import(py, "grimp.exceptions").unwrap(); | |
| let py_exception_class = exceptions_pymodule.getattr("SourceSyntaxError").unwrap(); | |
| let exception = py_exception_class | |
| .call1((module_filename, line_number, text)) | |
| .unwrap(); | |
| return Err(PyErr::from_value(exception)); | |
| } | |
| Err(e) => return Err(e.into()), | |
| Ok(imported_objects) => imported_objects, | |
| }; |
|
Thanks for the post-merge review @LilyAcorn! These all seem sensible. If you feel like doing a PR for them at some point be my guest! |
Moves the ImportScanner class to Rust.
In order to do this we also need to define Rust-based swappable fake/real file systems to be passed in, so we can still unit test via Python. To reduce the amount of work involved, I've defined a narrower interface for the filesystem that only implements what is needed by
ImportScanner. We can broaden it later when we come to do the same with caching / module finding.There's some distinctly smelly Rust code that loads the
ModuleandDirectImportdataclasses from Python, rather than defining them in Rust - all in the interests of trying to limit how many changes I needed to make to keep the tests passing.Parallelism next steps
The
ImportScannerclass currently requires the GIL, which means we still have a bit of work to do before we can move to multithreading. I think the best thing to do is abandon the ports-and-adapters approach forImportScanner(we only have one of them anyway) and instead make a function that we can unit test from Python, that does them in bulk along the lines of #222. We should keep the unit tests of import scanner but just adapt them to call the bulk function instead.The bulk function could at first be in Python, but then we could push that down to Rust. That would allow us to turn the
ImportScannerinto a pure Rust class that doesn't need the GIL, and do any mapping to Python classes / exceptions in a wrapper function in Python.Performance regressions
Codspeed consistently reports a slowdown with this change. I'm a bit confused as to why this change would make a difference, e.g. to
get_shortest_path, as this should only affect the building of the graph.In practice, run on a large repository it doesn't seem to noticeably change things so I think I'm okay with this merging, given that it will pave the way for a speed up soon.