Skip to content

Conversation

@seddonym
Copy link
Collaborator

@seddonym seddonym commented Jun 23, 2025

Moves the ImportScanner class to Rust.

In order to do this we also need to define Rust-based swappable fake/real file systems to be passed in, so we can still unit test via Python. To reduce the amount of work involved, I've defined a narrower interface for the filesystem that only implements what is needed by ImportScanner. We can broaden it later when we come to do the same with caching / module finding.

There's some distinctly smelly Rust code that loads the Module and DirectImport dataclasses from Python, rather than defining them in Rust - all in the interests of trying to limit how many changes I needed to make to keep the tests passing.

Parallelism next steps

The ImportScanner class currently requires the GIL, which means we still have a bit of work to do before we can move to multithreading. I think the best thing to do is abandon the ports-and-adapters approach for ImportScanner (we only have one of them anyway) and instead make a function that we can unit test from Python, that does them in bulk along the lines of #222. We should keep the unit tests of import scanner but just adapt them to call the bulk function instead.

The bulk function could at first be in Python, but then we could push that down to Rust. That would allow us to turn the ImportScanner into a pure Rust class that doesn't need the GIL, and do any mapping to Python classes / exceptions in a wrapper function in Python.

Performance regressions

Codspeed consistently reports a slowdown with this change. I'm a bit confused as to why this change would make a difference, e.g. to get_shortest_path, as this should only affect the building of the graph.

In practice, run on a large repository it doesn't seem to noticeably change things so I think I'm okay with this merging, given that it will pave the way for a speed up soon.

@codspeed-hq
Copy link

codspeed-hq bot commented Jun 23, 2025

CodSpeed Instrumentation Performance Report

Merging #229 will degrade performances by 18.12%

Comparing rust-importscanner (80b2234) with master (a62d0b7)

Summary

❌ 3 (👁 3) regressions
✅ 19 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
👁 test_deep_layers_large_graph_kept 16.3 ms 19.9 ms -18.12%
👁 test_no_chain 1.1 ms 1.2 ms -11.16%
👁 test_no_chains 1.1 ms 1.2 ms -11.14%

@seddonym seddonym mentioned this pull request Jun 24, 2025
@seddonym seddonym force-pushed the rust-importscanner branch 2 times, most recently from ddac1be to 3972bc7 Compare June 25, 2025 17:08
@seddonym seddonym force-pushed the rust-importscanner branch 4 times, most recently from e83fbcc to bf838a1 Compare June 27, 2025 16:25
seddonym added 14 commits June 27, 2025 17:38
Much of the Rust code was generated by an LLM, possibly it could be
simplified.

The Python tests are adapted so some of the same tests can be run on the
Python FakeFileSystem and the Rust-based FakeBasicFileSystem.
This means we don't need to pickle the FileSystem - making it possible
to use the Rust-based file system classes while doing multiprocessing.
RealBasicFileSystem is renamed to PyRealBasicFileSystem, likewise with
FakeBasicFileSystem. These then wrap inner Rust structs. We do this so
we can box a file system in a Rust-based ImportScanner class, and then
interact with it polymorphically. (See an upcoming commit.)
@seddonym seddonym force-pushed the rust-importscanner branch from bf838a1 to 80b2234 Compare June 27, 2025 16:39
@seddonym seddonym marked this pull request as ready for review June 27, 2025 16:46
@seddonym seddonym merged commit a1a0527 into master Jul 14, 2025
18 checks passed
@seddonym seddonym deleted the rust-importscanner branch July 14, 2025 16:13
@seddonym seddonym mentioned this pull request Jul 16, 2025
Copy link
Contributor

@LilyFirefly LilyFirefly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't review before this was merged! I hope these suggestions are helpful anyway:


#[getter]
fn sep(&self) -> String {
"/".to_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth interning this with intern! for efficiency if it's called a lot from Python.

Comment on lines +42 to +47
let sep = self.sep();
components
.into_iter()
.map(|c| c.trim_end_matches(&sep).to_string())
.collect::<Vec<String>>()
.join(&sep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the performance is important, it might make sense to define const SEP = "/" as a module constant to avoid creating a String every time join is called. This could still be referred to from self.sep() for Python code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if we can avoid the Vec and/or the String in the collect?

return ("".to_string(), "".to_string());
}

let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The explicit reference here should be unnecessary because "" is already a &str:

Suggested change
let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split)
let tail = components.last().unwrap_or(""); // Last component, or empty if components is empty (shouldn't happen from split)

Comment on lines +51 to +59
let components: Vec<&str> = file_name.split('/').collect();

if components.is_empty() {
return ("".to_string(), "".to_string());
}

let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split)

let head_components = &components[..components.len() - 1]; // All components except the last
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split returns a DoubleEndedIterator, so it can be iterated from both ends:

Suggested change
let components: Vec<&str> = file_name.split('/').collect();
if components.is_empty() {
return ("".to_string(), "".to_string());
}
let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split)
let head_components = &components[..components.len() - 1]; // All components except the last
let mut components = file_name.split('/');
let tail = match components.next_back() {
Some(tail) => tail,
None => return ("".to_string(), "".to_string());
};
let head_components: Vec<&str> = components.collect();

fn read(&self, file_name: &str) -> PyResult<String> {
match self.contents.get(file_name) {
Some(file_name) => Ok(file_name.clone()),
None => Err(PyFileNotFoundError::new_err("")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd include the file_name in the error message:

Suggested change
None => Err(PyFileNotFoundError::new_err("")),
None => Err(PyFileNotFoundError::new_err(format!("No such file: {file_name}"))),


#[getter]
fn sep(&self) -> String {
std::path::MAIN_SEPARATOR.to_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider intern! here too.

for component in components {
path.push(component);
}
path.to_str().unwrap().to_string()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use .expect() instead of .unwrap() - this allows leaving an explanatory message for why it's safe to unwrap:

Suggested change
path.to_str().unwrap().to_string()
path.to_str().expect("Path components are valid unicode").to_string()

})?;

let s = String::from_utf8_lossy(&bytes);
let encoding_re = Regex::new(r"^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)").unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably worth seeing if this can be moved to module scope, so it's only compiled once no matter how often read is called. It might be necessary to use LazyCell.

Comment on lines +29 to +33
#[allow(unused_variables)]
#[new]
#[pyo3(signature = (file_system, found_packages, include_external_packages=false))]
fn new(
py: Python,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove the #[allow(unused_variables)] if you rename py to _py:

Suggested change
#[allow(unused_variables)]
#[new]
#[pyo3(signature = (file_system, found_packages, include_external_packages=false))]
fn new(
py: Python,
#[new]
#[pyo3(signature = (file_system, found_packages, include_external_packages=false))]
fn new(
_py: Python,

Or maybe you don't need py in the signature at all?

Suggested change
#[allow(unused_variables)]
#[new]
#[pyo3(signature = (file_system, found_packages, include_external_packages=false))]
fn new(
py: Python,
#[new]
#[pyo3(signature = (file_system, found_packages, include_external_packages=false))]
fn new(

Comment on lines +74 to +91
match parse_result {
Err(GrimpError::ParseError {
line_number, text, ..
}) => {
// TODO: define SourceSyntaxError using pyo3.
let exceptions_pymodule = PyModule::import(py, "grimp.exceptions").unwrap();
let py_exception_class = exceptions_pymodule.getattr("SourceSyntaxError").unwrap();
let exception = py_exception_class
.call1((module_filename, line_number, text))
.unwrap();
return Err(PyErr::from_value(exception));
}
Err(e) => {
return Err(e.into());
}
_ => (),
}
let imported_objects = parse_result.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid needing to call parse_result.unwrap() by assigning the match:

Suggested change
match parse_result {
Err(GrimpError::ParseError {
line_number, text, ..
}) => {
// TODO: define SourceSyntaxError using pyo3.
let exceptions_pymodule = PyModule::import(py, "grimp.exceptions").unwrap();
let py_exception_class = exceptions_pymodule.getattr("SourceSyntaxError").unwrap();
let exception = py_exception_class
.call1((module_filename, line_number, text))
.unwrap();
return Err(PyErr::from_value(exception));
}
Err(e) => {
return Err(e.into());
}
_ => (),
}
let imported_objects = parse_result.unwrap();
let imported_objects = match parse_result {
Err(GrimpError::ParseError {
line_number, text, ..
}) => {
// TODO: define SourceSyntaxError using pyo3.
let exceptions_pymodule = PyModule::import(py, "grimp.exceptions").unwrap();
let py_exception_class = exceptions_pymodule.getattr("SourceSyntaxError").unwrap();
let exception = py_exception_class
.call1((module_filename, line_number, text))
.unwrap();
return Err(PyErr::from_value(exception));
}
Err(e) => return Err(e.into()),
Ok(imported_objects) => imported_objects,
};

@seddonym
Copy link
Collaborator Author

seddonym commented Aug 6, 2025

Thanks for the post-merge review @LilyAcorn! These all seem sensible. If you feel like doing a PR for them at some point be my guest!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants