Rust importscanner #229

seddonym · 2025-06-23T07:43:16Z

Moves the ImportScanner class to Rust.

In order to do this we also need to define Rust-based swappable fake/real file systems to be passed in, so we can still unit test via Python. To reduce the amount of work involved, I've defined a narrower interface for the filesystem that only implements what is needed by ImportScanner. We can broaden it later when we come to do the same with caching / module finding.

There's some distinctly smelly Rust code that loads the Module and DirectImport dataclasses from Python, rather than defining them in Rust - all in the interests of trying to limit how many changes I needed to make to keep the tests passing.

Parallelism next steps

The ImportScanner class currently requires the GIL, which means we still have a bit of work to do before we can move to multithreading. I think the best thing to do is abandon the ports-and-adapters approach for ImportScanner (we only have one of them anyway) and instead make a function that we can unit test from Python, that does them in bulk along the lines of #222. We should keep the unit tests of import scanner but just adapt them to call the bulk function instead.

The bulk function could at first be in Python, but then we could push that down to Rust. That would allow us to turn the ImportScanner into a pure Rust class that doesn't need the GIL, and do any mapping to Python classes / exceptions in a wrapper function in Python.

Performance regressions

Codspeed consistently reports a slowdown with this change. I'm a bit confused as to why this change would make a difference, e.g. to get_shortest_path, as this should only affect the building of the graph.

In practice, run on a large repository it doesn't seem to noticeably change things so I think I'm okay with this merging, given that it will pave the way for a speed up soon.

codspeed-hq · 2025-06-23T07:49:29Z

CodSpeed Instrumentation Performance Report

Merging #229 will degrade performances by 18.12%

_{Comparing rust-importscanner (80b2234) with master (a62d0b7)}

Summary

❌ 3 (👁 3) regressions
✅ 19 untouched benchmarks

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
👁	`test_deep_layers_large_graph_kept`	16.3 ms	19.9 ms	-18.12%
👁	`test_no_chain`	1.1 ms	1.2 ms	-11.16%
👁	`test_no_chains`	1.1 ms	1.2 ms	-11.14%

Much of the Rust code was generated by an LLM, possibly it could be simplified. The Python tests are adapted so some of the same tests can be run on the Python FakeFileSystem and the Rust-based FakeBasicFileSystem.

This means we don't need to pickle the FileSystem - making it possible to use the Rust-based file system classes while doing multiprocessing.

RealBasicFileSystem is renamed to PyRealBasicFileSystem, likewise with FakeBasicFileSystem. These then wrap inner Rust structs. We do this so we can box a file system in a Rust-based ImportScanner class, and then interact with it polymorphically. (See an upcoming commit.)

LilyFirefly

Sorry I didn't review before this was merged! I hope these suggestions are helpful anyway:

LilyFirefly · 2025-08-01T13:56:24Z

rust/src/filesystem.rs

+
+    #[getter]
+    fn sep(&self) -> String {
+        "/".to_string()


It might be worth interning this with intern! for efficiency if it's called a lot from Python.

LilyFirefly · 2025-08-01T14:01:24Z

rust/src/filesystem.rs

+        let sep = self.sep();
+        components
+            .into_iter()
+            .map(|c| c.trim_end_matches(&sep).to_string())
+            .collect::<Vec<String>>()
+            .join(&sep)


If the performance is important, it might make sense to define const SEP = "/" as a module constant to avoid creating a String every time join is called. This could still be referred to from self.sep() for Python code.

I also wonder if we can avoid the Vec and/or the String in the collect?

LilyFirefly · 2025-08-01T14:04:11Z

rust/src/filesystem.rs

+            return ("".to_string(), "".to_string());
+        }
+
+        let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split)


The explicit reference here should be unnecessary because "" is already a &str:

Suggested change

let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split)

let tail = components.last().unwrap_or(""); // Last component, or empty if components is empty (shouldn't happen from split)

LilyFirefly · 2025-08-01T14:13:05Z

rust/src/filesystem.rs

+        let components: Vec<&str> = file_name.split('/').collect();
+
+        if components.is_empty() {
+            return ("".to_string(), "".to_string());
+        }
+
+        let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split)
+
+        let head_components = &components[..components.len() - 1]; // All components except the last


split returns a DoubleEndedIterator, so it can be iterated from both ends:

Suggested change

let components: Vec<&str> = file_name.split('/').collect();

if components.is_empty() {

return ("".to_string(), "".to_string());

}

let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split)

let head_components = &components[..components.len() - 1]; // All components except the last

let mut components = file_name.split('/');

let tail = match components.next_back() {

Some(tail) => tail,

None => return ("".to_string(), "".to_string());

};

let head_components: Vec<&str> = components.collect();

LilyFirefly · 2025-08-01T14:17:05Z

rust/src/filesystem.rs

+    fn read(&self, file_name: &str) -> PyResult<String> {
+        match self.contents.get(file_name) {
+            Some(file_name) => Ok(file_name.clone()),
+            None => Err(PyFileNotFoundError::new_err("")),


I'd include the file_name in the error message:

Suggested change

None => Err(PyFileNotFoundError::new_err("")),

None => Err(PyFileNotFoundError::new_err(format!("No such file: {file_name}"))),

LilyFirefly · 2025-08-01T14:33:51Z

rust/src/filesystem.rs

+
+    #[getter]
+    fn sep(&self) -> String {
+        std::path::MAIN_SEPARATOR.to_string()


Consider intern! here too.

LilyFirefly · 2025-08-01T14:37:35Z

rust/src/filesystem.rs

+        for component in components {
+            path.push(component);
+        }
+        path.to_str().unwrap().to_string()


I'd use .expect() instead of .unwrap() - this allows leaving an explanatory message for why it's safe to unwrap:

Suggested change

path.to_str().unwrap().to_string()

path.to_str().expect("Path components are valid unicode").to_string()

LilyFirefly · 2025-08-01T14:47:57Z

rust/src/filesystem.rs

+        })?;
+
+        let s = String::from_utf8_lossy(&bytes);
+        let encoding_re = Regex::new(r"^[ \t\f]*#.*?coding[:=][ \t]*([-_.a-zA-Z0-9]+)").unwrap();


It's probably worth seeing if this can be moved to module scope, so it's only compiled once no matter how often read is called. It might be necessary to use LazyCell.

LilyFirefly · 2025-08-01T15:00:52Z

rust/src/import_scanning.rs

+    #[allow(unused_variables)]
+    #[new]
+    #[pyo3(signature = (file_system, found_packages, include_external_packages=false))]
+    fn new(
+        py: Python,


I think you can remove the #[allow(unused_variables)] if you rename py to _py:

Suggested change

#[allow(unused_variables)]

#[new]

#[pyo3(signature = (file_system, found_packages, include_external_packages=false))]

fn new(

py: Python,

#[new]

#[pyo3(signature = (file_system, found_packages, include_external_packages=false))]

fn new(

_py: Python,

Or maybe you don't need py in the signature at all?

Suggested change

#[allow(unused_variables)]

#[new]

#[pyo3(signature = (file_system, found_packages, include_external_packages=false))]

fn new(

py: Python,

#[new]

#[pyo3(signature = (file_system, found_packages, include_external_packages=false))]

fn new(

LilyFirefly · 2025-08-01T15:04:22Z

rust/src/import_scanning.rs

+        match parse_result {
+            Err(GrimpError::ParseError {
+                line_number, text, ..
+            }) => {
+                // TODO: define SourceSyntaxError using pyo3.
+                let exceptions_pymodule = PyModule::import(py, "grimp.exceptions").unwrap();
+                let py_exception_class = exceptions_pymodule.getattr("SourceSyntaxError").unwrap();
+                let exception = py_exception_class
+                    .call1((module_filename, line_number, text))
+                    .unwrap();
+                return Err(PyErr::from_value(exception));
+            }
+            Err(e) => {
+                return Err(e.into());
+            }
+            _ => (),
+        }
+        let imported_objects = parse_result.unwrap();


You can avoid needing to call parse_result.unwrap() by assigning the match:

Suggested change

match parse_result {

Err(GrimpError::ParseError {

line_number, text, ..

}) => {

// TODO: define SourceSyntaxError using pyo3.

let exceptions_pymodule = PyModule::import(py, "grimp.exceptions").unwrap();

let py_exception_class = exceptions_pymodule.getattr("SourceSyntaxError").unwrap();

let exception = py_exception_class

.call1((module_filename, line_number, text))

.unwrap();

return Err(PyErr::from_value(exception));

}

Err(e) => {

return Err(e.into());

}

_ => (),

}

let imported_objects = parse_result.unwrap();

let imported_objects = match parse_result {

Err(GrimpError::ParseError {

line_number, text, ..

}) => {

// TODO: define SourceSyntaxError using pyo3.

let exceptions_pymodule = PyModule::import(py, "grimp.exceptions").unwrap();

let py_exception_class = exceptions_pymodule.getattr("SourceSyntaxError").unwrap();

let exception = py_exception_class

.call1((module_filename, line_number, text))

.unwrap();

return Err(PyErr::from_value(exception));

}

Err(e) => return Err(e.into()),

Ok(imported_objects) => imported_objects,

};

seddonym · 2025-08-06T14:23:04Z

Thanks for the post-merge review @LilyAcorn! These all seem sensible. If you feel like doing a PR for them at some point be my guest!

seddonym mentioned this pull request Jun 24, 2025

Rust Import Scanner #221

Closed

seddonym force-pushed the rust-importscanner branch 2 times, most recently from ddac1be to 3972bc7 Compare June 25, 2025 17:08

Add uv.lock to .gitignore

9ba14b1

seddonym force-pushed the rust-importscanner branch 4 times, most recently from e83fbcc to bf838a1 Compare June 27, 2025 16:25

seddonym added 14 commits June 27, 2025 17:38

Run cargo clippy --fix with latest rust version

023ff25

Define BasicFileSystem protocol

afb76a8

Implement FakeBasicFileSystem

979e7fa

Much of the Rust code was generated by an LLM, possibly it could be simplified. The Python tests are adapted so some of the same tests can be run on the Python FakeFileSystem and the Rust-based FakeBasicFileSystem.

Use FakeBasicFileSystem in unit test

75c6e9d

Pass BasicFileSystem to ImportScanner from use case

9639232

Implement RealBasicFileSystem

1e5d688

Instantiate ImportScanner from within _scan_chunk

1475c55

This means we don't need to pickle the FileSystem - making it possible to use the Rust-based file system classes while doing multiprocessing.

Handle include_external_packages=None

6c57ea1

Use RealBasicFileSystem in prod

0676640

Make AbstractImportScanner a Protocol

a9f6210

Define some Rust equivalents for module finding data structures

3bbefa5

Define ImportScanner in Rust

92299c5

Use Rust-based ImportScanner in production

80b2234

seddonym force-pushed the rust-importscanner branch from bf838a1 to 80b2234 Compare June 27, 2025 16:39

seddonym marked this pull request as ready for review June 27, 2025 16:46

seddonym merged commit a1a0527 into master Jul 14, 2025
18 checks passed

seddonym deleted the rust-importscanner branch July 14, 2025 16:13

seddonym mentioned this pull request Jul 16, 2025

Closed layers #227

Merged

LilyFirefly reviewed Aug 1, 2025

View reviewed changes

LilyFirefly mentioned this pull request Sep 5, 2025

Rust importscanner review fixes #245

Merged

	let tail = components.last().unwrap_or(&""); // Last component, or empty if components is empty (shouldn't happen from split)
	let tail = components.last().unwrap_or(""); // Last component, or empty if components is empty (shouldn't happen from split)

	None => Err(PyFileNotFoundError::new_err("")),
	None => Err(PyFileNotFoundError::new_err(format!("No such file: {file_name}"))),

	path.to_str().unwrap().to_string()
	path.to_str().expect("Path components are valid unicode").to_string()

Rust importscanner #229

Rust importscanner #229

Uh oh!

Conversation

seddonym commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Parallelism next steps

Performance regressions

Uh oh!

codspeed-hq bot commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #229 will degrade performances by 18.12%

Summary

Benchmarks breakdown

Uh oh!

Uh oh!

LilyFirefly left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seddonym commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

seddonym commented Jun 23, 2025 •

edited

Loading

codspeed-hq bot commented Jun 23, 2025 •

edited

Loading