Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: Check file timestamps #127

Merged
merged 8 commits into from
Nov 30, 2022
Merged

Optimization: Check file timestamps #127

merged 8 commits into from
Nov 30, 2022

Conversation

tkellogg
Copy link
Owner

@tkellogg tkellogg commented Nov 27, 2022

The current version performs a git commit on every repository every 5 seconds. If nothing changed, it simply has no effect. This strategy is eating up sometimes 10-20% of my CPU (on a poor dev machine), due to the massive amount of hashing needed to commit so many repositories.

This change creates an optimization by tracking file changed timestamps and comparing them against last commit timestamps. To do this, we have to traverse all files in each repository, which git commit is doing anyway. The big difference is that we only need to traverse to the inode (directory) for the file timestamp entries, so it should be significantly faster.

Considered Design: notify

It has been suggested to use the notify crate to be notified of file changes. The trouble here was a global limit on file descriptors that can be watched, so you can't watch all files. I considered some machine learning appraches to guessing which files should be watched, but abandoned the approach because every strategy I could think of inolved a fallback to eactly what I'm doing here.

Considered Design: SQLite

Use SQLite to remember timestamps of every file. This was needlessly complex because I only need to remember one timestamp - the most recent one. And more important, the last dura commit already stores it, so SQLite wasn't adding any value.

Results

On the laptop where I was getting ~10% CPU utilization, it's now less that 0.1% (it doesn't register). There's a stark drop in the Activity Monitor graph after I installed the updated version.

image

The current version performs a `git commit` on every repository every 5
seconds. If nothing changed, it simply has no effect. This strategy is
eating up sometimes 10-20% of my CPU (on a poor dev machine), due to the
massive amount of hashing needed to commit so many repositories.

This change creates an optimization by tracking file changed timestamps
and comparing them against last commit timestamps. To do this, we have
to traverse all files in each repository, which `git commit` is doing
anyway. The big difference is that we only need to traverse to the inode
(directory) for the file timestamp entries, so it should be
significantly faster.

Considered design: It has been suggested to use the `notify` crate to be
notified of file changes. The trouble here was a global limit on file
descriptors that can be watched, so you can't watch all files. I
considered some machine learning appraches to guessing which files
should be watched, but abandoned the approach because every strategy I
could think of inolved a fallback to eactly what I'm doing here.

Considered design: Use SQLite to remember timestamps of every file. This
was needlessly complex because I only need to remember one timestamp -
the most recent one. And more important, the last dura commit already
stores it, so SQLite wasn't adding any value.
Copy link
Collaborator

@JakeStanger JakeStanger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a major improvement without adding much complexity at all! Code is all good apart from a few little bits - one unwrap() I think should definitely be changed, but the rest are minor points.

src/poll_guard.rs Outdated Show resolved Hide resolved
src/poll_guard.rs Outdated Show resolved Hide resolved
src/poll_guard.rs Outdated Show resolved Hide resolved
tests/poll_guard_test.rs Outdated Show resolved Hide resolved
@tkellogg
Copy link
Owner Author

I tried this on another computer and I'm getting even higher CPU utilization — 15-25% — it's possible I just installed it wrong. Looking into it...

@tkellogg
Copy link
Owner Author

UPDATE: I found the performance issue. Over the weekend, after I produced that build that performed well on my old laptop, I started adding unit tests while having a few drinks. As expected, it seemed like a very good idea to call Repository::is_path_ignored(path) on every path in the hot path. I mean, you don't want to commit too often, right?

Well... it turns out there's a lot of kernel calls in is_path_ignored. Here's a flamegraph of the slow dura:

image

vs here's dura without the is_path_ignored calls. It's a less accurate filter, but still dramatically decreases the calls to Repository::commit():

image

I inexplicably have some failing tests now. I think the other PR must have somehow broken them. Working...

Copy link
Collaborator

@JakeStanger JakeStanger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All lgtm, give me a heads up when the final tests are sorted and I'll have another check over :)

@tkellogg tkellogg merged commit 4dc1029 into master Nov 30, 2022
@tkellogg tkellogg deleted the duralite branch November 30, 2022 01:46
@tkellogg
Copy link
Owner Author

Merged, because this is definitely better than it was before. I'm going to make a new PR for more improvements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants