Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize large (~500K to 1M+) number of paths #7

Open
jdb8 opened this issue Jan 5, 2018 · 2 comments
Open

Optimize large (~500K to 1M+) number of paths #7

jdb8 opened this issue Jan 5, 2018 · 2 comments

Comments

@jdb8
Copy link

jdb8 commented Jan 5, 2018

Hi there! I just stumbled upon this project today and it seems like exactly what I need to replace my very slow SSHFS setup, so thank you very much for all your work!

I've been playing around with it and settings things up in my environment, where (similarly to your examples) I have a folder containing various code projects, and each of these code projects contains quite a few files. After seeing things work in each project folder individually, my plan was to effectively 'mount' all of them at once and let mirror handle the syncing. I started off with an initial rsync which pulled in most of the data.

What I'm finding is that for a folder with all my projects (mirror reports that the server has 219717 paths), syncing appears to only work one way: my client can make a change and have it reflected on the server, but not the other way round. If I restart the client or server then things do get back in sync during the initial sync that occurs.

So I'm wondering if this is related to the inotify limits that you mention in the readme. Unfortunately I'm in an environment on the server where I can't change those limits. Interestingly though, watchman itself seems to detect the changes that mirror isn't responding to: I set up a trivial trigger to echo files that are changed, and I see them in the watchman log. I'm unsure if there's a way to access more verbose logs from mirror, so at this point I'm at a bit of a dead end. I took a look at some of the source code but couldn't work out where to start without access to a debugger, and my experience debugging java code is a little lacking :(

My workaround for now will likely be to spawn individual clients for each of my project folders as required, as that seems to avoid this problem. But if there is a way to have the single code/ folder picked up from one client, it would make managing those processes a little easier for sure.

@stephenh
Copy link
Owner

stephenh commented Jan 5, 2018

Hey, thanks for filing the note!

I've had mirror work with ~80k files (iirc), and 200k is in the ballpark, but both 80k and 200k are getting up there in terms of where mirror can start taking longer (generally/hopefully on the initial sync), using more memory (I typically run with ~2-4gb of RAM iirc for that many files).

my client can make a change and have it reflected on the server, but not the other way round

That is interesting. I'm not really sure what may cause that, but your guess about the inotify limits being too low on the server side would probably explain that. I'm not sure what the failure mode for "past inotify limit" is; e.g. if the kernel would throw some sort of error that then watchman could report back to mirror that "btw, we're not actually getting notifications for this".

Oh, but right, you said watchman itself is working fine. So it must be something within mirror.

That is curious. You could try --enable-log-file and --debug some-path and see what that says, where some-path is one of the directories that is not getting synced (might be hard to do if it's a different directory each time that isn't working).

You could also run top, and if the JVM process is taking ~100% of CPU, then it's very likely just ran out of memory, so try increasing the heap with -Xmx. But sounds like the client is handling that memory usage just fine, and the memory usage on the client/server should be ~basically the same.

My workaround for now will likely be to spawn individual clients for each
of my project folders as required

Yeah, that makes sense. I've thought of potentially building this into mirror, where it assumes 1st level of directories is a large number of projects, some "active" and some "inactive", and it only fully syncs "active" projects (so for the inactive projects, it can avoid both using inotify limits and also JVM heap usage for the tree of paths/mod times).

That would add some more bookkeeping internally, which would be doable, but I haven't figured out the best way to determine which projects are active. The easiest thing would probably be to wait until a write happens. But seems like ideally it could also watch for reads, but those don't go through inotify, and would require something like FUSE, to tell when the client is accessing which directories.

Anyway, not sure I have anything too helpful; try the debug flags and check the heap and let me know if that works.

@jdb8
Copy link
Author

jdb8 commented Jan 12, 2018

It's possible related to #9 (on the server side), which I'd rather look at instead because it's affecting even small folders (so definitely not the inotify limits this time).

Thanks for the quick response and detailed writeup! I think having some higher-level management of folders would be really cool in the long-term. I already wrote some small bash scripts around starting up individual mirror processes in the background and creating lockfiles etc. to ensure no clashes - would be great if that kind of stuff could eventually be integrated into the main tool (or I could release a higher-level tool which manages the various mirror processes).

I think if you can help me out with #8 then I'd be able to spend some time digging into this + issues like #9 and hopefully work out what's going on.

@stephenh stephenh changed the title Large number of paths prevent desktop -> laptop syncing Optimize large (~500K to 1M+) number of paths Apr 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants