Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing algorithm #161

Open
jbd opened this issue Dec 27, 2016 · 26 comments · May be fixed by #279
Open

Indexing algorithm #161

jbd opened this issue Dec 27, 2016 · 26 comments · May be fixed by #279

Comments

@jbd
Copy link
Contributor

jbd commented Dec 27, 2016

Hello,

this is more a suggestion than an issue.

The duc indexing is already quite fast but you might be interested in the filesystem crawling algorithm of robinhood also written in C, if you don't already know the project:

To go beyond the performance of classical scanning tools, robinhood implements a multi-threaded version of depth-first traversal[4]. To parallelize the scan, the namespace traversal is split into individual tasks that consist in reading single directories. A pool of worker threads performs these tasks following a depth-first strategy (as illustrated on figure 3).

from Taking back control of HPC file systems with Robinhood Policy Engine paper. See here for more.

If the storage could handle parallel requests nicely, that's a huge win. Right now I don't have any metrics to compare the crawling performance between robinhood and duc, but I'm using robinhood on a petabyte filer through nfs3 with nearly 1 billion files on it, and the crawling algorithm performance scales quite linearly up to 8 threads. After that, I can't tell if the filer or the mysql database that holds the results is the bottleneck at this moment.

@jbd jbd changed the title indexing algorithm Indexing algorithm Dec 27, 2016
@zevv
Copy link
Owner

zevv commented Dec 28, 2016 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Dec 28, 2016 via email

@jbd
Copy link
Contributor Author

jbd commented Dec 28, 2016

I've looked at this too, and I think they have a crucial flaw, which is using Mysql as the backend

I tend to think like you, but I'm not a database expert. And if you want to run stuff like that on huge file system, you'll have to build a solid database backend. The people behind robinhood seems to be quite involved and active on this matter (you can have a look at their mailing list) and the program is used in production on multi-petabytes storage system. But again, I think that duc hits a sweet spot by using a KV backend. And my suggestion was only concerning the crawling algorithm =)

As for the parallel code, it will help up until the backend (NFS on Netapp in my case mostly) hits it's limits I fear.

Yes, no magic here. But the speedup optained while hammering/breaking the storage system could be noticeable.

@zevv
Copy link
Owner

zevv commented Dec 28, 2016 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Dec 28, 2016 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Dec 29, 2016 via email

@zevv
Copy link
Owner

zevv commented Dec 29, 2016 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Dec 29, 2016 via email

@jbd
Copy link
Contributor Author

jbd commented Dec 29, 2016

zevv> I haven't looked in all the details about Robinhood, but maybe they store a lot more data then Duc

You can have a rough idea of what is stored in robinhood database by quickly looking at the rbh-find command options.

l8gravely> We should move from tokyocabinet to lmdb, and I'd also suggest that we pull lmdb into the duc source tree as well.

FWIW, I fully agree =)

@zevv
Copy link
Owner

zevv commented Dec 29, 2016 via email

@jbd
Copy link
Contributor Author

jbd commented Dec 30, 2016

I lost trace of an interesting paper, but I found it again: On Distributed File Tree Walk of Parallel File
Systems
.

@jbd
Copy link
Contributor Author

jbd commented Dec 30, 2016

Out of curiosity, I've tested what looks like an implementation of the previous paper: dwalk.c from the mpifileutils project which is using libdftw underneath. It uses mpi to handle parallelism, but I'm only using the local machine. The goal is only to see the real benefits of a distributed file tree walk.

See also http://lustre.ornl.gov/ecosystem-2015/documents/LustreEco2015-Tutorial5.pdf

To test on a CentOS 7 system:

$ sudo yum install openmpi openmpi-devel libattr-devel
$ source /etc/profile.d/modules.sh
$ module load mpi/openmpi-x86_64
$ module list
Currently Loaded Modulefiles:
  1) mpi/openmpi-x86_64
$ git clone https://github.com/hpc/mpifileutils/
$ cd mpifileutils
$ ./buildme_dependencies # requires internet connection
$ ./buildme 

The yum openmpi library seems to have some problems, but it is working:

./install/bin/dwalk 
localhost.31857hfi_wait_for_device: The /dev/hfi1_0 device failed to appear after 15.0 seconds: Connection timed out

Usage: dwalk [options] <path> ...

Options:
  -i, --input <file>  - read list from file
  -o, --output <file> - write processed list to file
  -l, --lite          - walk file system without stat
  -s, --sort <fields> - sort output by comma-delimited fields
  -p, --print         - print files to screen
  -v, --verbose       - verbose output
  -h, --help          - print usage

Fields: name,user,group,uid,gid,atime,mtime,ctime,size

Let's test (as root to access my no-root squash nfs share on 8 cores/16G virtual machine). I've ran each test 5 times and took the best score because the filer could be in heavy use at some point in time. dwalk performs a stat for each entry.

The folder I'm indexing:

Items: 764324
  Directories: 51247
  Files: 713077
  Links: 0
Data: 0.992 TB (1.459 MB per file)
# module load mpi/openmpi-x86_64

One mpi process:

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 1 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 71.579805 seconds (10677.927944 files/sec)

Two:

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 2 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 50.769305 seconds (15054.844655 files/sec)

Four:

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 4 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 32.848336 seconds (23268.271489 files/sec)

Eight:

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 8 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 21.685690 seconds (35245.546718 files/sec)

And for fun, 16 =)

# echo 3 | tee /proc/sys/vm/drop_caches
# mpirun -np 16 --allow-run-as-root ./install/bin/dwalk -v /data
[snip]
Walked 764324 files in 18.811476 seconds (40630.729880 files/sec)

I don't know where the bottleneck is (nfs, local machine or remote storage) but the gain is substantial. It would be interesting to have a dry-run option for the index duc command that will prevent the db_put call in index.c to compare the results.

@zevv
Copy link
Owner

zevv commented Dec 30, 2016 via email

@jbd
Copy link
Contributor Author

jbd commented Dec 30, 2016

Order matters here.

Indeed, that was something I didn't have in mind.

@jbd
Copy link
Contributor Author

jbd commented Dec 30, 2016

It would be interesting to have a dry-run option for the index duc command that will prevent the db_put call in index.c to compare the results.

I don't know if you're interested, but I've tried to implement a --dry-run option to the index (#166). Given your last remark, maybe it is not as useful as I thought... =)

@zevv
Copy link
Owner

zevv commented Dec 30, 2016 via email

@jbd
Copy link
Contributor Author

jbd commented Dec 30, 2016

Grrr, I've lost my answer.

Test branch for you

Thank you for merging my pull request. Let's hope I didn't introduce a subtle bug in the process =)

Here is the result (best of five runs) on the same dataset:

# echo 3 | tee /proc/sys/vm/drop_caches;
# rm -f /root/.duc.db* 
# time ./duc-lmdb index --dry-run -x -p /data
real    0m53.694s
user    0m1.114s
sys     0m11.823s

A bit faster ! That gives us a rough idea.

(Do you know if the MPI test app actually calls fstat() for each file?)

Since the dwalk has a "--lite - walk file system without stat" option, I hope that the default behaviour is to perform a stat for each entry. An strace -f -estat,lstat confirms that.

By quickly looking at the code, the bayer_flist_walk_paths function enqueues walk_stat_process callbacks that issue lstat call in the end. bayer_flist_walk_paths is also doing other stuff regarding user and group (but I don't know if my previous runs felt inside this code path).

@zevv
Copy link
Owner

zevv commented Dec 30, 2016 via email

@zevv
Copy link
Owner

zevv commented Jan 1, 2017

(Some ramblings in case I leave this for a year and forget about it - this happens to me far too often.)

Ok, I've spent some time thinking and prototyping, but there's one nasty problem on the road:

At this time Duc recursively chdir()s itself through the file system, thus only needing relative file names for opendir(). This breaks when doing multiple threads since there is only one CWD (current working directory) for all threads. The proper solution for this is to use openat() and fdopendir() which allow referencing of file names relative to a parent file descriptor. The only problem is that for some reason this is not supported on MacOS X and Windows. I've stopped caring for windows a long time ago, but I really would like to keep MacOS in the list of supported operating systems.

Two solutions come to mind:

  • Use processes instead of threads. Actually not that bad, although this will complicate the communication in the worker pool with some overhead. (pipe write/read = system calls)

  • Have duc always keep track of absolute path names and pass those to opendir(). This causes a bit of overhead with some string handling, but is not that bad.

@stuartthebruce
Copy link

stuartthebruce commented Jan 1, 2017 via email

@l8gravely
Copy link
Collaborator

l8gravely commented Jan 1, 2017 via email

@zevv
Copy link
Owner

zevv commented Jan 2, 2017 via email

@msm595
Copy link

msm595 commented Jan 19, 2019

The proper solution for this is to use openat() and fdopendir() which allow referencing of file names relative to a parent file descriptor. The only problem is that for some reason this is not supported on MacOS X and Windows. I've stopped caring for windows a long time ago, but I really would like to keep MacOS in the list of supported operating systems.

As of MacOS 10.10 (Darwin 14.0), which was released in the end of 2014, openat() is supported. You can see the full list of supported syscalls for this version here: https://opensource.apple.com/source/xnu/xnu-2782.1.97/bsd/kern/syscalls.master.auto.html

@zevv
Copy link
Owner

zevv commented Jan 19, 2019 via email

@muhammadharis
Copy link

I'm currently working on an worker-pool based implementation based on the DFS algorithm described in this paper:

https://www.lrde.epita.fr/~bleton/doc/parallel-depth-first-search.pdf

If anyone is interested in a proof-of-concept, I can provide a Python script that minimally illustrates the filesystem crawling algorithm. If the maintainers have suggestions/opinions on this crawling algorithm, please let me know -- I'm open to suggestions.

@l8gravely
Copy link
Collaborator

l8gravely commented Jul 20, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants