Skip to content

Utility for finding and pruning old files in huge directory trees.

License

Notifications You must be signed in to change notification settings

tesujimath/filebutler

Repository files navigation

Filebutler

Filebutler is a utility for managing old files in huge directory trees. Huge means of the order of 100 million files. The motivation is that find is far too slow on directory trees with this many files. Even working with filelists generated offline by find can be rather slow for interactive use. Filebutler improves on this by structuring filelists by age, by size, by owner, and by dataset (see below).

As of version 0.23.0, it also builds a reverse index of symlinks for all filesets, to support querying by symlink target, that is, what symlinks point into a given directory tree.

Installation

Filebutler is now on PyPI, so may be installed using pip, preferably in a virtualenv.

$ pip install filebutler

Alternatively, filebutler is also on conda-forge.

$ conda install filebutler

Installing filebutler from RPM is now deprecated, and the spec file has therefore been removed from the repo.

Example use for old file pruning

$ filebutler
fb: ls
gypsy-scratch filelist /mirror/gypsy/z202/scratch/all-scratch-files cached on 2017-12-22
infernal-scratch filelist /mirror/infernal/z302/scratch/all-scratch-files cached on 2017-12-22
integrity-scratch filelist /mirror/integrity/z102/scratch/all-scratch-files cached on 2017-12-22
ivy-scratch filelist /mirror/ivy/z101/scratch/all-scratch-files cached on 2017-12-22
scratch union gypsy-scratch infernal-scratch integrity-scratch ivy-scratch
old-scratch filter scratch older:2015-12-23
my-old-scratch filter old-scratch owner:will
very-old-scratch filter old-scratch older:2009-10-05
big-old-scratch filter old-scratch size:+1G

fb: time info -s scratch
total 139.8T in  62306443 files
   0 -   1k     10G in  32652282 files
  1k -  10k     55G in  18798049 files
 10k - 100k    196G in   6735931 files
100k -   1M   1012G in   3096409 files
  1M -  10M    2.5T in    698040 files
 10M - 100M    7.3T in    218382 files
100M -   1G   28.0T in     90412 files
  1G -  10G   32.4T in     14142 files
 10G - 100G   58.6T in      2733 files
100G +         9.7T in        63 files
time: real 0.00s, user 0.00s, sys 0.00s

fb: time info -u old-scratch
total 34.2T in 7138319 files
captainjack 9.8T in 122775 files
will 6.3T in 197499 files
barbossa 3.8T in 121977 files
elizabeth 3.5T in 1686 files
bootstrap 3.3T in 2532542 files
calypso 1.2T in 551088 files
norrington 1.1T in 41961 files
<snip>
time: real 8.22s, user 3.00s, sys 1.26s

fb: time info -u very-old-scratch
total 12G in 52759 files
barbossa 11G in 47262 files
salazar 127M in 296 files
gibbs 28M in 5101 files
will 27M in 58 files
time: real 0.06s, user 0.04s, sys 0.01s

fb: print very-old-scratch
-rw-r--r-- 2005-02-20 743k /dataset/blackpearl/scratch/maker_AR501/GeneMark.ES/mtx/human_00_43.mtx  will:crew
-rw-r--r-- 2005-02-20 743k /dataset/blackpearl/scratch/maker_AR501/GeneMark.ES/mtx/human_43_49.mtx  will:crew
-rw-r--r-- 2005-02-20 621k /dataset/blackpearl/scratch/maker_AR501/GeneMark.ES/mtx/wheat.mtx        will:crew
-rw-r--r-- 2005-02-20 621k /dataset/blackpearl/scratch/maker_AR501/GeneMark.ES/mtx/barley.mtx       will:crew
-rw-r--r-- 2005-02-20 621k /dataset/blackpearl/scratch/maker_AR501/GeneMark.ES/mtx/corn.mtx         will:crew
<snip>

fb: delete very-old-scratch

Example use for symlink queries

Tab completion is available on path names, to complete on any subdirectory which contains a symlink target.

$ filebutler
fb: symlinks -r /dataset/bio                    <TAB>
fb: symlinks -r /dataset/bioinformatics_dev/    <TAB>
/dataset/bioinformatics_dev/active/
/dataset/bioinformatics_dev/active/data_prism/data_prism.py
/dataset/bioinformatics_dev/active/data_prism/kmer_prism.py
/dataset/bioinformatics_dev/active/data_prism/taxonomy_prism.py
/dataset/bioinformatics_dev/scratch/

This means there are symlink targets (that is, things pointed at by a symlink somewhere on the system) in all these directory trees. Let's drill down a bit more

fb: symlinks -r /dataset/bioinformatics_dev/s           <TAB>
fb: symlinks -r /dataset/bioinformatics_dev/scratch/    <TAB>
/dataset/bioinformatics_dev/scratch/bcl2fastq/
/dataset/bioinformatics_dev/scratch/gnu/
/dataset/bioinformatics_dev/scratch/localdata_backup/
/dataset/bioinformatics_dev/scratch/metabat/
/dataset/bioinformatics_dev/scratch/octopus/
/dataset/bioinformatics_dev/scratch/summer_projects_2018/
/dataset/bioinformatics_dev/scratch/summer_projects_2018/fraserm/project/
/dataset/bioinformatics_dev/scratch/summer_projects_2018/fraserm/stacks/
/dataset/bioinformatics_dev/scratch/summer_projects_2018/fraserm/uneak_SQ0690/
/dataset/bioinformatics_dev/scratch/tardis/
/dataset/bioinformatics_dev/scratch/topcons2/

fb: symlinks -r /dataset/bioinformatics_dev/scratch/to           <TAB>
fb: symlinks -r /dataset/bioinformatics_dev/scratch/topcons2/    <ENTER>
/dataset/bioinformatics_dev/scratch/topcons2/TOPCONS2/topcons2_webserver/predictors/Philius/runPhilius.sh.mod <- /dataset/bioinformatics_dev/scratch/topcons2/TOPCONS2/topcons2_webserver/predictors/Philius/runPhilius.sh
/dataset/bioinformatics_dev/scratch/topcons2/bin/modhmm <- /dataset/bioinformatics_dev/scratch/topcons2/TOPCONS2/topcons2_webserver/predictors/modhmm
/dataset/bioinformatics_dev/scratch/topcons2/bin/modhmm <- /dataset/bioinformatics_dev/scratch/topcons2/TOPCONS2/topcons2_webserver/predictors/spoctopus/modhmm
/dataset/bioinformatics_dev/scratch/topcons2/data/big/chrisp/topcons2_webserver/database <- /dataset/bioinformatics_dev/scratch/topcons2/TOPCONS2/topcons2_webserver/database
fb:

The reason for the reversed order of printing symlinks is so that it sorts by destination, not by source. If you just want to know about a single target, omit the -r option (for recursive).

Concepts

Filebutler's main concept is the fileset. The supported fileset types are as follows.

Type Description
find.gnu.out List of files generated offline by GNU find -ls
find List of files found by filebutler walking the filesystem
union Union of specified filesets
filter Any existing fileset filtered by age, owner, file size, etc.

Filebutler generates on-disk structured caches for both the find.gnu.out and find filesets, as described in the next section. It is these caches which enable it to process queries over millions of files in reasonable time.

Cache Structure

Filebutler's main value add is to process very large filelists, and structure them into its cache, by age, dataset, and owner. Interactive queries are greatly accelerated by scanning over just those portions of the overall filelist that match the filter criteria.

The top level cache structuring is by week number, that is YYYYWW, where WW is the ISO week number. This results in 52 directories per year. On the author's fileservers (half a petabyte or so, with 100 million files dating back over 20 years or so), there are around 1000 such directories across the filesets.

The second level cache structuring is by dataset. A dataset is a way of grouping files. The simplest definition would be, top-level directory on a fileserver, but filebutler supports arbitrary definition of dataset, by means of regular expression matching of pathnames. Datasets are optional.

The third level cache structuring is by owner. Most users are interested only in files they own, and by scanning only such portions of filelists, queries are greatly accelerated.

At the bottom level of the cache structure, within each owner directory, filebutler uses two files: filelist and info. The former is simply a list of the files last modified in that week, in that dataset, owned by that user, with their attributes. The latter is summary information of the number of these files and their total size.

Features

Filebutler has two main uses:

  • unprivileged users wanting (or requested) to be good citizens, and delete old files they no longer need
  • privileged users, for deleting such users' files automatically, and emailing warnings in advance

The manpage lists the commands available for both types of user.

Unprivileged Users

Unprivileged users require to select a set of files, check that these are in fact unwanted, and delete them.

Existing filesets may be refined, by defining new filters on them, for example:

fb: ls
fb: print very-old-scratch
fb: print very-old-scratch ! -path *important*
fb: fileset unimportant filter very-old-scratch ! -path *important*
fb: info very-old-scratch
total 12G in 52759 files
fb: info unimportant
total 12G in 52433 files
fb: delete unimportant

Privileged Users

It is expected that privileged users will install cron jobs to enforce file deletion policies. Warning emails may be generated, to the owners of files in selected filesets. For example:

fb: fileset warn-old-scratch filter scratch -mtime +730
fb: fileset delete-old-scratch filter scratch -mtime +737
fb: send-emails warn-old-scratch deletion-warning
fb: delete delete-old-scratch

See the next section for the configuration required to support warning emails.

Configuration

/etc/filebutlerrc

The main configuration is simply a command file, which sets attributes and defines filesets. The command set for the startup file is identical to the interactive command set.

Usually, startup commands are read from both /etc/filebutlerrc and ~/.filebutlerrc. This may be overriden using the command line argument --config, in which case, neither of the default configuration files are read.

See the example filebutlerrc file.

The commands and attributes available are defined on the manpage.

Email Templates

The attribute templatedir defines the location of the directory containing email templates. For example, to send emails using the deletion-warning template, that directory must contain both the subject and body files, called respectively deletion-warning.subject and deletion-warning.body.

See the example subject and body templates.

Ignore Paths

Certain files can be flagged to be ignored by filebutler. This is done by means of a list of Python-style regular expressions in the file named by the attribute ignorepathsfrom. Any file matching one of these regular expressions will be ignored.

Note that the ignoring is done when generating the filebutler cache, when scanning the actual filesystem, or the output file list of find -ls. If desired, different ignore files may be used for different filesets, by setting the attribute just before the line defining the fileset.

See the example ignore paths file

Cron

Cron jobs are recommended for regenerating the caches overnight, deleting old files, and sending warning emails.

For example:

0 5 * * * filebutler -c update-cache --batch
0 7 * * 1 filebutler -c 'send-emails warn-old-scratch; delete delete-old-scratch' --batch

See the manpage for further details.