New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: allow symlinkinkg and hardlinking files instead of just copying #409

Open
wants to merge 4 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@anarcat
Copy link
Contributor

anarcat commented Nov 13, 2018

Description

This allows users to manage collections of large WARC files without duplicating space. Hardlinks are used instead of symlinks to reflect the original mechanism, where the file is copied (so it can be safely removed from the source). If we used symlinks, we would break that expectation which could lead to data loss.

Inversely, hardlinks can lead to data loss as well. For example, pywb could somehow edit the file, which would modify the original as well. But we assume here pywb does not modify the file, and each side of the hardlink can have their own permissions to ensure this (or not) as well.

Closes: #408

Types of changes

  • Replay fix (fixes a replay specific issue)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added or updated tests to cover my changes.
  • All new and existing tests passed.

anarcat added some commits Nov 13, 2018

support linking files in place instead of copying
This allows users to manage collections of large WARC files without
duplicating space. Hardlinks are used instead of symlinks to reflect
the original mechanism, where the file is copied (so it can be safely
removed from the source). If we used symlinks, we would break that
expectation which could lead to data loss.

Inversely, hardlinks can lead to data loss as well. For example, pywb
could somehow edit the file, which would modify the original as
well. But we assume here pywb does not modify the file, and each side
of the hardlink can have their own permissions to ensure this (or not)
as well.

Closes: #408

@anarcat anarcat force-pushed the anarcat:hardlink branch from 7e80dba to 011640a Nov 13, 2018

@anarcat

This comment has been minimized.

Copy link
Contributor

anarcat commented Nov 13, 2018

This is WIP because I haven't worked on the docs or tests yet, as I want feedback on the idea first. Furthermore, tests fail here but that's unrelated to the patch here: they've been failing on master since at least 08b0ac8

Advice on which docs to update and insights on the test suite would be very much welcome as well.

@despens

This comment has been minimized.

Copy link

despens commented Nov 13, 2018

Hey, if your WARC files are already present somewhere else in a structured way, you might have success with configuring multiple archive paths, see https://pywb.readthedocs.io/en/latest/manual/configuring.html#archive-paths

@anarcat

This comment has been minimized.

Copy link
Contributor

anarcat commented Nov 13, 2018

The problem with that approach is that this expects a certain layout in the filesystem. Right now WARC files are stored like this:

archive/example.com-YYYY-MM-DD-HASH/fooo-NNNN.warc.gz

Where:

  • YYYY-MM-DD is the date of the crawl
  • HASH is a unique hash
  • NNNN is an incrementing number (e.g. 0000 for the first WARC file, 0001 for the second one, files are split on 5GB boundaries)

This layout is commonly used by grab-site and archivebot and other crawlers. It does not match the expectations of pywb, including custom archive paths, which still look in collection/<coll name>/example.warc.gz.

So I don't think that's a sufficient solution to my problem.

@JustAnotherArchivist

This comment has been minimized.

Copy link

JustAnotherArchivist commented Nov 13, 2018

I can't speak for the technical implementation (although it looks good to me), but I'd definitely appreciate anything that allows using existing WARCs – whatever file structure they may be in – in pywb without having to make any copies. Collections of WARCs can be huge, and storing them twice is wasteful and in some cases not even possible due to space constraints.

Of course, it would be possible to create a separate directory with the expected structure, make hardlinks or symlinks in there, and then use that as another archive path. But from a usability perspective, it would be much better in my opinion if pywb/wb-manager simply had an option that can take care of that directly (like the one proposed in this PR) instead of having to do that manually or with helper scripts.

@N0taN3rd

This comment has been minimized.

Copy link
Member

N0taN3rd commented Nov 15, 2018

@anarcat thanks for the PR.

I believe the appropriate docs section that would need updating is docs/manual/usage.rst: Using Existing Web Archive Collections

Still thinking about how to test this because we will also need to support this functionality on Windows.

It may also be a good idea to add moving WARCs to the expected locations.
Would also be useful for CDX per #410.

/cc @ikreymer

@ikreymer

This comment has been minimized.

Copy link
Member

ikreymer commented Nov 15, 2018

Thanks for suggesting this, I agree that this should be supported, but not sure that symlinking/hardlinking is the way to go. As @N0taN3rd, this would complicate Windows support and would potentially make the setup more brittle.

pywb is close to supporting what you want with external paths, but unfortunately its not automated.

It seems like the best option would be to support a per-collection overrides, say overrides.yaml which can set a list of one or more paths to directories that contain WARC files, ex:

collections/external-data/overrides.yaml:

archive_paths:
  - /path/to/warcs/
  - /path/to/more/warcs/
  - /path/to/some-warcs/warcfile.warc.gz

Then, instead of the local archives directory, pywb will look for works in those directories instead.
This can also work with auto-indexing, so any time a new WARC is added to those directories, it can be indexed automatically. pywb can already check and index subdirectories, so you can use whatever structure you'd like in the external directory.

Of course, the overrides.yaml could also be managed with something like:

wb-manager add-external-path <coll> <path>
wb-manager remove-external-path <coll> <path>

If you wanted to add only a specific WARC file in a directory instead of all WARC, that too can be supported by specifying the file path instead of a directory (eg. /path/to/some-warcs/warcfile.warc.gz)

@JustAnotherArchivist

This comment has been minimized.

Copy link

JustAnotherArchivist commented Nov 15, 2018

Where would the CDX files be stored with that setup?

Regarding Windows support, is that a hard requirement or could that feature simply be unavailable if not supported by the OS, the Python version, or the configuration (only admins can create symlinks on Windows according to the Python docs)?

@ikreymer

This comment has been minimized.

Copy link
Member

ikreymer commented Nov 15, 2018

Where would the CDX files be stored with that setup?

By default, still in the indexes directory, although can provide an override for that also, if needed.

Regarding Windows support, is that a hard requirement or could that feature simply be unavailable if not supported by the OS, the Python version, or the configuration (only admins can create symlinks on Windows according to the Python docs)?

Ideally, the external directory feature would be available on all platforms.

@anarcat

This comment has been minimized.

Copy link
Contributor

anarcat commented Nov 21, 2018

what I hear from the various comments so far is this:

  • LGTM, some fix like this is necessary
  • this can be done by hand
  • add support for moving files (os.move, presumably, although that fails if we cross FS boundaries)
  • external paths are close to answering that requirement, but would require a new config file, maybe modifiable with a new set of commands
  • concerns about Windows support
  • pointers to which docs need updating (usage.rst, thanks!)

I understand where you are coming from: multi-platform compatibility is important, and there are existing features which might fit this requirement.

however, i would argue that "a bird in the mouth is better than two in the bush": I have a working patch to workaround a real scalability issue with pywb, right now. it might not work that effectively on Windows, but I want to point out that both os.link and os.symlink are actually supported on Windows, at least since Vista. so i don't think it's as much a blocker as people would tend to believe

the proposals to automate editing of the YAML file seem to be an entirely different approach, one that would require much more changes to the documentation and seem to me like feature creep. i just want to copy files lightly in the archive, not redesign how the entire YAML configuration system works! :) if this is the approach you want to take, I'm not sure I can help since I would need to dive again deeper in the internals of pywb, which might mean this would never be done at all. ;)

so to move this forward, I would propose that we keep on following the approach I proposed here. this would mean adding tests for the functionality and documentation. i would be happy to push that forward, if the proposal is accepted, otherwise I'm afraid I won't be able to provide a solution to #408 myself going forward.

have a nice day!

PS: the travis test failure here does not seem related to the patch, you might want to look into that... i'll re-trigger the build to see if it works better now.

@anarcat anarcat closed this Nov 21, 2018

@anarcat anarcat reopened this Nov 21, 2018

@codecov

This comment has been minimized.

Copy link

codecov bot commented Nov 21, 2018

Codecov Report

Merging #409 into master will decrease coverage by 0.15%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #409      +/-   ##
==========================================
- Coverage   88.04%   87.89%   -0.16%     
==========================================
  Files          59       59              
  Lines        7227     7235       +8     
  Branches     1286     1288       +2     
==========================================
- Hits         6363     6359       -4     
- Misses        570      575       +5     
- Partials      294      301       +7
Impacted Files Coverage Δ
pywb/manager/manager.py 97.58% <66.66%> (-1.35%) ⬇️
pywb/apps/static_handler.py 90% <0%> (-2.5%) ⬇️
pywb/warcserver/index/aggregator.py 89.72% <0%> (-1.98%) ⬇️
pywb/recorder/multifilewarcwriter.py 77.84% <0%> (-1.14%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1b151b7...011640a. Read the comment docs.

@N0taN3rd N0taN3rd requested review from N0taN3rd and ikreymer Nov 30, 2018

@N0taN3rd
Copy link
Member

N0taN3rd left a comment

I'm +1 to adding this.

@ikreymer

This comment has been minimized.

Copy link
Member

ikreymer commented Dec 21, 2018

hi @anarcat apologies for the delay -- was away on vacation and neglected to respond earlier!

I reread your comment and thought about it more, and have considered our current work. Since we don't have the bandwidth to implement the config-based approach now, and it could still be done at a later time, I think you're right and we should add this solution, even if it is not cross-platform as it will help your (and possibly others') use cases. This solution is a simple change to the wb-manager while the config option would be a much more extensive change, as you've mentioned.

To proceed, could you add:

  • tests (and mark them with something like @pytest.mark.skipif(sys.platform == 'win32', reason="does not run on windows") to skip windows)
    Existing tests for manager are mostly in ./tests/test_auto_colls.py and ./tests/test_redirects.py is another example. A test module similar to those (but simpler probably!) should be good.

  • docs in usage.rst, and include some tradeoffs between using the hardlink vs symlink vs copy approaches.

And we'll try to merge it in for next release! Thanks again!

(Yes, the travis-ci issue is/was unrelated, we're looking at that)

@anarcat

This comment has been minimized.

Copy link
Contributor

anarcat commented Dec 21, 2018

awesome! i'm not sure I'll have time to do this before the next year (and I'm don't mind at all if someone else beats me to it), but hopefully I'll be able to come back to this soon-ish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment