-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: A general API for PyPI mirroring tools #548
Comments
I'm a bit curious about why you need to download them yourself, because I feel like bandersnatch is primarily a mirror and if you are going to separate concerns, bandersnatch should remain a mirror and the rest of the functionality should be grouped and abstracted for its specific use case. In my mind, the overall solution that I am imagining looks something like this: Where the in memory cache, the storage backend, and the metadata store (i.e. the database) can be swapped out via a plugin system. Currently the storage backend is getting abstracted a bit to allow for more generic implementations (like S3-based file storage). Ideally, we could move away from so much filesystem interaction. Using a queue like redis or I mean, aren't all the kids using kafka for everything these days? Surely you should be deploying kafka somewhere in there just to fit in. (That's sarcasm) If you only care about downloading metadata, depending on what specific information you need, I have some extra tooling built for this already over at https://requirementslib.readthedocs.io/en/master/requirementslib.models.metadata.html#requirementslib.models.metadata.get_package ( Anyhow, I'd argue you could make completely different libraries here, but I'm on board with splitting up the mirror functionality to start off with since I agree that metadata and actual data are totally independent. Like you, I need to put the metadata into a database (and would love to have that logic shared as well, but suspect your approach is implemented already). Unlike you I do still need the files. |
Well, it's not just that we want the metadata, it's that we want to try to fetch as little metadata as possible, in a way that is compliant and "polite", and we'd also like to be able to support many of the filtering and version-limiting features that Bandersnatch supports, and possibly be able be able to configure Pulp Python using an unmodified Bandersnatch config file.
So this is the bit I deleted earlier because it's not really that relevant to the issue. There's many reasons but the primary one is that one of the primary features of Pulp is "lazy sync", which is basically an on-demand caching behavior. Basically Pulp will ingest all of the metadata and create the desired mirror without actually downloading any of the packages until a client asks Pulp for them. It's a trade off of a small amount of robustness in exchange for saving a huge amount of disk space and bandwidth on packages that will never actually be used. But because we support this feature, the architecture makes the assumption that Pulp is performing the actual package download itself (regardless of whether it's doing so immediately during the initial mirror, or on an individual per-package basis later).
So just to be clear I don't think any such drastic changes in architecture are necessary, just shifting the place where it happens would suffice :) If you want to talk about long-term architecture that would be an interesting discussion to have, but I think that would be a separate issue. |
Few points from me:
But happy to review and work the library into the shape you need it. tl;dr - We have pretty good test and CI coverage - so as long as you don't break that I'll be pretty accepting :) I'll try to get #463 done maybe this weekend so we can get coverage backup and remove all the uneeded code from verify.py - I use it as a POC for aiohttp over requests and haven't merged it's master / mirror / other operations back into the correct class. |
What else is the plan here? Can we maybe spell out the remaining plans so there is an endpoint for this issue if possible? |
Let me thrown in my requirements: I have a few use cases where we maintain mostly a mirror of metadata and selectively will download actual files only if needed. There I want to run a daily or hourly process to update that system of mine and I was about to start extracting the bits I needed from bandersnatch. Enhancing the code here instead would be a much better option. I feel my use case is not much different from Pulp so I am game to help. As @dralley pointed out, it would be much better for me to call bandersnatch as a library rather than reimplementing or fork a metadata-focused subset that would be tied to the legacy xmlrpc API. |
@pombredanne Here's what we're working on at the moment in our PR bandersnatch/src/bandersnatch/mirror.py Lines 33 to 178 in 37cabc7
As an API user, the methods that a subclasser would need to implement are:
Do you think that would be sufficient for your purposes? Of course technically anything can be overridden but it would be great if we can define a clear interface that allows all the behavior to be customized without doing such things. We would definitely love your feedback. There is probably still plenty of things that could be improved, especially around |
Link to the PR #591 |
@dralley re:
Thank you ++... let me play with that |
Can we call this done and release? Please reopen if not, and please open those cleanup issues you were talking about :D |
Proposal
Bandersnatch is the top tool for mirroring PyPI. It's well tested in production, PEP-compliant, and generally respectful of PyPI's resources. It would be excellent if Bandersnatch could be refactored to provide a common library for all mirroring tools interacting with PyPI.
Benefits to Bandersnatch:
Benefits to the Python ecosystem:
What we need
I write this as a contributor to one of these other mirroring tools (although we do more "repo management" rather than just mirroring). Here is the functionality we would need from Bandersnatch (including things that might already be do-able):
Current Problems
Problem: Separation of concerns - the communication with PyPI isn't very separated from internal Bandersnatch-specific behavior.
a) It looks like the primary entry point, Mirror.synchronize(), unavoidably tries to touch the filesystem in various ways that are needed by Bandersnatch but not for someone calling Bandersnatch as a library, such as:
b) It doesn't seem that there is a way to ask Bandersnatch to abstain from downloading the package files (or JSON files? unclear on that one) and saving them to disk.
We need to handle downloading the package file ourselves, the only thing we need is the JSON metadata (as a dict or string) and the package download URL.
[deleted tangent about why we need it]
Solution
Separate the concerns a little bit better. I think the following suggestions would be the minimum necessary changes:
Split the "Mirror" class in half, create a new one which handles everything up to and including "determine which packages to sync", while the existing class handles the logic behind persisting it to disk, error handling, and cleaning up.
Inside the Package class, move the code which handles the metadata download from Package.sync() into a new method called request_metadata() or something, so that it can be called separately.
Remove the need for Package to hold a reference to Mirror - several changes needed here to the flow of errors.
The text was updated successfully, but these errors were encountered: