Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: A general API for PyPI mirroring tools #548

Closed
dralley opened this issue Jun 3, 2020 · 9 comments
Closed

Proposal: A general API for PyPI mirroring tools #548

dralley opened this issue Jun 3, 2020 · 9 comments

Comments

@dralley
Copy link
Contributor

dralley commented Jun 3, 2020

Proposal

Bandersnatch is the top tool for mirroring PyPI. It's well tested in production, PEP-compliant, and generally respectful of PyPI's resources. It would be excellent if Bandersnatch could be refactored to provide a common library for all mirroring tools interacting with PyPI.

Benefits to Bandersnatch:

  • More testing
  • More eyes on the code
  • More stakeholders willing to fix bugs and make improvements in Bandersnatch which will trickle down to their own projects

Benefits to the Python ecosystem:

  • Easier to change Warehouse APIs (e.g. remove XML-RPC, add new JSON APIs) if Bandersnatch is the only consumer of those APIs (and other mirroring tools just use Bandersnatch)

What we need

I write this as a contributor to one of these other mirroring tools (although we do more "repo management" rather than just mirroring). Here is the functionality we would need from Bandersnatch (including things that might already be do-able):

  • The ability to parse Bandersnatch config files and get the configuration data back in a form that we can save as settings in our database, and re-constitute later for use by the mirroring stage
  • The ability to filter PyPI packages/releases using the filters Bandersnatch provides
  • The following metadata about each package that passes through the filters:
    • filename
    • packagetype
    • name
    • version
    • metadata_version
    • summary
    • description
    • keywords
    • home_page
    • download_url
    • author
    • author_email
    • maintainer
    • maintainer_email
    • license
    • requires_python
    • project_url
    • platform
    • supported_platform
    • requires_dist
    • provides_dist
    • obsoletes_dist
    • requires_external
    • classifiers
  • The download url to the actual file in the PyPI datastore

Current Problems

Problem: Separation of concerns - the communication with PyPI isn't very separated from internal Bandersnatch-specific behavior.

a) It looks like the primary entry point, Mirror.synchronize(), unavoidably tries to touch the filesystem in various ways that are needed by Bandersnatch but not for someone calling Bandersnatch as a library, such as:

  • "todolist"
  • "last_modified"
  • "statusfile"

b) It doesn't seem that there is a way to ask Bandersnatch to abstain from downloading the package files (or JSON files? unclear on that one) and saving them to disk.

We need to handle downloading the package file ourselves, the only thing we need is the JSON metadata (as a dict or string) and the package download URL.

[deleted tangent about why we need it]

Solution

Separate the concerns a little bit better. I think the following suggestions would be the minimum necessary changes:

Split the "Mirror" class in half, create a new one which handles everything up to and including "determine which packages to sync", while the existing class handles the logic behind persisting it to disk, error handling, and cleaning up.

Inside the Package class, move the code which handles the metadata download from Package.sync() into a new method called request_metadata() or something, so that it can be called separately.

Remove the need for Package to hold a reference to Mirror - several changes needed here to the flow of errors.

@techalchemy
Copy link
Member

We need to handle downloading the package file ourselves, the only thing we need is the JSON metadata (as a dict or string) and the package download URL.

[deleted tangent about why we need it]

I'm a bit curious about why you need to download them yourself, because I feel like bandersnatch is primarily a mirror and if you are going to separate concerns, bandersnatch should remain a mirror and the rest of the functionality should be grouped and abstracted for its specific use case.

In my mind, the overall solution that I am imagining looks something like this:

Where the in memory cache, the storage backend, and the metadata store (i.e. the database) can be swapped out via a plugin system.

Currently the storage backend is getting abstracted a bit to allow for more generic implementations (like S3-based file storage).

Ideally, we could move away from so much filesystem interaction. Using a queue like redis or I mean, aren't all the kids using kafka for everything these days? Surely you should be deploying kafka somewhere in there just to fit in. (That's sarcasm)

If you only care about downloading metadata, depending on what specific information you need, I have some extra tooling built for this already over at https://requirementslib.readthedocs.io/en/master/requirementslib.models.metadata.html#requirementslib.models.metadata.get_package (get_package(<pkgname>) and get_package_version(<pkgname>, <pkgversion>)) -- but these are designed for the use case of retrieving dependency metadata and written with python 2 compatibility so they are not asyncio capable.

Anyhow, I'd argue you could make completely different libraries here, but I'm on board with splitting up the mirror functionality to start off with since I agree that metadata and actual data are totally independent. Like you, I need to put the metadata into a database (and would love to have that logic shared as well, but suspect your approach is implemented already). Unlike you I do still need the files.

@dralley
Copy link
Contributor Author

dralley commented Jun 4, 2020

If you only care about downloading metadata,

Well, it's not just that we want the metadata, it's that we want to try to fetch as little metadata as possible, in a way that is compliant and "polite", and we'd also like to be able to support many of the filtering and version-limiting features that Bandersnatch supports, and possibly be able be able to configure Pulp Python using an unmodified Bandersnatch config file.

I'm a bit curious about why you need to download them yourself

So this is the bit I deleted earlier because it's not really that relevant to the issue. There's many reasons but the primary one is that one of the primary features of Pulp is "lazy sync", which is basically an on-demand caching behavior. Basically Pulp will ingest all of the metadata and create the desired mirror without actually downloading any of the packages until a client asks Pulp for them. It's a trade off of a small amount of robustness in exchange for saving a huge amount of disk space and bandwidth on packages that will never actually be used.

But because we support this feature, the architecture makes the assumption that Pulp is performing the actual package download itself (regardless of whether it's doing so immediately during the initial mirror, or on an individual per-package basis later).

Ideally, we could move away from so much filesystem interaction.

So just to be clear I don't think any such drastic changes in architecture are necessary, just shifting the place where it happens would suffice :)

If you want to talk about long-term architecture that would be an interesting discussion to have, but I think that would be a separate issue.

@cooperlees
Copy link
Contributor

Few points from me:

  • All for the separation (especially if it will all be as clean as your first PR)
  • I agree with @techalchemy - We should be able to abstract the URL finding and downloading enough to make it friendly to you
  • Agree with moving the code so you don't have any filesystem deps
    • This also helps with my idea for the POC simple webserver I'm slowly working on (that might one day become an extra part of bandersnatch) where it generated from the JSON files the simple html on the fly and caches in a kv store of choice - e.g. redis
    • Can find that here: https://github.com/cooperlees/banderfront
  • Filtering plugins need a lot of work (mainly more testing)
    • Normalization of names
      • From config into list and when comparing with XMLRPC returned (that is not normalized)
    • Interoperability testing is not there - e.g. using whitelist + platform filters + regex filters (totally not tested)

But happy to review and work the library into the shape you need it.

tl;dr - We have pretty good test and CI coverage - so as long as you don't break that I'll be pretty accepting :)

I'll try to get #463 done maybe this weekend so we can get coverage backup and remove all the uneeded code from verify.py - I use it as a POC for aiohttp over requests and haven't merged it's master / mirror / other operations back into the correct class.

@cooperlees
Copy link
Contributor

What else is the plan here? Can we maybe spell out the remaining plans so there is an endpoint for this issue if possible?

@pombredanne
Copy link

pombredanne commented Jul 20, 2020

Let me thrown in my requirements: I have a few use cases where we maintain mostly a mirror of metadata and selectively will download actual files only if needed. There I want to run a daily or hourly process to update that system of mine and I was about to start extracting the bits I needed from bandersnatch. Enhancing the code here instead would be a much better option. I feel my use case is not much different from Pulp so I am game to help. As @dralley pointed out, it would be much better for me to call bandersnatch as a library rather than reimplementing or fork a metadata-focused subset that would be tied to the legacy xmlrpc API.

@dralley
Copy link
Contributor Author

dralley commented Jul 29, 2020

@pombredanne Here's what we're working on at the moment in our PR

class Mirror:
synced_serial: Optional[int] = 0 # The last serial we have consistently synced to.
target_serial: Optional[int] = None # What is the serial we are trying to reach?
errors = False
packages_to_sync: Dict[str, Union[int, str]] = {}
# Stop soon after meeting an error. Continue without updating the
# mirror's serial if false.
stop_on_error = False
# We are required to leave a 'last changed' timestamp. I'd rather err
# on the side of giving a timestamp that is too old so we keep track
# of it when starting to sync.
now = None
def __init__(
self, master: Master, stop_on_error: bool = False, workers: int = 3,
):
self.master = master
self.filters = LoadedFilters(load_all=True)
self.stop_on_error = stop_on_error
self.workers = workers
if self.workers > 10:
raise ValueError("Downloading with more than 10 workers is not allowed")
# Lets record and report back the changes we do each run
# Format: dict['pkg_name'] = [set(removed), Set[added]
# Class Instance variable so each package can add their changes
self.altered_packages: Dict[str, Set[str]] = {}
async def synchronize(
self, specific_packages: Optional[List[str]] = None
) -> Dict[str, Set[str]]:
logger.info(f"Syncing with {self.master.url}.")
self.now = datetime.datetime.utcnow()
# Lets ensure we get a new dict each run
# - others importing may not reset this like our main.py
self.altered_packages = {}
if specific_packages is None:
# Changelog-based synchronization
await self.determine_packages_to_sync()
else:
# Synchronize specific packages. This method doesn't update the statusfile
# Pass serial number 0 to bypass the stale serial check in Package class
SERIAL_DONT_CARE = 0
self.packages_to_sync = {
utils.bandersnatch_safe_name(name): SERIAL_DONT_CARE
for name in specific_packages
}
await self.sync_packages()
self.finalize_sync()
return self.altered_packages
def _filter_packages(self) -> None:
"""
Run the package filtering plugins and remove any packages from the
packages_to_sync that match any filters.
- Logging of action will be done within the check_match methods
"""
global LOG_PLUGINS
filter_plugins = self.filters.filter_project_plugins()
if not filter_plugins:
if LOG_PLUGINS:
logger.info("No project filters are enabled. Skipping filtering")
LOG_PLUGINS = False
return
# Make a copy of self.packages_to_sync keys
# as we may delete packages during iteration
packages = list(self.packages_to_sync.keys())
for package_name in packages:
if not all(
plugin.filter({"info": {"name": package_name}})
for plugin in filter_plugins
if plugin
):
if package_name not in self.packages_to_sync:
logger.debug(f"{package_name} not found in packages to sync")
else:
del self.packages_to_sync[package_name]
async def determine_packages_to_sync(self) -> None:
"""
Update the self.packages_to_sync to contain packages that need to be
synced.
"""
raise NotImplementedError()
async def package_syncer(self, idx: int) -> None:
logger.debug(f"Package syncer {idx} started for duty")
while True:
try:
package = self.package_queue.get_nowait()
await package.update_metadata(self.master, attempts=3)
await self.process_package(package)
except asyncio.QueueEmpty:
logger.debug(f"Package syncer {idx} emptied queue")
break
except PackageNotFound:
return # I think this should be a continue
except Exception:
logger.exception(
f"Error syncing package: {package.name}@{package.serial}"
)
self.errors = True
if self.errors and self.stop_on_error:
logger.error("Exiting early after error.")
sys.exit(1)
async def process_package(self, package: Package) -> None:
raise NotImplementedError()
async def sync_packages(self) -> None:
try:
self.package_queue: asyncio.Queue = asyncio.Queue()
# Sorting the packages alphabetically makes it more predictable:
# easier to debug and easier to follow in the logs.
for name in sorted(self.packages_to_sync):
serial = int(self.packages_to_sync[name])
await self.package_queue.put(Package(name, serial=serial))
sync_coros: List[Awaitable] = [
self.package_syncer(idx) for idx in range(self.workers)
]
try:
await asyncio.gather(*sync_coros)
except KeyboardInterrupt:
# Setting self.errors to True to ensure we don't save Serial
# and thus save to disk that we've had a successful sync
self.errors = True
logger.info(
"Cancelling, all downloads are forcibly stopped, data may be "
+ "corrupted. Serial will not be saved to disk. "
+ "Next sync will start from previous serial"
)
except (ValueError, TypeError):
# This is for when self.packages_to_sync isn't of type Dict[str, int]
self.errors = True
def finalize_sync(self) -> None:
raise NotImplementedError()

As an API user, the methods that a subclasser would need to implement are:

  • determine_packages_to_sync() # init the structures that define the packages to be synced
  • process_package(package) # called for each Package object (do things like saving the json and downloading packages -- or not)
  • finalize_sync() # called once at the end of the sync (do things like write index file -- or not)

Do you think that would be sufficient for your purposes? Of course technically anything can be overridden but it would be great if we can define a clear interface that allows all the behavior to be customized without doing such things.

We would definitely love your feedback. There is probably still plenty of things that could be improved, especially around determine_packages_to_sync().

@dralley
Copy link
Contributor Author

dralley commented Jul 29, 2020

Link to the PR #591

@pombredanne
Copy link

@dralley re:

Here's what we're working on at the moment in our PR

Thank you ++... let me play with that

@cooperlees
Copy link
Contributor

Can we call this done and release? Please reopen if not, and please open those cleanup issues you were talking about :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants