Proposal: A general API for PyPI mirroring tools #548

dralley · 2020-06-03T14:49:12Z

Proposal

Bandersnatch is the top tool for mirroring PyPI. It's well tested in production, PEP-compliant, and generally respectful of PyPI's resources. It would be excellent if Bandersnatch could be refactored to provide a common library for all mirroring tools interacting with PyPI.

Benefits to Bandersnatch:

More testing
More eyes on the code
More stakeholders willing to fix bugs and make improvements in Bandersnatch which will trickle down to their own projects

Benefits to the Python ecosystem:

Easier to change Warehouse APIs (e.g. remove XML-RPC, add new JSON APIs) if Bandersnatch is the only consumer of those APIs (and other mirroring tools just use Bandersnatch)

What we need

I write this as a contributor to one of these other mirroring tools (although we do more "repo management" rather than just mirroring). Here is the functionality we would need from Bandersnatch (including things that might already be do-able):

The ability to parse Bandersnatch config files and get the configuration data back in a form that we can save as settings in our database, and re-constitute later for use by the mirroring stage
The ability to filter PyPI packages/releases using the filters Bandersnatch provides
The following metadata about each package that passes through the filters:
- filename
- packagetype
- name
- version
- metadata_version
- summary
- description
- keywords
- home_page
- download_url
- author
- author_email
- maintainer
- maintainer_email
- license
- requires_python
- project_url
- platform
- supported_platform
- requires_dist
- provides_dist
- obsoletes_dist
- requires_external
- classifiers
The download url to the actual file in the PyPI datastore

Current Problems

Problem: Separation of concerns - the communication with PyPI isn't very separated from internal Bandersnatch-specific behavior.

a) It looks like the primary entry point, Mirror.synchronize(), unavoidably tries to touch the filesystem in various ways that are needed by Bandersnatch but not for someone calling Bandersnatch as a library, such as:

"todolist"
"last_modified"
"statusfile"

b) It doesn't seem that there is a way to ask Bandersnatch to abstain from downloading the package files (or JSON files? unclear on that one) and saving them to disk.

We need to handle downloading the package file ourselves, the only thing we need is the JSON metadata (as a dict or string) and the package download URL.

[deleted tangent about why we need it]

Solution

Separate the concerns a little bit better. I think the following suggestions would be the minimum necessary changes:

Split the "Mirror" class in half, create a new one which handles everything up to and including "determine which packages to sync", while the existing class handles the logic behind persisting it to disk, error handling, and cleaning up.

Inside the Package class, move the code which handles the metadata download from Package.sync() into a new method called request_metadata() or something, so that it can be called separately.

Remove the need for Package to hold a reference to Mirror - several changes needed here to the flow of errors.

techalchemy · 2020-06-03T23:14:05Z

We need to handle downloading the package file ourselves, the only thing we need is the JSON metadata (as a dict or string) and the package download URL.

[deleted tangent about why we need it]

I'm a bit curious about why you need to download them yourself, because I feel like bandersnatch is primarily a mirror and if you are going to separate concerns, bandersnatch should remain a mirror and the rest of the functionality should be grouped and abstracted for its specific use case.

In my mind, the overall solution that I am imagining looks something like this:

Where the in memory cache, the storage backend, and the metadata store (i.e. the database) can be swapped out via a plugin system.

Currently the storage backend is getting abstracted a bit to allow for more generic implementations (like S3-based file storage).

Ideally, we could move away from so much filesystem interaction. Using a queue like redis or I mean, aren't all the kids using kafka for everything these days? Surely you should be deploying kafka somewhere in there just to fit in. (That's sarcasm)

If you only care about downloading metadata, depending on what specific information you need, I have some extra tooling built for this already over at https://requirementslib.readthedocs.io/en/master/requirementslib.models.metadata.html#requirementslib.models.metadata.get_package (get_package(<pkgname>) and get_package_version(<pkgname>, <pkgversion>)) -- but these are designed for the use case of retrieving dependency metadata and written with python 2 compatibility so they are not asyncio capable.

Anyhow, I'd argue you could make completely different libraries here, but I'm on board with splitting up the mirror functionality to start off with since I agree that metadata and actual data are totally independent. Like you, I need to put the metadata into a database (and would love to have that logic shared as well, but suspect your approach is implemented already). Unlike you I do still need the files.

dralley · 2020-06-04T01:51:27Z

If you only care about downloading metadata,

Well, it's not just that we want the metadata, it's that we want to try to fetch as little metadata as possible, in a way that is compliant and "polite", and we'd also like to be able to support many of the filtering and version-limiting features that Bandersnatch supports, and possibly be able be able to configure Pulp Python using an unmodified Bandersnatch config file.

I'm a bit curious about why you need to download them yourself

So this is the bit I deleted earlier because it's not really that relevant to the issue. There's many reasons but the primary one is that one of the primary features of Pulp is "lazy sync", which is basically an on-demand caching behavior. Basically Pulp will ingest all of the metadata and create the desired mirror without actually downloading any of the packages until a client asks Pulp for them. It's a trade off of a small amount of robustness in exchange for saving a huge amount of disk space and bandwidth on packages that will never actually be used.

But because we support this feature, the architecture makes the assumption that Pulp is performing the actual package download itself (regardless of whether it's doing so immediately during the initial mirror, or on an individual per-package basis later).

Ideally, we could move away from so much filesystem interaction.

So just to be clear I don't think any such drastic changes in architecture are necessary, just shifting the place where it happens would suffice :)

If you want to talk about long-term architecture that would be an interesting discussion to have, but I think that would be a separate issue.

cooperlees · 2020-06-04T16:19:28Z

Few points from me:

All for the separation (especially if it will all be as clean as your first PR)
I agree with @techalchemy - We should be able to abstract the URL finding and downloading enough to make it friendly to you
Agree with moving the code so you don't have any filesystem deps
- This also helps with my idea for the POC simple webserver I'm slowly working on (that might one day become an extra part of bandersnatch) where it generated from the JSON files the simple html on the fly and caches in a kv store of choice - e.g. redis
- Can find that here: https://github.com/cooperlees/banderfront
Filtering plugins need a lot of work (mainly more testing)
- Normalization of names
  - From config into list and when comparing with XMLRPC returned (that is not normalized)
- Interoperability testing is not there - e.g. using whitelist + platform filters + regex filters (totally not tested)

But happy to review and work the library into the shape you need it.

tl;dr - We have pretty good test and CI coverage - so as long as you don't break that I'll be pretty accepting :)

I'll try to get #463 done maybe this weekend so we can get coverage backup and remove all the uneeded code from verify.py - I use it as a POC for aiohttp over requests and haven't merged it's master / mirror / other operations back into the correct class.

cooperlees · 2020-06-17T23:28:49Z

What else is the plan here? Can we maybe spell out the remaining plans so there is an endpoint for this issue if possible?

pombredanne · 2020-07-20T18:45:49Z

Let me thrown in my requirements: I have a few use cases where we maintain mostly a mirror of metadata and selectively will download actual files only if needed. There I want to run a daily or hourly process to update that system of mine and I was about to start extracting the bits I needed from bandersnatch. Enhancing the code here instead would be a much better option. I feel my use case is not much different from Pulp so I am game to help. As @dralley pointed out, it would be much better for me to call bandersnatch as a library rather than reimplementing or fork a metadata-focused subset that would be tied to the legacy xmlrpc API.

dralley · 2020-07-29T03:03:21Z

@pombredanne Here's what we're working on at the moment in our PR

bandersnatch/src/bandersnatch/mirror.py

Lines 33 to 178 in 37cabc7

    
           class Mirror: 
        
               synced_serial: Optional[int] = 0  # The last serial we have consistently synced to. 
        
               target_serial: Optional[int] = None  # What is the serial we are trying to reach? 
        
               errors = False 
        
               packages_to_sync: Dict[str, Union[int, str]] = {} 
        
               # Stop soon after meeting an error. Continue without updating the 
        
               # mirror's serial if false. 
        
               stop_on_error = False 
        
               # We are required to leave a 'last changed' timestamp. I'd rather err 
        
               # on the side of giving a timestamp that is too old so we keep track 
        
               # of it when starting to sync. 
        
               now = None 
        
               def __init__( 
        
                   self, master: Master, stop_on_error: bool = False, workers: int = 3, 
        
               ): 
        
                   self.master = master 
        
                   self.filters = LoadedFilters(load_all=True) 
        
                   self.stop_on_error = stop_on_error 
        
                   self.workers = workers 
        
                   if self.workers > 10: 
        
                       raise ValueError("Downloading with more than 10 workers is not allowed") 
        
                   # Lets record and report back the changes we do each run 
        
                   # Format: dict['pkg_name'] = [set(removed), Set[added] 
        
                   # Class Instance variable so each package can add their changes 
        
                   self.altered_packages: Dict[str, Set[str]] = {} 
        
               async def synchronize( 
        
                   self, specific_packages: Optional[List[str]] = None 
        
               ) -> Dict[str, Set[str]]: 
        
                   logger.info(f"Syncing with {self.master.url}.") 
        
                   self.now = datetime.datetime.utcnow() 
        
                   # Lets ensure we get a new dict each run 
        
                   # - others importing may not reset this like our main.py 
        
                   self.altered_packages = {} 
        
                   if specific_packages is None: 
        
                       # Changelog-based synchronization 
        
                       await self.determine_packages_to_sync() 
        
                   else: 
        
                       # Synchronize specific packages. This method doesn't update the statusfile 
        
                       # Pass serial number 0 to bypass the stale serial check in Package class 
        
                       SERIAL_DONT_CARE = 0 
        
                       self.packages_to_sync = { 
        
                           utils.bandersnatch_safe_name(name): SERIAL_DONT_CARE 
        
                           for name in specific_packages 
        
                       } 
        
                   await self.sync_packages() 
        
                   self.finalize_sync() 
        
                   return self.altered_packages 
        
               def _filter_packages(self) -> None: 
        
                   """ 
        
                   Run the package filtering plugins and remove any packages from the 
        
                   packages_to_sync that match any filters. 
        
                   - Logging of action will be done within the check_match methods 
        
                   """ 
        
                   global LOG_PLUGINS 
        
                   filter_plugins = self.filters.filter_project_plugins() 
        
                   if not filter_plugins: 
        
                       if LOG_PLUGINS: 
        
                           logger.info("No project filters are enabled. Skipping filtering") 
        
                           LOG_PLUGINS = False 
        
                       return 
        
                   # Make a copy of self.packages_to_sync keys 
        
                   # as we may delete packages during iteration 
        
                   packages = list(self.packages_to_sync.keys()) 
        
                   for package_name in packages: 
        
                       if not all( 
        
                           plugin.filter({"info": {"name": package_name}}) 
        
                           for plugin in filter_plugins 
        
                           if plugin 
        
                       ): 
        
                           if package_name not in self.packages_to_sync: 
        
                               logger.debug(f"{package_name} not found in packages to sync") 
        
                           else: 
        
                               del self.packages_to_sync[package_name] 
        
               async def determine_packages_to_sync(self) -> None: 
        
                   """ 
        
                   Update the self.packages_to_sync to contain packages that need to be 
        
                   synced. 
        
                   """ 
        
                   raise NotImplementedError() 
        
               async def package_syncer(self, idx: int) -> None: 
        
                   logger.debug(f"Package syncer {idx} started for duty") 
        
                   while True: 
        
                       try: 
        
                           package = self.package_queue.get_nowait() 
        
                           await package.update_metadata(self.master, attempts=3) 
        
                           await self.process_package(package) 
        
                       except asyncio.QueueEmpty: 
        
                           logger.debug(f"Package syncer {idx} emptied queue") 
        
                           break 
        
                       except PackageNotFound: 
        
                           return  # I think this should be a continue 
        
                       except Exception: 
        
                           logger.exception( 
        
                               f"Error syncing package: {package.name}@{package.serial}" 
        
                           ) 
        
                           self.errors = True 
        
                       if self.errors and self.stop_on_error: 
        
                           logger.error("Exiting early after error.") 
        
                           sys.exit(1) 
        
               async def process_package(self, package: Package) -> None: 
        
                   raise NotImplementedError() 
        
               async def sync_packages(self) -> None: 
        
                   try: 
        
                       self.package_queue: asyncio.Queue = asyncio.Queue() 
        
                       # Sorting the packages alphabetically makes it more predictable: 
        
                       # easier to debug and easier to follow in the logs. 
        
                       for name in sorted(self.packages_to_sync): 
        
                           serial = int(self.packages_to_sync[name]) 
        
                           await self.package_queue.put(Package(name, serial=serial)) 
        
                       sync_coros: List[Awaitable] = [ 
        
                           self.package_syncer(idx) for idx in range(self.workers) 
        
                       ] 
        
                       try: 
        
                           await asyncio.gather(*sync_coros) 
        
                       except KeyboardInterrupt: 
        
                           # Setting self.errors to True to ensure we don't save Serial 
        
                           # and thus save to disk that we've had a successful sync 
        
                           self.errors = True 
        
                           logger.info( 
        
                               "Cancelling, all downloads are forcibly stopped, data may be " 
        
                               + "corrupted. Serial will not be saved to disk. " 
        
                               + "Next sync will start from previous serial" 
        
                           ) 
        
                   except (ValueError, TypeError): 
        
                       # This is for when self.packages_to_sync isn't of type Dict[str, int] 
        
                       self.errors = True 
        
               def finalize_sync(self) -> None: 
        
                   raise NotImplementedError()

As an API user, the methods that a subclasser would need to implement are:

determine_packages_to_sync() # init the structures that define the packages to be synced
process_package(package) # called for each Package object (do things like saving the json and downloading packages -- or not)
finalize_sync() # called once at the end of the sync (do things like write index file -- or not)

Do you think that would be sufficient for your purposes? Of course technically anything can be overridden but it would be great if we can define a clear interface that allows all the behavior to be customized without doing such things.

We would definitely love your feedback. There is probably still plenty of things that could be improved, especially around determine_packages_to_sync().

dralley · 2020-07-29T03:04:44Z

Link to the PR #591

pombredanne · 2020-08-13T13:07:01Z

@dralley re:

Here's what we're working on at the moment in our PR

Thank you ++... let me play with that

cooperlees · 2020-08-21T01:23:35Z

Can we call this done and release? Please reopen if not, and please open those cleanup issues you were talking about :D

dralley mentioned this issue Aug 4, 2020

Generic API for third-party mirroring tools #591

Merged

cooperlees closed this as completed Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: A general API for PyPI mirroring tools #548

Proposal: A general API for PyPI mirroring tools #548

dralley commented Jun 3, 2020 •

edited

Loading

techalchemy commented Jun 3, 2020

dralley commented Jun 4, 2020 •

edited

Loading

cooperlees commented Jun 4, 2020

cooperlees commented Jun 17, 2020

pombredanne commented Jul 20, 2020 •

edited

Loading

dralley commented Jul 29, 2020

dralley commented Jul 29, 2020

pombredanne commented Aug 13, 2020

cooperlees commented Aug 21, 2020

Proposal: A general API for PyPI mirroring tools #548

Proposal: A general API for PyPI mirroring tools #548

Comments

dralley commented Jun 3, 2020 • edited Loading

Proposal

What we need

Current Problems

Solution

techalchemy commented Jun 3, 2020

dralley commented Jun 4, 2020 • edited Loading

cooperlees commented Jun 4, 2020

cooperlees commented Jun 17, 2020

pombredanne commented Jul 20, 2020 • edited Loading

dralley commented Jul 29, 2020

dralley commented Jul 29, 2020

pombredanne commented Aug 13, 2020

cooperlees commented Aug 21, 2020

dralley commented Jun 3, 2020 •

edited

Loading

dralley commented Jun 4, 2020 •

edited

Loading

pombredanne commented Jul 20, 2020 •

edited

Loading