Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamping PyPI Development: Build Infrastructure and Tooling #12157

Open
dstufft opened this issue Aug 30, 2022 · 9 comments
Open

Revamping PyPI Development: Build Infrastructure and Tooling #12157

dstufft opened this issue Aug 30, 2022 · 9 comments
Labels
developer experience Anything that improves the experience for Warehouse devs needs discussion a product management/policy issue maintainers and users should discuss

Comments

@dstufft
Copy link
Member

dstufft commented Aug 30, 2022

Something I've been thinking lately is the developer experience of working on PyPI (not just Warehouse, but Warehouse and all the related services and libraries).

Background / Current State

One of the big things that marred people's ability to contribute to legacy PyPI was that getting a local setup was extremely finicky, including things like having to comment out certain pieces of code (or getting access to a running S3 bucket), getting a lot of local stuff setup (specific Python versions, databases, etc).

Early on in Warehouse's history we identified this as a major problem and decided to solve it, and have largely settled on broad assumption of make, docker, docker-compose, and the standard set of unix coreutils (sort, awk, etc) as the things we can depend on existing on the developers computer, and shunting everything else to happen inside of a docker container 1.

We've also made the decision to split out certain parts of Warehouse's functionality into libraries that we distribute (readme-renderer, trove-classifiers, maybe pypitoken), additional services that we run alongside Warehouse (inspector, camo, linehaul, conveyor, pypi-docs-proxy, maybe forklift), and related supporting "misc" repositories (pypi-infra). Generally speaking, each of those repositories are operated as largely independent projects within their own repo (their own release cadences, issues, etc) 2.

To manage dependencies, we've settled on using pip-compile to create a lock files that has been generated with valid hashes, of which we have multiple of them to break our dependencies down into broad categories (so that we don't have to install test dependencies into production for instance) that are each generated independently of each other 3. To manage dependency updates, we rely on Dependabot to create PRs for dependency bumps that we then run as a pull request.

To manage any sort of admin tooling or similar that we may want to write, we generally have a few options available to us:

  1. Write a command under python -m warehouse, and manually invoke it with direct access to our running kubernetes cluster.
  2. Write an admin UI in our web interface for it.
  3. Write a one off script and run it such that it has access to any internal services it needs (direct access to the PostgreSQL database, Fastly API keys, etc).

That largely covers the bulk of Warehouse itself, the other missing piece for Warehouse is our front end JavaScript and CSS. That is managed entirely separately using Gulp, Webpack, and a bunch of JS libraries. This has more or less been untouched and has been effectively unmaintained as changes to it have a tendency to break things often, which our front end test coverage is poor, so it's largely just sitting there with minimal changes made to the build infrastructure.

For managing CI, we're currently using GitHub Actions, which have been carefully constructed in such a way to directly run the underlying commands that, in local development, would normally be invoked inside of a docker container 4.

The final piece of the puzzle is that deployments are managed by Cabotage, which orchestrates building a Dockerfile at the root of a repository, managing configuration for the deployed versions of an application, combining that into Kubernetes primitives to deploy the build images into Kubernetes using the processes defined in the repo's Procfile, and finally updates statuses in various locations (Github Deployments API, Slack, etc).

Current Problems

1. Speed

One of the biggest problems with the current state is that iterating on PyPI is very slow anytime you have to interact with the build tooling. Something that should be relatively fast, like for instance running black against our code base, running black natively on my desktop 5 takes just under 0.3 seconds, however running make reformat takes over 5 seconds assuming that I have the docker containers built and I have make serve running in another shell 6.

If I don't have make serve running in another shell it takes somewhere around 45 seconds to run and it leaves a number of docker containers running in the background after the command has finished.

If someone doesn't already have the docker containers built (or if they've for some reason caused their cache to be invalid, say switching between branches with different dependencies) then you can add around 2m30s to build the local containers and possibly another couple of minutes to download images on my Gigabit internet connection.

Now to some extent some of the "bootstrapping" speed problems are unavoidable, docker is still one of the best ways to get a bunch of related services like PostgreSQL, Redis, etc onto a random contributor's computer without having a large and complicated set of steps to set things up, and caching of build steps alleviates some of these problems.

However, the way docker fundamentally works makes this caching less useful than it otherwise could be for speeding up repeat runs, the biggest problem being that docker caching is fundamentally a linear chain of mutations that take a single input and produce a single output, so anytime some part of the chain is invalidated, everything coming after it has to be regenerated, whether it actually needed to be or not.

Most well made build modern build tools (and even a lot of older tools.. like Make itself) support the idea of a dependency graph that takes multiple inputs (and some of them can produce multiple outputs, though it's common to only support one output), and can then be more intelligent about (re)building only the parts in the dependency graph that it actually needs 7.

Taking all of the "bootstrapping" speed issues off the table for a minute, even in the very best case the nature of starting up a docker container for any command we run essentially means that nothing we run can return in <5s, which is just enough time to be frustrating, even when the underlying command itself is fast.

2. Mounted Volumes Issues

One of the ways we make developing Warehouse not intolerably slow is that once our application image has been built and is running, we run that container with the developer's checkout of Warehouse mounted overtop of the path that the build process typically installs Warehouse into, then we run Warehouse such that it will monitor those files for changes, and will automatically reload itself whenever those files changes, creating a reasonably fast feedback loop on changes that can be handled this way.

However, mounting a host volume in this way brings with it some issues.

The largest issue being that in the typical setup, the docker daemon typically runs as root but users typically do not so any files created within the docker containers are owned by root and the user doesn't have permission on them. The most obvious place people can run into this is with the generated static files, but it also effects things like auto generated migration files, any files that make reformat or make translations needs to format or generate, or the generated docs produced by make docs.

Other issues stem from things like the reloader not being well protected from invalid code causing it to fully exit, so if you save a file with syntax errors, the web container will crash completely 8, causing you to need to Ctrl+C the running make serve and then restart it, which depending on what you've changed may or may not trigger a whole rebuild of the docker container causing another multi minute delay 9.

Another issue that stems from this, is that due to the way node works, if the developer has a node_modules directory for their host system in their repository and that has been mounted into the docker container, then node/npm will blindly use whatever is in there, even if it contains binaries for a completely different system. To work around this, we don't actually mount the entire repository into the docker container, but instead mount a bunch of sub paths within the repository.

This can cause confusion for people if they've changed anything 10 in a location that hasn't been host mounted as the version inside the container might not match the version like settings inside a top level file or a new sub path that simply hasn't been added yet.

3. Dependency Management is Error Prone and Manual

The way pip-compile works (and really, all or most of Python packaging) is that dependency resolution is a "one shot" process, and invoking pip or pip-compile multiple times does not guarantee that the end state will be valid.

For the way that we use pip-compile, with multiple requirements/*.in files that each get compiled individually into separate requirements/*.txt files, is that overlapping dependency constraints between different files are not taken into account so you may end up with the same dependency being pinned to multiple different versions inside of different requirements/*.txt files, which fundamentally cannot work, and currently we have to manually resolve these cases when they come up.

In addition, while pip supports taking multiple requirements file as an input, it doesn't support having some of the requirements pinned using hashes and some not, so attempting to install the requirements files that are not being generated using pip-compile has to happen in a separate step, which as mentioned above, means the end state may not be a valid set of packages. We will "fail fast" in this case because after we've gotten done installing everything, we run pip check which will iterate through all of the dependencies and verify that all of the dependency constraints are valid 11, but again actually resolving things to prevent this from happening is a manual process.

To ensure that our generated requirements/*.txt files have been generated after a dependency has been added or removed we have a make deps job which essentially just runs pip-compile on all of the requirements/*.in files to a temporary location then compares that to the current version to look for changes. However, because the output can change if someone uploads or deletes a file 12 or publishes a new release 13 this job parses the requirements.txt to extract the names and makes sure there are no extra or missing names.

Unfortunately, this check is particularly bad for a few reasons:

  1. Since it's invoking pip-compile to generate a new lock file it will pull in newly released files or versions since the last time versions were bumped, it attempts to control against that, but if a new version of one of our dependencies adds or removes a whole new transitive dependency then that causes this check to fail due to no changes in the Warehouse repository, which means this check starts failing on every pull request or deploy until someone manually fixes it.
  2. To mitigate (1), the make deps job strips the ==VERSION when parsing the requirements.txt file and only compares names. However that means that the value of this job is limited because it's not actually making sure that our requirements.txt files are producing what pip-compile things they should, even on pull requests that actually modify our dependencies in some way.

This means that this check ends up creating "emergency" dependency management tasks to unbreak CI at effectively random intervals but the mitigations to reduce that end up drastically reducing the value of that check, allowing invalid things to pass (and thus requiring manual intervention by humans to fix it) 14.

Likewise, Dependabot is also falling short here for us.

Dependabot itself does not understand pip-compile at all, it's just treating each individual requirements/*.txt file as an independent artifact and blindly bumping versions regardless of what is inside of our requirements/*.in files, or even I believe, regardless of what the dependency metadata within the packages themselves say. On top of that, the same problem with pip-compile only resolving one set of files at a one applies here as well, it will gladly bump versions in different files to different versions, even though for our purposes they ultimately need to exist in the same installed environment and thus need to be the same 15.

Dependabot also does not support grouping multiple dependency updates together, which means that every single dependency update has to be merged individually or someone has to manually combine all of the pull requests (which maybe can be done automatically using merge, but not always if there are merge conflicts) which makes attempting to keep up on all of the dependabot PRs take a large amount of time 16. This can be particularly problematic for dependencies with tight integration with each other like the various boto tools that may not even be valid to update one at a time, but even if it is every update requires updating multiple dependencies.

4. Multiple Repositories Increase Overhead and Cause Bit-rot

In general I'm a fan of multiple repositories, treating individual projects as their own unit as I think that that ends up providing an overall better experience for everyone and requires less amount of "build infrastructure/tooling maintenance" for each of those individual repositories as most OSS tooling assumes a 1:1 mapping of repository to project and the alternative, mono repos, generally become unwieldy and require more effort to keep them performant.

However, that being said the poly repo strategy has it's own problems, particularly when viewed under the lense of "the primary purpose of this collection of repositories is to be combined into a singular thing", like is the case with Warehouse + related repositories coming together to form "PyPI".

Some of the forms of overhead include:

  1. Every repository ends up having it's own issue tracker, but new contributors or even just users are not familiar which repository "owns" which functionality or not so they'll often times just dump their issues into whatever repository was the easiest to find. This puts maintainers in a position where they have to either redirect those users to a different repository 17 or they have to let the issue sit in the "wrong" repository and now the true set of issues are spread over multiple repositories making it harder to actually find things.
    • Even when people do everything correct often times issues span multiple repositories, for example readme-renderer might have an issue to add some functionality, while Warehouse needs an issue to integrate that functionality.
  2. Making changes to PyPI can often require making changes across multiple repositories, making it harder to evaluate the full context of a proposed change because it ends up involving several repositories, so several pull requests plus mentally thinking through the interactions, for example a code change that relies on changes in our VCL to work.
    • We also don't have good tooling support to have Warehouse use a non-released version of one of our own dependencies, which means that even the process of developing it can be harder and longer as you have to first get the change into the dependency, get that dependency released, then you can start working on the Warehouse change OR manually muck around in the requirements files or adding your own host mounts for editable installs or something along that line.
  3. Having these projects being treated as independent leads to there being drift between the teams involved in each of them. For instance, trove-classifier has 2/3 of the current admin team able to release it while readme-renderer has a different 2/3 of the current admin team able to release it.
  4. Having these projects being treated as independent means each one's build tooling and infrastructure will evolve independently of each other unless we put in a concerted effort to keep them in sync. For example, readme-renderer uses tox to manage enviornments/tasks while trove-classifiers uses a basic hand made Makefile.
  5. These projects are typically intimately tied to Warehouse's development, but there's no mechanism to ensure that any particular change doesn't break Warehouse.
    • While this is true for any dependency that we have, given that these interrelated dependencies are effectively developed as part of PyPI, we're giving up on efficiency by being able to catch problems prior to a release going out.

The other big problem comes in the form of bit rot, we have a service (pypi-docs-proxy) that we have deployed (I think) that haven't had any commits to them since 2017, which I can't even remember where or how they're deployed, what version of Python they're deployed against (is it still even a supported version?), what version's it's using, etc. We have another service (conveyor) that hasn't had a commit since 2021, which appears to be deployed using Cabotage (yay, something matching Warehouse).

With all of these lesser deployed things, the projects themselves don't require a ton of changes, but things that aren't regularly built and maintained tend to bit rot to where attempting to pick them up again in the future ends up first having to re-learn whatever the dev practice of that repository was and almost certainly figuring out what things have broken in the mean time and need to be fixed before you can even get into the position of adding new changes or updating dependencies or whatever.

5. Needlessly re-running tests

In development you're hopefully regularly running tests, however doing so by default will run the entire test suite, which takes several minutes 18, and the bulk of that time is typically spent running tests where whatever thing you've changed couldn't possibly affect the outcome of the results.

You can filter this down by passing arguments into py.test to select only a subset of tests, however that is a particularly manual process and it requires either just intuitively knowing all of the tests that may be affected by your change OR accepting that you may be breaking random other tests so at some point you'll need to run the full test suite again 19.

Some more modern tooling that is able to actually model the dependency graph of your code as a whole across the entire project can actually figure this out for you, and only run the subset of tests that depend on the things that you've changed, automatically skipping tests that don't depend on changed code, and thus can't possibly be affected 20.

6. Host system and outside state "leaks" into the build artifacts

We've done a pretty good job of isolating Warehouse from the host system it's running on and from random state changes coming from the outside world.

However we haven't completely eliminated them, for instance we are fetching docker images from docker hub by tag instead of by hash, which means that as the docker image gets updated on docker hub the output of the build system can change without any related change within the Warehouse repository.

Even within our own repository, our Dockerfile does things like blindly apt-get install things which will affect the outcome of building the project and then within our CI we're running completely independently of docker and just running directly on the Github Action "host", pulling in any random state that might come from that.

That might all sound somewhat pedantic, but it actually is an important consideration, which generally falls under the idea of "hermetic builds". A hermetic build being a build that has little to no dependency on the host itself, and instead pulls in everything itself either by including it in the repository (e.g. what Google does) or by using some form of an external reference that is pinned to an exact, unchanging version (e.g. what most people do) 21.

When you have a hermetic build, that enables some pretty compelling features like:

  1. When all your inputs to the build system are controlled by the build system and the output is semantically reproducible 22, then your build system can cache anything, because it knows all of the inputs (since there's no implicit inputs via the host), if the inputs haven't changed, then the output hasn't either and it can just re-use things.
  2. When your build output doesn't depend on the host that you're running on (other than in ways controlled by the platform triplet, like windows x64, linux x32, etc) then you can share the cache between machines that have a matching triplet, which means repeat CI runs can cache large portions of it's output, only building what has actually changed and you can make those cached artifacts available in a read only way and random developers can download them and reuse them instead of rebuilding them 23.
    • Another way to think about this, is that your local device doesn't have to be the thing that does the building, you can have a high powered build server in the cloud, and a low powered device locally and just behind the scenes it uses the build server to build instead of your local device.

It also means that you're insulated from random changes in the world affecting your build such that a broken package released to apt or whatever cannot possibly break you without a corresponding change to your repository.

7. Poor integration with Editors

Right now, attempting to open Warehouse in a modern editor that has integration with things like mypy produces a big blob of useless warnings that look something like this:

Screen Shot 2022-08-30 at 3 53 08 PM

All of those warnings roughly boil down to: Import "suchandsuch" could not be resolved., which comes from the fact that we never actually install or expose the dependencies that we're working with on the host system in a way that an editor can introspect, everything is hidden inside of a docker container.

Unfortunately there's not really a good way around the fact that the only good way to handle this is to just install everything into an environment on the host system as well and point your editor to that, but doing that in our current setup means that we again start to depend on things like "having the right version of Python installed" on the host system, even if it's only for a little development shim to make editors happy.

To try and accommodate that, you can see that we have a .python-version file at the root of the repository to let people sort of set it up themselves and hopefully let some editors decide what version of Python to use. However, that file itself depends on having something like pyenv or asdf to interpret it (outside of manually setting things up) which is yet more tooling that developers have to install.

8. Multiple "interfaces" for developers

In theory this one doesn't actually exist, everything we expect developers to have to invoke we wrap inside of a make target which then means that make <foo> becomes our build infra/tooling interface, which is maybe not the best or most feature filled interface, but it is a consistent interface 24.

In reality though, a number of the problems with the way that our tooling currently works leads people to want to work around having to invoke make <something> and instead call the underlying thing directly, something that we can see even within the project itself by the fact that all of our CI things don't invoke make <something>, but instead call some underlying command.

9. Lack of support for "Internal" Code

This partially goes to the admin interface question, but also just in general.

Right now there's not really a good way to have code in Warehouse that isn't part of Warehouse, which means that if we want any sort of internal utility that we have to either incorporate it as part of Warehouse or we need to spin it off into it's own repository, which often times doesn't make much sense if it's closely tied to Warehouse and comes along with all of the aforementioned problems with polyrepos.

This makes things like one off commands a lot harder to handle, if we bake them into Warehouse itself they're forced to comply with things like our 100% test coverage and possibly other things in the future (MyPy coverage?) unless we carefully carve out a section of Warehouse itself, with these automated testing providing little to no long term value (and possibly even negative value).

Instead what often happens is scripts like this get developed externally from the code repo completely and just get manually copied around until they're eventually ran manually (typically on a jumpbox inside of our VPC). A great example of this kind of utility would be code that lets an admin purge an URL from our CDN, where writing an web interface for it is away more effort, that lets us restore something close to the simplicity we had with curl -XPURGE ... prior to when we added auth to purging.

This also can extend into code that needs to be shared between multiple services, but that we don't want to make generally available as a published library (though in that case we would want to keep the test coverage requirements) or even utilities used as part of the build tooling (see everything inside of bin/* for instance).

10. Lacking(ish) support for developing on Windows

We've made the decision that the things that we depend on from the developer machine is effectively:

  • make
  • docker
  • docker-compose
  • Unix core utils (rm, mv, awk, etc).

This effectively locks development into something resembling a Unix like environment, though not entirely because all of those tools do have versions that you can get for Windows though it's unclear if things would actually work using them or not. Windows users do have another option in using WSL, which is essentially a way to run Linux inside of your Windows install.

I've verified that WSL2 does in fact generally work fine for developing Warehouse on Windows, however it's kind of like having a linux VM in that you're effectively in a Linux environment, so unless you're familiar with it and have it setup the experience is likely going to be kind of miserable.

11. Probably More Things

Who knows! I'd love to hear if there's problems with developing / maintaining Warehouse that fall under the build tooling / infrastructure that people feel like they've just been begrudgingly living with that aren't covered by the above.

Ideas for Fixing

We probably can't fix absolutely everything, but hey maybe we can make some things better! I'll have a follow up post later with some ideas for how we could make some things better here, but i wanted to get this up because it's already very long and I wanted to see if people had their own things to add.

Footnotes

  1. Originally we also assumed Python was installed on the host system, and we could install local tools into a virtual environment managed by our Makefile, things like black etc, but have since moved those so that they run inside of a docker container as well.

  2. This isn't 100% true, some of them don't really get meaningful activity and sometimes issues get opened up on Warehouse and we just fix them in the correct repo without worrying about it, but in the general abstract case, this is true.

  3. This also isn't 100% true, we have some dependencies, particularly dev dependencies, where we only have a requirements/*.txt file with loosely constrained dependencies with no hashes.

  4. My memory of this, is that this was largely done for a combination of speed (not running things inside of docker is faster than spinning up a docker container and running things inside of a docker container) and docker in docker problems back when our CI was still running on Travis CI.

  5. Where "natively" is actually running black installed via pipx running inside of WSL2 on my Windows machines.

  6. For reasons that I assume are oversight, even simple commands like make reformat spins up PostgreSQL, Redis, and ElasticSearch.

  7. Of course, the ability to do this is only as good as the information that the build tool has, the more fine grained the dependency graph is, the smaller the "blast radius" of a particular change is.

  8. To make matters worse, often times the output indicating that the web container crashed quickly scrolls off screen as some of the containers are more robust against crashing from Syntax errors like this, so often times you don't really see it until you try to load an URL from the web app and get an error.

  9. In theory this means that the docker container needed to be rebuilt and the one you had running previously was no longer valid, but in practice cache invalidation for these containers has a lot of noise, and often is a false positive.

  10. Explicitly or implicitly through changing branches or something.

  11. I could be wrong, but I think that pipdoesn't record what extras are supposed to be present, so I believe that pip check will ignore any constraints that are pulled in via an extra that we've selected.

  12. pip-compile lists all the files for the pinned version, even ones that we'll never use because it doesn't know that we won't use them, which means that if the list of file changes because a new wheel was uploaded, or an old one was removed or something, a simple diff would fail.

  13. We're effectively only checking the names of the pins, not the versions they're pinned to.

  14. Invalid things should somewhat be checked by the pip check that we manually run... however that's only going to check against the dependency metadata that those packages provide, it won't check against any constraints that we've introduced inside of our requirements/*.in files.

  15. There are other options besides Dependabot like Renovate that actually do understand pip-compile, but they still have the same fundamental problem that we have with pip-compile in that they'll do the resolution for each pair of .in -> .txt files independently.

  16. We make this worse on ourselves by requiring that branches be up to date with the HEAD of main before they can be merged, which we do because we auto deploy from main and without a merge queue it's the only way to actually have tests run against what the actual merged state of that PR would be. This however means merging say 10 dependabot PRs involves merging 1, updating the next one with main, waiting for tests to pass, merge, then repeat, which can take an hour or longer of just toil work.

  17. This can often times come across as pedantic or hostile, particularly when the same set of people are managing all of the involved projects, because it creates this appearance that you've just created make work for that end user because they happened to approach you "wrong".

  18. The fact our test suite takes several minutes is its own problem, but this is about the build tooling and infrastructure!

  19. Never mind the fact that selecting tests like this breaks anything that looks at the status code of the results of make tests because we require 100% test coverage, which can't exist when you're only running a subset of tests without any knowledge of what lines of code that you expect those tests to cover.

  20. Unfortunately most of these tools don't handle dependency cycles at all, which the Warehouse code base has a non trivial amount of them, so to actually gain benefit here in a way that doesn't treat like the entirety of Warehouse as one dependency would require refactoring the code itself, which if we go with one of those tools we should probably do, but that would likely be a longer term effort.

  21. When dealing with hermetic builds, you can pretty easily get yourself into a rabbit hole where you're eventually down all the way to building your own C compiler from source because different versions of a C compiler can produce different outcomes. In practice most people try to draw a line somewhere between "building a C compiler from source" and "YOLO fetch everything from HEAD on every build" using how likely it is that different versions of that "thing" is likely to actually meaningfully affect the outcome of your build.

  22. This concept of "reproducible builds" is different from the common one talked about in packaging which is that the build process will produce a byte for byte incidental output-- It's great if it does that, but what this actually cares about is that the semantics of the output are equivalent, e.g. if it produces two wheels that install identically but the wheel file itself has some form of randomness in it that doesn't affect anything, then that's fine.

  23. Whether this is a good or a bad thing depends on how fast downloading those cached objects are versus rebuilding the project from scratch, so this particular thing might not be super useful for developers, but it could be very useful for CI.

  24. Ideally this extends to multiple different languages too! Python is great, but one of it's major features is that you can glue into C or Rust or whatever code, but we currently can't do that without spinning up a whole new project OR adding another build system OR fundamentally changing how we build Warehouse. This isn't a hypothetical problem either, we currently need Python AND JavaScript to develop Warehouse, and we end up having two entirely separate build systems for Python and JavaScript and just paper over it with make and docker.

@dstufft dstufft added needs discussion a product management/policy issue maintainers and users should discuss developer experience Anything that improves the experience for Warehouse devs labels Aug 30, 2022
@dstufft
Copy link
Member Author

dstufft commented Aug 30, 2022

tldr

@dstufft
Copy link
Member Author

dstufft commented Aug 30, 2022

So my thoughts on trying to improve the above things:

I'm going to roughly distill things down into three general root causes:

  1. Problems caused by having PyPI effectively span multiple repositories
  2. Problems caused by our build system lacking features that would manage these things.
  3. Problems caused by our own tooling lacking features or having problems.

1. Problems caused by having PyPI effectively span multiple repositories

The main problems caused by this from my list above is:

  • (4) Multiple Repositories Increase Overhead and Cause Bit-rot
  • (9) Lack of support for "Internal" Code

This basically comes down to trade offs and normally I'm on the side of multi/poly repos and think that the problems that come along with monorepos outweigh the benefits. However, I've been thinking about this a lot for PyPI, and I actually think that here we have a really strong case that things would be better if we switched to a monorepo.

Just to spit ball things a bit, here's a rough sketch of an idea for that:

All of the services, libraries, and related repositories that go into producing the entirety of what we would consider "PyPI" that are under our control should all be folded into a singular repository. This would include services like Warehouse, Linehaul, Conveyor, Inspector, libraries like readme_renderer and trove-classifiers, and misc supporting things like the terraform in pypi-infra. It would not include anything that isn't roughly developed primarily for use in or with PyPI by the PyPI team, so things like packaging would not be expected to be included.

We would primarily use a code structure that looks like

.
├── external
├── infra
├── libs
│  ├── internal
│  │  └── hypothetical-feature-flags-lib
│  └── shared
│     ├── readme-renderer
│     └── trove-classifiers
└── services
   ├── conveyor
   ├── linehaul
   └── warehouse

Within this code structure:

  • The external/ directory houses any external code (or any references to external code such as requirements.txt files) that we've imported or referenced into the repository.
  • The infra/ directory to contain the "base" terraform infrastructure files (but not any modules or service specific stuff other than
  • The libs/internal directory contains any internal libraries that exist for us to share among our services, but that would not end up being published to PyPI.
  • The libs/shared directory contains any libraries that we've developed for use in or with PyPI that we release to PyPI as well for others to use.
  • The services directory contains any services that we've got, the primary one being Warehouse itself of course.

The general rules would be:

  1. Nothing can import from a service (other than that specific service itself)
  2. Nothing can import from the top level infra directory.
  3. Everything can import from externals.
  4. Everything can import from a shared lib.
  5. Services and other internal libs can import from internal libs.
  6. Infra can import from services/*/infra.

Inside each services directory would essentially be a "project root", which would have things like (but not limited to):

  • A ``infra/folder that contains any related terraform configuration or modules for this service (which the top levelinfra/` imports)
  • A src/ directory containing the source code for that service.
  • A tests/ directory containing the tests for that service.
  • Any deployment files that Cabotage needs to deploy this service (Dockerfile and Procfile still?).

The basic idea being that the servicies/$NAME directory is roughly equivalent to the current service specific repositories other than things that we lift to the top level for uniformity (for example, I'd imagine black configuration to be top level).

The same sort of idea is true then for the two libs directories, with external being the odd one out of being more free form since it will have a lot more one off things in it, which will hopefully mostly be references to external things rather than fully importing external things.

We would then shut down all other repositories and all issues, pull requests, deployments, etc and funnel everything through this one repository.

This would have a lot of benefits for us:

  • We eliminate the overhead of tracking changes that span multiple repositories.
  • We eliminate 1 the ability for people to accidentally send issues to the wrong tracker.
  • Development of PyPI always uses the HEAD versions of any library that is developed as part of this monorepo so you never have to jump through hoops to test changes for our own libraries inside one of our services.
  • All of our related services use the same build system and that's maintained in one place for everything, so we reduce the ability for things to bit rot and removing the need to remember how each individual thing is developed/deployed/etc.
  • By default there's only one version of things references/included in external/ 2 which means that we lash all of our related services together, upgrading Python for one will upgrade Python for all of them in one atomic unit which will help prevent services slowly accumulating tons of out of date packages, possibly with security issues 3.
  • We directly add support for "internal" code with the libs/internal directory.

We could also add some other directories if we can think of good uses for them, like a cmd/ directory that just exists to make little one off commands or something. Really any sort of categorization we find useful we can add here.

2. Problems caused by our build system lacking features that would manage these things.

This is where the bulk of the issues ultimately fall under I think, with things like:

  • (1) Speed
  • (2) Mounted Volumes Issues
  • (5) Needlessly re-running tests
  • (6) Host system and outside state "leaks" into the build artifacts
  • (7) Poor integration with Editors
  • (8) Multiple "interfaces" for developers
  • (10) Lacking(ish) support for developing on Windows

Ultimately all boiling down to the fact that our build system is currently a combination of docker, docker-compose, make, shell scripts, pip, npm, gulp, so to fix them will require replacing some or all of them.

I don't have an exact idea of which system to pick, but I think ultimately the right thing to do here is switch to one of the more modern build tools like Bazel, Pants, Please, etc.

They all work somewhat differently and have their own pros/cons over each other and I don't want to get into trying to select a specific one right now, but in general these tools:

  • Support some level of Hermetic builds
  • Support fine grained dependency tracking to allow only (re)building or running tests for the things that actually need it.
  • Support packaging up arbitrary commands/tasks to provide a unified interface for doing all the things in a repository without having to wrap it inside of a docker container.
  • Support running things natively on all popular operating systems as well as invoking docker containers (or being ran inside of a docker container) allowing the same interface / definitions to be ran in all situations without having to deal with mounted volumes or spinning up docker containers to do every thing inside of.
  • Are almost always designed with making speed and correctness a top priority.

These systems often don't, by default, provide any better editor integration than our current system does. However since things can run outside of a docker container while still being pulling in requirements as part of the build system rather than through the host, it means that it's a lot more tractable to write editor integration that just installs everything to some well known path that editors can be configured to look inside as the environment.

The main downsides to switching to something like this are:

  • Almost all of them involve significant investment in setting things up before you get any benefit from them (and in many cases, before it produces something functional at all) as most of the benefits tend to require fully converting a project into the system OR treating it as an external dependency.
  • docker-compose itself expects your build system for locally built images to use Dockerfile based build systems, but these systems typically want to work by letting the build system produce the docker image without the use of a Dockerfile. You can work around this though:
    a. Just make a stub Dockerfile that just invokes the new build tool to do building.
    b. Use docker-compose just for things that you aren't building locally (e.g. third party services) and just run the services from the repository natively using the build tool.
    c. Use the build tool to produce images, and just have docker-compose use that image as if it were a third party image which requires having the image built before starting docker-compose but which you can write a wrapper around or provide some integration to make more pleasant to deal with.

3. Problems caused by our own tooling lacking features or having problems.

The main problem caused by this is:

  • (3) Dependency Management is Error Prone and Manual

For this I think the answer is that we just need to rethink how we manage our dependencies, what I would suggest is something like:

We have multiple requirements/*.in files that specify our top level constraints for various files and this is the only valid way to specify dependencies (e.g. no more bare requirements/foo.txt). We then use pip-compile and pass all of these requirements/*.in files into it as input in a single pass and produce a single requirements/constraints.txt.

This requirements/constraints.txt contains the entire set of possible dependencies along with pinned versions and hashes, but since we don't know which ones of them we actually want to install just based on that file alone, we don't actually pip install from that file, but instead we use it a pip constraints file and use the original individual requirements/*.in files as our requirements files.

We then disable Dependabot, and create our own cron job (Github Action?) that will on some schedule just recompile the whole suite of requirements/*.in files into the singular requirements/constraints.txt and submit that as a single Pull Request.

Skip the dependency check unless requirements/* has changed, and remove the attempts to mitigate things by parsing the outcome and only partially diffing the data, just do a basic diff.

This ends up solving a few of our biggest problems with dependency handling:

  • We no longer resolve our dependencies in multiple smaller batches so we know we actually get a correctly resolved final set of versions with one version per dependency.
  • We still support installing a subset of dependencies by passing the .in files as requirements files and using the constraints file.
  • Since the full set of dependencies is resolved and is being used as a constraints file, we can install different subsets in different pip install commands and still get the same, valid, outcome.
  • We stop trying to validate dependencies on every PR, because fundamentally that is going to be flaky with the way these tools work without more investment for actual "upgrade" workflows rather than "re-resolve the entire set" workflow which removes the ability for these dependency check jobs to cause unrelated PRs to start failing.
  • We can make the dependency check jobs much stricter and better check what they're intended to check.
  • We get all of our dependency updates wrapped up in a single Pull Request on our own schedule.

It does have a few downsides that I can think of:

  • If any of the grouped dependency updates fail it blocks all of the remaining dependency updates until we fix the failures OR black list that version (by editing requirements/*.in with != or <)
  • We can't do any sort of fine grained grouping, everything is just grouped together and anything else must be done manually.
  • We won't get security alerts, change logs, etc that Dependabot gives us, just pull requests that update our pins.

Footnotes

  1. Mostly? I think we'd still keep the pypi-support issue tracker.

  2. I'm sure we'd need the ability to have other versions sometimes, but those should be special cases.

  3. This "lashing together" can also be considered a con, since it means upgrading a dependency in Warehouse is blocked on also upgrading that dependency in Conveyor. Ultimately it's a trade off that if its all together its typically not super hard for a small set of services to upgrade it all together, and it acts as a forcing function to make sure that we actually do it.

@dstufft
Copy link
Member Author

dstufft commented Aug 30, 2022

Of the three ideas, the third one is largely wholly independent, we can implement it no matter what we do since it's just related entirely to our own tooling and we'd be implementing it our selves.

The first two ideas kind of go hand and hand together somewhat.

You can of course do a monorepo without one of the more modern build tools and you can use one of the more modern build tools without bothering with a monorepo, but doing so is kind of going even further off the beaten path than doing them together and a lot of the benefits of the two things interplay with each other in positive ways.

That being said, we obviously could do a monorepo without changing our build tool stack, and use a similar directory layout and it would work, it's just likely to cause a lot of the problems that the build tool change tackles to get somewhat worse.

Likewise we could switch to one of the mentioned build tools without the monorepo, but most of the upfront investment that you're required to do either way exists to make the dependency tracking work correctly, which has a lot less utility (though not zero!) when you're still doing polyrepos and what utility you retain, you can likely get using more off the shelf tools for Python/pytest/etc already.

@dstufft
Copy link
Member Author

dstufft commented Aug 30, 2022

Also, none of the above really goes too deep into specifics or figured out what a migration plan would look like, or even really exists as a proof of concept for people to try.

Mostly I'm trying to generate some discussion about these ideas and see how other people feel! If these sounds like directions people would be interested in exploring, then I'm going to continue to work on fleshing out more solid proposals and a proof of concept of all of this. If people see these ideas and just immediately get hives, then maybe we can figure out a different way of fixing these issues!

@pradyunsg
Copy link
Contributor

FWIW, Dependabot does support pip-compile natively: https://github.com/dependabot/dependabot-core/blob/main/python/lib/dependabot/python/file_updater/pip_compile_file_updater.rb (https://github.blog/changelog/2021-01-19-dependabot-pip-compile-5-5-0-support/). However, it only supports going from a single .in to a single .txt however (https://github.com/dependabot/dependabot-core/blob/70805187fb63ff1f012446bd32a3d22a6220cc43/python/lib/dependabot/python/file_updater/pip_compile_file_updater.rb#L458).

If I understand things correctly, having a single requirements/constraints.in that -r ... for the other "scoped" files, and compiles it into a requirements/constraints.txt would satisfy the workflow you're seeking for dependency management, while still being compatible with Dependabot.

@dstufft
Copy link
Member Author

dstufft commented Sep 2, 2022

Hmm, guess they just don't document that at all then

@dstufft
Copy link
Member Author

dstufft commented Sep 2, 2022

Although I did find out that the workflow I proposed doesn't actually work with pip, see https://discuss.python.org/t/constraints-files-with-hashes/18612/

@miketheman
Copy link
Member

Whoa, this is quite the brain dump! I read through it once, but I'm certain I missed some stuff, so apologies if I overlook some specifics.

As someone who's self-onboarded last year to develop this codebase, I've got some learnings to share! Sorry this isn't nearly as well-structured as @dstufft 's.

"Getting started" was mostly correct for when I started out, and had a fair amount of tweaks and twiddles to get things working. Some examples: #10393 #10816

I found that working with a container-based stack was quite comfortable to me, and even went as far to explore how to make the entire stack work on a cloud-based development environment in Gitpod, to help alleviate some of the bootstrapping/building one would have to undertake before making a change.
I made some decent progress, but the true measure of this kind of effort is seen when the GitHub Org signs up for an (Open Source) account for GitPod and then seeks out the utility of the Prebuild functionality, so you essentially get a unique workspace that has already built any base images for you, and your time-to-write-code is greatly reduced.

On the topic of Docker layer caching - the folks over at https://depot.dev/ have come up with something that works pretty fast, and are also interested in support OSS, so that might be something to pursue as well, if GitPod Prebuilding isn't the direction to follow. (I didn't spend a ton of time on GitHub Codespaces, as they felt too new at the time and underpowered compared to the 16 CPU, 64GB RAM that I get for free by default from GitPod - far outweighing my own laptop's power).

Providing a preset devcontainer.json file would likely also solve for 7. Poor integration with Editors - as the code referenced is no longer seen on the "host" rather it directs the editor to reach into the container that has it. For those of us that like PyCharm Professional, there's a similar approach.

Developing on GitPod (remote scenario), even using a local editor solves a lot of developing on Windows concerns. There's also no more Host-level artifacts that can leak in.

On the poly/monorepo side - I think this does have some negative tradeoffs that haven't been expressed fully, notably:

  • all maintainers would have to have access to the same repo, which reduces the potential for soft-onboarding, trust-building kinds of work. It also makes any issue tracker more general, and a bit more difficult to prune through and find the actual parts you want. yes, we can label our way out, but is a issue tracker with 400+ issues attractive to newcomers? (I know it is to me, but I'm a bit weird that way...)
  • my understanding is that today the warehouse has a CI/CD approach to deploy on merge to main - with multiple libraries and merges to main, will this add more overhead and complexity to the CI/CD system?

I haven't given the hermetic builds a ton of thought yet, but agree that using Docker image hashes is far more "pure". Adding version specifiers to Dockerfile apt commands is another good idea, and will have to follow the upstream distro (buster?) versions available.

Anyhow, that's some of the brain items that came out of this so far.

@woodruffw
Copy link
Member

The volume permissions with make reformat issue just bit me, with #15590: I'm on a Linux host for once, so running make reformat caused my editor to freak out and refuse to write to files 🙂

(I assume this doesn't effect macOS hosts for development because the permissions are all YOLO'd through a VM.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
developer experience Anything that improves the experience for Warehouse devs needs discussion a product management/policy issue maintainers and users should discuss
Projects
None yet
Development

No branches or pull requests

4 participants