Join GitHub today
refactoring rtd-build #6
I basically re-wrote the complete rtd-build cli.
All implemented things are tested very well. Tests can be run with
To run a quick sample of the builder, do:
cd integration_tests/minimal_project/ rtd-build
This will output the rendered docs in
Currently we support the following config options in
name: docs # required, arbitrary type: sphinx # required, only sphinx is supported at the moment base: docs-en/ # the base directory of the source files
The idea with multiple configs is that you can choose which one to run -- we wouldn't support outputting multiple sets of docs for each run. You could have a setting on RTD.org that points at the specific YAML file you want to build. The main thing to support is repos that have multiple sets of docs in them, not one set of docs splintered into multiple readthedocs.yml files.
I'm curious how you're planning to integrate this into RTD? My understanding is that we were going to have readthedocs-build work downstream of the VCS & virtualenv code, and not handle that set of code. With the current implementation, we're going to have to totally replace the RTD.org builder code to support this, or else we'll have 2 concurrently running sets of code to do all the VCS/Venv/etc code.
Ok, sorry if I missed the general task. I wrote up the specs in #5 in order to clear things out and got positive feedback that's why I thought I'm on the right track with this :)
The spec say at the top:
and that's what I pursued with the refactor. Let's talk about how we should proceed there.
I'll try to describe the plan for integrating with readthedocs.org with the steps that should be performed from zero to hosted docs:
All rtd-build steps can happen in an isolated Docker environment in order to encapsulate the commands, to prevent any malicious code to break out. The parts of the build system that are left in readthedocs.org do not execute any files or config from the user repository.
That all sounds great. I think we should delete the VCS code from this repo then. Looking more closely at this PR, I see it was mostly touched in rename operations, but seeing it in the changeset made me wary.
One large concern that I have is making sure that we can capture output for every command properly, and display them to users. Historically, our build output has been quite awful, and the work Anthony did with rtfd/readthedocs.org#1563 fixes a lot of that. I want to make sure that we don't lose that visibility with this work.
The other one is that Docker doesn't have a great way to handle long-running commands and IPC between the parent and child commands. We've solved this by wrapping every command in a specific docker environment, instead of a long-running set of commands. So we need to make sure that for RTD use, we can get access to the specific commands via library calls, and can run those in a specific environment.
Along the lines of not repeating past mistakes, I think that the "system site packages" is a mistake in our design, and should be removed. We should also consider if there are other options and approaches that should be considered in this work. One of the other big ones that I'd love to solve is the build process depending on layout.html, which breaks a common Sphinx paradigm: rtfd/readthedocs.org#152 -- It's a tricky issue in general, but if we can find another way to force content into the pages, we should. I think doing this in an extension that the builder forces, and we iterate over every doctree and add nodes directly into the Docutils Doctree is likely the "purest" solution.
Similar things with the doc_builder subsystem. If we aren't using that in this refactor, it should be removed as well. These are the main things I copied over from RTD, and have changed significantly, and I was worried about living in parallel.
Looking at the scope of this work, it seems quite reasonable -- but I want to make sure we are progressing properly and without a bunch of old baggage. We have also talked a lot about whether we want to shell out to
Yes, the changed UI looks very nice and is so much more helpful. The plan for nice commands output would have been that the rtd-build tool prints the command it runs under the hood and the output, prefixed with command name, like:
But Anthony's UI is ofcourse much nicer. We could make the rtd-build output
This could then be used on the rtfd.org build page to show the nice UI. The
docker & IPC
What is the difference between one long running build command and a few nearly
I'm not sure yet how we could make the individual subcommands exposable to a
injecting custom html in the built
I would suggest a solution that is completely unrelated to Sphinx. Iterate
This step could either live in rtd-build or in readthedocs.org itself as an
legacy vcs code
Regarding running sphinx by importing it: I'm a clear -1 on using Sphinx as a library. We should run it via it's command line interface. The main reason is that every user is using Sphinx via the CLI. If we use it as a library we introduce a lot of potential to use it differently than our users do which results in compatibility issues. For example we might miss updates to how Sphinx calls the internal APIs when executing the CLI client.
The main issue with capturing command output that way is it precludes the ability to do any kind of real-time output in the future. If we run each command seperately, we can stream the output from that specific command, whereas returning it as JSON means that we effectively buffer all output until the end of the run.
docker & IPC
Same general issue here. It isn't a technical issue of having a long-running process, it's that we want to minimize the logic that is running inside of docker. With rtd-build abstracting a lot of that, it makes it less of an issue.
True. It feels like a dirty solution, but I guess it is the only way to make sure it works everywhere -- however there are still subsets of information that require the Jinja template context, and must be rendered within the actual build process. Things like the page name, specifically. So we can't totally work around this, though I guess we could derive it from the outputted filenames.
Here are some of my thoughts on all of this. I feel like we are turning a lightweight module into a kitchen sink module here, so it might be worth stepping back first to see what we're building.
In it's two forms, this module could be either:
This module is more strictly to ease the efforts of building documentation, both locally and in our environment. It doesn't need to know about how to build in docker build environments, because that is an RTD implementation detail -- though knowing how to build in docker containers is not a drawback necessarily. This tool also doesn't need to know how to report build commands or interact with RTD at all. It simply knows how to build documentation through sphinx/etc, and eases this process. It is the most lightweight wrapper we can make that pushes interaction with sphinx into a tool that provides end users with more reproducible builds.
It would execute sphinx/etc as a module -- as this is an important next step to cleaning up how sphinx build is implemented. It would shorten build times as @ericholscher mentioned, which we've been running into more often with
RTD would execute this as a cli command as part of the build process, treating output and error handling as it normally would.
Virtualenv support could be built in, or could just as easily be provided by manual tooling. All we are trying to enforce with vcs included in this tool would be fresh virtualenvs -- a use case that a tool like Tox solves far better.
This work seems to be leaning towards moving all build functionality into a separate module. This could provide a purer virtualenv for building locally, and theoretically reduce support burden around virtualenvs and python package issues. It would require duplicating and refactoring a lot more code, however. Additionally, there will be some forking of builder code and a duplication of effort until we can unify the two systems again.
There would be benefits to moving building code out of RTD, but I'm not sure they outweigh the work that would have to go in at this point, or if this implementation would be any more useful to users than the implementation above.
So, it's worth discussing the following where RTD and this module would interact:
My thoughts on implementation details:
Anyways, just my thoughts, happy to have input on all of this.
using sphinx python api
I'm sold, the performance argument was not obvious to me. +1 on this.
two scenarios described by @agjohnson
I want to summarize the "Module A" and "Module B" sections in my own words to see if I understand it correctly:
Module A, rtd-build is a solely a wrapper around sphinx, injecting RTD specific sphinx behaviour. It only uses parts of the readthedocs.yml which directly influence the way how sphinx is called internally. All other configs in readthedocs.yml (like pip requirements) would be handled on rtfd.org
Module B, rtd-build is a replication of the pipeline we currently have on rtfd.org, including VCS logic, virtualenv handling, running commands inside docker.
My understanding of rtd-build is somewhere in between those scenarios. It should be a function to turn the project sourcefiles into built documentation artifacts. In other words:
It therefore abstracts away what happens during this, but nothing more. Its the only tool knowing about the
So in order to detect the logic that needs to go into
In order to fullfill these things, we require virtualenv handling inside
So my vision is also a very lightweight wrapper around Sphinx + virtualenv handling. The only reason virtualenv is in there is that we do need to support autodoc. Once the autoapi which doesn't use dynamic imports takes over, we can completely drop virtualenv handling and are left with what @agjohnson suggested in Module A.
So if I'm getting this right, with IPC you mean that the command output is transferred from inside docker to a monitoring system outside. I don't understand yet what the drawback of a longer-running docker command is. You say that you loose all output of the failing command. That would mean that you only see the output of succeeding commands and the benefit of splitting it up into multiple commands is that you will see the you will know which command has failed, but not how (as you don't have the output).
Regardless of my (probably weird and incorrect) understanding of this I have a suggestion for monitoring the command output. @agjohnson said:
If we do something like this, this definitely should live outside of rtd-build. It could be implemented as a wrapper command so that the actual command the is run inside docker is:
Then rtd-track-output will record stdout/stderr of
Sorry if I'm a bit novice on the docker side. I always thought
Reading your comments about IPC, I wonder if I'm getting the situation right how docker would be used here.
I assume the following things for use with rtd-build:
Is this correct?
Regarding Docker usage workflow, your assumption is about correct. This used to execute a long running command and report back via JSON over STDOUT, but now it runs a noop command that dies after X seconds and commands are executed in background processes using Dockers'
Relying on a long running command definitely means the container needs socket communication to constantly ship updates. When I say we don't get failure, I mean a failure at the container level -- if the command in the container is OOM killed for instance -- will result in no output. So we can't get around using sockets. The current execution scheme makes this easy because Docker's api abstracts that and gives metadata on the inner command failures without having to prepare the container for it.
We can use STDIN/STDOUT, but this is a lazy socket, and one that is fragile. We would have to handle streaming and (unbuffered?) serialized socket data, not to mention killing all spurious output to ensure nothing will inject into STDOUT and ruin our output. This introduces complexity to debugging and setup. We could add socket communication to this, but now we are building host and client socket communication on top of this, not to mention orchestrating docker to use unix sockets or opening up tcp ports.
I am -1 on long running commands inside containers not because it doesn't work, but because it bloats the docker container and the container's responsibilities for seemingly no benefit over background executed commands. All this work above only gives us an approximation of feature parity with what we get with background processes and the Docker API.
In terms of what this tool is, I think what you describe for this tool is mostly in line with what I picture. I need to think more on this one. If we make this into a thin wrapper around sphinx, with no knowledge or care of docker/it's environment, there are some places where the integration get cumbersome. I still think this is the direction we need to go, but need to come back to this and map out those few points.
I don't think virtualenvs need to be part of the tool still however. We can certainly still manage dependencies and even simplify dependency installation, but I feel like virtualenv management is bloat to this tool that users might not be expecting. Also, though in our ideal world, no one is executing code to inspect docstrings, I don't think we'll ever get 100% away from executing code or having virtualenvs, so this will always likely live somewhere.
Anyways, I'll come back to this shortly.
It has been a while since we talked, but as far as I can see from the last posts, we have three main points to discuss. 1. whether we include virtualenv logic inside rtd-build, 2. how we are monitoring the build logs, 3. if the build can be split into multiple smaller commands.
1. virtualenv logic
I would like to keep the virtualenv logic inside rtd-build. Otherwise it's going to be really difficult to reproduce builds on the user's computer that fail on readthedocs.org. I also talked to @ericholscher about this and he seems to be inline with my view that it's not a big burden to manage a virtualenv with rtd-build. The code will be quite minimal and I don't see a risk that it will become a maintainability problem as the domain of the code is very narrow.
2. monitoring output
As you describe it seems to be necessary to communicate with a process outside of docker for the stdout monitoring. As I expressed in a previous comment I prefer having a command that can wrap the rtd-build calls like:
In that case I'm ok with putting the monitoring logic into rtd-build itself.
3. multiple small commands
@ericholscher explained to me in person again why it's preferred to use multiple small commands to make the build instead of using one command. He mentioned mainly that it's otherwise not practical to limit resouces on the individual steps of the build. That makes sense to me and I want to support having multiple commands.
Therefor I propose that running the
should be equivalent to
That way the user can produce a build locally in the simplest way possible with
So my main worry here is that we're building too much into this tool for the sake of refactoring. We're focusing more about what this tool means for our implementation than what this means for authors. This tool was supposed to be a tool for authoring, but I don't feel like we're giving authors a reason to use this.
So what problems do our users have?
There's probably some more obvious questions there.
Either way, my thoughts on this are that if it isn't of use to authors, then we shouldn't be complicating this tool with it. I feel the issues around dependencies and reproducible builds can't be solved by virtualenv and dependency management. Without the need for virtualenv management, there is no need to make this tool execute commands via subprocess/docker. I think communicating back to RTD is also superfluous, and building any sort of IPC between docker would be solidly out of scope.
This greatly reduces the scope of the tool, and hopefully allows us to focus more on the author experience, rather than refactoring.
In my mind, this tool:
I think exposing the standard
This would feel very natural for authors perhaps?
Also, it's worth mentioning that there are many better ways for authors to test in a pristine environment -- tox is one, CI tooling is another. We're not going to depend on these tools, but maybe our answer for Python developers that need to test their docs in a pristine environment is to use one of these methods.
Eventually, there are some users that we'll want to guide towards using CI tooling anyways, which aligns with our idea of eventually being able to ship built documentation with this tool.
@agjohnson ok this sounds all reasonable now.
The virtualenv handling is then something that is configured in the YAML file but is not handled by the rtd-build tool. That will require us to parse the YAML file in the build system on readthedocs.org but we can ofcourse use the rtd-build source as a library for this. No big deal here, we just need to keep it in mind.