Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build: Allow multiple PDFs #10438

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open

Build: Allow multiple PDFs #10438

wants to merge 26 commits into from

Conversation

benjaoming
Copy link
Contributor

@benjaoming benjaoming commented Jun 14, 2023

Reviewing notes

This is intended to be a very limited change to our builders, whereby we can enable the Feature flag for a couple of projects and have PDFs build with the new method.

The filename for single-file projects is left intact, which should mean that we can switch on the feature for a couple of live-projects as well, and the generated PDFs should be served through the current URLs and APIs.

Would be great with some after-thoughts on using the ImportedFile model.

TODO

Fixes #10424
Fixes #2045

  • Add feature flag
  • Allow multiple PDF artifacts
  • Filter only PDFs in rclone sync
  • Store PDF filenames for later listing (if feasible)
  • Tests
    • Test new rclone filtering
    • Test Sphinx builder (that it moves files correctly etc)
    • Test celery task and trigger trigger_sync_downloadable_artifacts
    • Test all new queryset methods
  • Another round of Q&A

APIs will be added in a separate PR.

Other aspects covered

  • File extensions are restricted for pdf, htmlzip and epub.

@benjaoming benjaoming marked this pull request as ready for review June 26, 2023 15:43
@benjaoming benjaoming requested a review from a team as a code owner June 26, 2023 15:43
@benjaoming benjaoming requested a review from stsewd June 26, 2023 15:43
@benjaoming benjaoming changed the title Build: Allow multiple PDFs (WIP) Build: Allow multiple PDFs Jun 26, 2023
@benjaoming benjaoming requested a review from humitos June 26, 2023 15:43
@benjaoming
Copy link
Contributor Author

Will be following up with test case updates and some additional manual testing (both with and without the feature flag).

@agjohnson agjohnson linked an issue Jul 5, 2023 that may be closed by this pull request
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I need to do a deeper review and local QA still, tho.

In the meanwhile, I have some questions to understand what's the status of the PR and what is the final goal:

I think these questions are important since they will tell us how we are going to move forward with this PR. These decisions will make the future integration of the backend easier with the front end.

Also, if projects generating multiple PDF files are not told those cannot be easily accessed by the users, it looks broken to me and it's something I'd prefer to not expose to users yet.

Comment on lines +91 to +96
# Isolate temporary files in the _readthedocs/ folder
# Used for new feature ENABLE_MULTIPLE_PDFS
self.absolute_host_tmp_root = os.path.join(
self.project.checkout_path(self.version.slug),
"_readthedocs/tmp",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need a temporary directory we should mktemp --directory as we are doing in other parts of this code. I prefer to keep _readthedocs/ clean since it's a directory where we output files we are going to serve.

@@ -132,13 +132,13 @@ def _check_suspicious_path(self, path):
def _rclone(self):
raise NotImplementedError

def rclone_sync_directory(self, source, destination):
def rclone_sync_directory(self, source, destination, **sync_kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we pass filter_extensions here directly instead of a dictionary without knowing exactly what it has inside?


This method mostly exists so we have a pattern that is test-friend (can be mocked).
"""
tex_files = glob(os.path.join(self.absolute_host_output_dir, f"*.{extension}"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line could be combined with _get_epub_files_generated() and make it generic.

"""
tex_files = glob(os.path.join(self.absolute_host_output_dir, f"*.{extension}"))
if not tex_files:
raise BuildUserError("No *.{extension} files were found.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build error string should be an attribute like BuildUserError.NO_ARTIFACT_FILES_FOUND or similar.


# Run LaTeX -> PDF conversions
success = self._build_latexmk(self.project_path)

self._post_build()
if self.project.has_feature(Feature.ENABLE_MULTIPLE_PDFS):
self._post_build_multiple()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self._post_build_multiple()
self._post_build_multiple_pdf()

Comment on lines +649 to +657
# There is only 1 PDF file. We will call it project_slug.pdf
# This is the old behavior.
if len(pdf_files) == 1:
os.rename(
os.path.join(self.absolute_host_output_dir, pdf_files[0]),
os.path.join(
self.absolute_host_output_dir, f"{self.project.slug}.pdf"
),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand why we would like to keep "the old behavior here". I think we will always want "the new behavior" for projects using this feature. However, I'm not sure to understand what would be the filename of the resulting PDF in the new behavior --but it should be the name the user has defined in the docs/conf.py (see my other comment about -jobname).

Comment on lines +722 to +730
# We cannot use '*' in commands sent to the host machine, the asterisk gets escaped.
# So we opt for iterating from outside the container
pdf_file_names = []
for fname in pdf_files:
shutil.move(
fname,
os.path.join(self.absolute_host_tmp_root, os.path.basename(fname)),
)
pdf_file_names.append(os.path.basename(fname))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably use self.run(..., escape=False) or similar if we want to avoid this.

Comment on lines +50 to +57
# Map media types to their know extensions.
# This is used for validating and uploading artifacts.
MEDIA_TYPES_EXTENSIONS = {
MEDIA_TYPE_PDF: ("pdf",),
MEDIA_TYPE_EPUB: ("epub",),
MEDIA_TYPE_HTMLZIP: ("zip",),
MEDIA_TYPE_JSON: ("json",),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using a tuple in the value here because we want to have multiple extensions per each media type?

Comment on lines +60 to +65
@app.task(queue="web")
def sync_downloadable_artifacts(
version_pk, commit, build, artifacts_found_for_download
):
"""
Create ImportedFile objects for downloadable files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this task is part of the search tasks. I understand it's not related with search itself, but more with the build process, in my opinion. So, I would probably move it to projects.tasks.utils or similar.

Comment on lines +88 to +96
ImportedFile.objects.create(
name=name,
project=version.project,
version=version,
path=fpath,
commit=commit,
build=build,
ignore=True,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only "creating new objects here", but don't we need to delete the old ImportedFile for this particular version. Otherwise, wouldn't be exposing "old filenames" to the user?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build and API: Backend changes to handle multiple PDFs Multiple pdf files for single project
2 participants