New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce queries to storage to serve 404 pages #6845
Conversation
BTW, I saw that mkdocs projects can be |
Code changes look good. I think this can still be a little more refactored to serve */index.html
directly from the ServeBase.get
view instead of passing through the 404 handler, since we already know that the version is HTMLDir.
I haven't took a look at the tests, yet.
readthedocs/proxito/views/serve.py
Outdated
doc_type = ( | ||
Version.objects.filter(project=final_project, slug=version_slug) | ||
.values_list('documentation_type', flat=True) | ||
.first() | ||
) | ||
is_htmldir_type = doc_type in (SPHINX_HTMLDIR, MKDOCS) | ||
if is_htmldir_type: | ||
for tryfile in ('index.html', 'README.html'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we don't do this in the ServeBase.get
instead?
I mean, if we already know that this project is_htmldir_type=True
, there is no reason to fail the first request. We can just serve the */index.html
in that case.
Then, we can handle only the README.html
as an special case in the 404 handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't do that, we serve files other than html. Like foo/bar.img
, we'll try to serve foo/bar.img/index.html
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, aren't we already doing this in the ServeBase.get
at
readthedocs.org/readthedocs/proxito/views/serve.py
Lines 127 to 132 in 0e3df50
# Handle our backend storage not supporting directory indexes, | |
# so we need to append index.html when appropriate. | |
if path[-1] == '/': | |
# We need to add the index.html before ``storage.url`` since the | |
# Signature and Expire time is calculated per file. | |
path += 'index.html' |
We are adding index.html
if the URL ends with /
, so it seems there is no need to try the same again in the 404 handler, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only adding the index file if path ends with /
, the 404 tries to serve the index from paths not ending with /
readthedocs/proxito/views/serve.py
Outdated
.first() | ||
) | ||
is_htmldir_type = doc_type in (SPHINX_HTMLDIR, MKDOCS) | ||
if is_htmldir_type: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this check is correct. We have users who want to serve README.html as their index, regardless of their doctype, I don't think we can skip that check.
I'd like to remove the README check all together though, but I don't think this is the way to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I don't think we can skip the index.html
check here either. It doesn't depend on htmldir either:
For example we want https://docs.readthedocs.io/en/latest
to redirect to https://docs.readthedocs.io/en/latest/
, and it doesn't matter what doctype it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we did that here
readthedocs.org/readthedocs/proxito/views/serve.py
Lines 125 to 126 in 33049cc
# If ``filename`` is empty, serve from ``/`` | |
path = f'{storage_path}/{filename}' |
But looks like we send them to the 404 handler
readthedocs.org/readthedocs/proxito/urls.py
Lines 122 to 133 in 33049cc
# Hack /en/latest so it redirects properly | |
# We don't want to serve the docs here, | |
# because it's at a different level of serving so relative links break. | |
url( | |
( | |
r'^(?:projects/(?P<subproject_slug>{project_slug})/)?' | |
r'(?P<lang_slug>{lang_slug})/' | |
r'(?P<version_slug>{version_slug})$'.format(**pattern_opts) | |
), | |
fast_404, | |
name='docs_detail_directory_indexing', | |
), |
Is there a reason why we send them to the 404 view and not just redirect them with /
? (same for subprojects, translations, single page, etc)
But still I can see cases where users want to serve an index without caring about the doctype. That could be, users using a html
doctype and having a directory with an index...
I guess we could at least assume 404.html
for html
types and 404/index.html
for htmldir
types when serving a custom 404 page? that will give us one check less
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to do a query to cloud storage to know if the file exists before we redirect to /
, and we don't want to do any queries in the normal proxito handler to make it really fast. The point of the 404 handler is to be slower and handle the queries.
I guess we could at least assume 404.html for html types and 404/index.html for htmldir types when serving a custom 404 page? that will give us one check less
Yea, that seems easy enough. Though I could see users hacking in a top-level 404.htmleven with a
dirhtmlbuilder. We should probably keep stats, but I assume we almost never serve
404/index.html`, and could consider removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://www.mkdocs.org/user-guide/deploying-your-docs/#404-pages
Looks like mkdocs always outputs a 404.html
file, not a 404/inde.html
file. And github pages expects a 404.html
https://help.github.com/en/github/working-with-github-pages/creating-a-custom-404-page-for-your-github-pages-site
Sphinx with htmldir
is the only one that outputs a 404/index.html
file.
Ok, I just reverted this PR to only check for This reduces 2 calls for all builders tha aren't |
This should still save a storage query on most requests, though it adds 1-2 DB queries, but still worth doing I think
This makes use of
version.documentation_type
to stop guessing if for404/index.html
vs404.html
and guessing when to test fordir/index.html
ordir/README.html
Now we have 2 queries to the db, but I guess that's better than having 2 calls to storage :)