Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce queries to storage to serve 404 pages #6845

Merged
merged 6 commits into from Apr 8, 2020

Conversation

stsewd
Copy link
Member

@stsewd stsewd commented Apr 1, 2020

This makes use of version.documentation_type to stop guessing if for 404/index.html vs 404.html and guessing when to test for dir/index.html or dir/README.html

  • Worst case for html projects: 2 calls to storage (before it was 6)
  • Worst case for dirhtml projects: 4 calls to storage (before it was 6)
  • Best case for html projects: 1 call to storage (before it was 3)
  • Best case for dirhtml projects: 3 calls to storage (before it was 6)

Now we have 2 queries to the db, but I guess that's better than having 2 calls to storage :)

@stsewd
Copy link
Member Author

stsewd commented Apr 1, 2020

BTW, I saw that mkdocs projects can be html projects too https://www.mkdocs.org/user-guide/configuration/#use_directory_urls. I'm going to add that type in another PR, no changes on this PR or the config file are required for that.

@stsewd stsewd requested a review from Apr 1, 2020
Copy link
Member

@humitos humitos left a comment

Code changes look good. I think this can still be a little more refactored to serve */index.html directly from the ServeBase.get view instead of passing through the 404 handler, since we already know that the version is HTMLDir.

I haven't took a look at the tests, yet.

readthedocs/proxito/views/serve.py Outdated Show resolved Hide resolved
readthedocs/proxito/views/serve.py Outdated Show resolved Hide resolved
readthedocs/proxito/views/serve.py Outdated Show resolved Hide resolved
readthedocs/proxito/views/serve.py Outdated Show resolved Hide resolved
doc_type = (
Version.objects.filter(project=final_project, slug=version_slug)
.values_list('documentation_type', flat=True)
.first()
)
is_htmldir_type = doc_type in (SPHINX_HTMLDIR, MKDOCS)
if is_htmldir_type:
for tryfile in ('index.html', 'README.html'):
Copy link
Member

@humitos humitos Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we don't do this in the ServeBase.get instead?

I mean, if we already know that this project is_htmldir_type=True, there is no reason to fail the first request. We can just serve the */index.html in that case.

Then, we can handle only the README.html as an special case in the 404 handler.

Copy link
Member Author

@stsewd stsewd Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do that, we serve files other than html. Like foo/bar.img, we'll try to serve foo/bar.img/index.html.

Copy link
Member

@humitos humitos Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, aren't we already doing this in the ServeBase.get at

# Handle our backend storage not supporting directory indexes,
# so we need to append index.html when appropriate.
if path[-1] == '/':
# We need to add the index.html before ``storage.url`` since the
# Signature and Expire time is calculated per file.
path += 'index.html'
?

We are adding index.html if the URL ends with /, so it seems there is no need to try the same again in the 404 handler, right?

Copy link
Member Author

@stsewd stsewd Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only adding the index file if path ends with /, the 404 tries to serve the index from paths not ending with /

.first()
)
is_htmldir_type = doc_type in (SPHINX_HTMLDIR, MKDOCS)
if is_htmldir_type:
Copy link
Member

@ericholscher ericholscher Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this check is correct. We have users who want to serve README.html as their index, regardless of their doctype, I don't think we can skip that check.

I'd like to remove the README check all together though, but I don't think this is the way to do it.

Copy link
Member

@ericholscher ericholscher Apr 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I don't think we can skip the index.html check here either. It doesn't depend on htmldir either:

For example we want https://docs.readthedocs.io/en/latest to redirect to https://docs.readthedocs.io/en/latest/, and it doesn't matter what doctype it is.

Copy link
Member Author

@stsewd stsewd Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we did that here

# If ``filename`` is empty, serve from ``/``
path = f'{storage_path}/{filename}'

But looks like we send them to the 404 handler

# Hack /en/latest so it redirects properly
# We don't want to serve the docs here,
# because it's at a different level of serving so relative links break.
url(
(
r'^(?:projects/(?P<subproject_slug>{project_slug})/)?'
r'(?P<lang_slug>{lang_slug})/'
r'(?P<version_slug>{version_slug})$'.format(**pattern_opts)
),
fast_404,
name='docs_detail_directory_indexing',
),

Is there a reason why we send them to the 404 view and not just redirect them with /? (same for subprojects, translations, single page, etc)

But still I can see cases where users want to serve an index without caring about the doctype. That could be, users using a html doctype and having a directory with an index...

I guess we could at least assume 404.html for html types and 404/index.html for htmldir types when serving a custom 404 page? that will give us one check less

Copy link
Member

@ericholscher ericholscher Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to do a query to cloud storage to know if the file exists before we redirect to /, and we don't want to do any queries in the normal proxito handler to make it really fast. The point of the 404 handler is to be slower and handle the queries.

I guess we could at least assume 404.html for html types and 404/index.html for htmldir types when serving a custom 404 page? that will give us one check less

Yea, that seems easy enough. Though I could see users hacking in a top-level 404.htmleven with adirhtmlbuilder. We should probably keep stats, but I assume we almost never serve404/index.html`, and could consider removing it.

Copy link
Member Author

@stsewd stsewd Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.mkdocs.org/user-guide/deploying-your-docs/#404-pages

Looks like mkdocs always outputs a 404.html file, not a 404/inde.html file. And github pages expects a 404.html https://help.github.com/en/github/working-with-github-pages/creating-a-custom-404-page-for-your-github-pages-site

Sphinx with htmldir is the only one that outputs a 404/index.html file.

@stsewd
Copy link
Member Author

stsewd commented Apr 7, 2020

Ok, I just reverted this PR to only check for 404.html using the doctype information, this is only check for 404/inde.html on sphinx htmldir projects, the rest of the builders (even mkdocs) always output a 404.html file.

This reduces 2 calls for all builders tha aren't sphinx htmldir.

@stsewd stsewd requested a review from Apr 7, 2020
Copy link
Member

@ericholscher ericholscher left a comment

This should still save a storage query on most requests, though it adds 1-2 DB queries, but still worth doing I think 👍

@ericholscher ericholscher merged commit ad72dc6 into master Apr 8, 2020
2 checks passed
@ericholscher ericholscher deleted the reduce-storage-queries branch Apr 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants