Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: return relatives URLS #7376

Merged
merged 1 commit into from Aug 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/server-side-search.rst
Expand Up @@ -93,7 +93,8 @@ This is ``https://docs.readthedocs.io/_/api/v2/search`` for the ``docs`` project
:>json string project: The project slug
:>json string version: The version slug
:>json string title: The title of the page
:>json string link: An absolute URL to the resulting page
:>json string domain: Canonical domain of the resulting page
:>json string path: Path to the resulting page
:>json object highlights: An object containing a list of substrings with matching terms.
Note that the text is HTML escaped with the matching terms inside a <span> tag.
:>json object blocks:
Expand Down Expand Up @@ -140,7 +141,8 @@ This is ``https://docs.readthedocs.io/_/api/v2/search`` for the ``docs`` project
"project": "docs",
"version": "latest",
"title": "Server Side Search",
"link": "https://docs.readthedocs.io/en/latest/server-side-search.html",
"domain": "https://docs.readthedocs.io",
"path": "/en/latest/server-side-search.html",
"highlights": {
"title": [
"<span>Server</span> <span>Side</span> <span>Search</span>"
Expand Down
6 changes: 3 additions & 3 deletions readthedocs/core/static-src/core/js/doc-embed/search.js
Expand Up @@ -69,7 +69,7 @@ function attach_elastic_search_query_sphinx(data) {
title = xss(result.highlights.title[0]);
}

var link = result.link + "?highlight=" + $.urlencode(query);
var link = result.path + "?highlight=" + $.urlencode(query);

var item = $('<a>', {'href': link});

Expand Down Expand Up @@ -291,7 +291,7 @@ function attach_elastic_search_query_mkdocs(data) {

var item = $('<article>');
item.append(
$('<h3>').append($('<a>', {'href': result.link, 'text': result.title}))
$('<h3>').append($('<a>', {'href': result.path, 'text': result.title}))
);

if (result.project !== project) {
Expand All @@ -303,7 +303,7 @@ function attach_elastic_search_query_mkdocs(data) {
var section = blocks[j];

if (section.type === 'section') {
var section_link = result.link + '#' + section.id;
var section_link = result.path + '#' + section.id;
var section_title = section.title;
var section_content = section.content;
if (section_content.length > MAX_SUBSTRING_LIMIT) {
Expand Down
2 changes: 1 addition & 1 deletion readthedocs/core/static/core/js/readthedocs-doc-embed.js

Large diffs are not rendered by default.

22 changes: 18 additions & 4 deletions readthedocs/search/serializers.py
Expand Up @@ -10,6 +10,7 @@
import re
from functools import namedtuple
from operator import attrgetter
from urllib.parse import urlparse

from django.shortcuts import get_object_or_404
from rest_framework import serializers
Expand Down Expand Up @@ -58,11 +59,26 @@ class PageSearchSerializer(serializers.Serializer):
project = serializers.CharField()
version = serializers.CharField()
title = serializers.CharField()
link = serializers.SerializerMethodField()
path = serializers.SerializerMethodField()
domain = serializers.SerializerMethodField()
highlights = PageHighlightSerializer(source='meta.highlight', default=dict)
blocks = serializers.SerializerMethodField()

def get_link(self, obj):
def get_domain(self, obj):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely seems like it's adding a bunch of queries on both of these functions to do the full resolve and then ignore parts of it (eg. we don't care about subprojects for the domain). Is there a reason not to just call the resolver resolve_path and resolve_domain directly here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result from _get_full_path is cached, and we already pass project_data into the context, so this won't generate any extra queries.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def get_serializer_context(self):
context = super().get_serializer_context()
context['projects_data'] = self._get_all_projects_data()
return context

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling resolve_path and resolve_domain here will generate extra queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like projects_data is only used in this one place, so I don't understand why we're caching it prior to calling this code? Seems like we could just remove all the pre-setting and only set it here when we actually use it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The serializer only knows about one object, not about all of them. But the caller of this class has the list of all objects that the serializer is going to use, so it can retrieve all the data in one query.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but you're doing queries for every Domain for every subproject with this approach, instead of querying the doctype for every Version, which will lead to the same number of queries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see what you mean, yeah, that can be optimized to query the domain only once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to worry too much about making this super efficient -- I'm actually saying we should make the code simpler rather than try and make it super fast. We don't do that many searches, so having simple code is probably better. I guess it might matter for projects with a lot of subprojects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think we need to think a bit more deeply about how to make this stuff faster at the resolver level, rather than trying to optimize specific areas. We've done this a few times, and really the solution should be "calling the resolver is always fast"

full_path = self._get_full_path(obj)
if full_path:
parsed = urlparse(full_path)
return f'{parsed.scheme}://{parsed.netloc}'
return None

def get_path(self, obj):
full_path = self._get_full_path(obj)
if full_path:
parsed = urlparse(full_path)
return parsed.path
return None

def _get_full_path(self, obj):
"""
Get the page link.

Expand All @@ -71,8 +87,6 @@ def get_link(self, obj):
If the result is fetched from the database,
it's cached into ``project_data``.
"""
# TODO: return a relative URL when this is called from the indoc search.

# First try to build the URL from the context.
project_data = self.context.get('projects_data', {}).get(obj.project)
if project_data:
Expand Down
40 changes: 23 additions & 17 deletions readthedocs/search/tests/test_api.py
Expand Up @@ -35,6 +35,11 @@ def setup_method(self, method):
# installed
self.url = reverse('search_api')

@pytest.fixture(autouse=True)
def setup_settings(self, settings):
settings.PUBLIC_DOMAIN = 'readthedocs.io'
settings.USE_SUBDOMAIN = True

def get_search(self, api_client, search_params):
return api_client.get(self.url, search_params)

Expand Down Expand Up @@ -260,7 +265,8 @@ def test_doc_search_subprojects(self, api_client, all_projects):
assert first_result['project'] == subproject.slug
# Check the link is the subproject document link
document_link = subproject.get_docs_url(version_slug=version.slug)
assert document_link in first_result['link']
link = first_result['domain'] + first_result['path']
assert document_link in link

def test_doc_search_unexisting_project(self, api_client):
project = 'notfound'
Expand Down Expand Up @@ -360,7 +366,7 @@ def test_search_correct_link_for_normal_page_html_projects(self, api_client, doc

result = resp.data['results'][0]
assert result['project'] == project.slug
assert result['link'].endswith('en/latest/support.html')
assert result['path'] == '/en/latest/support.html'

@pytest.mark.parametrize('doctype', [SPHINX, SPHINX_SINGLEHTML, MKDOCS_HTML])
def test_search_correct_link_for_index_page_html_projects(self, api_client, doctype):
Expand All @@ -378,7 +384,7 @@ def test_search_correct_link_for_index_page_html_projects(self, api_client, doct

result = resp.data['results'][0]
assert result['project'] == project.slug
assert result['link'].endswith('en/latest/index.html')
assert result['path'] == '/en/latest/index.html'

@pytest.mark.parametrize('doctype', [SPHINX, SPHINX_SINGLEHTML, MKDOCS_HTML])
def test_search_correct_link_for_index_page_subdirectory_html_projects(self, api_client, doctype):
Expand All @@ -396,7 +402,7 @@ def test_search_correct_link_for_index_page_subdirectory_html_projects(self, api

result = resp.data['results'][0]
assert result['project'] == project.slug
assert result['link'].endswith('en/latest/guides/index.html')
assert result['path'] == '/en/latest/guides/index.html'

@pytest.mark.parametrize('doctype', [SPHINX_HTMLDIR, MKDOCS])
def test_search_correct_link_for_normal_page_htmldir_projects(self, api_client, doctype):
Expand All @@ -414,7 +420,7 @@ def test_search_correct_link_for_normal_page_htmldir_projects(self, api_client,

result = resp.data['results'][0]
assert result['project'] == project.slug
assert result['link'].endswith('en/latest/support.html')
assert result['path'] == '/en/latest/support.html'

@pytest.mark.parametrize('doctype', [SPHINX_HTMLDIR, MKDOCS])
def test_search_correct_link_for_index_page_htmldir_projects(self, api_client, doctype):
Expand All @@ -432,7 +438,7 @@ def test_search_correct_link_for_index_page_htmldir_projects(self, api_client, d

result = resp.data['results'][0]
assert result['project'] == project.slug
assert result['link'].endswith('en/latest/')
assert result['path'] == '/en/latest/'

@pytest.mark.parametrize('doctype', [SPHINX_HTMLDIR, MKDOCS])
def test_search_correct_link_for_index_page_subdirectory_htmldir_projects(self, api_client, doctype):
Expand All @@ -450,7 +456,7 @@ def test_search_correct_link_for_index_page_subdirectory_htmldir_projects(self,

result = resp.data['results'][0]
assert result['project'] == project.slug
assert result['link'].endswith('en/latest/guides/')
assert result['path'] == '/en/latest/guides/'

def test_search_advanced_query_detection(self, api_client):
project = Project.objects.get(slug='docs')
Expand Down Expand Up @@ -519,8 +525,8 @@ def test_search_custom_ranking(self, api_client):

results = resp.data['results']
assert len(results) == 2
assert results[0]['link'].endswith('/en/latest/index.html')
assert results[1]['link'].endswith('/en/latest/guides/index.html')
assert results[0]['path'] == '/en/latest/index.html'
assert results[1]['path'] == '/en/latest/guides/index.html'

# Query with a higher rank over guides/index.html
page_guides.rank = 5
Expand All @@ -537,8 +543,8 @@ def test_search_custom_ranking(self, api_client):

results = resp.data['results']
assert len(results) == 2
assert results[0]['link'].endswith('/en/latest/guides/index.html')
assert results[1]['link'].endswith('/en/latest/index.html')
assert results[0]['path'] == '/en/latest/guides/index.html'
assert results[1]['path'] == '/en/latest/index.html'

# Query with a lower rank over index.html
page_index.rank = -2
Expand All @@ -558,8 +564,8 @@ def test_search_custom_ranking(self, api_client):

results = resp.data['results']
assert len(results) == 2
assert results[0]['link'].endswith('/en/latest/guides/index.html')
assert results[1]['link'].endswith('/en/latest/index.html')
assert results[0]['path'] == '/en/latest/guides/index.html'
assert results[1]['path'] == '/en/latest/index.html'

# Query with a lower rank over index.html
page_index.rank = 3
Expand All @@ -579,8 +585,8 @@ def test_search_custom_ranking(self, api_client):

results = resp.data['results']
assert len(results) == 2
assert results[0]['link'].endswith('/en/latest/guides/index.html')
assert results[1]['link'].endswith('/en/latest/index.html')
assert results[0]['path'] == '/en/latest/guides/index.html'
assert results[1]['path'] == '/en/latest/index.html'

# Query with a same rank over guides/index.html and index.html
page_index.rank = -10
Expand All @@ -600,8 +606,8 @@ def test_search_custom_ranking(self, api_client):

results = resp.data['results']
assert len(results) == 2
assert results[0]['link'].endswith('/en/latest/index.html')
assert results[1]['link'].endswith('/en/latest/guides/index.html')
assert results[0]['path'] == '/en/latest/index.html'
assert results[1]['path'] == '/en/latest/guides/index.html'


class TestDocumentSearch(BaseTestDocumentSearch):
Expand Down
4 changes: 0 additions & 4 deletions readthedocs/search/tests/test_proxied_api.py
Expand Up @@ -10,10 +10,6 @@ class TestProxiedSearchAPI(BaseTestDocumentSearch):
# This project slug needs to exist in the ``all_projects`` fixture.
host = 'docs.readthedocs.io'

@pytest.fixture(autouse=True)
def setup_settings(self, settings):
settings.PUBLIC_DOMAIN = 'readthedocs.io'

def get_search(self, api_client, search_params):
# TODO: remove once the api is stable
search_params['new-api'] = 'true'
Expand Down
3 changes: 2 additions & 1 deletion readthedocs/search/tests/test_views.py
Expand Up @@ -281,7 +281,8 @@ def test_file_search_exact_match(self, client, project):
# only one project
assert len(results) == 1
assert results[0]['project'] == 'kuma'
assert results[0]['link'] == 'http://readthedocs.org/docs/kuma/en/latest/documentation.html'
assert results[0]['domain'] == 'http://readthedocs.org'
assert results[0]['path'] == '/docs/kuma/en/latest/documentation.html'

blocks = results[0]['blocks']
assert len(blocks) == 1
Expand Down
6 changes: 3 additions & 3 deletions readthedocs/templates/search/elastic_search.html
Expand Up @@ -183,14 +183,14 @@ <h3>
{% endfor %}

{% elif result.type == 'page' %}
<a href="{{ result.link }}?highlight={{ query }}">
<a href="{{ result.domain }}{{ result.path}}?highlight={{ query }}">
{{ result.project }} - {% if result.highlights.title %} {{ result.highlights.title.0|safe }} {% else %} {{ result.title }} {% endif %}
</a>

{% for block in result.blocks %}
{% if block.type == 'domain' %}
<p>
<a href="{{ result.link }}?highlight={{ query }}#{{ block.id }}">
<a href="{{ result.domain }}{{ result.path }}?highlight={{ query }}#{{ block.id }}">
{% if block.highlights.name %}
{% with domain_name=block.highlights.name %}
[{{ block.role }}]: {{ domain_name.0|safe }}
Expand All @@ -214,7 +214,7 @@ <h3>

{% elif block.type == 'section' %}
<p>
<a href="{{ result.link }}?highlight={{ query }}#{{ block.id }}">
<a href="{{ result.domain }}{{ result.path }}?highlight={{ query }}#{{ block.id }}">
{% if block.highlights.title %}
{% with section_title=block.highlights.title %}
{{ section_title.0|safe }}
Expand Down