Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add section linking for the search result #5829

Conversation

@dojutsu-user
Copy link
Member

@dojutsu-user dojutsu-user commented Jun 19, 2019

This is a WIP.

Related PR in readthedocs-sphinx-search -- readthedocs/readthedocs-sphinx-search#19

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jun 19, 2019

Yea, I think we likely need to have an approach where sections take the place of the existing content -- we don't want to index the same data twice. I believe ES has the ability to do this reasonably well.

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jun 19, 2019

@ericholscher

Yea, I think we likely need to have an approach where sections take the place of the existing content -- we don't want to index the same data twice

That means we will be modifying/upgrading to remove the headers field completely and indexing each section as a separate document... right?

I believe ES has the ability to do this reasonably well.

Which feature of ES are you talking about?

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jun 19, 2019

I would like to continue work on this PR only.. if that's okay?

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jun 19, 2019

That means we will be modifying/upgrading to removing the headers field completely and indexing each section as a separate document... right?

Perhaps. We still need to know what the headers are, but could group them with the section. I have reached out to my friend at Elastic to ask his opinion on this. I will update here with the approach he suggests.

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jun 21, 2019

@ericholscher

We still need to know what the headers are

I went ahead and try to index each section as a separate document in the ES. I think headers are mostly titles of the page ... right?
So I was having documents with -- project, version, title, section_title, section_id, section_content and other required fields. And it works well with the full page search UI. I haven't tested that with the search results page but I think it will work.

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jun 21, 2019

@dojutsu-user Great. That is how we used to do it, and it worked ok. So that might be a path forward 👍

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jun 21, 2019

@dojutsu-user if you could push up the code here or somewhere else, I can try and take a look.

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jun 21, 2019

@ericholscher
Pushing the code... just 1 min.

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jun 21, 2019

@ericholscher
Pushed the code.
Many tests are going to fail for this.

Edit: It might not be very clean and robust because I was just testing things out.

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jun 21, 2019

Makes sense. I think this is a good way to test at least. If we can provide good results with this, we can definitely move to this approach in the short term. I want to think a bit more about how to combine this, along with the SphinxDomain objects that I want to return in search as well.

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jun 21, 2019

@ericholscher
I think results from sphinx domain are not shown to the user right now...??

SphinxDomain objects that I want to return in search as well.

Where do we want to show these results?

  • Full page search ui
  • In the search results page
  • Both

Assuming that we want to show sphinx domain results at both of these places -- we need to make an api endpoint which return results from both indexes -- something like AllSearch but with only two indexes.

One other thing that we need to discuss is how we will be showing this to the users..?

  • Either we give choice to the user to select one of the search results (facets)
    • In this case, for the full page search UI, we can have a dropdown or something like this with the input field and it will look good. But I can't think of how we will be giving choice to the user in the search result page.
  • Or we show results from both indexes.
    • We don't have to think much about the ui/ux in this case -- as we will be showing results from both indexes, everything will remain same with the inclusion of extra results.

In both cases -- I haven't given the thought that we will be showing results from Sphinx Domain in the full page search UI... so the extension is not prepared for that now. But it will.

cc: @davidfischer

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jun 24, 2019

Yea, I'd love to return SphinxDomain results to users in the same API results. I believe the approach we've discussed with Nested queries can work for both sections and sphinx_domains on the Page document. We should test this and see how it works.

Loading

@@ -446,9 +446,6 @@ def USE_PROMOS(self): # noqa
'settings': {
'number_of_shards': 2,
'number_of_replicas': 0,
"index": {
"sort.field": ["project", "version"]
Copy link
Member Author

@dojutsu-user dojutsu-user Jun 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jun 24, 2019

Did this work in testing?

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jun 25, 2019

@ericholscher
No, not in the way we want.
We need to discuss it a little more.

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jun 25, 2019

Turns out that I was wrong earlier and it is working nicely.
Here is the sample result for query sponsors (ignore the value of link)
https://pastebin.com/ZuZq4cfp
What are your thoughts on this approach?
@ericholscher @davidfischer

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jun 25, 2019

This looks great. I think we'll need to index some more data in order to generate the links. We need to know the id attribute of the H2 section so we can properly generate a link to it.

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jun 25, 2019

I didn't follow.
I mentioned to ignore the value of link because I have rtd-test as a subproject of template and so the link is not simple.
For generating links -- we can have the full_path (#5821 -- after closing of this issue) and then we have to add #section-id to it and we have the link.

Loading

Copy link
Member

@ericholscher ericholscher left a comment

Latest changes look good 👍

Loading

@@ -25,6 +25,8 @@

{% block content %}

{% trans "100" as MAX_SUBSTRING_LIMIT %}
Copy link
Member

@ericholscher ericholscher Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need trans here, I think we can use with: https://docs.djangoproject.com/en/2.2/ref/templates/builtins/#with

Loading

domains = inner_hits.domains or []
all_results = itertools.chain(sections, domains)

sorted_results = [
Copy link
Member Author

@dojutsu-user dojutsu-user Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stsewd
Here, If I used a generator expression -- test_search_works_with_title_query and test_search_works_with_sections_query will fail.
I can't find the reason though. For now, I have changed them to list comprehension for now.

Loading

Copy link
Member

@stsewd stsewd Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I wasn't able to run the test because there is an import error, my guess is that when the generator gets evaluated the object inner_hits has changed. You can confirm this if you do a copy of inner_hits.sections and inner_hits.domains before assign them.

Also, I'd just left the list comprehension, since we don't know when the generator gets evaluated by django rest.

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add comments there to avoid any confusion in the future.

Loading

fields = ['title^10', 'headers^5', 'content']

_outer_fields = ['title^4']
_section_fields = ['sections.title^3', 'sections.content']
Copy link
Member Author

@dojutsu-user dojutsu-user Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the boosters.
They are working fine.

Loading

assert res['project'] == 'docs'

# def test_doc_search_filter_by_version(self, api_client, project):
# """Test Doc search result are filtered according to version"""
Copy link
Member Author

@dojutsu-user dojutsu-user Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented out the other tests.
I will update them.

Loading

@ericholscher ericholscher changed the base branch from master to gsoc-19-indoc-search Jul 12, 2019
Copy link
Member

@ericholscher ericholscher left a comment

Looks like a good direction. I haven't given it a full review quite yet, since it looks like there was a good amount of refactoring?

Loading

@@ -23,3 +23,6 @@ def test_h2_parsing(self):
'You can use Slumber'
))
self.assertEqual(data['title'], 'Read the Docs Public API')

for section in data['sections']:
self.assertFalse('\n' in section['content'])
Copy link
Member

@ericholscher ericholscher Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably use a comment. Likely it should also test for a length before doing this, otherwise this check could be running on 0 sections.

Loading

# assert data[0]['project'] == subproject.slug
# # Check the link is the subproject document link
# document_link = subproject.get_docs_url(version_slug=version.slug)
# assert document_link in data[0]['link']
Copy link
Member

@ericholscher ericholscher Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these all commented out?

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't worked on them yet.

Loading

elif data_type.startswith('sections'):

# generates query from section title
if data_type.endswith('title'):
Copy link
Member

@ericholscher ericholscher Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we using endswith and startswith here, instead of just string checking?

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds more pythonic.
I will update the PR.

Loading

</a>
</li>
{% endfor %}
{% with "100" as MAX_SUBSTRING_LIMIT %}
Copy link
Member

@ericholscher ericholscher Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this file change so much?

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every line is indented one more level.

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Jul 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I have corrected the indentation of the whole file.

Loading

dojutsu-user added a commit to dojutsu-user/readthedocs.org that referenced this issue Jul 12, 2019
@dojutsu-user dojutsu-user added this to In progress in In-doc search UI via automation Jul 12, 2019
@ericholscher ericholscher merged commit d526249 into readthedocs:gsoc-19-indoc-search Jul 12, 2019
@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jul 12, 2019

👍 Merged into the feature branch as a base

Loading

@dojutsu-user dojutsu-user deleted the search-section-linking branch Jul 12, 2019
@dojutsu-user dojutsu-user moved this from In progress to Done in In-doc search UI Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants