Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index more domain data into elasticsearch #5979

Merged
merged 36 commits into from Aug 27, 2019

Conversation

@dojutsu-user
Copy link
Member

@dojutsu-user dojutsu-user commented Jul 23, 2019

WIP

Related PR in readthedocs-sphinx-search -- readthedocs/readthedocs-sphinx-search#42

Copy link
Member

@ericholscher ericholscher left a comment

This looks like a good approach. I'd like to see some tests and a bit more detail in what all we're parsing for.

Loading

readthedocs/search/parse_json.py Outdated Show resolved Hide resolved
Loading
readthedocs/search/parse_json.py Show resolved Hide resolved
Loading
readthedocs/search/parse_json.py Outdated Show resolved Hide resolved
Loading
@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jul 24, 2019

@ericholscher
I have updated the PR.
Also, this PR also index sphinx domain signature content (Inside the <dd> tag).
I am not sure if it is needed or not.

I found some domains which don't have proper html strucutre.
Example: https://2.python-requests.org/en/master/api/#requests.cookies.RequestsCookieJar.popitem
In this.
Content inside <dt> tag: popitem() → (k, v), remove and return some (key, value) pair
Content inside <dd> tag: as a 2-tuple; but raise KeyError if D is empty.

So there's no proper domain signature and no proper domain doc strings.
What should we do in this case?

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jul 24, 2019

@ericholscher

I'd like to see some tests

I will add the tests, once we are sure of the working.

Loading

@dojutsu-user dojutsu-user requested a review from ericholscher Jul 24, 2019
Copy link
Member

@ericholscher ericholscher left a comment

This looks good. I haven't run it yet, so we might hit some bugs mapping from real world HTML, but the structure looks great.

Loading

readthedocs/search/parse_json.py Show resolved Hide resolved
Loading
readthedocs/search/parse_json.py Outdated Show resolved Hide resolved
Loading
* remove  and  from docsearch API and start using .
* remove unnecessary fields from the PageDocument to index only relevant data into elasticsearch.
* Add  to search fields in faceted_Search.py.
* Don't get  from parse_json -- it is not required.
* Update  to show domains docstrings in results in  page.
@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jul 29, 2019

@ericholscher
Turns out that not every sphinx domain is inside <dl> tag.
For example: http://docs.celeryproject.org/en/latest/userguide/configuration.html#beat-scheduler
Therefore, the docstrings for these sphinx domains are not getting indexed. However these docstrings will be there in sections.content.

Loading

@dojutsu-user dojutsu-user requested a review from ericholscher Jul 29, 2019
@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jul 29, 2019

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Jul 29, 2019

@ericholscher
They are getting indexed as sphinx domains.

Screenshot from 2019-07-30 02-26-41

Screenshot from 2019-07-30 02-29-32

I believe, this sphinx domain is coming from this --
Screenshot from 2019-07-30 13-48-10

Loading

Copy link
Member

@ericholscher ericholscher left a comment

I have some questions on this. I'm guessing the removed data is because we aren't displaying it in the search results anymore? I think we still likely want to display the type and name, no?

Loading

Loading
readthedocs/search/documents.py Outdated Show resolved Hide resolved
Loading
'anchor': fields.KeywordField(),

# For showing in the search result
'type_display': fields.TextField(),
'doc_display': fields.TextField(),
Copy link
Member

@ericholscher ericholscher Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this?

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need doc_display -- because the main title of each result contains the doc_display.
But I have undoed the change for type_display -- we can make use of this.

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😕
I can't seem to find a good way to display every information.
type_display looks unncessary when we are displaying role_name.

Loading

Copy link
Member

@ericholscher ericholscher Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, sure. They are showing the same information. I believe the type_display is meant for human readable output, and role_name is not, no?

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are quite readable:
py:function
std:confval
py:class

Loading

Copy link
Member

@ericholscher ericholscher Aug 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think configuration value over confval is much better for non-technical users. We should definitely try to use the display value, that's the whole reason it exists if for this use.

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ericholscher
I realised (while testing) that some sphinx domains don't have values for type_display.
Should we still show for those who have? or just keep showing role_name?

Loading


{% if inner_hit.source.display_name|length >= 1 %}
({{ inner_hit.source.role_name }}) {{ inner_hit.source.display_name}}
{% with "100" as MAX_SUBSTRING_LIMIT %}
Copy link
Member

@ericholscher ericholscher Jul 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm feeling the pain of maintaining multiple sets of result HTML. We should consider trying to get the site search using the search as you type code, hopefully with minimal changes from the theme code. That way we don't have 3 places to change HTML.

The goal will be to ship search as you type across all of RTD so we only have to maintain one set of results, but baby steps :)

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... maintaining multiple sets of html is definitely a pain.

The goal will be to ship search as you type across all of RTD so we only have to maintain one set of results, but baby steps :)

I think that's quite a big refactor and we can target that after this PR?

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Aug 19, 2019

This looks like a great approach. Let's see if we can get it deployed this week after we fix up tests. 👍

Loading

@dojutsu-user dojutsu-user requested a review from ericholscher Aug 21, 2019
@dojutsu-user dojutsu-user mentioned this pull request Aug 26, 2019
Copy link
Member

@ericholscher ericholscher left a comment

Going to go ahead and merge this now that we have 3.7.3 out. Next deploy will ship this and require a reindex 👍

Loading

@ericholscher ericholscher merged commit 00ab116 into readthedocs:master Aug 27, 2019
2 checks passed
Loading
@dojutsu-user dojutsu-user deleted the index-more-domain-data branch Aug 28, 2019
@dojutsu-user dojutsu-user moved this from In progress to Done in In-doc search UI Aug 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants