Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index more domain data into elasticsearch #5979

Merged
merged 36 commits into from Aug 27, 2019

Conversation

@dojutsu-user
Copy link
Member

commented Jul 23, 2019

@dojutsu-user dojutsu-user added this to In progress in In-doc search UI via automation Jul 23, 2019

@ericholscher
Copy link
Member

left a comment

This looks like a good approach. I'd like to see some tests and a bit more detail in what all we're parsing for.

readthedocs/search/parse_json.py Outdated Show resolved Hide resolved
readthedocs/search/parse_json.py Show resolved Hide resolved
readthedocs/search/parse_json.py Outdated Show resolved Hide resolved
@dojutsu-user

This comment has been minimized.

Copy link
Member Author

commented Jul 24, 2019

@ericholscher
I have updated the PR.
Also, this PR also index sphinx domain signature content (Inside the <dd> tag).
I am not sure if it is needed or not.

I found some domains which don't have proper html strucutre.
Example: https://2.python-requests.org/en/master/api/#requests.cookies.RequestsCookieJar.popitem
In this.
Content inside <dt> tag: popitem() → (k, v), remove and return some (key, value) pair
Content inside <dd> tag: as a 2-tuple; but raise KeyError if D is empty.

So there's no proper domain signature and no proper domain doc strings.
What should we do in this case?

@dojutsu-user

This comment has been minimized.

Copy link
Member Author

commented Jul 24, 2019

@ericholscher

I'd like to see some tests

I will add the tests, once we are sure of the working.

@dojutsu-user dojutsu-user requested a review from ericholscher Jul 24, 2019

@ericholscher
Copy link
Member

left a comment

This looks good. I haven't run it yet, so we might hit some bugs mapping from real world HTML, but the structure looks great.

readthedocs/search/parse_json.py Show resolved Hide resolved
readthedocs/search/parse_json.py Outdated Show resolved Hide resolved
Show domains docstrings
* remove  and  from docsearch API and start using .
* remove unnecessary fields from the PageDocument to index only relevant data into elasticsearch.
* Add  to search fields in faceted_Search.py.
* Don't get  from parse_json -- it is not required.
* Update  to show domains docstrings in results in  page.
@dojutsu-user

This comment has been minimized.

Copy link
Member Author

commented Jul 29, 2019

@ericholscher
Turns out that not every sphinx domain is inside <dl> tag.
For example: http://docs.celeryproject.org/en/latest/userguide/configuration.html#beat-scheduler
Therefore, the docstrings for these sphinx domains are not getting indexed. However these docstrings will be there in sections.content.

@dojutsu-user dojutsu-user requested a review from ericholscher Jul 29, 2019

@ericholscher

This comment has been minimized.

@dojutsu-user

This comment has been minimized.

Copy link
Member Author

commented Jul 29, 2019

@ericholscher
They are getting indexed as sphinx domains.

Screenshot from 2019-07-30 02-26-41

Screenshot from 2019-07-30 02-29-32

I believe, this sphinx domain is coming from this --
Screenshot from 2019-07-30 13-48-10

@ericholscher
Copy link
Member

left a comment

I have some questions on this. I'm guessing the removed data is because we aren't displaying it in the search results anymore? I think we still likely want to display the type and name, no?

readthedocs/search/documents.py Outdated Show resolved Hide resolved
'anchor': fields.KeywordField(),

# For showing in the search result
'type_display': fields.TextField(),
'doc_display': fields.TextField(),

This comment has been minimized.

Copy link
@ericholscher

ericholscher Jul 31, 2019

Member

Why are we removing this?

This comment has been minimized.

Copy link
@dojutsu-user

dojutsu-user Aug 5, 2019

Author Member

We don't need doc_display -- because the main title of each result contains the doc_display.
But I have undoed the change for type_display -- we can make use of this.

This comment has been minimized.

Copy link
@dojutsu-user

dojutsu-user Aug 5, 2019

Author Member

😕
I can't seem to find a good way to display every information.
type_display looks unncessary when we are displaying role_name.

This comment has been minimized.

Copy link
@ericholscher

ericholscher Aug 19, 2019

Member

Well, sure. They are showing the same information. I believe the type_display is meant for human readable output, and role_name is not, no?

This comment has been minimized.

Copy link
@dojutsu-user

dojutsu-user Aug 19, 2019

Author Member

Both are quite readable:
py:function
std:confval
py:class

This comment has been minimized.

Copy link
@ericholscher

ericholscher Aug 19, 2019

Member

I think configuration value over confval is much better for non-technical users. We should definitely try to use the display value, that's the whole reason it exists if for this use.

This comment has been minimized.

Copy link
@dojutsu-user

dojutsu-user Aug 21, 2019

Author Member

@ericholscher
I realised (while testing) that some sphinx domains don't have values for type_display.
Should we still show for those who have? or just keep showing role_name?


{% if inner_hit.source.display_name|length >= 1 %}
({{ inner_hit.source.role_name }}) {{ inner_hit.source.display_name}}
{% with "100" as MAX_SUBSTRING_LIMIT %}

This comment has been minimized.

Copy link
@ericholscher

ericholscher Jul 31, 2019

Member

I'm feeling the pain of maintaining multiple sets of result HTML. We should consider trying to get the site search using the search as you type code, hopefully with minimal changes from the theme code. That way we don't have 3 places to change HTML.

The goal will be to ship search as you type across all of RTD so we only have to maintain one set of results, but baby steps :)

This comment has been minimized.

Copy link
@dojutsu-user

dojutsu-user Aug 5, 2019

Author Member

Yes... maintaining multiple sets of html is definitely a pain.

The goal will be to ship search as you type across all of RTD so we only have to maintain one set of results, but baby steps :)

I think that's quite a big refactor and we can target that after this PR?

@ericholscher

This comment has been minimized.

Copy link
Member

commented Aug 19, 2019

This looks like a great approach. Let's see if we can get it deployed this week after we fix up tests. 👍

@dojutsu-user dojutsu-user requested a review from ericholscher Aug 21, 2019

@dojutsu-user dojutsu-user referenced this pull request Aug 26, 2019
@ericholscher
Copy link
Member

left a comment

Going to go ahead and merge this now that we have 3.7.3 out. Next deploy will ship this and require a reindex 👍

@ericholscher ericholscher merged commit 00ab116 into readthedocs:master Aug 27, 2019

2 checks passed

continuous-documentation/read-the-docs Read the Docs build succeeded!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@dojutsu-user dojutsu-user deleted the dojutsu-user:index-more-domain-data branch Aug 28, 2019

@dojutsu-user dojutsu-user moved this from In progress to Done in In-doc search UI Aug 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.