Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor search code #5197

Merged
merged 52 commits into from Feb 6, 2019
Merged

Refactor search code #5197

merged 52 commits into from Feb 6, 2019

Conversation

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Jan 29, 2019

This PR has gotten pretty large, so I think the easiest thing to do is actually to just check it out and read the code in readthedocs/search. I'd really like more eyes from @rtfd/core on this, since we're all now responsible for maintaining this code :)

This does a number of things:

  • Removes the simple_search endpoint, so that we only have 1 entry point for search
  • Re-adds the search signals that we removed in the refactor, these are required for the .com
  • A few small UI/UX cleanup things to make search results nicer
  • Some optimizations that reduce the size of the ES results that we get back from the server.
  • Moves all ES updates/deletes to celery, totally removes default django-elasticsearch-dsl signals
    • This allows us to remove our custom logic that we needed to remove invalid HTMLFile's, and removes the entire RTDDocType class

Closes #5167 #5168

This does a number of things:

* Removes the simple_search endpoint, so that we only have 1 entry point for search
* Re-adds the search signals that we removed in the refactor, these are required for the .com
* A few small UI/UX cleanup things to make search results nicer
@ericholscher ericholscher requested a review from Jan 29, 2019
@ericholscher ericholscher requested a review from safwanrahman Jan 29, 2019
@@ -12,8 +12,8 @@
@pytest.mark.search
class TestDocumentSearch(object):

def __init__(self):
# This reverse needs to be inside the ``__init__`` method because from
def setUp(self):
Copy link
Member Author

@ericholscher ericholscher Jan 29, 2019

search/tests/test_api.py:13
  /Users/eric/projects/readthedocs.org/readthedocs/search/tests/test_api.py:13: PytestWarning: cannot collect test class 'TestDocumentSearch' because it has a __init__ constructor

@safwanrahman
Copy link
Member

@safwanrahman safwanrahman commented Jan 30, 2019

This need to have a careful review. I will review it tonight.
@ericholscher Can you explain more about removing the simple_search? I think we can overwrite the search method to pass the singal.

@ericholscher
Copy link
Member Author

@ericholscher ericholscher commented Jan 30, 2019

@ericholscher Can you explain more about removing the simple_search? I think we can overwrite the search method to pass the singal.

We had two different entry points for search, which means we had to repeat logic a bunch of places. Why do we need simple_search instead of just always using faceted search?

@safwanrahman
Copy link
Member

@safwanrahman safwanrahman commented Jan 30, 2019

We had two different entry points for search, which means we had to repeat logic a bunch of places.

Its actually necessary. One is search in API, which does not need aggregated data. But the project search and file search do need aggregated data and the number, so they do need aggregated data.

Why do we need simple_search instead of just always using faceted search?

We need simple search for not overwhelming the API search endpoint. Aggregated query in large dataset are comparable slower than non aggregated query. So we should not run aggregated query when its not needed.
If we use faceted search all the time, it will just make the query slower and may have effect in our Elasticsearch cluster. If we can keep the query simple and fast, we can implement search as you type, suggestion and other things without breaking the user experience.

@ericholscher
Copy link
Member Author

@ericholscher ericholscher commented Jan 30, 2019

We need simple search for not overwhelming the API search endpoint. Aggregated query in large dataset are comparable slower than non aggregated query. So we should not run aggregated query when its not needed.

I think we can ship this for now, and simplify if we have issues. It seems like having two totally different code paths for search is more complexity than value to me. This makes things much simpler, and allows us to keep all the logic in one place.

@ericholscher ericholscher changed the title Reactor search code Refactor search code Jan 31, 2019
kwargs = {}
kwargs['projects_list'] = [p.slug for p in self.get_all_projects()]
kwargs['versions_list'] = self.request.query_params.get('version')
user = ''
Copy link
Member

@humitos humitos Feb 5, 2019

I think it's a better pattern to default to the AnonymousUser instead. So, anywhere where this is used, all the method are still available and returning the proper values.

I think you can pass self.request.user directly without checking anything.

Copy link
Contributor

@agjohnson agjohnson Feb 5, 2019

I agree here, unless there is some significance to ES handling the user as an empty string.

Copy link
Member Author

@ericholscher ericholscher Feb 5, 2019

Nope, was just the old way we were doing it. Fixed it now.

Copy link
Contributor

@agjohnson agjohnson left a comment

I haven't quite grokked the original changes before refactor, so be warned that I'm not super effective reviewing this. I noted a couple of JS changes -- I did try to add a method of bubbling DEBUG up to our search javascript, but went over my time limit without success, so we can revist adding debug to our local output. It's fairly easy to point out when Sphinx index is used in production at the moment anyways

@@ -32,6 +32,7 @@ function attach_elastic_search_query(data) {
var total_count = data.count || 0;

if (hit_list.length) {
console.debug('Read the Docs search got a result. Showing results.')
Copy link
Contributor

@agjohnson agjohnson Feb 5, 2019

We should drop debug/log statements like this for prod. You can tell if the Sphinx indexes are used as the search result return will be empty -- or easier, you'll see a flood of requests for Sphinx's index files.

I do think this is helpfu though. I think an addition to our JS could be to log when DEBUG = True, but I took a really quick swing at this and hit issues. We need to pass through our footer most likely. I'd say lets remove these statements and find a method of exposing DEBUG to our JS.

contents.html(content_text);
contents.find('em').addClass('highlighted');
list_item.append(contents);
for (index in highlight.content) {
Copy link
Contributor

@agjohnson agjohnson Feb 5, 2019

So, JS quirk here. for ... in isn't actually for iterables, it's for properties on an object. It's better to use the old for (var i; ...) approach for arrays:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...in

There is for ... of, which loops over iterables, but browser support is still new:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/for...of#Browser_compatibility

@@ -97,6 +105,7 @@ function attach_elastic_search_query(data) {
},
complete: function (resp, status_code) {
if (status_code !== 'success' || resp.responseJSON.count === 0) {
console.debug('Read the Docs search failed, skipping loading search content.')
Copy link
Contributor

@agjohnson agjohnson Feb 5, 2019

Another debug statement

@@ -127,14 +127,16 @@ <h3>{% blocktrans with query=query|default:"" %}Results for {{ query }}{% endblo
{% if result.name %}

{# Project #}
<a href="{{ result.url }}">{{ result.name }}</a>
<a href="{{ result.url }}">{{ result.name }} (<em>{{ result.slug }}</em>)</a>
Copy link
Contributor

@agjohnson agjohnson Feb 5, 2019

If this is the notation for subproject results, perhaps we should mention "(from project {{slug}})" or "(from project {{name}})" would be even better. I don't have an example of how this looks with subproject search in doc right now though. If in-doc we omit "from project", we can omit here too.

Copy link
Member Author

@ericholscher ericholscher Feb 5, 2019

It's mostly because project's can have different names and Slugs, which can be confusing. Eg. in prod we have like 5000 projects with the name "Docs" or similar, so this helps make it explicit.

Copy link
Member

@humitos humitos Feb 5, 2019

I remember that we had a bug that makes all the results to say (from project blah) but I'm not sure what was the mistake. I think it was a problem with JS and the === instead of ==. It may worth to check the history in case it's related to this.

Copy link
Member

@humitos humitos left a comment

I left some comments.

I don't understand this code completely, though.

I tested this branch locally (some commits "ago") and I did work properly: generating index via management command, index after docs built, search and get results (I had to add the port to CORS).

class Meta(object):
model = HTMLFile
fields = ('commit',)
ignore_signals = settings.ES_PAGE_IGNORE_SIGNALS
Copy link
Member

@humitos humitos Feb 5, 2019

search.rst should be updated accordingly. This setting is not used anymore.

from readthedocs.search.documents import PageDocument, ProjectDocument
from readthedocs.search.signals import before_file_search, before_project_search

log = logging.getLogger(__name__)


class RTDFacetedSearch(FacetedSearch):
Copy link
Member

@humitos humitos Feb 5, 2019

Nevermind. Got confused by the comment.


# need to search for both 'and' and 'or' operations
# the score of and should be higher as it satisfies both or and and
for operator in ['and', 'or']:
Copy link
Member

@humitos humitos Feb 5, 2019

Just in case, we were using AND and OR, not sure if it affects.


def elastic_project_search(request, project_slug):
"""Use elastic search to search in a project."""
queryset = Project.objects.protected(request.user)
Copy link
Member

@humitos humitos Feb 5, 2019

Why .protected is used here instead of .public? If it's because of .com, shouldn't be a combination of .public + .for_user?

Copy link
Member Author

@ericholscher ericholscher Feb 5, 2019

This is what all the dashboard views in projects.views.public do.

@@ -127,14 +127,16 @@ <h3>{% blocktrans with query=query|default:"" %}Results for {{ query }}{% endblo
{% if result.name %}

{# Project #}
<a href="{{ result.url }}">{{ result.name }}</a>
<a href="{{ result.url }}">{{ result.name }} (<em>{{ result.slug }}</em>)</a>
Copy link
Member

@humitos humitos Feb 5, 2019

I remember that we had a bug that makes all the results to say (from project blah) but I'm not sure what was the mistake. I think it was a problem with JS and the === instead of ==. It may worth to check the history in case it's related to this.

the other part is responsible for querying the Index to show the proper results to users.
We use the `django-elasticsearch-dsl`_ package mostly to the keep the search working.

* One part is responsible for **indexing** the documents and projects (`documents.py`)
Copy link
Member

@humitos humitos Feb 6, 2019

I think you want to use the `` here instead of the single one.

ericholscher added 3 commits Feb 6, 2019
…rch.

Also support passing a version_slug to get_project_list_or_404 in order to filter by version privacy instead of Project.
Copy link
Member

@humitos humitos left a comment

Latest changes look good.

I left some nitpick comments to consider.

@@ -21,19 +21,10 @@ def __init__(self, user, **kwargs):
but is used on the .com
"""
self.user = user
if 'filter_user' in kwargs:
self.filter_user = kwargs.pop('filter_user')
Copy link
Member

@humitos humitos Feb 6, 2019

nitpick: this can be written as

self.filter_user = kwargs.pop('filter_user', None)

to avoid the if.

subprojects = Project.objects.filter(superprojects__parent_id=main_project.id)
for project in list(subprojects) + [main_project]:
version = Version.objects.public(user).filter(project__slug=project.slug, slug=version_slug)
if version.count():
Copy link
Member

@humitos humitos Feb 6, 2019

nitpick: .exists() is better for this use case.

for project in list(subprojects) + [main_project]:
version = Version.objects.public(user).filter(project__slug=project.slug, slug=version_slug)
if version.count():
project_list.append(version[0].project)
Copy link
Member

@humitos humitos Feb 6, 2019

nitpick: I think using .first() is the Django-way for this.

project_list = []
main_project = get_object_or_404(Project.objects.all(), slug=project_slug)
subprojects = Project.objects.filter(superprojects__parent_id=main_project.id)
for project in list(subprojects) + [main_project]:
Copy link
Member

@humitos humitos Feb 6, 2019

Not sure, but I'm thinking that this for could be replaced by a query itself, like:

versions = Version.objects.public(user)
    .filter(project__in=projects, slug=version_slug)
    .values_list('id', flat=True)

projects = Project.objects.filter(versions__id__in=versions)

Use it if you consider it clearer.

"""
# Support private projects with public versions
project_list = []
main_project = get_object_or_404(Project.objects.all(), slug=project_slug)
Copy link
Member

@humitos humitos Feb 6, 2019

nitpick: no need to .objects.all(), just get_object_or_404(Project, slug=project_slug) works

@@ -62,7 +62,7 @@ def get_queryset(self):
# Validate all the required params are there
self.validate_query_params()
query = self.request.query_params.get('q', '')
kwargs = {}
kwargs = {'filter_user': False}
Copy link
Member

@humitos humitos Feb 6, 2019

nitpick: I'd like to come up with a better name for this. I'm thinking about filter_by_user which is a little more explicit.

What we want to communicate here is "filter versions by users permissions" I suppose, but I didn't find a good name for that :(

kwargs = {
'using': using or cls._doc_type.using,
'index': index or cls._doc_type.index,
'doc_types': [cls],
Copy link
Member

@safwanrahman safwanrahman Feb 6, 2019

@ericholscher I think we can pass the doc_type from here to avoide the lazy import.

Copy link
Member Author

@ericholscher ericholscher Feb 6, 2019

Will take a peek at that in a refactor, I think it's OK for now.

kwargs = {
'using': using or cls._doc_type.using,
'index': index or cls._doc_type.index,
'doc_types': [cls],
Copy link
Member

@safwanrahman safwanrahman Feb 6, 2019

Same here, pass the doc_types in order to avoide the lazy import

@ericholscher ericholscher merged commit 05b7c3f into master Feb 6, 2019
1 of 2 checks passed
Search update automation moved this from In progress to Done Feb 6, 2019
@delete-merged-branch delete-merged-branch bot deleted the readd-search-signals branch Feb 6, 2019
@safwanrahman
Copy link
Member

@safwanrahman safwanrahman commented Feb 6, 2019

I have run the management command with CELERY_ALWAYS_EAGER=False and it raises following error.

[06/Feb/2019 16:41:36] celery.app.trace:249[1564]: ERROR Task readthedocs.search.tasks.index_objects_to_es[676db212-aaf4-4fee-830c-5bd6f173b3cc] raised unexpected: RequestError(400, 'illegal_argument_exception', {'error': {'root_cause': [{'type': 'illegal_argument_exception', 'reason': "Alias [project_index] has more than one indices associated with it [[project_index_20190206163659, project_index_20190206163146]], can't execute a single index op"}], 'type': 'illegal_argument_exception', 'reason': "Alias [project_index] has more than one indices associated with it [[project_index_20190206163659, project_index_20190206163146]], can't execute a single index op"}, 'status': 400})
Traceback (most recent call last):
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
    return self.run(*args, **kwargs)
  File "/Users/safwan/readthedocs/readthedocs/search/tasks.py", line 36, in index_objects_to_es
    doc_obj.update(queryset.iterator())
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/django_elasticsearch_dsl/documents.py", line 231, in update
    self._get_actions(object_list, action), **kwargs
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/django_elasticsearch_dsl/documents.py", line 191, in bulk
    return bulk(client=self.connection, actions=actions, **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 257, in bulk
    for ok, item in streaming_bulk(client, actions, **kwargs):
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 192, in streaming_bulk
    raise_on_error, **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 99, in _process_bulk_chunk
    raise e
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/helpers/__init__.py", line 95, in _process_bulk_chunk
    resp = client.bulk('\n'.join(bulk_actions) + '\n', **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/client/utils.py", line 76, in _wrapped
    return func(*args, params=params, **kwargs)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 1150, in bulk
    headers={'content-type': 'application/x-ndjson'})
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/transport.py", line 314, in perform_request
    status, headers_response, data = connection.perform_request(method, url, params, body, headers=headers, ignore=ignore, timeout=timeout)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/connection/http_urllib3.py", line 180, in perform_request
    self._raise_error(response.status, raw_data)
  File "/Users/safwan/.virtualenvs/readthedocs/lib/python3.6/site-packages/elasticsearch/connection/base.py", line 125, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.RequestError: TransportError(400, 'illegal_argument_exception', "Alias [project_index] has more than one indices associated with it [[project_index_20190206163659, project_index_20190206163146]], can't execute a single index op")

I was expecting this error because this is trying to index projects and documents by the index alias. If the index alias has more than one index associate with it, it will raise error.
In current code that is in the master, the new index name is passed to the task and the task index to the new index using the index name. But this functionality get broken by this PR.

This only works when running with CELERY_ALWAYS_EAGER=True as the task will run in the same instance as the management command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Search update
  
Done
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants