[Fix #4333] Implement asynchronous search reindex functionality using celery #4368

safwanrahman · 2018-07-13T23:18:23Z

The management command reindex_elasticsearch has been rewritten from scratch using celery tasks.
The idea is taken from @robhudson blog post and heavily inspired from the code of mozilla/zamboni.

Need to overwrite some method of django-elasticsearch-dsl in order to support zero downtime rebuild (django-es/django-elasticsearch-dsl#75). I am hoping to send a PR to upstream.

This fixes #4333

@ericholscher @rtfd/core r?

safwanrahman · 2018-07-16T02:34:15Z

I have tested with about 10K files in single celery host with 4 workers. The indexing got finished within 20 seconds.

ericholscher

I like the approach of using the chain/chord to handle indexing and index creation. Does this work locally with CELERY_ALWAYS_EAGER, I believe Rob mentioning that in his blog post, or will we just use the existing DED indexing locally?

I'm a little concerned about the complexity of this approach. Is there a reason we're using chunks here when we already have a domain object to chunk on which is the Version? This feels like extra work to do and code to maintain, when we already have an existing way to think about this problem.

This approach also doesn't use the same code path for indexing as our application code, so now we have two different ways of indexing files, which doesn't seem great.

It really feels like all we need is:

A management commands that iterates over versions, and sends them to be indexed
A celery task that takes a version and indexes all the files in that version, which is called both in production as well as in reindexing.

I'm happy to talk about this more. There are likely some design decisions that you made that I don't understand :)

ericholscher · 2018-07-16T08:07:05Z

import_projects.py

+        print("unsuccessful", name)
+
+
+def fetch(url=None, all_projects=[]):


all_projects should be defined outside the function, it's confusing as defined here.

Oops! This file is pushed mistakenly. I have written it for local purpose of fetching. removing it.

ericholscher · 2018-07-16T10:20:45Z

readthedocs/search/management/commands/reindex_elasticsearch.py

+    @staticmethod
+    def _get_models(args):
+        for model_name in args:
+            yield apps.get_model(model_name)


This doesn't need to be broken out into it's own function. It just increases complexity for little value.

ericholscher · 2018-07-16T10:21:21Z

readthedocs/search/management/commands/reindex_elasticsearch.py

+                  "The format is <app_label>.<model_name>")
+        )
+
+    def handle(self, *args, **options):


Needs a docstring showing how to run it.

ericholscher · 2018-07-16T10:23:58Z

readthedocs/search/mixins.py

+            ),
+        }
+
+    def _get_actions(self, object_list, action, index_name=None):


Why are we overriding this? Needs a docstring.

ericholscher · 2018-07-16T10:28:35Z

readthedocs/search/tasks.py

+
+    for document in documents:
+        if str(document) == document_class:
+            return document


What is the logic here for? Will there be multiple documents for a model ever? Also needs a docstring :)

Yes. there can be multiple documents for a model

class FooDocument(DocType): ... class Meta: model = Bar class NewFooDocument(DocType): ... class Meta: model = Bar

ericholscher · 2018-07-16T10:30:40Z

readthedocs/search/management/commands/reindex_elasticsearch.py

+    def _run_reindex_tasks(self, models):
+        for doc in registry.get_documents(models):
+            qs = doc().get_queryset()
+            instance_ids = list(qs.values_list('id', flat=True))


Is this not going to use a ton of memory? This is creating the entire list of objects in memory all at once, instead of streaming them.

Its not actually creating list of objects, Its just a list of integers. I think integers take much lower amount of memory.

I'm pretty worried this is going to be both slow & perhaps incredibly memory intensive in production when we have 2 million objects in memory here. I'd like to find another approach that streams the data if possible, but we can try this for now and see what happens.

@ericholscher I think we can try this now to check if anything wrong happens.
I have tested with 2 million integers, I takes about 16 MB of RAM. so I think its not big issue.

@ericholscher I have optimized the memory usage in 143ce7f

ericholscher · 2018-07-16T10:33:10Z

readthedocs/search/tasks.py

+
+
+@app.task(queue='web')
+def switch_es_index_task(app_label, model_name, index_name, new_index_name):


We don't put task at the end of our tasks.

So what suffix do we put to understand that its a task? I thought its clearer to understand from code that its a task so do not call it directly.

We don't use any suffix in our code. We should be keeping the same coding standards here as elsewhere. If we want to change this, we need to change it everywhere, otherwise it's even more confusing having half of our tasks end in task.

Yeah. Fixed it in latest commit

ericholscher · 2018-07-16T10:50:23Z

Another thought -- should this be implemented as a contribution to DED with a Celery backend, instead of our own custom logic? It sounds like we might be able to do it fix Celery with a setting: https://github.com/sabricot/django-elasticsearch-dsl#elasticsearch_dsl_signal_processor

We could also perhaps add a celery flag to the existing indexer command? So that we don't have to maintain our own set of code around indexing.

safwanrahman · 2018-07-16T12:46:40Z

Another thought -- should this be implemented as a contribution to DED with a Celery backend, instead of our own custom logic? It sounds like we might be able to do it fix Celery with a setting:
https://github.com/sabricot/django-elasticsearch-dsl#elasticsearch_dsl_signal_processor

Sure, I will surely port it to DED, but I wonder if we can use the signal processor as that is called while saving/updating the objects.

We could also perhaps add a celery flag to the existing indexer command? So that we don't have to maintain our own set of code around indexing.

Thats really true. But it needs some time to get it reviewed and merged into master. I will open a PR there soon, but to get our deployment soon, we can keep it here untill then.

ericholscher · 2018-07-16T12:50:37Z

Thats really true. But it needs some time to get it reviewed and merged into master. I will open a PR there soon, but to get our deployment soon, we can keep it here untill then.

We can always deploy a forked version while we wait for it to get accepted. It's true though we don't want to be waiting for review & merge from them, but perhaps it will be quick.

safwanrahman · 2018-07-16T20:48:18Z

I'm a little concerned about the complexity of this approach. Is there a reason we're using chunks here when we already have a domain object to chunk on which is the Version? This feels like extra work to do and code to maintain, when we already have an existing way to think about this problem.

This is a general purpose management command for indexing all type of documents like Project/HTMLFile, not only the HTMLFile. So I had only one way to make a chunk depending on number.

This approach also doesn't use the same code path for indexing as our application code, so now we have two different ways of indexing files, which doesn't seem great.

Yeah, thats true. This management command actually reindex all the documents. On the otherside, DED catch the signal when a new file is created and index it into Elasticsearch.

It really feels like all we need is:
* A management commands that iterates over versions, and sends them to be indexed
* A celery task that takes a version and indexes all the files in that version, which is called both in production as well as in reindexing.

As mentioned above, the management command is general purpose. So something specially for HTMLFile would be extra implementation that maybe not available when we port it to DED

safwanrahman · 2018-07-16T21:27:45Z

Does this work locally with CELERY_ALWAYS_EAGER, I believe Rob mentioning that in his blog post, or will we just use the existing DED indexing locally?

Yes, it works with CELERY_ALWAYS_EAGER. Maybe it was broken in the past, but maybe fixed from the celery end.

…ality using celery

ericholscher · 2018-07-17T11:35:38Z

Yeah, thats true. This management command actually reindex all the documents. On the otherside, DED catch the signal when a new file is created and index it into Elasticsearch.

Right, but that doesn’t scale properly right now. It currently sends an http request per file saved during project build. We need to batch it, which could use similar logic to this.

safwanrahman · 2018-07-17T11:47:27Z

Right, but that doesn’t scale properly right now. It currently sends an http request per file saved during project build. We need to batch it, which could use similar logic to this.

I understand. but how do we get the files in batch? Do there any signal that is sent in batch?

ericholscher · 2018-07-17T21:53:42Z

We should just update the code to keep track of the files that are changed, as it works now. We can use the existing files_changed signal, or implement something that just does it in that function natively.

safwanrahman · 2018-07-19T13:46:45Z

Do we need this to be another Celery task? It's already running inside of Celery, so I'm not sure we need to delay it here. It could lead to weird race conditions if it got queued up for a long time.

Good catch @ericholscher. I will fix it in synchronous task. Thanks

safwanrahman · 2018-07-19T20:35:53Z

@ericholscher I have fixed the issues as you mentioned and fixed the tests. Also added a comment in #4264 (comment) for testing it in backlog. r?

safwanrahman · 2018-07-27T07:27:01Z

@ericholscher I have also fixed #4409 with 612cfb8 . So there will not be any auto indexing to elasticsearch in local development.
r?

ericholscher · 2018-07-29T21:13:11Z

readthedocs/search/management/commands/reindex_elasticsearch.py

+            queryset = doc().get_queryset()
+            # Get latest object from the queryset
+            latest_object = queryset.latest('modified_date')
+            latest_object_time = latest_object.modified_date


This should just use the current time, not query the entire queryset for the time. This will be quite a slow query likely in prod.

Yeah! I was also thinking about it.

safwanrahman · 2018-07-30T18:50:01Z

@ericholscher Fixed the timestamp issue. r?

ericholscher · 2018-07-31T12:38:32Z

This looks good. 👍

safwanrahman requested a review from ericholscher July 13, 2018 23:18

safwanrahman added this to Backlog in Search update via automation Jul 13, 2018

safwanrahman moved this from Backlog to In progress in Search update Jul 13, 2018

ericholscher reviewed Jul 16, 2018

View reviewed changes

safwanrahman force-pushed the comman branch from 7804e2e to abb3aaa Compare July 16, 2018 12:25

safwanrahman force-pushed the search_upgrade branch from 512138a to 39ada00 Compare July 16, 2018 22:36

safwanrahman force-pushed the comman branch from d13cff1 to ff584d4 Compare July 16, 2018 22:39

safwanrahman added 6 commits July 17, 2018 04:42

[Fix readthedocs#4333] Implement asynchronous search reindex function…

8fc3b65

…ality using celery

fixing lint

fb16187

fixup

39d8031

fixup message

bbbdca5

fixup index name

faca6de

fixing command

fd54d69

safwanrahman force-pushed the comman branch from ff584d4 to 79411b9 Compare July 16, 2018 22:45

fixing docstring

b9dbb5d

safwanrahman force-pushed the comman branch from 79411b9 to b9dbb5d Compare July 16, 2018 23:02

optimizing indexing

ce4abaf

fixing tests and signals

db51a90

safwanrahman mentioned this pull request Jul 19, 2018

Adding Test for new search prototype #4264

Closed

8 tasks

agjohnson added this to the Search improvements milestone Jul 20, 2018

safwanrahman self-assigned this Jul 27, 2018

safwanrahman force-pushed the comman branch from 1b5865a to ebcbcb9 Compare July 27, 2018 07:17

[Fix readthedocs#4409] Disable autoindexing in local development

612cfb8

safwanrahman force-pushed the comman branch from ebcbcb9 to 612cfb8 Compare July 27, 2018 07:25

safwanrahman added 4 commits July 27, 2018 15:29

fixup as per comments

baf8421

Optimizing reindexing management command

143ce7f

adding migration

abaeade

fixup lint

9d6f201

ericholscher reviewed Jul 29, 2018

View reviewed changes

fixup as per comments

652f869

ericholscher merged commit 463f9e2 into readthedocs:search_upgrade Jul 31, 2018

Search update automation moved this from In progress to Done Jul 31, 2018

This was referenced Aug 11, 2018

Local development doesn't work without ES running #4409

Closed

Make sure search indexing is ready for production #4333

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix #4333] Implement asynchronous search reindex functionality using celery #4368

[Fix #4333] Implement asynchronous search reindex functionality using celery #4368

safwanrahman commented Jul 13, 2018 •

edited

Loading

safwanrahman commented Jul 16, 2018

ericholscher left a comment

ericholscher Jul 16, 2018

safwanrahman Jul 16, 2018

ericholscher Jul 16, 2018

ericholscher Jul 16, 2018

ericholscher Jul 16, 2018

ericholscher Jul 16, 2018

safwanrahman Jul 16, 2018

ericholscher Jul 16, 2018

safwanrahman Jul 16, 2018 •

edited

Loading

ericholscher Jul 27, 2018

safwanrahman Jul 27, 2018 •

edited

Loading

safwanrahman Jul 27, 2018

ericholscher Jul 16, 2018

safwanrahman Jul 16, 2018 •

edited

Loading

ericholscher Jul 27, 2018

safwanrahman Jul 27, 2018

ericholscher commented Jul 16, 2018 •

edited

Loading

safwanrahman commented Jul 16, 2018

ericholscher commented Jul 16, 2018

safwanrahman commented Jul 16, 2018 •

edited

Loading

safwanrahman commented Jul 16, 2018

ericholscher commented Jul 17, 2018

safwanrahman commented Jul 17, 2018

ericholscher commented Jul 17, 2018 •

edited

Loading

safwanrahman commented Jul 19, 2018

safwanrahman commented Jul 19, 2018

safwanrahman commented Jul 27, 2018

ericholscher Jul 29, 2018

safwanrahman Jul 30, 2018

safwanrahman commented Jul 30, 2018

ericholscher commented Jul 31, 2018

		print("unsuccessful", name)


		def fetch(url=None, all_projects=[]):



		@app.task(queue='web')
		def switch_es_index_task(app_label, model_name, index_name, new_index_name):

[Fix #4333] Implement asynchronous search reindex functionality using celery #4368

[Fix #4333] Implement asynchronous search reindex functionality using celery #4368

Conversation

safwanrahman commented Jul 13, 2018 • edited Loading

safwanrahman commented Jul 16, 2018

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman Jul 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman Jul 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman Jul 16, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher commented Jul 16, 2018 • edited Loading

safwanrahman commented Jul 16, 2018

ericholscher commented Jul 16, 2018

safwanrahman commented Jul 16, 2018 • edited Loading

safwanrahman commented Jul 16, 2018

ericholscher commented Jul 17, 2018

safwanrahman commented Jul 17, 2018

ericholscher commented Jul 17, 2018 • edited Loading

safwanrahman commented Jul 19, 2018

safwanrahman commented Jul 19, 2018

safwanrahman commented Jul 27, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

safwanrahman commented Jul 30, 2018

ericholscher commented Jul 31, 2018

safwanrahman commented Jul 13, 2018 •

edited

Loading

safwanrahman Jul 16, 2018 •

edited

Loading

safwanrahman Jul 27, 2018 •

edited

Loading

safwanrahman Jul 16, 2018 •

edited

Loading

ericholscher commented Jul 16, 2018 •

edited

Loading

safwanrahman commented Jul 16, 2018 •

edited

Loading

ericholscher commented Jul 17, 2018 •

edited

Loading