New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix #4333] Implement asynchronous search reindex functionality using celery #4368

Merged
merged 15 commits into from Jul 31, 2018

Conversation

3 participants
@safwanrahman
Member

safwanrahman commented Jul 13, 2018

The management command reindex_elasticsearch has been rewritten from scratch using celery tasks.
The idea is taken from @robhudson blog post and heavily inspired from the code of mozilla/zamboni.

Need to overwrite some method of django-elasticsearch-dsl in order to support zero downtime rebuild (sabricot/django-elasticsearch-dsl#75). I am hoping to send a PR to upstream.

This fixes #4333

@ericholscher @rtfd/core r?

@safwanrahman safwanrahman requested a review from ericholscher Jul 13, 2018

@safwanrahman safwanrahman added this to Backlog in Search update via automation Jul 13, 2018

@safwanrahman safwanrahman moved this from Backlog to In progress in Search update Jul 13, 2018

@safwanrahman

This comment has been minimized.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 16, 2018

Member

I have tested with about 10K files in single celery host with 4 workers. The indexing got finished within 20 seconds.

Member

safwanrahman commented Jul 16, 2018

I have tested with about 10K files in single celery host with 4 workers. The indexing got finished within 20 seconds.

@ericholscher

I like the approach of using the chain/chord to handle indexing and index creation. Does this work locally with CELERY_ALWAYS_EAGER, I believe Rob mentioning that in his blog post, or will we just use the existing DED indexing locally?

I'm a little concerned about the complexity of this approach. Is there a reason we're using chunks here when we already have a domain object to chunk on which is the Version? This feels like extra work to do and code to maintain, when we already have an existing way to think about this problem.

This approach also doesn't use the same code path for indexing as our application code, so now we have two different ways of indexing files, which doesn't seem great.

It really feels like all we need is:

  • A management commands that iterates over versions, and sends them to be indexed
  • A celery task that takes a version and indexes all the files in that version, which is called both in production as well as in reindexing.

I'm happy to talk about this more. There are likely some design decisions that you made that I don't understand :)

Show outdated Hide outdated import_projects.py Outdated
Show outdated Hide outdated readthedocs/search/management/commands/reindex_elasticsearch.py Outdated
"The format is <app_label>.<model_name>")
)
def handle(self, *args, **options):

This comment has been minimized.

@ericholscher

ericholscher Jul 16, 2018

Member

Needs a docstring showing how to run it.

@ericholscher

ericholscher Jul 16, 2018

Member

Needs a docstring showing how to run it.

),
}
def _get_actions(self, object_list, action, index_name=None):

This comment has been minimized.

@ericholscher

ericholscher Jul 16, 2018

Member

Why are we overriding this? Needs a docstring.

@ericholscher

ericholscher Jul 16, 2018

Member

Why are we overriding this? Needs a docstring.

for document in documents:
if str(document) == document_class:
return document

This comment has been minimized.

@ericholscher

ericholscher Jul 16, 2018

Member

What is the logic here for? Will there be multiple documents for a model ever? Also needs a docstring :)

@ericholscher

ericholscher Jul 16, 2018

Member

What is the logic here for? Will there be multiple documents for a model ever? Also needs a docstring :)

This comment has been minimized.

@safwanrahman

safwanrahman Jul 16, 2018

Member

Yes. there can be multiple documents for a model

class FooDocument(DocType):
    ...
    class Meta:
        model = Bar

class NewFooDocument(DocType):
    ...
    class Meta:
        model = Bar
@safwanrahman

safwanrahman Jul 16, 2018

Member

Yes. there can be multiple documents for a model

class FooDocument(DocType):
    ...
    class Meta:
        model = Bar

class NewFooDocument(DocType):
    ...
    class Meta:
        model = Bar
Show outdated Hide outdated readthedocs/search/management/commands/reindex_elasticsearch.py Outdated
Show outdated Hide outdated readthedocs/search/tasks.py Outdated
@ericholscher

This comment has been minimized.

Show comment
Hide comment
@ericholscher

ericholscher Jul 16, 2018

Member

Another thought -- should this be implemented as a contribution to DED with a Celery backend, instead of our own custom logic? It sounds like we might be able to do it fix Celery with a setting: https://github.com/sabricot/django-elasticsearch-dsl#elasticsearch_dsl_signal_processor

We could also perhaps add a celery flag to the existing indexer command? So that we don't have to maintain our own set of code around indexing.

Member

ericholscher commented Jul 16, 2018

Another thought -- should this be implemented as a contribution to DED with a Celery backend, instead of our own custom logic? It sounds like we might be able to do it fix Celery with a setting: https://github.com/sabricot/django-elasticsearch-dsl#elasticsearch_dsl_signal_processor

We could also perhaps add a celery flag to the existing indexer command? So that we don't have to maintain our own set of code around indexing.

@safwanrahman

This comment has been minimized.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 16, 2018

Member

Another thought -- should this be implemented as a contribution to DED with a Celery backend, instead of our own custom logic? It sounds like we might be able to do it fix Celery with a setting:
https://github.com/sabricot/django-elasticsearch-dsl#elasticsearch_dsl_signal_processor

Sure, I will surely port it to DED, but I wonder if we can use the signal processor as that is called while saving/updating the objects.

We could also perhaps add a celery flag to the existing indexer command? So that we don't have to maintain our own set of code around indexing.

Thats really true. But it needs some time to get it reviewed and merged into master. I will open a PR there soon, but to get our deployment soon, we can keep it here untill then.

Member

safwanrahman commented Jul 16, 2018

Another thought -- should this be implemented as a contribution to DED with a Celery backend, instead of our own custom logic? It sounds like we might be able to do it fix Celery with a setting:
https://github.com/sabricot/django-elasticsearch-dsl#elasticsearch_dsl_signal_processor

Sure, I will surely port it to DED, but I wonder if we can use the signal processor as that is called while saving/updating the objects.

We could also perhaps add a celery flag to the existing indexer command? So that we don't have to maintain our own set of code around indexing.

Thats really true. But it needs some time to get it reviewed and merged into master. I will open a PR there soon, but to get our deployment soon, we can keep it here untill then.

@ericholscher

This comment has been minimized.

Show comment
Hide comment
@ericholscher

ericholscher Jul 16, 2018

Member

Thats really true. But it needs some time to get it reviewed and merged into master. I will open a PR there soon, but to get our deployment soon, we can keep it here untill then.

We can always deploy a forked version while we wait for it to get accepted. It's true though we don't want to be waiting for review & merge from them, but perhaps it will be quick.

Member

ericholscher commented Jul 16, 2018

Thats really true. But it needs some time to get it reviewed and merged into master. I will open a PR there soon, but to get our deployment soon, we can keep it here untill then.

We can always deploy a forked version while we wait for it to get accepted. It's true though we don't want to be waiting for review & merge from them, but perhaps it will be quick.

@safwanrahman

This comment has been minimized.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 16, 2018

Member

I'm a little concerned about the complexity of this approach. Is there a reason we're using chunks here when we already have a domain object to chunk on which is the Version? This feels like extra work to do and code to maintain, when we already have an existing way to think about this problem.

This is a general purpose management command for indexing all type of documents like Project/HTMLFile, not only the HTMLFile. So I had only one way to make a chunk depending on number.

This approach also doesn't use the same code path for indexing as our application code, so now we have two different ways of indexing files, which doesn't seem great.

Yeah, thats true. This management command actually reindex all the documents. On the otherside, DED catch the signal when a new file is created and index it into Elasticsearch.

It really feels like all we need is:
* A management commands that iterates over versions, and sends them to be indexed
* A celery task that takes a version and indexes all the files in that version, which is called both in production as well as in reindexing.

As mentioned above, the management command is general purpose. So something specially for HTMLFile would be extra implementation that maybe not available when we port it to DED

Member

safwanrahman commented Jul 16, 2018

I'm a little concerned about the complexity of this approach. Is there a reason we're using chunks here when we already have a domain object to chunk on which is the Version? This feels like extra work to do and code to maintain, when we already have an existing way to think about this problem.

This is a general purpose management command for indexing all type of documents like Project/HTMLFile, not only the HTMLFile. So I had only one way to make a chunk depending on number.

This approach also doesn't use the same code path for indexing as our application code, so now we have two different ways of indexing files, which doesn't seem great.

Yeah, thats true. This management command actually reindex all the documents. On the otherside, DED catch the signal when a new file is created and index it into Elasticsearch.

It really feels like all we need is:
* A management commands that iterates over versions, and sends them to be indexed
* A celery task that takes a version and indexes all the files in that version, which is called both in production as well as in reindexing.

As mentioned above, the management command is general purpose. So something specially for HTMLFile would be extra implementation that maybe not available when we port it to DED

@safwanrahman

This comment has been minimized.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 16, 2018

Member

Does this work locally with CELERY_ALWAYS_EAGER, I believe Rob mentioning that in his blog post, or will we just use the existing DED indexing locally?

Yes, it works with CELERY_ALWAYS_EAGER. Maybe it was broken in the past, but maybe fixed from the celery end.

Member

safwanrahman commented Jul 16, 2018

Does this work locally with CELERY_ALWAYS_EAGER, I believe Rob mentioning that in his blog post, or will we just use the existing DED indexing locally?

Yes, it works with CELERY_ALWAYS_EAGER. Maybe it was broken in the past, but maybe fixed from the celery end.

@ericholscher

This comment has been minimized.

Show comment
Hide comment
@ericholscher

ericholscher Jul 17, 2018

Member

Yeah, thats true. This management command actually reindex all the documents. On the otherside, DED catch the signal when a new file is created and index it into Elasticsearch.

Right, but that doesn’t scale properly right now. It currently sends an http request per file saved during project build. We need to batch it, which could use similar logic to this.

Member

ericholscher commented Jul 17, 2018

Yeah, thats true. This management command actually reindex all the documents. On the otherside, DED catch the signal when a new file is created and index it into Elasticsearch.

Right, but that doesn’t scale properly right now. It currently sends an http request per file saved during project build. We need to batch it, which could use similar logic to this.

@safwanrahman

This comment has been minimized.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 17, 2018

Member

Right, but that doesn’t scale properly right now. It currently sends an http request per file saved during project build. We need to batch it, which could use similar logic to this.

I understand. but how do we get the files in batch? Do there any signal that is sent in batch?

Member

safwanrahman commented Jul 17, 2018

Right, but that doesn’t scale properly right now. It currently sends an http request per file saved during project build. We need to batch it, which could use similar logic to this.

I understand. but how do we get the files in batch? Do there any signal that is sent in batch?

@ericholscher

This comment has been minimized.

Show comment
Hide comment
@ericholscher

ericholscher Jul 17, 2018

Member

We should just update the code to keep track of the files that are changed, as it works now. We can use the existing files_changed signal, or implement something that just does it in that function natively.

Member

ericholscher commented Jul 17, 2018

We should just update the code to keep track of the files that are changed, as it works now. We can use the existing files_changed signal, or implement something that just does it in that function natively.

@ericholscher

This comment was marked as resolved.

Show comment
Hide comment
@ericholscher

ericholscher Jul 19, 2018

Member

Is there a reason to use signals here? I think we can just put the logic directly in the build process? These feel too specific to be useful as signals, unless there is some other reason we need them?

Member

ericholscher commented on readthedocs/projects/signals.py in ce4abaf Jul 19, 2018

Is there a reason to use signals here? I think we can just put the logic directly in the build process? These feel too specific to be useful as signals, unless there is some other reason we need them?

This comment was marked as resolved.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 19, 2018

Member

I was also thinking about having the logic here. Later I have thought that in future there maybe some case where the HTMLFiles are created from other place like directly pushing documentation from CI. In that case, we need to have the logic over there also.
Moreover, I would prefer to have the Search logic separate from the build logic. So its easier in future if we would like to seperate the search application.

I do not have strong preference to keep the signal. If you think we should keep it here, I can implement the logic here.

Member

safwanrahman replied Jul 19, 2018

I was also thinking about having the logic here. Later I have thought that in future there maybe some case where the HTMLFiles are created from other place like directly pushing documentation from CI. In that case, we need to have the logic over there also.
Moreover, I would prefer to have the Search logic separate from the build logic. So its easier in future if we would like to seperate the search application.

I do not have strong preference to keep the signal. If you think we should keep it here, I can implement the logic here.

@ericholscher

This comment was marked as resolved.

Show comment
Hide comment
@ericholscher

ericholscher Jul 19, 2018

Member

Do we need this to be another Celery task? It's already running inside of Celery, so I'm not sure we need to delay it here. It could lead to weird race conditions if it got queued up for a long time.

Member

ericholscher commented on readthedocs/search/signals.py in ce4abaf Jul 19, 2018

Do we need this to be another Celery task? It's already running inside of Celery, so I'm not sure we need to delay it here. It could lead to weird race conditions if it got queued up for a long time.

@safwanrahman

This comment has been minimized.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 19, 2018

Member

Do we need this to be another Celery task? It's already running inside of Celery, so I'm not sure we need to delay it here. It could lead to weird race conditions if it got queued up for a long time.

Good catch @ericholscher. I will fix it in synchronous task. Thanks

Member

safwanrahman commented Jul 19, 2018

Do we need this to be another Celery task? It's already running inside of Celery, so I'm not sure we need to delay it here. It could lead to weird race conditions if it got queued up for a long time.

Good catch @ericholscher. I will fix it in synchronous task. Thanks

@safwanrahman safwanrahman referenced this pull request Jul 19, 2018

Open

Adding Test for new search prototype #4264

0 of 8 tasks complete
@safwanrahman

This comment has been minimized.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 19, 2018

Member

@ericholscher I have fixed the issues as you mentioned and fixed the tests. Also added a comment in #4264 (comment) for testing it in backlog. r?

Member

safwanrahman commented Jul 19, 2018

@ericholscher I have fixed the issues as you mentioned and fixed the tests. Also added a comment in #4264 (comment) for testing it in backlog. r?

@agjohnson agjohnson added this to the Search improvements milestone Jul 20, 2018

@safwanrahman safwanrahman self-assigned this Jul 27, 2018

@safwanrahman

This comment has been minimized.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 27, 2018

Member

@ericholscher I have also fixed #4409 with 612cfb8 . So there will not be any auto indexing to elasticsearch in local development.
r?

Member

safwanrahman commented Jul 27, 2018

@ericholscher I have also fixed #4409 with 612cfb8 . So there will not be any auto indexing to elasticsearch in local development.
r?

@safwanrahman

This comment has been minimized.

Show comment
Hide comment
@safwanrahman

safwanrahman Jul 30, 2018

Member

@ericholscher Fixed the timestamp issue. r?

Member

safwanrahman commented Jul 30, 2018

@ericholscher Fixed the timestamp issue. r?

@ericholscher

This comment has been minimized.

Show comment
Hide comment
@ericholscher

ericholscher Jul 31, 2018

Member

This looks good. 👍

Member

ericholscher commented Jul 31, 2018

This looks good. 👍

@ericholscher ericholscher merged commit 463f9e2 into rtfd:search_upgrade Jul 31, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

Search update automation moved this from In progress to Done Jul 31, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment