Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug that caused search objects not to delete #5487

Merged
merged 3 commits into from
Mar 20, 2019

Conversation

ericholscher
Copy link
Member

@ericholscher ericholscher commented Mar 18, 2019

The comment mostly explains it,
but the delete objects code does another query:

https://github.com/rtfd/readthedocs.org/blob/3340364802a8cdcc5f4145fbde2a5dd4f7087771/readthedocs/search/tasks.py#L59-L62

This should now fix search deletion.

The comment mostly explains it,
but the delete objects code does another query:

https://github.com/rtfd/readthedocs.org/blob/3340364802a8cdcc5f4145fbde2a5dd4f7087771/readthedocs/search/tasks.py#L59-L62

This should now fix search deletion.
@ericholscher ericholscher changed the title Release 3.4.0 Fix bug that caused search objects not to delete Mar 18, 2019
@ericholscher ericholscher requested a review from a team March 18, 2019 22:06
Copy link
Member

@humitos humitos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this will fix the issue. If functions listening to bulk_post_delete are not executed immediately, delete_queryset.delete will be executed and we will have the same that the one that we are trying to fix here, aren't we?

@ericholscher
Copy link
Member Author

Django signals are processed synchronously, so it should work. I agree we should move to having this logic be handled all in the same task, instead of through signals, so we can make it clearer the ordering, but this should work.

I also wish we had good tests for this, but I've never found a good way to add tests to the build process, and with search too. I think with a refactor we could at least mock things reasonably, but that's a bigger PR I think.

@safwanrahman
Copy link
Member

@ericholscher I had tested and it was working. Whats the issue now?

@ericholscher
Copy link
Member Author

@safwanrahman in prod pages aren't being deleted from the index.

@safwanrahman
Copy link
Member

@ericholscher
Copy link
Member Author

No, the issue is as noted above, the queryset is being deleted before we query for the objects to remove them from the search index :)

@safwanrahman
Copy link
Member

@ericholscher Can you check the autosync settings in Production? ELASTICSEARCH_DSL_AUTO_REFRESH key in settings?

@safwanrahman
Copy link
Member

safwanrahman commented Mar 19, 2019

No, the issue is as noted above, the queryset is being deleted before we query for the objects to remove them from the search index :)

Because of we were deleting the object first, we were including it into the list
https://github.com/rtfd/readthedocs.org/blob/700b83df78417a095718ec8b641ac1750560b817/readthedocs/projects/tasks.py#L1326
So though the object is deleted from db, the signal can get it from the cached object thats in the memory.

@ericholscher
Copy link
Member Author

We query it again where I linked in the description of this PR .

@safwanrahman
Copy link
Member

safwanrahman commented Mar 19, 2019

We query it again where I linked in the description of this PR .

I think we are not querying it. Oh. I did not notice it. I am checking it

@ericholscher
Copy link
Member Author

ericholscher commented Mar 19, 2019

@safwanrahman please take the time to read what I'm saying. We query it in the delete_objects_in_es that is called at the bottom of that function, as I already linked in the description of this PR.

@safwanrahman
Copy link
Member

@ericholscher Sorry for the misunderstanding. Do there any way to remove the DB query from there? Do you have any idea?

@humitos
Copy link
Member

humitos commented Mar 19, 2019

Django signals are processed synchronously, so it should work

You are right. I was thinking on celery tasks.

The PR looks correct to me.

@@ -1324,17 +1324,13 @@ def _manage_imported_files(version, path, commit):
)
# Keep the objects into memory to send it to signal
instance_list = list(delete_queryset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing list() doesn't make to hit the db already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we can probably remove this if we want, since the objects should exist now.

@safwanrahman
Copy link
Member

@ericholscher I have checked the git log and it seems like we were running registry.delete(instance_list) in the past and it was removed with the current approach in 144ed0e. In current approach, another query to DB is needed. But in previous approach, DB query was not needed.
Do you remember why it was changed?

I think the ES objects should be removed after the object has been removed from the DB so that in case there are some error removing the object from DB, the object does not get removed from elasticsearch.

@ericholscher
Copy link
Member Author

I just copied the logic from the index_html_file function above. If we can just pass an instance list, we should remove that logic from both places.

@safwanrahman
Copy link
Member

I have gone through the code, and it seems like the task delete_objects_in_es is not get called from anywhere asynchronously. It was used to delete projects asynchronously, but later removed in #5262

@safwanrahman
Copy link
Member

I just copied the logic from the index_html_file function above. If we can just pass an instance list, we should remove that logic from both places.

Yeah. We can simplify the index_html_file. We had the logic there because we wanted to use a generic task function index_objects_to_es, which needed the serialized data. If you agree, we can remove the extra logic in index_html_file.

@ericholscher
Copy link
Member Author

I think keeping the 2 tasks similar in structure is probably more important than the order in which we delete objects from the DB or ES. Arguably it's more important to delete them from ES and the DB, since they will be deleted from the DB again on a later run of the code, but ES won't automatically delete a file if it gets missed.

We should probably update the ES delete logic to be the same, and depend on commit, so it will properly delete old files if we hit an exception during one run of the code.


# Delete ImportedFiles from previous versions
(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All HTMLFile are ImportedFile, but not all ImportedFile are HTMLFile. So we should delete the ImportedFile also. The HTMLFile was deleted above for dispatching signals as HTMLFile is a proxy model.

@safwanrahman
Copy link
Member

I think keeping the 2 tasks similar in structure is probably more important than the order in which we delete objects from the DB or ES.

If you would like to have both task in same structure, I think its better to pass the list in registry.delete() and registry.update() from the signal. We can use the index_objects_to_es for indexing asynchronously when we needed (like from management command, project indexing, etc).

We should probably update the ES delete logic to be the same, and depend on commit, so it will properly delete old files if we hit an exception during one run of the code.

Yeah. That maybe better idea in long run. But we were trying to map HTMLFile with the ES index object. So if you think from that side, its better to have a tie with HTMLFile DB object.

@ericholscher
Copy link
Member Author

Yeah. That maybe better idea in long run. But we were trying to map HTMLFile with the ES index object. So if you think from that side, its better to have a tie with HTMLFile DB object.

But right now there is no way to handle the case where we delete an HTMLFile object, but the ES index doesn't get updated, which feels like a much bigger issue.

@safwanrahman
Copy link
Member

safwanrahman commented Mar 19, 2019

But right now there is no way to handle the case where we delete an HTMLFile object, but the ES index doesn't get updated, which feels like a much bigger issue.

We can run a cronjob with celery-beat which will check every night whether any object got missed form ES or anything is non available in DB, but available in ES and fix according to it. Or we can automatically run reindex every night which will be much better as we do have zero down time indexing so no user will be impacted and the data will be consistent. Which one do you prefer?

@ericholscher
Copy link
Member Author

I think we should make the code do the correct logic, not add a bunch of operational overhead to our servers :)

@safwanrahman
Copy link
Member

I agree! Our current code do correct logic as most of the time. But its not possible to do it always from the code end if the ES get degraded and we get timed out from Elasticsearch.
I think its better to have ES stale data rathar than having missing data.
Because of that, I have prosed about cronjob for data consistancy. If thats not possible, I think the current approach is good enough.

@ericholscher
Copy link
Member Author

I've made a bunch of improvements to this logic in #5290, so I will merge this for a quick fix, and that will be the larger fix around removing commits via ES.

@ericholscher ericholscher merged commit 3699b4e into master Mar 20, 2019
@delete-merged-branch delete-merged-branch bot deleted the search-delete-objects branch March 20, 2019 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants