Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Record search queries smartly #6088

Merged

Conversation

@dojutsu-user
Copy link
Member

@dojutsu-user dojutsu-user commented Aug 21, 2019

This PR refactor the recording mechanism for better support of "search as you type" feature.
Also, It makes all the queries lowercase so as to have no difference between words like Build and build

@dojutsu-user dojutsu-user changed the title Record queries smartly Record search queries smartly Aug 21, 2019
@dojutsu-user dojutsu-user added this to In progress in In-doc search UI via automation Aug 21, 2019
@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Aug 21, 2019

Currently, it reduces the total partial queries -- not all but better than storing every partial query.

Loading

project=project,
version=version,
query=query,
)
obj.created = time
Copy link
Member

@stsewd stsewd Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be an auto field?

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... but the actual time is when the query was made.
Celery task are not executed immediately so the automated time will be wrong.

Loading

Copy link
Member

@humitos humitos Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that's probably OK to use the auto field for this.

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading

Copy link
Member

@ericholscher ericholscher Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be querying the modifed field, and using that -- it will automatically update whenever it is saved: https://github.com/django-extensions/django-extensions/blob/master/django_extensions/db/models.py#L19 -- we should likely use that, instead of trying to change the created date.

Loading

@stsewd
Copy link
Member

@stsewd stsewd commented Aug 21, 2019

Also, what about just saving the query if it is greater than 3 characters?

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Aug 21, 2019

@stsewd

Also, what about just saving the query if it is greater than 3 characters?

This will fail in many cases, like: readt, readthe, readthedo (for readthedocs)

Loading

@stsewd
Copy link
Member

@stsewd stsewd commented Aug 21, 2019

Why would that fail?

We are start executing the same logic from readt. I'm not saying to remove the logic from this PR

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Aug 21, 2019

@stsewd
Ohhh... Okay
I get it.
Yeahh... that's a nice improvement.
I will update the PR.

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Aug 21, 2019

This is not ready for a review. It is too far from being OK.
Results are not good currently.

Loading

readthedocs/search/tasks.py Outdated Show resolved Hide resolved
Loading
version__slug=version_slug,
query__startswith=query,
created__gte=before_10_sec,
).order_by('-created')
Copy link
Contributor

@davidfischer davidfischer Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about the performance characteristics of this. I don't believe created is indexed although project and version are. This might be worth testing on the live system before merging. This might be ok but I'm not 100% confident without just timing a similar query in prod.

Loading

readthedocs/search/tasks.py Outdated Show resolved Hide resolved
Loading

# record the search query with a celery task
tasks.record_search_query.delay(
project_slug,
version_slug,
query,
total_results,
time,
Copy link
Contributor

@davidfischer davidfischer Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on how tasks are serialized, this may not work exactly as planned. I know celery supports Pickle and JSON and I'm not sure which one we are using. Datetimes are not JSON serializable by default although Django does have a way to do it. However, I'm not sure they come back as timezone aware datetimes after they do. Did you verify this?

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During local development, celery uses json serializer.

CELERY_RESULT_SERIALIZER = 'json'

(Docs: https://docs.celeryproject.org/en/latest/userguide/configuration.html?highlight=CELERY_RESULT_SERIALIZER#new-lowercase-settings)

they come back as timezone aware datetimes after they do

I am not sure how to verify this.
I did print the timezone.now() and the SQL query -- that looks pretty similar.

timezone.now(): 2019-08-23 08:45:42.841161+00:00
SQL Query:

SELECT "search_searchquery"."id", "search_searchquery"."created", "search_searchquery"."modified", "search_searchquery"."project_id", "search_searchquery"."version_id", "search_searchquery"."query"
FROM "search_searchquery"
INNER JOIN "projects_project"
ON ("search_searchquery"."project_id" = "projects_project"."id")
INNER JOIN "builds_version"
ON ("search_searchquery"."version_id" = "builds_version"."id")
WHERE ("projects_project"."slug" = rtd AND "builds_version"."slug" = latest AND "search_searchquery"."query" LIKE read% ESCAPE '\' AND "search_searchquery"."created" >= 2019-08-23 08:45:32.841161)
ORDER BY "search_searchquery"."created" DESC

Loading

return

# don't record query with zero results.
if not total_results:
Copy link
Contributor

@davidfischer davidfischer Aug 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this block isn't at the top of the function? I guess you could update the time on another query even if this one returns no results. However, that probably warrants a comment or something.

Loading

Copy link
Member Author

@dojutsu-user dojutsu-user Aug 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this block isn't at the top of the function?

Yes. I wanted to update the timing and query even if no results are there. For example: readthed will return no results but I wanted to update the timing and query if there is a previous query of readth.

I guess you could update the time on another query even if this one returns no results. However, that probably warrants a comment or something.

I am not sure if I understood.

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Aug 27, 2019

@dojutsu-user what is the status here? Is it working locally, or still needs more work?

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Aug 28, 2019

@ericholscher
I would really like to have some help here.
The logic and everything seems correct but I it is not working as expected.

Loading

ericholscher
Copy link
Member

ericholscher commented on 9200121 Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should likely be setting an updated time, not created, no?

Loading

dojutsu-user
Copy link
Member

dojutsu-user commented on 9200121 Aug 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get it. Did you mean modified_time?

Loading

Copy link
Member

@ericholscher ericholscher left a comment

Think this looks good, but we should definitely be updating the modified time, not the created time :)

Loading

project=project,
version=version,
query=query,
)
obj.created = time
Copy link
Member

@ericholscher ericholscher Sep 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be querying the modifed field, and using that -- it will automatically update whenever it is saved: https://github.com/django-extensions/django-extensions/blob/master/django_extensions/db/models.py#L19 -- we should likely use that, instead of trying to change the created date.

Loading

@ericholscher
Copy link
Member

@ericholscher ericholscher commented Sep 5, 2019

Gonna merge this so it goes out on the next deploy, but would like to do some small updates to this over time.

Loading

@ericholscher ericholscher merged commit a1f3b05 into readthedocs:master Sep 5, 2019
2 checks passed
Loading
@dojutsu-user dojutsu-user deleted the recording-query-smartly branch Sep 6, 2019
@dojutsu-user dojutsu-user moved this from In progress to Done in In-doc search UI Sep 6, 2019
@humitos
Copy link
Member

@humitos humitos commented Sep 9, 2019

Think this looks good, but we should definitely be updating the modified time, not the created time :)

@dojutsu-user can you create an issue to track this?

Loading

@dojutsu-user
Copy link
Member Author

@dojutsu-user dojutsu-user commented Sep 9, 2019

@humitos
Sure 👍

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants