Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize json parsing #6160

Merged
merged 8 commits into from Oct 9, 2019

Conversation

@dojutsu-user
Copy link
Member

commented Sep 10, 2019

Related #6130

This PR removes the PyQuery and lxml and instead uses the selectolax which is much faster than lxml (benchmark). Parsing logic is exactly same with some small improvements.

Results:
Before

$ time python manage.py reindex_elasticsearch
real	0m24.272s
user	0m20.457s
sys	0m0.337s

After

$ time python manage.py reindex_elasticsearch
real	0m14.907s
user	0m10.274s
sys	0m0.336s
@dojutsu-user dojutsu-user requested review from ericholscher and readthedocs/core Sep 10, 2019
@dojutsu-user dojutsu-user referenced this pull request Sep 10, 2019
Copy link
Contributor

left a comment

There's two pieces here:

  • Swapping out pyquery/lxml for selectolax/modest
  • Swapping out the stdlib json for orjson

In both cases, we are looking to swap to more performant libraries. One thing I'd like to see if where is this performance benefit coming from. Was JSON parsing the slow part and orjson is making a huge difference or was HTML parsing the slow part and selectolax is improving things overall?

My biggest concern about this is that selectolax has a single developer even though it does appear to be actively developed. The modest engine has a bit more developer backing. Orjson I've used in the past where JSON parsing speed was really important and it worked reasonably well.

readthedocs/search/parse_json.py Outdated Show resolved Hide resolved
@dojutsu-user

This comment has been minimized.

Copy link
Member Author

commented Sep 11, 2019

@davidfischer
I think these results might give some insights

# with `json` and `selectolax`
real	0m20.142s
user	0m14.789s
sys	0m0.405s

# with `orjson` and `selectolax`
real	0m17.934s
user	0m13.506s
sys	0m0.412s

# with `json` and `pyquery`
real	0m31.838s
user	0m25.688s
sys	0m0.414s

# with `orjson` and `pyquery`
real	0m26.549s
user	0m21.444s
sys	0m0.314s

I believe both orjson and selectolax are responsible for performance benefits.

Copy link
Member

left a comment

I do not have too much experience with this code. Although, it seems a good change and I think that it's easy to revert if we find out that it does not behave better in the next deploy (reindex) than the current implementation.

My only concern about this PR is that it seems that we are changing some small logic on it as well (I'm not sure, though). If that's the case, I'd recommend to make this changes in different PR.

Also, if we care too much about having different implementations, this could be refactored to use a class with a specific API so we can swap from one to another. I suppose it's not worth the effort at this point, though.

try:
body('.toctree-wrapper').remove()
nodes_to_be_removed += body.css('.toctree-wrapper') + body.css('.contents.local.topic')

This comment has been minimized.

Copy link
@humitos

humitos Sep 11, 2019

Member

It seems there is a logic change here. Are we removing an extra node here?

This comment has been minimized.

Copy link
@dojutsu-user

dojutsu-user Sep 11, 2019

Author Member

Yes, we are removing extra node here.
Earlier, I thought that all the Table of Contents are under a div having class .toctree-wrapper but latest I realised that pages which have Table of Contents describing the contents in the same page (like: SEO Guide) are having different css class -- contents local topic.
So I included that also.

break

text = parse_content(next_p.text(), remove_first_line=True)
text = parse_content(next_p.text(), remove_first_line=False)

This comment has been minimized.

Copy link
@humitos

humitos Sep 11, 2019

Member

Also here. It seems we are changing the logic.

This comment has been minimized.

Copy link
@dojutsu-user

dojutsu-user Sep 11, 2019

Author Member

Yes, when working on it again, I find that few lines were getting removed from indexing because of this.

This comment has been minimized.

Copy link
@dojutsu-user

dojutsu-user Sep 11, 2019

Author Member

Should I target these things in another PR?

This comment has been minimized.

Copy link
@humitos

humitos Sep 11, 2019

Member

I would say "yes".

If we first apply the logic changes and after that we apply the migration, we can easily come back to the previous (pyquery, etc) implementation with just one revert if we ever want.

@rushter

This comment has been minimized.

Copy link

commented Sep 26, 2019

Hey, I'm the author of selectolax.

I will be glad to help you or add new features for your particular need.
Just ping me if you will have any questions or ideas on how to speed up your particular use-case even more by introducing new features in selectolax.

@ericholscher

This comment has been minimized.

Copy link
Member

commented Sep 30, 2019

I'd like to move this forward. I think I'm OK with merging it with the logic changes, since it's already done. In the future we should try and separate these though.

Copy link
Member

left a comment

👍 on getting this in. Having faster indexing makes everything better.

@dojutsu-user

This comment has been minimized.

Copy link
Member Author

commented Sep 30, 2019

Thanks @rushter for your help. ❤️

@ericholscher ericholscher merged commit 37cabe3 into readthedocs:master Oct 9, 2019
2 checks passed
2 checks passed
continuous-documentation/read-the-docs Read the Docs build succeeded!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@dojutsu-user dojutsu-user deleted the dojutsu-user:optimize-json-parsing branch Oct 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.