Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(localSearch): add concurrency pooling, cleanup logic, improve performance #3374

Merged
merged 5 commits into from Dec 30, 2023

Conversation

zhangyx1998
Copy link
Contributor

@zhangyx1998 zhangyx1998 commented Dec 23, 2023

fixes #3377

Highlights

This PR enables the local search plugin to work in true parallel. While keeping its API backward compatible, it now allows asynchronous user-provided callbacks for both html generation and section splitting (which is very expensive).

In addition, by refactoring the logic of task dispatching, the plugin will no longer accumulate content for all pages and locales before indexing. Therefore, this change will benefit all local-search users out-of-box.

What's next

I've implemented multiprocessing that worked with this PR. That implementation is not yet included here because (1) I am not sure whether the inclusion of a new package (JSDOM) is acceptable, and (2) I do not know if people would agree with the idea of multiprocessing.

As you can see in this commit, the new miniSearch.splitIntoSections() config option allows an AsyncGenerator as its return value. And the AsyncGenerator can be used to proxy a worker thread using JS Event model.

One important reason I chose local-search plugin to start experimenting is data isolation. The splitting task only needs html string, so it's easy to be stripped out of the context. In my project, the overhead of IPC is reduced to the minimum since the main process only sends a path, not the entire content.

However, if done correctly, many other processes can also be refactored to run in parallel. For example, the renderPage() looks very promising, which currently takes relatively long to complete, especially for large volume of content. If we can make that part parallel, I believe almost everyone would receive a noticeable speedup.

The speedup is crazy

Both task are performed on Apple M1 Max, building this project

Before: 16008.00 seconds (4.4 hours)

  vitepress v1.0.0-rc.32

⠹ building client + server bundles...
(!) Some chunks are larger than 500 kB after minification. Consider:
- Using dynamic import() to code-split the application
- Use build.rollupOptions.output.manualChunks to improve chunking: https://rollupjs.org/configuration-options/#output-manualchunks
- Adjust chunk size limit for this warning via build.chunkSizeWarningLimit.
/Users/Yuxuan/Lab/xorg-doc/docs/index.md
✓ building client + server bundles...
✓ rendering pages...
build complete in 16008.00s.

After: 240 seconds (4 minutes)

  vitepress v1.0.0-rc.32

🔍️ Indexing files for search...
✅ Indexing finished...
⠹ building client + server bundles...
(!) Some chunks are larger than 500 kB after minification. Consider:
- Using dynamic import() to code-split the application
- Use build.rollupOptions.output.manualChunks to improve chunking: https://rollupjs.org/configuration-options/#output-manualchunks
- Adjust chunk size limit for this warning via build.chunkSizeWarningLimit.
🔍️ Indexing files for search...
✅ Indexing finished...
✓ building client + server bundles...
✓ rendering pages...
build complete in 248.53s.

Problem to be fixed

The default section splitting function splitPageIntoSections() uses two regular expressions to detect and stip headers from html sources. However, it assumes that each <h*> element has an <a> element as it first child. This is only true when generated by pure markdown sources. However, markdown allows html element embedding (vitepress even makes it reactive), and users would expect their own headers to be detected normally, which may not follow the RegExp patterns.

In addition, the original splitter used a sparse array to track current header hierarchies. However, it fails at some very simple cases. For example:

.       <!-- expected | actual   -->
#   A1  <!-- A1       | A1       -->
##  A2  <!-- A1 A2    | A1 A2    -->
#   B1  <!-- B1       | B1 A2    -->
### B3  <!-- B1 B3    | B1 A2 B3 -->

All these problems have been fixed in the worker implementation from my project. I can spend some time to port them back (with or without multiprocessing).

@zhangyx1998
Copy link
Contributor Author

Force pushed to keep up with main branch (use pMap instead).

@brc-dd brc-dd self-assigned this Dec 27, 2023
types/default-theme.d.ts Outdated Show resolved Hide resolved
@brc-dd brc-dd changed the title refactor(localSearch): add concurrency pooling, cleanup logic, improve performance perf(localSearch): add concurrency pooling, cleanup logic, improve performance Dec 30, 2023
Copy link
Member

@brc-dd brc-dd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I'll merge this after testing once with unocss docs or something.

@zhangyx1998
Copy link
Contributor Author

LGTM, I'll merge this after testing once with unocss docs or something.

Sure, I will create another PR to port the JSDOM based section splitter back after this one gets merged. After that all users should receive performance benefit as well as enhanced indexing robustness.

@brc-dd brc-dd merged commit ac5881e into vuejs:main Dec 30, 2023
7 checks passed
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[local search] indexing too slow to be usable for large sites
2 participants