perf: lazy-load fasttext quality model in context_scraping.utils#55
Open
obchain wants to merge 1 commit into
Open
perf: lazy-load fasttext quality model in context_scraping.utils#55obchain wants to merge 1 commit into
obchain wants to merge 1 commit into
Conversation
`opendeepsearch.context_scraping.utils` used to call `fasttext.load_model(hf_hub_download(...))` at module import time, which pulled `model.bin` (~hundreds of MB) from HuggingFace for every consumer of the package — including users who never enable `filter_content` and therefore never reach `predict_educational_value`. Move the download/load behind a `_get_model()` lazy accessor cached in a module-level `_model` singleton so the artifact is only fetched the first time `predict_educational_value` is actually called. Also drops `import opendeepsearch` time from "blocks on HuggingFace" to "plain Python import" on machines with no model cache, which avoids the surprise download reported in sentient-agi#32 when only the Jina cloud reranker is configured.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Stop loading the
kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2model at import time inopendeepsearch.context_scraping.utils. Move the download /fasttext.load_modelbehind a_get_model()accessor that caches into a module-level_modelsingleton, called only insidepredict_educational_value.Closes #56
Why
The
clean_html/get_wikipedia_contenthelpers in that module are imported on the hot path ofWebScraperand (transitively) ofOpenDeepSearchAgent. Today,import opendeepsearchtriggers:at module scope, which:
filter_contentand never reachpredict_educational_value;import opendeepsearchentirely on machines with no network or no write access to the HF cache;How
src/opendeepsearch/context_scraping/utils.py:model = fasttext.load_model(...)with_model: Optional[...] = Noneplus a_get_model()accessor that loads-and-caches on first callmodel.predict(...)insidepredict_educational_valuefor_get_model().predict(...)No public API changes —
predict_educational_value,clean_markdown_links,filter_quality_content,get_wikipedia_contentkeep their signatures and behaviour. The download still happens the first time the quality filter runs, just not onimport.Testing
python3 -m py_compile src/opendeepsearch/context_scraping/utils.py— cleangrep -n 'model\.predict\|fasttext\.load_model' src/opendeepsearch/context_scraping/utils.py— only the inside-_get_modelcall and the_get_model()call insidepredict_educational_valueremain; no module-scopemodelname leftpython3 -c "from opendeepsearch.context_scraping import utils"no longer triggers a HuggingFace request;utils.predict_educational_value(["sample"])still works and lazily caches the model on first call (verified by patching_get_modeland asserting it is called exactly once across repeated invocations).