feat: add support to stopword presets#2008
Conversation
There was a problem hiding this comment.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Secrets | View in Orca |
There was a problem hiding this comment.
Pull request overview
Adds client-side support for Weaviate 1.37+ “stopword presets”, enabling collection-level user-defined stopword lists and per-property selection via text_analyzer.stopword_preset, with appropriate version gating and config parsing/serialization.
Changes:
- Extend inverted index config models (create/update + read) to include
stopwordPresets/stopword_presets. - Extend
TextAnalyzerConfigto supportstopword_preset(built-in enum or user-defined string) and update JSON parsing to treat stopwordPreset-only analyzers as meaningful. - Add version gates for stopword presets in collection create/update flows and add unit + integration test coverage.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| weaviate/collections/config/executor.py | Adds version gate to reject inverted_index_config.stopwordPresets on Weaviate < 1.37.0 during config update. |
| weaviate/collections/collections/executor.py | Adds version gate for stopword presets on create; updates text analyzer gate message. |
| weaviate/collections/classes/config.py | Adds stopword presets to inverted index config models; adds stopword_preset to TextAnalyzerConfig; wires through Configure/Reconfigure helpers. |
| weaviate/collections/classes/config_methods.py | Parses stopwordPresets into config objects; ensures stopwordPreset-only text analyzers are not dropped. |
| weaviate/collections/classes/config_base.py | Allows dict-valued fields to be merged into existing schema during updates. |
| test/collection/test_config.py | Adds unit tests for stopword preset serialization and update merge behavior. |
| test/collection/test_config_methods.py | Adds tests for parsing stopword preset fields from schema JSON. |
| integration/test_collection_config.py | Adds integration coverage for stopword presets behavior, updates, and version gating (plus additional analyzer behavior assertions). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feat/ascii-fold #2008 +/- ##
==================================================
Coverage ? 87.59%
==================================================
Files ? 280
Lines ? 22018
Branches ? 0
==================================================
Hits ? 19286
Misses ? 2732
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| assert title_en.text_analyzer.stopword_preset == "en" | ||
| assert plain.text_analyzer is None | ||
|
|
||
| collection.data.insert( |
There was a problem hiding this comment.
Same as the other PR, can you only test client-specific things in here? Eg set settings, get config and do a round-trip
| ), | ||
| ], | ||
| ) | ||
|
|
There was a problem hiding this comment.
One test that I think is missing is to do a roundtrip:
- create collection with stopwords
- fetch config (and assert correctness)
- change collection name in config
- recreate collection with the settings and fetch new config to compare before after
this makes sure that all the de/serialization is correct
| # Pydantic preserves the StopwordsPreset enum instance through model_dump, | ||
| # but the wire format must be a plain string. Coerce at construction time. | ||
| if isinstance(v, StopwordsPreset): | ||
| return v.value |
There was a problem hiding this comment.
We usually use a _to_dict method, see nestedProperties for example.
Eg in class Property add
if self.textAnalyzer is not None:
ret_dict["textAnalyzer"] = self.textAnalyzer._to_dict()
and then add the _to_dict method to textAnalyzer that does whatever transfromation we need to do. I think it will work without any changes to textAnalyzer, as the baseClass already handles enums:
class _ConfigCreateModel(BaseModel):
model_config = ConfigDict(strict=True)
def _to_dict(self) -> Dict[str, Any]:
ret = cast(dict, self.model_dump(exclude_none=True))
for key, val in ret.items():
if isinstance(val, Enum):
ret[key] = val.value
return ret
No description provided.