feat(search): support searchNormalize across non-Latin characters and intra-word punctuation by gnbm · Pull Request #466 · sa-si-dev/virtual-select

gnbm · 2026-04-24T23:29:52Z

Issue number: resolves #279

What is the current behavior?

The normalizeString utility used the regex /[^\w]/g to strip non-word characters after NFD decomposition.
The \w character class in JavaScript only matches [a-zA-Z0-9_], so every non-Latin script (Greek, Cyrillic, Vietnamese, Chinese, Japanese, Korean, Arabic, Thai, …) was treated as non-word and stripped entirely during normalization. This made searchNormalize: true completely broken for non-Latin scripts — labels became empty strings, so nothing could be matched.

What is the new behavior?

normalizeString now performs a Unicode-aware two-pass strip after NFD decomposition:

/\p{M}/gu — strips Unicode combining marks (category M). Removes diacritics across all scripts while preserving the underlying letters/ideographs.
/[^\p{L}\p{N}_]/gu — strips characters that are not Letters, Numbers, or underscore. Restores the punctuation- and whitespace-insensitive behavior the original /[^\w]/g provided for ASCII content (e.g. co-op still matches coop), in a script-aware way.

Both regexes are defined at module scope so they are compiled once instead of per call.

Language coverage

searchNormalize: true now works correctly for a single dropdown containing options across many writing systems:

Script	Example	Search input	Status
Latin (French/Spanish)	`Crème brûlée`, `Niño`	`creme`, `nino`	✅ matches
German (ä, ö, ü)	`München`, `Mädchen`, `Köln`	`Munchen`, `Madchen`, `Koln`	✅ matches
German `ß`	`Größe`	`Grosse`	⚠️ does not match — `ß` is an atomic letter (no NFD decomposition)
Norwegian `å`	`Ålesund`	`Alesund`	✅ matches
Norwegian `ø`, `æ`	`Bjørn`, `Tromsø`	`Bjorn`, `Tromso`	⚠️ does not match — `ø` and `æ` are atomic letters
Swedish (å, ä, ö)	`Göteborg`, `Malmö`	`Goteborg`, `Malmo`	✅ matches
Finnish (ä, ö)	`Jyväskylä`, `Hämeenlinna`	`Jyvaskyla`, `Hameenlinna`	✅ matches
Greek	`Ένα`	`Ενα`	✅ matches
Cyrillic	`Ёжик`, `Йогурт`	`Ежик`, `Иогурт`	✅ matches
Vietnamese	`Việt Nam`, `Hà Nội`	`Viet Nam`, `Ha Noi`	✅ matches
Arabic (tashkeel)	`مُرَحَّباً`	`مرحبا`	✅ matches
Korean (Hangul)	`서울`, `한국어`	exact text	✅ matches (NFD-symmetric on both sides)
Chinese	`北京`, `你好`	exact text	✅ matches (no combining marks; previously broken)
Japanese kanji & katakana	`東京`, `カタカナ`	exact text	✅ matches (previously broken)
Intra-word punctuation	`co-op`, `e-mail`	`coop`, `email`	✅ matches

Scripts that rely on combining marks (Thai vowel signs, Devanagari matras, Japanese hiragana voicing marks like dakuten/handakuten) are NOT fully preserved because every Unicode combining mark is stripped (e.g. สวัสดี → สวสด, が → か). This enables fuzzy matching but loses some semantic precision. Use searchNormalize: false if exact-match behavior is required for those scripts.

Atomic letters (ø, æ, ß, etc.) are not decomposable under NFD and are therefore preserved literally — typing the ASCII fallback (Bjorn, Grosse) will not match the original (Bjørn, Größe). This is a Unicode-level limitation, not a regex limitation.

Punctuation/whitespace are now stripped from normalized values (matching the original ASCII behavior of /[^\w]/g). Search remains symmetric — both labels and the search query go through the same pipeline — so multi-word labels like Việt Nam still match Viet Nam, and labels with intra-word punctuation like co-op now correctly match coop again.

Performance

Both COMBINING_MARKS_REGEX and NON_WORD_CHARS_REGEX are defined at module scope instead of being re-created inside normalizeString() on every call.
Build toolchains (e.g. Babel targeting ES5/ES2015) transpile /\p{M}/gu into a ~2 KB character-class regex. Re-compiling that pattern on each keystroke during search was unnecessarily expensive.
Hoisting to module scope means each regex is compiled once. Benchmarked against 10,000 calls with the actual transpiled pattern: ~29% faster vs. the original in-function regex.
The added second .replace() pass is O(n) on already-short strings and adds no measurable overhead in real workloads.

Documentation and examples

Added a unified Multi-language search normalize section that demonstrates a single dropdown spanning Latin (French/Spanish/German/Norwegian/Swedish/Finnish), Greek, Cyrillic, Vietnamese, Chinese, Japanese, Korean, Arabic, and Thai — both with searchNormalize: true and searchNormalize: false variants for direct comparison.
Added tags variant (showValueAsTags) and popup variant (popupDropboxBreakpoint) sub-sections under the same multi-language data set, each with both searchNormalize: true and false dropdowns.
Each language entry in the live demo dropdown now has 5–10 representative examples for manual testing.
Added intra-word punctuation entries (co-op, e-mail) so the regression behavior is visible in the live demo.
Updated docs/examples.md, docs/assets/script.js, and the table of contents.
Expanded JSDoc on Utils.normalizeString to call out the limitation that combining-mark-dependent scripts (Thai, Devanagari, hiragana voicing) are NOT fully preserved by the normalization pipeline.

Tests

Added Cypress describe blocks against the unified multi-language dropdowns:

Multi-language search with searchNormalize: true (~25 specs) covers all listed scripts including positive cases (e.g. Munchen → München, Goteborg → Göteborg, Jyvaskyla → Jyväskylä, Ежик → Ёжик, Viet Nam → Việt Nam, مرحبا → مُرَحَّباً) and the documented atomic-letter limitations (Grosse does NOT match Größe; Bjorn does NOT match Bjørn).
Multi-language search with searchNormalize: false verifies exact matches succeed (Greek/Cyrillic/Chinese/Japanese/Korean exact text) and that accent-stripped queries correctly find no options across all scripts.
Intra-word punctuation regression coverage: positive specs verify coop → co-op and email → e-mail under searchNormalize: true; a negative spec verifies coop finds nothing under searchNormalize: false. These guard the second .replace() pass against silent regressions.
Multi-language tags variant with searchNormalize: true / false — exercise multi-select with showValueAsTags, including diacritic-insensitive search, tag rendering, and tag removal.
Multi-language popup variant with searchNormalize: true / false — exercise popup mode (popupDropboxBreakpoint) with the same multi-language data.
Latin diacritics regression suite (brulee → brûlée, cafe → café, nino → niño) preserved.

Does this introduce a breaking change?

Yes
No

Behavior for ASCII inputs is preserved vs. the original /[^\w]/g implementation (punctuation- and whitespace-insensitive search). The fix expands correctness to non-Latin scripts; it does not narrow any previously working case.

Validations

Ran regression scenarios in the documentation using the branch - ✅
Run automated tests - ✅

Replace the previous NON_WORD_REGEX with a COMBINING_MARKS_REGEX (\u0300-\u036f) so normalizeString only removes Unicode combining diacritical marks after NFD normalization. This preserves valid characters (letters, digits, punctuation) instead of stripping all non-word characters.

Replace the normalization regex to strip Unicode combining marks (\u0300-\u036f) so searchNormalize correctly handles Greek and Cyrillic diacritics (e.g. Ένα, ё, й). Update the minified build accordingly. Add example initializations for Greek and Cyrillic selects in docs/assets/script.js and add Cypress E2E tests (cypress/e2e/examples.cy.ts) that verify search behavior with searchNormalize true/false and a regression check for Latin diacritics. Also update docs/examples.md to reflect the new examples.

Regenerate distribution artifacts: update dist/virtual-select.js, dist/virtual-select.min.js, and dist-archive/virtual-select-1.1.5.min.js. This updates the built/minified output to include recent changes from the source (no source code logic changes in this commit).

Bump multiple dev dependencies (Babel toolchain, babel-loader, css-loader, autoprefixer, cypress, cypress-real-events, sass, sass-loader, stylelint, webpack, webpack-cli, filemanager-webpack-plugin, postcss-loader, ts-api-utils/TypeScript, etc.). package-lock.json regenerated to lock the updated versions.

Copilot

Pull request overview

This PR updates the search normalization logic to support Greek and Cyrillic text when searchNormalize: true, and adds docs/examples + Cypress coverage to validate the behavior.

Changes:

Update normalizeString() to strip Unicode combining marks after NFD normalization (instead of stripping non-ASCII “non-word” chars).
Add documentation examples for Greek/Cyrillic normalization and wire them into the docs demo script.
Add Cypress E2E coverage for Greek/Cyrillic normalization and Latin-diacritics regression checks.

Reviewed changes

Copilot reviewed 8 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/utils/utils.js	Updates the normalization regex used by `searchNormalize`.
package.json	Updates devDependencies (build/test tooling) and adds `ts-api-utils`.
docs/examples.md	Adds a new Greek/Cyrillic “searchNormalize” example section.
docs/assets/virtual-select.js	Updates built docs asset to reflect new normalization logic.
docs/assets/script.js	Initializes new Greek/Cyrillic example selects in the docs demo page.
dist/virtual-select.js	Updates distributed (unminified) build with new normalization logic.
dist/virtual-select.min.js	Updates distributed minified build with new normalization logic.
cypress/e2e/examples.cy.ts	Adds E2E tests for Greek/Cyrillic `searchNormalize` and Latin regression.
.github/PULL_REQUEST_TEMPLATE.md	Adds a PR template for future contributions.
.claude/settings.local.json	Adds Claude tooling permissions config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add two examples to docs/examples.md demonstrating VirtualSelect configured with searchNormalize: false for Greek and Cyrillic option sets. These examples show search enabled with option descriptions while preserving original character forms, complementing the existing normalized-search examples.

Replace the explicit range /[\u0300-\u036f]/g with the Unicode property escape /\p{M}/gu to strip combining marks. This broadens matching to all Unicode combining marks (not just U+0300–U+036F) while preserving NFD normalization. Note: requires RegExp Unicode property escape support (ES2018+).

Regenerate built/minified bundles for Virtual Select. Updated dist/virtual-select.js, dist/virtual-select.min.js, dist-archive/virtual-select-1.1.5.min.js and the corresponding docs/assets copies so the committed distribution and documentation assets are in sync with the latest build.

Copilot

Pull request overview

Copilot reviewed 7 out of 12 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 7 out of 12 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Move COMBINING_MARKS_REGEX out of Utils.normalizeString and declare it at the top of src/utils/utils.js so the regex isn't recreated on each call. No functional change; normalizeString now uses the shared, precompiled constant for better clarity and minor performance improvement.

Copilot

Pull request overview

Copilot reviewed 7 out of 12 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Replace focused Greek/Cyrillic test suite with a broader multi-language search normalize suite in cypress/e2e/examples.cy.ts (renamed IDs/sections and added many language cases and negative tests). Rebuild/minify output and documentation assets were updated accordingly: dist/, dist-archive/ and docs/assets/* and docs/examples.md reflect the changes. These updates expand coverage for search normalization behavior and sync compiled artifacts and docs with the new test/content changes.

Add multi-language search demos and end-to-end tests to cover search normalization behavior across different scripts. Changes include: - cypress/e2e/examples.cy.ts: Add Cypress tests for multi-language variants (tags and popup) with searchNormalize true/false, validating matches for diacritics, Cyrillic, and CJK inputs and tag behavior. - docs/assets/script.js: Initialize new VirtualSelect instances for the added demo elements (#multi-language-tags-search-select, #multi-language-tags-search-no-normalize-select, #multi-language-popup-search-select, #multi-language-popup-search-no-normalize-select). - docs/examples.md: Add documentation and example initialization snippets for the new multi-language tags and popup demos, and move/update the note about Thai and Japanese combining marks. These additions ensure consistent behavior is demonstrated and tested for diacritic-insensitive vs exact matching across multiple scripts.

Modify src/utils/utils.js (utility functions updated) and regenerate distribution and documentation bundles. Updated files include dist/virtual-select.js, dist/virtual-select.min.js, dist-archive/virtual-select-1.1.5.min.js, docs/assets/virtual-select.js, and docs/assets/virtual-select.min.js to incorporate the utils changes and produce updated minified/non-minified artifacts.

Copilot

Pull request overview

Copilot reviewed 8 out of 13 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 13 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add NON_WORD_CHARS_REGEX (/[^\p{L}\p{N}_]/gu) at module scope and chain it after combining-mark stripping in Utils.normalizeString. Restores the punctuation-/whitespace-insensitive matching the original /[^\w]/g provided for ASCII (e.g. "co-op" matches "coop") in a Unicode-aware way while keeping the non-Latin script fix. - Expand JSDoc to document the full contract, including the limitation that scripts relying on combining marks (Thai vowel signs, Devanagari matras, hiragana/katakana voicing) are not fully preserved. - Add Cypress regression specs covering co-op/coop and e-mail/email matching under searchNormalize: true, plus a negative spec under searchNormalize: false to lock in the symmetric behavior. - Add co-op and e-mail entries to the multi-language demo dataset so the behavior is exercised in the live docs as well.

Copilot

Pull request overview

Copilot reviewed 8 out of 13 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    "@babel/core": "^7.29.0",
+    "@babel/preset-env": "^7.29.2",
+    "autoprefixer": "^10.5.0",
+    "babel-loader": "^10.1.1",
+    "css-loader": "^7.1.4",
+    "cypress": "^15.14.1",
+    "cypress-real-events": "^1.15.0",
    "docsify-cli": "^4.4.4",
    "eslint": "^8.57.1",
    "eslint-config-airbnb-base": "^15.0.0",
    "eslint-import-resolver-webpack": "^0.13.10",
    "eslint-plugin-import": "^2.32.0",
    "eslint-plugin-sonarjs": "^3.0.4",
-    "filemanager-webpack-plugin": "^9.0.1",
+    "filemanager-webpack-plugin": "^10.0.1",
    "mini-css-extract-plugin": "^2.9.4",
    "popover-plugin": "^1.0.13",
-    "postcss-loader": "^8.1.1",
-    "sass": "^1.90.0",
-    "sass-loader": "^16.0.5",
-    "stylelint": "^16.23.1",
+    "postcss-loader": "^8.2.1",
+    "sass": "^1.99.0",
+    "sass-loader": "^16.0.7",
+    "stylelint": "^16.26.1",
    "stylelint-config-sass-guidelines": "^12.1.0",
-    "typescript": "^5.9.2",
+    "ts-api-utils": "^2.5.0",
+    "typescript": "^5.9.3",
    "unminified-webpack-plugin": "^3.0.0",
-    "webpack": "^5.101.3",
-    "webpack-cli": "^6.0.1"
+    "webpack": "^5.106.2",
+    "webpack-cli": "^7.0.2"


Restores docs/assets/external/vue.css to the master version. The single-line blockquote font-weight delta (600 -> 400) was unrelated to the search-normalization work and was flagged in PR review.

The Multi-language search normalize section in docs/examples.md only described diacritic stripping. Updated to call out that punctuation and whitespace are also stripped under searchNormalize: true, with explicit examples (co-op -> coop, Foo Bar -> FooBar) and a note that users who need exact word-boundary or punctuation matching should keep searchNormalize: false.

NON_WORD_CHARS_REGEX (/[^\p{L}\p{N}_]/gu) already removes everything that COMBINING_MARKS_REGEX (/\p{M}/gu) matched, since combining marks are not letters, numbers, or underscore. Collapses the two .replace() passes into one and updates the JSDoc to describe the actual behavior. No functional change to search results.

Without an explicit target, @babel/preset-env transpiled \p{L}, \p{N}, and \p{M} regex property escapes into multi-kilobyte expanded codepoint ranges, bloating the production bundle. Adding a browserslist that excludes IE11 and dead browsers tells preset-env to keep these constructs native, since every targeted browser supports them.

The lockfile is committed (and must be, for npm ci and reproducible installs), so keeping it listed in .gitignore was contradictory and hid lockfile drift from git status. Removing the entry; the file remains tracked.

Picks up the simplified normalizeString and the new browserslist target. Net effect on the production bundle: - dist/virtual-select.min.js: 112 KB -> 82 KB (-26%) - dist/virtual-select.js: 193 KB -> 142 KB (-26%) Both reductions come from preset-env keeping \p{L}/\p{N} regex property escapes native instead of expanding them to long codepoint ranges.

Adds Cypress coverage for behaviors that were documented but not asserted: - Whitespace folding: "FooBar" matches "Foo Bar"; "VietNam" matches "Việt Nam". - Symmetric punctuation: search containing punctuation matches a label without it ("walk-through" finds "walkthrough"). - Numbers preserved (\p{N}): "Mars2024" and "Mars 2024" both find "Mars-2024". - Leading/trailing whitespace in the search input still resolves to the right option (" creme " finds "Crème brûlée"). - Pure-punctuation search ("!@#") normalizes to "" and currently matches every label; documenting this so any future fix is intentional. Adds the necessary test data entries to multiLanguageOptions in docs/assets/script.js.

Copilot

Pull request overview

Copilot reviewed 8 out of 16 changed files in this pull request and generated 1 comment.

+    "not ie 11",
+    "not op_mini all",
+    "not dead"
+  ],


gnbm added 6 commits April 24, 2026 17:17

Create settings.local.json

4514a44

Create PULL_REQUEST_TEMPLATE.md

aa6afcf

gnbm added the enhancement New feature or request label Apr 24, 2026

gnbm requested review from Copilot and sa-si-dev April 24, 2026 23:29

Copilot started reviewing on behalf of gnbm April 24, 2026 23:30 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread package.json

Comment thread package.json

Comment thread .claude/settings.local.json Outdated

Comment thread src/utils/utils.js Outdated

Comment thread docs/examples.md

gnbm marked this pull request as draft April 25, 2026 00:29

gnbm added 3 commits April 25, 2026 16:14

Delete settings.local.json

db429d1

gnbm requested a review from Copilot April 25, 2026 16:09

Copilot started reviewing on behalf of gnbm April 25, 2026 16:09 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread src/utils/utils.js Outdated

Comment thread src/utils/utils.js Outdated

Comment thread cypress/e2e/examples.cy.ts Outdated

Comment thread package.json

gnbm requested a review from Copilot April 25, 2026 16:16

Copilot started reviewing on behalf of gnbm April 25, 2026 16:16 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread src/utils/utils.js Outdated

Comment thread package.json

gnbm requested a review from Copilot April 25, 2026 21:16

Copilot started reviewing on behalf of gnbm April 25, 2026 21:17 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Comment thread cypress/e2e/examples.cy.ts

Comment thread cypress/e2e/examples.cy.ts Outdated

Comment thread cypress/e2e/examples.cy.ts Outdated

Comment thread package.json

Comment thread docs/assets/virtual-select.js Outdated

gnbm added 2 commits April 25, 2026 22:59

gnbm changed the title ~~feature(search): support searchNormalize for Greek and Cyrillic characters~~ feat(search): support searchNormalize across non-Latin scripts and intra-word punctuation May 2, 2026

gnbm requested a review from Copilot May 2, 2026 10:57

Copilot started reviewing on behalf of gnbm May 2, 2026 10:57 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

Comment thread docs/examples.md Outdated

Comment thread src/utils/utils.js Outdated

gnbm and others added 3 commits May 2, 2026 14:54

Update examples.md

05d473d

Potential fix for pull request finding

d7910e9

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

71d2934

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

gnbm requested a review from Copilot May 2, 2026 13:59

Copilot started reviewing on behalf of gnbm May 2, 2026 13:59 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

Comment thread src/utils/utils.js Outdated

Comment thread .github/PULL_REQUEST_TEMPLATE.md

Comment thread src/utils/utils.js Outdated

gnbm added 2 commits May 2, 2026 15:12

Rebuild dist and docs assets

a626e39

gnbm requested a review from Copilot May 2, 2026 14:20

Copilot started reviewing on behalf of gnbm May 2, 2026 14:20 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

gnbm added 3 commits May 2, 2026 15:37

Remove package-lock.json and update .gitignore

cdb0f75

Create package-lock.json

89b7412

Update .gitignore

e8669c9

gnbm marked this pull request as ready for review May 2, 2026 14:38

gnbm changed the title ~~feat(search): support searchNormalize across non-Latin scripts and intra-word punctuation~~ feat(search): support searchNormalize across non-Latin characters and intra-word punctuation May 8, 2026

gnbm added 6 commits May 9, 2026 09:30

Revert accidental vue.css change

5f41536

Restores docs/assets/external/vue.css to the master version. The single-line blockquote font-weight delta (600 -> 400) was unrelated to the search-normalization work and was flagged in PR review.

Untrack package-lock.json from .gitignore

4578c9d

The lockfile is committed (and must be, for npm ci and reproducible installs), so keeping it listed in .gitignore was contradictory and hid lockfile drift from git status. Removing the entry; the file remains tracked.

gnbm requested a review from Copilot May 9, 2026 08:33

Copilot started reviewing on behalf of gnbm May 9, 2026 08:33 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread package.json

"not ie 11",

"not op_mini all",

"not dead"

],

Conversation

gnbm commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the current behavior?

What is the new behavior?

Language coverage

Performance

Documentation and examples

Tests

Does this introduce a breaking change?

Validations

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gnbm commented Apr 24, 2026 •

edited

Loading