Skip to content

Make sure that email validators don't match unintentional non-ASCII local parts #1075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tats-u
Copy link
Contributor

@tats-u tats-u commented Mar 10, 2025

Follow-up of #1068

EMAIL_REGEX has the same problem as RFC_EMAIL_REGEX.

Also added a test case to email and rfcEmail. The latter has already been assured to be passed thanks to #1068.

Copy link

vercel bot commented Mar 10, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
valibot ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jun 26, 2025 10:11pm

@cyyynthia

This comment was marked as resolved.

@tats-u
Copy link
Contributor Author

tats-u commented Jun 19, 2025

\w matches the following characters with iu flags:

017F; C; 0073; # LATIN SMALL LETTER LONG S
212A; C; 006B; # KELVIN SIGN

I doubt you know this fact. Internationalized emails are not considered by the current regex even now. This is an unintended behavior and should be fixed.

This is also a discussion in the WHATWG HTML tracker (which directly concerns the rfcEmail rule since it's using the HTML spec as its source): whatwg/html#4562

Discuss it only in WHATWG, then Valibot will change only RFC_EMAIL_REGEX. EMAIL_REGEX isn't concerned with the WHATWG spec.

I'll also note that the current regex used by the email validator does not account for valid characters such as '

EMAIL_REGEX has lower quality and is less practical than RFC_EMAIL_REGEX. EMAIL_REGEX is discouraged in my opinion. It exists only for backward compatibility. It should be changed in v2. Where did it taken from?

@tats-u tats-u changed the title Make sure that email validators don't match non-ASCII local parts Make sure that email validators don't match non-ASCII local parts unintentionally Jun 19, 2025
@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jun 19, 2025
@tats-u tats-u changed the title Make sure that email validators don't match non-ASCII local parts unintentionally Make sure that email validators don't match unintentional non-ASCII local parts Jun 19, 2025
@cyyynthia

This comment was marked as resolved.

@fabian-hiller fabian-hiller self-assigned this Jun 19, 2025
@fabian-hiller fabian-hiller added the fix A smaller enhancement or bug fix label Jun 19, 2025
@fabian-hiller fabian-hiller added this to the v1.2 milestone Jun 19, 2025
Copy link

pkg-pr-new bot commented Jun 19, 2025

Open in StackBlitz

npm i https://pkg.pr.new/valibot@1075

commit: 339b4ec

@tats-u
Copy link
Contributor Author

tats-u commented Jun 19, 2025

In the first place what "internationalized email addresses" match that regex?
I doubt there are ones using non-ASCII characters other than K and ſ.

SMTPUTF8 is not supported by all MTAs. Who uses non-ASCII characters as a local part?

@tats-u
Copy link
Contributor Author

tats-u commented Jun 20, 2025

Have you seen https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Regular_expressions/Character_class_escape#w? What "accented letters" match \w?

@cyyynthia
Copy link

Oh damn it, I mixed up regex flavors again... 😖 You're right here, JS' unicode flag doesn't do anything useful here; for some reason I was thinking with PCRE regex behavior in mind... 🙃

I marked my original comment as resolved, sorry for the confusion! ❤️

I'll open a separate issue for internationalized email support, as I still think it's an important thing to address nonetheless

@tats-u
Copy link
Contributor Author

tats-u commented Jun 22, 2025

You should confirm detailed definitions of character classes in each language unless you are used to them.

e.g. C#: https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#word-character-w

@fabian-hiller
Copy link
Owner

Just as a quick clarification... there is still interest in merging this PR? If so, I will review in soon.

@fabian-hiller
Copy link
Owner

I took a quick look and am not sure if I want to merge it. email is intentional "stricter" and disallows many uncommon things. Maybe it it better to leave this as is and recommend rfcEmail instead.

@tats-u
Copy link
Contributor Author

tats-u commented Jun 27, 2025

there is still interest in merging this PR?

Sure

disallows many uncommon things

This PR also disallows uncommon cases. I don't think you assume or know \w with iu matches the Kelvin symbol and the small long S.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix A smaller enhancement or bug fix size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants