Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validating internationalized mail addresses in <input type="email"> #4562

Open
jrlevine opened this issue Apr 24, 2019 · 129 comments
Open

Validating internationalized mail addresses in <input type="email"> #4562

jrlevine opened this issue Apr 24, 2019 · 129 comments
Labels
addition/proposal New features or enhancements i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. needs implementer interest Moving the issue forward requires implementers to express interest topic: forms

Comments

@jrlevine
Copy link

jrlevine commented Apr 24, 2019

This is more or less the same issue as https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 but I think it's worth another look since a lot of things have changed.

The issue is that the e-mail address validation pattern in sec 4.10.5.1.5 only accepts ASCII addresses, not EAI addresses. Since last time, large hosted mail systems including Gmail, Hotmail/Outlook, Yahoo/AOL (soon if not yet), and Coremail handle EAI mail. On smaller systems Postfix and Exim have EAI support enabled by a configuration flag.

On the other side, writing a Javascript pattern to validate EAI addresses has gotten a lot easier since JS now has Unicode character class patterns like /(\p{L}|\p{N})+/u which matches a string of letters and digits for a Unicode version of letters and digits.

Last time around the consensus seemed to be that EAI input fields should be marked as unicode or eai or the like, since it'll be a while since all mail systems handle EAI.

For the avoidance of doubt, when I say EAI, I mean both Unicode local parts and Unicode domain names, since that's what EAI mail systems handle. There is no benefit to translating IDNs to A-labels (the ones with punycode) since that's all handled deep inside the mail system.

@josepharhar
Copy link
Contributor

Hey, coming here from this chrome bug.
If I understand correctly, this means that we would send the email user@ß.com to the server as user@ß.com instead of the punycoded version user@ss.com like we do today, and we would also allow ß@ß.com to pass validation and send it as ß@ß.com.
After reading the concern in this comment, I have a hard time believing that we wouldn't break some servers somewhere. Just because mail servers tend to accept more unicode doesn't mean that every mail server everywhere does now, right?

@rutek
Copy link

rutek commented Jun 6, 2020

@josepharhar I agree that some servers can break (old ones but f.x. in Poland most popular e-mail providers are ... not working as they should) but please remember that we are still saying about client-side e-mail field validation.

RFC 6532 was not supported for a long time in many software apps (f.x. Thunderbird makes really strange things when receives non-encoded UTF-8 mail compilant with RFC 6532 - it's still open in Bugzilla) but up-to-date mail servers allow to create such accounts and send such mails (Postfix has support for it since ~2015). It's complex problem as f.x. delivery of UTF-8 mail to old mailbox can lead to some problems but what can we do else other than progressively upgrade used technologies to support it? :)

Anyway, I don't think that it's browser responsibility to "protect backend from problematic e-mail addresses" so if RFC allows it and up-to-date software supports it, we should allow it.

@jrlevine
Copy link
Author

jrlevine commented Jun 6, 2020

It's more complex than that, and it's not about ß which is an odd special case.

EAI (internationalized) mail can handle addresses like пример@Бориса.РФ. While the domain part can turn into ASCII A-labels xn--80abvxkh.xn--p1ai (sometimes called punycode), the mailbox cannot, and only an EAI mail system can handle that address.
Common MTAs like postfix and exim have EAI support but it's not turned on by default, and there is no way a browser can tell what kind of MTA a remote server has or how it is configured. That's why we need a new input type="eaimail" that accepts EAI addresses, which web sites can use if their MTA handles EAI.

The treatment of ß has nothing to do with this. The obsolete IDN2003 and current IDN2008 internationalized domain names are almost the same but one of the few differences is that 2003 normalizes (not punycodes) ß to ss while 2008 makes it a valid character. An address with an ASCII mailbox like user@ß.com could turn into user@xn--zca.com but ß@ß.com is EAI only. This turns out to matter because there are German domain names with ß in them that your browser cannot reach if it uses the obsolete rules. See my page https://fuß.standcore.com to see what your browser does.

@klensin
Copy link

klensin commented Jun 7, 2020

A few tiny additions and clarifications to John Levine;'s note (we do not disagree about the situation in any important way; the issues are just a bit more complex, with potentially broader implications, that one might infer from his message and they may call part of his suggestion into question). In particular, "eaimaill" or something like it may be the wrong solution to the problem and may dig us in even deeper. For those who lack the time or inclination to read a fairly long analysis and explanation, skip to the last paragraph.

First, while his explanation of the difficulty with ß is correct, it is perhaps useful to also note that the ß -> ss transformation is often brought about by the improper or premature application of NFKC, which may have been the source of the recent dust-up about phishing attacks using Mathematical special characters. In the latter case, IDNA2008 imposes a requirement on "lookup applications" (including browsers) to check for and reject such things but they obviously cannot do so if the characters the IDNA interface sees are already transformed to something valid. The current version of Charmod-norm discusses, and recommends against, general application of compatibility mappings. It is perhaps also worth noting that UTS #46 is still recommending the use for NFKC (as part of NFKC_Casefold and its associated tables (see Section 5 of that document)) but also calls out the problem of reaching some IDNA2008-conformant domain names if the IDNA2003 rules are followed. Because, from observation, some (perhaps many or most) browsers look to UTS #46 for authority in interpreting domain names in, e.g., URLs while most or all SMTPUTF8 implementations (incorrectly, but commonly, known as "EAI") are strictly conformant to IDNA2008, the differences between the two introduces additional complications .

John mentions that a browser cannot tell what the MTA and configuration a remote server might have, but it is even worse than that. In general, the browser is unlikely to know very much about the precise capabilities of the local MTA or Message Submission Server (MSA() unless those functions are actually built into the browser. The web page designer is even less likely to know and is in big trouble if different browsers behave differently. If the browser does not know, or cannot be configured to know, the distinction between an input type="email" and one of ""eaimail" (which I hope would be called something else, perhaps "i18nemail") would not be as useful as his message implies.

Thinking about these issues in terms of what mail systems do with the addresses my miss an important issue. In many cases, web pages are trying to accept and validate something that looks like an email address but is not headed immediately into a mail system. Instead, it is destined for insertion into a database or comparison with something already there, validation by some other process entirely, or is actually an email address (or something that looks like one) used as a personal identifier such as a user ID. For the latter case, conversion of the part of the string following the "@" via the Punycode algorithm may not produce a useful result whether IDNA2008, IDNA2003, or UTS #46 rules are used. I would think it would be dumb, but if someone wanted to allow 3!!!@#$%^&.ØØØ as a user ID and some system wants to allow that, we should probably stay out of their way (perhaps by insisting they use a type that does not imply an email address). However, the other side of that example is probably relevant to the discussion. The operator or administration of a mail server, or the administrator of a system that uses email addresses as IDs, gets to pick the addresses they will allow. Especially in the ID case, if they use a set of rules narrower than what RFC 5821 allows (and that are allowed in addresses on many mail systems), then they open themselves up to many frustrations and complaints from from users whose email addresses are valid according to the standards and work perfectly well on most of the Internet but that are rejected by their systems. Internationalized addresses open up a different problem. As an example, I don't know many mail servers identified by domains subsidiary to the 公益 TLD have allowed registration of local parts in Tamil or Syriac scripts, but I suspect that "zero" wouldn't be a bad guess. Someone designing a web site for users in China might know that and, for the best quality user experience, might want to reject or produce messages about non-Chinese local parts for that domain or perhaps even for any Chinese-script and China-based TLD. Similar rules might be applied in other places to tie the syntax of the local part to the script of the TLD but, for example in countries where multiple scripts are in use and "official", such rules might be a disaster. And, because almost anyone can set up an email server and there are clearly people on the Internet who prioritize being clever or cute or exhibiting a maximum of their freedom of expression over what others might consider sensible or rational, most of us who have been around email for many years have seen some truly bizarre (but valid) local parts of all-ASCII addresses and see no reason to believe we won't see even worse excesses as the Internet becomes increasingly internationalized.

This leads me to a conclusion that is a bit different from when this was discussed at length over a year ago. As we have seen when web sites reject legitimate ASCII local parts because people somehow got in into their heads that most non-alphanumeric characters were forbidden or were stand-ins for something else and, more broadly, because it is generally impossible to know what a remote MTA with email accounts on it will allow in those accounts, trying to validate email addresses by syntax alone is hard and may not be productive. When one starts considering email addresses (or things that look like them) that contain non-ASCII characters, things get much more difficult. IDNA2008, IDNA2003, and UTS#46 (in either profile) each have slightly different ideas about what they consider valid. Whatever any of them allow is going to be a superset of what any sensible domain or mail administrator or will allow in practice. In general, a browser does not know what conventions back-end systems or a mail system at the far end of the Internet are following, much less whether they will be doing the same thing next month. So my suggestion would be that Input type="email" be interpreted and tested only as "sort of looks like an all-ASCII email address", that a new input type="i18nmail" be introduced as "looks like 'email' but with some non-ASCII characters strewn around", and that the notion of validating beyond those really general rules be left to the back-end systems, the remote "delivery" MTAs, and so on. In addition, to the extent to which one cares about the quality of the user experience, it may be time to start redesigning the APIs associated with various libraries and interfaces to that they can report back real information about why putative email addresses didn't work for them more precise than "failed" or "invalid address".

good luck to us all,
john

@nicowilliams
Copy link

FYI, new installs of Postfix get EAI enabled by default.

My take is that a new input type is not required. An attribute by which to reject EAI is fair (e.g., because the site's MTAs don't support EAI on outbound.

@jrlevine
Copy link
Author

s/reject/accept/ and I agree

@nicowilliams
Copy link

Validation on the front-end creates more ways to lose rather than more ways to win, and doesn't really protect the backend from vulnerabilities.

So I'm just not very keen on the browser doing much validation here. If the site operator has / does not have a limitation as to outbound email, I'm fine with stating it, but I'm also fine with allowing whatever, and making it the backend's job (or any scripts' on the page) to do any validation.

My take is that the default should be permissive. This should be how it is in general. Consider what happens otherwise. You might have a page and site that can handle EAI just fine but a developer forgot to update their email inputs on their pages to say so: now you have a latent bug to be found by the first user who tries to enter an internationalized address. This might mean losing user engagement, and you might never find out because why would the users tell you? But, really, why do we need the input to do so much validation? The input has to be plausibly an email address -- a subset of RFC5322, mailbox-part@domain.part is plenty good enough for 99.999% of users, and there is no good validation to apply to the mailbox part. This is how users get upset that they can't have jane+ietf@janedoe.example. We should stop that kind of foot self-shooting.

@vdukhovni
Copy link

The user should able to enter an email address verbatim, with no second-guessing by input forms. If that address is known to be a-priori unworkable by the server's backend system, it can be rejected with an appropriate error message on the initial POST. Otherwise, if the address vaguely resembles mailbox syntax, it should be accepted and used verbatim. It may not be deliverable, but that's also true of many addresses that are syntactically boring john.smith@example.com may bounce while виктор1βετα@духовный.org may well be deliverable...

@masinter
Copy link

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines
he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.

The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value.
This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

@jrlevine
Copy link
Author

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines
he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.

The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value.
This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

Keep reading and in another paragraph or two you'll find the Javascript pattern they tell you to use to validate e-mail addresses.

@vdukhovni
Copy link

https://html.spec.whatwg.org/multipage/input.html#e-mail-state-(type=email) defines
he value attribute, if specified and not empty, must have a value that is a single valid e-mail address.
The value sanitization algorithm is as follows: Strip newlines from the value, then strip leading and trailing ASCII whitespace from the value.
This should be retained if not expanded (other whitespace?) NFC shouldn't be necessary for user typed data, but wouldn't hurt.

Keep reading and in another paragraph or two you'll find the Javascript pattern they tell you to use to validate e-mail addresses.

The PCRE pattern behind the link is rather busted. It fails to properly validate dot-atoms, allowing multiple consecutive periods in unquoted local-parts (invalid addresses), while disallowing quoted local-parts (valid addresses). EAI-aside, this sort of fuzzy approximation of the actual requirements is harmful.

@klensin
Copy link

klensin commented Jul 7, 2020

Hil Maybe it would be helpful to back up a little bit an look at this from the perspective of a fairly common use case. Suppose I have a web site that sets up or uses user accounts and that I've decided to use email addresses as user IDs (there are lots of reasons why that isn't a good idea, but the horse has left the barn and vanished over the horizon). Now, while it would probably not be a good practice, there is no inherent requirement that my system ever send email to that address -- it can be, as far as I'm concerned, just a funny-looking user ID. On the other hand, if I tell a user who has been successfully using a particular email address for a long time that their address is invalid, I am going to have one very annoyed user on my hands. If I am operating in an environment in which "user" is spelled "customer", and I don't have a better reason for rejecting that address than "W3C and WHATWG said it was ok to reject it" I may also be able to have various sales types, managers, and executives in my face.

The fact that email address is being used as a user ID probably answers another question. Suppose the user registers with an email address using native Unicode characters in both the local part and the domain part. Now suppose they come back a few weeks later and try to sign in using the same local part but a domain part that contains A-labels. Should the two be considered to match? Remembering that this is a user ID that has the syntax of an email address, not something that is going to be used exclusively in an email context, I'd say that is a business decision and not some HTML (or browsers, or similar tools) should get into the middle of. There is one exception. One of the key differences between IDNA2003 and IDNA2008 is that, in the latter, U-labels and A-labels are guaranteed to be duals of each other. If the browser or the back-end database system are stuck in IDNA2003 or most interpretations of UTR#46, then the fact that multiple source labels can map to a single punycode-encoded form opens the door to a variety of attacks and anyone deciding that the two are interchangeable in that environment has best be quite careful about what user names they allow and how they are treated.

It may also be a reasonable business decision in some cases for a site to say "we don't accept non-ASCII email addresses as user IDs/ account identifiers" or even "we accept addresses that uses these characters, or characters from a particular set of scripts, and not others". But nothing in the HTML rules about the valid syntax for email address should be in the middle of that decision.

Beyond that, as others have suggested, one just can't know whether an email address is valid without somehow asking the server that hosts the relevant mailbox (or its front end). It may not be possible to ask that question in real time and, even if it is, doing so is likely to require significantly more time (user-visible delay) than browser implementers have typically wanted to invest. So let's stick to syntax

That scenario by itself argues strongly for what I think John, Nico, and others are suggesting: the only validation HTML should be performing on something that is claimed to be an email address is conformity to the syntax restrictions in RFC 6531. Could one be even more liberal than that? Yes, but why bother.

@aphillips
Copy link
Contributor

I was actioned by the W3C I18N WG with replying to this thread with a sense of the group.

Generally, we concur with @kleinsin's comment just above ⬆️.

We think that type=email should accept non-ASCII addresses the better to permit adoption of EAI and IDNA. One reason for low adoption of these are barriers to using them across the Web/Internet. Removing these types of artificial barriers will not only encourage adoption, but will support those users who are already using these.

Users of this feature in HTML expect that the input value follow the structural requirements of an email address but don't expect the value to be validated to be an actual valid address. At best this amounts to ensuring that there is an @ sign and maybe some other structure that can be found with a regex. Users who want to impose an ASCII restriction or do additional validation are free to do so and mostly have to do this anyway. In our opinion, HTML would thus be best off to provide minimal validation. User agents can use type=input as a hint for additional features (such as prompting the user with their own email address or providing access to the user's address book), but this is outside the realm of HTML itself.

@annevk
Copy link
Member

annevk commented Jul 15, 2020

I played with this a bit and it seems the current state is rather subpar, though that also leaves more room for changes. Example input: x@ñ. Firefox submits as-is (percent-encoded). Chrome submits x@xn--ida. Safari rejects (asks me to enter an email address). If you use ñ before the @ all reject (as expected).

One thing that would help here is a precise definition of the validation browsers would be expected to perform if we changed the current definition as well as tests for that. I can't really commit for Mozilla though if we can make this a bit more concrete I'd be happy to advocate for change.

@nicowilliams
Copy link

@aphillips @annevk just about the only thing worth validating here is the RHS of the @ -- everything else should be left to either the backend (which does or does not support internationalized mailbox names) or the MXes ultimately identified by the RHS of the @, or any MTAs in the path (which might not support internationalized mailbox names, but damn it, should).

What is the most minimal mailbox validation? Certainly: that it's not empty. Validating that the mailbox is not some garbage like just ASCII periods, and so on, might help, but getting that right is probably difficult.

So that's my advice: validate that the given address is of any RFC 5322 form that is ultimately of the form ${lhs}@${rhs}, that the RHS is a domainname, supporting U-labels because this is a UI element, as well as A-labels, and validate that the LHS is not empty, and keep any further LHS validation to the utter minimum, in particular not rejecting non-ASCII Unicode.

@klensin
Copy link

klensin commented Jul 15, 2020

@annevk, I think your examples actually point out the problem. In order: it would be rare, but not impossible (details on request but I want to keep this relatively short) to see on on the RHS of the "@", and % is prohibited by the syntax in RFC 5321 , but I'd generally recommend the use of percent-encoding in any part of email addresses. Pushing a domain-part through Punycode is prohibited by IDNA unless the labels it contains are validated to be U-labels. I can't tell from your example but if, e.g., the domain -part of the mailbox was \u1D7AA\u1D7C2 then it should be rejected, not encoded with punycode: doing otherwise invites errors down the line, errors for which the user get obscure and/or misleading messages.

The problem is that email addresses with non-ASCII characters in the local-part and/or domain part are now valid and increasing numbers of people who can use them for email are expecting to use them through web interfaces.
Keeping in mind that a browser cannot ever fully "validate" an email address (something that would require knowing that the mailbox xyz@example.com exists but abc@example.com does not) I suggest:

(1) If a mailbox consists of a string of between 1 and 64 octets, an "@", and at least 2 and up to 255 more octets, treat it as acceptable and move on, understanding that all sorts of things may apply additional restrictions in actual email handling.

(2) In addition, if you wanted to and the domain-part contained non-ASCII characters, you could verify that any labels were valid ISDNA2008 U-labels and reject the name if they were not ("invalid domain name in email address:" would be a much better message than "invalid email address") AND, optionally iff the local-part was entirely ASCII, convert those U-labels to A-labels. The SMTPUTF8 ("EAI") specs strongly recommend against making that conversion if the local-part is all-ASCII. When the local part is all-ASCII, the conversion will allow some valid cases to go through but, over time, it seems likely that those cases will become, percentagewise, less frequent so whether it is worth the effort is somewhat questionable.

FWIW, the above was written in parallel with @nicowilliams's comment rather than after studying it, but that his recommendation and mine are not significantly different except for that one marginal case of an ASCII local-part and a non-ASCII (but IDNA2008-valid) domain part.

@klensin
Copy link

klensin commented Jul 15, 2020

I should have added, as @vdukhovni more or less points out, if one is going to try to validate the syntax of the local-part (even all-ASCII local-parts) if it important to actually get it right. As he shows, getting it right is a moderately complicated process, perhaps best left to email systems that are doing those checks anyway (which is what @nicowilliams and I essentially suggest above). But, if one is going to try to do it, it should be done right because halfway attempts (fuzzy approximations) are harmful, including letting some local-parts with invalid syntax through and prohibiting some valid ones.

@annevk
Copy link
Member

annevk commented Jul 16, 2020

@klensin I'm not sure what you're trying to convince me of. I was offering to help. (Percent-encoding is just part of the MIME type form submission uses by default, it's immaterial. Chrome's Punycode handling is what is encouraged by HTML today. That browsers do incompatible things suggests it might be possible to change the current handling.)

@aphillips
Copy link
Contributor

@annevk I drew an action item (during part of I18N's meeting when @klensin was not available) to propose changes and I'd appreciate your thoughts on how to approach this. Looking at the current text, I guess a question is whether we should attempt to preserve the current behavior for ASCII email addresses (or their LHS/RHS parts) while simultaneously allowing labels in that use non-ASCII Unicode? I18N WG participants seem to agree that we don't want to get into deep validation of the address's validity and limit ourselves to "structurally valid" addresses.

@annevk
Copy link
Member

annevk commented Jul 16, 2020

Right, e.g., at a minimum we should probably require that the string contains a @ and no surrogates. But currently we also prohibit various types of ASCII labels, e.g., quoted ones, and allowing those to now go through might not be great either.

@jrlevine
Copy link
Author

It certainly has to be valid Unicode (e.g., no unpaired UTF-16 surrogates, no invalid UTF-8 bytes), and follow the rules like no unpaired quotes. Restricting it more than that is not likely to help.

@masinter
Copy link

Even if people are just using things that look like email addresses for purposes other than sending email, do you really want to allow unnormalized Unicode or leading or trailing white space in the LHS?
for sites that use email addresses as user IDs, changing HTML validation to allow entry of different sequences that are visually identical opens up new security concerns.

@nicowilliams
Copy link

@masinter Absolutely this must allow unnormalized Unicode because users cannot be counted to produce normalized Unicode. Regarding whitespace, trimming it is fine. I don't think there are any security concerns regarding client-side validation -- if there is a site where relaxing client-side validation of email addresses creates a security concern, then the site is already vulnerable.

@aphillips
Copy link
Contributor

I18N discussed this earlier today (2024-05-02) in our teleconference (note that @klensin participated in that call). Our group position (and mine personally) is in rough agreement with John's comment. HTML is not mail agent and the input type=email is not always used for something email specific. Browsers aren't really in a position to do that much validation of the local part (they can do more with the domain name, obviously). With that in mind, not long ago in a WHATNOT call I took the action to update #5799 (which is meant to fix this issue). Is it possible that we could focus on what changes are needful there?

@hsivonen
Copy link
Member

hsivonen commented May 6, 2024

There is a useful (and very common) analogy to the Vietnamese situation as described in the posting above. RFCs 5321 and 5322 are very explicit that local parts of email addresses are case sensitive.

I believe this is not a useful analogy. Case and normalization form are substantially different issues.

Specifically:

  1. When people read text that has been rendered visually, they can see the difference in case but cannot (in the absence of rendering bugs) see normalization differences.
  2. It is customary for text input methods to provide users with explicit control over case (the shift key) but text input methods do not provide users control over normalization. (In the Vietnamese case, the difference is between different input methods—not control provided within a given input method.)
  3. Spec-wise, Unicode keeps case-folding as a more situational operation that's also tailorable while Unicode seeks to keep the interpretation identical regardless of normalization differences ("Canonical equivalence is the criterion used to determine whether two character sequences are considered identical for interpretation.") and normalization itself as stable and untailorable.

On the other hand, if some mail administrator with mailboxes in Vietnamese wants to make it difficult for someone who does not have a locally-normal Vietnamese input device to send mail to a particular mailbox and decides to only allow the non-NFC form, the standards deliberately do not tell them they cannot do that and we shouldn't either.

I disagree that specs should deliberately enable addressing failure depending on the incidental characteristics of the input method.

HTML might end up doing that incidentally as part of historically not normalizing (except in the domain part of input type=email!), but deliberately enabling such addressing failure seems like a very bad reason not to make input type=email normalize the local part to NFC.

In particular, if input type=email ended up normalizing the local part to NFC, for Vietnamese this would be a no-op for the most popular method of text input and would make addresses written with the less popular method appear like they were written with the more popular method. I could understand an argument that HTML should not take on complexity that should belong in the implementation of keyboard layouts, but I find it bizarre to take the position that specs should specifically seek to enable mailboxes that are only reachable via the less popular method of Vietnamese text input.

To be clear, this is not just about the NFC "restriction" or about case sensitivity. The same reasoning applies to quoted strings and dot-string. Whatever is done with HTML type=email, it should not get in the way of using legitimate email addresses even if we, or even the relevant RFCs, find those addresses or associated practices, distasteful.

This can be read as arguing that input type=email should not reject leading, trailing, or consecutive dots in the local part, but I get the feeling from the rest of your comment that this might not be the interpretation that you intended.

Upthread I listed specific implementation-relevant questions. To make progress, I think it would be useful to see which ones of those have consensus and what specific disagreements there are on those specific questions. From what you say, it's unclear to me how you are answering "Is it the job of input type=email to reject leading, trailing, and consecutive dots in the local part?"

With that in mind, not long ago in a WHATNOT call I took the action to update #5799 (which is meant to fix this issue). Is it possible that we could focus on what changes are needful there?

I pointed from there to my list of questions upthread here.

@annevk
Copy link
Member

annevk commented May 6, 2024

Here's an attempt at answering the questions raised in #4562 (comment):

  1. Seems reasonable to allow, should be consistent with IPv4.
  2. Seems reasonable to allow, should be consistent with IPv6.
  3. Yes.
  4. ToASCII.
  5. I think ideally this follows the host parser. STD3 was wrong for underscore in particular for URLs.
  6. Yes seems reasonable.
  7. Yes seems reasonable.
  8. Yes seems reasonable.
  9. I think ideally this follows the host parser.
  10. User-entered value seems good.
  11. Normalized/to-be-submitted value seems good.

@hsivonen
Copy link
Member

hsivonen commented May 6, 2024

Here's an attempt at answering the questions raised in #4562 (comment):

Thanks!

Do you have an opinion on whether input type=email should try to constrain the non-ASCII repertoire in the local part beyond restricting to scalar values (i.e. rejecting unpaired surrogates) and do you have an opinion on whether input type=email should normalize the local part to NFC?

@annevk
Copy link
Member

annevk commented May 6, 2024

I think you have made a compelling case for NFC normalization so I'd be inclined to go with that. Not sure about other non-ASCII restrictions beyond scalar values.

@jrlevine
Copy link
Author

jrlevine commented May 6, 2024

Klensin and the rest of the people who wrote the RFCs did not say to normalize local parts, and I can assure you they were aware of the possibility. Unless you are sure you know better than they do how every EAI mail system works, the answer is clearly no.

@klensin
Copy link

klensin commented May 6, 2024

Let me say what may be the same as what John Levine said from a slightly different perspective. The relevant email RFCs allow non-normalized strings as local parts and even allow the same mail-receiving system to have, e.g., string1@example.com and string2@example.com (where string2=NFC(string1)) and even allow them to point to separate mailboxes. Consequently, applying NFC in HTML processing risks making legitimate, working, email addresses inaccessible and may even create the security risk of having messages delivered to the wrong address.

So, as he said, clearly no.

@collinanderson
Copy link

collinanderson commented May 6, 2024

Hot take: If specs for email specifically allow non-NFC forms, then it seems to me that's a security issue waiting to happen. I'm not saying I know better than them, but to me as non-spec writer, non-unicode expert nobody, it really seems to be a security issue in the email spec. The email specs lean too far on backward compatibility in my opinion. All unicode control characters are allowed? I can end the local part with a right-to-left control code? I can end the local part with a zero-width-joiner and emoji control codes? Instead of defining a good secure universal definition of an email address, the email spec writers are just going to leave it up to the individual email providers to do whatever they want?

I realize the html spec is not the place to fix these issues. I'm just venting my somewhat irrelevant opinion here. I'll send my thoughts to the UASG people.

@jrlevine
Copy link
Author

jrlevine commented May 6, 2024

Hot take: If specs for email specifically allow non-NFC forms, then it seems to me that's a security issue waiting to happen.

RFC 6531 was published 12 years ago, and its predecessor RFC 5336 was published 16 years ago, so we have been waiting for quite a while.

Mail systems have complete latitude about what they do with the local parts you hand them. In a lot of cases it would make sense to normalize the input to NFC before comparing it to local addresses. Or maybe NFD depending on how the internal character processing works. Or even NFKC or NFKD. But to point out something we've said a dozen times, YOU DON'T KNOW. So don't guess, just pass along whatever the user enters.

@jrlevine
Copy link
Author

jrlevine commented May 6, 2024

What shall be submitted? (I think the least-risky answer is to say the UTS 46 ToASCII form with the above flags. Chrome already submits the ToASCII form and Safari only allows ASCII to be submitted.

That's unfortunate. Postfix, which is the most widely used open source MTA, does not automatically make A-label and U-label versions of domains equivalent. I expect that most mail systems are configured to do so, but they don't have to and once again, YOU DON'T KNOW. So please do not imagine you are cleverer than the people who wrote the specs, and just pass along anything that complies with the RFC, which is IDNA2008 in this case.

By the way, a lot of implementations screw up handling of German ß. Point your browser at https://fuß.standcore.com/ to see how it does.

@collinanderson
Copy link

collinanderson commented May 6, 2024

Mail systems have complete latitude about what they do with the local parts you hand them.

Right, and in my opinion this is the fundamental problem with the current email specs and the likely reason why there's been very little adoption in the last 12 years. In my opinion the specs are too lax and give mail systems too much latitude. The "YOU DON'T KNOW" is a problem that the email specs should be solving so we can actually have a sane universal definition of a standards-compliant email address. There's a reason why IDNA 2008 is more restrictive than IDNA 2003. It's fine for newer specs to be more restrictive than past specs. Again, I realize html is not the place to be creating those restrictions, but it seems to me at minimum NFC and Identifier_Status=Allowed are pretty common sense to require for a "standards-compliant" email address.

@collinanderson
Copy link

collinanderson commented May 6, 2024

To the html people: I agree submitting the ToASCII form is probably wrong for an ideal internationalized email field and browsers probably call ToASCII only for backward compatibility. input type="url" doesn't do that, right? I'd suggest being consistent with url field.

@vdukhovni
Copy link

That's unfortunate. Postfix, which is the most widely used open source MTA, does not automatically make A-label and U-label versions of domains equivalent. I expect that most mail systems are configured to do so, but they don't have to and once again, YOU DON'T KNOW. So please do not imagine you are cleverer than the people who wrote the specs, and just pass along anything that complies with the RFC, which is IDNA2008 in this case.

Indeed any equivalence is up to the administrator to implement by installing suitable lookup table key/value pairs (for virtual domains) or just listing both domains in mydestination (for domains whose localpart mailboxes map to Unix logins or system-wide Sendmail-compatible aliases).

And no equivalence can be 100% complete via Postfix alone, because delivery may be delegated to non-Postfix delivery agents (LMTP, or "pipe" commands), where Postfix can't know which address forms the delivery agent supports.

Leaving the input form as-is facilitates onward relaying in a form that is plausibly most likely to be understood by the ultimate MDA.

@jrlevine
Copy link
Author

jrlevine commented May 7, 2024

Mail systems have complete latitude about what they do with the local parts you hand them.

Right, and in my opinion this is the fundamental problem with the current email specs ...

To point out what should be obvious, we're not offering advice on how to run mail systems. People already have mail addreses that they got somewhere else, and if we have aesthetic objections to those addresses, too bad. The only relevant question for what goes in an an email address box is whether the string is a plausible address. Not a great address, not a pretty address, just whether it's an address that a mail system might have assigned them. if so, then we should accept it.

To be further obvious, most addresses that are syntactically valid are not actual addresses. There are over 3 quadrillion possible addresses of the form xxxxxxxxxx@gmail.com where xxxxxxxxxx is a string of 10 letters or digits, of which perhaps a billion are actual addresses. The only way to verify that an address is valid is to write to it and see if someone responds. The only sensible thing to do is to allow all addresses that meet the RFC syntax rules, because there will always have to be later checks to see if they're valid, typically sending a confirmation message with a link.

@klensin
Copy link

klensin commented May 7, 2024

Collin,
As others have at least hinted, the option of a more restrictive syntax for email local-parts, and the tradeoffs between a more restrictive syntax and allowing for special circumstances, have been considered (usually carefully) in multiple IETF WGs over the last 25 years or longer. That is for both ASCII and, more recently, non-ASCII addresses. While a very restrictive syntax definitely has its appeal for the reasons you give and others, the conclusion every time has been to allow extreme flexibility and then to recommend that sites establishing conventions for mailbox names adopt much more conservative rules.

What has brought us to that conclusion each time has been some set of edge cases. Each time, it would be nice to ban the particular case globally. The earliest, and perhaps most obvious, example invoves quoting: the first example I can remember involved systems on the early ARPANET whose "user names" and hence mailboxes were of the form "GroupNumber UserNumber" -- without some way to represent the space(s) that separated the two, big problems because there was no guarantee that simply mashing the two together would produce something unambiguous. Worse, some operating systems and applications would figure out various ways to un-quote strings because they thought they knew what they means (hence the rather complex, potentially redundant, quoting rules in RFC 821 and its successors). And then there are chacters, still in ASCII, that have special meanings in some mail environments, are ordinary characters in others, and prohibited in still more. All of "!%$&+-_." have been important examples. In general only the delivery MTA and the system hosting the mailboxes know. Trying to guess what is or is not valid, or what might have special meanings, just invites trouble.

As we moved past ASCII, things got more complicated, not becasue we could not make up restrictive rules but because almost any we could come up with caused problems with some reasonable (to them) set of contentions. I note that every single example in this threat has been Latin script -- characters of the variety sometimes called "decorated Latin" or even "decorated ASCII". Well, too bad, but, for historical reasons, Latin script is easy... and so are Greek and Cyrillic as long as one is careful about shared character graphemes among them. Can one make a simple global rule like "no local-part strings with mixed scripts"? Nope: many systems have come up with reasons to do that. And that is still near the top of the slippery slope with lots of sliding opportunities I have not covered.

An issue that probably should be quite separate further complicates the situation for the "domain-part" of an email address. Most, if not all, browsers follow UTS#46 as the standard for IDNs. However, most, if not all, mail transport systems follow IDNA2008. For selected cases, they are not compatible no matter what options are chosen for the former.

Bottom line remains the same: only systems that actually host mailboxes and determine what strings they will assign or allow can determine what is, or is not, a valid mailbox.

@hsivonen
Copy link
Member

hsivonen commented May 7, 2024

Consequently, applying NFC in HTML processing risks making legitimate, working, email addresses inaccessible and may even create the security risk of having messages delivered to the wrong address.

It seems to me that it wouldn't be HTML creating a security risk if adhering to a "SHOULD" from RFC 6532 results in delivery to the wrong mailbox.

It's worth noting that if HTML doesn't normalize, the same security risk would be actualized by the user seeing the email address rendered as visual text on a screen or in print and writing what they see in input type=email (as opposed to copying and pasting from a digital source).

I think addresses that create a security risk when seen and written by users are entirely unsuited for email addresses that are communicated to humans, and I believe input type=email is for the kind of email addresses that can be communicated to a human and that a human can enter into a form.

Normalizing to NFC is very minimal compared to what https://www.unicode.org/reports/tr39/#Email_Security_Profiles says. It says "The goal is to flag addresses that are structurally unsound or contain unexpected detritus." That is, the Unicode document covering Unicode security issues characterizes addresses whose local part isn't in NFKC (among other criteria) as "structurally unsound".

What shall be submitted? (I think the least-risky answer is to say the UTS 46 ToASCII form with the above flags. Chrome already submits the ToASCII form and Safari only allows ASCII to be submitted.

That's unfortunate. Postfix, which is the most widely used open source MTA, does not automatically make A-label and U-label versions of domains equivalent. I expect that most mail systems are configured to do so, but they don't have to and once again, YOU DON'T KNOW.

Do you consider a piece of software that knows that U-labels are a possibility but that doesn't treat the corresponding A-labels and U-labels as equivalent as meeting the goals of Universal Acceptance?

So please do not imagine you are cleverer than the people who wrote the specs, and just pass along anything that complies with the RFC, which is IDNA2008 in this case.

I said earlier that I think requiring browsers to carry extra data to implement IDNA 2008 restrictions for input type=email when browsers use UTS 46 in other places would be excessive. Considering that UTS 46 in non-transitional mode, which browsers already use for domains in URLs, accepts and doesn't change the resolved address of domains that IDNA 2008 accepts, I think using UTS 46 in non-transitional mode is consistent with your position of asking input type=email to rather under-reject than over-reject inputs.

To the html people: I agree submitting the ToASCII form is probably wrong for an ideal internationalized email field and browsers probably call ToASCII only for backward compatibility.

Consider these six cases:

  1. ascii@non-punycode-ascii
  2. ascii@punycode
  3. ascii@unicode
  4. unicode@non-punycode-ascii
  5. unicode@punycode
  6. unicode@unicode

Cases 4, 5, and 6 currently cannot be submitted if entered into input type=email in any of the three major Web engines.

Case 3 is currently cannot be submitted in WebKit.

Currently, Web sites that use input type=email always receive the domain in ASCII form (non-punycode-ascii or punycode depending on the domain) from Blink & WebKit and may receive it in ASCII form from Gecko.

I think submitting the ToASCII form of the domain is the least risky approach and the shortest path to interop between Web engines, because to do otherwise would be asking the engine with the largest market share and the closest to current spec behavior to start treating inputs that can be submitted today (inputs of type 3) in a different way in submission. This logically involves Web compat (site breakage) risk, and asking Blink to take such risk for case 3 seems counter-productive to the goal of enabling cases 4, 5, and 6. (Asking the domain handling to differ depeding on the local part seems asking for trouble. I do think we should ask Blink to change its input type=email to non-transitional mode of UTS 46, though, but that affects 5 input characters instead of affecting all non-ASCII.)

Notably, submitting the ToASCII form makes case 3 work with sites that don't specifically handle IDNA. On the other hand, sites whose back end is internationalized email-aware can transform the Punycode form to the Unicode form if they want to show the Unicode form or treat the Unicode form as canonical.

(I think HTML shouldn't cater to the Postfix brokenness described above or to other setups that allow Unicode forms of domains but don't treat the corresponding Unicode and ASCII forms as equivalent. AFAICT, such Postfix configurations would already fail e.g. with Thunderbird senders when the submission SMTP server doesn't negotiate SMTPUTF8 with Thunderbird.)

Most, if not all, browsers follow UTS#46 as the standard for IDNs. However, most, if not all, mail transport systems follow IDNA2008. For selected cases, they are not compatible no matter what options are chosen for the former.

Can you give an example of a domain that IDNA 2008 accepts but that doesn't result in the same Punycode to be passed to DNS with UTS 46 in non-transitional mode?

@vdukhovni
Copy link

An issue that probably should be quite separate further complicates the situation for the "domain-part" of an email address. Most, if not all, browsers follow UTS#46 as the standard for IDNs. However, most, if not all, mail transport systems follow IDNA2008. For selected cases, they are not compatible no matter what options are chosen for the former.

I don't know about "most" mail systems, but FWIW, Postfix uses "libicu" for IDN support, without transitional processing, which is largely a UTS#46 superset of IDNA2008. I am not aware of a ubiquitous library that implements IDNA2008. :-(

@vdukhovni
Copy link

That's unfortunate. Postfix, which is the most widely used open source MTA, does not automatically make A-label and U-label versions of domains equivalent. I expect that most mail systems are configured to do so, but they don't have to and once again, YOU DON'T KNOW.

Do you consider a piece of software that knows that U-labels are a possibility but that doesn't treat the corresponding A-labels and U-labels as equivalent as meeting the goals of Universal Acceptance?

Yes, because it is up to the system administrator to decide which address forms are supported and deliverable. This may mean support for either or both of the A-label and U-label variants of the address domain part.

Trying each lookup twice is not attractive (especially when multiple aliases are chained), and normalisation to a form that isn't what came in can hamper downstream deliverability.

Postfix supports non-ASCII addresses, but does not attempt any built-in normalisation to either A-label or U-label form. DNS lookups (MX, A, AAAA, ...) of course use the A-label form.

@hsivonen
Copy link
Member

hsivonen commented May 7, 2024

An issue that probably should be quite separate further complicates the situation for the "domain-part" of an email address. Most, if not all, browsers follow UTS#46 as the standard for IDNs. However, most, if not all, mail transport systems follow IDNA2008. For selected cases, they are not compatible no matter what options are chosen for the former.

I don't know about "most" mail systems, but FWIW, Postfix uses "libicu" for IDN support, without transitional processing, which is largely a UTS#46 superset of IDNA2008. I am not aware of a ubiquitous library that implements IDNA2008. :-(

While not part of the transport system per se, Thunderbird also uses UTS 46 in non-transitional mode when it needs to turn a domain name into an ASCII-only form (notably, when the submission SMTP server does not negotiate SMTPUTF8 but the email address to send to has been entered as ascii@unicode).

That's unfortunate. Postfix, which is the most widely used open source MTA, does not automatically make A-label and U-label versions of domains equivalent. I expect that most mail systems are configured to do so, but they don't have to and once again, YOU DON'T KNOW.

Do you consider a piece of software that knows that U-labels are a possibility but that doesn't treat the corresponding A-labels and U-labels as equivalent as meeting the goals of Universal Acceptance?

Yes, because it is up to the system administrator to decide which address forms are supported and deliverable.

Much of the disconnect in this this discussion arises from differing views on whether HTML should facilitate downstream processing giving semantic differences to bit-wise differences that Unicode-wise are not supposed to have semantic differences. Whether HTML should cater to local parts that Unicode considers canonically equivalent to be delivered to potentially different mailboxes. Whether HTML should cater to software giving a semantic difference to different domain representation that would map the same way under (non-transitional) ToASCII.

Let's take a step back: What's the goal here? Is it to enable people around the world to use their native script in email addressing? Or is it to cater to email server configurations that create semantic differences where there Unicode-wise (I'm counting UTS 46 under "Unicode-wise" here) are not supposed to be any?

I think it's worthwhile to change input type=email (in the HTML spec and in browser engines) to enable people around the world to use their native script in email addressing. I think it's not worthwhile to change input type=email (in the HTML spec or in browser engines) to cater to email server configurations giving distinct treatment to addresses that are canonically equivalent in the Unicode sense or that are equivalent after applying UTS 46 non-transitional processing to the domain.

@gene-hightower
Copy link

gene-hightower commented May 7, 2024 via email

@jrlevine
Copy link
Author

jrlevine commented May 7, 2024

Do you consider a piece of software that knows that U-labels are a possibility but that doesn't treat the corresponding A-labels and U-labels as equivalent as meeting the goals of Universal Acceptance?

We don't have to guess. I tested a dozen mail systems for the UASG and Postfix passed with flying colors. Read all about it: https://uasg.tech/download/uasg-030-evaluation-of-eai-support-in-email-software-and-services-report-en/

People who actually write mail software and run mail systems have told you over and over why it is a bad idea to imagine that mail addresses follow any pattern beyond what the RFCs require. If you insist on NFC or whatever, you will reject some real mail addresses, with no benefit to anyone. But if you insist you know better than everyone else, there's not much we can do about it.

@annevk
Copy link
Member

annevk commented May 7, 2024

There's certainly a lot of argument by authority, but not a lot of engagement with the identified issues and questions put forward.

@jrlevine
Copy link
Author

jrlevine commented May 7, 2024

There's certainly a lot of argument by authority, but not a lot of engagement with the identified issues and questions put forward.

Viktor is one of the people who maintain Postfix. Klensin and I have been working on mail standards for decades, in his case many more decades. I have done work for the UASG, and talk with people at large mail systems at M3AAWG and other places. Can you say more about why our experience is irrelevant here?

@annevk
Copy link
Member

annevk commented May 7, 2024

I'm not saying your experience is irrelevant, but it's not a great conversation when one side asks questions and the other side is essentially saying "trust us".

E.g., you claim Postfix is perfect but above there's also an undisputed claim it uses UTS46 and not IDNA2008 (and still nobody addressed the question for which inputs those might be different). So I guess that was not tested then?

@hsivonen
Copy link
Member

hsivonen commented May 7, 2024

Do you consider a piece of software that knows that U-labels are a possibility but that doesn't treat the corresponding A-labels and U-labels as equivalent as meeting the goals of Universal Acceptance?

We don't have to guess. I tested a dozen mail systems for the UASG and Postfix passed with flying colors. Read all about it: https://uasg.tech/download/uasg-030-evaluation-of-eai-support-in-email-software-and-services-report-en/

The document says:
"We were surprised to find that Postfix does not automatically treat U-label and A-label
versions of domains the same. For example, 用户2@xn--fqr621h.services.net and 用户2@
后缀.services.net are not considered equivalent. It is not hard to configure them so they are
effectively equivalent, delivering to the same mailbox and authenticating the same to the
MSA, but it could be a trap for the unwary."

I interpret "trap" as not meeting the goals.

I'm not saying your experience is irrelevant, but it's not a great conversation when one side asks questions and the other side is essentially saying "trust us".

I've tried to avoid engaging on the "who knows better" topic, but:

More to the point, it's not just saying "trust us" instead of engaging with the specific implementation-relevant questions raised but saying "trust us over the Unicode spec writers on Unicode matters".

What I've said about NFC is very mild (and, when it comes to e.g. non-ASCII control characters, possibly insufficient) compared to https://www.unicode.org/reports/tr39/#Email_Security_Profiles .

@jrlevine
Copy link
Author

jrlevine commented May 7, 2024

The document says:
"We were surprised to find that Postfix does not automatically treat U-label and A-label
versions of domains the same. For example, 用户2@xn--fqr621h.services.net and 用户2@
后缀.services.net are not considered equivalent. It is not hard to configure them so they are
effectively equivalent, delivering to the same mailbox and authenticating the same to the
MSA, but it could be a trap for the unwary."

I interpret "trap" as not meeting the goals.

Ah, thanks for explaining to me what I meant when I wrote that.

In any event, it's your choice. If you want to do something useful and allow people to enter their actual addresses, you'll allow whatever the RFCs allow. Or if you only want to allow addresses that meet your aesthetic preferences and if someone's address isn't one of those, too bad, they're not going to fill out a web form today, that seems bizarre but we haven't had much luck getting that point through.

@gene-hightower

This comment was marked as duplicate.

@klensin
Copy link

klensin commented May 7, 2024

Hmm. I think we are successfully talking past each other. To illustrate from "the other side"

I'm not saying your experience is irrelevant, but it's not a great conversation when one side asks questions and the other side is essentially saying "trust us".

What I have written and seen others write/say is not "trust us" but things closer to "there are well-established, written, email specs, written as IETF Standards Track RFCs, and you should believe them rather than trying to make up your own rules". The only thing I've seen that comes close to "trust us" is when we have tried to assure you that, in the process of developing, and periodically reviewing and updating, those standards, proposals that would have resulted in much more restricted syntax have been carefully considered and then rejected.

What I've seen instead is a good deal of "not listening" on the part of those who think that their ideas of what email syntax should be ought to prevail, in HTML and elsewhere over what those standards, and assorted actual implementations, establish as the actual practice. So, if you and some of your colleagues are hearing "us" saying "trust us" rather than "don't try to make rules more restrictive than the actual email standards and practices based on them" and "we" are seeing signs of what we actually do say not getting through, the discussion is doomed to go around in circles... which seems, to me at least, to be a good description.

I think that, without any hint of "trust us", Gene's comment above:

"At this time, I would say that web forms rejecting perfectly valid email addresses is more of an impediment to adoption of Internationalized Email than any problems with the email standards."

is just right.

E.g., you claim Postfix is perfect but above there's also an undisputed claim it uses UTS46 and not IDNA2008 (and still nobody addressed the question for which inputs those might be different). So I guess that was not tested then?

I haven't made any claims about the perfection of Postfix. The only claims I've heard others make are about its being a popular open source implementation that conforms to the standards. As far as UTS46 and IDNA2008 are concerned, first remember that they have nothing to do with the discussion above, which, at least AFAICT, has been about the local-part of an email address and not the domain part, the differences in the status and interpretation of various inputs is well documented, with much of the documentation being part of UTS#46 itself. To state this as neutrally as I can, most of the differences are because UTS#46 tried to maintain a higher degree of compatibility with IDNA2003 while there were many areas in which the consensus in the IETF about how to make the DNS work better and some rather explicit agreements with DNS registries with needs for specific scripts concluded that allowing certain incompatibilities was much wiser for both the short and long term. Some of the most glaring differences result from the use of Case Folding in UTS#46 (and IDNA2003), with, using an example that came up in this discussion, whether "ß" is a letter or just a funny way to write "ss". And so on -- more examples on request, but I doubt they will accomplish anything.

@hsivonen
Copy link
Member

hsivonen commented May 8, 2024

The document says:
"We were surprised to find that Postfix does not automatically treat U-label and A-label
versions of domains the same. For example, 用户2@xn--fqr621h.services.net and 用户2@
后缀.services.net are not considered equivalent. It is not hard to configure them so they are
effectively equivalent, delivering to the same mailbox and authenticating the same to the
MSA, but it could be a trap for the unwary."
I interpret "trap" as not meeting the goals.

Ah, thanks for explaining to me what I meant when I wrote that.

I wasn't telling you what you meant. I was saying what I understood as a reader. To me as a reader, the quoted bit looks like it is calling out a Universal Acceptance-relevant problem in Postfix's behavior. If there was a communication failure, I think that giving me a direct answer instead of pointing me to an external document might have avoided a communication failure.

In particular, I think it's evident in this discussion that reading what discussion participants have written elsewhere isn't predictive of specific opinions here. For example, as a reader (again, not claiming author intent) of RFC 5198, I wouldn't have predicted this position (which I find anti-persuasive): "On the other hand, if some mail administrator with mailboxes in Vietnamese wants to make it difficult for someone who does not have a locally-normal Vietnamese input device to send mail to a particular mailbox and decides to only allow the non-NFC form, the standards deliberately do not tell them they cannot do that and we shouldn't either."

Hmm. I think we are successfully talking past each other. To illustrate from "the other side"

As far as I can tell, you (plural) are asking for an HTML spec change and Web engine implementation changes. The issue has been open for years without success at getting the changes. Now when you've had the attention of folks who can change the spec or an implementation, it seems to me you haven't been using the attention effectively.

It seems to me that no matter how wrong you think we are, it would be more productive to reply to specific questions instead of talking past.

I think that, without any hint of "trust us", Gene's comment above:

"At this time, I would say that web forms rejecting perfectly valid email addresses is more of an impediment to adoption of Internationalized Email than any problems with the email standards."

is just right.

I suggest that you focus on the impediments. However, I've seen statements that to me have looked like requests not just to expand the value space that input type=email accepts or submits but also restricting the value space relative to what it is now. (Specifically, dot specifics in the local part or IDNA 2008 in the domain part.)

If being unable to submit Unicode in the local part is an impediment, I suggest focusing on expanding the value space but treating dot specifics as a distinct point out of scope here.

If Chrome turning "ß" in input into "ss" in output is an impediment, I suggest focusing on specifying the non-transitional mode of UTS 46 instead of suggesting either IDNA 2008 or suggesting not using ToASCII.

If Safari not allowing Unicode-form domain input is an impediment, I suggest changing the spec to be more obviously algorithmic to make it clearer that the spec intends to enable the input of the Unicode (or mixed) form despite the logical value space of the domain part being ASCII.

On the other hand, if submitting the ToASCII form like Chrome does and the current spec requires is seen as an impediment, it seems to me the impediment is elsewhere (in software that fails to treat Unicode and ASCII forms equivalently).

Furthermore, to maximize success of removing impediments, it makes sense to make the kind of spec changes that to a browser implementor don't look risky to be the first one to ship. If you ask for changes that would make implementors prefer that someone else go first, success is less likely.

Some of the most glaring differences result from the use of Case Folding in UTS#46 (and IDNA2003), with, using an example that came up in this discussion, whether "ß" is a letter or just a funny way to write "ss". And so on -- more examples on request, but I doubt they will accomplish anything.

First, that's not an example of UTS 46 in non-transitional mode differing from IDNA 2008. Second, are more examples really available on request? I have already requested twice and Anne has pointed out that I've requested. Yet, there's been talking past instead of showing examples.

I reiterate my request: Can you tell me what kind of inputs are there that IDNA 2008 accepts but that UTS 46 in non-transitional mode either rejects or maps to different ASCII form than IDNA 2008?

As an UTS 46 implementor, my current understanding is that there are none, but if there are some, it would be useful for me to know.

@arnt
Copy link

arnt commented May 13, 2024

That Postfix code is indeed easy to misconfigure: I wrote the code and later misconfigured it. Sigh. That said, it's not the kind of bug that you can fix easily once you've seen a user make a mistake. When I wrote the code I thought that Postfix had to choose between three different suboptimal behaviours, and I chose the least harmful possibility. Details offtopic in this thread.

I've also done a thorough study of the domains in the production DNS and probably have at least half of the real-world examples. They're boring. Most of the real-world domains that are treated differently between any paid of IDAN2003, IDAN2008 and UTS46(x/y) look like tests to my eyes. I should be able to dig up the list if anyone really wants them, but be warned, it's splendid material for a bikeshed discussion. My advice, if anyone were to listen, would be to consider IDNA2008 and UTS46 as practically interchangeable and go work on something important.

I agree with John Klensin's comment above. We have large problems here, we should IMO not allow ourselves to be distracted by fascinating exceptions and edge cases. (Don't misunderstand, I love edge cases and bikeshedding discussions, but we shouldn't engage in that kind of thing, even if I'm as guilty as anyone, every time we do.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
addition/proposal New features or enhancements i18n-tracker Group bringing to attention of Internationalization, or tracked by i18n but not needing response. needs implementer interest Moving the issue forward requires implementers to express interest topic: forms
Development

No branches or pull requests

15 participants