"valid email address" doesn't allow IDNs… #538

Closed
chaals opened this Issue Jul 24, 2016 · 16 comments

Comments

Projects
None yet
7 participants
@chaals
Collaborator

chaals commented Jul 24, 2016

The definition of a valid email address only allows for ascii letters, digits, and hyphens in domain names, and only ascii printable characters before the @ sign. This doesn't match reality, which now includes IDNs.

Should we insist on putting punycode everywhere, or should we try to hide that from users and specify how to make internationalised email addresses?

@klensin

This comment has been minimized.

Show comment
Hide comment
@klensin

klensin Jul 24, 2016

Chaals, as is usual with i18n iessue, things are always a little more complicated. First of all,, while "rintable [ASCII] characters before the @ sign" is probably good advice for email delivery system developers, it is not what RFC 5321 says/allows in an email local-part. "I have this perfectly good email address that your HTML or web form won't accept" has been the cause of a very large number of problems and even some loss or market share by organizations that have tried to use their interpretations of the narrower HTML/web definition. Second, there are now a set of standards that allow non-ASCII characters in local parts. Those standards are being rapidly implemented and deployed in some parts of the world, so focusing only on domain names ignores a major part of the problem, one that will become more significant over time. To the extent possible, those standards prohibit the use of Punycode -encoded strings in domain labels. The reasons are that they look ugly, are very susceptible to spoofing and similar attacks because users tend to tune such non-memorable strings out, and because the use of anything other than what IDNA2008 calls U-labels creates a mess (or opportunities if one is an attacker) for all sorts of confusion.

Of course, as soon as one starts talking about non-ASCII characters in email addresses, it becomes necessary to talk bout the URI/IRI boundary, issues with mailto: (which also interact with valid email addresses being discouraged in many web applications), and so on.

Whatever HTML does with this, I suggest that "just use punycode" in the domain part" would almost certainly be the wrong answer.

klensin commented Jul 24, 2016

Chaals, as is usual with i18n iessue, things are always a little more complicated. First of all,, while "rintable [ASCII] characters before the @ sign" is probably good advice for email delivery system developers, it is not what RFC 5321 says/allows in an email local-part. "I have this perfectly good email address that your HTML or web form won't accept" has been the cause of a very large number of problems and even some loss or market share by organizations that have tried to use their interpretations of the narrower HTML/web definition. Second, there are now a set of standards that allow non-ASCII characters in local parts. Those standards are being rapidly implemented and deployed in some parts of the world, so focusing only on domain names ignores a major part of the problem, one that will become more significant over time. To the extent possible, those standards prohibit the use of Punycode -encoded strings in domain labels. The reasons are that they look ugly, are very susceptible to spoofing and similar attacks because users tend to tune such non-memorable strings out, and because the use of anything other than what IDNA2008 calls U-labels creates a mess (or opportunities if one is an attacker) for all sorts of confusion.

Of course, as soon as one starts talking about non-ASCII characters in email addresses, it becomes necessary to talk bout the URI/IRI boundary, issues with mailto: (which also interact with valid email addresses being discouraged in many web applications), and so on.

Whatever HTML does with this, I suggest that "just use punycode" in the domain part" would almost certainly be the wrong answer.

@duerst

This comment has been minimized.

Show comment
Hide comment
@duerst

duerst Jul 25, 2016

I agree using punycode in the domain part is the wrong answer. Also, the left hand side (LHS) doesn't use punycode at all, so "putting punycode everywhere" is a total non-starter.

duerst commented Jul 25, 2016

I agree using punycode in the domain part is the wrong answer. Also, the left hand side (LHS) doesn't use punycode at all, so "putting punycode everywhere" is a total non-starter.

@duerst

This comment has been minimized.

Show comment
Hide comment
@klensin

This comment has been minimized.

Show comment
Hide comment
@klensin

klensin Jul 25, 2016

Yes. Thanks Martin, I had forgotten to mention the "punycode is not allowed on the LHS and generally won't work part of the problem.

klensin commented Jul 25, 2016

Yes. Thanks Martin, I had forgotten to mention the "punycode is not allowed on the LHS and generally won't work part of the problem.

@travisleithead travisleithead added this to the HTML 5.2 WD 4 milestone Oct 24, 2016

@chaals chaals added the needs tests label Nov 18, 2016

@klensin

This comment has been minimized.

Show comment
Hide comment
@klensin

klensin Nov 26, 2016

Anyone building such tests should pay careful attention to local-part (left of "@") validity even in the all-ASCII care. As partially pointed out in the discussion at https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 (thanks, Martin), a rather large number of web-related implementations claim that email addresses with, e.g., "+", "="", "/", etc. are invalid. It is hard for users to identify just where in the system the problems occur, but it appears there are issues in browsers, in web forms, and in database accesses once the strings appear to be accepted.

The same issues may arise with IDNs and SMTPUTF8 addresses as well -- it is even more frustrating to have such an address apperently accepted for email or UserID use then then have messages lost or logins rejected because the backend systems have not been consistently updated than it is to have them rejected in the first place.

klensin commented Nov 26, 2016

Anyone building such tests should pay careful attention to local-part (left of "@") validity even in the all-ASCII care. As partially pointed out in the discussion at https://www.w3.org/Bugs/Public/show_bug.cgi?id=15489 (thanks, Martin), a rather large number of web-related implementations claim that email addresses with, e.g., "+", "="", "/", etc. are invalid. It is hard for users to identify just where in the system the problems occur, but it appears there are issues in browsers, in web forms, and in database accesses once the strings appear to be accepted.

The same issues may arise with IDNs and SMTPUTF8 addresses as well -- it is even more frustrating to have such an address apperently accepted for email or UserID use then then have messages lost or logins rejected because the backend systems have not been consistently updated than it is to have them rejected in the first place.

@LJWatson LJWatson modified the milestones: HTML 5.2 WD 4, HTML 5.2 WD 5 Jan 17, 2017

@chaals chaals modified the milestones: When we can., HTML 5.2 WD 5 Feb 13, 2017

@chaals chaals self-assigned this Feb 13, 2017

@chaals

This comment has been minimized.

Show comment
Hide comment
@chaals

chaals Feb 22, 2017

Collaborator

punycode is only used in the right-hand-side, and since it is ASCII, it matches the current definition.

One incremental step that might make sense is to add an editorial note for implementation, similar to what is there for numbers and dates, so it is clear to browser developers that they can implement something useful for addresses - right now I think we're in a position where we are waiting to see implementation in browsers catch up with implementation in the rest of the stack, as well as waiting to see more email systems handle fully-internationalised addresses...

Collaborator

chaals commented Feb 22, 2017

punycode is only used in the right-hand-side, and since it is ASCII, it matches the current definition.

One incremental step that might make sense is to add an editorial note for implementation, similar to what is there for numbers and dates, so it is clear to browser developers that they can implement something useful for addresses - right now I think we're in a position where we are waiting to see implementation in browsers catch up with implementation in the rest of the stack, as well as waiting to see more email systems handle fully-internationalised addresses...

@klensin

This comment has been minimized.

Show comment
Hide comment
@klensin

klensin Feb 22, 2017

Chaals, yes, but the state of adoption of fully-internationalized email addresses is a bit of a chicken-and-egg problem. Email providers are reluctant to allow registration and use of such addresses until there is full support for them and those who accept email addresses (on web sites and elsewhere) are disinclined to support them until they are widely registered and in use. That makes any "we are waiting to see more implementation (and/or deployment) somewhere else" part of a barrier to such deployment.

Certainly ACE forms (aka "Punycode" should be accepted as domain parts of email addresses. However, users who want (and those who register) domains in local scripts shouldn't be expected to type, much less, remember, those forms -- forcing that defeats the whole purpose of allowing non-ASCII names, especially when one remembers that the A-label form for a mostly-ASCII string containing a decorated Latin character or two is much easier to remember that form for other scripts, espeically non-European ones.

If we are trying to serve users well, the rules for the email addresses that are accepted in various web contexts should be identical to those that are acceptable for email. There should be no situations in which a browser or other web interface says "that email address is invalid" when it is perfectly valid in the email aystem and, ideally, no situations in which the web accepts an email address as valid that standards-conforming email systems consider to be invalid syntax.

john

klensin commented Feb 22, 2017

Chaals, yes, but the state of adoption of fully-internationalized email addresses is a bit of a chicken-and-egg problem. Email providers are reluctant to allow registration and use of such addresses until there is full support for them and those who accept email addresses (on web sites and elsewhere) are disinclined to support them until they are widely registered and in use. That makes any "we are waiting to see more implementation (and/or deployment) somewhere else" part of a barrier to such deployment.

Certainly ACE forms (aka "Punycode" should be accepted as domain parts of email addresses. However, users who want (and those who register) domains in local scripts shouldn't be expected to type, much less, remember, those forms -- forcing that defeats the whole purpose of allowing non-ASCII names, especially when one remembers that the A-label form for a mostly-ASCII string containing a decorated Latin character or two is much easier to remember that form for other scripts, espeically non-European ones.

If we are trying to serve users well, the rules for the email addresses that are accepted in various web contexts should be identical to those that are acceptable for email. There should be no situations in which a browser or other web interface says "that email address is invalid" when it is perfectly valid in the email aystem and, ideally, no situations in which the web accepts an email address as valid that standards-conforming email systems consider to be invalid syntax.

john

@cynthia

This comment has been minimized.

Show comment
Hide comment
@cynthia

cynthia Feb 22, 2017

Member

I gave this idea a bit of thought over the last couple days - my thoughts have been shared with some other folks in the group over a informal channel. Yes, this is a valid problem. Writing this down in a document is easy - getting the implementations and validation rules right across all implementations - not as easy.

For the time being, web applications that absolutely need support for this can do this in userland with a text form instead of an email form, and implement the validation at the application level. This seems like a more natural direction with modern extensible web principles, or at least that's how I see it.

Member

cynthia commented Feb 22, 2017

I gave this idea a bit of thought over the last couple days - my thoughts have been shared with some other folks in the group over a informal channel. Yes, this is a valid problem. Writing this down in a document is easy - getting the implementations and validation rules right across all implementations - not as easy.

For the time being, web applications that absolutely need support for this can do this in userland with a text form instead of an email form, and implement the validation at the application level. This seems like a more natural direction with modern extensible web principles, or at least that's how I see it.

@chaals

This comment has been minimized.

Show comment
Hide comment
@chaals

chaals Feb 23, 2017

Collaborator

@klensin yes, there is indeed a chicken-and-egg problem.

My "incremental" proposal is probably not clear. What I would like to see, at least as an initial step, is browsers provide an interface for email similar to the way they send requests for IDNs - so that I can type chaals@яндекс.рф and see that, even though the value of the control is a punycode, and so acceptable, value.

If you read the spec as is, there isn't much support for that at the moment, although it isn't explicitly forbidden. By comparison, there is explicit wording to support that kind of approach for e.g. dates.

This would make it clearer that there is no reason not to allow users to write internationalised content that can be converted, which might be helpful in getting user agents to start implementing something more modern than ascii-only email.

Collaborator

chaals commented Feb 23, 2017

@klensin yes, there is indeed a chicken-and-egg problem.

My "incremental" proposal is probably not clear. What I would like to see, at least as an initial step, is browsers provide an interface for email similar to the way they send requests for IDNs - so that I can type chaals@яндекс.рф and see that, even though the value of the control is a punycode, and so acceptable, value.

If you read the spec as is, there isn't much support for that at the moment, although it isn't explicitly forbidden. By comparison, there is explicit wording to support that kind of approach for e.g. dates.

This would make it clearer that there is no reason not to allow users to write internationalised content that can be converted, which might be helpful in getting user agents to start implementing something more modern than ascii-only email.

@cynthia

This comment has been minimized.

Show comment
Hide comment
@cynthia

cynthia Feb 23, 2017

Member

The incremental proposal above has a risk of requiring consistency across browsers, if any browser treats the value as it is (as a IDN) then signing up from one browser and logging in from another browser may (and most likely will) not work.

This would effectively require a monkey patch layer to assure consistency across browsers - we've gone through that more times than we are comfortable with in the past, which is why I suggested shimming it inside a generic text field and having the application deal with the consistency.

Member

cynthia commented Feb 23, 2017

The incremental proposal above has a risk of requiring consistency across browsers, if any browser treats the value as it is (as a IDN) then signing up from one browser and logging in from another browser may (and most likely will) not work.

This would effectively require a monkey patch layer to assure consistency across browsers - we've gone through that more times than we are comfortable with in the past, which is why I suggested shimming it inside a generic text field and having the application deal with the consistency.

@chaals

This comment has been minimized.

Show comment
Hide comment
@chaals

chaals Feb 25, 2017

Collaborator

I'm not sure this requires a new monkey patch layer. Using a text field to allow this loses the useful functionality of looking for an email address, but already exposes users to the problems of patchy implementation. In practice, browsers aren't consistent in doing "new" things anyway, and in this case the email infrastructure beyond the browser is also changing.

If browsers implemented native support for e.g. converting to punycode on the fly as validation, and collecting and recognising internationalised email addresses, then applications could start to polyfill or switch as they deemed appropriate.

In any event, telling browsers not to implement anything for non-ascii email seems futile - markets where it is important will start to diverge anyway, and we'll still have the problem without the potential benefits of working out how to do this right :S

Collaborator

chaals commented Feb 25, 2017

I'm not sure this requires a new monkey patch layer. Using a text field to allow this loses the useful functionality of looking for an email address, but already exposes users to the problems of patchy implementation. In practice, browsers aren't consistent in doing "new" things anyway, and in this case the email infrastructure beyond the browser is also changing.

If browsers implemented native support for e.g. converting to punycode on the fly as validation, and collecting and recognising internationalised email addresses, then applications could start to polyfill or switch as they deemed appropriate.

In any event, telling browsers not to implement anything for non-ascii email seems futile - markets where it is important will start to diverge anyway, and we'll still have the problem without the potential benefits of working out how to do this right :S

@klensin

This comment has been minimized.

Show comment
Hide comment
@klensin

klensin Feb 25, 2017

chaals, I think that is just the point (and suggestion) I was trying to make, only you were more clear. From a slightly different perspective, it is reasonable for browsers to make the exact checks and conversions (if needed) required by the base standards or to simply accept everything, pass whatever strings they get across whatever interfaces they have to mail systems, and insist that those systems do their jobs and that interfaces are well enough designed to make good error reporting feasible and convenient. Telling them to not support a feature that is believed to be vitally important in a number of countries is a good way to encourage either ad hoc solutions that may turn out to be non-inoperable or to turn browser feature sets into a political matter.

Two implementation hints (entirely up to you folks as to whether they belong in this document):

(i) As explained in detail in RFC 6055, conversion from native forms to ACE (A-label) form (i.e., with Punycode) should generally be deferred until as close to DNS lookup as possible. There are private identifier systems out there that are supported (and, I gather, required) by widely-deployed systems and their vendors that use normal Unicode encodings for non-ASCII characters, not Punycode encoding.

(ii) Because "%" is, historically, a character with important mail routing semantics in many email systems, any system that deals with non-ASCII characters by using "%NN"-style encoding, such as those that do IRI <-> URI conversions, needs to be exceptionally careful about conversions to and from that form. Because that email routing usage is much less common today than it was a decade or two ago, the need for applications that might do the conversions to pay careful attention may be less obvious, but no less important, than it was historically.

klensin commented Feb 25, 2017

chaals, I think that is just the point (and suggestion) I was trying to make, only you were more clear. From a slightly different perspective, it is reasonable for browsers to make the exact checks and conversions (if needed) required by the base standards or to simply accept everything, pass whatever strings they get across whatever interfaces they have to mail systems, and insist that those systems do their jobs and that interfaces are well enough designed to make good error reporting feasible and convenient. Telling them to not support a feature that is believed to be vitally important in a number of countries is a good way to encourage either ad hoc solutions that may turn out to be non-inoperable or to turn browser feature sets into a political matter.

Two implementation hints (entirely up to you folks as to whether they belong in this document):

(i) As explained in detail in RFC 6055, conversion from native forms to ACE (A-label) form (i.e., with Punycode) should generally be deferred until as close to DNS lookup as possible. There are private identifier systems out there that are supported (and, I gather, required) by widely-deployed systems and their vendors that use normal Unicode encodings for non-ASCII characters, not Punycode encoding.

(ii) Because "%" is, historically, a character with important mail routing semantics in many email systems, any system that deals with non-ASCII characters by using "%NN"-style encoding, such as those that do IRI <-> URI conversions, needs to be exceptionally careful about conversions to and from that form. Because that email routing usage is much less common today than it was a decade or two ago, the need for applications that might do the conversions to pay careful attention may be less obvious, but no less important, than it was historically.

@chaals

This comment has been minimized.

Show comment
Hide comment
@chaals

chaals Mar 30, 2017

Collaborator

@klensin wrote

Second, there are now a set of standards that allow non-ASCII characters in local parts. Those standards are being rapidly implemented and deployed in some parts of the world, so focusing only on domain names ignores a major part of the problem, one that will become more significant over time.

and @duerst seconded it, quite rightly.

We're scoping this issue to the IDN part, and I raised #845 to deal with the local part. (I should rename that to be clearer).

I will propose something for the IDN part as soon as I can.

Collaborator

chaals commented Mar 30, 2017

@klensin wrote

Second, there are now a set of standards that allow non-ASCII characters in local parts. Those standards are being rapidly implemented and deployed in some parts of the world, so focusing only on domain names ignores a major part of the problem, one that will become more significant over time.

and @duerst seconded it, quite rightly.

We're scoping this issue to the IDN part, and I raised #845 to deal with the local part. (I should rename that to be clearer).

I will propose something for the IDN part as soon as I can.

@chaals chaals modified the milestones: HTML 5.2 WD 7, When it's ready Apr 3, 2017

@chaals

This comment has been minimized.

Show comment
Hide comment
@chaals

chaals Apr 3, 2017

Collaborator

This was discussed at the face to face meeting on 29 march, leading to the conclusion to propose something as above…

Collaborator

chaals commented Apr 3, 2017

This was discussed at the face to face meeting on 29 march, leading to the conclusion to propose something as above…

@klensin

This comment has been minimized.

Show comment
Hide comment
@klensin

klensin Apr 9, 2017

@chaals: I read through that thread. In addition to India starting to push full non-ASCII addresses (LHS included) and Russia starting to, I'm told it is being pushed very hard in China and there is at least one fully-functional implementation there, probably several. Because non-ASCII email addresses are being rolled out one country and writing system at a time (as the WG that developed the standards predicted), your trying to decide whether the facilities are deployed enough is not only error-prone but may be considered insulting in areas where there are already millions of such addresses deployed.

I really think that having a web application or form block a perfectly valid email address is a bad idea -- unlike, e.g., the fairly passive DNS, the email system is capable of protecting itself against bad addresses and even sending a non-ASCII address to a host that cannot accept it -- and the effect of such blocking is likely to be a lot of annoyed users and, especially in places where these addresses are believed to be important, pressure to ignore whatever standards W3C manages to set. If that does anyone any good, I can't figure out who it would be.

klensin commented Apr 9, 2017

@chaals: I read through that thread. In addition to India starting to push full non-ASCII addresses (LHS included) and Russia starting to, I'm told it is being pushed very hard in China and there is at least one fully-functional implementation there, probably several. Because non-ASCII email addresses are being rolled out one country and writing system at a time (as the WG that developed the standards predicted), your trying to decide whether the facilities are deployed enough is not only error-prone but may be considered insulting in areas where there are already millions of such addresses deployed.

I really think that having a web application or form block a perfectly valid email address is a bad idea -- unlike, e.g., the fairly passive DNS, the email system is capable of protecting itself against bad addresses and even sending a non-ASCII address to a host that cannot accept it -- and the effect of such blocking is likely to be a lot of annoyed users and, especially in places where these addresses are believed to be important, pressure to ignore whatever standards W3C manages to set. If that does anyone any good, I can't figure out who it would be.

@eligrey

This comment has been minimized.

Show comment
Hide comment
@eligrey

eligrey Apr 14, 2017

Related: Don't forget that dotless email addresses are real and valid. n@ai is the email address of some guy named Ian who runs the AI TLD.

eligrey commented Apr 14, 2017

Related: Don't forget that dotless email addresses are real and valid. n@ai is the email address of some guy named Ian who runs the AI TLD.

chaals pushed a commit that referenced this issue Apr 18, 2017

chaals

@r12a r12a referenced this issue in w3c/i18n-activity May 25, 2017

Open

"valid email address" doesn't allow IDNs… #417

@cynthia cynthia closed this in #881 Jun 14, 2017

cynthia added a commit that referenced this issue Jun 14, 2017

Clarify the charset constraints on email addresses (#881)
* Clarify the constraints on email addresses

Fix #538
See also #845

* Remove class="impl"

See #178 

Should cover the entire chapter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment