Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input #115

Open
hsivonen opened this issue Jun 19, 2017 · 26 comments

Comments

5 participants
@hsivonen
Copy link
Member

commented Jun 19, 2017

Encodings other than ISO-2022-JP have the property that if you concatenate two outputs from a conforming encoder and decode them together, you get the same result as when decoding them separately and then concatenating.

ISO-2022-JP lacks this property, because despite the encoder making an effort into this direction by emitting a transition to the ASCII state at the end, if the next segment being concatenated starts with a transition to a non-ASCII state, the concatenation results in zero ASCII bytes between two escapes.

Is there a strong reason for treating a transition to the ASCII state immediately followed by another escape as non-conforming? Or put the other way, what purpose does the transition to the ASCII state at the end of encode serve if not achieving the above-mentioned concatenation property that other encodings have?

This is relevant to RFC 2047 header decoding.

@hsivonen hsivonen changed the title Concatenating two ISO-2022-JP outputs from a conforming encoding doesn't result in conforming input Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input Jun 19, 2017

@vyv03354

This comment has been minimized.

Copy link
Collaborator

commented Jun 19, 2017

It is my understanding that the reason is to prevent XSS attacks. Consider "<\u001b(B\u001b$Bscript" for example.

@hsivonen

This comment has been minimized.

Copy link
Member Author

commented Jul 3, 2017

It is my understanding that the reason is to prevent XSS attacks. Consider "<\u001b(B\u001b$Bscript" for example.

Why is that worth protecting against if we can't protect against "<\x1b(Js\x1b(Bcript"?

That is, if we can't generate U+FFFD for all of these, is it worth generating it for any of these?

  • Escape immediately followed by another escape.
  • Transition from the ASCII state to the ASCII state.
  • Useless transitions between the ASCII state and the Roman state.

The last one seems the hardest to prevent without potentially breaking some legitimate inputs.

@annevk

This comment has been minimized.

Copy link
Member

commented Apr 25, 2018

Does anyone wish to drive a change proposal here? Or should we just accept this and maybe document the issues a bit further?

cc @jungshik

@annevk annevk added the normative label Apr 25, 2018

annevk added a commit that referenced this issue Aug 30, 2018

ISO-2022-JP encoder: document an oddity
At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk.

Closes #115.
@annevk

This comment has been minimized.

Copy link
Member

commented Aug 30, 2018

I decided to simply document this quirk in #155. If there are any concerns with that approach let me know.

@annevk annevk closed this in #155 Sep 2, 2018

annevk added a commit that referenced this issue Sep 2, 2018

ISO-2022-JP encoder: document an oddity
At this point it does not seem worth it to require further implementation changes and risk compatibility issues, so instead document the quirk.

Closes #115.
@hsivonen

This comment has been minimized.

@hsivonen

This comment has been minimized.

Copy link
Member Author

commented Nov 12, 2018

I still have trouble seeing the value of generating an error when an escape is followed by another escape when we don't generate errors for the other two cases I mentioned on July 3 2017.

@annevk

This comment has been minimized.

Copy link
Member

commented Nov 12, 2018

FWIW, if you want to reopen this and remove that error, that's fine by me. It'd be good to know what other implementations do at this point, as changing such details is never fun for anyone involved.

@hsivonen

This comment has been minimized.

Copy link
Member Author

commented Nov 13, 2018

I noticed that Unicode Security Considerations say that what the Encoding Standard requires for ISO-2022-JP should be done. If we reopen this, we should ask Mark Davis and Michel Suignard for their rationale.

@hsivonen

This comment has been minimized.

Copy link
Member Author

commented Nov 19, 2018

Regarding the previous Gecko bug from a week ago, the reporter provides more info in a Thunderbird bug.

@hsivonen

This comment has been minimized.

Copy link
Member Author

commented Nov 19, 2018

It'd be good to know what other implementations do at this point

Edge and IE don't generate a REPLACEMENT CHARACTER. Firefox, Chrome and Safari do. (With the caveat that my Mac is stuck on El Capitan, so I couldn't test the latest Safari.)

@annevk

This comment has been minimized.

Copy link
Member

commented Nov 21, 2018

It seems worth resolving the Unicode Security Considerations question relative to the observation in #115 (comment). I wonder if @srl295 or @kenlunde could help us with that.

To restate, why is an escape followed by an escape considered problematic, whereas going from ASCII to ASCII, or uselessly going between ASCII and Roman, is not considered problematic?

Also taking into account that an escape followed by an escape is typical in email as per https://bugzilla.mozilla.org/show_bug.cgi?id=1374149#c5.

@annevk annevk reopened this Nov 21, 2018

@kenlunde

This comment has been minimized.

Copy link

commented Nov 21, 2018

@annevk By "escape followed by an escape," you are referring to ISO 2022 escape sequences, not individual "escape" (U+001B) characters, right? If so, back in the day when ISO 2022 encoding was common for email clients, I often encountered data that included redundant escape sequences, meaning no-op escape sequences that would shift back to ASCII, then immediately shift back into JIS X 0208 with no intervening characters. I also recall building in support to handle such no-op escape sequences when converting to other encodings, such as EUC-JP or Shift-JIS, or to fix the ISO 2022 data by removing them. I am not sure whether that helps.

@kenlunde

This comment has been minimized.

Copy link

commented Nov 21, 2018

The following paragraph is from the bottom of page 583 of CJKV Information Processing, Second Edition:

Besides simple code conversion, it is also very important to be able to detect the escape
sequences used in ISO-2022-JP encoding. Escape sequences signal the software when to
change modes. Good software should also keep track of the current n-byte-per-character
mode so that redundant escape sequences can be ignored (and absorbed). Remember that
Shift-JIS encoding does not use escape sequences, so you will have to make sure that they
are not written to the resulting output file.

@annevk

This comment has been minimized.

Copy link
Member

commented Nov 21, 2018

Right, the no-op escape sequences are the question here. https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input suggests those need to emit a U+FFFD and this is currently what the Encoding Standard does. However, that is incompatible with scenarios occurring in email as you note and also doesn't consider other potentially problematic scenarios such as an escape sequence for ASCII, some ASCII code points, and then another escape sequence for ASCII. Or that, but then between ASCII and Roman.

So what we'd like to know is how much consideration we should pay to that security consideration (which seems incomplete if a real danger) relative to the need for email clients to handle such no-op escape sequences.

@annevk

This comment has been minimized.

Copy link
Member

commented Nov 21, 2018

(Also, thank you very much for the timely reply!)

@kenlunde

This comment has been minimized.

Copy link

commented Nov 21, 2018

@annevk In my opinion, enough time has passed that anything that is Unicode-encoded should have no realistic or real-world need to convert back to legacy encodings, particularly because almost all legacy encodings cannot handle a large number of characters that can be represented via Unicode. With that said, I think that nuking the no-op escape sequences is the right approach. There's no meaningful reason to leave behind a "trail" of sorts that something that is not really text data was in the data stream.

@hsivonen

This comment has been minimized.

Copy link
Member Author

commented Nov 22, 2018

@EarlgreyPicard

This comment has been minimized.

Copy link

commented Nov 23, 2018

I am only one of the Thunderbird users in Japan. Please allow me to comment here.

In Japan, many users still send e-mail with ISO-2022-JP encoding.
Many Japanese Windows users usually use Shift_JIS instead of Unicode when dealing with Japanese texts. UTF-8 or UTF-16 is used only when it is necessary to handle characters not included in Shift_JIS.
However, sending 8-bit Shift_JIS text directly via e-mail caused problems.
For that reason, we selected ISO-2022-JP encoding that can express characters with 7 bits when sending Japanese e-mails.
This old method is still in use today.

Recently Thunderbird has been automatically updated to Version 60, however, cases where U + FFFD is inserted in the subject display of some e-mails later have been reported since then.
For example, it was posted to MozillaZine.jp, and a bug report was posted to bugzilla.mozilla.org.
However, this bug report became RESOLVED because Firefox(and Thunderbird) conforms to the spec. That is why I came here.

Although there is no concrete report, I think that the e-mail software that performs encoding that does not conform to the spec is an old Outlook.
It is easy to say, "E-mail software not compliant with the spec is bad", but users of such e-mail software are never few. I think that we should not ignore this fact.

Regular users are not interested in the meaning of U + FFFD. They will simply decide that it is a bug in Thunderbird. And they will downgrade Thunderbird to version 52 or choose Microsoft products.

I hope that discussion will be conducted from the user's point of view.

@kenlunde

This comment has been minimized.

Copy link

commented Nov 23, 2018

@EarlgreyPicard In other words, you agree with my suggestion to simply nuke the no-op escape sequences.

@EarlgreyPicard

This comment has been minimized.

Copy link

commented Nov 23, 2018

I do not know what "simple nuke the no-op escape sequences" mean, but I hope that Thunderbird will not insert U + FFFD into the result of decoding the following e-mail subject.

Subject: =?ISO-2022-JP?B?IBskQiVXJW0bKEIbJEIlMCVpJWAbKEI=?=

Good:
プログラム

Bad:
プロ�グラム

@kenlunde

This comment has been minimized.

Copy link

commented Nov 23, 2018

@EarlgreyPicard Sorry. To clarify, I meant 1) to remove no-op escape sequences; and 2) to not emit U+FFFD in their presence.

@EarlgreyPicard

This comment has been minimized.

Copy link

commented Nov 23, 2018

@kenlunde I understand. Thank you. I agree.

@EarlgreyPicard

This comment has been minimized.

Copy link

commented Dec 9, 2018

@hsivonen Is there progress on the mailing list?
Japanese Thunderbird user is still waiting for the problem to be solved.

@hsivonen

This comment has been minimized.

Copy link
Member Author

commented Dec 10, 2018

Is there progress on the mailing list?

No. I pinged the mailing list again.

@hsivonen

This comment has been minimized.

Copy link
Member Author

commented Dec 11, 2018

To address Mark Davis' request for formal feedback I started writing something. While doing so, I came up with this:


Generate U+FFFD if:

  • A state transition was made such that the previous state had no content and the previous state was not the ASCII state. (I.e. stop generating U+FFFD if the zero-length state is the ASCII state.)
  • A state transition to the ASCII state was preceded by the Roman state and the next byte was not 0x5C, 0x7E or the end of the stream.
  • A state transition to the Roman state was made and the next byte was not 0x5C, 0x7E, 0x1B or the end of the stream. (0x1B is on this list to avoid a case where both this rule and the first rule would apply at the same time resulting in two U+FFFDs.)

This would actually uphold the security properties that UTR 36 tries to uphold but fails to and this would avoid the unwanted U+FFFD generation reported in the email cases.

The key question is if imposing the requirement that ASCII to Roman and vice versa transitions can only happen when logically necessary and then at the last possible moment is feasible given the behavior of encoders out there.

Does anyone want to volunteer to research this by checking the behavior of existing encoders of my searching archives of old Japanese email for the relevant byte patterns?

@hsivonen

This comment has been minimized.

Copy link
Member Author

commented Dec 17, 2018

I'm thinking of submitting a suggestion to go for non-committal middle ground for now. Thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.