Perls prior to v5.8 did not work well with UTF-8. This presents problems for the CP1252 character encoding, as some of its code points, when converted to Unicode, require UTF-8 to represent. This patch instead uses ASCII approximations for them. Prior to this patch, they were left alone, which would show up as C1 control characters. On Perls that do work with UTF-8, the code points are still properly converted to their Unicode equivalents.
These would otherwise call an undefined function. Just return instead of doing that, which leads to incorrect results, but it's better than crashing.
I had trouble applying the patch to make CP1252 the default; in trying to do work around that manually, I inadvertently overwrote this recent change. Given that I had trouble, I should have tested before submitting the patch, and hopefully will learn my lesson from this.
The :^ascii: should be part of a bracketed character class. I missed this in code review. There is code in regcomp.c to warn on something like this, but it didn't get triggered, I'll look into that. And I didn't add a test for this. It's not critical if such characters don't get dropped.
When there is no =encoding line and the file isn't UTF-8, the encoding is now presumed to be CP 1252 instead of Latin1. This was discussed in pod-people starting with <412A27EC-6EE5-4A06-8CBA-5128E7CE3741@justatheory.com> I will submit a patch to perlpodspec once this is out.
One test has been failing because it was testing that illegal UTF-8 was considered to be UTF-8. This commit fixes that. The other test is made a TODO. It is passed genuninely ambiguous text that could either be CP1252 or UTF-8. This commit makes the text passed actually more plausible than previously. The fact that it was hard to get a plausible example gives me hope that real-world examples will be quite unlikely to be guessed wrong. The first byte must be between C2 and DF, otherwise it would be a 3 byte sequence in UTF-8, and even harder to find a likely CP1252 equivalent sequence. That means that the first byte is one of 1) an uppercase accented character, 2) the multiplication sign, or 3) the German sharp s 'ß'. The second byte is in the range 80 to 9F. Most of these in CP1252 are various punctuation characters or symbols such as a dagger. These are mostly unlikely to immediately follow an uppercase letter, multiplcation sign, or the sharp s. One that could is a right single quote used as an apostrophe in English. But there are no accents in English except in borrowed words. Since it must be a capital, it's likely the whole word is in caps, like in a heading. I came up with what looks like "JOSÉ'S" in CP1252, which looks like legal UTF-8 as well.
This commit takes two identical regular expression patterns and makes them into a single qr//. And it rewrites the revised one so it is platform-independent on sufficiently modern Perls. I think the pattern is wrong to exclude the digit '9', but I don't have time now to develop the expertise to delve into it, so am leaving it as-is. I compiled the two versions under -Dr (one using hard-coded characters, and the other using [:posix:] classes) to verify that the new one generates the exact same code points as the original on ASCII platforms
This whole thing probably should be fixed to not call 'diff' at all, but for now, there is no real need for the '-u' option to diff, and some platforms don't have that option, so just remove it.
When no =encoding line is present, the encoding is checked to see if it is UTF-8, and if not, currently ISO 8859-1 is chosen instead. This wasn't working well on EBCDIC platforms prior to this commit. It is planned to change things so that CP 1252 is chosen instead of 8859-1, and this code will have to be revised to handle that, but in case that doesn't work out, this commit can be fallen back to.
This same code is repeated in multiple places. I chose to not consolidate it. The comments indicate that it was known it would work only on ASCII, but since v5.8, there is the capability to make it easily working on non-ASCII as well, using the translation functions available starting in that release
For Perls starting in v5.8, this allows BOM detection on all platforms
The No-Break Space and Soft Hyphen are used in 6 modules. This generalizes so they can be handled fully on non-ASCII platforms. A recent patch had already fixed this this for one area of code, but it turns out that they are used in more than one place. In most of those places, they were handled somewhat gracefully for non-ASCII platforms, but this patch makes them work completely correctly. I used global scalar variables in the base module to store what the native characters are for these code points, as the calculation of what they should be is not obvious, and so should be done in a single place. An unlikely pitfall is that these scalars are not read-only; I suppose a subroutine could be used instead, I suppose, but I thought that this was adequate.
These tests fail on EBCDIC platforms because the expected sort order is hard-coded. This introduces a helper .pl file which contains two functions to make the sort order come out ASCII (hence to the expected value) no matter what the current platform's character set is.