Subject: [perl #58182] partial: Add uni \s,\w matching

This commit causes regex sequences \b, \s, and \w (and complements) to match in the latin1 range in the scope of feature 'unicode_strings' or with the /u regex modifier. It uses the previously unused flags field in the respective regnodes to indicate the type of matching, and in regexec.c, uses that to decide which of the handy.h macros to use, native or Latin1. I chose this for now rather than create new nodes for each type of match. An earlier version of this patch did that, and in every case the switch case: statements were adjacent, offering no performance advantage. If regexec were modified to use in-line functions or more macros for various short section of it, then it would be faster to have new nodes rather than using the flags field. But, using that field simplified things, as this change flies under the radar in a number of places where it would not if separate nodes were used.
rjbs · Oct 15, 2010 · a12cf05 · a12cf05
1 parent bdc22dd
commit a12cf05
Show file tree

Hide file tree

Showing 9 changed files with 420 additions and 83 deletions.
diff --git a/lib/feature/unicode_strings.t b/lib/feature/unicode_strings.t
@@ -7,9 +7,10 @@ BEGIN {
     require './test.pl';
 }
 
-plan(13312);    # Determined by experimentation
+plan(20736);    # Determined by experimentation
 
-# Test the upper/lower/title case mappings for all characters 0-255.
+# In this section, test the upper/lower/title case mappings for all characters
+# 0-255.
 
 # First compute the case mappings without resorting to the functions we're
 # testing.
@@ -140,3 +141,122 @@ for my  $prefix (\%empty, \%posix, \%cyrillic, \%latin1) {
         }
     }
 }
+
+# In this section test that \w, \s, and \b work correctly.  These are the only
+# character classes affected by this pragma.
+
+# Boolean if w[$i] is a \w character
+my @w = (0) x 256;
+@w[0x30 .. 0x39] = (1) x 10;     # 0-9
+@w[0x41 .. 0x5a] = (1) x 26;     # A-Z
+@w[0x61 .. 0x7a] = (1) x 26;     # a-z
+$w[0x5F] = 1;                    # _
+$w[0xAA] = 1;                    # FEMININE ORDINAL INDICATOR
+$w[0xB5] = 1;                    # MICRO SIGN
+$w[0xBA] = 1;                    # MASCULINE ORDINAL INDICATOR
+@w[0xC0 .. 0xD6] = (1) x 23;     # various
+@w[0xD8 .. 0xF6] = (1) x 31;     # various
+@w[0xF8 .. 0xFF] = (1) x 8;      # various
+
+# Boolean if s[$i] is a \s character
+my @s = (0) x 256;
+$s[0x09] = 1;   # Tab
+$s[0x0A] = 1;   # LF
+$s[0x0C] = 1;   # FF
+$s[0x0D] = 1;   # CR
+$s[0x20] = 1;   # SPACE
+$s[0x85] = 1;   # NEL
+$s[0xA0] = 1;   # NO BREAK SPACE
+
+for my $i (0 .. 255) {
+    my $char = chr($i);
+    my $hex_i = sprintf "%02X", $i;
+    foreach my $which (\@s, \@w) {
+        my $basic_name;
+        if ($which == \@s) {
+            $basic_name = 's';
+        } else {
+            $basic_name = 'w'
+        }
+
+        # Test \w \W \s \S
+        foreach my $complement (0, 1) {
+            my $name = '\\' . (($complement) ? uc($basic_name) : $basic_name);
+
+            # in and out of [...]
+            foreach my $charclass (0, 1) {
+
+                # And like [^...] or just plain [...]
+                foreach my $complement_class (0, 1) {
+                    next if ! $charclass && $complement_class;
+
+                    # Start with the boolean as to if the character is in the
+                    # class, and then complement as needed.
+                    my $expect_success = $which->[$i];
+                    $expect_success = ! $expect_success if $complement;
+                    $expect_success = ! $expect_success if $complement_class;
+
+                    my $test = $name;
+                    $test = "^$test" if $complement_class;
+                    $test = "[$test]" if $charclass;
+                    $test = "^$test\$";
+
+                    use feature 'unicode_strings';
+                    my $prefix = "in uni8bit; Verify chr(0x$hex_i)";
+                    if ($expect_success) {
+                        like($char, qr/$test/, display("$prefix =~ qr/$test/"));
+                    } else {
+                        unlike($char, qr/$test/, display("$prefix !~ qr/$test/"));
+                    }
+
+                    no feature 'unicode_strings';
+                    $prefix = "no uni8bit; Verify chr(0x$hex_i)";
+
+                    # With the legacy, nothing above 128 should be in the
+                    # class
+                    if ($i >= 128) {
+                        $expect_success = 0;
+                        $expect_success = ! $expect_success if $complement;
+                        $expect_success = ! $expect_success if $complement_class;
+                    }
+                    if ($expect_success) {
+                        like($char, qr/$test/, display("$prefix =~ qr/$test/"));
+                    } else {
+                        unlike($char, qr/$test/, display("$prefix !~ qr/$test/"));
+                    }
+                }
+            }
+        }
+    }
+
+    # Similarly for \b and \B.
+    foreach my $complement (0, 1) {
+        my $name = '\\' . (($complement) ? 'B' : 'b');
+        my $expect_success = ! $w[$i];  # \b is complement of \w
+        $expect_success = ! $expect_success if $complement;
+
+        my $string = "a$char";
+        my $test = "(^a$name\\x{$hex_i}\$)";
+
+        use feature 'unicode_strings';
+        my $prefix = "in uni8bit; Verify $string";
+        if ($expect_success) {
+            like($string, qr/$test/, display("$prefix =~ qr/$test/"));
+        } else {
+            unlike($string, qr/$test/, display("$prefix !~ qr/$test/"));
+        }
+
+        no feature 'unicode_strings';
+        $prefix = "no uni8bit; Verify $string";
+        if ($i >= 128) {
+            $expect_success = 1;
+            $expect_success = ! $expect_success if $complement;
+        }
+        if ($expect_success) {
+            like($string, qr/$test/, display("$prefix =~ qr/$test/"));
+            like($string, qr/$test/, display("$prefix =~ qr/$test/"));
+        } else {
+            unlike($string, qr/$test/, display("$prefix !~ qr/$test/"));
+        }
+    }
+}
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
@@ -57,11 +57,24 @@ These modifiers are currently only available within a C<(?...)> construct.
 The C<"l"> modifier says to compile the regular expression as if it were
 in the scope of C<use locale>, even if it is not.
 
-The C<"u"> modifier currently does nothing.
+The C<"u"> modifier says to compile the regular expression as if it were
+in the scope of a C<use feature "unicode_strings"> pragma.
 
-The C<"d"> modifier is used in the scope of C<use locale> to compile the
-regular expression as if it were not in that scope.
-See L<perlre/(?dlupimsx-imsx)>.
+The C<"d"> modifier is used to override any C<use locale> and
+C<use feature "unicode_strings"> pragmas that are in effect at the time
+of compiling the regular expression.
+
+See just below and L<perlre/(?dlupimsx-imsx)>.
+
+=head2 C<use feature "unicode_strings"> now applies to some regex matching
+
+Another chunk of the L<perlunicode/The "Unicode Bug"> is fixed in this
+release.  Now, regular expressions compiled within the scope of the
+"unicode_strings" feature will match the same whether or not the target
+string is encoded in utf8, with regard to C<\s>, C<\w>, C<\b>, and their
+complements.  Work is underway to add the C<[[:posix:]]> character
+classes and case sensitive matching to the control of this feature, but
+was not complete in time for this dot release.
 
 =head2 C<\N{...}> now handles Unicode named character sequences
 

diff --git a/pod/perlre.pod b/pod/perlre.pod
@@ -646,9 +646,23 @@ L<setlocale() function|perllocale/The setlocale function>.
 This modifier is automatically set if the regular expression is compiled
 within the scope of a C<"use locale"> pragma.
 
-C<"u"> has no effect currently.  It is automatically set if the regular
-expression is compiled within the scope of a
-L<C<"use feature 'unicode_strings">|feature> pragma.
+C<"u"> means to use Unicode semantics when pattern matching.  It is
+automatically set if the regular expression is compiled within the scope
+of a L<C<"use feature 'unicode_strings">|feature> pragma (and isn't
+also in the scope of L<C<"use locale">|locale> nor
+L<C<"use bytes">|bytes> pragmas.  It is not fully implemented at the
+time of this writing, but work is being done to complete the job.  On
+EBCDIC platforms this currently has no effect, but on ASCII platforms,
+it effectively turns them into Latin-1 platforms.  That is, the ASCII
+characters remain as ASCII characters (since ASCII is a subset of
+Latin-1), but the non-ASCII code points are treated as Latin-1
+characters.  Right now, this only applies to the C<"\b">, C<"\s">, and
+C<"\w"> pattern matching operators, plus their complements.  For
+example, when this option is not on, C<"\w"> matches precisely
+C<[A-Za-z0-9_]> (on a non-utf8 string).  When the option is on, it
+matches not just those, but all the Latin-1 word characters (such as an
+"n" with a tilde).  It thus matches exactly the same set of code points
+from 0 to 255 as it would if the string were encoded in utf8.
 
 C<"d"> means to use the traditional Perl pattern matching behavior.
 This is dualistic (hence the name C<"d">, which also could stand for

diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
@@ -682,7 +682,8 @@ nor EBCDIC, they match the ASCII defaults (0 to 9 for C<\d>; 52 letters,
 A regular expression is marked for Unicode semantics if it is encoded in
 utf8 (usually as a result of including a literal character whose code
 point is above 255), or if it contains a C<\N{U+...}> or C<\N{I<name>}>
-construct.
+construct, or (starting in Perl 5.14) if it was compiled in the scope of a
+C<S<use feature "unicode_strings">> pragma.
 
 The differences in behavior between locale and non-locale semantics
 can affect any character whose code point is 255 or less.  The
@@ -693,6 +694,11 @@ L<perlunicode/The "Unicode Bug">.
 
 For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
 or the POSIX character classes, and use the Unicode properties instead.
+That way you can control whether you want matching of just characters in
+the ASCII character set, or any Unicode characters.
+C<S<use feature "unicode_strings">> will allow seamless Unicode behavior
+no matter what the internal encodings are, but won't allow restricting
+to just the ASCII characters.
 
 =head4 Examples
 

diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
@@ -1509,17 +1509,20 @@ ASCII range (except in a locale), along with Perl's desire to add Unicode
 support seamlessly.  The result wasn't seamless: these characters were
 orphaned.
 
-Work is being done to correct this, but only some of it was complete in time
-for the 5.12 release.  What has been finished is the important part of the case
+Work is being done to correct this, but only some of it is complete.
+What has been finished is the matching of C<\b>, C<\s>, C<\w> and their
+complements in regular expressions, and the important part of the case
 changing component.  Due to concerns, and some evidence, that older code might
 have come to rely on the existing behavior, the new behavior must be explicitly
 enabled by the feature C<unicode_strings> in the L<feature> pragma, even though
 no new syntax is involved.
 
 See L<perlfunc/lc> for details on how this pragma works in combination with
-various others for casing.  Even though the pragma only affects casing
-operations in the 5.12 release, it is planned to have it affect all the
-problematic behaviors in later releases: you can't have one without them all.
+various others for casing.
+
+Even though the implementation is incomplete, it is planned to have this
+pragma affect all the problematic behaviors in later releases: you can't
+have one without them all.
 
 In the meantime, a workaround is to always call utf8::upgrade($string), or to
 use the standard module L<Encode>.   Also, a scalar that has any characters

diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod
@@ -155,8 +155,7 @@ have furnished your own casing functions to override the default, these will
 not be called unless the UTF8 flag is on)
 
 This remains a problem for the regular expression constructs
-C<\s>, C<\w>, C<\S>, C<\W>, C</.../i>, C<(?i:...)>,
-and C</[[:posix:]]/>.
+C</.../i>, C<(?i:...)>, and C</[[:posix:]]/>.
 
 To force Unicode semantics, you can upgrade the internal representation to
 by doing C<utf8::upgrade($string)>. This can be used