正则截取问题（乱码） #1135

LawssssCat · 2022-09-05T04:26:00Z

想正则截取前50个字，截了前50个字节。最后一个utf8中文字3字节被截了一半乱码

/^(.{50}).+$/$1 ……《略》/

结尾加 /u 按unicode(utf-8)匹配，不知道配置里怎么加，直接加最后没效果

/^(.{50}).+$/$1 ……《略》/u

The text was updated successfully, but these errors were encountered:

shewer · 2022-09-05T13:55:33Z

-- 直接增加 utf8.sub() -- 用法同 strnig.sub(str,si,ei)
function utf8.sub(str,si,ei)
  local function index(ustr,i)
    return i>=0 and ( utf8.offset(ustr,i) or ustr:len() +1 )
    or ( utf8.offset(ustr,i) or 1 )
  end

  local u_si= index(str,si)
  ei = ei or utf8.len(str)
  ei = ei >=0 and ei +1 or ei
  local u_ei= index(str, ei ) -1
  return str:sub(u_si,u_ei)
end

Ace-Who · 2022-09-05T15:37:29Z

可以用 (?:[\0-\x7F\xC2-\xFD][\x80-\xBF]*) 替代匹配式中的 .。

oniondelta · 2022-09-05T17:38:01Z

shewer 大大和 Ace-Who 大大兩種作法，親測皆可行 👍🏻

LawssssCat · 2022-09-06T17:47:28Z

可以用 (?:[\0-\x7F\xC2-\xFD][\x80-\xBF]*) 替代匹配式中的 .。

感谢，处理了很大一部分，不过还是有见到乱码

我猜是没覆盖全utf8，看了utf8的编码规则

找到一个更加全的正则匹配：
(?:[\x01-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xff][\x80-\xbf]{3})

编码表

代码值（二进制）	UTF-16（二进制）	第一个字节（二进制）	第二个字节（二进制）	第三个字节（二进制）	第四个字节（二进制）
00000000 0xxxxxxx	00000000 0xxxxxxx	0xxxxxxx
00000yyy yyxxxxxx	00000yyy yyxxxxxx	110yyyyy	10xxxxxx
zzzzyyyy yyxxxxxx	zzzzyyyy yyxxxxxx	1110zzzz	10yyyyyy	10xxxxxx
uuuuu zzzzyyyy yyxxxxxx	110110ww wwzzzzyy 110111yy yyxxxxxx	11110uuu（其中 uuuuu = wwww+1）	10uuzzzz	10yyyyyy	10xxxxxx

进制转换对照

二进制	十六进制
0000 0000 ~ 0111 1111	00 ~ 7f
10000 0000 ~ 1011 1111	80 ~ bf
11100 0000 ~ 1101 1111	c0 ~ df
1111 0000 ~ 1111 1111	f0 ~ ff

Ace-Who · 2022-09-07T02:07:11Z

「帕」的例子应该不是 UTF-8 字符模式的问题，是表达式中 .+$ 至少要匹配一个字节，所以把最后一个字「帕」给截断了。如果把后一个 . 也替换为 UTF-8 字符模式，应该就行了。

oniondelta · 2022-09-07T04:11:30Z

尚還有亂碼問題：

    - xform/^((?:[\0-\x7F\xC2-\xFD][\x80-\xBF]*){30}).+$/$1 ……《略》/

目前尚未發現亂碼問題：

    - xform/^((?:[\0-\x7F\xC2-\xFD][\x80-\xBF]*){30})(?:[\0-\x7F\xC2-\xFD][\x80-\xBF]*)+$/$1 ……《略》/

LawssssCat closed this as completed Sep 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

正则截取问题（乱码） #1135

正则截取问题（乱码） #1135

LawssssCat commented Sep 5, 2022

shewer commented Sep 5, 2022

Ace-Who commented Sep 5, 2022

oniondelta commented Sep 5, 2022 •

edited

Loading

LawssssCat commented Sep 6, 2022

Ace-Who commented Sep 7, 2022

oniondelta commented Sep 7, 2022

正则截取问题（乱码） #1135

正则截取问题（乱码） #1135

Comments

LawssssCat commented Sep 5, 2022

shewer commented Sep 5, 2022

Ace-Who commented Sep 5, 2022

oniondelta commented Sep 5, 2022 • edited Loading

LawssssCat commented Sep 6, 2022

Ace-Who commented Sep 7, 2022

oniondelta commented Sep 7, 2022

oniondelta commented Sep 5, 2022 •

edited

Loading