Skip to content

Commit

Permalink
Add invisible separator / comma to the list of things treated as spac…
Browse files Browse the repository at this point in the history
…es. One half of #1281 - although this doesn't address the crash, unfortunately
  • Loading branch information
AngledLuffa committed Jul 2, 2022
1 parent 40fee82 commit 7c84960
Show file tree
Hide file tree
Showing 3 changed files with 69,089 additions and 68,997 deletions.
3 changes: 2 additions & 1 deletion src/edu/stanford/nlp/process/LexCommon.tokens
@@ -1,5 +1,6 @@
/* \u3000 is ideographic space; \u205F is medium math space */
SPACE = [ \t\u00A0\u2000-\u200A\u202F\u20F5\u3000]
/* \u2063 is an invisible separator */
SPACE = [ \t\u00A0\u2000-\u200A\u202F\u2063\u20F5\u3000]
SPACES = {SPACE}+
NEWLINE = \r|\r?\n|\u2028|\u2029|\u000B|\u000C|\u0085
SPACENL = ({SPACE}|{NEWLINE})
Expand Down
1 change: 1 addition & 0 deletions src/edu/stanford/nlp/process/PTBLexer.flex
Expand Up @@ -589,6 +589,7 @@ SPLET = &[aeiouAEIOU](acute|grave|uml);

%include LexCommon.tokens

/* SPACE, SPACENL, etc are in LexCommon.tokens */
SPACENLS = {SPACENL}+
/* These next ones are useful to get a fixed length trailing context. */
SPACENL_ONE_CHAR = [ \t\u00A0\u2000-\u200A\u202F\u3000\r\n\u2028\u2029\u000B\u000C\u0085]
Expand Down

0 comments on commit 7c84960

Please sign in to comment.