Incoporate RBBI tokenizer for v4.0 #2216

koheiw · 2023-03-19T01:35:34Z

I drafted tokenize_word4() based on #2165. I want to make it the main tokenizer of quanteda v4.0. It is made available via tokens(what = "word4") too.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.5
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
tokenize_word4("a well-known website http://example.com #hashtag @username", 
               split_hyphens = FALSE, split_tags = FALSE)
#> [[1]]
#>  [1] "a"                  " "                  "well-known"        
#>  [4] " "                  "website"            " "                 
#>  [7] "http://example.com" " "                  "#hashtag"          
#> [10] " "                  "@username"
tokenize_word4("a well-known website http://example.com #hashtag @username", 
               split_hyphens = TRUE, split_tags = TRUE)
#> [[1]]
#>  [1] "a"                  " "                  "well"              
#>  [4] "-"                  "known"              " "                 
#>  [7] "website"            " "                  "http://example.com"
#> [10] " "                  "#"                  "hashtag"           
#> [13] " "                  "@username"

tokenize_word4("Qu'est-ce que c'est?", split_elisions = FALSE)
#> [[1]]
#> [1] "Qu'est-ce" " "         "que"       " "         "c'est"     "?"
tokenize_word4("Qu'est-ce que c'est?", split_elisions = TRUE)
#> [[1]]
#> [1] "Qu'"    "est-ce" " "      "que"    " "      "c'"     "est"    "?"

@odelmarcelle, I could not understand why you added these rules. Can you explain?

### Protect variant selector & whitespace with diacritical marks
$Variant = [\uFE00-\uFE0F];
$Diacritical = [\p{whitespace}][\u0300-\u036F];
# Rules
($ALetterPlus | $Hebrew_Letter) $Variant ($ALetterPlus | $Hebrew_Letter);
($ALetterPlus | $Hebrew_Letter) $Diacritical ($ALetterPlus | $Hebrew_Letter);

I am hoping split_elisions = TRUE makes it easier to analyze French texts. Don't you think "Qu'est-ce" to shoul be tokenized to "Qu" "'", "est-ce" (separate )?

odelmarcelle · 2023-03-19T11:58:04Z

Nice work, I'm glad you had the chance to look at it!

For your first question, I added this rule to align behaviour with https://github.com/quanteda/quanteda/blob/master/R/tokenizers.R#L45-L48, which already implied that those characters should be protected in the ICU tokenizer.

I couldn't identify the difference between adding and not adding these rules, so I did not use them. I just gave it a try today, and I think you could drop the regex replacement at https://github.com/quanteda/quanteda/blob/dev-tokenize4/R/tokenizers.R#L242-L244:

It drops the variation selector for emoji, which is not needed (seems to be already protected through the 'word' ICU rule).
It does not break whitespace followed by a diacritic mark (which can be equivalently implemented with an ICU rule).

See the following examples

require(quanteda)
#> Le chargement a nécessité le package : quanteda
#> Package version: 3.2.5
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
rules_extended <- c(
  data_breakrules,
  list(
    variant =
      r"---(### Protect variant selector & whitespace with diacritical marks
    $Variant = [\uFE00-\uFE0F];
    $Diacritical = [\p{whitespace}][\u0300-\u036F];
    # Rules
    ($ALetterPlus | $Hebrew_Letter) $Variant ($ALetterPlus | $Hebrew_Letter);
    ($ALetterPlus | $Hebrew_Letter) $Diacritical ($ALetterPlus | $Hebrew_Letter);)---"
  ))


txt <- "i \u2764\ufe0f you \u2764\ufe0f\ufe0f\u2764" 
print(txt)
#> [1] "i ❤️ you ❤️️❤"
tokens(txt)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "i"   "❤"   "you" "❤"   "❤"
tokenize_word4(txt)
#> [[1]]
#> [1] "i"   " "   "❤"   " "   "you" " "   "❤"   "❤"
tokenize_custom(txt, data_breakrules)
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"
tokenize_custom(txt, rules_extended)
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"


txt <- "übër u\u0308be\u0308r \u0308ubër"
print(txt)
#> [1] "übër übër ̈ubër"
tokens(txt)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "übër"     "übërubër"
tokenize_word4(txt)
#> [[1]]
#> [1] "übër"     " "        "übërubër"
tokenize_custom(txt, data_breakrules)
#> [[1]]
#> [1] "übër" " "    "übër" " ̈"    "ubër"
tokenize_custom(txt, rules_extended)
#> [[1]]
#> [1] "übër"      " "         "übër ̈ubër"

Might be possible to fine-tune the diacritic ICU rule to retain \u0308ubër as a single token, but I am not sure there are a lof of used cases.

For your second question, "qu" will always be used with an elision to replace "qui" or "que". I'm not sure there is any use in separating the contracted word from the elision.

odelmarcelle · 2023-03-19T13:50:05Z

After giving it a second thought, it is probably better to tokenize "Qu'est-ce" into "Qu" "'", "est-ce" (mainly to avoid having "Qu'" and "Qu’" as two different tokens). This can be done by adjusting the elision rule to:

$Elision = ([lLmMtTnNsSjJdDcC]|([jJ][u][s]|[qQ][u][o][i]|[lL][o][r][s]|[pP][u][i][s]|[qQ][u][e][l])?[qQ][u]);
$Apostrophe = [\u0027\u2019];
^$Elision / $Apostrophe;

I also added |[qQ][u][e][l] in the elision part to tokenize "Quelqu'un" into "Quelqu" "'" "un"

koheiw · 2023-03-19T22:21:51Z

@odelmarcelle thanks you for the comments. I will investigate the variant selector issue a bit more.

You are welcome to edit this branch directly (I sent you an invite). In this way, your contribution will be registered properly.

koheiw · 2023-03-19T22:45:55Z

You are the expert on elisions here because other do not know French. In English, extracting "aren" from "aren't" does not make sense, but "he" from "he'll" does.

koheiw · 2023-03-20T00:27:42Z

I removed https://github.com/quanteda/quanteda/blob/dev-tokenize4/R/tokenizers.R#L242-L244.

require(stringi)
#> Loading required package: stringi
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.5
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.

txt <- "i \u2764\ufe0f you \u2764\ufe0f\ufe0f\u2764"
tokenize_custom(txt, data_breakrules_word)
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"
tokenize_word4(txt)
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"
stri_split_boundaries(txt, type = "word")
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"

txt2 <- "übër u\u0308be\u0308r \u0308ubër"
tokenize_custom(txt2, data_breakrules_word)
#> [[1]]
#> [1] "übër" " "    "übër" " ̈"    "ubër"
tokenize_word4(txt2)
#> [[1]]
#> [1] "übër" " "    "übër" " ̈"    "ubër"
stri_split_boundaries(txt2, type = "word")
#> [[1]]
#> [1] "übër" " "    "übër" " ̈"    "ubër"

The better way to deal with the stray diacritical marks is post-tokenization cleaning. We can add tokens(remove_control = TRUE) for this. @kbenoit ?

tokens(txt2, what = "word4") %>% 
    tokens_remove("^[\\p{Z}\\p{M}\\p{C}]+$", valuetype = "regex")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "übër" "übër" "ubër"

koheiw · 2023-03-20T04:44:50Z

The only problem of the new tokenizer seems to be the handling of @, for which the rules have been changed in the ICU library.

── Failure ('test-tokens-word4.R:948:5'): split_tags works ─────────────────────
as.list(tokens(txt1, what = "word", split_tags = TRUE)) not identical to list(d1 = c("@", "quanteda", "@", "koheiw7", "@", "QUANTEDA_INITIATIVE")).
Component “d1”: Lengths (3, 6) differ (string compare on first 3)
Component “d1”: 3 string mismatches

koheiw · 2023-03-20T04:47:28Z

The better way to deal with the stray diacritical marks is post-tokenization cleaning. We can add tokens(remove_control = TRUE) for this.

I solved this in 7f90f4e.

koheiw · 2023-03-29T07:35:50Z

I like putting RBBI rules in the environment. Thanks.

breakrules_get() sounds odd to me. How about breakrules() (like stopwords()) for getting and breakrules<- for setting values to customize?

kbenoit · 2023-03-29T22:54:33Z

I like putting RBBI rules in the environment. Thanks.

breakrules_get() sounds odd to me. How about breakrules() (like stopwords()) for getting and breakrules<- for setting values to customize?

Even better. I think we should keep these internal for v3 and see how they develop. Only the most advanced users are likely to make use of them. I was thinking more that this, plus the update script, provides a way for us to update the default rules easily.

kbenoit · 2023-03-29T23:23:10Z

I like putting RBBI rules in the environment. Thanks.
breakrules_get() sounds odd to me. How about breakrules() (like stopwords()) for getting and breakrules<- for setting values to customize?

Even better. I think we should keep these internal for v3 and see how they develop. Only the most advanced users are likely to make use of them. I was thinking more that this, plus the update script, provides a way for us to update the default rules easily.

Actually I started implementing this now, then realised we cannot use the same function for assignment and retrieval, because x cannot be passed as the thing to be modified. So better to have breakrules_get(), breakrules_set()`. (Later we can add validators, no need for now however - this is still mainly internal.)
See https://r-pkgs.org/data.html#sec-data-state

…into dev-tokenize4

koheiw · 2023-03-29T23:37:26Z

Can't we modify the package-level environment at all?

kbenoit · 2023-03-29T23:39:38Z

Can't we modify the package-level environment at all?

Only inside functions - so it cannot be passed as the x in an assignment function.

I just fixed these but now see you updated it in 6525670

koheiw · 2023-03-29T23:47:31Z

Let's discuss how to customize rules on a branch for v4. I still don't like _set() and _get() functions.

Otherwise, word4 is ready. I fixed the issues in segmentation of tags and Japanese.

- change aaa.R to use the reset function

kbenoit · 2023-03-29T23:55:57Z

Take a look at how I implemented them in 21248f3, keeping in mind that these are internal for v3 and could be improved for v4, with validators for instance. But we are limited in how we define these, given the use of environments.

kbenoit · 2023-03-30T00:08:39Z

Take a look at how I implemented them in 21248f3, keeping in mind that these are internal for v3 and could be improved for v4, with validators for instance. But we are limited in how we define these, given the use of environments.

You might need to fix this slightly because I pushed it right before running off to a seminar

kbenoit · 2023-03-30T02:42:25Z

I think it will pass now, but still I am getting different results on this test, locally.

> tokens("゛んﾞ", what = "word4")
Tokens consisting of 1 document.
text1 :
[1] "゛んﾞ"

> tokens("゛んﾞ", what = "word3")
Tokens consisting of 1 document.
text1 :
[1] "゛んﾞ"

Details:

R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(quanteda)
Package version: 3.2.5.9000
Unicode version: 14.0
ICU version: 71.1
Parallel computing: 10 of 10 threads used.
See https://quanteda.io for tutorials and examples.

koheiw · 2023-03-30T04:34:19Z

The rules stri_split_boundary(type = "word") is a bit different from breakrules() because it is download from the ICU's repository directly. How is the result if you stri_split_boundary("゛んﾞ", type = breakrules()$base).

koheiw · 2023-03-30T04:53:53Z

We don't need to finish breakrules() here, but I think it should behave like quanteda_options().

# get
> quanteda_options("tokens_tokenizer_word")
[1] "word3"

# set 
> quanteda_options("tokens_tokenizer_word" = "word4")

# get
> breakrules("base")
[1] "#\n# Copyright (C) 2016 and later: Unicode, Inc. and others.\n# License & terms of use: http://www.unicode.org/copyright.html\n# Copyright (C) 2002-2016, International Business Machines Corporation\n# and others. All Rights Reserved.\n#\n# file:  word.txt\n#\n# ICU Word Break Rules\n#

# set
breakrules("custom" = "xyz")

kbenoit · 2023-03-30T04:54:25Z

library("quanteda")
#> Package version: 3.2.5.9000
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

stringi::stri_split_boundaries("゛んﾞ", type = breakrules_get()$base)
#> [[1]]
#> [1] "゛んﾞ"

packageVersion("stringi")
#> [1] '1.7.12'

^{Created on 2023-03-30 with reprex v2.0.2}

kbenoit · 2023-03-30T04:55:13Z

We don't need to finish breakrules() here, but I think it should behave like quanteda_options().

# get
> quanteda_options("tokens_tokenizer_word")
[1] "word3"

# set 
> quanteda_options("tokens_tokenizer_word" = "word4")

# get
> breakrules("base")
[1] "#\n# Copyright (C) 2016 and later: Unicode, Inc. and others.\n# License & terms of use: http://www.unicode.org/copyright.html\n# Copyright (C) 2002-2016, International Business Machines Corporation\n# and others. All Rights Reserved.\n#\n# file:  word.txt\n#\n# ICU Word Break Rules\n#

# set
breakrules("custom" = "xyz")

I'd be happy with that.

They could even be quanteda options!

Should we merge this now and put the change on the (short-term) to-do list?

koheiw · 2023-03-30T05:09:15Z

I have no idea why it works differently on your Mac. It is a problem in stringi instead of our tokenizers. I am not worried about this unusual Japanese string, so we can merge this PR.


> stringi::stri_split_boundaries("゛んﾞ", type = breakrules()$base)
[[1]]
[1] "゛"  "んﾞ"

> stringi::stri_split_boundaries("゛んﾞ", type = "word")
[[1]]
[1] "゛"  "んﾞ"

> packageVersion("stringi")
[1] ‘1.7.12’

kbenoit · 2023-03-30T05:21:08Z

I was just about to post this as a stringi issue, and noticed that it worked as expected. Now this is really weird:

# just to show Unicode info
library("quanteda")
#> Package version: 3.2.5.9000
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

packageVersion("stringi")
#> [1] '1.7.12'
stringi::stri_split_boundaries("゛んﾞ")
#> [[1]]
#> [1] "゛"  "んﾞ"

sessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.2.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] quanteda_3.2.5.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.10        rstudioapi_0.14    knitr_1.42         magrittr_2.0.3    
#>  [5] stopwords_2.3      R.cache_0.16.0     lattice_0.20-45    rlang_1.1.0       
#>  [9] fastmatch_1.1-3    fastmap_1.1.1      styler_1.9.1       tools_4.2.3       
#> [13] grid_4.2.3         xfun_0.37          R.oo_1.25.0        cli_3.6.1         
#> [17] withr_2.5.0        htmltools_0.5.5    RcppParallel_5.1.7 yaml_2.3.7        
#> [21] digest_0.6.31      lifecycle_1.0.3    Matrix_1.5-3       purrr_1.0.1       
#> [25] vctrs_0.6.1        R.utils_2.12.2     fs_1.6.1           glue_1.6.2        
#> [29] evaluate_0.20      rmarkdown_2.21     reprex_2.0.2       stringi_1.7.12    
#> [33] compiler_4.2.3     R.methodsS3_1.8.2

^{Created on 2023-03-30 with reprex v2.0.2}

But then

library("quanteda")
#> Package version: 3.2.5.9000
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

tokens("゛んﾞ")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "゛んﾞ"

^{Created on 2023-03-30 with reprex v2.0.2}

Which suggests it's quanteda, maybe something might be happening post-split in handling the re-joining for special split characters?

koheiw added 14 commits March 18, 2023 18:16

Save ICU-rules as text files

f668a5e

Partially incorporate code from PR #2165

983029a

User character class

4d23a0e

Fix

62b3ad8

Define a package-level global variable

8ac6f0b

Include number in hash tag

cac02cf

Save rules as data object

0be8551

Add tokenize_word4() using tokenizer_custom()

bd21250

Avoid unused argument error

1a6e1dd

Make the new tokenizer available in tokens()

428a9fc

Keep line endings

7d3c305

Fix

ace18a1

Update example

5180d51

Include sentence rules

a7caac5

koheiw mentioned this pull request Mar 19, 2023

Allows adding customized rules to the ICU tokenizer #2165

Merged

koheiw mentioned this pull request Mar 19, 2023

stri_split_boundaries() returns slightly different tokens when the ICU rules are supplied gagolews/stringi#489

Closed

Separate rules for word and sentence

dc3a1c0

koheiw added 6 commits March 20, 2023 09:32

Fix

4a7384c

Remove stray diacritical markss when remove_separator = TRUE

7f90f4e

Separate test for old tokenizer

c9eb06d

Allow changing tokenizer via quanteda_options()

07d27f4

Separate tests for new tokenizer

e7fab69

Update comparison with the preset

610c341

koheiw added 2 commits March 29, 2023 16:30

Unskip the test

e02eec5

Remove skip for old CI test

93dfefe

koheiw added 3 commits March 30, 2023 07:23

Include underscore to username tag

34cd317

Remove old rule files

ae9e68f

Simplify verbose message

a78c7de

kbenoit approved these changes Mar 29, 2023

View reviewed changes

koheiw changed the base branch from v4 to master March 29, 2023 23:25

koheiw added 3 commits March 30, 2023 08:26

Merge branch 'master' into dev-tokenize4

94e8003

Change breakrules_get() to breakrules()

6525670

Merge branch 'dev-tokenize4' of https://github.com/quanteda/quanteda …

9b75975

…into dev-tokenize4

Implement get, set, reset for breakrules

21248f3

- change aaa.R to use the reset function

kbenoit and others added 2 commits March 30, 2023 11:11

Change breakrules() to breakrules_get() in example

8ce5761

Update man page and add breakrules to WORDLIST

13b2d9a

kbenoit merged commit b6dea7b into master Mar 30, 2023
6 checks passed

kbenoit deleted the dev-tokenize4 branch April 12, 2023 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incoporate RBBI tokenizer for v4.0 #2216

Incoporate RBBI tokenizer for v4.0 #2216

koheiw commented Mar 19, 2023

odelmarcelle commented Mar 19, 2023

odelmarcelle commented Mar 19, 2023 •

edited

koheiw commented Mar 19, 2023

koheiw commented Mar 19, 2023

koheiw commented Mar 20, 2023 •

edited

koheiw commented Mar 20, 2023

koheiw commented Mar 20, 2023

koheiw commented Mar 29, 2023 •

edited

kbenoit commented Mar 29, 2023

kbenoit commented Mar 29, 2023

koheiw commented Mar 29, 2023

kbenoit commented Mar 29, 2023

koheiw commented Mar 29, 2023

kbenoit commented Mar 29, 2023

kbenoit commented Mar 30, 2023

kbenoit commented Mar 30, 2023

koheiw commented Mar 30, 2023

koheiw commented Mar 30, 2023

kbenoit commented Mar 30, 2023

kbenoit commented Mar 30, 2023 •

edited

koheiw commented Mar 30, 2023 •

edited

kbenoit commented Mar 30, 2023

Incoporate RBBI tokenizer for v4.0 #2216

Incoporate RBBI tokenizer for v4.0 #2216

Conversation

koheiw commented Mar 19, 2023

odelmarcelle commented Mar 19, 2023

odelmarcelle commented Mar 19, 2023 • edited

koheiw commented Mar 19, 2023

koheiw commented Mar 19, 2023

koheiw commented Mar 20, 2023 • edited

koheiw commented Mar 20, 2023

koheiw commented Mar 20, 2023

koheiw commented Mar 29, 2023 • edited

kbenoit commented Mar 29, 2023

kbenoit commented Mar 29, 2023

koheiw commented Mar 29, 2023

kbenoit commented Mar 29, 2023

koheiw commented Mar 29, 2023

kbenoit commented Mar 29, 2023

kbenoit commented Mar 30, 2023

kbenoit commented Mar 30, 2023

koheiw commented Mar 30, 2023

koheiw commented Mar 30, 2023

kbenoit commented Mar 30, 2023

kbenoit commented Mar 30, 2023 • edited

koheiw commented Mar 30, 2023 • edited

kbenoit commented Mar 30, 2023

odelmarcelle commented Mar 19, 2023 •

edited

koheiw commented Mar 20, 2023 •

edited

koheiw commented Mar 29, 2023 •

edited

kbenoit commented Mar 30, 2023 •

edited

koheiw commented Mar 30, 2023 •

edited