Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incoporate RBBI tokenizer for v4.0 #2216

Merged
merged 85 commits into from Mar 30, 2023
Merged

Incoporate RBBI tokenizer for v4.0 #2216

merged 85 commits into from Mar 30, 2023

Conversation

koheiw
Copy link
Collaborator

@koheiw koheiw commented Mar 19, 2023

I drafted tokenize_word4() based on #2165. I want to make it the main tokenizer of quanteda v4.0. It is made available via tokens(what = "word4") too.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.5
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
tokenize_word4("a well-known website http://example.com #hashtag @username", 
               split_hyphens = FALSE, split_tags = FALSE)
#> [[1]]
#>  [1] "a"                  " "                  "well-known"        
#>  [4] " "                  "website"            " "                 
#>  [7] "http://example.com" " "                  "#hashtag"          
#> [10] " "                  "@username"
tokenize_word4("a well-known website http://example.com #hashtag @username", 
               split_hyphens = TRUE, split_tags = TRUE)
#> [[1]]
#>  [1] "a"                  " "                  "well"              
#>  [4] "-"                  "known"              " "                 
#>  [7] "website"            " "                  "http://example.com"
#> [10] " "                  "#"                  "hashtag"           
#> [13] " "                  "@username"

tokenize_word4("Qu'est-ce que c'est?", split_elisions = FALSE)
#> [[1]]
#> [1] "Qu'est-ce" " "         "que"       " "         "c'est"     "?"
tokenize_word4("Qu'est-ce que c'est?", split_elisions = TRUE)
#> [[1]]
#> [1] "Qu'"    "est-ce" " "      "que"    " "      "c'"     "est"    "?"

@odelmarcelle, I could not understand why you added these rules. Can you explain?

### Protect variant selector & whitespace with diacritical marks
$Variant = [\uFE00-\uFE0F];
$Diacritical = [\p{whitespace}][\u0300-\u036F];
# Rules
($ALetterPlus | $Hebrew_Letter) $Variant ($ALetterPlus | $Hebrew_Letter);
($ALetterPlus | $Hebrew_Letter) $Diacritical ($ALetterPlus | $Hebrew_Letter);

I am hoping split_elisions = TRUE makes it easier to analyze French texts. Don't you think "Qu'est-ce" to shoul be tokenized to "Qu" "'", "est-ce" (separate )?

@odelmarcelle
Copy link
Collaborator

Nice work, I'm glad you had the chance to look at it!

For your first question, I added this rule to align behaviour with https://github.com/quanteda/quanteda/blob/master/R/tokenizers.R#L45-L48, which already implied that those characters should be protected in the ICU tokenizer.

I couldn't identify the difference between adding and not adding these rules, so I did not use them. I just gave it a try today, and I think you could drop the regex replacement at https://github.com/quanteda/quanteda/blob/dev-tokenize4/R/tokenizers.R#L242-L244:

  • It drops the variation selector for emoji, which is not needed (seems to be already protected through the 'word' ICU rule).
  • It does not break whitespace followed by a diacritic mark (which can be equivalently implemented with an ICU rule).

See the following examples

require(quanteda)
#> Le chargement a nécessité le package : quanteda
#> Package version: 3.2.5
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
rules_extended <- c(
  data_breakrules,
  list(
    variant =
      r"---(### Protect variant selector & whitespace with diacritical marks
    $Variant = [\uFE00-\uFE0F];
    $Diacritical = [\p{whitespace}][\u0300-\u036F];
    # Rules
    ($ALetterPlus | $Hebrew_Letter) $Variant ($ALetterPlus | $Hebrew_Letter);
    ($ALetterPlus | $Hebrew_Letter) $Diacritical ($ALetterPlus | $Hebrew_Letter);)---"
  ))


txt <- "i \u2764\ufe0f you \u2764\ufe0f\ufe0f\u2764" 
print(txt)
#> [1] "i ❤️ you ❤️️❤"
tokens(txt)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "i"   "❤"   "you" "❤"   "❤"
tokenize_word4(txt)
#> [[1]]
#> [1] "i"   " "   "❤"   " "   "you" " "   "❤"   "❤"
tokenize_custom(txt, data_breakrules)
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"
tokenize_custom(txt, rules_extended)
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"


txt <- "übër u\u0308be\u0308r \u0308ubër"
print(txt)
#> [1] "übër übër ̈ubër"
tokens(txt)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "übër"     "übërubër"
tokenize_word4(txt)
#> [[1]]
#> [1] "übër"     " "        "übërubër"
tokenize_custom(txt, data_breakrules)
#> [[1]]
#> [1] "übër" " "    "übër" " ̈"    "ubër"
tokenize_custom(txt, rules_extended)
#> [[1]]
#> [1] "übër"      " "         "übër ̈ubër"

Might be possible to fine-tune the diacritic ICU rule to retain \u0308ubër as a single token, but I am not sure there are a lof of used cases.

For your second question, "qu" will always be used with an elision to replace "qui" or "que". I'm not sure there is any use in separating the contracted word from the elision.

@odelmarcelle
Copy link
Collaborator

odelmarcelle commented Mar 19, 2023

After giving it a second thought, it is probably better to tokenize "Qu'est-ce" into "Qu" "'", "est-ce" (mainly to avoid having "Qu'" and "Qu’" as two different tokens). This can be done by adjusting the elision rule to:

$Elision = ([lLmMtTnNsSjJdDcC]|([jJ][u][s]|[qQ][u][o][i]|[lL][o][r][s]|[pP][u][i][s]|[qQ][u][e][l])?[qQ][u]);
$Apostrophe = [\u0027\u2019];
^$Elision / $Apostrophe;

I also added |[qQ][u][e][l] in the elision part to tokenize "Quelqu'un" into "Quelqu" "'" "un"

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 19, 2023

@odelmarcelle thanks you for the comments. I will investigate the variant selector issue a bit more.

You are welcome to edit this branch directly (I sent you an invite). In this way, your contribution will be registered properly.

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 19, 2023

You are the expert on elisions here because other do not know French. In English, extracting "aren" from "aren't" does not make sense, but "he" from "he'll" does.

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 20, 2023

I removed https://github.com/quanteda/quanteda/blob/dev-tokenize4/R/tokenizers.R#L242-L244.

require(stringi)
#> Loading required package: stringi
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.5
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.

txt <- "i \u2764\ufe0f you \u2764\ufe0f\ufe0f\u2764"
tokenize_custom(txt, data_breakrules_word)
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"
tokenize_word4(txt)
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"
stri_split_boundaries(txt, type = "word")
#> [[1]]
#> [1] "i"   " "   "❤️"   " "   "you" " "   "❤️️"   "❤"

txt2 <- "übër u\u0308be\u0308r \u0308ubër"
tokenize_custom(txt2, data_breakrules_word)
#> [[1]]
#> [1] "übër" " "    "übër" " ̈"    "ubër"
tokenize_word4(txt2)
#> [[1]]
#> [1] "übër" " "    "übër" " ̈"    "ubër"
stri_split_boundaries(txt2, type = "word")
#> [[1]]
#> [1] "übër" " "    "übër" " ̈"    "ubër"

The better way to deal with the stray diacritical marks is post-tokenization cleaning. We can add tokens(remove_control = TRUE) for this. @kbenoit ?

tokens(txt2, what = "word4") %>% 
    tokens_remove("^[\\p{Z}\\p{M}\\p{C}]+$", valuetype = "regex")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "übër" "übër" "ubër"

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 20, 2023

The only problem of the new tokenizer seems to be the handling of @, for which the rules have been changed in the ICU library.

── Failure ('test-tokens-word4.R:948:5'): split_tags works ─────────────────────
as.list(tokens(txt1, what = "word", split_tags = TRUE)) not identical to list(d1 = c("@", "quanteda", "@", "koheiw7", "@", "QUANTEDA_INITIATIVE")).
Component “d1”: Lengths (3, 6) differ (string compare on first 3)
Component “d1”: 3 string mismatches

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 20, 2023

The better way to deal with the stray diacritical marks is post-tokenization cleaning. We can add tokens(remove_control = TRUE) for this.

I solved this in 7f90f4e.

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 29, 2023

I like putting RBBI rules in the environment. Thanks.

breakrules_get() sounds odd to me. How about breakrules() (like stopwords()) for getting and breakrules<- for setting values to customize?

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 29, 2023

I like putting RBBI rules in the environment. Thanks.

breakrules_get() sounds odd to me. How about breakrules() (like stopwords()) for getting and breakrules<- for setting values to customize?

Even better. I think we should keep these internal for v3 and see how they develop. Only the most advanced users are likely to make use of them. I was thinking more that this, plus the update script, provides a way for us to update the default rules easily.

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 29, 2023

I like putting RBBI rules in the environment. Thanks.
breakrules_get() sounds odd to me. How about breakrules() (like stopwords()) for getting and breakrules<- for setting values to customize?

Even better. I think we should keep these internal for v3 and see how they develop. Only the most advanced users are likely to make use of them. I was thinking more that this, plus the update script, provides a way for us to update the default rules easily.

Actually I started implementing this now, then realised we cannot use the same function for assignment and retrieval, because x cannot be passed as the thing to be modified. So better to have breakrules_get(), breakrules_set()`. (Later we can add validators, no need for now however - this is still mainly internal.)
See https://r-pkgs.org/data.html#sec-data-state

@koheiw koheiw changed the base branch from v4 to master March 29, 2023 23:25
@koheiw
Copy link
Collaborator Author

koheiw commented Mar 29, 2023

Can't we modify the package-level environment at all?

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 29, 2023

Can't we modify the package-level environment at all?

Only inside functions - so it cannot be passed as the x in an assignment function.

I just fixed these but now see you updated it in 6525670

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 29, 2023

Let's discuss how to customize rules on a branch for v4. I still don't like _set() and _get() functions.

Otherwise, word4 is ready. I fixed the issues in segmentation of tags and Japanese.

- change aaa.R to use the reset function
@kbenoit
Copy link
Collaborator

kbenoit commented Mar 29, 2023

Take a look at how I implemented them in 21248f3, keeping in mind that these are internal for v3 and could be improved for v4, with validators for instance. But we are limited in how we define these, given the use of environments.

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 30, 2023

Take a look at how I implemented them in 21248f3, keeping in mind that these are internal for v3 and could be improved for v4, with validators for instance. But we are limited in how we define these, given the use of environments.

You might need to fix this slightly because I pushed it right before running off to a seminar

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 30, 2023

I think it will pass now, but still I am getting different results on this test, locally.

> tokens("゛ん゙", what = "word4")
Tokens consisting of 1 document.
text1 :
[1] "゛ん゙"

> tokens("゛ん゙", what = "word3")
Tokens consisting of 1 document.
text1 :
[1] "゛ん゙"

Details:

R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(quanteda)
Package version: 3.2.5.9000
Unicode version: 14.0
ICU version: 71.1
Parallel computing: 10 of 10 threads used.
See https://quanteda.io for tutorials and examples.

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 30, 2023

The rules stri_split_boundary(type = "word") is a bit different from breakrules() because it is download from the ICU's repository directly. How is the result if you stri_split_boundary("゛ん゙", type = breakrules()$base).

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 30, 2023

We don't need to finish breakrules() here, but I think it should behave like quanteda_options().

# get
> quanteda_options("tokens_tokenizer_word")
[1] "word3"

# set 
> quanteda_options("tokens_tokenizer_word" = "word4")
# get
> breakrules("base")
[1] "#\n# Copyright (C) 2016 and later: Unicode, Inc. and others.\n# License & terms of use: http://www.unicode.org/copyright.html\n# Copyright (C) 2002-2016, International Business Machines Corporation\n# and others. All Rights Reserved.\n#\n# file:  word.txt\n#\n# ICU Word Break Rules\n#

# set
breakrules("custom" = "xyz") 

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 30, 2023

library("quanteda")
#> Package version: 3.2.5.9000
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

stringi::stri_split_boundaries("゛ん゙", type = breakrules_get()$base)
#> [[1]]
#> [1] "゛ん゙"

packageVersion("stringi")
#> [1] '1.7.12'

Created on 2023-03-30 with reprex v2.0.2

@kbenoit
Copy link
Collaborator

kbenoit commented Mar 30, 2023

We don't need to finish breakrules() here, but I think it should behave like quanteda_options().

# get
> quanteda_options("tokens_tokenizer_word")
[1] "word3"

# set 
> quanteda_options("tokens_tokenizer_word" = "word4")
# get
> breakrules("base")
[1] "#\n# Copyright (C) 2016 and later: Unicode, Inc. and others.\n# License & terms of use: http://www.unicode.org/copyright.html\n# Copyright (C) 2002-2016, International Business Machines Corporation\n# and others. All Rights Reserved.\n#\n# file:  word.txt\n#\n# ICU Word Break Rules\n#

# set
breakrules("custom" = "xyz") 

I'd be happy with that.

They could even be quanteda options!

Should we merge this now and put the change on the (short-term) to-do list?

@koheiw
Copy link
Collaborator Author

koheiw commented Mar 30, 2023

I have no idea why it works differently on your Mac. It is a problem in stringi instead of our tokenizers. I am not worried about this unusual Japanese string, so we can merge this PR.


> stringi::stri_split_boundaries("゛ん゙", type = breakrules()$base)
[[1]]
[1] "゛"  "ん゙"

> stringi::stri_split_boundaries("゛ん゙", type = "word")
[[1]]
[1] "゛"  "ん゙"

> packageVersion("stringi")
[1] ‘1.7.12’

@kbenoit kbenoit merged commit b6dea7b into master Mar 30, 2023
6 checks passed
@kbenoit
Copy link
Collaborator

kbenoit commented Mar 30, 2023

I was just about to post this as a stringi issue, and noticed that it worked as expected. Now this is really weird:

# just to show Unicode info
library("quanteda")
#> Package version: 3.2.5.9000
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

packageVersion("stringi")
#> [1] '1.7.12'
stringi::stri_split_boundaries("゛ん゙")
#> [[1]]
#> [1] "゛"  "ん゙"

sessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.2.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] quanteda_3.2.5.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.10        rstudioapi_0.14    knitr_1.42         magrittr_2.0.3    
#>  [5] stopwords_2.3      R.cache_0.16.0     lattice_0.20-45    rlang_1.1.0       
#>  [9] fastmatch_1.1-3    fastmap_1.1.1      styler_1.9.1       tools_4.2.3       
#> [13] grid_4.2.3         xfun_0.37          R.oo_1.25.0        cli_3.6.1         
#> [17] withr_2.5.0        htmltools_0.5.5    RcppParallel_5.1.7 yaml_2.3.7        
#> [21] digest_0.6.31      lifecycle_1.0.3    Matrix_1.5-3       purrr_1.0.1       
#> [25] vctrs_0.6.1        R.utils_2.12.2     fs_1.6.1           glue_1.6.2        
#> [29] evaluate_0.20      rmarkdown_2.21     reprex_2.0.2       stringi_1.7.12    
#> [33] compiler_4.2.3     R.methodsS3_1.8.2

Created on 2023-03-30 with reprex v2.0.2

But then

library("quanteda")
#> Package version: 3.2.5.9000
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

tokens("゛ん゙")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "゛ん゙"

Created on 2023-03-30 with reprex v2.0.2

Which suggests it's quanteda, maybe something might be happening post-split in handling the re-joining for special split characters?

@kbenoit kbenoit deleted the dev-tokenize4 branch April 12, 2023 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants