-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incoporate RBBI tokenizer for v4.0 #2216
Conversation
Nice work, I'm glad you had the chance to look at it! For your first question, I added this rule to align behaviour with https://github.com/quanteda/quanteda/blob/master/R/tokenizers.R#L45-L48, which already implied that those characters should be protected in the ICU tokenizer. I couldn't identify the difference between adding and not adding these rules, so I did not use them. I just gave it a try today, and I think you could drop the regex replacement at https://github.com/quanteda/quanteda/blob/dev-tokenize4/R/tokenizers.R#L242-L244:
See the following examples require(quanteda)
#> Le chargement a nécessité le package : quanteda
#> Package version: 3.2.5
#> Unicode version: 13.0
#> ICU version: 69.1
#> Parallel computing: 12 of 12 threads used.
#> See https://quanteda.io for tutorials and examples.
rules_extended <- c(
data_breakrules,
list(
variant =
r"---(### Protect variant selector & whitespace with diacritical marks
$Variant = [\uFE00-\uFE0F];
$Diacritical = [\p{whitespace}][\u0300-\u036F];
# Rules
($ALetterPlus | $Hebrew_Letter) $Variant ($ALetterPlus | $Hebrew_Letter);
($ALetterPlus | $Hebrew_Letter) $Diacritical ($ALetterPlus | $Hebrew_Letter);)---"
))
txt <- "i \u2764\ufe0f you \u2764\ufe0f\ufe0f\u2764"
print(txt)
#> [1] "i ❤️ you ❤️️❤"
tokens(txt)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "i" "❤" "you" "❤" "❤"
tokenize_word4(txt)
#> [[1]]
#> [1] "i" " " "❤" " " "you" " " "❤" "❤"
tokenize_custom(txt, data_breakrules)
#> [[1]]
#> [1] "i" " " "❤️" " " "you" " " "❤️️" "❤"
tokenize_custom(txt, rules_extended)
#> [[1]]
#> [1] "i" " " "❤️" " " "you" " " "❤️️" "❤"
txt <- "übër u\u0308be\u0308r \u0308ubër"
print(txt)
#> [1] "übër übër ̈ubër"
tokens(txt)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "übër" "übërubër"
tokenize_word4(txt)
#> [[1]]
#> [1] "übër" " " "übërubër"
tokenize_custom(txt, data_breakrules)
#> [[1]]
#> [1] "übër" " " "übër" " ̈" "ubër"
tokenize_custom(txt, rules_extended)
#> [[1]]
#> [1] "übër" " " "übër ̈ubër" Might be possible to fine-tune the diacritic ICU rule to retain For your second question, "qu" will always be used with an elision to replace "qui" or "que". I'm not sure there is any use in separating the contracted word from the elision. |
After giving it a second thought, it is probably better to tokenize "Qu'est-ce" into "Qu" "'", "est-ce" (mainly to avoid having "Qu'" and "Qu’" as two different tokens). This can be done by adjusting the elision rule to:
I also added |
@odelmarcelle thanks you for the comments. I will investigate the variant selector issue a bit more. You are welcome to edit this branch directly (I sent you an invite). In this way, your contribution will be registered properly. |
You are the expert on elisions here because other do not know French. In English, extracting "aren" from "aren't" does not make sense, but "he" from "he'll" does. |
I removed https://github.com/quanteda/quanteda/blob/dev-tokenize4/R/tokenizers.R#L242-L244. require(stringi)
#> Loading required package: stringi
require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.2.5
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- "i \u2764\ufe0f you \u2764\ufe0f\ufe0f\u2764"
tokenize_custom(txt, data_breakrules_word)
#> [[1]]
#> [1] "i" " " "❤️" " " "you" " " "❤️️" "❤"
tokenize_word4(txt)
#> [[1]]
#> [1] "i" " " "❤️" " " "you" " " "❤️️" "❤"
stri_split_boundaries(txt, type = "word")
#> [[1]]
#> [1] "i" " " "❤️" " " "you" " " "❤️️" "❤"
txt2 <- "übër u\u0308be\u0308r \u0308ubër"
tokenize_custom(txt2, data_breakrules_word)
#> [[1]]
#> [1] "übër" " " "übër" " ̈" "ubër"
tokenize_word4(txt2)
#> [[1]]
#> [1] "übër" " " "übër" " ̈" "ubër"
stri_split_boundaries(txt2, type = "word")
#> [[1]]
#> [1] "übër" " " "übër" " ̈" "ubër" The better way to deal with the stray diacritical marks is post-tokenization cleaning. We can add tokens(txt2, what = "word4") %>%
tokens_remove("^[\\p{Z}\\p{M}\\p{C}]+$", valuetype = "regex")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "übër" "übër" "ubër" |
The only problem of the new tokenizer seems to be the handling of @, for which the rules have been changed in the ICU library.
|
I solved this in 7f90f4e. |
I like putting RBBI rules in the environment. Thanks.
|
Even better. I think we should keep these internal for v3 and see how they develop. Only the most advanced users are likely to make use of them. I was thinking more that this, plus the update script, provides a way for us to update the default rules easily. |
Actually I started implementing this now, then realised we cannot use the same function for assignment and retrieval, because |
Can't we modify the package-level environment at all? |
Only inside functions - so it cannot be passed as the I just fixed these but now see you updated it in 6525670 |
Let's discuss how to customize rules on a branch for v4. I still don't like Otherwise, word4 is ready. I fixed the issues in segmentation of tags and Japanese. |
- change aaa.R to use the reset function
Take a look at how I implemented them in 21248f3, keeping in mind that these are internal for v3 and could be improved for v4, with validators for instance. But we are limited in how we define these, given the use of environments. |
You might need to fix this slightly because I pushed it right before running off to a seminar |
I think it will pass now, but still I am getting different results on this test, locally. > tokens("゛ん゙", what = "word4")
Tokens consisting of 1 document.
text1 :
[1] "゛ん゙"
> tokens("゛ん゙", what = "word3")
Tokens consisting of 1 document.
text1 :
[1] "゛ん゙" Details: R version 4.2.3 (2023-03-15) -- "Shortstop Beagle"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: aarch64-apple-darwin20 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> library(quanteda)
Package version: 3.2.5.9000
Unicode version: 14.0
ICU version: 71.1
Parallel computing: 10 of 10 threads used.
See https://quanteda.io for tutorials and examples. |
The rules |
We don't need to finish # get
> quanteda_options("tokens_tokenizer_word")
[1] "word3"
# set
> quanteda_options("tokens_tokenizer_word" = "word4") # get
> breakrules("base")
[1] "#\n# Copyright (C) 2016 and later: Unicode, Inc. and others.\n# License & terms of use: http://www.unicode.org/copyright.html\n# Copyright (C) 2002-2016, International Business Machines Corporation\n# and others. All Rights Reserved.\n#\n# file: word.txt\n#\n# ICU Word Break Rules\n#
# set
breakrules("custom" = "xyz") |
library("quanteda")
#> Package version: 3.2.5.9000
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
stringi::stri_split_boundaries("゛ん゙", type = breakrules_get()$base)
#> [[1]]
#> [1] "゛ん゙"
packageVersion("stringi")
#> [1] '1.7.12' Created on 2023-03-30 with reprex v2.0.2 |
I'd be happy with that. They could even be quanteda options! Should we merge this now and put the change on the (short-term) to-do list? |
I have no idea why it works differently on your Mac. It is a problem in stringi instead of our tokenizers. I am not worried about this unusual Japanese string, so we can merge this PR.
|
I was just about to post this as a stringi issue, and noticed that it worked as expected. Now this is really weird: # just to show Unicode info
library("quanteda")
#> Package version: 3.2.5.9000
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
packageVersion("stringi")
#> [1] '1.7.12'
stringi::stri_split_boundaries("゛ん゙")
#> [[1]]
#> [1] "゛" "ん゙"
sessionInfo()
#> R version 4.2.3 (2023-03-15)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Ventura 13.2.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] quanteda_3.2.5.9000
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.10 rstudioapi_0.14 knitr_1.42 magrittr_2.0.3
#> [5] stopwords_2.3 R.cache_0.16.0 lattice_0.20-45 rlang_1.1.0
#> [9] fastmatch_1.1-3 fastmap_1.1.1 styler_1.9.1 tools_4.2.3
#> [13] grid_4.2.3 xfun_0.37 R.oo_1.25.0 cli_3.6.1
#> [17] withr_2.5.0 htmltools_0.5.5 RcppParallel_5.1.7 yaml_2.3.7
#> [21] digest_0.6.31 lifecycle_1.0.3 Matrix_1.5-3 purrr_1.0.1
#> [25] vctrs_0.6.1 R.utils_2.12.2 fs_1.6.1 glue_1.6.2
#> [29] evaluate_0.20 rmarkdown_2.21 reprex_2.0.2 stringi_1.7.12
#> [33] compiler_4.2.3 R.methodsS3_1.8.2 Created on 2023-03-30 with reprex v2.0.2 But then library("quanteda")
#> Package version: 3.2.5.9000
#> Unicode version: 14.0
#> ICU version: 71.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.
tokens("゛ん゙")
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "゛ん゙" Created on 2023-03-30 with reprex v2.0.2 Which suggests it's quanteda, maybe something might be happening post-split in handling the re-joining for special split characters? |
I drafted
tokenize_word4()
based on #2165. I want to make it the main tokenizer of quanteda v4.0. It is made available viatokens(what = "word4")
too.@odelmarcelle, I could not understand why you added these rules. Can you explain?
I am hoping
split_elisions = TRUE
makes it easier to analyze French texts. Don't you think "Qu'est-ce" to shoul be tokenized to "Qu" "'", "est-ce" (separate )?