New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading tokens_replace() to keep tokens and keys togather #2324
Comments
Using require(quanteda)
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)
dict2 <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
Oceans = c("* Ocean")), tolower = FALSE)
tokens_replace2 <- function(x, pattern, replacement = NULL,
concatenator = "_", add_key = FALSE) {
x <- tokens_compound(x, pattern, concatenator = concatenator, join = FALSE)
fixed <- unlist(object2fixed(pattern, types(x), concatenator = concatenator))
if (add_key) {
rep <- paste0(fixed, "/", names(fixed))
} else {
rep <- names(fixed)
}
tokens_replace(x, fixed, rep)
}
tokens_replace2(toks, dict2, concatenator = " ")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "Countries" "is" "bordered" "by" "the"
#> [7] "Oceans" "and" "the" "Oceans"
#>
#> d2 :
#> [1] "The" "Supreme" "Court" "of" "the" "Countries"
#> [7] "is" "seldom" "in" "a" "united" "state"
tokens_replace2(toks, dict2, concatenator = " ", add_key = TRUE)
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "United States/Countries"
#> [3] "is" "bordered"
#> [5] "by" "the"
#> [7] "Atlantic Ocean/Oceans" "and"
#> [9] "the" "Pacific Ocean/Oceans"
#>
#> d2 :
#> [1] "The" "Supreme"
#> [3] "Court" "of"
#> [5] "the" "United States/Countries"
#> [7] "is" "seldom"
#> [9] "in" "a"
#> [11] "united" "state" Created on 2023-12-08 with reprex v2.0.2 |
What if we just added an argument to For instance: tokens_lookup(toks, dict, keep_tokens = TRUE, concatenator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "United States/Countries"
#> [3] "is" "bordered"
#> [5] "by" "the"
#> [7] "Atlantic Ocean/Oceans" "and"
#> [9] "the" "Pacific Ocean/Oceans"
#>
#> d2 :
#> [1] "The" "Supreme"
#> [3] "Court" "of"
#> [5] "the" "United States/Countries"
#> [7] "is" "seldom"
#> [9] "in" "a"
#> [11] "united" "state" |
I wanted to simplify require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)
dict <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
Oceans = c("* Ocean")), tolower = FALSE)
tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "United_States/Countries"
#> [3] "is" "bordered"
#> [5] "by" "the"
#> [7] "Atlantic_Ocean/Oceans" "and"
#> [9] "the" "Pacific_Ocean/Oceans"
#>
#> d2 :
#> [1] "The" "Supreme"
#> [3] "Court" "of"
#> [5] "the" "United_States/Countries"
#> [7] "is" "seldom"
#> [9] "in" "a"
#> [11] "united" "state"
tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "+")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "The" "United_States+Countries"
#> [3] "is" "bordered"
#> [5] "by" "the"
#> [7] "Atlantic_Ocean+Oceans" "and"
#> [9] "the" "Pacific_Ocean+Oceans"
#>
#> d2 :
#> [1] "The" "Supreme"
#> [3] "Court" "of"
#> [5] "the" "United_States+Countries"
#> [7] "is" "seldom"
#> [9] "in" "a"
#> [11] "united" "state" The concatenator for phrases are taken from the meta field of the tokens object. This means users should specify concatenator in the upstream. For this, I prefer adding Lines 164 to 167 in c597cb0
|
I can see good reasons not to force the same concatenator. What if we wanted to concatenate tokens that had a POS tag denoted by a "/" separator? This would be something like "capital/ADJ_gains/NOUN_tax/NOUN". We use Also it's conceivable that we would have functions later that appended other info to a token, what if we call this |
We sometimes keep track of tokens matched to dictionary patters but it is not easy (see #2063).
tokens_replace()
can be used add keys to original tokens (e.g. "United States/Countries") but only with fixed patterns.To use
tokens_replace()
, we need to know all the matches beforehand. It is not possible when patterns are glob.The text was updated successfully, but these errors were encountered: