Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading tokens_replace() to keep tokens and keys togather #2324

Closed
koheiw opened this issue Dec 7, 2023 · 4 comments
Closed

Upgrading tokens_replace() to keep tokens and keys togather #2324

koheiw opened this issue Dec 7, 2023 · 4 comments

Comments

@koheiw
Copy link
Collaborator

koheiw commented Dec 7, 2023

We sometimes keep track of tokens matched to dictionary patters but it is not easy (see #2063). tokens_replace() can be used add keys to original tokens (e.g. "United States/Countries") but only with fixed patterns.

require(quanteda)
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict <- dictionary(list(Countries = c("United States", "Federal Republic of Germany"),
                        Cceans = c("Atlantic Ocean", "Pacific Ocean")), tolower = FALSE)

tokens_lookup(toks, dict)
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "Countries" "Cceans"    "Cceans"   
#> 
#> d2 :
#> [1] "Countries"

To use tokens_replace(), we need to know all the matches beforehand. It is not possible when patterns are glob.

# fixed dictionary

pat <- unlist(dict, use.names = FALSE)
rep <- paste0(pat, "/" ,rep(names(dict), lengths(dict)))
tokens_replace(toks, phrase(pat), rep)
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Cceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Cceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

# glob dictionary

dict2 <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Cceans = c("* Ocean")), tolower = FALSE)

pat2 <- unlist(dict2, use.names = FALSE)
rep2 <- paste0(pat2, "/" ,rep(names(dict2), lengths(dict2)))
tokens_replace(toks, phrase(pat2), rep2)
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                "* States/Countries" "is"                
#>  [4] "bordered"           "by"                 "the"               
#>  [7] "* Ocean/Cceans"     "and"                "the"               
#> [10] "* Ocean/Cceans"    
#> 
#> d2 :
#>  [1] "The"                "Supreme"            "Court"             
#>  [4] "of"                 "the"                "* States/Countries"
#>  [7] "is"                 "seldom"             "in"                
#> [10] "a"                  "united"             "state"
@koheiw
Copy link
Collaborator Author

koheiw commented Dec 8, 2023

Using tokens_compound(), we can achieve this without changing the C++ code! I did not like how tokens_replace() work with dictionary but this is useful behavior. We could also substitute tokens_lookup(exclusive = FALSE) with this.

require(quanteda)
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict2 <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Oceans = c("* Ocean")), tolower = FALSE)

tokens_replace2 <- function(x, pattern, replacement = NULL,
                            concatenator = "_", add_key = FALSE) {
    
    x <- tokens_compound(x, pattern, concatenator = concatenator, join = FALSE)
    fixed <- unlist(object2fixed(pattern, types(x), concatenator = concatenator))
    if (add_key) {
        rep <- paste0(fixed, "/", names(fixed))
    } else {
        rep <- names(fixed)
    }
    tokens_replace(x, fixed, rep)
}

tokens_replace2(toks, dict2, concatenator = " ")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"       "Countries" "is"        "bordered"  "by"        "the"      
#>  [7] "Oceans"    "and"       "the"       "Oceans"   
#> 
#> d2 :
#>  [1] "The"       "Supreme"   "Court"     "of"        "the"       "Countries"
#>  [7] "is"        "seldom"    "in"        "a"         "united"    "state"
tokens_replace2(toks, dict2, concatenator = " ", add_key = TRUE)
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

Created on 2023-12-08 with reprex v2.0.2

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 13, 2023

What if we just added an argument to tokens_lookup() that kept the token matched and appended the dictionary key?
This could allow lookups to function as a way of annotating tokens more generally, and keep it all within one function.

For instance:

tokens_lookup(toks, dict, keep_tokens = TRUE, concatenator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

koheiw added a commit that referenced this issue Dec 14, 2023
@koheiw
Copy link
Collaborator Author

koheiw commented Dec 14, 2023

I wanted to simplify tokens_lookup() but it seems the easiest to do it there. I tentatively added append and separator. If we want add only one argument, separator = NULL can means do not append.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Oceans = c("* Ocean")), tolower = FALSE)

tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United_States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic_Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific_Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United_States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"
tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "+")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United_States+Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic_Ocean+Oceans"   "and"                    
#>  [9] "the"                     "Pacific_Ocean+Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United_States+Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

The concatenator for phrases are taken from the meta field of the tokens object. This means users should specify concatenator in the upstream. For this, I prefer adding concatenator = "_" to tokens() so that tokens_compound(), tokens_ngrams(), tokens_lookup()` use the same concatenator.

quanteda/R/tokens_lookup.R

Lines 164 to 167 in c597cb0

if (append) {
fixed <- sapply(ids, function(x, y) paste(type[x], collapse = y),
field_object(attrs, "concatenator"))
key <- paste0(fixed, separator, names(fixed))

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 14, 2023

I can see good reasons not to force the same concatenator. What if we wanted to concatenate tokens that had a POS tag denoted by a "/" separator? This would be something like "capital/ADJ_gains/NOUN_tax/NOUN".

We use concatenator in tokens_compound() and in as.tokens.spacyr_parsed(). Since we are concatenating the dictionary key, I think we should use that argument name instead of separator.

Also it's conceivable that we would have functions later that appended other info to a token, what if we call this append_key rather than append?

kbenoit pushed a commit that referenced this issue Dec 15, 2023
kbenoit added a commit that referenced this issue Dec 15, 2023
@koheiw koheiw closed this as completed Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants