Upgrading tokens_replace() to keep tokens and keys togather #2324

koheiw · 2023-12-07T08:12:21Z

We sometimes keep track of tokens matched to dictionary patters but it is not easy (see #2063). tokens_replace() can be used add keys to original tokens (e.g. "United States/Countries") but only with fixed patterns.

require(quanteda)
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict <- dictionary(list(Countries = c("United States", "Federal Republic of Germany"),
                        Cceans = c("Atlantic Ocean", "Pacific Ocean")), tolower = FALSE)

tokens_lookup(toks, dict)
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "Countries" "Cceans"    "Cceans"   
#> 
#> d2 :
#> [1] "Countries"

To use tokens_replace(), we need to know all the matches beforehand. It is not possible when patterns are glob.

# fixed dictionary

pat <- unlist(dict, use.names = FALSE)
rep <- paste0(pat, "/" ,rep(names(dict), lengths(dict)))
tokens_replace(toks, phrase(pat), rep)
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Cceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Cceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

# glob dictionary

dict2 <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Cceans = c("* Ocean")), tolower = FALSE)

pat2 <- unlist(dict2, use.names = FALSE)
rep2 <- paste0(pat2, "/" ,rep(names(dict2), lengths(dict2)))
tokens_replace(toks, phrase(pat2), rep2)
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                "* States/Countries" "is"                
#>  [4] "bordered"           "by"                 "the"               
#>  [7] "* Ocean/Cceans"     "and"                "the"               
#> [10] "* Ocean/Cceans"    
#> 
#> d2 :
#>  [1] "The"                "Supreme"            "Court"             
#>  [4] "of"                 "the"                "* States/Countries"
#>  [7] "is"                 "seldom"             "in"                
#> [10] "a"                  "united"             "state"

The text was updated successfully, but these errors were encountered:

koheiw · 2023-12-08T00:37:07Z

Using tokens_compound(), we can achieve this without changing the C++ code! I did not like how tokens_replace() work with dictionary but this is useful behavior. We could also substitute tokens_lookup(exclusive = FALSE) with this.

require(quanteda)
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict2 <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Oceans = c("* Ocean")), tolower = FALSE)

tokens_replace2 <- function(x, pattern, replacement = NULL,
                            concatenator = "_", add_key = FALSE) {
    
    x <- tokens_compound(x, pattern, concatenator = concatenator, join = FALSE)
    fixed <- unlist(object2fixed(pattern, types(x), concatenator = concatenator))
    if (add_key) {
        rep <- paste0(fixed, "/", names(fixed))
    } else {
        rep <- names(fixed)
    }
    tokens_replace(x, fixed, rep)
}

tokens_replace2(toks, dict2, concatenator = " ")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"       "Countries" "is"        "bordered"  "by"        "the"      
#>  [7] "Oceans"    "and"       "the"       "Oceans"   
#> 
#> d2 :
#>  [1] "The"       "Supreme"   "Court"     "of"        "the"       "Countries"
#>  [7] "is"        "seldom"    "in"        "a"         "united"    "state"
tokens_replace2(toks, dict2, concatenator = " ", add_key = TRUE)
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

^{Created on 2023-12-08 with reprex v2.0.2}

kbenoit · 2023-12-13T10:59:54Z

What if we just added an argument to tokens_lookup() that kept the token matched and appended the dictionary key?
This could allow lookups to function as a way of annotating tokens more generally, and keep it all within one function.

For instance:

tokens_lookup(toks, dict, keep_tokens = TRUE, concatenator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

koheiw · 2023-12-14T04:47:16Z

I wanted to simplify tokens_lookup() but it seems the easiest to do it there. I tentatively added append and separator. If we want add only one argument, separator = NULL can means do not append.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 4.0.0
#> Unicode version: 15.1
#> ICU version: 74.1
#> Parallel computing: 16 of 16 threads used.
#> See https://quanteda.io for tutorials and examples.
txt <- c(d1 = "The United States is bordered by the Atlantic Ocean and the Pacific Ocean.",
         d2 = "The Supreme Court of the United States is seldom in a united state.")
toks <- tokens(txt, remove_punct = TRUE)

dict <- dictionary(list(Countries = c("* States", "Federal Republic of *"),
                         Oceans = c("* Ocean")), tolower = FALSE)

tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "/")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United_States/Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic_Ocean/Oceans"   "and"                    
#>  [9] "the"                     "Pacific_Ocean/Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United_States/Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"
tokens_lookup(toks, dict, exclusive = FALSE, append = TRUE, separator = "+")
#> Tokens consisting of 2 documents.
#> d1 :
#>  [1] "The"                     "United_States+Countries"
#>  [3] "is"                      "bordered"               
#>  [5] "by"                      "the"                    
#>  [7] "Atlantic_Ocean+Oceans"   "and"                    
#>  [9] "the"                     "Pacific_Ocean+Oceans"   
#> 
#> d2 :
#>  [1] "The"                     "Supreme"                
#>  [3] "Court"                   "of"                     
#>  [5] "the"                     "United_States+Countries"
#>  [7] "is"                      "seldom"                 
#>  [9] "in"                      "a"                      
#> [11] "united"                  "state"

The concatenator for phrases are taken from the meta field of the tokens object. This means users should specify concatenator in the upstream. For this, I prefer adding concatenator = "_" to tokens() so that tokens_compound(), tokens_ngrams(), tokens_lookup()` use the same concatenator.

quanteda/R/tokens_lookup.R

Lines 164 to 167 in c597cb0

    
           if (append) { 
        
               fixed <- sapply(ids, function(x, y) paste(type[x], collapse = y),  
        
                               field_object(attrs, "concatenator")) 
        
               key <- paste0(fixed, separator, names(fixed))

kbenoit · 2023-12-14T10:21:43Z

I can see good reasons not to force the same concatenator. What if we wanted to concatenate tokens that had a POS tag denoted by a "/" separator? This would be something like "capital/ADJ_gains/NOUN_tax/NOUN".

We use concatenator in tokens_compound() and in as.tokens.spacyr_parsed(). Since we are concatenating the dictionary key, I think we should use that argument name instead of separator.

Also it's conceivable that we would have functions later that appended other info to a token, what if we call this append_key rather than append?

Upgrade tokens_lookup() for #2324

koheiw self-assigned this Dec 8, 2023

koheiw added enhancement design dictionary labels Dec 8, 2023

koheiw added a commit that referenced this issue Dec 14, 2023

Add append for #2324

4d89ad8

This was referenced Dec 14, 2023

Upgrade tokens_lookup() for #2324 #2326

Merged

Add concatenator to tokens() #2327

Merged

kbenoit pushed a commit that referenced this issue Dec 15, 2023

Add news for #2324

6242914

kbenoit added a commit that referenced this issue Dec 15, 2023

Merge pull request #2326 from quanteda/dev-append

b8190ab

Upgrade tokens_lookup() for #2324

koheiw closed this as completed Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading tokens_replace() to keep tokens and keys togather #2324

Upgrading tokens_replace() to keep tokens and keys togather #2324

koheiw commented Dec 7, 2023 •

edited

koheiw commented Dec 8, 2023 •

edited

kbenoit commented Dec 13, 2023 •

edited

koheiw commented Dec 14, 2023

kbenoit commented Dec 14, 2023 •

edited

Upgrading tokens_replace() to keep tokens and keys togather #2324

Upgrading tokens_replace() to keep tokens and keys togather #2324

Comments

koheiw commented Dec 7, 2023 • edited

koheiw commented Dec 8, 2023 • edited

kbenoit commented Dec 13, 2023 • edited

koheiw commented Dec 14, 2023

kbenoit commented Dec 14, 2023 • edited

koheiw commented Dec 7, 2023 •

edited

koheiw commented Dec 8, 2023 •

edited

kbenoit commented Dec 13, 2023 •

edited

kbenoit commented Dec 14, 2023 •

edited