Need (internal?) tokens_split() #1500

kbenoit · 2018-11-19T23:02:26Z

This would resolve the inefficiency noted here but with a general solution. I actually started writing this based on the (internal) tokens_segment() but then went for a quicker fix since I only needed an internal hyphen solution for

quanteda/R/tokens.R

Lines 221 to 225 in 0dc5b34

    
           if (remove_hyphens && any(stri_detect_regex(types(x), "^.+-.+$"))) { 
        
               x <- lapply(as.list(x), function(y)  
        
                   as.character(tokens(as.character(y), remove_hyphens = TRUE))) %>% 
        
                   as.tokens() 
        
           }

Proposal:

tokens_split(x, pattern, valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, keep_pattern = FALSE)

which would be the opposite direction, otherwise similar behaviour, to tokens_compound().

The text was updated successfully, but these errors were encountered:

kbenoit · 2018-11-20T10:09:17Z

BTW here I implemented a hyphen-specific version for dfm features here, probably about as efficiently it's possible to do in straight R.

But it's easier for a dfm than for tokens because a) can use column indexing from matrices and b) feature order does not matter for dfm.

kbenoit · 2018-11-20T10:10:56Z

To Do: Reimplement the inefficient solution to #1498 once tokens_split() is working.

kbenoit · 2018-12-27T08:34:21Z

Solved in #1520 via tokens_chunk().

This was referenced Nov 19, 2018

tokens.tokens(x, remove_hyphens = TRUE) does not split hyphenated word components #1498

Closed

tokens(x, remove_punct) too aggressive? #1445

Closed

koheiw mentioned this issue Nov 24, 2018

Dev tokens split #1502

Merged

kbenoit added this to the v1.4 milestone Dec 18, 2018

kbenoit closed this as completed Dec 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need (internal?) tokens_split() #1500

Need (internal?) tokens_split() #1500

kbenoit commented Nov 19, 2018

kbenoit commented Nov 20, 2018 •

edited

Loading

kbenoit commented Nov 20, 2018

kbenoit commented Dec 27, 2018

Need (internal?) tokens_split() #1500

Need (internal?) tokens_split() #1500

Comments

kbenoit commented Nov 19, 2018

kbenoit commented Nov 20, 2018 • edited Loading

kbenoit commented Nov 20, 2018

kbenoit commented Dec 27, 2018

kbenoit commented Nov 20, 2018 •

edited

Loading