Skip to content

tokens.tokens(x, remove_hyphens = TRUE) does not split hyphenated word components #1498

@kbenoit

Description

@kbenoit

Describe the bug

tokens.tokens(x, remove_hyphens = TRUE) does not split the hyphenated words.

> txt <- "Auto-immune system."
> tokens(txt, remove_hyphens = TRUE)
tokens from 1 document.
text1 :
[1] "Auto"   "-"      "immune" "system" "."     

> tokens(txt, remove_hyphens = FALSE)
tokens from 1 document.
text1 :
[1] "Auto-immune" "system"      "."          

> tokens(txt, remove_hyphens = FALSE) %>% tokens(remove_hyphens = TRUE)
tokens from 1 document.
text1 :
[1] "Auto immune" "system"      "."  

Expected behavior

> tokens(txt, remove_hyphens = FALSE) %>% tokens(remove_hyphens = TRUE)
tokens from 1 document.
text1 :
[1] "Auto"   "-"      "immune" "system" "."

## System information

> packageVersion("quanteda")
[1] ‘1.3.14

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions