We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove_punctuation:
remove_punctuation
tokens("This, a test.", what = "fasterword", remove_punct = TRUE) ## tokens from 1 document. ## text1 : ## [1] "This," "a" "test."
and Twitter characters:
tokens("@kenbenoit #quanteda", what = "fasterword", remove_twitter = TRUE, remove_punct = TRUE) ## tokens from 1 document. ## text1 : ## [1] "@kenbenoit" "#quanteda" tokens("@kenbenoit #quanteda", what = "fasterword", remove_twitter = FALSE, remove_punct = TRUE) ## tokens from 1 document. ## text1 : ## [1] "@kenbenoit" "#quanteda"
remove_separators behaves differently:
remove_separators
txt <- "one two\nthree\tfour" # unexpected tokens(txt, what = "fasterword", remove_separators = FALSE) ## tokens from 1 document. ## text1 : ## [1] "one" "two\nthree\tfour" # correct tokens(txt, what = "fasterword", remove_separators = TRUE) ## tokens from 1 document. ## text1 : ## [1] "one" "two" "three" "four" # correct tokens(txt, what = "word", remove_separators = FALSE) ## tokens from 1 document. ## text1 : ## [1] "one" " " "two" "\n" "three" "\t" "four" # correct tokens(txt, what = "word", remove_separators = TRUE) ## tokens from 1 document. ## text1 : ## [1] "one" "two" "three" "four"
remove_url = TRUE is ok:
remove_url = TRUE
tokens("https://quanteda.io is our website", what = "fasterword", remove_url = TRUE) ## tokens from 1 document. ## text1 : ## [1] "is" "our" "website" tokens("https://quanteda.io is our website", what = "fasterword", remove_url = FALSE) ## tokens from 1 document. ## text1 : ## [1] "https://quanteda.io" "is" "our" "website"
remove_numbers is ok:
remove_numbers
tokens("99 red balloons 4ever", what = "fasterword", remove_numbers = TRUE) %>% as.character() ## [1] "red" "balloons" "4ever" tokens("99 red balloons 4ever", what = "fasterword", remove_numbers = FALSE) %>% as.character() ## [1] "99" "red" "balloons" "4ever"
remove_symbols is ok:
remove_symbols
tokens("2 < 3 = TRUE #logic", what = "fasterword", remove_symbols = TRUE) %>% as.character() ## [1] "2" "3" "TRUE" "#logic" tokens("2 < 3 = TRUE #logic", what = "fasterword", remove_symbols = FALSE) %>% as.character() ## [1] "2" "<" "3" "=" "TRUE" "#logic"
remove_hyphens is ok:
remove_hyphens
txt <- "Jacob-Rees - second-rate" identical( tokens(txt, what = "word", remove_hyphens = FALSE) %>% as.character(), tokens(txt, what = "fasterword", remove_hyphens = FALSE) %>% as.character() ) ## [1] TRUE identical( tokens(txt, what = "word", remove_hyphens = TRUE) %>% as.character(), tokens(txt, what = "fasterword", remove_hyphens = TRUE) %>% as.character() ) ## [1] TRUE
The text was updated successfully, but these errors were encountered:
This seems to be caused by a change in #1420
quanteda/R/tokens.R
Lines 670 to 674 in bedca07
Sorry, something went wrong.
Address #1447
8e206f9
- Revert change in ad359e6. - Fix handling of new line markers. - Update documentation as well. - Correct tests
Merge pull request #1459 from quanteda/issue-1447
73d63a1
#1459 solved the separators issue, now that just leaves the punctuation and Twitter differences.
No branches or pull requests
Different
remove_punctuation
:and Twitter characters:
remove_separators
behaves differently:Same
remove_url = TRUE
is ok:remove_numbers
is ok:remove_symbols
is ok:remove_hyphens
is ok:The text was updated successfully, but these errors were encountered: