New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to remove only the first element of documents using tokens_select() #1475
Comments
tokens_select()
Probably like stringi implements these, but by adding an argument rather than adding
tokens_select_first() / tokens_select_last() / tokens_select_all(). This way we keep consistency with existing quanteda grammar, and leave the default to match current behaviour, so we have no compatibility issues. So we would have
tokens_select(x, position = c("all", "first", "last"), …)
where for this usage you would call:
tokens_remove(x, pattern = “RT”, position = “first”)
On 28 Oct 2018, at 05:36, Kohei Watanabe <notifications@github.com<mailto:notifications@github.com>> wrote:
The first "RT" indicate that this text is a retweet but the last is an acronym (Russia Today). What should I do if I only want to remove the first "RT"?
tokens("RT @koheiw7: people should not watch RT")
tokens from 1 document.
text1 :
[1] "RT" "@koheiw7" ":" "people" "should" "not" "watch" "RT"
Probably I cannot do this using quanteda functions 😞 Restricting positions of pattern matching is easy in the C++ code, but how the command should look like?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub<#1475>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACFMZhg3GgUICXqnKkfWZPn-eqP3yBtTks5upXqqgaJpZM4X93t7>.
|
> tokens("people should not watch RT")
tokens from 1 document.
text1 :
[1] "people" "should" "not" "watch" "RT" stringi has |
That's true, so what we would need then some rule to remove RT based on either its absolute position in the tokens object or using some form of regex-type identifier for the sequence (like "^RT" but where the "^" means at the beginning of the document. Of course this could also be done using regex operations on the text itself, pre-tokens, or using some careful application of |
To make Consider a news text with dateline and by-line. txt <- "PARIS, Dec. 2 (AP) -- A powerful fighting navy to back up her extensive Rhineland defenses is being built by France in reply to German rearmament."
toks <- tokens(txt, remove_punct = TRUE)
toks
# tokens from 1 document.
# text1 :
# [1] "PARIS" "Dec" "2" "AP" "A" "powerful" "fighting" "navy" "to" "back"
# [11] "up" "her" "extensive" "Rhineland" "defenses" "is" "being" "built" "by" "France"
# [21] "in" "reply" "to" "German" "rearmament" If I want to remove any tokens in the beginning: tokens_select(toks, "*", selection = "remove", position = 1) If I want to remove news agency names in 1-10th tokens. tokens_select(toks, c("AP", "AFP"), selection = "remove", position = 1:10)
# or
tokens_select(toks, c("AP", "AFP"), selection = "remove", position = c(1, 10)) # easier to implement We can allow indexing from the end of the document by negative values like Python, but I am not sure when we will do this. I have to do this kind of cleaning in corpus with regex now, but love to do in tokens! |
OK, that makes sense, but just so I'm clear: Here I like the first approach (e.g. The absolute position method as you outline above has multiple uses:
Worth thinking about whether we would want to split these functions into something like: tokens_substring(mytokens, first = 1, last = 10) %>%
tokens_select(pattern = c("AP", "AFP") although this works for selection but not as well for removal, unless there were some sort of replacement as well. Worth considering a separate Also worth thinking about the relative position and whether or how that could also be implemented. |
There is stringi::stri_sub(str, from = 1L, to = -1L, length) which takes a vector for tokens_select(toks, c("AP", "AFP"), selection = "remove", position = 1:10) because a tokens_select(toks, c("AP", "AFP"), selection = "remove", from = 1L, to = -1L)
# or
tokens_select(toks, c("AP", "AFP"), selection = "remove", from = NULL, to = NULL)
tokens_select(toks, c("AP", "AFP"), selection = "remove", from = c(1, 10, 4), to = ntoken(toks)) We can use it in combination with kw <- kwic(toks, "something")
kw <- kw[!duplicated(kw$docname),]
tokens_select(toks, c("AP", "AFP"), selection = "remove", from = kw$from, to = kw$to) Specifying positions by |
I was about to upgrade tokens_select(toks, from = 1, to = 10) It would be better if we make a separate function like |
OK, updated thoughts:
library("quanteda")
## Package version: 1.9.9000
txt <- c(
"PARIS, Dec. 2 (AP) -- A powerful fighting navy to back up her extensive Rhineland defenses is being built by France in reply to German rearmament.",
"LONDON, Dec. 8 (Reuters) -- An amazing new application for tokens_segment() has been applied to avoid creating a new function."
)
toks <- tokens(txt)
tokens_segment(toks, phrase("- -"), pattern_position = "after") %>%
tokens_subset(subset = !grepl("\\.1$", docnames(.)))
## tokens from 2 documents.
## text1.2 :
## [1] "A" "powerful" "fighting" "navy" "to"
## [6] "back" "up" "her" "extensive" "Rhineland"
## [11] "defenses" "is" "being" "built" "by"
## [16] "France" "in" "reply" "to" "German"
## [21] "rearmament" "."
##
## text2.2 :
## [1] "An" "amazing" "new" "application"
## [5] "for" "tokens_segment" "(" ")"
## [9] "has" "been" "applied" "to"
## [13] "avoid" "creating" "a" "new"
## [17] "function" "." |
It is not only taking first or last N elements. How about this? tokens_range(toks, from = 1, to = 10) |
If it is extracting specific tokens, then tokens_extract(toks, from = 1, to = 10) This function can be developed in the future to support different extraction criteria (that we do not know yet). |
I think I like the variation you suggested above that we could add two new arguments to tokens_select(toks, "*", selection = "remove", startpos = 1, endpos = 10) (where Then we would not need |
This is a way to avoid naming problem. tokens_select(toks, "*", selection = "remove", startpos = 1, endpos = 10) But |
OK how about this: without implementing the full
That allows us to keep the new functions in |
I think this is a good direction. |
Agreed. But let's leave it without a default in the function signature, and treat it as any match when start/endpos are used, with internal checks to catch this, rather than putting some more complicated conditional or |
I tested with a relatively large tokens object, and confirmed that > ndoc(toks)
[1] 1117011
> length(types(toks))
[1] 210288
>
> system.time(
+ tokens_select(toks, "*")
+ )
user system elapsed
3.509 0.007 1.389
> system.time(
+ tokens_remove(toks, "*")
+ )
user system elapsed
3.479 0.000 1.385 However, I don't what users to do > tokens_select(toks, phrase("* *"))
Error: cannot allocate vector of size 329.5 Gb becasue it creates 210288 ^ 2 patterns internally. As far as "*" does not encourage users to use tokens_select(toks, "*", selection = "remove", startpos = 1, endpos = 10) |
Cool, then let's block the pure |
This is the most flexible version with interesting interaction with pattern and position. Basically, positions changes area where the functions scan for patterns. > toks <- tokens(c("This is a sentence.", "This is a second sentence."),
+ remove_punct = TRUE)
> tokens_select(toks, c("is", "a", "this"), startpos = 2)
tokens from 2 documents.
text1 :
[1] "is" "a"
text2 :
[1] "is" "a"
> tokens_select(toks, c("is", "a", "this"), padding = TRUE, startpos = 2)
tokens from 2 documents.
text1 :
[1] "" "is" "a" ""
text2 :
[1] "" "is" "a" "" ""
>
> tokens_remove(toks, c("is", "a", "this"), startpos = 2)
tokens from 2 documents.
text1 :
[1] "This" "sentence"
text2 :
[1] "This" "second" "sentence"
> tokens_remove(toks, c("is", "a", "this"), padding = TRUE, startpos = 2)
tokens from 2 documents.
text1 :
[1] "This" "" "" "sentence"
text2 :
[1] "This" "" "" "second" "sentence" |
Ending position too. > tokens_select(toks, "*", startpos = 1, endpos = 2)
tokens from 2 documents.
text1 :
[1] "This" "is"
text2 :
[1] "This" "is"
> tokens_select(toks, "*", startpos = 1, endpos = -2)
tokens from 2 documents.
text1 :
[1] "This" "is" "a"
text2 :
[1] "This" "is" "a" "second" |
The first "RT" indicate that this text is a retweet but the last is an acronym (Russia Today). What should I do if I only want to remove the first "RT"?
Probably I cannot do this using quanteda functions 😞 Restricting positions of pattern matching is easy in the C++ code, but how the command should look like?
The text was updated successfully, but these errors were encountered: