How to remove only the first element of documents using tokens_select() #1475

koheiw · 2018-10-28T09:36:42Z

The first "RT" indicate that this text is a retweet but the last is an acronym (Russia Today). What should I do if I only want to remove the first "RT"?

> tokens("RT @koheiw7: people should not watch RT")
tokens from 1 document.
text1 :
[1] "RT"      "@koheiw7" ":"       "people"  "should"  "not"     "watch"   "RT"

Probably I cannot do this using quanteda functions 😞 Restricting positions of pattern matching is easy in the C++ code, but how the command should look like?

kbenoit · 2018-10-28T20:30:34Z

Probably like stringi implements these, but by adding an argument rather than adding tokens_select_first() / tokens_select_last() / tokens_select_all(). This way we keep consistency with existing quanteda grammar, and leave the default to match current behaviour, so we have no compatibility issues. So we would have tokens_select(x, position = c("all", "first", "last"), …) where for this usage you would call: tokens_remove(x, pattern = “RT”, position = “first”) On 28 Oct 2018, at 05:36, Kohei Watanabe <notifications@github.com<mailto:notifications@github.com>> wrote: The first "RT" indicate that this text is a retweet but the last is an acronym (Russia Today). What should I do if I only want to remove the first "RT"?

tokens("RT @koheiw7: people should not watch RT")

tokens from 1 document. text1 : [1] "RT" "@koheiw7" ":" "people" "should" "not" "watch" "RT" Probably I cannot do this using quanteda functions 😞 Restricting positions of pattern matching is easy in the C++ code, but how the command should look like? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub<#1475>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACFMZhg3GgUICXqnKkfWZPn-eqP3yBtTks5upXqqgaJpZM4X93t7>.

koheiw · 2018-10-28T21:30:51Z

first and last is useful, but it removes the acronym in the original post, because "RT" is the first here.

> tokens("people should not watch RT")
tokens from 1 document.
text1 :
[1] "people"  "should"  "not"     "watch"   "RT"

stringi has stri_startswith and stri_endswith for this, so position = c("all", "start", "end")?

kbenoit · 2018-10-29T00:20:46Z

That's true, so what we would need then some rule to remove RT based on either its absolute position in the tokens object or using some form of regex-type identifier for the sequence (like "^RT" but where the "^" means at the beginning of the document.

Of course this could also be done using regex operations on the text itself, pre-tokens, or using some careful application of corpus_segment().

koheiw · 2018-11-01T01:21:55Z

To make position more general and flexible, the input should be a numeric vector specifying positions of matches.

Consider a news text with dateline and by-line.

txt <- "PARIS, Dec. 2 (AP) -- A powerful fighting navy to back up her extensive Rhineland defenses is being built by France in reply to German rearmament."
toks <- tokens(txt, remove_punct = TRUE)
toks

# tokens from 1 document.
# text1 :
#  [1] "PARIS"      "Dec"        "2"          "AP"         "A"          "powerful"   "fighting"   "navy"       "to"         "back"      
# [11] "up"         "her"        "extensive"  "Rhineland"  "defenses"   "is"         "being"      "built"      "by"         "France"    
# [21] "in"         "reply"      "to"         "German"     "rearmament"

If I want to remove any tokens in the beginning:

tokens_select(toks, "*", selection = "remove", position = 1)

If I want to remove news agency names in 1-10th tokens.

tokens_select(toks, c("AP", "AFP"), selection = "remove", position = 1:10)
# or 
tokens_select(toks, c("AP", "AFP"), selection = "remove", position = c(1, 10)) # easier to implement

We can allow indexing from the end of the document by negative values like Python, but I am not sure when we will do this.

I have to do this kind of cleaning in corpus with regex now, but love to do in tokens!

kbenoit · 2018-11-01T11:32:05Z

OK, that makes sense, but just so I'm clear: Here position refers to the absolute token position (or range of them) within which the selection would take place. In my suggestion above it was to remove the first, last, or all occurrences of a token, so it was relative rather than absolute position. The absolute position removal is more powerful but might miss some of the use cases for relative position.

I like the first approach (e.g. position = 1:10) rather than the position = c(startpos, endpos) since the former is more flexible, as it allows any vector of positions.

The absolute position method as you outline above has multiple uses:

removal of specific token patterns within a limited range of token positions, as demonstrated above.
selection of specific token patterns within a limited range of positions
if selection = "*", the it acts as a tokens_substring() similar to characters in base::substring().

Worth thinking about whether we would want to split these functions into something like:

tokens_substring(mytokens, first = 1, last = 10) %>%
    tokens_select(pattern = c("AP", "AFP")

although this works for selection but not as well for removal, unless there were some sort of replacement as well. Worth considering a separate tokens_substring() for additional functionality and as an additional route to the worker functions that would implement the proposed changes to tokens_select().

Also worth thinking about the relative position and whether or how that could also be implemented.

koheiw · 2018-11-06T22:39:55Z

There is

stringi::stri_sub(str, from = 1L, to = -1L, length)

which takes a vector for form and to to allow the position to be different between documents. This would be more useful than

tokens_select(toks, c("AP", "AFP"), selection = "remove", position = 1:10)

because a position vector works only when target elements are in the same positions across documents (taking a list of position vectors is possible but let's not do this).

tokens_select(toks, c("AP", "AFP"), selection = "remove", from = 1L, to = -1L)
# or 
tokens_select(toks, c("AP", "AFP"), selection = "remove", from = NULL, to = NULL)

tokens_select(toks, c("AP", "AFP"), selection = "remove", from = c(1, 10, 4), to = ntoken(toks))

We can use it in combination with kwic

kw <- kwic(toks, "something")
kw <- kw[!duplicated(kw$docname),]
tokens_select(toks, c("AP", "AFP"), selection = "remove", from = kw$from, to =  kw$to)

Specifying positions by from and to is actually easier to implement and more efficient in C++ than position.

koheiw · 2019-12-07T12:13:32Z

I was about to upgrade tokens_select() but not sure if it is the best function to add from and to, because positions are unrelated to patterns. For example, if I want to extract only first 10 tokens:

tokens_select(toks, from = 1, to = 10)

It would be better if we make a separate function like tokens_trim(). In this case, dfm_trim() selects features based on frequency while tokens_trim() does based on absolute and relative positions. @kbenoit what do you think?

kbenoit · 2019-12-08T18:39:34Z

OK, updated thoughts:

trim is already different for a corpus object, see corpus_trim(), so stretching it a bit here is not a violation of the virgin perfectness of our trim function, but rather an extension of being, um, flexible on how it is used. So I consider it an okay candidate for a name and come to think of it, better than introducing a new name such as tokens_cut(). However...
This is really like a tokens_head() or tokens_tail() function - except that we want a negative selection (to remove the head or tail). Inner positions could involve a combination of these two, although this would require a very regular set of positions across documents. Does such a case exist?
Maybe this can be solved using an existing function?

library("quanteda")
## Package version: 1.9.9000

txt <- c(
  "PARIS, Dec. 2 (AP) -- A powerful fighting navy to back up her extensive Rhineland defenses is being built by France in reply to German rearmament.",
  "LONDON, Dec. 8 (Reuters) -- An amazing new application for tokens_segment() has been applied to avoid creating a new function."
)
toks <- tokens(txt)

tokens_segment(toks, phrase("- -"), pattern_position = "after") %>%
  tokens_subset(subset = !grepl("\\.1$", docnames(.)))
## tokens from 2 documents.
## text1.2 :
##  [1] "A"          "powerful"   "fighting"   "navy"       "to"        
##  [6] "back"       "up"         "her"        "extensive"  "Rhineland" 
## [11] "defenses"   "is"         "being"      "built"      "by"        
## [16] "France"     "in"         "reply"      "to"         "German"    
## [21] "rearmament" "."         
## 
## text2.2 :
##  [1] "An"             "amazing"        "new"            "application"   
##  [5] "for"            "tokens_segment" "("              ")"             
##  [9] "has"            "been"           "applied"        "to"            
## [13] "avoid"          "creating"       "a"              "new"           
## [17] "function"       "."

koheiw · 2019-12-08T19:17:01Z

It is not only taking first or last N elements. How about this?

tokens_range(toks, from = 1, to = 10)

koheiw · 2019-12-08T19:21:28Z

If it is extracting specific tokens, then

tokens_extract(toks, from = 1, to = 10)

This function can be developed in the future to support different extraction criteria (that we do not know yet).

kbenoit · 2019-12-08T19:24:23Z

I think I like the variation you suggested above that we could add two new arguments to tokens_select():

tokens_select(toks, "*", selection = "remove", startpos = 1, endpos = 10)

(where endpos = -1L might be the default, and behave similar to the to in stringi::stri_sub())
since this allows the maximum flexibility for combining positional selection with patterns with selection or removal?

Then we would not need tokens_extract(), since this is just a combination of a select-keep by pattern and position.

koheiw · 2019-12-08T19:39:33Z

This is a way to avoid naming problem.

tokens_select(toks, "*", selection = "remove", startpos = 1, endpos = 10)

But pattern = "*" is really tricky to deal with as it requires a lot of special handling in C++ in the current design. For this reason, I prefer to separate the new function. If we are going to add this to tokens_select(), pattern = NULL.

kbenoit · 2019-12-08T20:52:12Z

OK how about this: without implementing the full * match, we can say that

if pattern is empty and start/endpos are specified, it selects all tokens in the range
if there is a pattern that is not *, and that start/endpos` are specified, then it first selects the range and then selects the pattern from the range
* is just disallowed if start/endpos are unspecified.

That allows us to keep the new functions in tokens_select() but not get hung up while thinking about an efficient implementation for the * function which is not solved in existing code.

koheiw · 2019-12-09T06:13:14Z

I think this is a good direction. window is enabled when pattern is specified , while start/endpos is when pattern is not specified. In this case, pattern needs to be NULL because pattern = "" is for selection/removal of padding 🤔

kbenoit · 2019-12-09T09:33:51Z

Agreed. But let's leave it without a default in the function signature, and treat it as any match when start/endpos are used, with internal checks to catch this, rather than putting some more complicated conditional or pattern = NULL in the signature (which we use no where else). We can explain this in the Details section for tokens_select().

koheiw · 2019-12-09T12:15:47Z

I tested with a relatively large tokens object, and confirmed that * does not harm anyrthing.

> ndoc(toks)
[1] 1117011
> length(types(toks))
[1] 210288
> 
> system.time(
+     tokens_select(toks, "*")
+ )
   user  system elapsed 
  3.509   0.007   1.389 
> system.time(
+     tokens_remove(toks, "*")
+ )
   user  system elapsed 
  3.479   0.000   1.385

However, I don't what users to do

> tokens_select(toks, phrase("* *"))
 Error: cannot allocate vector of size 329.5 Gb

becasue it creates 210288 ^ 2 patterns internally.

As far as "*" does not encourage users to use phrase("* *"), I am happy with

tokens_select(toks, "*", selection = "remove", startpos = 1, endpos = 10)

kbenoit · 2019-12-09T12:31:37Z

Cool, then let's block the pure "*" wildcard match from phrase() and from expressions that translate into phrase-like constructions such as list(c("*", "*"), and then we can just use the wildcard syntax for the positional expressions. And it's all totally consistent!

koheiw · 2019-12-09T13:08:04Z

This is the most flexible version with interesting interaction with pattern and position. Basically, positions changes area where the functions scan for patterns.

> toks <- tokens(c("This is a sentence.", "This is a second sentence."),
+                  remove_punct = TRUE)
> tokens_select(toks, c("is", "a", "this"), startpos = 2)
tokens from 2 documents.
text1 :
[1] "is" "a" 

text2 :
[1] "is" "a" 

> tokens_select(toks, c("is", "a", "this"), padding = TRUE, startpos = 2)
tokens from 2 documents.
text1 :
[1] ""   "is" "a"  ""  

text2 :
[1] ""   "is" "a"  ""   ""  

> 
> tokens_remove(toks, c("is", "a", "this"), startpos = 2)
tokens from 2 documents.
text1 :
[1] "This"     "sentence"

text2 :
[1] "This"     "second"   "sentence"

> tokens_remove(toks, c("is", "a", "this"), padding = TRUE, startpos = 2)
tokens from 2 documents.
text1 :
[1] "This"     ""         ""         "sentence"

text2 :
[1] "This"     ""         ""         "second"   "sentence"

koheiw · 2019-12-09T13:26:59Z

Ending position too.

> tokens_select(toks, "*", startpos = 1, endpos = 2)
tokens from 2 documents.
text1 :
[1] "This" "is"  

text2 :
[1] "This" "is"  

> tokens_select(toks, "*", startpos = 1, endpos = -2)
tokens from 2 documents.
text1 :
[1] "This" "is"   "a"   

text2 :
[1] "This"   "is"     "a"      "second"

koheiw added enhancement question tokens design feature request labels Oct 28, 2018

koheiw changed the title ~~How to remove only first element of documents using tokens_select()~~ How to remove only the first element of documents using tokens_select() Oct 28, 2018

kbenoit added this to the v2.0 milestone Dec 18, 2018

koheiw added a commit that referenced this issue Dec 9, 2019

Add posiitonal argument for #1475

b4894d6

koheiw mentioned this issue Dec 10, 2019

Dev position #1794

Merged

kbenoit closed this as completed Dec 10, 2019

kbenoit added a commit that referenced this issue Dec 10, 2019

Update NEWS for #1475

4c163ef

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to remove only the first element of documents using tokens_select() #1475

How to remove only the first element of documents using tokens_select() #1475

koheiw commented Oct 28, 2018

kbenoit commented Oct 28, 2018 via email

koheiw commented Oct 28, 2018 •

edited

kbenoit commented Oct 29, 2018

koheiw commented Nov 1, 2018 •

edited

kbenoit commented Nov 1, 2018

koheiw commented Nov 6, 2018

koheiw commented Dec 7, 2019 •

edited

kbenoit commented Dec 8, 2019 •

edited

koheiw commented Dec 8, 2019

koheiw commented Dec 8, 2019

kbenoit commented Dec 8, 2019 •

edited

koheiw commented Dec 8, 2019

kbenoit commented Dec 8, 2019

koheiw commented Dec 9, 2019

kbenoit commented Dec 9, 2019

koheiw commented Dec 9, 2019 •

edited

kbenoit commented Dec 9, 2019 •

edited

koheiw commented Dec 9, 2019

koheiw commented Dec 9, 2019

How to remove only the first element of documents using tokens_select() #1475

How to remove only the first element of documents using tokens_select() #1475

Comments

koheiw commented Oct 28, 2018

kbenoit commented Oct 28, 2018 via email

koheiw commented Oct 28, 2018 • edited

kbenoit commented Oct 29, 2018

koheiw commented Nov 1, 2018 • edited

kbenoit commented Nov 1, 2018

koheiw commented Nov 6, 2018

koheiw commented Dec 7, 2019 • edited

kbenoit commented Dec 8, 2019 • edited

koheiw commented Dec 8, 2019

koheiw commented Dec 8, 2019

kbenoit commented Dec 8, 2019 • edited

koheiw commented Dec 8, 2019

kbenoit commented Dec 8, 2019

koheiw commented Dec 9, 2019

kbenoit commented Dec 9, 2019

koheiw commented Dec 9, 2019 • edited

kbenoit commented Dec 9, 2019 • edited

koheiw commented Dec 9, 2019

koheiw commented Dec 9, 2019

koheiw commented Oct 28, 2018 •

edited

koheiw commented Nov 1, 2018 •

edited

koheiw commented Dec 7, 2019 •

edited

kbenoit commented Dec 8, 2019 •

edited

kbenoit commented Dec 8, 2019 •

edited

koheiw commented Dec 9, 2019 •

edited

kbenoit commented Dec 9, 2019 •

edited