Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to remove only the first element of documents using tokens_select() #1475

Closed
koheiw opened this issue Oct 28, 2018 · 19 comments
Closed

How to remove only the first element of documents using tokens_select() #1475

koheiw opened this issue Oct 28, 2018 · 19 comments

Comments

@koheiw
Copy link
Collaborator

koheiw commented Oct 28, 2018

The first "RT" indicate that this text is a retweet but the last is an acronym (Russia Today). What should I do if I only want to remove the first "RT"?

> tokens("RT @koheiw7: people should not watch RT")
tokens from 1 document.
text1 :
[1] "RT"      "@koheiw7" ":"       "people"  "should"  "not"     "watch"   "RT"     

Probably I cannot do this using quanteda functions 😞 Restricting positions of pattern matching is easy in the C++ code, but how the command should look like?

@koheiw koheiw changed the title How to remove only first element of documents using tokens_select() How to remove only the first element of documents using tokens_select() Oct 28, 2018
@kbenoit
Copy link
Collaborator

kbenoit commented Oct 28, 2018 via email

@koheiw
Copy link
Collaborator Author

koheiw commented Oct 28, 2018

first and last is useful, but it removes the acronym in the original post, because "RT" is the first here.

> tokens("people should not watch RT")
tokens from 1 document.
text1 :
[1] "people"  "should"  "not"     "watch"   "RT"    

stringi has stri_startswith and stri_endswith for this, so position = c("all", "start", "end")?

@kbenoit
Copy link
Collaborator

kbenoit commented Oct 29, 2018

That's true, so what we would need then some rule to remove RT based on either its absolute position in the tokens object or using some form of regex-type identifier for the sequence (like "^RT" but where the "^" means at the beginning of the document.

Of course this could also be done using regex operations on the text itself, pre-tokens, or using some careful application of corpus_segment().

@koheiw
Copy link
Collaborator Author

koheiw commented Nov 1, 2018

To make position more general and flexible, the input should be a numeric vector specifying positions of matches.

Consider a news text with dateline and by-line.

txt <- "PARIS, Dec. 2 (AP) -- A powerful fighting navy to back up her extensive Rhineland defenses is being built by France in reply to German rearmament."
toks <- tokens(txt, remove_punct = TRUE)
toks

# tokens from 1 document.
# text1 :
#  [1] "PARIS"      "Dec"        "2"          "AP"         "A"          "powerful"   "fighting"   "navy"       "to"         "back"      
# [11] "up"         "her"        "extensive"  "Rhineland"  "defenses"   "is"         "being"      "built"      "by"         "France"    
# [21] "in"         "reply"      "to"         "German"     "rearmament"

If I want to remove any tokens in the beginning:

tokens_select(toks, "*", selection = "remove", position = 1)

If I want to remove news agency names in 1-10th tokens.

tokens_select(toks, c("AP", "AFP"), selection = "remove", position = 1:10)
# or 
tokens_select(toks, c("AP", "AFP"), selection = "remove", position = c(1, 10)) # easier to implement

We can allow indexing from the end of the document by negative values like Python, but I am not sure when we will do this.

I have to do this kind of cleaning in corpus with regex now, but love to do in tokens!

@kbenoit
Copy link
Collaborator

kbenoit commented Nov 1, 2018

OK, that makes sense, but just so I'm clear: Here position refers to the absolute token position (or range of them) within which the selection would take place. In my suggestion above it was to remove the first, last, or all occurrences of a token, so it was relative rather than absolute position. The absolute position removal is more powerful but might miss some of the use cases for relative position.

I like the first approach (e.g. position = 1:10) rather than the position = c(startpos, endpos) since the former is more flexible, as it allows any vector of positions.

The absolute position method as you outline above has multiple uses:

  • removal of specific token patterns within a limited range of token positions, as demonstrated above.
  • selection of specific token patterns within a limited range of positions
  • if selection = "*", the it acts as a tokens_substring() similar to characters in base::substring().

Worth thinking about whether we would want to split these functions into something like:

tokens_substring(mytokens, first = 1, last = 10) %>%
    tokens_select(pattern = c("AP", "AFP")

although this works for selection but not as well for removal, unless there were some sort of replacement as well. Worth considering a separate tokens_substring() for additional functionality and as an additional route to the worker functions that would implement the proposed changes to tokens_select().

Also worth thinking about the relative position and whether or how that could also be implemented.

@koheiw
Copy link
Collaborator Author

koheiw commented Nov 6, 2018

There is

stringi::stri_sub(str, from = 1L, to = -1L, length)

which takes a vector for form and to to allow the position to be different between documents. This would be more useful than

tokens_select(toks, c("AP", "AFP"), selection = "remove", position = 1:10)

because a position vector works only when target elements are in the same positions across documents (taking a list of position vectors is possible but let's not do this).

tokens_select(toks, c("AP", "AFP"), selection = "remove", from = 1L, to = -1L)
# or 
tokens_select(toks, c("AP", "AFP"), selection = "remove", from = NULL, to = NULL)

tokens_select(toks, c("AP", "AFP"), selection = "remove", from = c(1, 10, 4), to = ntoken(toks))

We can use it in combination with kwic

kw <- kwic(toks, "something")
kw <- kw[!duplicated(kw$docname),]
tokens_select(toks, c("AP", "AFP"), selection = "remove", from = kw$from, to =  kw$to)

Specifying positions by from and to is actually easier to implement and more efficient in C++ than position.

@kbenoit kbenoit added this to the v2.0 milestone Dec 18, 2018
@koheiw
Copy link
Collaborator Author

koheiw commented Dec 7, 2019

I was about to upgrade tokens_select() but not sure if it is the best function to add from and to, because positions are unrelated to patterns. For example, if I want to extract only first 10 tokens:

tokens_select(toks, from = 1, to = 10)

It would be better if we make a separate function like tokens_trim(). In this case, dfm_trim() selects features based on frequency while tokens_trim() does based on absolute and relative positions. @kbenoit what do you think?

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 8, 2019

OK, updated thoughts:

  1. trim is already different for a corpus object, see corpus_trim(), so stretching it a bit here is not a violation of the virgin perfectness of our trim function, but rather an extension of being, um, flexible on how it is used. So I consider it an okay candidate for a name and come to think of it, better than introducing a new name such as tokens_cut(). However...

  2. This is really like a tokens_head() or tokens_tail() function - except that we want a negative selection (to remove the head or tail). Inner positions could involve a combination of these two, although this would require a very regular set of positions across documents. Does such a case exist?

  3. Maybe this can be solved using an existing function?

library("quanteda")
## Package version: 1.9.9000

txt <- c(
  "PARIS, Dec. 2 (AP) -- A powerful fighting navy to back up her extensive Rhineland defenses is being built by France in reply to German rearmament.",
  "LONDON, Dec. 8 (Reuters) -- An amazing new application for tokens_segment() has been applied to avoid creating a new function."
)
toks <- tokens(txt)

tokens_segment(toks, phrase("- -"), pattern_position = "after") %>%
  tokens_subset(subset = !grepl("\\.1$", docnames(.)))
## tokens from 2 documents.
## text1.2 :
##  [1] "A"          "powerful"   "fighting"   "navy"       "to"        
##  [6] "back"       "up"         "her"        "extensive"  "Rhineland" 
## [11] "defenses"   "is"         "being"      "built"      "by"        
## [16] "France"     "in"         "reply"      "to"         "German"    
## [21] "rearmament" "."         
## 
## text2.2 :
##  [1] "An"             "amazing"        "new"            "application"   
##  [5] "for"            "tokens_segment" "("              ")"             
##  [9] "has"            "been"           "applied"        "to"            
## [13] "avoid"          "creating"       "a"              "new"           
## [17] "function"       "."

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 8, 2019

It is not only taking first or last N elements. How about this?

tokens_range(toks, from = 1, to = 10)

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 8, 2019

If it is extracting specific tokens, then

tokens_extract(toks, from = 1, to = 10)

This function can be developed in the future to support different extraction criteria (that we do not know yet).

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 8, 2019

I think I like the variation you suggested above that we could add two new arguments to tokens_select():

tokens_select(toks, "*", selection = "remove", startpos = 1, endpos = 10)

(where endpos = -1L might be the default, and behave similar to the to in stringi::stri_sub())
since this allows the maximum flexibility for combining positional selection with patterns with selection or removal?

Then we would not need tokens_extract(), since this is just a combination of a select-keep by pattern and position.

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 8, 2019

This is a way to avoid naming problem.

tokens_select(toks, "*", selection = "remove", startpos = 1, endpos = 10)

But pattern = "*" is really tricky to deal with as it requires a lot of special handling in C++ in the current design. For this reason, I prefer to separate the new function. If we are going to add this to tokens_select(), pattern = NULL.

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 8, 2019

OK how about this: without implementing the full * match, we can say that

  • if pattern is empty and start/endpos are specified, it selects all tokens in the range
  • if there is a pattern that is not *, and that start/endpos` are specified, then it first selects the range and then selects the pattern from the range
  • * is just disallowed if start/endpos are unspecified.

That allows us to keep the new functions in tokens_select() but not get hung up while thinking about an efficient implementation for the * function which is not solved in existing code.

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 9, 2019

I think this is a good direction. window is enabled when pattern is specified , while start/endpos is when pattern is not specified. In this case, pattern needs to be NULL because pattern = "" is for selection/removal of padding 🤔

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 9, 2019

Agreed. But let's leave it without a default in the function signature, and treat it as any match when start/endpos are used, with internal checks to catch this, rather than putting some more complicated conditional or pattern = NULL in the signature (which we use no where else). We can explain this in the Details section for tokens_select().

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 9, 2019

I tested with a relatively large tokens object, and confirmed that * does not harm anyrthing.

> ndoc(toks)
[1] 1117011
> length(types(toks))
[1] 210288
> 
> system.time(
+     tokens_select(toks, "*")
+ )
   user  system elapsed 
  3.509   0.007   1.389 
> system.time(
+     tokens_remove(toks, "*")
+ )
   user  system elapsed 
  3.479   0.000   1.385

However, I don't what users to do

> tokens_select(toks, phrase("* *"))
 Error: cannot allocate vector of size 329.5 Gb 

becasue it creates 210288 ^ 2 patterns internally.

As far as "*" does not encourage users to use phrase("* *"), I am happy with

tokens_select(toks, "*", selection = "remove", startpos = 1, endpos = 10)

@kbenoit
Copy link
Collaborator

kbenoit commented Dec 9, 2019

Cool, then let's block the pure "*" wildcard match from phrase() and from expressions that translate into phrase-like constructions such as list(c("*", "*"), and then we can just use the wildcard syntax for the positional expressions. And it's all totally consistent!

koheiw added a commit that referenced this issue Dec 9, 2019
@koheiw
Copy link
Collaborator Author

koheiw commented Dec 9, 2019

This is the most flexible version with interesting interaction with pattern and position. Basically, positions changes area where the functions scan for patterns.

> toks <- tokens(c("This is a sentence.", "This is a second sentence."),
+                  remove_punct = TRUE)
> tokens_select(toks, c("is", "a", "this"), startpos = 2)
tokens from 2 documents.
text1 :
[1] "is" "a" 

text2 :
[1] "is" "a" 

> tokens_select(toks, c("is", "a", "this"), padding = TRUE, startpos = 2)
tokens from 2 documents.
text1 :
[1] ""   "is" "a"  ""  

text2 :
[1] ""   "is" "a"  ""   ""  

> 
> tokens_remove(toks, c("is", "a", "this"), startpos = 2)
tokens from 2 documents.
text1 :
[1] "This"     "sentence"

text2 :
[1] "This"     "second"   "sentence"

> tokens_remove(toks, c("is", "a", "this"), padding = TRUE, startpos = 2)
tokens from 2 documents.
text1 :
[1] "This"     ""         ""         "sentence"

text2 :
[1] "This"     ""         ""         "second"   "sentence"

@koheiw
Copy link
Collaborator Author

koheiw commented Dec 9, 2019

Ending position too.

> tokens_select(toks, "*", startpos = 1, endpos = 2)
tokens from 2 documents.
text1 :
[1] "This" "is"  

text2 :
[1] "This" "is"  

> tokens_select(toks, "*", startpos = 1, endpos = -2)
tokens from 2 documents.
text1 :
[1] "This" "is"   "a"   

text2 :
[1] "This"   "is"     "a"      "second"

@koheiw koheiw mentioned this issue Dec 10, 2019
@kbenoit kbenoit closed this as completed Dec 10, 2019
kbenoit added a commit that referenced this issue Dec 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants