textstat_keyness should raise error if target argument is the whole corpus #1482

jiongweilua · 2018-11-07T16:40:08Z

Describe the bug

The textstat_keyness function can raise an error to warn the user if the target argument entered by the user is all the documents in the dfm. Alternatively, we can have a bug handler that defaults the target to the first document in the dfm.

Reproducible code

> library(quanteda)
> dfm_inaugural <- dfm(data_corpus_inaugural)
> keyness_inaugural <- quanteda::textstat_keyness(dfm_inaugural)
> keyness_inaugural %>% head()
       feature     chi2            p n_target
1    immutable 70.48924 0.000000e+00        2
2  impressions 70.48924 0.000000e+00        2
3 providential 70.48924 0.000000e+00        2
4        which 64.39580 9.992007e-16       36
5           my 56.69373 5.095924e-14       22
6         your 41.69436 1.067163e-10        9
  n_reference
1           1
2           1
3           1
4         970
5         473
6         114
> quanteda::textstat_keyness(dfm_inaugural, target = docnames(dfm_inaugural))
Error in x[2, ] : Subscript out of bounds

To test out what was happening under the hood, I ran the parts of the code necessary to generate the groupings and temp in the texstat_keyness function and printed some lines:

shortened_textstat(dfm_inaugural, target = docnames(dfm_inaugural))
[1] "The target input into the function is "
 [1] "1789-Washington" "1793-Washington"
 [3] "1797-Adams"      "1801-Jefferson" 
 [5] "1805-Jefferson"  "1809-Madison"   
 [7] "1813-Madison"    "1817-Monroe"    
 [9] "1821-Monroe"     "1825-Adams"     
[11] "1829-Jackson"    "1833-Jackson"   
[13] "1837-VanBuren"   "1841-Harrison"  
[15] "1845-Polk"       "1849-Taylor"    
[17] "1853-Pierce"     "1857-Buchanan"  
[19] "1861-Lincoln"    "1865-Lincoln"   
[21] "1869-Grant"      "1873-Grant"     
[23] "1877-Hayes"      "1881-Garfield"  
[25] "1885-Cleveland"  "1889-Harrison"  
[27] "1893-Cleveland"  "1897-McKinley"  
[29] "1901-McKinley"   "1905-Roosevelt" 
[31] "1909-Taft"       "1913-Wilson"    
[33] "1917-Wilson"     "1921-Harding"   
[35] "1925-Coolidge"   "1929-Hoover"    
[37] "1933-Roosevelt"  "1937-Roosevelt" 
[39] "1941-Roosevelt"  "1945-Roosevelt" 
[41] "1949-Truman"     "1953-Eisenhower"
[43] "1957-Eisenhower" "1961-Kennedy"   
[45] "1965-Johnson"    "1969-Nixon"     
[47] "1973-Nixon"      "1977-Carter"    
[49] "1981-Reagan"     "1985-Reagan"    
[51] "1989-Bush"       "1993-Clinton"   
[53] "1997-Clinton"    "2001-Bush"      
[55] "2005-Bush"       "2009-Obama"     
[57] "2013-Obama"      "2017-Trump"     
[1] "The boolean version of the target is"
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[10] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[28] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[55] TRUE TRUE TRUE TRUE
[1] "The Grouping is"
 [1] target target target target target target
 [7] target target target target target target
[13] target target target target target target
[19] target target target target target target
[25] target target target target target target
[31] target target target target target target
[37] target target target target target target
[43] target target target target target target
[49] target target target target target target
[55] target target target target
Levels: target reference
Document-feature matrix of: 1 document, 9,357 features (0% sparse).

Potential Fixes

There could possibly be a line that defaults the target document to the first document.

# use original docnames only when there are two documents
  if (ndoc(x) == 2) {
    label <- docnames(x)[order(target, decreasing = TRUE)]
  } else {
    if (sum(target) == 1 && !is.null(docnames(x)[target])) {
      label <- c(docnames(x)[target], "reference")
    } else {
      if (sum(target) == length(target)) {
        target[2:length(target)] = FALSE
        target[1] = TRUE
      }
      label <- c("target", "reference")
    }
  }

Or we could raise an error

# use original docnames only when there are two documents
  if (ndoc(x) == 2) {
    label <- docnames(x)[order(target, decreasing = TRUE)]
  } else {
    if (sum(target) == 1 && !is.null(docnames(x)[target])) {
      label <- c(docnames(x)[target], "reference")
    } else {
      if (sum(target) == length(target)) {
        stop('Target documents cannot be the whole corpus')
      }
      label <- c("target", "reference")
    }
  }

The text was updated successfully, but these errors were encountered:

kbenoit · 2018-11-08T12:46:51Z

Thanks @jiongweilua, this is a good find. How about I let you try fixing it as a PR. Here's the steps.

Fork the master.
Create a branch for the fix.
Add this to test-textstat_keyness.R:

# works for one document
textstat_keyness(d, target = docnames(d)[1])

library("testthat")

test_that("keyness works correctly for default, single, and multiple targets") {
    d <- corpus(c(d1 = "a a a a a b b b c c c c ",
          d2 = "a b c d d d d e f g",
          d3 = "a b c d d e e e e f g")) %>%
        dfm()

    # default target is first document
    expect_identical(
        textstat_keyness(d),
        textstat_keyness(d, target = docnames(d)[1])
    )
    
    # for explicit first target
    expect_identical(
        as.integer(textstat_keyness(d, target = docnames(d)[1])[1, "n_target"]),
        5L
    )
    
    # for two documents as targets
    expect_equivalent(
        textstat_keyness(d, target = docnames(d)[1:2])[1, c("n_target", "n_reference")],
        data.frame(n_target = 6, n_reference = 1)
    )
    
    # for all documents as targets
    expect_error(
        textstat_keyness(d, target = docnames(d)[1:3]),
        "number of target documents must be < ndoc"
    )
}

Implement an error check for the condition in textstat_keyness.dfm(), and return the error through stop("number of target documents must be < ndoc"), so that new, last test passes.
Issue a pull request.

Happy to help if you need it!

jiongweilua mentioned this issue Nov 10, 2018

Checks for textstat_keyness working correctly on default, single and multiple targets. #1486

Merged

kbenoit closed this as completed in #1486 Nov 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

textstat_keyness should raise error if target argument is the whole corpus #1482

textstat_keyness should raise error if target argument is the whole corpus #1482

jiongweilua commented Nov 7, 2018

kbenoit commented Nov 8, 2018 •

edited

textstat_keyness should raise error if target argument is the whole corpus #1482

textstat_keyness should raise error if target argument is the whole corpus #1482

Comments

jiongweilua commented Nov 7, 2018

Describe the bug

Reproducible code

Potential Fixes

kbenoit commented Nov 8, 2018 • edited

kbenoit commented Nov 8, 2018 •

edited