Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

textstat_keyness should raise error if target argument is the whole corpus #1482

Closed
jiongweilua opened this issue Nov 7, 2018 · 1 comment

Comments

@jiongweilua
Copy link
Collaborator

Describe the bug

The textstat_keyness function can raise an error to warn the user if the target argument entered by the user is all the documents in the dfm. Alternatively, we can have a bug handler that defaults the target to the first document in the dfm.

Reproducible code

> library(quanteda)
> dfm_inaugural <- dfm(data_corpus_inaugural)
> keyness_inaugural <- quanteda::textstat_keyness(dfm_inaugural)
> keyness_inaugural %>% head()
       feature     chi2            p n_target
1    immutable 70.48924 0.000000e+00        2
2  impressions 70.48924 0.000000e+00        2
3 providential 70.48924 0.000000e+00        2
4        which 64.39580 9.992007e-16       36
5           my 56.69373 5.095924e-14       22
6         your 41.69436 1.067163e-10        9
  n_reference
1           1
2           1
3           1
4         970
5         473
6         114
> quanteda::textstat_keyness(dfm_inaugural, target = docnames(dfm_inaugural))
Error in x[2, ] : Subscript out of bounds

To test out what was happening under the hood, I ran the parts of the code necessary to generate the groupings and temp in the texstat_keyness function and printed some lines:

shortened_textstat(dfm_inaugural, target = docnames(dfm_inaugural))
[1] "The target input into the function is "
 [1] "1789-Washington" "1793-Washington"
 [3] "1797-Adams"      "1801-Jefferson" 
 [5] "1805-Jefferson"  "1809-Madison"   
 [7] "1813-Madison"    "1817-Monroe"    
 [9] "1821-Monroe"     "1825-Adams"     
[11] "1829-Jackson"    "1833-Jackson"   
[13] "1837-VanBuren"   "1841-Harrison"  
[15] "1845-Polk"       "1849-Taylor"    
[17] "1853-Pierce"     "1857-Buchanan"  
[19] "1861-Lincoln"    "1865-Lincoln"   
[21] "1869-Grant"      "1873-Grant"     
[23] "1877-Hayes"      "1881-Garfield"  
[25] "1885-Cleveland"  "1889-Harrison"  
[27] "1893-Cleveland"  "1897-McKinley"  
[29] "1901-McKinley"   "1905-Roosevelt" 
[31] "1909-Taft"       "1913-Wilson"    
[33] "1917-Wilson"     "1921-Harding"   
[35] "1925-Coolidge"   "1929-Hoover"    
[37] "1933-Roosevelt"  "1937-Roosevelt" 
[39] "1941-Roosevelt"  "1945-Roosevelt" 
[41] "1949-Truman"     "1953-Eisenhower"
[43] "1957-Eisenhower" "1961-Kennedy"   
[45] "1965-Johnson"    "1969-Nixon"     
[47] "1973-Nixon"      "1977-Carter"    
[49] "1981-Reagan"     "1985-Reagan"    
[51] "1989-Bush"       "1993-Clinton"   
[53] "1997-Clinton"    "2001-Bush"      
[55] "2005-Bush"       "2009-Obama"     
[57] "2013-Obama"      "2017-Trump"     
[1] "The boolean version of the target is"
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[10] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[28] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[37] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[55] TRUE TRUE TRUE TRUE
[1] "The Grouping is"
 [1] target target target target target target
 [7] target target target target target target
[13] target target target target target target
[19] target target target target target target
[25] target target target target target target
[31] target target target target target target
[37] target target target target target target
[43] target target target target target target
[49] target target target target target target
[55] target target target target
Levels: target reference
Document-feature matrix of: 1 document, 9,357 features (0% sparse).

Potential Fixes

There could possibly be a line that defaults the target document to the first document.

# use original docnames only when there are two documents
  if (ndoc(x) == 2) {
    label <- docnames(x)[order(target, decreasing = TRUE)]
  } else {
    if (sum(target) == 1 && !is.null(docnames(x)[target])) {
      label <- c(docnames(x)[target], "reference")
    } else {
      if (sum(target) == length(target)) {
        target[2:length(target)] = FALSE
        target[1] = TRUE
      }
      label <- c("target", "reference")
    }
  }

Or we could raise an error

# use original docnames only when there are two documents
  if (ndoc(x) == 2) {
    label <- docnames(x)[order(target, decreasing = TRUE)]
  } else {
    if (sum(target) == 1 && !is.null(docnames(x)[target])) {
      label <- c(docnames(x)[target], "reference")
    } else {
      if (sum(target) == length(target)) {
        stop('Target documents cannot be the whole corpus')
      }
      label <- c("target", "reference")
    }
  }
@kbenoit
Copy link
Collaborator

kbenoit commented Nov 8, 2018

Thanks @jiongweilua, this is a good find. How about I let you try fixing it as a PR. Here's the steps.

  1. Fork the master.
  2. Create a branch for the fix.
  3. Add this to test-textstat_keyness.R:
# works for one document
textstat_keyness(d, target = docnames(d)[1])

library("testthat")

test_that("keyness works correctly for default, single, and multiple targets") {
    d <- corpus(c(d1 = "a a a a a b b b c c c c ",
          d2 = "a b c d d d d e f g",
          d3 = "a b c d d e e e e f g")) %>%
        dfm()

    # default target is first document
    expect_identical(
        textstat_keyness(d),
        textstat_keyness(d, target = docnames(d)[1])
    )
    
    # for explicit first target
    expect_identical(
        as.integer(textstat_keyness(d, target = docnames(d)[1])[1, "n_target"]),
        5L
    )
    
    # for two documents as targets
    expect_equivalent(
        textstat_keyness(d, target = docnames(d)[1:2])[1, c("n_target", "n_reference")],
        data.frame(n_target = 6, n_reference = 1)
    )
    
    # for all documents as targets
    expect_error(
        textstat_keyness(d, target = docnames(d)[1:3]),
        "number of target documents must be < ndoc"
    )
}
  1. Implement an error check for the condition in textstat_keyness.dfm(), and return the error through stop("number of target documents must be < ndoc"), so that new, last test passes.
  2. Issue a pull request.

Happy to help if you need it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants