Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to left join docvars with those in an existing corpus #7

Open
conjugateprior opened this issue Feb 27, 2017 · 17 comments
Open

How to left join docvars with those in an existing corpus #7

conjugateprior opened this issue Feb 27, 2017 · 17 comments
Assignees

Comments

@conjugateprior
Copy link

This works

> str(inaugCorpus) # but deprecated
List of 4
 $ documents:'data.frame':	58 obs. of  1 variable:
  ..$ texts: chr [1:58] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the o"| __truncated__ "When it was first perceived, in early times, that no middle course for America remained between unlimited submission to a forei"| __truncated__ "Friends and Fellow Citizens:\n\nCalled upon to undertake the duties of the first executive office of our country, I avail mysel"| __truncated__ ...
 $ metadata :'data.frame':	58 obs. of  1 variable:
  ..$ Year: num [1:58] 1789 1793 1797 1801 1805 ...
 $ settings :'data.frame':	58 obs. of  1 variable:
  ..$ President: chr [1:58] "Washington" "Washington" "Adams" "Jefferson" ...
 $ tokens   :'data.frame':	58 obs. of  1 variable:
  ..$ FirstName: chr [1:58] "George" "George" "John" "Thomas" ...
 - attr(*, "class")= chr [1:2] "corpus" "list"

but this doesn't

> str(corpus(data_char_inaugural))
Error in `[[.corpus`(object, 1L) : 
  cannot index docvars this way because none exist

apparently because there are no docvars

> str(corpus(data_char_inaugural, docvars = docvars(inaugCorpus)))
List of 4
 $ documents:'data.frame':	58 obs. of  1 variable:
  ..$ texts: chr [1:58] "Fellow-Citizens of the Senate and of the House of Representatives:\n\nAmong the vicissitudes incident to life no event could ha"| __truncated__ "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the o"| __truncated__ "When it was first perceived, in early times, that no middle course for America remained between unlimited submission to a forei"| __truncated__ "Friends and Fellow Citizens:\n\nCalled upon to undertake the duties of the first executive office of our country, I avail mysel"| __truncated__ ...
 $ metadata :'data.frame':	58 obs. of  1 variable:
  ..$ Year: num [1:58] 1789 1793 1797 1801 1805 ...
 $ settings :'data.frame':	58 obs. of  1 variable:
  ..$ President: chr [1:58] "Washington" "Washington" "Adams" "Jefferson" ...
 $ tokens   :'data.frame':	58 obs. of  1 variable:
  ..$ FirstName: chr [1:58] "George" "George" "John" "Thomas" ...
 - attr(*, "class")= chr [1:2] "corpus" "list"

Seems like it should be possible to make a docvar-free corpus though.

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X El Capitan 10.11.6

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readtext_0.2.9000 quanteda_0.9.9-24

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9         lattice_0.20-34     deldir_0.1-12      
 [4] png_0.1-7           class_7.3-14        gtools_3.5.0       
 [7] digest_0.6.12       foreach_1.4.3       V8_1.2             
[10] R6_2.2.0            plyr_1.8.4          tmap_1.8-1         
[13] stats4_3.3.2        coda_0.19-1         e1071_1.6-8        
[16] httr_1.2.1          spdep_0.6-9         curl_2.3           
[19] data.table_1.10.0   gdata_2.17.0        geosphere_1.5-5    
[22] raster_2.5-8        gmodels_2.16.2      R.utils_2.5.0      
[25] R.oo_1.21.0         Matrix_1.2-7.1      splines_3.3.2      
[28] webshot_0.4.0       rgdal_1.2-5         htmlwidgets_0.8    
[31] RCurl_1.95-4.8      munsell_0.4.3       rmapshaper_0.1.0   
[34] tmaptools_1.2       rgeos_0.3-22        htmltools_0.3.5    
[37] codetools_0.2-15    mapview_1.2.0       XML_3.98-1.5       
[40] viridisLite_0.1.3   MASS_7.3-45         bitops_1.0-6       
[43] R.methodsS3_1.7.1   grid_3.3.2          nlme_3.1-128       
[46] jsonlite_1.2        satellite_0.2.0     magrittr_1.5       
[49] scales_0.4.1        RcppParallel_4.3.20 KernSmooth_2.23-15 
[52] stringi_1.1.2       LearnBayes_2.15     leaflet_1.0.1      
[55] sp_1.2-4            ca_0.64             latticeExtra_0.6-28
[58] boot_1.3-18         fastmatch_1.1-0     osmar_1.1-7        
[61] RColorBrewer_1.1-2  iterators_1.0.8     tools_3.3.2        
[64] gdalUtils_2.0.1.7   dichromat_2.0-0     colorspace_1.3-2   
[67] classInt_0.1-23    
@conjugateprior conjugateprior changed the title str does not work for for corpus objects without docvars str does not work for for corpus objects without docvars Feb 27, 2017
@kbenoit
Copy link
Contributor

kbenoit commented Feb 27, 2017

Thanks. More generally (and basically):

str(corpus("this is my single document"))
## Error in `[[.corpus`(object, 1L) : 
##  cannot index docvars this way because none exist 

@kbenoit
Copy link
Contributor

kbenoit commented Feb 27, 2017

But keep in mind this, from ?corpus:

A warning on accessing corpus elements

A corpus currently consists of an S3 specially classed list of elements, but you should not access these elements directly. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change (as it inevitably will as we continue to develop the package, including moving corpus objects to the S4 class system).

😉

@kbenoit
Copy link
Contributor

kbenoit commented Feb 27, 2017

@conjugateprior Refresh with the latest GitHub version and try it now.

@conjugateprior
Copy link
Author

conjugateprior commented Feb 28, 2017

FYI I was string in the first place so I could sketch out a corpus_merge_docvars function that I have now needed several times and hacked around. (Something like tmaptools::append_data). If you're planning such a function, let me know and I won't duplicate the work.

In the meantime I'll wait until the innards settle down.

@kbenoit
Copy link
Contributor

kbenoit commented Feb 28, 2017

Have you seen the + and c methods for the corpus class? Might be what you are after.

corpus1 <- corpus_subset(data_corpus_inaugural, President == "Bush")
corpus2 <- corpus_subset(data_corpus_inaugural, President == "Clinton")
docvars(corpus2, "newvar") <- "Added to Clinton"
corpus3 <- corpus_subset(data_corpus_inaugural, President == "Obama")
docvars(corpus3, "newvar") <- "Added to Obama"

docvars(c(corpus1, corpus2, corpus3))
##              Year President FirstName           newvar
## 1989-Bush    1989      Bush    George             <NA>
## 2001-Bush    2001      Bush George W.             <NA>
## 2005-Bush    2005      Bush George W.             <NA>
## 1993-Clinton 1993   Clinton      Bill Added to Clinton
## 1997-Clinton 1997   Clinton      Bill Added to Clinton
## 2009-Obama   2009     Obama    Barack   Added to Obama
## 2013-Obama   2013     Obama    Barack   Added to Obama

docvars(corpus2 + corpus3)
##              Year President FirstName           newvar
## 1993-Clinton 1993   Clinton      Bill Added to Clinton
## 1997-Clinton 1997   Clinton      Bill Added to Clinton
## 2009-Obama   2009     Obama    Barack   Added to Obama
## 2013-Obama   2013     Obama    Barack   Added to Obama

If not, consider a PR that operates using accessor functions (try methods(class = "corpus") for a list), or just describe what you are looking for and we could add it.

@conjugateprior
Copy link
Author

Definitely not + or c.

As in the SpatialPolygonDataFrame function I linked to above it's about having maybe incomplete or overcomplete hand constructed document metadata in a data.frame and (left) joining it with a corpus object via a key that is a corpus docvar on the left side, and regular data.frame column on the right side.

Currently it seems one must have the external metadata go in column by column and hope it lines up with the exact ordering of documents in the corpus. This has bitten me several times already. Hence the desire for a merge-like function rather than a cbind-like function to do that.

@kbenoit
Copy link
Contributor

kbenoit commented Feb 28, 2017

Well, we could modify + for signature corpus, data.frame so that it performs a left join automatically based on the docname as a key. But first let me make sure I have understood.

You want following:

> docvars(data_corpus_irishbudget2010)
                                      year debate number      foren     name party
2010_BUDGET_01_Brian_Lenihan_FF       2010 BUDGET     01      Brian  Lenihan    FF
2010_BUDGET_02_Richard_Bruton_FG      2010 BUDGET     02    Richard   Bruton    FG
2010_BUDGET_03_Joan_Burton_LAB        2010 BUDGET     03       Joan   Burton   LAB
2010_BUDGET_04_Arthur_Morgan_SF       2010 BUDGET     04     Arthur   Morgan    SF
2010_BUDGET_05_Brian_Cowen_FF         2010 BUDGET     05      Brian    Cowen    FF
2010_BUDGET_06_Enda_Kenny_FG          2010 BUDGET     06       Enda    Kenny    FG
2010_BUDGET_07_Kieran_ODonnell_FG     2010 BUDGET     07     Kieran ODonnell    FG
2010_BUDGET_08_Eamon_Gilmore_LAB      2010 BUDGET     08      Eamon  Gilmore   LAB
2010_BUDGET_09_Michael_Higgins_LAB    2010 BUDGET     09    Michael  Higgins   LAB
2010_BUDGET_10_Ruairi_Quinn_LAB       2010 BUDGET     10     Ruairi    Quinn   LAB
2010_BUDGET_11_John_Gormley_Green     2010 BUDGET     11       John  Gormley Green
2010_BUDGET_12_Eamon_Ryan_Green       2010 BUDGET     12      Eamon     Ryan Green
2010_BUDGET_13_Ciaran_Cuffe_Green     2010 BUDGET     13     Ciaran    Cuffe Green
2010_BUDGET_14_Caoimhghin_OCaolain_SF 2010 BUDGET     14 Caoimhghin OCaolain    SF

> (df_tomerge <- data.frame(minister = c(1, 1), row.names = c("2010_BUDGET_01_Brian_Lenihan_FF", "2010_BUDGET_11_John_Gormley_Green")))
                                  minister
2010_BUDGET_01_Brian_Lenihan_FF          1
2010_BUDGET_11_John_Gormley_Green        1

## MERGE COMMAND

## RESULT:
                                      year debate number      foren     name party minister
2010_BUDGET_01_Brian_Lenihan_FF       2010 BUDGET     01      Brian  Lenihan    FF        1
2010_BUDGET_02_Richard_Bruton_FG      2010 BUDGET     02    Richard   Bruton    FG       NA
2010_BUDGET_03_Joan_Burton_LAB        2010 BUDGET     03       Joan   Burton   LAB       NA
2010_BUDGET_04_Arthur_Morgan_SF       2010 BUDGET     04     Arthur   Morgan    SF       NA
2010_BUDGET_05_Brian_Cowen_FF         2010 BUDGET     05      Brian    Cowen    FF       NA
2010_BUDGET_06_Enda_Kenny_FG          2010 BUDGET     06       Enda    Kenny    FG       NA
2010_BUDGET_07_Kieran_ODonnell_FG     2010 BUDGET     07     Kieran ODonnell    FG       NA
2010_BUDGET_08_Eamon_Gilmore_LAB      2010 BUDGET     08      Eamon  Gilmore   LAB       NA
2010_BUDGET_09_Michael_Higgins_LAB    2010 BUDGET     09    Michael  Higgins   LAB       NA
2010_BUDGET_10_Ruairi_Quinn_LAB       2010 BUDGET     10     Ruairi    Quinn   LAB       NA
2010_BUDGET_11_John_Gormley_Green     2010 BUDGET     11       John  Gormley Green        1
2010_BUDGET_12_Eamon_Ryan_Green       2010 BUDGET     12      Eamon     Ryan Green       NA
2010_BUDGET_13_Ciaran_Cuffe_Green     2010 BUDGET     13     Ciaran    Cuffe Green       NA
2010_BUDGET_14_Caoimhghin_OCaolain_SF 2010 BUDGET     14 Caoimhghin OCaolain    SF       NA

@conjugateprior
Copy link
Author

Yes, that would do it.

Two small caveats.

  1. Seems awkward to be required to key on rownames, but that's a minor thing. I guess it ensures they're unique :-)
  2. Making + non-commutative looks like trouble, unless you're thinking of type-distinguished (corpus, dataframe) and (data.frame, corpus) implementations of it.

@kbenoit
Copy link
Contributor

kbenoit commented Mar 1, 2017

OK, thinking about options for syntax:

  1. It could qualify for the corpus_something() grammar since it takes a corpus as the main argument, and returns a modified corpus. Something like:

    corpus_joinvars(thecorpus, newdocvars_data.frame, by = NULL)

    where the default is to join by docnames() (and row.names for the data.frame), but can be set in the same way that dplyr::left_join() works.

  2. Since it sets docvars for a corpus (through a left join), it might be more appropriate to be a variant of the docvars() command. For instance:

    docvars(thecorpus, merge_source = newdocvars_data.frame, by = NULL)

    or maybe some clever adaptation of the <-.docvars() function?

@kbenoit kbenoit changed the title str does not work for for corpus objects without docvars How to left join docvars with those in an existing corpus Mar 1, 2017
@kbenoit
Copy link
Contributor

kbenoit commented Mar 1, 2017

How about this:

  • corpus1 + corpus2: adds documents and smart-cbinds docvars, where ndoc() of the result = ndoc(corpus1) + ndoc(corpus2). (This is the current + functionality for corpus objects.)
  • corpus + data.frame = does the left join described above for matching documents, does not affect ndoc

Using S4 methods with multiple dispatch will allow us to distinguish these two methods (even with S3 objects). Order from chaos.

@kbenoit kbenoit reopened this Mar 1, 2017
@conjugateprior
Copy link
Author

conjugateprior commented Mar 8, 2017

Four questions and proposed answers for the semantics of + with corpus 'corp' and data.frame 'newdocvars'.

  1. Are matches determined exclusively (keyed on) the rownames of corpus and data.frame?
  2. Does + left join, ignoring docvars for which there is no corpus document?
  3. Does + overwrite values of create a new renamed variable for the intersection of colnames(docvars(corp)) and colnames(newdocvars)?
  4. Does + maintaindocvars(corp) variable classes? (factor is the only hard case)

Proposal:

  1. Yes (since you've decided to do this elsewhere)
  2. Yes
  3. Yes. Specifically, option iii of the following:
    1. making a new docvar column with an adjusted name (messy and prevents building up docvar values in stages or partially)
    2. only overwriting when the old corpus docvar has an NA in place (silent and confusing)
    3. overwriting all elements of the old corpus docvar (simplest semantics, allows building up docvar values in stages or partially)
    4. overwriting all elements of the old corpus docvar unless newdocvar value is NA in which case keeping the old corpus var (asymmetric semantics but otherwise like iii)
  4. Yes. Cases to consider are all combinations of numeric, character, and factor types for old corpus docvars and newdocvars. Abbreviate them N, F, and S so <N,F> is an originally numeric corpus docvar meeting the factor in newdocvars that shares its name.
    1. <N,N> overwrite as above, keep class N
    2. <N,F> complain and stop
    3. <N,C> complain and stop
    4. <C,N> complain and stop
    5. <C,F> convert F to C and overwrite as above, keeping class C
    6. <C,C> overwrite as above, keep class C
    7. <F,N> complain and stop
    8. <F,F> convert both to character, overwrite as above, convert result to F (possibly creating and destroying labels)
      9, <F,C> convert F to C, overwrite as above, convert result to F (possibly creating and destroying labels)

Some discussion of the semantics factor conversion would be useful.

@conjugateprior
Copy link
Author

Second suggestion: All this goes into an augmented docvars command instead: docvars(corp) <- newdocvars. All the same questions would need anwering for this, so it seems to be an orthogonal question.

@conjugateprior
Copy link
Author

@kbenoit Thoughts on these semantics or should I assume they're fine and send a PR?

@kbenoit
Copy link
Contributor

kbenoit commented Mar 13, 2017

Insofar as I understood it fully, let's implement your answers to the scheme above. I'd say that the docvars class should be the left side, i.e. the existing variable, and if this is not compatible in the ways you list, then complain and stop.

You mention a PR - great if you code this!

@kbenoit
Copy link
Contributor

kbenoit commented Feb 6, 2018

Update: The solution to this could be part of quanteda/quanteda#1214. It could also be solved by the idea of creating a quanteda.dplyr extension package as described in quanteda/quanteda#1171, quanteda/quanteda#529.

@kbenoit kbenoit transferred this issue from quanteda/quanteda Jul 20, 2020
@kbenoit
Copy link
Contributor

kbenoit commented Jul 20, 2020

@conjugateprior with the new package this should be pretty easy to implement now. I'm adding it to the list.

@mpazpiroz
Copy link

@kbenoit I was wondering if there is a solution to the question in this thread? I have been unsuccessful in trying to do add external variables to a corpus object. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants