Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summary() should not print anything #1007

Closed
koheiw opened this issue Oct 5, 2017 · 19 comments
Closed

summary() should not print anything #1007

koheiw opened this issue Oct 5, 2017 · 19 comments
Assignees
Labels
Milestone

Comments

@koheiw
Copy link
Collaborator

koheiw commented Oct 5, 2017

Just like the older version of head.dfm(), summary.textmodel_wordfish_fitted() prints model parameters. It should only return values just like summary.lm().

ieWF <- dfm(data_corpus_irishbudget2010, removePunct = TRUE) %>%
    textmodel_wordfish(dir = c(6,5))
textplot_scale1d(ieWF)
out <- summary(ieWF)
# Call:
#   textmodel_wordfish.dfm(x = ., dir = c(6, 5))
# 
# Estimated document positions:
#   theta         SE       lower       upper
# 2010_BUDGET_01_Brian_Lenihan_FF        1.8209535 0.02032345  1.78111954  1.86078748
# 2010_BUDGET_02_Richard_Bruton_FG      -0.5932779 0.02818836 -0.64852708 -0.53802872
# 2010_BUDGET_03_Joan_Burton_LAB        -1.1136753 0.01540256 -1.14386429 -1.08348625
# 2010_BUDGET_04_Arthur_Morgan_SF       -0.1219325 0.02846319 -0.17772032 -0.06614462
# 2010_BUDGET_05_Brian_Cowen_FF          1.7724224 0.02364097  1.72608615  1.81875874
# 2010_BUDGET_06_Enda_Kenny_FG          -0.7145784 0.02650254 -0.76652337 -0.66263342
# 2010_BUDGET_07_Kieran_ODonnell_FG     -0.4844821 0.04171476 -0.56624299 -0.40272114
# 2010_BUDGET_08_Eamon_Gilmore_LAB      -0.5616713 0.02967351 -0.61983141 -0.50351124
# 2010_BUDGET_09_Michael_Higgins_LAB    -0.9703106 0.03850541 -1.04578121 -0.89484000
# 2010_BUDGET_10_Ruairi_Quinn_LAB       -0.9589229 0.03892373 -1.03521343 -0.88263242
# 2010_BUDGET_11_John_Gormley_Green      1.1807220 0.07221463  1.03918133  1.32226270
# 2010_BUDGET_12_Eamon_Ryan_Green        0.1866457 0.06294119  0.06328093  0.31001039
# 2010_BUDGET_13_Ciaran_Cuffe_Green      0.7421895 0.07245436  0.60017891  0.88420001
# 2010_BUDGET_14_Caoimhghin_OCaolain_SF -0.1840821 0.03666256 -0.25594076 -0.11222351

summary.textmodel_wordscores_fitted() is also wrong in the same way.

@koheiw koheiw added the bug label Oct 5, 2017
@koheiw koheiw added this to the v1.0 milestone Oct 5, 2017
@kbenoit
Copy link
Collaborator

kbenoit commented Oct 16, 2017

Should work like summary.lm().

Checklist:

  • summary.textmodel_wordfish_fitted
  • summary.textmodel_wordscores_fitted
  • summary.textmodel_wordscores_predicted
  • summary.textmodel_wordshoal_fitted

@kbenoit kbenoit modified the milestones: v1.0, Last pre-1.0 update Nov 5, 2017
@kbenoit kbenoit assigned koheiw and unassigned kbenoit Nov 6, 2017
@kbenoit kbenoit modified the milestones: CRAN pre-1.0 update, v1.0 Nov 10, 2017
@kbenoit kbenoit assigned kbenoit and unassigned koheiw Nov 16, 2017
@koheiw
Copy link
Collaborator Author

koheiw commented Jan 5, 2018

I started working on in summary method for textmodel_wordscore(). Why does it use S4? To my mind, there is not reason to use S4 for text models, and it actually making the code unnecessarily complex.

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 5, 2018

Fixing summary methods forces me to tidy up text models.

As far as we use generic predict(), text models' output should be consistent with other functions like lm(), but they are very different. predict.lm() returns only a named vector, unless se.fit = TRUE. Even so, its output is a plain list. From the lm examples:

> predict(lm.D9)

    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 
5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 4.661 4.661 4.661 4.661 4.661 4.661 4.661 4.661 
   19    20 
4.661 4.661 
> predict(lm.D9, se.fit = TRUE)

$fit
    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 
5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 4.661 4.661 4.661 4.661 4.661 4.661 4.661 4.661 
   19    20 
4.661 4.661 

$se.fit
 [1] 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177
[11] 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177

$df
[1] 18

$residual.scale
[1] 0.6963895

predict.textmodel_wordscore() returns data.frame always. From the textmodel_wordscre examples:

> predict(ws)

Predicted textmodel of type: wordscores

   textscore LBG se   ci lo   ci hi
R1   -1.3179 0.0067 -1.3311 -1.3048
R2   -0.7396 0.0114 -0.7620 -0.7172
R3    0.0000 0.0120 -0.0235  0.0235
R4    0.7396 0.0114  0.7172  0.7620
R5    1.3179 0.0067  1.3048  1.3311
V1   -0.4481 0.0119 -0.4714 -0.4247

If we make predict.textmodels_*() similar to predict.lm(), we can remove summary.textmodel_*_predicted and print.textmodel_*_predicted to make the code simpler. (we probably do not need the textmodel_*_predicted class). Removing those methods, nonetheless dose not cause users any inconvenience, because they can extract parameters by summary.textmodel_*_fitted() and see them nicely by print.textmodel_summary() (a new class that I am adding).

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 5, 2018

That sounds great. This will make the methods consistent with other predict methods, which is what I had intended all along. We can use the same sorts of arguments as in predict.lm(), including se.fit, interval, and level, to produce the desired results as a list.

But would be good to have a summary.textmodel_fitted() method too. (Not all predicted objects have them.)

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 5, 2018

OK. I will change textmodel_wordsore in this direction, and use it as a template for other models.

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 5, 2018

Sounds good. I can review that before we proceed to the others.

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 5, 2018

On the S4 question above, the main reason was inheritance. But if we revert the predicted objects to a single list-style structure, the inheritance arguments may not be as compelling.

The move may very well break some users' old code, however. Accessing the slots in an S4 object is something we would as a principle discourage, but for some functions this has been the only way to extract the desired quantities (at least until #108 is resolved). I'm happy to go with what works most consistently with other predict methods here, and for the fitted variants.

Suggest we call the fitted variant classes the same name as the textmodel_*() function. So for wordscores, the fitted model would be of class textmodel_wordscores. The predicted object can be fitted_textmodel so that we can write a nicer print method for them. (textmodel_fitted is not really consistent so better to avoid that word order).

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 5, 2018

I designed summary objects as a list of small classed objects:
https://github.com/kbenoit/quanteda/blob/f778a0e876f4e7887ca9d29af29c417ae864faca/R/textmodel_wordscores.R#L295-L309
With that, we can implement print.summary really simply:
https://github.com/kbenoit/quanteda/blob/692229745399611375ecec5f067b42cefd21accd/R/textmodel-methods.R#L15-L21

> summary(ws)
Call:
textmodel_wordscores.dfm(x = data_dfm_lbgexample, y = c(seq(-1.5, 
    1.5, 0.75), NA))

Reference Document Statistics:
(reference scores and feature count statistics)

 Document Score Total Min Max Mean Median
       R1 -1.50  1000   0 158   27      0
       R2 -0.75  1000   0 158   27      0
       R3  0.00  1000   0 158   27      0
       R4  0.75  1000   0 158   27      0
       R5  1.50  1000   0 158   27      0
       V1    NA  1000   0 158   27      0

Word Scores:
(showing first 30 features)

    A     B     C     D     E     F     G     H     I     J     K     L     M     N     O     P     Q     R     S     T 
-1.50 -1.50 -1.50 -1.50 -1.50 -1.48 -1.48 -1.45 -1.41 -1.32 -1.18 -1.04 -0.88 -0.75 -0.62 -0.45 -0.30 -0.13  0.00  0.13 
    U     V     W     X     Y     Z    ZA    ZB    ZC    ZD 
 0.30  0.45  0.62  0.75  0.88  1.04  1.18  1.32  1.41  1.45 

> ws
Fitted wordscores model:
Call:
textmodel_wordscores.dfm(x = data_dfm_lbgexample, y = c(seq(-1.5, 
    1.5, 0.75), NA))

Reference Documents and Reference Scores:

 Document Score
       R1 -1.50
       R2 -0.75
       R3  0.00
       R4  0.75
       R5  1.50
       V1    NA

I removed prediction objects, and predictions are either a named vector or a list:

> predict(ws)
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

> predict(ws, se.fit = TRUE)
$textscore_raw
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

$textscore_raw_se
[1] 0.006699613 0.011433605 0.012005250 0.011433605 0.006699613 0.011897667

$textscore_raw_lo
[1] -1.33106234 -0.76196925 -0.02352986  0.71715035  1.30480034 -0.47137808

$textscore_raw_hi
[1] -1.30480034 -0.71715035  0.02352986  0.76196925  1.33106234 -0.42474008

I am not sure if we should include CI, because user can calculate with SE.

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 6, 2018

Let's say the following, which is consistent with what you have outlined above (and largely just summarizes what you have been proposing):

fitted

A (fitted) textmodel_wordscores object should have the following methods: (it already has these):

  • summary. Produces a summary.textmodel object.
  • print
  • coef
  • confint. Even though this makes no sense for fitted wordscores, for consistency we can implement to return zero intervals for now for the fitted object. Other models will have confidence intervals for the fitted objects (e.g. textmodel_wordfish)
    The first three are already implemented, just need OO tweaking.

summary

A summary.textmodel object will have a print method.

predicted

predict.textmodel_wordscores (and most others) would have the following signature:

## S3 method for class 'textmodel_wordscores'
predict(object, newdata, se.fit = FALSE, 
        interval = c("none", "confidence"), level = 0.95, 
        rescaling = c("none", "lbg", "mv"), ...)

(we remove the verbose argument)

Return: A (predicted) predict.textmodel object will be a named vector, matrix, or list, depending on the arguments, similar to how predict.lm works:

x <- rnorm(5)
y <- x + rnorm(5)

predict(lm(y ~ x))
#          1          2          3          4          5 
# -0.8375490 -0.8598590 -1.0842958 -0.2708943 -1.5676246 

predict(lm(y ~ x), interval = "confidence")
#          fit       lwr       upr
# 1 -0.8375490 -2.216833 0.5417345
# 2 -0.8598590 -2.226422 0.5067040
# 3 -1.0842958 -2.530503 0.3619111
# 4 -0.2708943 -2.772334 2.2305456
# 5 -1.5676246 -4.043157 0.9079075
class(predict(lm(y ~ x), interval = "confidence"))
# [1] "matrix"

predict(lm(y ~ x), se.fit = TRUE)
# $fit
#          1          2          3          4          5 
# -0.8375490 -0.8598590 -1.0842958 -0.2708943 -1.5676246 
# 
# $se.fit
# [1] 0.4334036 0.4294065 0.4544324 0.7860116 0.7778708
# 
# $df
# [1] 3
# 
# $residual.scale
# [1] 0.949114

Not perhaps an ideal design, but it's an established model.

In addition to the above, we would also class the predicted objects beyond their base class, so that we could add a print method.

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 6, 2018

It's more similar to predict.lm now.

> predict(ws)
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

> predict(ws, se.fit = T)
$fit
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

$se
[1] 0.006699613 0.011433605 0.012005250 0.011433605 0.006699613 0.011897667

> predict(ws, se.fit = T, rescaling = 'lbg', interval = 'confidence')
           fit         lwr        upr
R1 -1.58967683 -1.60567795 -1.5736757
R2 -0.88488724 -0.91219485 -0.8575796
R3  0.01632248 -0.01235043  0.0449954
R4  0.91753220  0.89022458  0.9448398
R5  1.62232179  1.60632067  1.6383229
V1 -0.52967149 -0.55808746 -0.5012555

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 6, 2018

Should coef return word scores, which are also in the summary output?

    A     B     C     D     E     F     G     H     I     J     K     L     M     N     O     P     Q     R     S     T 
-1.50 -1.50 -1.50 -1.50 -1.50 -1.48 -1.48 -1.45 -1.41 -1.32 -1.18 -1.04 -0.88 -0.75 -0.62 -0.45 -0.30 -0.13  0.00  0.13 
    U     V     W     X     Y     Z    ZA    ZB    ZC    ZD 
 0.30  0.45  0.62  0.75  0.88  1.04  1.18  1.32  1.41  1.45

Coefficients are in a matrix in coef.lm but named vector seems better, because each word has only one coefficient in our text models. Only exception seems textmodel_wordfish.

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 6, 2018

Yes. Great stuff on the new formatting.

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 6, 2018

coef would be like this if its output is a matrix

> coef(ws)
     Estimate
A  -1.5000000
B  -1.5000000
C  -1.5000000
D  -1.5000000
E  -1.5000000
F  -1.4812500
G  -1.4809322
H  -1.4519231
I  -1.4083333
J  -1.3232984
K  -1.1846154
L  -1.0369898
M  -0.8805970
N  -0.7500000
O  -0.6194030
P  -0.4507576
Q  -0.2992424
R  -0.1305970
S   0.0000000
T   0.1305970
U   0.2992424
V   0.4507576
W   0.6194030
X   0.7500000
Y   0.8805970
Z   1.0369898
ZA  1.1846154
ZB  1.3232984
ZC  1.4083333
ZD  1.4519231
ZE  1.4809322
ZF  1.4812500
ZG  1.5000000
ZH  1.5000000
ZI  1.5000000
ZJ  1.5000000
ZK  1.5000000

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 6, 2018

It is coef.textmodel_wordscores, but we can make coeff.textmodel if we combine all coefficients in a list in a fitted textmodel objects. It involves frightening structural change in S4 objects, though.

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 6, 2018

coef.textmodel can return a named vector if the textmodel has just one set of coefficients, or a matrix if (similar to lm) has multiple coefficients that are for the same dimension (features, in the word scores case). For wordfish, a list is probably the natural output, since the coefficients are different lengths depending on whether they are document or feature coefficients.

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 6, 2018

Let's specify the detail as we move onto other functions. Shall we tackle Wordfish next?

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 6, 2018

Agreed, the natural choices for output formats will become more apparent once we do a few more formats. CA and wordfish will be very similar to one another. I can do NB and affinity if you tackle the others.

@koheiw
Copy link
Collaborator Author

koheiw commented Jan 6, 2018

After working on textmodel_wordfish, I think it is best to make output of coef always a named list (even if there is only one set of parameters). I am not yet sure what to print by print.textmodel_* but believe that its ouput should minimal.

@kbenoit
Copy link
Collaborator

kbenoit commented Jan 7, 2018

Suggest the following:

the textmodel_*() methods can produce S4 class objects as they currently do, but the remaining methods will be S3, and remain as similar to the lm/glm family as possible.

For wordscores and wordfish:

function output class
textmodel_wordscores() textmodel_wordscores
predict.textmodel_wordscores() predict.textmodel_wordscores
print.textmodel_wordscores() produces minimal screen output
summary.textmodel_wordscores() summary.textmodel_wordscores
print.summary.textmodel_wordscores() produces detailed screen output
textmodel_worddish() textmodel_wordfish
print.textmodel_wordfish() produces minimal screen output
summary.textmodel_wordfish() summary.textmodel_wordfish
print.summary.textmodel_wordfish() produces detailed screen output
confint.textmodel_wordfish() dimnamed matrix as with confint.lm()

kbenoit added a commit that referenced this issue Jan 10, 2018
@kbenoit kbenoit mentioned this issue Jan 10, 2018
@kbenoit kbenoit closed this as completed Jan 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants