summary() should not print anything #1007

koheiw · 2017-10-05T20:56:51Z

Just like the older version of head.dfm(), summary.textmodel_wordfish_fitted() prints model parameters. It should only return values just like summary.lm().

ieWF <- dfm(data_corpus_irishbudget2010, removePunct = TRUE) %>%
    textmodel_wordfish(dir = c(6,5))
textplot_scale1d(ieWF)
out <- summary(ieWF)
# Call:
#   textmodel_wordfish.dfm(x = ., dir = c(6, 5))
# 
# Estimated document positions:
#   theta         SE       lower       upper
# 2010_BUDGET_01_Brian_Lenihan_FF        1.8209535 0.02032345  1.78111954  1.86078748
# 2010_BUDGET_02_Richard_Bruton_FG      -0.5932779 0.02818836 -0.64852708 -0.53802872
# 2010_BUDGET_03_Joan_Burton_LAB        -1.1136753 0.01540256 -1.14386429 -1.08348625
# 2010_BUDGET_04_Arthur_Morgan_SF       -0.1219325 0.02846319 -0.17772032 -0.06614462
# 2010_BUDGET_05_Brian_Cowen_FF          1.7724224 0.02364097  1.72608615  1.81875874
# 2010_BUDGET_06_Enda_Kenny_FG          -0.7145784 0.02650254 -0.76652337 -0.66263342
# 2010_BUDGET_07_Kieran_ODonnell_FG     -0.4844821 0.04171476 -0.56624299 -0.40272114
# 2010_BUDGET_08_Eamon_Gilmore_LAB      -0.5616713 0.02967351 -0.61983141 -0.50351124
# 2010_BUDGET_09_Michael_Higgins_LAB    -0.9703106 0.03850541 -1.04578121 -0.89484000
# 2010_BUDGET_10_Ruairi_Quinn_LAB       -0.9589229 0.03892373 -1.03521343 -0.88263242
# 2010_BUDGET_11_John_Gormley_Green      1.1807220 0.07221463  1.03918133  1.32226270
# 2010_BUDGET_12_Eamon_Ryan_Green        0.1866457 0.06294119  0.06328093  0.31001039
# 2010_BUDGET_13_Ciaran_Cuffe_Green      0.7421895 0.07245436  0.60017891  0.88420001
# 2010_BUDGET_14_Caoimhghin_OCaolain_SF -0.1840821 0.03666256 -0.25594076 -0.11222351

summary.textmodel_wordscores_fitted() is also wrong in the same way.

The text was updated successfully, but these errors were encountered:

kbenoit · 2017-10-16T08:02:05Z

Should work like summary.lm().

Checklist:

summary.textmodel_wordfish_fitted
summary.textmodel_wordscores_fitted
summary.textmodel_wordscores_predicted
summary.textmodel_wordshoal_fitted

koheiw · 2018-01-05T09:17:15Z

I started working on in summary method for textmodel_wordscore(). Why does it use S4? To my mind, there is not reason to use S4 for text models, and it actually making the code unnecessarily complex.

koheiw · 2018-01-05T10:32:45Z

Fixing summary methods forces me to tidy up text models.

As far as we use generic predict(), text models' output should be consistent with other functions like lm(), but they are very different. predict.lm() returns only a named vector, unless se.fit = TRUE. Even so, its output is a plain list. From the lm examples:

> predict(lm.D9)

    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 
5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 4.661 4.661 4.661 4.661 4.661 4.661 4.661 4.661 
   19    20 
4.661 4.661

> predict(lm.D9, se.fit = TRUE)

$fit
    1     2     3     4     5     6     7     8     9    10    11    12    13    14    15    16    17    18 
5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 5.032 4.661 4.661 4.661 4.661 4.661 4.661 4.661 4.661 
   19    20 
4.661 4.661 

$se.fit
 [1] 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177
[11] 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177 0.2202177

$df
[1] 18

$residual.scale
[1] 0.6963895

predict.textmodel_wordscore() returns data.frame always. From the textmodel_wordscre examples:

> predict(ws)

Predicted textmodel of type: wordscores

   textscore LBG se   ci lo   ci hi
R1   -1.3179 0.0067 -1.3311 -1.3048
R2   -0.7396 0.0114 -0.7620 -0.7172
R3    0.0000 0.0120 -0.0235  0.0235
R4    0.7396 0.0114  0.7172  0.7620
R5    1.3179 0.0067  1.3048  1.3311
V1   -0.4481 0.0119 -0.4714 -0.4247

If we make predict.textmodels_*() similar to predict.lm(), we can remove summary.textmodel_*_predicted and print.textmodel_*_predicted to make the code simpler. (we probably do not need the textmodel_*_predicted class). Removing those methods, nonetheless dose not cause users any inconvenience, because they can extract parameters by summary.textmodel_*_fitted() and see them nicely by print.textmodel_summary() (a new class that I am adding).

kbenoit · 2018-01-05T12:46:26Z

That sounds great. This will make the methods consistent with other predict methods, which is what I had intended all along. We can use the same sorts of arguments as in predict.lm(), including se.fit, interval, and level, to produce the desired results as a list.

But would be good to have a summary.textmodel_fitted() method too. (Not all predicted objects have them.)

koheiw · 2018-01-05T12:50:05Z

OK. I will change textmodel_wordsore in this direction, and use it as a template for other models.

kbenoit · 2018-01-05T12:51:42Z

Sounds good. I can review that before we proceed to the others.

kbenoit · 2018-01-05T13:29:40Z

On the S4 question above, the main reason was inheritance. But if we revert the predicted objects to a single list-style structure, the inheritance arguments may not be as compelling.

The move may very well break some users' old code, however. Accessing the slots in an S4 object is something we would as a principle discourage, but for some functions this has been the only way to extract the desired quantities (at least until #108 is resolved). I'm happy to go with what works most consistently with other predict methods here, and for the fitted variants.

Suggest we call the fitted variant classes the same name as the textmodel_*() function. So for wordscores, the fitted model would be of class textmodel_wordscores. The predicted object can be fitted_textmodel so that we can write a nicer print method for them. (textmodel_fitted is not really consistent so better to avoid that word order).

koheiw · 2018-01-05T19:51:19Z

I designed summary objects as a list of small classed objects:
https://github.com/kbenoit/quanteda/blob/f778a0e876f4e7887ca9d29af29c417ae864faca/R/textmodel_wordscores.R#L295-L309
With that, we can implement print.summary really simply:
https://github.com/kbenoit/quanteda/blob/692229745399611375ecec5f067b42cefd21accd/R/textmodel-methods.R#L15-L21

> summary(ws)
Call:
textmodel_wordscores.dfm(x = data_dfm_lbgexample, y = c(seq(-1.5, 
    1.5, 0.75), NA))

Reference Document Statistics:
(reference scores and feature count statistics)

 Document Score Total Min Max Mean Median
       R1 -1.50  1000   0 158   27      0
       R2 -0.75  1000   0 158   27      0
       R3  0.00  1000   0 158   27      0
       R4  0.75  1000   0 158   27      0
       R5  1.50  1000   0 158   27      0
       V1    NA  1000   0 158   27      0

Word Scores:
(showing first 30 features)

    A     B     C     D     E     F     G     H     I     J     K     L     M     N     O     P     Q     R     S     T 
-1.50 -1.50 -1.50 -1.50 -1.50 -1.48 -1.48 -1.45 -1.41 -1.32 -1.18 -1.04 -0.88 -0.75 -0.62 -0.45 -0.30 -0.13  0.00  0.13 
    U     V     W     X     Y     Z    ZA    ZB    ZC    ZD 
 0.30  0.45  0.62  0.75  0.88  1.04  1.18  1.32  1.41  1.45 

> ws
Fitted wordscores model:
Call:
textmodel_wordscores.dfm(x = data_dfm_lbgexample, y = c(seq(-1.5, 
    1.5, 0.75), NA))

Reference Documents and Reference Scores:

 Document Score
       R1 -1.50
       R2 -0.75
       R3  0.00
       R4  0.75
       R5  1.50
       V1    NA

I removed prediction objects, and predictions are either a named vector or a list:

> predict(ws)
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

> predict(ws, se.fit = TRUE)
$textscore_raw
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

$textscore_raw_se
[1] 0.006699613 0.011433605 0.012005250 0.011433605 0.006699613 0.011897667

$textscore_raw_lo
[1] -1.33106234 -0.76196925 -0.02352986  0.71715035  1.30480034 -0.47137808

$textscore_raw_hi
[1] -1.30480034 -0.71715035  0.02352986  0.76196925  1.33106234 -0.42474008

I am not sure if we should include CI, because user can calculate with SE.

kbenoit · 2018-01-06T01:37:23Z

Let's say the following, which is consistent with what you have outlined above (and largely just summarizes what you have been proposing):

fitted

A (fitted) textmodel_wordscores object should have the following methods: (it already has these):

summary. Produces a summary.textmodel object.
print
coef
confint. Even though this makes no sense for fitted wordscores, for consistency we can implement to return zero intervals for now for the fitted object. Other models will have confidence intervals for the fitted objects (e.g. textmodel_wordfish)
The first three are already implemented, just need OO tweaking.

summary

A summary.textmodel object will have a print method.

predicted

predict.textmodel_wordscores (and most others) would have the following signature:

## S3 method for class 'textmodel_wordscores'
predict(object, newdata, se.fit = FALSE, 
        interval = c("none", "confidence"), level = 0.95, 
        rescaling = c("none", "lbg", "mv"), ...)

(we remove the verbose argument)

Return: A (predicted) predict.textmodel object will be a named vector, matrix, or list, depending on the arguments, similar to how predict.lm works:

x <- rnorm(5)
y <- x + rnorm(5)

predict(lm(y ~ x))
#          1          2          3          4          5 
# -0.8375490 -0.8598590 -1.0842958 -0.2708943 -1.5676246 

predict(lm(y ~ x), interval = "confidence")
#          fit       lwr       upr
# 1 -0.8375490 -2.216833 0.5417345
# 2 -0.8598590 -2.226422 0.5067040
# 3 -1.0842958 -2.530503 0.3619111
# 4 -0.2708943 -2.772334 2.2305456
# 5 -1.5676246 -4.043157 0.9079075
class(predict(lm(y ~ x), interval = "confidence"))
# [1] "matrix"

predict(lm(y ~ x), se.fit = TRUE)
# $fit
#          1          2          3          4          5 
# -0.8375490 -0.8598590 -1.0842958 -0.2708943 -1.5676246 
# 
# $se.fit
# [1] 0.4334036 0.4294065 0.4544324 0.7860116 0.7778708
# 
# $df
# [1] 3
# 
# $residual.scale
# [1] 0.949114

Not perhaps an ideal design, but it's an established model.

In addition to the above, we would also class the predicted objects beyond their base class, so that we could add a print method.

koheiw · 2018-01-06T07:51:45Z

It's more similar to predict.lm now.

> predict(ws)
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

> predict(ws, se.fit = T)
$fit
           R1            R2            R3            R4            R5            V1 
-1.317931e+00 -7.395598e-01 -8.673617e-18  7.395598e-01  1.317931e+00 -4.480591e-01 

$se
[1] 0.006699613 0.011433605 0.012005250 0.011433605 0.006699613 0.011897667

> predict(ws, se.fit = T, rescaling = 'lbg', interval = 'confidence')
           fit         lwr        upr
R1 -1.58967683 -1.60567795 -1.5736757
R2 -0.88488724 -0.91219485 -0.8575796
R3  0.01632248 -0.01235043  0.0449954
R4  0.91753220  0.89022458  0.9448398
R5  1.62232179  1.60632067  1.6383229
V1 -0.52967149 -0.55808746 -0.5012555

koheiw · 2018-01-06T07:58:37Z

Should coef return word scores, which are also in the summary output?

    A     B     C     D     E     F     G     H     I     J     K     L     M     N     O     P     Q     R     S     T 
-1.50 -1.50 -1.50 -1.50 -1.50 -1.48 -1.48 -1.45 -1.41 -1.32 -1.18 -1.04 -0.88 -0.75 -0.62 -0.45 -0.30 -0.13  0.00  0.13 
    U     V     W     X     Y     Z    ZA    ZB    ZC    ZD 
 0.30  0.45  0.62  0.75  0.88  1.04  1.18  1.32  1.41  1.45

Coefficients are in a matrix in coef.lm but named vector seems better, because each word has only one coefficient in our text models. Only exception seems textmodel_wordfish.

kbenoit · 2018-01-06T08:00:52Z

Yes. Great stuff on the new formatting.

koheiw · 2018-01-06T09:40:22Z

coef would be like this if its output is a matrix

> coef(ws)
     Estimate
A  -1.5000000
B  -1.5000000
C  -1.5000000
D  -1.5000000
E  -1.5000000
F  -1.4812500
G  -1.4809322
H  -1.4519231
I  -1.4083333
J  -1.3232984
K  -1.1846154
L  -1.0369898
M  -0.8805970
N  -0.7500000
O  -0.6194030
P  -0.4507576
Q  -0.2992424
R  -0.1305970
S   0.0000000
T   0.1305970
U   0.2992424
V   0.4507576
W   0.6194030
X   0.7500000
Y   0.8805970
Z   1.0369898
ZA  1.1846154
ZB  1.3232984
ZC  1.4083333
ZD  1.4519231
ZE  1.4809322
ZF  1.4812500
ZG  1.5000000
ZH  1.5000000
ZI  1.5000000
ZJ  1.5000000
ZK  1.5000000

koheiw · 2018-01-06T09:46:30Z

It is coef.textmodel_wordscores, but we can make coeff.textmodel if we combine all coefficients in a list in a fitted textmodel objects. It involves frightening structural change in S4 objects, though.

kbenoit · 2018-01-06T10:27:45Z

coef.textmodel can return a named vector if the textmodel has just one set of coefficients, or a matrix if (similar to lm) has multiple coefficients that are for the same dimension (features, in the word scores case). For wordfish, a list is probably the natural output, since the coefficients are different lengths depending on whether they are document or feature coefficients.

koheiw · 2018-01-06T10:45:36Z

Let's specify the detail as we move onto other functions. Shall we tackle Wordfish next?

kbenoit · 2018-01-06T10:48:36Z

Agreed, the natural choices for output formats will become more apparent once we do a few more formats. CA and wordfish will be very similar to one another. I can do NB and affinity if you tackle the others.

koheiw · 2018-01-06T16:18:24Z

After working on textmodel_wordfish, I think it is best to make output of coef always a named list (even if there is only one set of parameters). I am not yet sure what to print by print.textmodel_* but believe that its ouput should minimal.

kbenoit · 2018-01-07T10:37:36Z

Suggest the following:

the textmodel_*() methods can produce S4 class objects as they currently do, but the remaining methods will be S3, and remain as similar to the lm/glm family as possible.

For wordscores and wordfish:

function	output class
`textmodel_wordscores()`	`textmodel_wordscores`
`predict.textmodel_wordscores()`	`predict.textmodel_wordscores`
`print.textmodel_wordscores()`	produces minimal screen output
`summary.textmodel_wordscores()`	`summary.textmodel_wordscores`
`print.summary.textmodel_wordscores()`	produces detailed screen output
`textmodel_worddish()`	`textmodel_wordfish`
`print.textmodel_wordfish()`	produces minimal screen output
`summary.textmodel_wordfish()`	`summary.textmodel_wordfish`
`print.summary.textmodel_wordfish()`	produces detailed screen output
`confint.textmodel_wordfish()`	dimnamed matrix as with `confint.lm()`

koheiw added the bug label Oct 5, 2017

koheiw added this to the v1.0 milestone Oct 5, 2017

koheiw assigned kbenoit Oct 5, 2017

koheiw mentioned this issue Oct 9, 2017

text model objects need extractor methods #108

Closed

8 tasks

kbenoit modified the milestones: v1.0, Last pre-1.0 update Nov 5, 2017

kbenoit assigned koheiw and unassigned kbenoit Nov 6, 2017

kbenoit modified the milestones: CRAN pre-1.0 update, v1.0 Nov 10, 2017

kbenoit assigned kbenoit and unassigned koheiw Nov 16, 2017

kbenoit added a commit that referenced this issue Jan 10, 2018

Update NEWS for #1007, #108

9a55dff

kbenoit mentioned this issue Jan 10, 2018

Issue 1007 #1190

Merged

kbenoit closed this as completed Jan 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summary() should not print anything #1007

summary() should not print anything #1007

koheiw commented Oct 5, 2017 •

edited

Loading

kbenoit commented Oct 16, 2017 •

edited

Loading

koheiw commented Jan 5, 2018

koheiw commented Jan 5, 2018 •

edited

Loading

kbenoit commented Jan 5, 2018

koheiw commented Jan 5, 2018

kbenoit commented Jan 5, 2018

kbenoit commented Jan 5, 2018

koheiw commented Jan 5, 2018

kbenoit commented Jan 6, 2018

koheiw commented Jan 6, 2018

koheiw commented Jan 6, 2018

kbenoit commented Jan 6, 2018

koheiw commented Jan 6, 2018

koheiw commented Jan 6, 2018

kbenoit commented Jan 6, 2018

koheiw commented Jan 6, 2018

kbenoit commented Jan 6, 2018 •

edited

Loading

koheiw commented Jan 6, 2018

kbenoit commented Jan 7, 2018

summary() should not print anything #1007

summary() should not print anything #1007

Comments

koheiw commented Oct 5, 2017 • edited Loading

kbenoit commented Oct 16, 2017 • edited Loading

koheiw commented Jan 5, 2018

koheiw commented Jan 5, 2018 • edited Loading

kbenoit commented Jan 5, 2018

koheiw commented Jan 5, 2018

kbenoit commented Jan 5, 2018

kbenoit commented Jan 5, 2018

koheiw commented Jan 5, 2018

kbenoit commented Jan 6, 2018

fitted

summary

predicted

koheiw commented Jan 6, 2018

koheiw commented Jan 6, 2018

kbenoit commented Jan 6, 2018

koheiw commented Jan 6, 2018

koheiw commented Jan 6, 2018

kbenoit commented Jan 6, 2018

koheiw commented Jan 6, 2018

kbenoit commented Jan 6, 2018 • edited Loading

koheiw commented Jan 6, 2018

kbenoit commented Jan 7, 2018

koheiw commented Oct 5, 2017 •

edited

Loading

kbenoit commented Oct 16, 2017 •

edited

Loading

koheiw commented Jan 5, 2018 •

edited

Loading

kbenoit commented Jan 6, 2018 •

edited

Loading