Add padding to dfm_select() to record the number of deleted tokens #2152

koheiw · 2021-11-26T03:26:19Z

We know how many tokes were deleted here thanks to tokens_remove(padding = TRUE).

require(quanteda)
toks <- tokens(data_corpus_inaugural)
dfm(tokens_remove(toks, stopwords(), padding = TRUE))
#> Document-feature matrix of: 59 documents, 9,302 features (92.64% sparse) and 4 docvars.
#>                  features
#> docs                   fellow-citizens senate house representatives : among
#>   1789-Washington  778               1      1     2               2 1     1
#>   1793-Washington   73               0      0     0               0 1     0
#>   1797-Adams      1248               3      1     0               2 0     4
#>   1801-Jefferson   913               2      0     0               0 1     1
#>   1805-Jefferson  1155               0      0     0               0 0     7
#>   1809-Madison     649               1      0     0               0 0     0
#>                  features
#> docs              vicissitudes incident life
#>   1789-Washington            1        1    1
#>   1793-Washington            0        0    0
#>   1797-Adams                 0        0    2
#>   1801-Jefferson             0        0    1
#>   1805-Jefferson             0        0    2
#>   1809-Madison               0        0    1
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,292 more features ]

But we don't know how many tokens were deleted if we use dfm_remove().

dfmt <- dfm(toks)
dfm_remove(dfmt, stopwords())
#> Document-feature matrix of: 59 documents, 9,301 features (92.65% sparse) and 4 docvars.
#>                  features
#> docs              fellow-citizens senate house representatives : among
#>   1789-Washington               1      1     2               2 1     1
#>   1793-Washington               0      0     0               0 1     0
#>   1797-Adams                    3      1     0               2 0     4
#>   1801-Jefferson                2      0     0               0 1     1
#>   1805-Jefferson                0      0     0               0 0     7
#>   1809-Madison                  1      0     0               0 0     0
#>                  features
#> docs              vicissitudes incident life event
#>   1789-Washington            1        1    1     2
#>   1793-Washington            0        0    0     0
#>   1797-Adams                 0        0    2     0
#>   1801-Jefferson             0        0    1     0
#>   1805-Jefferson             0        0    2     0
#>   1809-Madison               0        0    1     0
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,291 more features ]

In order to make pipelines for DFMs and tokens, let's add dfm_remove(padding = TRUE). "feat1" in the below DFM should be "" to indicate that it is a padding.

cbind(rowSums(dfm_select(dfmt, stopwords())), dfm_remove(dfmt, stopwords()))
#> Document-feature matrix of: 59 documents, 9,302 features (92.64% sparse) and 0 docvars.
#>                  features
#> docs              feat1 fellow-citizens senate house representatives : among
#>   1789-Washington   778               1      1     2               2 1     1
#>   1793-Washington    73               0      0     0               0 1     0
#>   1797-Adams       1248               3      1     0               2 0     4
#>   1801-Jefferson    913               2      0     0               0 1     1
#>   1805-Jefferson   1155               0      0     0               0 0     7
#>   1809-Madison      649               1      0     0               0 0     0
#>                  features
#> docs              vicissitudes incident life
#>   1789-Washington            1        1    1
#>   1793-Washington            0        0    0
#>   1797-Adams                 0        0    2
#>   1801-Jefferson             0        0    1
#>   1805-Jefferson             0        0    2
#>   1809-Madison               0        0    1
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,292 more features ]

^{Created on 2021-11-26 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

koheiw self-assigned this Nov 26, 2021

koheiw mentioned this issue Nov 26, 2021

Adds padding to dfm_select() #2154

Merged

kbenoit added a commit that referenced this issue Nov 26, 2021

Update news for #2152

b8bd77a

koheiw closed this as completed Nov 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add padding to dfm_select() to record the number of deleted tokens #2152

Add padding to dfm_select() to record the number of deleted tokens #2152

koheiw commented Nov 26, 2021 •

edited

Loading

Add padding to dfm_select() to record the number of deleted tokens #2152

Add padding to dfm_select() to record the number of deleted tokens #2152

Comments

koheiw commented Nov 26, 2021 • edited Loading

koheiw commented Nov 26, 2021 •

edited

Loading