Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add padding to dfm_select() to record the number of deleted tokens #2152

Closed
koheiw opened this issue Nov 26, 2021 · 0 comments
Closed

Add padding to dfm_select() to record the number of deleted tokens #2152

koheiw opened this issue Nov 26, 2021 · 0 comments
Assignees

Comments

@koheiw
Copy link
Collaborator

koheiw commented Nov 26, 2021

We know how many tokes were deleted here thanks to tokens_remove(padding = TRUE).

require(quanteda)
toks <- tokens(data_corpus_inaugural)
dfm(tokens_remove(toks, stopwords(), padding = TRUE))
#> Document-feature matrix of: 59 documents, 9,302 features (92.64% sparse) and 4 docvars.
#>                  features
#> docs                   fellow-citizens senate house representatives : among
#>   1789-Washington  778               1      1     2               2 1     1
#>   1793-Washington   73               0      0     0               0 1     0
#>   1797-Adams      1248               3      1     0               2 0     4
#>   1801-Jefferson   913               2      0     0               0 1     1
#>   1805-Jefferson  1155               0      0     0               0 0     7
#>   1809-Madison     649               1      0     0               0 0     0
#>                  features
#> docs              vicissitudes incident life
#>   1789-Washington            1        1    1
#>   1793-Washington            0        0    0
#>   1797-Adams                 0        0    2
#>   1801-Jefferson             0        0    1
#>   1805-Jefferson             0        0    2
#>   1809-Madison               0        0    1
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,292 more features ]

But we don't know how many tokens were deleted if we use dfm_remove().

dfmt <- dfm(toks)
dfm_remove(dfmt, stopwords())
#> Document-feature matrix of: 59 documents, 9,301 features (92.65% sparse) and 4 docvars.
#>                  features
#> docs              fellow-citizens senate house representatives : among
#>   1789-Washington               1      1     2               2 1     1
#>   1793-Washington               0      0     0               0 1     0
#>   1797-Adams                    3      1     0               2 0     4
#>   1801-Jefferson                2      0     0               0 1     1
#>   1805-Jefferson                0      0     0               0 0     7
#>   1809-Madison                  1      0     0               0 0     0
#>                  features
#> docs              vicissitudes incident life event
#>   1789-Washington            1        1    1     2
#>   1793-Washington            0        0    0     0
#>   1797-Adams                 0        0    2     0
#>   1801-Jefferson             0        0    1     0
#>   1805-Jefferson             0        0    2     0
#>   1809-Madison               0        0    1     0
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,291 more features ]

In order to make pipelines for DFMs and tokens, let's add dfm_remove(padding = TRUE). "feat1" in the below DFM should be "" to indicate that it is a padding.

cbind(rowSums(dfm_select(dfmt, stopwords())), dfm_remove(dfmt, stopwords()))
#> Document-feature matrix of: 59 documents, 9,302 features (92.64% sparse) and 0 docvars.
#>                  features
#> docs              feat1 fellow-citizens senate house representatives : among
#>   1789-Washington   778               1      1     2               2 1     1
#>   1793-Washington    73               0      0     0               0 1     0
#>   1797-Adams       1248               3      1     0               2 0     4
#>   1801-Jefferson    913               2      0     0               0 1     1
#>   1805-Jefferson   1155               0      0     0               0 0     7
#>   1809-Madison      649               1      0     0               0 0     0
#>                  features
#> docs              vicissitudes incident life
#>   1789-Washington            1        1    1
#>   1793-Washington            0        0    0
#>   1797-Adams                 0        0    2
#>   1801-Jefferson             0        0    1
#>   1805-Jefferson             0        0    2
#>   1809-Madison               0        0    1
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,292 more features ]

Created on 2021-11-26 by the reprex package (v2.0.1)

@koheiw koheiw self-assigned this Nov 26, 2021
kbenoit added a commit that referenced this issue Nov 26, 2021
@koheiw koheiw closed this as completed Nov 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant