Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what = not passed through to tokens() by dfm() #1121

Closed
cschwem2er opened this issue Dec 6, 2017 · 1 comment
Closed

what = not passed through to tokens() by dfm() #1121

cschwem2er opened this issue Dec 6, 2017 · 1 comment
Assignees
Labels
Milestone

Comments

@cschwem2er
Copy link

Hi,
the following code should produce a dfm with characters as features, but does contain words as tokens:

library(quanteda)
txt <- c(d1 = "Chinese Beijing Chinese",
         d2 = "Chinese Chinese Shanghai",
         d3 = "Chinese Macao",
         d4 = "Tokyo Japan Chinese",
         d5 = "Chinese Chinese Chinese Tokyo Japan")

q_corp <- corpus(txt)
q_dfm <- dfm(q_corp, what = 'character')
head(q_dfm)

Document-feature matrix of: 5 documents, 6 features (60% sparse).
5 x 6 sparse Matrix of class "dfm"
    features
docs chinese beijing shanghai macao tokyo japan
  d1       2       1        0     0     0     0
  d2       2       0        1     0     0     0
  d3       1       0        0     1     0     0
  d4       1       0        0     0     1     1
  d5       3       0        0     0     1     1

It works when using tokens() in a prestep:

q_tokens <- tokens(q_corp, what = 'character')
q_dfm <- dfm(q_tokens)
head(q_dfm)

Document-feature matrix of: 5 documents, 6 features (0% sparse).
5 x 6 sparse Matrix of class "dfm"
    features
docs c h i n e s
  d1 2 2 4 3 5 2
  d2 2 4 3 3 4 3
  d3 2 1 1 1 2 1
  d4 1 1 1 2 2 1
  d5 3 3 3 4 6 3
@kbenoit
Copy link
Collaborator

kbenoit commented Dec 7, 2017

True, somehow, what = "" is not being passed through to tokens() from dfm().

Simpler examples:

> tokens("this is a test", what = "character")
tokens from 1 document.
text1 :
 [1] "t" "h" "i" "s" "i" "s" "a" "t" "e" "s" "t"
> dfm("this is a test", what = "character")
Document-feature matrix of: 1 document, 4 features (0% sparse).
1 x 4 sparse Matrix of class "dfm"
       features
docs    this is a test
  text1    1  1 1    1

> tokens("This is a test. Second sentence", what = "sentence")
tokens from 1 document.
text1 :
[1] "This is a test." "Second sentence"

> dfm("This is a test. Second sentence", what = "sentence")
Document-feature matrix of: 1 document, 7 features (0% sparse).
1 x 7 sparse Matrix of class "dfm"
       features
docs    this is a test . second sentence
  text1    1  1 1    1 1      1        1

@kbenoit kbenoit changed the title bug for dfm and character tokenization what = not passed through to tokens() by dfm() Dec 7, 2017
@kbenoit kbenoit added the bug label Dec 7, 2017
@kbenoit kbenoit added this to the v1.0 milestone Dec 7, 2017
@koheiw koheiw mentioned this issue Dec 7, 2017
@kbenoit kbenoit closed this as completed Dec 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants