what = not passed through to tokens() by dfm() #1121

cschwem2er · 2017-12-06T20:03:39Z

Hi,
the following code should produce a dfm with characters as features, but does contain words as tokens:

library(quanteda)
txt <- c(d1 = "Chinese Beijing Chinese",
         d2 = "Chinese Chinese Shanghai",
         d3 = "Chinese Macao",
         d4 = "Tokyo Japan Chinese",
         d5 = "Chinese Chinese Chinese Tokyo Japan")

q_corp <- corpus(txt)
q_dfm <- dfm(q_corp, what = 'character')
head(q_dfm)

Document-feature matrix of: 5 documents, 6 features (60% sparse).
5 x 6 sparse Matrix of class "dfm"
    features
docs chinese beijing shanghai macao tokyo japan
  d1       2       1        0     0     0     0
  d2       2       0        1     0     0     0
  d3       1       0        0     1     0     0
  d4       1       0        0     0     1     1
  d5       3       0        0     0     1     1

It works when using tokens() in a prestep:

q_tokens <- tokens(q_corp, what = 'character')
q_dfm <- dfm(q_tokens)
head(q_dfm)

Document-feature matrix of: 5 documents, 6 features (0% sparse).
5 x 6 sparse Matrix of class "dfm"
    features
docs c h i n e s
  d1 2 2 4 3 5 2
  d2 2 4 3 3 4 3
  d3 2 1 1 1 2 1
  d4 1 1 1 2 2 1
  d5 3 3 3 4 6 3

The text was updated successfully, but these errors were encountered:

kbenoit · 2017-12-07T10:18:45Z

True, somehow, what = "" is not being passed through to tokens() from dfm().

Simpler examples:

> tokens("this is a test", what = "character")
tokens from 1 document.
text1 :
 [1] "t" "h" "i" "s" "i" "s" "a" "t" "e" "s" "t"
> dfm("this is a test", what = "character")
Document-feature matrix of: 1 document, 4 features (0% sparse).
1 x 4 sparse Matrix of class "dfm"
       features
docs    this is a test
  text1    1  1 1    1

> tokens("This is a test. Second sentence", what = "sentence")
tokens from 1 document.
text1 :
[1] "This is a test." "Second sentence"

> dfm("This is a test. Second sentence", what = "sentence")
Document-feature matrix of: 1 document, 7 features (0% sparse).
1 x 7 sparse Matrix of class "dfm"
       features
docs    this is a test . second sentence
  text1    1  1 1    1 1      1        1

kbenoit changed the title ~~bug for dfm and character tokenization~~ what = not passed through to tokens() by dfm() Dec 7, 2017

kbenoit assigned koheiw Dec 7, 2017

kbenoit added the bug label Dec 7, 2017

kbenoit added this to the v1.0 milestone Dec 7, 2017

koheiw mentioned this issue Dec 7, 2017

Issue 1121 #1127

Merged

kbenoit closed this as completed Dec 8, 2017

koheiw mentioned this issue Jan 19, 2018

Inconsistent behavior of remove_url in dfm.corpus() and tokens() #1203

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what = not passed through to tokens() by dfm() #1121

what = not passed through to tokens() by dfm() #1121

cschwem2er commented Dec 6, 2017

kbenoit commented Dec 7, 2017

what = not passed through to tokens() by dfm() #1121

what = not passed through to tokens() by dfm() #1121

Comments

cschwem2er commented Dec 6, 2017

kbenoit commented Dec 7, 2017