toLower in dfm not working #123

koheiw · 2016-04-21T08:54:29Z

token <- tokenize('AAA bbb CC aaa')
mx1 <- dfm(token)
colnames(mx1) # "AAA" "bbb" "CC"  "aaa"
mx2 <- dfm(token, toLower=TRUE)
colnames(mx2) # "AAA" "bbb" "CC"  "aaa"

> print(sessionInfo())
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] digest_0.6.9   Matrix_1.2-4   quanteda_0.9.4

loaded via a namespace (and not attached):
[1] parallel_3.2.5   tools_3.2.5      Rcpp_0.12.4      stringi_1.0-1    grid_3.2.5       data.table_1.9.6
[7] chron_2.3-47     lattice_0.20-33  ca_0.64

It would be also great if I could merge case-sensitive columns of a dfm object with command like this toLower(mx).

The text was updated successfully, but these errors were encountered:

kbenoit · 2016-04-21T09:57:29Z

Correct... Although toLower is an argument in dfm.tokenizedTexts(), I think we intended it not to be, since users who tokenise in a separate step before constructing a dfm usually want the greatest control. In other words, you would need to dfm(toLower(token)) in your example. But I agree that either we should add it to the function code, or remove it from the function arguments.

On the merging of case-sensitive columns, how about a function to generically reindex the dimensions of a dfm, that would consolidate equivalent columns and rows? I ran into this issue the other day when working with cbind() methods. Right now it just issues a warning, which does not solve the problem:

> cbind(dfm("aaa BBB", verbose = FALSE), dfm("bbb CCC", verbose = FALSE))
Document-feature matrix of: 1 document, 4 features.
1 x 4 sparse Matrix of class "dfmSparse"
      aaa bbb bob ccc
text1   1   1   1   1
Warning message:
In cbind(deparse.level, ...) :
  cbinding dfms with overlapping features will result in duplicated features

Note: you need to update the package (from GitHub).

> packageVersion("quanteda")
[1] ‘0.9.5.19’

koheiw · 2016-04-21T15:20:20Z

Thank you. dfm(toLower(token)) is absolutely fine.

A dfm marge/reindexing function would be a great addition. It would be like:

rownames(mx1) <- toLower(rownames(mx1))
mx1 <- merge(mx1)

or

mx3 <- merge(mx1, mx2)

kbenoit · 2016-04-21T19:52:18Z

See ?compress in v 0.9.5-20. Not fast (but the code is nicely parsimonious) on large objects but I am working on it: see also http://stackoverflow.com/questions/36778166/how-to-combine-columns-with-identical-names-in-a-large-sparse-matrix.

koheiw · 2016-04-22T00:33:53Z

Thank your the quick change. I have being trying like this. It can be efficient, but not successful, because numbers are coerced into binaries when update matrix C....


require(Matrix)

A <- sparseMatrix(i = c(1, 2, 1, 2, 1, 2, 1, 2), j = c(1:6, c(7, 7)), x = c(1,2,3,1,2,3,4,4), 
                  giveCsparse = FALSE,
                  dimnames = list(paste0("r", 1:2), letters[c(1,2,3,1,2,3,4)]))
A

cnm <- colnames(A)
dup <- duplicated(cnm)
loc <- data.frame(dupc=ave(cnm==cnm, cnm, FUN=cumsum), ucol=as.integer(as.factor(cnm)), ocol=1:length(cnm))

C <- sparseMatrix(dims = c(2,sum(!dup)), giveCsparse = FALSE,
                  i={}, j={}, dimnames=list(rownames(A), colnames(A)[!dup]))

for(c in unique(loc$dupc)){
  l <- loc[loc$dupc==c,]
  #print(c)
  #print(A[,l$ocol])
  C[,l$ucol] <- C[,l$ucol] + A[,l$ocol]
}
C

kbenoit · 2016-04-22T01:15:45Z

Why not give the text case a try on my StackOverflow question?

I think the efficiency will come from combining the rewriting the x values after reindexing the j and Dimnames[[2]] of the dgTMatrix.

Ken

On 21 Apr 2016, at 20:33, koheiw <notifications@github.com mailto:notifications@github.com> wrote:

Thank your the quick change. I have being trying like this. It can be efficient, but not successful, because numbers are coerced into binaries when update matrix C....
``
require(Matrix)

set up a (triplet) sparseMatrix

A <- sparseMatrix(i = c(1, 2, 1, 2, 1, 2, 1, 2), j = c(1:6, c(7, 7)), x = c(1,2,3,1,2,3,4,4),
giveCsparse = FALSE,
dimnames = list(paste0("r", 1:2), letters[c(1,2,3,1,2,3,4)]))
A

cnm <- colnames(A)
dup <- duplicated(cnm)
loc <- data.frame(dupc=ave(cnm==cnm, cnm, FUN=cumsum), ucol=as.integer(as.factor(cnm)), ocol=1:length(cnm))

C <- sparseMatrix(dims = c(2,sum(!dup)), giveCsparse = FALSE,
i={}, j={}, dimnames=list(rownames(A), colnames(A)[!dup]))

for(c in unique(loc$dupc)){
l <- loc[loc$dupc==c,]
#print(c)
#print(A[,l$ocol])
C[,l$ucol] <- C[,l$ucol] + A[,l$ocol]
}
C
``

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://github.com//issues/123#issuecomment-213176836

koheiw · 2016-04-22T10:59:05Z

Your function in stackoverflow claims too much memory and also very slow, as you expected. I gave up altering the dfm for now, and took different approach. I decided to tokenize perfectly before making dfm by creating a fast token removal function. This approach worked, so the problem was solved. Thank you very much anyway.

kbenoit · 2016-04-22T12:03:01Z

OK - I’d be very interested to see your “fast token removal solution”. Is it something that could be of general purpose? Right now feature selection generally takes place after the dfm is created, but this might not be the most efficient approach in many cases.

On 22 Apr 2016, at 06:59, koheiw <notifications@github.com mailto:notifications@github.com> wrote:

Your function in stackoverflow claims too much memory and also very slow, as you expected. I gave up altering the dfm for now, and took different approach. I decided to tokenize perfectly before making dfm by creating a fast token removal function. This approach worked, so the problem was solved. Thank you very much anyway.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://github.com//issues/123#issuecomment-213378941

kbenoit changed the title ~~toLower in dfm does not seem to be working~~ toLower in dfm not working Apr 21, 2016

kbenoit added a commit that referenced this issue Apr 21, 2016

Add compress() per issue #123

8ff11fa

kbenoit mentioned this issue Apr 22, 2016

Faster token scanning #125

Closed

kbenoit added a commit that referenced this issue Apr 23, 2016

Faster compress(), solves #123

f0bc306

kbenoit closed this as completed May 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

toLower in dfm not working #123

toLower in dfm not working #123

koheiw commented Apr 21, 2016 •

edited by kbenoit

Loading

kbenoit commented Apr 21, 2016

koheiw commented Apr 21, 2016 •

edited

Loading

kbenoit commented Apr 21, 2016 •

edited

Loading

koheiw commented Apr 22, 2016 •

edited

Loading

kbenoit commented Apr 22, 2016

koheiw commented Apr 22, 2016

kbenoit commented Apr 22, 2016

toLower in dfm not working #123

toLower in dfm not working #123

Comments

koheiw commented Apr 21, 2016 • edited by kbenoit Loading

kbenoit commented Apr 21, 2016

koheiw commented Apr 21, 2016 • edited Loading

kbenoit commented Apr 21, 2016 • edited Loading

koheiw commented Apr 22, 2016 • edited Loading

kbenoit commented Apr 22, 2016

koheiw commented Apr 22, 2016

kbenoit commented Apr 22, 2016

koheiw commented Apr 21, 2016 •

edited by kbenoit

Loading

koheiw commented Apr 21, 2016 •

edited

Loading

kbenoit commented Apr 21, 2016 •

edited

Loading

koheiw commented Apr 22, 2016 •

edited

Loading