Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

toLower in dfm not working #123

Closed
koheiw opened this issue Apr 21, 2016 · 7 comments
Closed

toLower in dfm not working #123

koheiw opened this issue Apr 21, 2016 · 7 comments

Comments

@koheiw
Copy link
Collaborator

koheiw commented Apr 21, 2016

token <- tokenize('AAA bbb CC aaa')
mx1 <- dfm(token)
colnames(mx1) # "AAA" "bbb" "CC"  "aaa"
mx2 <- dfm(token, toLower=TRUE)
colnames(mx2) # "AAA" "bbb" "CC"  "aaa"

> print(sessionInfo())
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] digest_0.6.9   Matrix_1.2-4   quanteda_0.9.4

loaded via a namespace (and not attached):
[1] parallel_3.2.5   tools_3.2.5      Rcpp_0.12.4      stringi_1.0-1    grid_3.2.5       data.table_1.9.6
[7] chron_2.3-47     lattice_0.20-33  ca_0.64         

It would be also great if I could merge case-sensitive columns of a dfm object with command like this toLower(mx).

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 21, 2016

Correct... Although toLower is an argument in dfm.tokenizedTexts(), I think we intended it not to be, since users who tokenise in a separate step before constructing a dfm usually want the greatest control. In other words, you would need to dfm(toLower(token)) in your example. But I agree that either we should add it to the function code, or remove it from the function arguments.

On the merging of case-sensitive columns, how about a function to generically reindex the dimensions of a dfm, that would consolidate equivalent columns and rows? I ran into this issue the other day when working with cbind() methods. Right now it just issues a warning, which does not solve the problem:

> cbind(dfm("aaa BBB", verbose = FALSE), dfm("bbb CCC", verbose = FALSE))
Document-feature matrix of: 1 document, 4 features.
1 x 4 sparse Matrix of class "dfmSparse"
      aaa bbb bob ccc
text1   1   1   1   1
Warning message:
In cbind(deparse.level, ...) :
  cbinding dfms with overlapping features will result in duplicated features

Note: you need to update the package (from GitHub).

> packageVersion("quanteda")
[1] ‘0.9.5.19’

@kbenoit kbenoit changed the title toLower in dfm does not seem to be working toLower in dfm not working Apr 21, 2016
@koheiw
Copy link
Collaborator Author

koheiw commented Apr 21, 2016

Thank you. dfm(toLower(token)) is absolutely fine.

A dfm marge/reindexing function would be a great addition. It would be like:

rownames(mx1) <- toLower(rownames(mx1))
mx1 <- merge(mx1)

or

mx3 <- merge(mx1, mx2)

kbenoit added a commit that referenced this issue Apr 21, 2016
@kbenoit
Copy link
Collaborator

kbenoit commented Apr 21, 2016

See ?compress in v 0.9.5-20. Not fast (but the code is nicely parsimonious) on large objects but I am working on it: see also http://stackoverflow.com/questions/36778166/how-to-combine-columns-with-identical-names-in-a-large-sparse-matrix.

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 22, 2016

Thank your the quick change. I have being trying like this. It can be efficient, but not successful, because numbers are coerced into binaries when update matrix C....


require(Matrix)

A <- sparseMatrix(i = c(1, 2, 1, 2, 1, 2, 1, 2), j = c(1:6, c(7, 7)), x = c(1,2,3,1,2,3,4,4), 
                  giveCsparse = FALSE,
                  dimnames = list(paste0("r", 1:2), letters[c(1,2,3,1,2,3,4)]))
A

cnm <- colnames(A)
dup <- duplicated(cnm)
loc <- data.frame(dupc=ave(cnm==cnm, cnm, FUN=cumsum), ucol=as.integer(as.factor(cnm)), ocol=1:length(cnm))

C <- sparseMatrix(dims = c(2,sum(!dup)), giveCsparse = FALSE,
                  i={}, j={}, dimnames=list(rownames(A), colnames(A)[!dup]))

for(c in unique(loc$dupc)){
  l <- loc[loc$dupc==c,]
  #print(c)
  #print(A[,l$ocol])
  C[,l$ucol] <- C[,l$ucol] + A[,l$ocol]
}
C

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 22, 2016

Why not give the text case a try on my StackOverflow question?

I think the efficiency will come from combining the rewriting the x values after reindexing the j and Dimnames[[2]] of the dgTMatrix.

Ken

On 21 Apr 2016, at 20:33, koheiw <notifications@github.commailto:notifications@github.com> wrote:

Thank your the quick change. I have being trying like this. It can be efficient, but not successful, because numbers are coerced into binaries when update matrix C....
``
require(Matrix)

set up a (triplet) sparseMatrix

A <- sparseMatrix(i = c(1, 2, 1, 2, 1, 2, 1, 2), j = c(1:6, c(7, 7)), x = c(1,2,3,1,2,3,4,4),
giveCsparse = FALSE,
dimnames = list(paste0("r", 1:2), letters[c(1,2,3,1,2,3,4)]))
A

cnm <- colnames(A)
dup <- duplicated(cnm)
loc <- data.frame(dupc=ave(cnm==cnm, cnm, FUN=cumsum), ucol=as.integer(as.factor(cnm)), ocol=1:length(cnm))

C <- sparseMatrix(dims = c(2,sum(!dup)), giveCsparse = FALSE,
i={}, j={}, dimnames=list(rownames(A), colnames(A)[!dup]))

for(c in unique(loc$dupc)){
l <- loc[loc$dupc==c,]
#print(c)
#print(A[,l$ocol])
C[,l$ucol] <- C[,l$ucol] + A[,l$ocol]
}
C
``


You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://github.com//issues/123#issuecomment-213176836

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 22, 2016

Your function in stackoverflow claims too much memory and also very slow, as you expected. I gave up altering the dfm for now, and took different approach. I decided to tokenize perfectly before making dfm by creating a fast token removal function. This approach worked, so the problem was solved. Thank you very much anyway.

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 22, 2016

OK - I’d be very interested to see your “fast token removal solution”. Is it something that could be of general purpose? Right now feature selection generally takes place after the dfm is created, but this might not be the most efficient approach in many cases.

On 22 Apr 2016, at 06:59, koheiw <notifications@github.commailto:notifications@github.com> wrote:

Your function in stackoverflow claims too much memory and also very slow, as you expected. I gave up altering the dfm for now, and took different approach. I decided to tokenize perfectly before making dfm by creating a fast token removal function. This approach worked, so the problem was solved. Thank you very much anyway.


You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://github.com//issues/123#issuecomment-213378941

kbenoit added a commit that referenced this issue Apr 23, 2016
@kbenoit kbenoit closed this as completed May 7, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants