-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
toLower in dfm not working #123
Comments
Correct... Although On the merging of case-sensitive columns, how about a function to generically reindex the dimensions of a dfm, that would consolidate equivalent columns and rows? I ran into this issue the other day when working with cbind() methods. Right now it just issues a warning, which does not solve the problem:
Note: you need to update the package (from GitHub).
|
Thank you. dfm(toLower(token)) is absolutely fine. A dfm marge/reindexing function would be a great addition. It would be like:
or
|
See |
Thank your the quick change. I have being trying like this. It can be efficient, but not successful, because numbers are coerced into binaries when update matrix C....
|
Why not give the text case a try on my StackOverflow question? I think the efficiency will come from combining the rewriting the x values after reindexing the j and Dimnames[[2]] of the dgTMatrix. Ken On 21 Apr 2016, at 20:33, koheiw <notifications@github.commailto:notifications@github.com> wrote: Thank your the quick change. I have being trying like this. It can be efficient, but not successful, because numbers are coerced into binaries when update matrix C.... set up a (triplet) sparseMatrix A <- sparseMatrix(i = c(1, 2, 1, 2, 1, 2, 1, 2), j = c(1:6, c(7, 7)), x = c(1,2,3,1,2,3,4,4), cnm <- colnames(A) C <- sparseMatrix(dims = c(2,sum(!dup)), giveCsparse = FALSE, for(c in unique(loc$dupc)){ — |
Your function in stackoverflow claims too much memory and also very slow, as you expected. I gave up altering the dfm for now, and took different approach. I decided to tokenize perfectly before making dfm by creating a fast token removal function. This approach worked, so the problem was solved. Thank you very much anyway. |
OK - I’d be very interested to see your “fast token removal solution”. Is it something that could be of general purpose? Right now feature selection generally takes place after the dfm is created, but this might not be the most efficient approach in many cases. On 22 Apr 2016, at 06:59, koheiw <notifications@github.commailto:notifications@github.com> wrote: Your function in stackoverflow claims too much memory and also very slow, as you expected. I gave up altering the dfm for now, and took different approach. I decided to tokenize perfectly before making dfm by creating a fast token removal function. This approach worked, so the problem was solved. Thank you very much anyway. — |
It would be also great if I could merge case-sensitive columns of a dfm object with command like this toLower(mx).
The text was updated successfully, but these errors were encountered: