Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse Matrix in caret #474

Closed
farbodr opened this issue Aug 21, 2016 · 11 comments
Closed

Sparse Matrix in caret #474

farbodr opened this issue Aug 21, 2016 · 11 comments

Comments

@farbodr
Copy link

farbodr commented Aug 21, 2016

It is possible to use sparse matrix with caret? All the code that I've seen in caret uses as.matrix(x) which converts it to a dense matrix (as it was pointed out to me on stack overflow). This sort of defeats the purpose of using a sparse matrix.

TIA,
FR

@topepo
Copy link
Owner

topepo commented Aug 22, 2016

All the code that I've seen in caret uses as.matrix(x) which converts it to a dense matrix

'train.default doesn't do that. There are likely models that require a non-sparse representation of the data but I'm pretty sure that train leaves it alone. Most/all of any required conversion code happens in the model code and only when the model requires (AFAIK).

Some of that model code was written a decade ago; if a model now allows for a sparse matrix, let me know or do a PR.

Thanks,

Max

@farbodr
Copy link
Author

farbodr commented Aug 22, 2016

Thanks Max. One that comes to mind is Ranger. I think it can take a sparse matrix for x but from what I can tell from fit method in caret it gets converted to data frame. I 'm not quite sure but I think nnet can also accept a matrix but it gets converted to data frame in caret.

XGBoost and glmnet are ones that I'm using right now and both support sparse matrix.

FR

@farbodr
Copy link
Author

farbodr commented Aug 22, 2016

I take that back I don't think ranger supports matrix.

@topepo
Copy link
Owner

topepo commented Aug 22, 2016

The xgBoost methods already converts the data to a sparse format using xgb.DMatrix. I added code to avoid re-converting.

In glmnet, I also avoided the conversion too.

Test away!

@farbodr
Copy link
Author

farbodr commented Aug 22, 2016

Thank you for speedy response. I noticed line 56 in glmnet.R may not be correct. It says 'x <- as.matri(x)' and should be x <- as. matrix(x).

FR

@topepo
Copy link
Owner

topepo commented Aug 23, 2016

Thanks

topepo added a commit that referenced this issue Aug 23, 2016
@farbodr
Copy link
Author

farbodr commented Aug 23, 2016

I think one more change is needed:

if(!(class(x)[1] %in% c("matrix", "sparseMatrix")))
                      x <- as.matrix(x)

should be

if(!(class(x)[1] %in% c("Matrix", "sparseMatrix")))
                      x <- as.matrix(x)

@topepo
Copy link
Owner

topepo commented Sep 13, 2016

I'm not sure. ?glmnet has for x:

input matrix, of dimension nobs x nvars; each row is an observation vector. Can be in sparse matrix format (inherit from class "sparseMatrix" as in package Matrix; not yet available for family="cox")

It needs to be a matrix of some sort and I interpret the above to mean either "matrix" or "sparseMatrix"

@schistyakov
Copy link
Contributor

I am trying to use a FeatureHashing (https://github.com/wush978/FeatureHashing) with xgboost in caret. FeatureHashing generate sparse matrix of class dgCMatrix. With native xgboost everything works good, but caret generates a memory exception.

I've tried to fix it in my local branch:

The first issue is at the line xgb.DMatrix(x, label = y, missing = NA), cause the default xgboost has missing = 0. What was the reason to set missing as NA, instead of default 0 ?

I've change the code to set missing to 0 if class(x)[1] == "dgCMatrix"

The second problem - if x is dgCMatrix, than it is converted to matrix:

if(class(x)[1] != "xgb.DMatrix") 
    x <- as.matrix(x)

I've replaced it to:

if(!(class(x)[1] %in% c("xgb.DMatrix", "dgCMatrix")))
    x <- as.matrix(x)

Now it works but I have a warning from the main train.default function :

Warning train.default(x = train_x, y = as.factor(ifelse(train_y == 1,  :
    The training data could not be converted to a data frame for saving

May be you have any idea what's wrong?

@topepo
Copy link
Owner

topepo commented Apr 13, 2017

Sorry for the delay. I think that it is the returnData option in trainControl. Try setting that to FALSE and see if the problem still persists.

Also, I just fixed a few things in the devel version related to sparse matrices so try reinstall thing that too. There are some inherent limitations to some of these sparse classes. A lot of existing code needs a data frame or matrix (say for prepreocessing or subsampling) and there are not methods for these sparse matrix classes to accommodate them.

@prokopyev
Copy link

Setting returnData option in trainControl to FALSE resolves this issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants