Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue when making large csr matrices #9180

Open
swzCuroverse opened this issue Aug 26, 2018 · 14 comments
Open

Issue when making large csr matrices #9180

swzCuroverse opened this issue Aug 26, 2018 · 14 comments

Comments

@swzCuroverse
Copy link

<<Please describe the issue in detail here, and for bug reports fill in the fields below.>>

I am having an issue when making a large csr matrix. I know this has been an issue before, so I made sure to be using the latest release (I am running off of a docker container). I am working with large data and when I attempt to make my final csr matrix, the docker container goes down. I tried making the matrix in steps and ensured nothing was wrong with it (nothing seems to be), and it builds correctly in coo format. It is just the csr that is the issue. I can't (because I am using the docker container on a particular system which runs workflows in a cluster environment) tell if it is issue with memory or a segmentation fault due to a code issue. I am on a node with 400GB of RAM and at the moment of crash it using ~70GB but it only tracks every few seconds so if something nefarious was happening with a memory spike, it could be the issue.

Since the code involves data that is highly restricted, I can a proxy and can send it to you for debugging purposes. I am not sure the best way to do so. The code itself if very simple at the point of crash. But it will not be a reproduction step as required without the data.

I know this is strange-ish request so only help with this issue would be very very helpful. I am stuck until then.

Some things to note:
The matrix is in uint16 which at some had some bug reports, but I believe I had the same issue in doubles. The current code breaks it in pieces because I was trying to debug and see if it was part of the matrix given me issues and/or the csr format.

Reproducing code example:

    enc = OneHotEncoder(sparse=True, dtype=np.uint16)
    m,n = tiledata.shape

    # 1-hot encoding tiled data (note: I broke it apart for debugging this issue, not necessary)
    halfmatrix = int(float(n)/2)
    Xtrain = enc.fit_transform(tiledata[:,0:n-halfmatrix])
    Xtrain2 = enc.fit_transform(tiledata[:,n-halfmatrix:n])

    del tiledata

    # This is fine in coo but crashes in csr
    Xtrain = hstack([Xtrain,Xtrain2])   

Error message:

<<Full error message, if any (starting from line Traceback: ...)>>

Scipy/Numpy/Python version information:

<<Output from 'import sys, scipy, numpy; print(scipy.__version__, numpy.__version__, sys.version_info)'>>

('1.1.0', '1.15.1', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0))

@ilayn
Copy link
Member

ilayn commented Aug 26, 2018

Can you try to replicate this locally using random data? Without the information about the nature of the crash, I don't know how we can help other than a lucky coincidence of someone else having a similar problem and knowing the reason.

@swzCuroverse
Copy link
Author

Unfortunately I need a large machine to do so, I could try to spin up an Amazon image though. I can't do it on my local machine due to memory issues. I only see this issue on larger matrices.

@swzCuroverse
Copy link
Author

If I send data and/or code that generates a random matrix that crashes can you run it locally?

@ilayn
Copy link
Member

ilayn commented Aug 26, 2018

Unfortunately I don't have such a machine and I doubt anyone of us has.

Unfortunately you have to find a way to check what is happening on your docker container via say sshing or looking at journalctl output and so on. We can only investigate the SciPy related part. I would start first logging options of the docker containers.

@swzCuroverse
Copy link
Author

Hmm, so how do you debug issues only relating to large data? I can debug to a matrix but you will need a big mem machine to duplicate.

@ilayn
Copy link
Member

ilayn commented Aug 26, 2018

First step should be identifying the problem to make sure that this is not a docker container problem. If it turns out to be a SciPy issue then the container surely logs something about it whether a segfault or MemoryError etc.

As I mentioned above, I would first seek support for logging the fail on a Docker container in Docker pages or on stackoverflow etc. Then next step would be triangulating down to SciPy if that turns out to be the case. We don't even know about the problem hence that's pretty much all we can help with.

@pv
Copy link
Member

pv commented Aug 26, 2018

Is hstack numpy.hstack or scipy.sparse.hstack?
What are Xtrain.shape, Xtrain2.shape, Xtrain.nnz, Xtrain2.nnz?
What is Xtrain.dtype, Xtrain2.dtype, Xtrain.format, Xtrain2.format?
What is the index dtype? (.indptr.dtype for csr, and .row.dtype for coo?)

@swzCuroverse
Copy link
Author

swzCuroverse commented Aug 26, 2018

dtype default for one hot is csr, so Xtrain, Xtrain2 are both csr. Xtrain and Xtrain2 are made by encoder (see code) which is set for sparse and unint16. Using scipy hstack. Can get dimensions shortly. Will get index type for you as well

@pv
Copy link
Member

pv commented Aug 26, 2018

If you can login to the instance, you can add import pdb; pdb.set_trace() before the problematic line, and then use s command repeatedly to step into subroutines and find out precisely at what point it fails.

@swzCuroverse
Copy link
Author

Unfortunately, it is running in batch mode on a compute platform. I can try to spin up an instance and debug interactively but it means setting it up outside the platform.

@swzCuroverse
Copy link
Author

swzCuroverse commented Aug 27, 2018

@pv

The following are the values you asked about:

After the stacking:
Xtrain shape --- (3672, 5373156)
nnz for Xtrain -- 2656820520
row dtype -- int32
column dtype -- int32

I narrowed down the crash to that it occurs when I convert the coo to csr. Or when I do the stack specifying csr format.

Is there a problem where I need my indices to be in int64 for csr and they aren't for some reason?

I will get an image up on EC2 or Azure this week and see if I can interactively debug. It is just nearly impossible using the batch jobs.

@pv
Copy link
Member

pv commented Aug 27, 2018 via email

@swzCuroverse
Copy link
Author

Xtrain and Xtrain2 are csr matrices to begin with. The only reason I was doing it in parts was to try to catch the issue. I have a 400GB machine so 100GB of overhead shouldn't be a problem.

@swzCuroverse
Copy link
Author

swzCuroverse commented Aug 27, 2018

So, the plot thickens. Converting using tocsr crashes but the following does not....

    Xtrain = hstack([Xtrain,Xtrain2])
    shapeX = Xtrain.shape
    data = Xtrain.data
    rXtrain = Xtrain.row.astype('int64')
    cXtrain = Xtrain.col.astype('int64')

    del Xtrain

    newX = csr_matrix((data, (rXtrain, cXtrain)), shape=shapeX)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants