Issue when making large csr matrices #9180

swzCuroverse · 2018-08-26T11:32:16Z

<<Please describe the issue in detail here, and for bug reports fill in the fields below.>>

I am having an issue when making a large csr matrix. I know this has been an issue before, so I made sure to be using the latest release (I am running off of a docker container). I am working with large data and when I attempt to make my final csr matrix, the docker container goes down. I tried making the matrix in steps and ensured nothing was wrong with it (nothing seems to be), and it builds correctly in coo format. It is just the csr that is the issue. I can't (because I am using the docker container on a particular system which runs workflows in a cluster environment) tell if it is issue with memory or a segmentation fault due to a code issue. I am on a node with 400GB of RAM and at the moment of crash it using ~70GB but it only tracks every few seconds so if something nefarious was happening with a memory spike, it could be the issue.

Since the code involves data that is highly restricted, I can a proxy and can send it to you for debugging purposes. I am not sure the best way to do so. The code itself if very simple at the point of crash. But it will not be a reproduction step as required without the data.

I know this is strange-ish request so only help with this issue would be very very helpful. I am stuck until then.

Some things to note:
The matrix is in uint16 which at some had some bug reports, but I believe I had the same issue in doubles. The current code breaks it in pieces because I was trying to debug and see if it was part of the matrix given me issues and/or the csr format.

Reproducing code example:

    enc = OneHotEncoder(sparse=True, dtype=np.uint16)
    m,n = tiledata.shape

    # 1-hot encoding tiled data (note: I broke it apart for debugging this issue, not necessary)
    halfmatrix = int(float(n)/2)
    Xtrain = enc.fit_transform(tiledata[:,0:n-halfmatrix])
    Xtrain2 = enc.fit_transform(tiledata[:,n-halfmatrix:n])

    del tiledata

    # This is fine in coo but crashes in csr
    Xtrain = hstack([Xtrain,Xtrain2])

Error message:

<<Full error message, if any (starting from line Traceback: ...)>>

Scipy/Numpy/Python version information:

<<Output from 'import sys, scipy, numpy; print(scipy.__version__, numpy.__version__, sys.version_info)'>>

('1.1.0', '1.15.1', sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0))

The text was updated successfully, but these errors were encountered:

ilayn · 2018-08-26T12:20:47Z

Can you try to replicate this locally using random data? Without the information about the nature of the crash, I don't know how we can help other than a lucky coincidence of someone else having a similar problem and knowing the reason.

swzCuroverse · 2018-08-26T12:23:58Z

Unfortunately I need a large machine to do so, I could try to spin up an Amazon image though. I can't do it on my local machine due to memory issues. I only see this issue on larger matrices.

swzCuroverse · 2018-08-26T12:25:06Z

If I send data and/or code that generates a random matrix that crashes can you run it locally?

ilayn · 2018-08-26T13:35:57Z

Unfortunately I don't have such a machine and I doubt anyone of us has.

Unfortunately you have to find a way to check what is happening on your docker container via say sshing or looking at journalctl output and so on. We can only investigate the SciPy related part. I would start first logging options of the docker containers.

swzCuroverse · 2018-08-26T13:45:11Z

Hmm, so how do you debug issues only relating to large data? I can debug to a matrix but you will need a big mem machine to duplicate.

ilayn · 2018-08-26T13:52:30Z

First step should be identifying the problem to make sure that this is not a docker container problem. If it turns out to be a SciPy issue then the container surely logs something about it whether a segfault or MemoryError etc.

As I mentioned above, I would first seek support for logging the fail on a Docker container in Docker pages or on stackoverflow etc. Then next step would be triangulating down to SciPy if that turns out to be the case. We don't even know about the problem hence that's pretty much all we can help with.

pv · 2018-08-26T14:12:32Z

Is hstack numpy.hstack or scipy.sparse.hstack?
What are Xtrain.shape, Xtrain2.shape, Xtrain.nnz, Xtrain2.nnz?
What is Xtrain.dtype, Xtrain2.dtype, Xtrain.format, Xtrain2.format?
What is the index dtype? (.indptr.dtype for csr, and .row.dtype for coo?)

swzCuroverse · 2018-08-26T14:28:00Z

dtype default for one hot is csr, so Xtrain, Xtrain2 are both csr. Xtrain and Xtrain2 are made by encoder (see code) which is set for sparse and unint16. Using scipy hstack. Can get dimensions shortly. Will get index type for you as well

pv · 2018-08-26T14:33:29Z

If you can login to the instance, you can add import pdb; pdb.set_trace() before the problematic line, and then use s command repeatedly to step into subroutines and find out precisely at what point it fails.

swzCuroverse · 2018-08-26T14:35:55Z

Unfortunately, it is running in batch mode on a compute platform. I can try to spin up an instance and debug interactively but it means setting it up outside the platform.

swzCuroverse · 2018-08-27T00:08:15Z

@pv

The following are the values you asked about:

After the stacking:
Xtrain shape --- (3672, 5373156)
nnz for Xtrain -- 2656820520
row dtype -- int32
column dtype -- int32

I narrowed down the crash to that it occurs when I convert the coo to csr. Or when I do the stack specifying csr format.

Is there a problem where I need my indices to be in int64 for csr and they aren't for some reason?

I will get an image up on EC2 or Azure this week and see if I can interactively debug. It is just nearly impossible using the batch jobs.

pv · 2018-08-27T00:53:44Z

Does `a = Xtrain.tocsr(); b = Xtrain.tocsr()` also crash? There shouldn't be a problem with int32 indices, as they will be upcast in the operation. However, with these matrix sizes, the two coo->csr conversions require ~100GB additional memory allocation, and passing csr into hstack probably ends up with csr->coo->csr conversions.

swzCuroverse · 2018-08-27T10:45:35Z

Xtrain and Xtrain2 are csr matrices to begin with. The only reason I was doing it in parts was to try to catch the issue. I have a 400GB machine so 100GB of overhead shouldn't be a problem.

swzCuroverse · 2018-08-27T14:27:23Z

So, the plot thickens. Converting using tocsr crashes but the following does not....

    Xtrain = hstack([Xtrain,Xtrain2])
    shapeX = Xtrain.shape
    data = Xtrain.data
    rXtrain = Xtrain.row.astype('int64')
    cXtrain = Xtrain.col.astype('int64')

    del Xtrain

    newX = csr_matrix((data, (rXtrain, cXtrain)), shape=shapeX)

ilayn added the scipy.sparse label Aug 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue when making large csr matrices #9180

Issue when making large csr matrices #9180

swzCuroverse commented Aug 26, 2018

ilayn commented Aug 26, 2018

swzCuroverse commented Aug 26, 2018

swzCuroverse commented Aug 26, 2018

ilayn commented Aug 26, 2018

swzCuroverse commented Aug 26, 2018

ilayn commented Aug 26, 2018

pv commented Aug 26, 2018 •

edited

swzCuroverse commented Aug 26, 2018 •

edited

pv commented Aug 26, 2018

swzCuroverse commented Aug 26, 2018

swzCuroverse commented Aug 27, 2018 •

edited

pv commented Aug 27, 2018 via email

swzCuroverse commented Aug 27, 2018

swzCuroverse commented Aug 27, 2018 •

edited

Issue when making large csr matrices #9180

Issue when making large csr matrices #9180

Comments

swzCuroverse commented Aug 26, 2018

Reproducing code example:

Error message:

Scipy/Numpy/Python version information:

ilayn commented Aug 26, 2018

swzCuroverse commented Aug 26, 2018

swzCuroverse commented Aug 26, 2018

ilayn commented Aug 26, 2018

swzCuroverse commented Aug 26, 2018

ilayn commented Aug 26, 2018

pv commented Aug 26, 2018 • edited

swzCuroverse commented Aug 26, 2018 • edited

pv commented Aug 26, 2018

swzCuroverse commented Aug 26, 2018

swzCuroverse commented Aug 27, 2018 • edited

pv commented Aug 27, 2018 via email

swzCuroverse commented Aug 27, 2018

swzCuroverse commented Aug 27, 2018 • edited

pv commented Aug 26, 2018 •

edited

swzCuroverse commented Aug 26, 2018 •

edited

swzCuroverse commented Aug 27, 2018 •

edited

swzCuroverse commented Aug 27, 2018 •

edited