Out of memory error on Power 8 IBM machine #6

sheltongeosx · 2020-09-24T17:22:29Z

@dingwentao, @MauricioAP, @jiemeng-total

Hi Dingwen,
Thank you very much for letting me know that the cuda version has been released!
Here are couple of issues/questions:

   . It is still not practical for our users since from the doc there is no stand alone decompression yet.
   . It generates the following four files from compression operation. Which one could be used to compute compression ratio( I assume .b16.h)?
          .b16.outlier,  .b16meta, .b16.dh and .b16.cHcb
   . Please provide an option to user to choose where to output the compressed data. 
   . It seems that there is problem running on IBM Power 8 Machines. The following are the environments and error messages:
      os:  Red Hat Enterprise Linux Server 7.4 (Maipo)
      compiler: gcc/7.3.0
      cuda: 10.1
      GPU:  Nvidia Tesla P100, Memory: 16Gig
      
      **Commands/error for small data size:**
      $ cusz -z -f32 -m r2r -e 0.0001 -i  mytestinput_38.dat  -3 1601 5850 38
      [info] datum:           mytestinput_38.dat (1423609200 bytes) of type f32
      [info] quant.capacity:  1024
      [info] input eb:        0.0001 x 10^(0) = 0.0001
      [info] eb change:       0.0001 (input eb) x 1405.54 (rng) = 0.140554 (relative-to-value-range, r2r mode)
      [dbg]  exponent = 0.000 (base10) (or) -13.288 (base2)
      [dbg]  uint16_t to represent quant. code, uint32_t internal Huffman bitstream
      [dbg]  original len: 355902300, m the padded: 18866, mxm: 355925956

      [inof] Commencing compression...
      [info] nnz.outlier:     355902297       (100%)
      [info] entropy:         0
      terminate called after throwing an instance of 'thrust::system::system_error'
       what():  radix_sort: failed on 1st step: cudaErrorInvalidDeviceFunction: invalid device function
       Aborted

      **Command for large data size:**
     $ cusz -z -f32 -m r2r -e 0.0001 -i  mytestinput_150.dat  -3 1601 5850 150
     See error messages in the attached file

cusz_errmsg.txt

Best,
Shelton

The text was updated successfully, but these errors were encountered:

dingwentao · 2020-09-24T17:51:26Z

@jtian0 @codyjrivera @sheltongeosx

Hi Shelton,

Thanks for testing. For your questions:

Our next release v0.2 will have this decoupled compression and decompression. We target to release it in early October.
Please calculate the compression ratio based on all these four files, as we mentioned in the limitation section in README. Our next release v0.2 will have an archived format to include all these four files. Again, we target to release v0.2 in early October.
Yes, we will add the option to specify the location of compressed/decompressed data.
We're looking into the bug you reported.

Please test cuSZ to first understand the compression quality and throughput on your datasets. We are actively developing these features which will released in October. Thanks for your understanding about the delay.

Hi @codyjrivera ,

It seems we have a bug in radix_sort function for Power8+P100 machine. Do you have any idea on this?

Thanks,
Dingwen

jtian0 · 2020-09-25T08:18:29Z

@codyjrivera @dingwentao This seems a known bug from thrust of CUDA 10 NVIDIA/thrust#936 (CUDA 10 + P2000 w/ sm61). Note that P100 is of sm60.

dingwentao · 2020-09-27T16:52:51Z

Hi @sheltongeosx @MauricioAP @jiemeng-total

We have released cuSZ v0.1.1. Pleae check it out. The major changes include:

Compression and decompression have been been decoupled
Using --opath option to specify compressed and decompressed file path

For the reported bug, as pointed by @jtian0, it seems from thrust with CUDA 10 on P100. We are testing cuSZ with CUDA 9 on P100 and see if the issue will be solved.

Thanks,
Dingwen

sheltongeosx · 2020-10-05T15:34:02Z

@dingwentao, @MauricioAP, @jiemeng-total

Hi Dingwen, thank you very much for your updates!

The following are what I experienced in running the new code:

1.  For decompression:
     It did produce a decompressed .szx file, but it encountered a segmentation fault. The attached file is error log.
      
2.  for compression:
    We observed that the "compression" operation produces 5 output files with the total size of being greater than input data size, which of course not what it should be. Here is my test case:
           input data: 
                mytestinput_150_case3.dat            5619510000

           compressed data: 
                 mytestinput_150_case3.dat.canon     2304
                 mytestinput_150_case3.dat.hbyte     849097724
                 mytestinput_150_case3.dat.hmeta     43902432
                 mytestinput_150_case3.dat.outlier   6959323260 
                 mytestinput_150_case3.dat.yamp      240

cusz_new.txt

Best,
Shelton

dingwentao · 2020-10-05T16:33:22Z

Hi Shelton,

Thanks for your followup. Yes, we did notice this issue when we were testing with GCC 6. Are you using GCC 6 with cuSZ? If so, can you try with GCC 7.3? As we mentioned in README, cuSZ requires GCC 7.3+. We'll investigate soon what triggers cuSZ free() error with GCC 6.

Thanks!
Dingwen

sheltongeosx · 2020-10-05T16:38:28Z

Hi Dingwen,

I am using gcc./7.3.0 and cuda/10.1.

Shelton

dingwentao · 2020-10-05T22:07:15Z

Hi Shelton,

Thanks for the information.

Yes, you're right. It doesn't make any sense that *.outlier is larger than *.dat, since *.outlier stores the unpredictable data which are only a small portion of the original data. I'm wondering if you might be willing to share with us the input data (if it's not sensitive) and your compression configuration to help us investigate the bug? Please contact me directly via dingwen.tao@wsu.edu. Thanks!

Best,
Dingwen

codyjrivera · 2020-10-25T01:25:06Z

Hi @sheltongeosx @MauricioAP @jiemeng-total,

We have released cuSZ v0.1.3. Please check it out. The two main updates are as follows:

cuSZ supports the Nvidia Pascal microarchitecture now.
All compressed files are archived into a single *.sz file

The bug that broke cuSZ on Pascal GPUs turned out to be unrelated to Thrust as we first suspected, and we fixed our code accordingly. I have tested our new release on a few different Pascal GPUs, on CUDA 10.1 and CUDA 11.0 (see README.md for more details). They are both x86 systems though.

I hope this new release fixes the issue you brought up in your first message.

Best,
Cody

dingwentao · 2020-10-29T04:56:02Z

Hi @sheltongeosx @MauricioAP @jiemeng-total,

-gzip option is added. If you only want to test cuSZ's performance on GPU, please ignore this option. If you want to test whether cuSZ can get a similar compression ratio as SZ on CPU, please enable this option.

Thanks,
Dingwen

sheltongeosx · 2020-11-02T17:22:01Z

@dingwentao @MauricioAP @jiemeng-total

Hi Dingwen,
Just got time to test Parihaka_PSTM_far_stack.sgy with the latest commit (6c69f15 - 2020-10-29). The following are what we have seen:

    1. running without -gzip:  -r2r -e=0.000001
        The size of .sz file produced is greater than the size of arihaka_PSTM_far_stack.sgy. We considered it being not applicable to us in this case.
     2. running without -gzip: -r2r -e=0.00001
         Compression ratio = 3.61, but it took 35.026 seconds to run which is still too long for the data size.
     3. running with -gzip:
         It took very long to finish.

Hope this is helpful.

Best,
Shelton

jtian0 · 2020-11-02T22:23:11Z

@sheltongeosx
Hi Shelton,
Reasonable CR=3.61 but 35-second runtime may be due to non-automated finding optimal concurrent CUDA thread number for Huffman encoding and the fact that default setting is generally for small-size dataset. Let me see if I can push a patch quickly.

dingwentao · 2020-11-06T13:56:09Z

Hi @sheltongeosx @MauricioAP @jiemeng-total,

Please check out our last commit 81248dd. Below is the output tested on V100 GPU.

Compression with breakdown time:

> time ./bin/cusz -f32 -m r2r -e 1e-4 -i ~/Parihaka_PSTM_far_stack.f32 -D parihaka -z

[info] datum:		/path/to/Parihaka_PSTM_far_stack.f32 (4850339584 bytes) of type f32
[dbg]  original len:	1212584896 (padding: 34823)
[dbg]  Time loading data:	1.69805s
[info] quant.cap:	1024	input eb:	0.0001
[dbg]  Time inspecting data range:	0.0119826s
[info] eb change:	(input eb) x 12342.2 (rng) = 1.23422 (relative-to-range)
[dbg]  2-byte quant type, 4-byte internal Huff type

[info] Commencing compression...
[info] nnz.outlier:	16607	(0.00136955%)
[dbg]  Optimal Huffman deflating chunksize	32768
[info] entropy:		3.85809
[dbg]  Huffman enc:	#chunk=37006, chunksze=32768 => 212269553 4-byte words/6792051551 bits
[dbg]  Time writing Huff. binary:	0.730165s
[info] Compression finished, saved Huffman encoded quant.code.
[dbg]  Time tar'ing	1.1637s
[info] Written to:	/path/to//Parihaka_PSTM_far_stack.f32.sz
./bin/cusz -f32 -m r2r -e 1e-4 -i ~/Parihaka_PSTM_far_stack.f32 -D parihaka -  1.24s user 2.80s system 70% cpu 5.687 total

So, 5.687s in total: (time loading + write Huff. bin + tar) for 3.592s, 2.095s for "real" cuSZ. Considering 2.095s includes CUDA malloc host (~ 1 second) which is a one-time cost, if enabling cuSZ library API in the future, the performance will be improved a lot. We're actively working on the development of cuSZ API and library support.

Verified decompression:

> ./bin/cusz -i ~/Parihaka_PSTM_far_stack.f32.sz -x --origin ~/Parihaka_PSTM_far_stack.f32 --skip write.x

[info] Commencing decompression...
[info] Huffman decoding into quant.code.
[info] Extracted outlier from CSR format.
[info] Decompression finished.

[info] Huffman metadata of chunking and reverse codebook size (in bytes): 594400
[info] Huffman coded output size: 849078212
[info] To compare with the original datum

[info] Verification start ---------------------
| min.val             -6893.359375
| max.val             5448.8828125
| val.rng             12342.2421875
| max.err.abs.val     1.2342529296875
| max.err.abs.idx     86493
| max.err.vs.rng      0.00010000232623352093317
| max.pw.rel.err      1
| PSNR                84.882560160759624068
| NRMSE               5.6999624146827599324E-05
| correl.coeff        0.99999687790253855013
| comp.ratio.w/o.gzip 5.706653
[info] Verification end -----------------------

[info] Decompressed file is written to /path/to/Parihaka_PSTM_far_stack.f32.szx.
[info] Please use compressed data (*.sz) to calculate final comp ratio (w/ gzip).
[info] Skipped writing unzipped to filesystem.

Please let us know if this will work on your P100 and Parihaka dataset. Thanks for your help!

Best,
Dingwen

sheltongeosx · 2020-11-06T15:25:00Z

@dingwentao @MauricioAP @jiemeng-total,

Hi Dingwen,
First of all it looks that this version works for IBM Power8. The total runtime for this case is around 10 seconds - a big improvement regarding runtime. Here are the printout:

 /home/sma/src/cuSZ_81248dd/build/bin/cusz  -z -f32 -m r2r -e 0.00001 -i /home/sma/data/cuSZ/Parihaka_PSTM_far_stack.H@  -3 1168 1126 922 --opath /home/sma/data/cuSZ/IBM_far_stack_comp_nogzip_0.00001 
[info] datum:           /home/sma/data/cuSZ/Parihaka_PSTM_far_stack.H@ (4850339584 bytes) of type f32
[dbg]  original len:    1212584896 (padding: 34823)
[dbg]  Time loading data:       2.17847s
[info] quant.cap:       1024    input eb:       1e-05
[dbg]  Time inspecting data range:      0.0169115s
[info] eb change:       (input eb) x 1216.85 (rng) = 0.0121685 (relative-to-range)
[dbg]  2-byte quant type, 4-byte internal Huff type

[info] Commencing compression...
[info] nnz.outlier:     474066415       (39.0955%)
[dbg]  Optimal Huffman deflating chunksize      65536
[info] entropy:         4.71345
[dbg]  Huffman enc:     #chunk=18503, chunksze=65536 => 259772045 4-byte words/8312416913 bits
[dbg]  Time writing Huff. binary:       0.220607s
[info] Compression finished, saved Huffman encoded quant.code.
[dbg]  Time tar'ing     4.23726s
[info] Written to:      /home/sma/data/cuSZ/IBM_far_stack_comp_nogzip_0.00001/Parihaka_PSTM_far_stack.H@.sz

But the issue is that the data looks has not been compressed much:

4832071680 Nov  6 08:41 Parihaka_PSTM_far_stack.H@.sz
4850339584 Nov  6 09:15 Parihaka_PSTM_far_stack.H@.szx

Even worse for -e=0.000001:

8401100800 Nov  6 08:42 Parihaka_PSTM_far_stack.H@.sz
4850339584 Nov  6 09:15 Parihaka_PSTM_far_stack.H@.szx

Hope you are able to repeat the issues.

Best
Shelton

dingwentao · 2020-11-06T16:19:28Z

Hi @sheltongeosx,

Thanks for your quick feedback.

Yes, based on our observation, the relative error bound of 1e-4 on this dataset can only provide a compression ratio of about 5, so worse with 1e-5 or 1e-6 is reasonable. I'm not sure if you've tested the data with CPU SZ. Based on our preliminary experiment, this dataset is very hard to compress with SZ as well. Considering cuSZ is a variant of SZ focusing on the compression speed improvement, I recommend you work with our SZ team (szlossycompressor@gmail.com) to improve the compression algorithm for higher ratio first. Then, cuSZ can incorporate this algorithm into the code.

Again, thanks a lot for helping us solve several issues for large and hard-to-compress data.

Best,
Dingwen

jtian0 added the bug Something isn't working label Sep 24, 2020

dingwentao assigned codyjrivera Sep 25, 2020

jtian0 closed this as completed Nov 2, 2020

jtian0 reopened this Nov 2, 2020

jtian0 self-assigned this Nov 2, 2020

jtian0 added the enhancement label Nov 2, 2020

dingwentao closed this as completed Nov 6, 2020

This was referenced Nov 7, 2020

log Pascal GPU support #22

Closed

log 5-GB datum running out of memory #23

Closed

dingwentao mentioned this issue Nov 7, 2020

Low compression ratio on Parihaka Seismic Data szcompressor/SZ#64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory error on Power 8 IBM machine #6

Out of memory error on Power 8 IBM machine #6

sheltongeosx commented Sep 24, 2020

dingwentao commented Sep 24, 2020 •

edited

Loading

jtian0 commented Sep 25, 2020 •

edited

Loading

dingwentao commented Sep 27, 2020 •

edited

Loading

sheltongeosx commented Oct 5, 2020

dingwentao commented Oct 5, 2020 •

edited

Loading

sheltongeosx commented Oct 5, 2020

dingwentao commented Oct 5, 2020

codyjrivera commented Oct 25, 2020

dingwentao commented Oct 29, 2020

sheltongeosx commented Nov 2, 2020

jtian0 commented Nov 2, 2020

dingwentao commented Nov 6, 2020 •

edited

Loading

sheltongeosx commented Nov 6, 2020 •

edited

Loading

dingwentao commented Nov 6, 2020 •

edited

Loading

Out of memory error on Power 8 IBM machine #6

Out of memory error on Power 8 IBM machine #6

Comments

sheltongeosx commented Sep 24, 2020

dingwentao commented Sep 24, 2020 • edited Loading

jtian0 commented Sep 25, 2020 • edited Loading

dingwentao commented Sep 27, 2020 • edited Loading

sheltongeosx commented Oct 5, 2020

dingwentao commented Oct 5, 2020 • edited Loading

sheltongeosx commented Oct 5, 2020

dingwentao commented Oct 5, 2020

codyjrivera commented Oct 25, 2020

dingwentao commented Oct 29, 2020

sheltongeosx commented Nov 2, 2020

jtian0 commented Nov 2, 2020

dingwentao commented Nov 6, 2020 • edited Loading

sheltongeosx commented Nov 6, 2020 • edited Loading

dingwentao commented Nov 6, 2020 • edited Loading

dingwentao commented Sep 24, 2020 •

edited

Loading

jtian0 commented Sep 25, 2020 •

edited

Loading

dingwentao commented Sep 27, 2020 •

edited

Loading

dingwentao commented Oct 5, 2020 •

edited

Loading

dingwentao commented Nov 6, 2020 •

edited

Loading

sheltongeosx commented Nov 6, 2020 •

edited

Loading

dingwentao commented Nov 6, 2020 •

edited

Loading