Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory error on Power 8 IBM machine #6

Closed
sheltongeosx opened this issue Sep 24, 2020 · 14 comments
Closed

Out of memory error on Power 8 IBM machine #6

sheltongeosx opened this issue Sep 24, 2020 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@sheltongeosx
Copy link

@dingwentao, @MauricioAP, @jiemeng-total

Hi Dingwen,
Thank you very much for letting me know that the cuda version has been released!
Here are couple of issues/questions:

   . It is still not practical for our users since from the doc there is no stand alone decompression yet.
   . It generates the following four files from compression operation. Which one could be used to compute compression ratio( I assume .b16.h)?
          .b16.outlier,  .b16meta, .b16.dh and .b16.cHcb
   . Please provide an option to user to choose where to output the compressed data. 
   . It seems that there is problem running on IBM Power 8 Machines. The following are the environments and error messages:
      os:  Red Hat Enterprise Linux Server 7.4 (Maipo)
      compiler: gcc/7.3.0
      cuda: 10.1
      GPU:  Nvidia Tesla P100, Memory: 16Gig
      
      **Commands/error for small data size:**
      $ cusz -z -f32 -m r2r -e 0.0001 -i  mytestinput_38.dat  -3 1601 5850 38
      [info] datum:           mytestinput_38.dat (1423609200 bytes) of type f32
      [info] quant.capacity:  1024
      [info] input eb:        0.0001 x 10^(0) = 0.0001
      [info] eb change:       0.0001 (input eb) x 1405.54 (rng) = 0.140554 (relative-to-value-range, r2r mode)
      [dbg]  exponent = 0.000 (base10) (or) -13.288 (base2)
      [dbg]  uint16_t to represent quant. code, uint32_t internal Huffman bitstream
      [dbg]  original len: 355902300, m the padded: 18866, mxm: 355925956

      [inof] Commencing compression...
      [info] nnz.outlier:     355902297       (100%)
      [info] entropy:         0
      terminate called after throwing an instance of 'thrust::system::system_error'
       what():  radix_sort: failed on 1st step: cudaErrorInvalidDeviceFunction: invalid device function
       Aborted

      **Command for large data size:**
     $ cusz -z -f32 -m r2r -e 0.0001 -i  mytestinput_150.dat  -3 1601 5850 150
     See error messages in the attached file

cusz_errmsg.txt

Best,
Shelton

@dingwentao
Copy link
Contributor

dingwentao commented Sep 24, 2020

@jtian0 @codyjrivera @sheltongeosx

Hi Shelton,

Thanks for testing. For your questions:

  1. Our next release v0.2 will have this decoupled compression and decompression. We target to release it in early October.
  2. Please calculate the compression ratio based on all these four files, as we mentioned in the limitation section in README. Our next release v0.2 will have an archived format to include all these four files. Again, we target to release v0.2 in early October.
  3. Yes, we will add the option to specify the location of compressed/decompressed data.
  4. We're looking into the bug you reported.

Please test cuSZ to first understand the compression quality and throughput on your datasets. We are actively developing these features which will released in October. Thanks for your understanding about the delay.

Hi @codyjrivera ,

It seems we have a bug in radix_sort function for Power8+P100 machine. Do you have any idea on this?

Thanks,
Dingwen

@jtian0 jtian0 added the bug Something isn't working label Sep 24, 2020
@jtian0
Copy link
Collaborator

jtian0 commented Sep 25, 2020

@codyjrivera @dingwentao This seems a known bug from thrust of CUDA 10 NVIDIA/thrust#936 (CUDA 10 + P2000 w/ sm61). Note that P100 is of sm60.

@dingwentao
Copy link
Contributor

dingwentao commented Sep 27, 2020

Hi @sheltongeosx @MauricioAP @jiemeng-total

We have released cuSZ v0.1.1. Pleae check it out. The major changes include:

  • Compression and decompression have been been decoupled
  • Using --opath option to specify compressed and decompressed file path

For the reported bug, as pointed by @jtian0, it seems from thrust with CUDA 10 on P100. We are testing cuSZ with CUDA 9 on P100 and see if the issue will be solved.

Thanks,
Dingwen

@sheltongeosx
Copy link
Author

@dingwentao, @MauricioAP, @jiemeng-total

Hi Dingwen, thank you very much for your updates!

The following are what I experienced in running the new code:

1.  For decompression:
     It did produce a decompressed .szx file, but it encountered a segmentation fault. The attached file is error log.
      
2.  for compression:
    We observed that the "compression" operation produces 5 output files with the total size of being greater than input data size, which of course not what it should be. Here is my test case:
           input data: 
                mytestinput_150_case3.dat            5619510000

           compressed data: 
                 mytestinput_150_case3.dat.canon     2304
                 mytestinput_150_case3.dat.hbyte     849097724
                 mytestinput_150_case3.dat.hmeta     43902432
                 mytestinput_150_case3.dat.outlier   6959323260 
                 mytestinput_150_case3.dat.yamp      240        

cusz_new.txt

Best,
Shelton

@dingwentao
Copy link
Contributor

dingwentao commented Oct 5, 2020

Hi Shelton,

Thanks for your followup. Yes, we did notice this issue when we were testing with GCC 6. Are you using GCC 6 with cuSZ? If so, can you try with GCC 7.3? As we mentioned in README, cuSZ requires GCC 7.3+. We'll investigate soon what triggers cuSZ free() error with GCC 6.

Thanks!
Dingwen

@sheltongeosx
Copy link
Author

Hi Dingwen,

I am using gcc./7.3.0 and cuda/10.1.

Shelton

@dingwentao
Copy link
Contributor

Hi Shelton,

Thanks for the information.

Yes, you're right. It doesn't make any sense that *.outlier is larger than *.dat, since *.outlier stores the unpredictable data which are only a small portion of the original data. I'm wondering if you might be willing to share with us the input data (if it's not sensitive) and your compression configuration to help us investigate the bug? Please contact me directly via dingwen.tao@wsu.edu. Thanks!

Best,
Dingwen

@codyjrivera
Copy link
Collaborator

Hi @sheltongeosx @MauricioAP @jiemeng-total,

We have released cuSZ v0.1.3. Please check it out. The two main updates are as follows:

  • cuSZ supports the Nvidia Pascal microarchitecture now.
  • All compressed files are archived into a single *.sz file

The bug that broke cuSZ on Pascal GPUs turned out to be unrelated to Thrust as we first suspected, and we fixed our code accordingly. I have tested our new release on a few different Pascal GPUs, on CUDA 10.1 and CUDA 11.0 (see README.md for more details). They are both x86 systems though.

I hope this new release fixes the issue you brought up in your first message.

Best,
Cody

@dingwentao
Copy link
Contributor

Hi @sheltongeosx @MauricioAP @jiemeng-total,

-gzip option is added. If you only want to test cuSZ's performance on GPU, please ignore this option. If you want to test whether cuSZ can get a similar compression ratio as SZ on CPU, please enable this option.

Thanks,
Dingwen

@sheltongeosx
Copy link
Author

@dingwentao @MauricioAP @jiemeng-total

Hi Dingwen,
Just got time to test Parihaka_PSTM_far_stack.sgy with the latest commit (6c69f15 - 2020-10-29). The following are what we have seen:

    1. running without -gzip:  -r2r -e=0.000001
        The size of .sz file produced is greater than the size of arihaka_PSTM_far_stack.sgy. We considered it being not applicable to us in this case.
     2. running without -gzip: -r2r -e=0.00001
         Compression ratio = 3.61, but it took 35.026 seconds to run which is still too long for the data size.
     3. running with -gzip:
         It took very long to finish.

Hope this is helpful.

Best,
Shelton

@jtian0 jtian0 closed this as completed Nov 2, 2020
@jtian0 jtian0 reopened this Nov 2, 2020
@jtian0
Copy link
Collaborator

jtian0 commented Nov 2, 2020

@sheltongeosx
Hi Shelton,
Reasonable CR=3.61 but 35-second runtime may be due to non-automated finding optimal concurrent CUDA thread number for Huffman encoding and the fact that default setting is generally for small-size dataset. Let me see if I can push a patch quickly.

@jtian0 jtian0 self-assigned this Nov 2, 2020
@dingwentao
Copy link
Contributor

dingwentao commented Nov 6, 2020

Hi @sheltongeosx @MauricioAP @jiemeng-total,

Please check out our last commit 81248dd. Below is the output tested on V100 GPU.

Compression with breakdown time:

> time ./bin/cusz -f32 -m r2r -e 1e-4 -i ~/Parihaka_PSTM_far_stack.f32 -D parihaka -z

[info] datum:		/path/to/Parihaka_PSTM_far_stack.f32 (4850339584 bytes) of type f32
[dbg]  original len:	1212584896 (padding: 34823)
[dbg]  Time loading data:	1.69805s
[info] quant.cap:	1024	input eb:	0.0001
[dbg]  Time inspecting data range:	0.0119826s
[info] eb change:	(input eb) x 12342.2 (rng) = 1.23422 (relative-to-range)
[dbg]  2-byte quant type, 4-byte internal Huff type

[info] Commencing compression...
[info] nnz.outlier:	16607	(0.00136955%)
[dbg]  Optimal Huffman deflating chunksize	32768
[info] entropy:		3.85809
[dbg]  Huffman enc:	#chunk=37006, chunksze=32768 => 212269553 4-byte words/6792051551 bits
[dbg]  Time writing Huff. binary:	0.730165s
[info] Compression finished, saved Huffman encoded quant.code.
[dbg]  Time tar'ing	1.1637s
[info] Written to:	/path/to//Parihaka_PSTM_far_stack.f32.sz
./bin/cusz -f32 -m r2r -e 1e-4 -i ~/Parihaka_PSTM_far_stack.f32 -D parihaka -  1.24s user 2.80s system 70% cpu 5.687 total

So, 5.687s in total: (time loading + write Huff. bin + tar) for 3.592s, 2.095s for "real" cuSZ. Considering 2.095s includes CUDA malloc host (~ 1 second) which is a one-time cost, if enabling cuSZ library API in the future, the performance will be improved a lot. We're actively working on the development of cuSZ API and library support.

Verified decompression:

> ./bin/cusz -i ~/Parihaka_PSTM_far_stack.f32.sz -x --origin ~/Parihaka_PSTM_far_stack.f32 --skip write.x

[info] Commencing decompression...
[info] Huffman decoding into quant.code.
[info] Extracted outlier from CSR format.
[info] Decompression finished.

[info] Huffman metadata of chunking and reverse codebook size (in bytes): 594400
[info] Huffman coded output size: 849078212
[info] To compare with the original datum

[info] Verification start ---------------------
| min.val             -6893.359375
| max.val             5448.8828125
| val.rng             12342.2421875
| max.err.abs.val     1.2342529296875
| max.err.abs.idx     86493
| max.err.vs.rng      0.00010000232623352093317
| max.pw.rel.err      1
| PSNR                84.882560160759624068
| NRMSE               5.6999624146827599324E-05
| correl.coeff        0.99999687790253855013
| comp.ratio.w/o.gzip 5.706653
[info] Verification end -----------------------

[info] Decompressed file is written to /path/to/Parihaka_PSTM_far_stack.f32.szx.
[info] Please use compressed data (*.sz) to calculate final comp ratio (w/ gzip).
[info] Skipped writing unzipped to filesystem.

Please let us know if this will work on your P100 and Parihaka dataset. Thanks for your help!

Best,
Dingwen

@sheltongeosx
Copy link
Author

sheltongeosx commented Nov 6, 2020

@dingwentao @MauricioAP @jiemeng-total,

Hi Dingwen,
First of all it looks that this version works for IBM Power8. The total runtime for this case is around 10 seconds - a big improvement regarding runtime. Here are the printout:

 /home/sma/src/cuSZ_81248dd/build/bin/cusz  -z -f32 -m r2r -e 0.00001 -i /home/sma/data/cuSZ/Parihaka_PSTM_far_stack.H@  -3 1168 1126 922 --opath /home/sma/data/cuSZ/IBM_far_stack_comp_nogzip_0.00001 
[info] datum:           /home/sma/data/cuSZ/Parihaka_PSTM_far_stack.H@ (4850339584 bytes) of type f32
[dbg]  original len:    1212584896 (padding: 34823)
[dbg]  Time loading data:       2.17847s
[info] quant.cap:       1024    input eb:       1e-05
[dbg]  Time inspecting data range:      0.0169115s
[info] eb change:       (input eb) x 1216.85 (rng) = 0.0121685 (relative-to-range)
[dbg]  2-byte quant type, 4-byte internal Huff type

[info] Commencing compression...
[info] nnz.outlier:     474066415       (39.0955%)
[dbg]  Optimal Huffman deflating chunksize      65536
[info] entropy:         4.71345
[dbg]  Huffman enc:     #chunk=18503, chunksze=65536 => 259772045 4-byte words/8312416913 bits
[dbg]  Time writing Huff. binary:       0.220607s
[info] Compression finished, saved Huffman encoded quant.code.
[dbg]  Time tar'ing     4.23726s
[info] Written to:      /home/sma/data/cuSZ/IBM_far_stack_comp_nogzip_0.00001/Parihaka_PSTM_far_stack.H@.sz

But the issue is that the data looks has not been compressed much:

4832071680 Nov  6 08:41 Parihaka_PSTM_far_stack.H@.sz
4850339584 Nov  6 09:15 Parihaka_PSTM_far_stack.H@.szx

Even worse for -e=0.000001:

8401100800 Nov  6 08:42 Parihaka_PSTM_far_stack.H@.sz
4850339584 Nov  6 09:15 Parihaka_PSTM_far_stack.H@.szx

Hope you are able to repeat the issues.

Best
Shelton

@dingwentao
Copy link
Contributor

dingwentao commented Nov 6, 2020

Hi @sheltongeosx,

Thanks for your quick feedback.

Yes, based on our observation, the relative error bound of 1e-4 on this dataset can only provide a compression ratio of about 5, so worse with 1e-5 or 1e-6 is reasonable. I'm not sure if you've tested the data with CPU SZ. Based on our preliminary experiment, this dataset is very hard to compress with SZ as well. Considering cuSZ is a variant of SZ focusing on the compression speed improvement, I recommend you work with our SZ team (szlossycompressor@gmail.com) to improve the compression algorithm for higher ratio first. Then, cuSZ can incorporate this algorithm into the code.

Again, thanks a lot for helping us solve several issues for large and hard-to-compress data.

Best,
Dingwen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants