Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbound memory allocation on system with six gpus #47

Closed
daniel-j-h opened this issue Sep 20, 2019 · 10 comments
Closed

Unbound memory allocation on system with six gpus #47

daniel-j-h opened this issue Sep 20, 2019 · 10 comments

Comments

@daniel-j-h
Copy link

Hey there - thanks for this amazing monitoring tool! 馃檱

Here's an issue I'm hitting: when running on a 6-gpu system nvtop allocates memory until it is getting killed by the linux oom killer. It looks like there is an overflow somewhere leading to unbound memory allocation (at a rate of multiple GBs per second).

Another data point: this behavior stops to happen when I run in a small terminal (e.g. 80x24) or in a tmux split pane, which indicates it has something to do with the live utilization plots.

When running a debug build and sending a SIGHUP signal during the memory allocation I get backtraces indicating draw_plots in the problem, e.g.

(gdb) bt
#0  0x00005555555c11c8 in nvtop_line_plot (win=0x0, num_data=3200171704, data=0x7ff9f9354800, min=0, max=100, num_plots=4, legend=0x7fffffffe000) at /home/djh/nvtop/src/plot.c:61
#1  0x00005555555a99db in draw_plots (interface=0x6110000002c0) at /home/djh/nvtop/src/interface.c:1604
#2  0x00005555555a9ce1 in draw_gpu_info_ncurses (dev_info=0x61a000000c80, interface=0x6110000002c0) at /home/djh/nvtop/src/interface.c:1625
#3  0x0000555555594192 in main (argc=1, argv=0x7fffffffe4a8) at /home/djh/nvtop/src/nvtop.c:270

Hope that helps, let me know if you need more information.

@Syllo
Copy link
Owner

Syllo commented Sep 21, 2019

Hi,

It seems that there is a problem with the window initialization code. Your backtrace shows a NULL window pointer.
What I think is happening is that this cascades into an unsigned underflow in initialize_gpu_mem_plot followed by malloc that over-commits an insanely huge buffer.
Finally, the OS starts allocating the real pages when it is accessed by the nvtop_line_plot code and 馃挘

Could you please provide me the gdb output after:
break inteface.c:439
run
print *plot_positions

At which terminal size does it break?

@daniel-j-h
Copy link
Author

Here's the full screen terminal size which runs into this issue

$ stty size
49 190

and a tmux split pane in which in works is of size 23 190.

The gdb output:

(gdb) print *plot_positions
$1 = {posX = 0, posY = 11, sizeX = 189, sizeY = 12}

Thank you! 馃檱

@Syllo
Copy link
Owner

Syllo commented Sep 22, 2019

Could you please confirm that the patch 8b56210 on the dev branch fixes your problem?

@daniel-j-h
Copy link
Author

Wonderful, the dev branch fixes the problem! 馃帀 Thank you for this quick fix! 馃

Shows me two plots at the top and a third one below (out of six gpus).

@Syllo
Copy link
Owner

Syllo commented Oct 1, 2019

You are welcome,
The fix is now part of master.

@Syllo Syllo closed this as completed Oct 1, 2019
@lyu
Copy link

lyu commented Mar 1, 2020

Hi,

Sorry for commenting in a closed issue but I am still having the same issue as the OP faced, but our system has 8 GPUs. Reducing the size of the terminal and nvtop works correctly by showing 4 plots, each displaying 2 GPUs.

I am building nvtop from the master branch.

@Syllo Syllo reopened this Mar 2, 2020
@Syllo
Copy link
Owner

Syllo commented Mar 2, 2020

Hello,

Can you please provide the location of the error in the same way Daniel did, the size of your terminal, and the output of the debugger for the following commands:

break interface_layout_selection.c:174
print num_plot_stacks
print plot_per_row
print *plot_types

To generate a debug build you have to specify -DCMAKE_BUILD_TYPE=Debug while running cmake.

Thanks

@lyu
Copy link

lyu commented Mar 3, 2020

tput cols: 142
tput lines : 75
print num_plot_stacks: 3
print plot_per_row: 1
print *plot_types: plot_gpu_duo

Address sanitizer backtrace:

=================================================================
==32111==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x2ab2b8fcf7e0 at pc 0x000000440867 bp 0x7fffffffcb30 sp 0x7fffffffcb28
READ of size 8 at 0x2ab2b8fcf7e0 thread T0
    #0 0x440866 in nvtop_line_plot /dev/shm/nvtop/src/plot.c:29
    #1 0x421f1a in draw_plots /dev/shm/nvtop/src/interface.c:1604
    #2 0x4222bb in draw_gpu_info_ncurses /dev/shm/nvtop/src/interface.c:1625
    #3 0x4059a5 in main /dev/shm/nvtop/src/nvtop.c:270
    #4 0x2aaaac994504 in __libc_start_main (/lib64/libc.so.6+0x22504)
    #5 0x4048a8  (/gpfs/home/USER_NAME/.local/bin/nvtop+0x4048a8)

0x2ab2b8fcf7e0 is located 16 bytes to the right of 34359738320-byte region [0x2aaab8fcf800,0x2ab2b8fcf7d0)
allocated by thread T0 here:
    #0 0x2aaaaadd8cb8 in __interceptor_calloc ../../../../gcc-9.2.0/libsanitizer/asan/asan_malloc_linux.cc:153
    #1 0x4083d8 in initialize_gpu_mem_plot /dev/shm/nvtop/src/interface.c:364
    #2 0x4091da in alloc_plot_window /dev/shm/nvtop/src/interface.c:409
    #3 0x4098c6 in initialize_all_windows /dev/shm/nvtop/src/interface.c:439
    #4 0x40c1a6 in initialize_curses /dev/shm/nvtop/src/interface.c:564
    #5 0x4057ed in main /dev/shm/nvtop/src/nvtop.c:249
    #6 0x2aaaac994504 in __libc_start_main (/lib64/libc.so.6+0x22504)

SUMMARY: AddressSanitizer: heap-buffer-overflow /dev/shm/nvtop/src/plot.c:29 in nvtop_line_plot
Shadow bytes around the buggy address:
  0x0556d71f1ea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0556d71f1eb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0556d71f1ec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0556d71f1ed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x0556d71f1ee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0556d71f1ef0: 00 00 00 00 00 00 00 00 00 00 fa fa[fa]fa fa fa
  0x0556d71f1f00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0556d71f1f10: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0556d71f1f20: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0556d71f1f30: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x0556d71f1f40: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc
==32111==ABORTING

@Syllo
Copy link
Owner

Syllo commented Aug 22, 2020

@lyu I think that the patch 71b7f96 should fix this problem.
Could you please tell me if it solves the problem on your system?

@Syllo Syllo closed this as completed Aug 28, 2020
@lyu
Copy link

lyu commented Aug 30, 2020

@Syllo The problem is gone, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants