Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault while using node_exporter 1.14.3.0 or 1.12.1.0 #16

Open
yashnagar opened this issue May 23, 2022 · 10 comments
Open

Segmentation fault while using node_exporter 1.14.3.0 or 1.12.1.0 #16

yashnagar opened this issue May 23, 2022 · 10 comments

Comments

@yashnagar
Copy link

On AIX 7.2 TL5, we see segmentation fault on 1.12.1.0, upgraded to latest 1.14.3.0 but that too remains the same.

[1]+ Segmentation fault (core dumped) /usr/local/bin/node_exporter_aix -acMdiPf -p 10051

LPAR has many disks and 6 fiber adapters and quite busy ...Can someone help?

@yashnagar
Copy link
Author

noticed that just before the segmentation fault we see this error -
Error calling perfstat_diskpath: Invalid argument

Note that we are using Veritas DMP for managing SAN paths.

@thorhs
Copy link
Owner

thorhs commented May 23, 2022

Hi,

Could you see if you can find the location of the segfault? You may see line numbers in the 'errpt -a' output, or you could enable core dumps and take a look using a debugger like gdb. I can give you some better details if needed, and if you are able to share the core dump could take a look myself.

Did this work for you before an upgrade to 1.12.1.0, or is this a new deployent?

Do the number of disks/adapters change frequently?

Does this happen at startup of the exporter, or does it run for some time beofre it happens?

Thor

@yashnagar
Copy link
Author

Core get generated under (/) root directory.

root@or1xx003[/]# ls -l core
-rw------- 1 root system 4110233 May 23 06:15 core

Behavior is same with 1.12 or 1.14 node exporter agent.

root@or1xx003[/]# lslpp -l|grep -i node
node_exporter_aix.rte 1.14.3.0 COMMITTED prometheus node_exporter for

The no of disks, adapters are not changed frequently, they remains pretty static. After running the node exporter with or without arguments, it crashes immediately after 1-2 minutes leaving a error "Error calling perfstat_diskpath: Invalid argument" on the console. I can share you core file, let me where I can upload it for you.

majority of our systems are with veritas cluster, veritas Volume manager and veritas dynamic multipath VxDMP where this issue is observed.

Thank you for a quick reply

Regards
Yash

@thorhs
Copy link
Owner

thorhs commented May 24, 2022

Could you please gzip the core and attach it to this case?

@yashnagar
Copy link
Author

I have noticed some txt in core stating root login is disabled. We have disabled direct root login all AIX. After enabling direct root in sshd_config, I see following in errpt.

root@or1xxx[/]# errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
A924A5FC 0527004122 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED
A924A5FC 0527003722 P S SYSPROC SOFTWARE PROGRAM ABNORMALLY TERMINATED

root@or1xxx[/]# errpt -a -j A924A5FC

LABEL: CORE_DUMP
IDENTIFIER: A924A5FC

Date/Time: Fri May 27 00:41:08 2022
Sequence Number: 1837132
Machine Id: 00CB2F274C00
Node Id: or1sxxxx
Class: S
Type: PERM
WPAR: Global
Resource Name: SYSPROC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

    Recommended Actions
    CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

    Recommended Actions
    RERUN THE APPLICATION PROGRAM
    IF PROBLEM PERSISTS THEN DO THE FOLLOWING
    CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
11
USER'S PROCESS ID:
32833812
FILE SYSTEM SERIAL NUMBER
1
INODE NUMBER
2
CORE FILE NAME
//core
PROGRAM NAME
node_exporter_aix
STACK EXECUTION DISABLED
0
COME FROM ADDRESS REGISTER
perfstat_ 13C

PROCESSOR ID
hw_fru_id: 2
hw_cpu_id: 35

ADDITIONAL INFORMATION
_Z12gathe 128
_Z12gathe 80

Symptom Data
REPORTABLE
1
INTERNAL ERROR
0
SYMPTOM CODE
PCSS/SPI2 FLDS/node_expo SIG/11 FLDS/Z12gathe VALU/128 FLDS/perfstat

LABEL: CORE_DUMP
IDENTIFIER: A924A5FC

Date/Time: Fri May 27 00:37:08 2022
Sequence Number: 1837131
Machine Id: 00CB2F274C00
Node Id: or1xxx
Class: S
Type: PERM
WPAR: Global
Resource Name: SYSPROC

Description
SOFTWARE PROGRAM ABNORMALLY TERMINATED

Probable Causes
SOFTWARE PROGRAM

User Causes
USER GENERATED SIGNAL

    Recommended Actions
    CORRECT THEN RETRY

Failure Causes
SOFTWARE PROGRAM

    Recommended Actions
    RERUN THE APPLICATION PROGRAM
    IF PROBLEM PERSISTS THEN DO THE FOLLOWING
    CONTACT APPROPRIATE SERVICE REPRESENTATIVE

Detail Data
SIGNAL NUMBER
11
USER'S PROCESS ID:
33095964
FILE SYSTEM SERIAL NUMBER
1
INODE NUMBER
2
CORE FILE NAME
//core
PROGRAM NAME
node_exporter_aix
STACK EXECUTION DISABLED
0
COME FROM ADDRESS REGISTER
perfstat_ 13C

PROCESSOR ID
hw_fru_id: 2
hw_cpu_id: 35

ADDITIONAL INFORMATION
_Z12gathe 128
_Z12gathe 80

Symptom Data
REPORTABLE
1
INTERNAL ERROR
0
SYMPTOM CODE
PCSS/SPI2 FLDS/node_expo SIG/11 FLDS/Z12gathe VALU/128 FLDS/perfstat

but core dump is still happening immediately after running it.
/usr/local/bin/node_exporter_aix -acMdiPf -p 10051 &

I have couple of questions:

  1. Can we enable debug on node_exporter or enable logging to a file?
  2. is there anyway we skip diskpath module?

At this moment, I am not able to share core dump due to company policy.

Thanks
Yash

@thorhs
Copy link
Owner

thorhs commented May 30, 2022

  1. Unfortunately there is no additional debugging available in node_exporter_aix.
  2. If you run 'node_exporter_aix -h', you will see the available modules. You should be able to include only the modules you want. (exclude -D).

Do you have access to gdb on AIX? If so, could you copy the core and the node_exporter_aix binary to a server with gdb available and run 'gdb node_exporter_aix core'.

When in gdb, please run 'where' to get a stack trace from the core file.

You could also try to run node_exporter_aix with one module enabled at a time, to try to zero in on where the issue is.

Also, what version of AIX are you running where it is crashing? (oslevel -s)

@yashnagar
Copy link
Author

I have installed gdb and got following, did gdb on other cores as well they all states that Program terminated with signal SIGSEGV. Let me know if you need more details from core.

root@or1xxx001# /opt/freeware/bin/gdb aix_node_exporter core
GNU gdb (GDB) 10.2
....
Core was generated by `node_exporter_aix'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x10118328 in ?? ()
(gdb) where
#0 0x10118328 in ?? ()
#1 0x10118280 in ?? ()
#2 0x10029078 in ?? ()
#3 0x1002ae04 in ?? ()
#4 0x1042bdcc in ?? ()
#5 0x10409d34 in ?? ()
#6 0x103ea18c in ?? ()
#7 0x103c0044 in ?? ()
#8 0x103ad424 in ?? ()
#9 0x103b7b9c in ?? ()
#10 0x103b78ec in ?? ()
#11 0x103b762c in ?? ()
#12 0x103b736c in ?? ()
#13 0x103b70b8 in ?? ()
#14 0x103b6de8 in ?? ()
#15 0x103b45fc in ?? ()
#16 0x102fd74c in ?? ()
#17 0x102f7df8 in ?? ()
#18 0x102f7318 in ?? ()
#19 0x1033d858 in ?? ()
#20 0x102ec250 in ?? ()
#21 0x100294b0 in ?? ()
#22 0x1002a674 in ?? ()
#23 0x10029ccc in ?? ()
#24 0x1002d01c in ?? ()
#25 0x1002cf70 in ?? ()
#26 0x1002cec0 in ?? ()
#27 0x102f3534 in ?? ()
#28 0xd0579fc8 in ?? ()
#29 0x00000000 in ?? ()

recently I deployed 1.12.1.0 all over AIX LPARs. do you recommend to upgrade 1.14.3.0?

if I exclude (D) and (d) from /usr/local/bin/node_exporter_aix command lines, I see agent doesn't crash, but I end up loosing data related to disk queue, disks timers etc..

Now, It seems I have to run this agent with two diff command line arguments e.g.

AIX LPARs without VxDMP
/usr/local/bin/node_exporter_aix -p

AIX LPARs with VxDMP
/usr/local/bin/node_exporter_aix -cCAMmiabPf -p

Let me know if you need more info from core.

Thank you for all the help!

Yash

@thorhs
Copy link
Owner

thorhs commented Jun 10, 2022

Hmmm, interesting. I was expecting the names of the functions to be displayed.

Could you please try to run this version, it will output some debugging data that could help me locate the issue.

Also, try to execute the exporter under gdb:
gdb --args node_exporter_aix

then type run to execute.

If it crashes, run 'where' and maybe 'list' as well.

I have attached the debug build of the exporter in this comment.

node_exporter_aix_debug.zip

@SkyMoCo
Copy link

SkyMoCo commented Jan 4, 2023

We are having a similar issue, but with "stock" aix file systems and a large number of disks.

This is the program output.
Number of diskpath records: 0
Error calling perfstat_diskpath: Invalid argument
Number of memory_page records: 4
.... lines deleted by me
Number of disk records: 520
Segmentation fault

Under gdb it doesn't tell us much more...
gdb --args node_exporter_aix_debug -p 9200
GNU gdb (GDB) 10.2
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "powerpc64-ibm-aix7.1.0.0".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
https://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from node_exporter_aix_debug...
(gdb) show args
Argument list to give program being debugged when it is started is "-p 9200".
(gdb) set follow-fork-mode child
(gdb) run
Starting program: /var/tmp/node_exporter_aix_debug -p 9200
[New Thread 1]
Node exporter for AIX version 1.12.1.0 listening on port 9200
[New Thread 258]
[Attaching after Thread 258 fork to child process 21103002]
[New inferior 2 (process 21103002)]
[Detaching after fork from parent process 12386564]
[Inferior 1 (process 12386564) detached]

Thread 2.1 received signal SIGTRAP, Trace/breakpoint trap.
[Switching to process 21103002]
0x10000100 in ?? ()
(gdb) where
#0 0x10000100 in ?? ()
#1 0xdeadbeef in ?? ()
(gdb) list
27 main.cpp: A file or directory in the path name does not exist..

@lbsivahari
Copy link

Found that this is happening due to memory leakage, I have added the calloc(dinamic memory allocation) after that this issue got fixed in my AIX servers. Please reffere pull request #33 #34 #35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants