Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standup Troubleshooting [gitlab] #43

Open
jmwample opened this issue May 27, 2020 · 4 comments
Open

Standup Troubleshooting [gitlab] #43

jmwample opened this issue May 27, 2020 · 4 comments

Comments

@jmwample
Copy link
Member

jmwample commented May 27, 2020

This will be a series of posts discussing the issues I ran into while standing up the stations windyheron and artemis.

original submission: 1/21/2020

@jmwample
Copy link
Member Author

PF_RING submodule incompatibility

Currently the tapdance server has it's git submodule tracking

[submodule "PF_RING"]
	path = PF_RING
	url = https://github.com/ntop/PF_RING.git
	branch = 6.4.0-stable

This is the stable version from 2017. Conjure on the other hand tracks 7.4.0-stable by default. The compilation process is the exact same and will work. HOWEVER if you have the kernel modules for i40e ethernet devices from 6.4.0-stable and you try to run a program compiled against the `7.4.0-stable libraries it will crash.

Symptom

The sign that you are experiencing this issue is a floating point exception error that crashes the detector.

Using public key: a1cb97be697c5ed5aefd78ffa4db7e68101024603511e40a89951bc158807177
PF_RING Tapdance shutting down...
PF_RING Tapdance done shutting down!
Starting process 0...
Core 6: PID 17346, lcore 9
Starting process 1...
Core 7: PID 17347, lcore 10
Starting process 2...
Core 8: PID 17348, lcore 11
Starting process 3...
Core 9: PID 17349, lcore 12
Starting process 4...
Core 10: PID 17350, lcore 13
Starting process 5...
Core 11: PID 17351, lcore 14
...child proc 0 killed by signal 8 -- Coredump created
...child proc 1 killed by signal 8 -- Coredump created
...child proc 2 killed by signal 8 -- Coredump created
...child proc 3 killed by signal 8 -- Coredump created
...child proc 4 killed by signal 8 -- Coredump created
...child proc 5 killed by signal 8 -- Coredump created

Solution

The current solution is to downgrade the conjure station to the 6.4.0-stable pf_ring tag. There is currently an issue open to update tapdance to support 7.4.0-stable (#42).

Downgrading the pf_ring libraries can be done in one of two ways.

EITHER:

  • downgrade the tracked submodule in a local commit and re-pull the PF_RING git submodule

OR

  • Remove the PF_RING module all-together and change the make file to use -lpcap and -lpfring which will look for the libraries installed on the system already.

original comment: 1/21/2020

@jmwample
Copy link
Member Author

Here is the diff of the makefile which worked for compiling on artemis. (solution 2)

$ git diff Makefile
diff --git a/Makefile b/Makefile
index 7651143..0fa6aa4 100644
--- a/Makefile
+++ b/Makefile
@@ -7,8 +7,8 @@ PFRINGDIR=./PF_RING/
 PFRING_LIBS=${PFRINGDIR}/userland/lib/libpfring.a ${PFRINGDIR}/userland/libpcap/libpcap.a
 RUST_LIB=./target/release/librust_dark_decoy.a
 TD_LIB=./libtapdance/libtapdance.a
-LIBS=${PFRING_LIBS} ${RUST_LIB} ${TD_LIB} -L/usr/local/lib -lzmq -lcrypto -lpthread -lrt -lgmp -ldl -lm
-CFLAGS = -Wall -DENABLE_BPF -DHAVE_PF_RING -DHAVE_PF_RING_ZC -DTAPDANCE_USE_PF_RING_ZERO_COPY -I${PFRINGDIR}/userland/lib/ -I${PFRINGDIR}/kernel -O2 # -g
+LIBS= ${RUST_LIB} ${TD_LIB} -L/usr/local/lib -lzmq -lcrypto -lpthread -lrt -lgmp -ldl -lm -lpfring -lpcap
+CFLAGS = -Wall -DENABLE_BPF -DHAVE_PF_RING -DHAVE_PF_RING_ZC -DTAPDANCE_USE_PF_RING_ZERO_COPY -O2 # -g
 PROTO_RS_PATH=src/signalling.rs
 

original comment: 1/21/2020

@jmwample
Copy link
Member Author

Held Packages

Symptom

When attempting to install libzmq3-dev apt informs the user that they do not have the
proper version of libzmq5 installed saying that

... you have held broken packages

This message can mean that a package was "held" using the apt system so that it would not update.
To see a list of these packages you can run the following command:

sudo apt-mark showhold

This should list all packages held on the system. If none are listed (as was the case during the windyheron setup) the next step is to look for broken apt sources.

Solution

In this case there was an extraneous apt source in /etc/apt/sources.list which tied the libzmq
to an old (broken) source. [Unfortunately I forgot to copy paste it anywhere after removing it].

Removing the zmq specific source and allowing apt to use the default repositories which worked great.
Another error here could be caused by a default apt source being of a version not matching your current kernel distribution, but that was not the case this time.

Tried and Failed

While the apt package was broken I attempted to install libzmq from source. This worked for conjure,
but did NOT work for tapdance and resulted in downtime while I attempted to recover the installation.

original comment: 1/21/2020

@jmwample
Copy link
Member Author

Zbalance Teardown

When restarting zbalance_ipc you must first stop the programs consuming data from the queues (detector.service) otherwise the teardown leaves things in a strange state.

Symptoms

If you find yourself in this strange state zbalance_ipc will complain about huge-pages when you try to restart it.

Solution

If you are running zbalance on it's own when this happens choose a new cluster id (-c [CLUSTER_ID]) and restart zbalance_ipc.

If you are running the zbalance service from Tapdance you will need to change the TD_CLUSTER_ID to something new in /opt/tapdance/config and then restart zbalance.service which automatically sources the tapdance config.

Note: This is only tested on PF_RING 6.4.0-stable

original comment: 1/21/2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant