gopacket sniffer #317

peterbourgon · 2015-07-14T16:20:18Z

This PR adds packet sniffing to the probe's bag of tricks. Here's how it works:

We sample via a duty cycle, e.g. 1s on, 5s off.
When off, we still count (but don't decode) packets, this is necessary to compute a proper sample rate.
Currently gopacket/pcap only, could think about adding e.g. pfring.

Misc notes from the implementation:

This removes the AggregateMetadata type, as (it turns out) we can merge EdgeMetadatas directly just fine.
There is no real concept of bytes ingress/egress on an edge, as an edge is directed. So those separate values have been removed. If there will be a concept of ingress/egress, it should be derived as part of the rendering process.
We can't cross-compile anymore :( so I had to remove the Darwin test from Circle.

To review, start with the Sniffer and work your way out, being sure to look at how it's wired up in probe.main. (It's modeled as a Reporter.)

We still need to wire this into the (revised) details pane, but I wanted to get this work reviewed and merged first.

peterbourgon · 2015-07-23T16:30:13Z

@tomwilkie PTAL and let me know what you think! Maybe also interesting to @paulbellamy

probe/endpoint/reporter.go

-		edgeKey      = report.MakeEdgeID(scopedLocal, scopedRemote)
+		localAddressNodeID  = report.MakeAddressNodeID(r.hostID, c.LocalAddress.String())
+		remoteAddressNodeID = report.MakeAddressNodeID(r.hostID, c.RemoteAddress.String())
+		adjecencyID         = report.MakeAdjacencyID(localAddressNodeID)


paulbellamy · 2015-07-29T09:19:57Z

I'm not quite seeing why EdgeMetadata.* all have to be pointers? Is it so they can be nil, or for mutability?

peterbourgon · 2015-07-29T09:21:52Z

They need maybe semantics, because the zero value may be meaningful.

peterbourgon · 2015-07-29T09:23:18Z

I'm seeing very weird behavior when testing the details pane. Still tracking down some bugs.

probe/sniff/sniffer.go

+	for {
+		select {
+		case p := <-packets:
+			s.Merge(p, rpt)


tomwilkie · 2015-07-29T15:30:39Z

Whats the performance impact of this change? Its not enabled by default - will we ever be able to do this?

I've voiced concerns previously that we should get this kind of information (bandwidth) from the router / FDP and not through scope packet capture.

In the JSON representation, we want the Sampling data to be distinct.

Another implicit invariant in the data model is that edges are always of the form (local -> remote). That is, the source of an edge must always be a node that originates from within Scope's domain of visibility. This was evident by the presence of ingress and egress fields in edge/aggregate metadata. When building the sniffer, I accidentally and incorrectly violated this invariant, by constructing distinct edges for (local -> remote) and (remote -> local), and collapsing ingress and egress byte counts to a single scalar. I experienced a variety of subtle undefined behavior as a result. See #339. This change reverts to the old, correct methodology. Consequently the sniffer needs to be able to find out which side of the sniffed packet is local v. remote, and to do that it needs access to local networks. I moved the discovery from the probe/host package into probe/main.go. As part of that work I discovered that package report also maintains its own, independent "cache" of local networks. Except it contains only the (optional) Docker bridge network, if it's been populated by the probe, and it's only used by the report.Make{Endpoint,Address}NodeID constructors to scope local addresses. Normally, scoping happens during rendering, and only for pseudo nodes -- see current LeafMap Render localNetworks. This is pretty convoluted and should be either be made consistent or heavily commented.

Also, 1 packet may be counted in N topologies, so you can't rely on the sum of all packet counts across topologies having any relation to the sampling data.

peterbourgon · 2015-08-04T10:50:40Z

@tomwilkie In my performance tests, adding packet capture at 100% sample rate, via -capture=true and -capture.off=0, takes weighted moving average CPU usage from ~9% to ~11%. I don't think this is significant, but I'm probably not stressing the probe with enough traffic. It would be great to get more data here. Would you be willing to try it on your instances?

Sadly, decreasing the sample rate through the duty-cycle seems to have no effect, or, in the worst case, increase CPU usage. With -capture.on=3s -capture.off=3s (50% effective sample rate) I see CPU usage at ~15%. With -capture.on=1s -capture.off=5s I see CPU usage at ~12%. Two strategies to make this better:

Abandon the duty-cycle approach, make a decision with each packet
Try pfring and/or netfilter, both provided in gopacket

The first would make the code cleaner (and sample rates more accurate), but I think only the second stands a real chance of improving performance. That said, I don't think it's a pressing concern. I would propose to do it in a later PR.

paulbellamy · 2015-08-05T11:35:40Z

Looks fine, but @tomwilkie knows more about the performance characteristics of this.
In general, any reason to use pcap instead of pf_ring? pcap is lossless, but that doesn't really bother us here...

tomwilkie · 2015-08-05T12:55:41Z

Sorry for delay, been beavering away on SAAS.

re: performance; I don't think customer will be too interested in those figures, I'm afraid. They'll want an answer the the question "If I run this on my machine, how much does it slow down my application?"

To measure this you could setup a pair of GCE VMs, run iperf between them, measure bandwidth, and then repeat with this enabled. Right now I imagine the performance impact isn't huge (as weave is already doing pcap, so the top line figure is low). I would measure without weave to get a better idea. I worry once we introduce FDP, running scope with packet capture will be like reverting to weave legacy performance (We're taking 10's MB/s vs 100's MB/s).

Given the above, I can't see how this approach will every be satisfactorily turned on by default. If we want bandwidth info, I'd rather bake the stats gathering into FDP, where it can be done in kernel with minimal overhead.

peterbourgon · 2015-08-05T13:22:42Z

In general, any reason to use pcap instead of pf_ring?

No, just easier to get running. pf_ring is totally possible.

re: performance; I don't think customer will be too interested in those features, I'm afraid...

To be clear the performance overhead of turning on packet capture in this branch is 2% by my (admittedly very basic) tests. Is that what you're calling unsatisfactory?

tomwilkie · 2015-08-05T13:51:30Z

How are you measuring the that? 1% of what?

On Wednesday, 5 August 2015, Peter Bourgon notifications@github.com wrote:

To be clear the performance overhead of turning on packet capture in this
branch is 1%. Is that what you're calling unsatisfactory?

—
Reply to this email directly or view it on GitHub
#317 (comment).

peterbourgon · 2015-08-05T15:28:17Z

Adding packet capture at 100% sample rate … takes … CPU usage from ~9% to ~11%

So 2% overall penalty, and ~15% relative penalty — though that's near as makes no difference statistical noise until we get a bit more rigorous. The point is it's not a 2x or 10x penalty, and so is good enough for continued development (IMO).

tomwilkie · 2015-08-05T15:34:46Z

"re: performance; I don't think customer will be too interested in those figures, I'm afraid. They'll want an answer the the question "If I run this on my machine, how much does it slow down my application?""

I don't think users care how much CPU scope uses; they care what the effect is on their application. Claiming 1% performance overhead could be misleading - they might thing an app serving 1k QPS will only lose 10 qps, whereas I suspect the reality is very different.

peterbourgon · 2015-08-06T08:52:32Z

I'm not sure how to respond to that. It's a lot of hypotheticals. Are you 👍 to merge this PR, do you prefer I close it without merging, or is there a third option?

peterbourgon · 2015-08-06T14:20:59Z

Alright, I'll merge this in the present disabled state, just so it doesn't rot. We'll leave it on ice until we decide what to do with it. We can rip it out later if it turns out we're not going to follow through with packet capture.

gopacket sniffer

peterbourgon force-pushed the sniff branch 9 times, most recently from 4f24ebb to 9522b45 Compare July 21, 2015 15:44

peterbourgon force-pushed the sniff branch from d650f8c to 3499241 Compare July 23, 2015 16:02

peterbourgon changed the title ~~[WIP] gopacket sniffer~~ gopacket sniffer Jul 23, 2015

peterbourgon assigned peterbourgon and tomwilkie and unassigned peterbourgon Jul 23, 2015

peterbourgon force-pushed the sniff branch from 415cce4 to 21e823d Compare July 28, 2015 09:32

paulbellamy reviewed Jul 29, 2015
View reviewed changes

probe/sniff/sniffer.go

for {

select {

case p := <-packets:

s.Merge(p, rpt)

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

This comment was marked as abuse.

Sign in to view

peterbourgon force-pushed the sniff branch 3 times, most recently from 14fbc26 to 0a2c0c3 Compare July 31, 2015 18:18

peterbourgon added 4 commits August 3, 2015 10:55

gopacket-based traffic sniffing

b9afa67

Don't flat-embed sampling in the report

c7a06d2

In the JSON representation, we want the Sampling data to be distinct.

Allow packet capture with effective sample rate 100%

64ebedc

peterbourgon force-pushed the sniff branch from 0a2c0c3 to 0aadf64 Compare August 3, 2015 08:56

peterbourgon added 4 commits August 3, 2015 12:25

Fix bugs in how we report bandwidth

0361b11

Split PacketCount to Egress and Ingress

e1f7752

Also, 1 packet may be counted in N topologies, so you can't rely on the sum of all packet counts across topologies having any relation to the sampling data.

Calculate rates in detailed nodes

0dafad7

Fix lockup bug on Linux

3069ce0

peterbourgon added a commit that referenced this pull request Aug 6, 2015

Merge pull request #317 from weaveworks/sniff

234b9ce

gopacket sniffer

peterbourgon merged commit 234b9ce into master Aug 6, 2015

peterbourgon deleted the sniff branch August 6, 2015 14:21

2opremio mentioned this pull request Jul 16, 2017

Probe gopacket capture status #2717

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gopacket sniffer #317

gopacket sniffer #317

peterbourgon commented Jul 14, 2015

peterbourgon commented Jul 23, 2015

This comment was marked as abuse.

paulbellamy commented Jul 29, 2015

peterbourgon commented Jul 29, 2015

peterbourgon commented Jul 29, 2015

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

tomwilkie commented Jul 29, 2015

peterbourgon commented Aug 4, 2015

paulbellamy commented Aug 5, 2015

tomwilkie commented Aug 5, 2015

peterbourgon commented Aug 5, 2015

tomwilkie commented Aug 5, 2015

peterbourgon commented Aug 5, 2015

tomwilkie commented Aug 5, 2015

peterbourgon commented Aug 6, 2015

peterbourgon commented Aug 6, 2015

gopacket sniffer #317

gopacket sniffer #317

Conversation

peterbourgon commented Jul 14, 2015

peterbourgon commented Jul 23, 2015

This comment was marked as abuse.

paulbellamy commented Jul 29, 2015

peterbourgon commented Jul 29, 2015

peterbourgon commented Jul 29, 2015

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

tomwilkie commented Jul 29, 2015

peterbourgon commented Aug 4, 2015

paulbellamy commented Aug 5, 2015

tomwilkie commented Aug 5, 2015

peterbourgon commented Aug 5, 2015

tomwilkie commented Aug 5, 2015

peterbourgon commented Aug 5, 2015

tomwilkie commented Aug 5, 2015

peterbourgon commented Aug 6, 2015

peterbourgon commented Aug 6, 2015