Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gopacket sniffer #317

Merged
merged 8 commits into from Aug 6, 2015
Merged

gopacket sniffer #317

merged 8 commits into from Aug 6, 2015

Conversation

peterbourgon
Copy link
Contributor

This PR adds packet sniffing to the probe's bag of tricks. Here's how it works:

  • We sample via a duty cycle, e.g. 1s on, 5s off.
  • When off, we still count (but don't decode) packets, this is necessary to compute a proper sample rate.
  • Currently gopacket/pcap only, could think about adding e.g. pfring.

Misc notes from the implementation:

  • This removes the AggregateMetadata type, as (it turns out) we can merge EdgeMetadatas directly just fine.
  • There is no real concept of bytes ingress/egress on an edge, as an edge is directed. So those separate values have been removed. If there will be a concept of ingress/egress, it should be derived as part of the rendering process.
  • We can't cross-compile anymore :( so I had to remove the Darwin test from Circle.

To review, start with the Sniffer and work your way out, being sure to look at how it's wired up in probe.main. (It's modeled as a Reporter.)

We still need to wire this into the (revised) details pane, but I wanted to get this work reviewed and merged first.

@peterbourgon peterbourgon changed the title [WIP] gopacket sniffer gopacket sniffer Jul 23, 2015
@peterbourgon
Copy link
Contributor Author

@tomwilkie PTAL and let me know what you think! Maybe also interesting to @paulbellamy

edgeKey = report.MakeEdgeID(scopedLocal, scopedRemote)
localAddressNodeID = report.MakeAddressNodeID(r.hostID, c.LocalAddress.String())
remoteAddressNodeID = report.MakeAddressNodeID(r.hostID, c.RemoteAddress.String())
adjecencyID = report.MakeAdjacencyID(localAddressNodeID)

This comment was marked as abuse.

@paulbellamy
Copy link
Contributor

I'm not quite seeing why EdgeMetadata.* all have to be pointers? Is it so they can be nil, or for mutability?

@peterbourgon
Copy link
Contributor Author

They need maybe semantics, because the zero value may be meaningful.

@peterbourgon
Copy link
Contributor Author

I'm seeing very weird behavior when testing the details pane. Still tracking down some bugs.

for {
select {
case p := <-packets:
s.Merge(p, rpt)

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

This comment was marked as abuse.

@tomwilkie
Copy link
Contributor

Whats the performance impact of this change? Its not enabled by default - will we ever be able to do this?

I've voiced concerns previously that we should get this kind of information (bandwidth) from the router / FDP and not through scope packet capture.

In the JSON representation, we want the Sampling data to be distinct.
Another implicit invariant in the data model is that edges are always of the
form (local -> remote). That is, the source of an edge must always be a node
that originates from within Scope's domain of visibility. This was evident by
the presence of ingress and egress fields in edge/aggregate metadata.

When building the sniffer, I accidentally and incorrectly violated this
invariant, by constructing distinct edges for (local -> remote) and (remote ->
local), and collapsing ingress and egress byte counts to a single scalar. I
experienced a variety of subtle undefined behavior as a result. See #339.

This change reverts to the old, correct methodology. Consequently the sniffer
needs to be able to find out which side of the sniffed packet is local v.
remote, and to do that it needs access to local networks. I moved the
discovery from the probe/host package into probe/main.go.

As part of that work I discovered that package report also maintains its own,
independent "cache" of local networks. Except it contains only the (optional)
Docker bridge network, if it's been populated by the probe, and it's only used
by the report.Make{Endpoint,Address}NodeID constructors to scope local
addresses. Normally, scoping happens during rendering, and only for pseudo
nodes -- see current LeafMap Render localNetworks. This is pretty convoluted
and should be either be made consistent or heavily commented.
Also, 1 packet may be counted in N topologies, so you can't rely on the
sum of all packet counts across topologies having any relation to the
sampling data.
@peterbourgon
Copy link
Contributor Author

@tomwilkie In my performance tests, adding packet capture at 100% sample rate, via -capture=true and -capture.off=0, takes weighted moving average CPU usage from ~9% to ~11%. I don't think this is significant, but I'm probably not stressing the probe with enough traffic. It would be great to get more data here. Would you be willing to try it on your instances?

Sadly, decreasing the sample rate through the duty-cycle seems to have no effect, or, in the worst case, increase CPU usage. With -capture.on=3s -capture.off=3s (50% effective sample rate) I see CPU usage at ~15%. With -capture.on=1s -capture.off=5s I see CPU usage at ~12%. Two strategies to make this better:

  1. Abandon the duty-cycle approach, make a decision with each packet
  2. Try pfring and/or netfilter, both provided in gopacket

The first would make the code cleaner (and sample rates more accurate), but I think only the second stands a real chance of improving performance. That said, I don't think it's a pressing concern. I would propose to do it in a later PR.

@paulbellamy
Copy link
Contributor

Looks fine, but @tomwilkie knows more about the performance characteristics of this.
In general, any reason to use pcap instead of pf_ring? pcap is lossless, but that doesn't really bother us here...

@tomwilkie
Copy link
Contributor

Sorry for delay, been beavering away on SAAS.

re: performance; I don't think customer will be too interested in those figures, I'm afraid. They'll want an answer the the question "If I run this on my machine, how much does it slow down my application?"

To measure this you could setup a pair of GCE VMs, run iperf between them, measure bandwidth, and then repeat with this enabled. Right now I imagine the performance impact isn't huge (as weave is already doing pcap, so the top line figure is low). I would measure without weave to get a better idea. I worry once we introduce FDP, running scope with packet capture will be like reverting to weave legacy performance (We're taking 10's MB/s vs 100's MB/s).

Given the above, I can't see how this approach will every be satisfactorily turned on by default. If we want bandwidth info, I'd rather bake the stats gathering into FDP, where it can be done in kernel with minimal overhead.

@peterbourgon
Copy link
Contributor Author

In general, any reason to use pcap instead of pf_ring?

No, just easier to get running. pf_ring is totally possible.

re: performance; I don't think customer will be too interested in those features, I'm afraid...

To be clear the performance overhead of turning on packet capture in this branch is 2% by my (admittedly very basic) tests. Is that what you're calling unsatisfactory?

@tomwilkie
Copy link
Contributor

How are you measuring the that? 1% of what?

On Wednesday, 5 August 2015, Peter Bourgon notifications@github.com wrote:

To be clear the performance overhead of turning on packet capture in this
branch is 1%. Is that what you're calling unsatisfactory?


Reply to this email directly or view it on GitHub
#317 (comment).

@peterbourgon
Copy link
Contributor Author

Adding packet capture at 100% sample rate … takes … CPU usage from ~9% to ~11%

So 2% overall penalty, and ~15% relative penalty — though that's near as makes no difference statistical noise until we get a bit more rigorous. The point is it's not a 2x or 10x penalty, and so is good enough for continued development (IMO).

@tomwilkie
Copy link
Contributor

"re: performance; I don't think customer will be too interested in those figures, I'm afraid. They'll want an answer the the question "If I run this on my machine, how much does it slow down my application?""

I don't think users care how much CPU scope uses; they care what the effect is on their application. Claiming 1% performance overhead could be misleading - they might thing an app serving 1k QPS will only lose 10 qps, whereas I suspect the reality is very different.

@peterbourgon
Copy link
Contributor Author

I'm not sure how to respond to that. It's a lot of hypotheticals. Are you 👍 to merge this PR, do you prefer I close it without merging, or is there a third option?

@peterbourgon
Copy link
Contributor Author

Alright, I'll merge this in the present disabled state, just so it doesn't rot. We'll leave it on ice until we decide what to do with it. We can rip it out later if it turns out we're not going to follow through with packet capture.

peterbourgon added a commit that referenced this pull request Aug 6, 2015
@peterbourgon peterbourgon merged commit 234b9ce into master Aug 6, 2015
@peterbourgon peterbourgon deleted the sniff branch August 6, 2015 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants