New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gopacket sniffer #317
gopacket sniffer #317
Conversation
4f24ebb
to
9522b45
Compare
@tomwilkie PTAL and let me know what you think! Maybe also interesting to @paulbellamy |
edgeKey = report.MakeEdgeID(scopedLocal, scopedRemote) | ||
localAddressNodeID = report.MakeAddressNodeID(r.hostID, c.LocalAddress.String()) | ||
remoteAddressNodeID = report.MakeAddressNodeID(r.hostID, c.RemoteAddress.String()) | ||
adjecencyID = report.MakeAdjacencyID(localAddressNodeID) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
I'm not quite seeing why EdgeMetadata.* all have to be pointers? Is it so they can be nil, or for mutability? |
They need maybe semantics, because the zero value may be meaningful. |
I'm seeing very weird behavior when testing the details pane. Still tracking down some bugs. |
for { | ||
select { | ||
case p := <-packets: | ||
s.Merge(p, rpt) |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
Whats the performance impact of this change? Its not enabled by default - will we ever be able to do this? I've voiced concerns previously that we should get this kind of information (bandwidth) from the router / FDP and not through scope packet capture. |
14fbc26
to
0a2c0c3
Compare
In the JSON representation, we want the Sampling data to be distinct.
Another implicit invariant in the data model is that edges are always of the form (local -> remote). That is, the source of an edge must always be a node that originates from within Scope's domain of visibility. This was evident by the presence of ingress and egress fields in edge/aggregate metadata. When building the sniffer, I accidentally and incorrectly violated this invariant, by constructing distinct edges for (local -> remote) and (remote -> local), and collapsing ingress and egress byte counts to a single scalar. I experienced a variety of subtle undefined behavior as a result. See #339. This change reverts to the old, correct methodology. Consequently the sniffer needs to be able to find out which side of the sniffed packet is local v. remote, and to do that it needs access to local networks. I moved the discovery from the probe/host package into probe/main.go. As part of that work I discovered that package report also maintains its own, independent "cache" of local networks. Except it contains only the (optional) Docker bridge network, if it's been populated by the probe, and it's only used by the report.Make{Endpoint,Address}NodeID constructors to scope local addresses. Normally, scoping happens during rendering, and only for pseudo nodes -- see current LeafMap Render localNetworks. This is pretty convoluted and should be either be made consistent or heavily commented.
Also, 1 packet may be counted in N topologies, so you can't rely on the sum of all packet counts across topologies having any relation to the sampling data.
@tomwilkie In my performance tests, adding packet capture at 100% sample rate, via -capture=true and -capture.off=0, takes weighted moving average CPU usage from ~9% to ~11%. I don't think this is significant, but I'm probably not stressing the probe with enough traffic. It would be great to get more data here. Would you be willing to try it on your instances? Sadly, decreasing the sample rate through the duty-cycle seems to have no effect, or, in the worst case, increase CPU usage. With -capture.on=3s -capture.off=3s (50% effective sample rate) I see CPU usage at ~15%. With -capture.on=1s -capture.off=5s I see CPU usage at ~12%. Two strategies to make this better:
The first would make the code cleaner (and sample rates more accurate), but I think only the second stands a real chance of improving performance. That said, I don't think it's a pressing concern. I would propose to do it in a later PR. |
Looks fine, but @tomwilkie knows more about the performance characteristics of this. |
Sorry for delay, been beavering away on SAAS. re: performance; I don't think customer will be too interested in those figures, I'm afraid. They'll want an answer the the question "If I run this on my machine, how much does it slow down my application?" To measure this you could setup a pair of GCE VMs, run iperf between them, measure bandwidth, and then repeat with this enabled. Right now I imagine the performance impact isn't huge (as weave is already doing pcap, so the top line figure is low). I would measure without weave to get a better idea. I worry once we introduce FDP, running scope with packet capture will be like reverting to weave legacy performance (We're taking 10's MB/s vs 100's MB/s). Given the above, I can't see how this approach will every be satisfactorily turned on by default. If we want bandwidth info, I'd rather bake the stats gathering into FDP, where it can be done in kernel with minimal overhead. |
No, just easier to get running. pf_ring is totally possible.
To be clear the performance overhead of turning on packet capture in this branch is 2% by my (admittedly very basic) tests. Is that what you're calling unsatisfactory? |
How are you measuring the that? 1% of what? On Wednesday, 5 August 2015, Peter Bourgon notifications@github.com wrote:
|
So 2% overall penalty, and ~15% relative penalty — though that's near as makes no difference statistical noise until we get a bit more rigorous. The point is it's not a 2x or 10x penalty, and so is good enough for continued development (IMO). |
"re: performance; I don't think customer will be too interested in those figures, I'm afraid. They'll want an answer the the question "If I run this on my machine, how much does it slow down my application?"" I don't think users care how much CPU scope uses; they care what the effect is on their application. Claiming 1% performance overhead could be misleading - they might thing an app serving 1k QPS will only lose 10 qps, whereas I suspect the reality is very different. |
I'm not sure how to respond to that. It's a lot of hypotheticals. Are you 👍 to merge this PR, do you prefer I close it without merging, or is there a third option? |
Alright, I'll merge this in the present disabled state, just so it doesn't rot. We'll leave it on ice until we decide what to do with it. We can rip it out later if it turns out we're not going to follow through with packet capture. |
This PR adds packet sniffing to the probe's bag of tricks. Here's how it works:
Misc notes from the implementation:
There is no real concept of bytes ingress/egress on an edge, as an edge is directed. So those separate values have been removed. If there will be a concept of ingress/egress, it should be derived as part of the rendering process.To review, start with the Sniffer and work your way out, being sure to look at how it's wired up in probe.main. (It's modeled as a Reporter.)
We still need to wire this into the (revised) details pane, but I wanted to get this work reviewed and merged first.