New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
automatically expose pprof+prometheus endpoint. #677
Conversation
TODO:
|
pkg/sidecar/docker_reactor.go
Outdated
// Allow traffic from the host machine (acting as the gateway). | ||
var ( | ||
gwIPs = make(map[string]struct{}) | ||
gwRoutes []netlink.Route | ||
) | ||
for _, route := range controlRoutes { | ||
if _, ok := gwIPs[route.Gw.String()]; ok { | ||
continue | ||
} | ||
nlroutes, err := netlinkHandle.RouteGet(route.Gw) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to resolve route: %w", err) | ||
} | ||
gwIPs[route.Gw.String()] = struct{}{} | ||
gwRoutes = append(gwRoutes, nlroutes...) | ||
} | ||
|
||
controlRoutes = append(controlRoutes, gwRoutes...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key. In kubernetes we don't need this, because kubernetes has native port-forwarding out of the box. In our case, we connect our test containers to the control and data networks. The former is tightly restricted, and the latter is internal and connects only the containers participating in the test. For containers to successfully expose ports to the host, besides publishing those ports, we need to ALLOW traffic from the gateway (host).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is actually added here?
I think this would be just a route to the single host, the route to the host, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the way, I see this route table on plan containers:
docker exec tg-example-prometheus2-f3de125372b3-single-1 ip route
default via 192.18.0.1 dev eth0
16.0.0.0/16 dev eth1 proto kernel scope link src 16.0.0.2
192.18.0.0/16 dev eth0 proto kernel scope link src 192.18.0.6
they can ping google
docker exec tg-example-prometheus2-f3de125372b3-single-1 ping google.com
PING google.com (172.217.1.142) 56(84) bytes of data.
64 bytes from dfw28s23-in-f14.1e100.net (172.217.1.142): icmp_seq=1 ttl=52 time=97.6 ms
64 bytes from dfw28s23-in-f14.1e100.net (172.217.1.142): icmp_seq=2 ttl=52 time=104 ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for comparison... Without this change, from the current master,
the route table looks like this:
docker exec tg-example-prometheus2-19979292b320-single-1 ip route
16.1.0.0/16 dev eth1 proto kernel scope link src 16.1.0.3
192.18.0.3 dev eth0 src 192.18.0.7
192.18.0.4 dev eth0 src 192.18.0.7
and there is no default route, so no place to send packets outside of known networks.
docker exec tg-example-prometheus2-19979292b320-single-1 ping -c 2 google.com
ping: google.com: Temporary failure in name resolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting this. Definitely unintended. I've fixed this logic in eaa06cd. Can you take it out for a spin again? Works for me.
- The host can access ports exposed by the container.
- The container cannot access public IPs.
Sometimes the flush could end up sending a nil bytes slice to TgWriter, and this would make it as a nil progress payload on the wire.
pkg/sidecar/docker_reactor.go
Outdated
// Allow traffic from the host machine (acting as the gateway). | ||
var ( | ||
gwIPs = make(map[string]struct{}) | ||
gwRoutes []netlink.Route | ||
) | ||
for _, route := range controlRoutes { | ||
if _, ok := gwIPs[route.Gw.String()]; ok { | ||
continue | ||
} | ||
nlroutes, err := netlinkHandle.RouteGet(route.Gw) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to resolve route: %w", err) | ||
} | ||
gwIPs[route.Gw.String()] = struct{}{} | ||
gwRoutes = append(gwRoutes, nlroutes...) | ||
} | ||
|
||
controlRoutes = append(controlRoutes, gwRoutes...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is actually added here?
I think this would be just a route to the single host, the route to the host, right?
pkg/sidecar/docker_reactor.go
Outdated
// Allow traffic from the host machine (acting as the gateway). | ||
var ( | ||
gwIPs = make(map[string]struct{}) | ||
gwRoutes []netlink.Route | ||
) | ||
for _, route := range controlRoutes { | ||
if _, ok := gwIPs[route.Gw.String()]; ok { | ||
continue | ||
} | ||
nlroutes, err := netlinkHandle.RouteGet(route.Gw) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to resolve route: %w", err) | ||
} | ||
gwIPs[route.Gw.String()] = struct{}{} | ||
gwRoutes = append(gwRoutes, nlroutes...) | ||
} | ||
|
||
controlRoutes = append(controlRoutes, gwRoutes...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by the way, I see this route table on plan containers:
docker exec tg-example-prometheus2-f3de125372b3-single-1 ip route
default via 192.18.0.1 dev eth0
16.0.0.0/16 dev eth1 proto kernel scope link src 16.0.0.2
192.18.0.0/16 dev eth0 proto kernel scope link src 192.18.0.6
they can ping google
docker exec tg-example-prometheus2-f3de125372b3-single-1 ping google.com
PING google.com (172.217.1.142) 56(84) bytes of data.
64 bytes from dfw28s23-in-f14.1e100.net (172.217.1.142): icmp_seq=1 ttl=52 time=97.6 ms
64 bytes from dfw28s23-in-f14.1e100.net (172.217.1.142): icmp_seq=2 ttl=52 time=104 ms
pkg/sidecar/docker_reactor.go
Outdated
// Allow traffic from the host machine (acting as the gateway). | ||
var ( | ||
gwIPs = make(map[string]struct{}) | ||
gwRoutes []netlink.Route | ||
) | ||
for _, route := range controlRoutes { | ||
if _, ok := gwIPs[route.Gw.String()]; ok { | ||
continue | ||
} | ||
nlroutes, err := netlinkHandle.RouteGet(route.Gw) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to resolve route: %w", err) | ||
} | ||
gwIPs[route.Gw.String()] = struct{}{} | ||
gwRoutes = append(gwRoutes, nlroutes...) | ||
} | ||
|
||
controlRoutes = append(controlRoutes, gwRoutes...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for comparison... Without this change, from the current master,
the route table looks like this:
docker exec tg-example-prometheus2-19979292b320-single-1 ip route
16.1.0.0/16 dev eth1 proto kernel scope link src 16.1.0.3
192.18.0.3 dev eth0 src 192.18.0.7
192.18.0.4 dev eth0 src 192.18.0.7
and there is no default route, so no place to send packets outside of known networks.
docker exec tg-example-prometheus2-19979292b320-single-1 ping -c 2 google.com
ping: google.com: Temporary failure in name resolution
Re-requested review from @coryschwartz. I've rectified the logic that whitelists the host's IP addr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Appears to work, I can hit http://<plan_ip>:6060/metrics and it responds with output,
The plan can't ping outside the network. LGTM
switch { | ||
case len(pub) == 0: | ||
logging.S().Warnw("failed to discover gateway/host address; no routes to public IPs", "container_id", container.ID) | ||
case pub[0].Gw == nil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, I have the following routing table in the plans:
docker exec tg-example-prometheus2-a0bf3767448d-single-9 ip r
16.2.0.0/16 dev eth1 proto kernel scope link src 16.2.0.3
192.18.0.1 dev eth0 src 192.18.0.6
192.18.0.3 dev eth0 src 192.18.0.6
192.18.0.4 dev eth0 src 192.18.0.6
Very nice, no default gateway.
and the outside is un-pingable, also very nice.
docker exec tg-example-prometheus2-a0bf3767448d-single-9 ping 1.1.1.1
connect: Network is unreachable
// Sidecar doesn't whitelist traffic to public addresses, but it special-cases | ||
// traffic between the container and the host, so that pprof, metrics and other | ||
// ports can be exposed to the Docker host. | ||
var PublicAddr = net.ParseIP("1.1.1.1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the IP address of Cloudfliar DNS. I don't have a problem with that, just pointing that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's ok. This will not perform any lookup of any kind, it just matches against the kernel routing tables.
@@ -44,6 +44,10 @@ type Error struct { | |||
} | |||
|
|||
func (tgw *TgWriter) Write(p []byte) (n int, err error) { | |||
if p == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this prevent errors from popping up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It suppresses empty writes from making it onto the wire, which I found can happen when Flush is called. This was leading to the intermittent bug with the describe
command.
With this change, test plans using
runtime.Invoke()
will automatically benefit from pprof and prometheus HTTP exports. Fixes #673.Works in all runneres (
local:exec
,local:docker
,cluster:k8s
(via k8s port forwarding).Fixes #698.