Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Unix domain sockets as -a listen addresses and backend addresses #2371

Closed
wants to merge 33 commits into from

Conversation

slimhazard
Copy link
Contributor

@slimhazard slimhazard commented Jul 16, 2017

This is a partial implementation of the ideas proposed in VIP17.

In VIP17 but not implemented yet:

  • UDS addresses for varnishadm and the CLI (-T and -M args).
  • param uds_path as a search path for relative paths in backend definitions.
    • At the moment, all paths for a UDS are required to begin with a /.
  • Obtaining credentials of the peer over the UDS (uid, gid) and making them available via VMOD std.
  • Named listen addresses (@Dridi will be working on that separately).

Update: At the end of VIP17 (section "Implementation"), I've written down some proposals for how these features could be implemented -- -T and -M, uds_path, and additions to VMOD std.

The big picture: -a addresses can be specified as:

-a /path/to/uds[,[PROXY,][user=<user>,][group=<group>,][mode=<mode>]]

If you specify any of user, group and/or mode (mode in octal), then the path will be created with those permissions. (The sub-args can appear in any order.)

A backend definition may have:

backend foo { .path = "/path/to/uds"; }

.path XOR .host must be specified for a backend.

You can also use -b /path/to/uds to define a backend on the command line.

This code has been tested on:

  • Linux 3.16 (Debian jessie 3.16.43-2+deb8u2) and 4.9 (Debian jessie 8.9, 4.9.30-2+deb9u2~bpo8+1)
  • FreeBSD 11.0-RELEASE-p9
  • SunOS 5.11 (joyent_20160317T000621Z)

There are quite a few details and issues worth considering for a review. So as to break it up into small pieces, I'll add more comments about those issues (and point out some interesting quirks about UDSen that I discovered along the way).

Update: Branches with the fork rebased to Varnish 5.2.0 and 5.2.1 live here:

RPMs for el7 for the two forks live at: https://packagecloud.io/uplex/varnish-uds

@slimhazard
Copy link
Contributor Author

sockaddr_un storage management, VSAs and session workspace

A VSA now looks like this:

struct suckaddr {
        unsigned                        magic;
#define SUCKADDR_MAGIC                  0x4b1e9335
        sa_family_t                     sa_family;
        union {
                struct sockaddr_in              sa4;
                struct sockaddr_in6             sa6;
                const struct sockaddr_un        *sau;
        }                               sa;
};

So the UDS member is a pointer to sockaddr_un and the creator of the VSA must pass in the pointer. This has the consequence that the typeless sockaddr, which had been used to get the sa_family, cannot be a member of the union, so sa_family is pulled out as a separate field. VSAs don't get any larger -- on my machine they're 40 bytes.

VSA_Build and VSA_Malloc now have an optional argument that may be NULL or a pointer to a sockaddr_un, which must be passed in by the caller for a UDS. Neither of these allocate space for sockaddr_un, but VSA_Clone clones a whole VSA, including space for a sockaddr_un if it's PF_UNIX.

When Varnish starts, mgt_acceptor.c allocates space for sockaddr_un for any UDS listen addresses, and the VSAs in the pool of struct listen_sock point to them. As before, these are handed off to the child via heritage.socks, which lives for the lifetime of Varnish and never has to be destroyed. So it just dies, along with the sockaddr_un storage, when the management process exits.

When a session is created for a connection accepted via UDS, its VSA is duplicated as the LOCAL session attribute, but it points back to the sockaddr_un in the heritage.socks pool. The other three session attributes all point to LOCAL. A UDS will never have different addresses for LOCAL and REMOTE (it's the same path), and this implementation is not supporting UDS addresses via PROXY (for reasons I'll explain below), so CLIENT and SERVER are also the same.

So we actually use less session workspace for UDSen -- one VSA for LOCAL, pointing to the sockaddr_un in heritage.socks is all we need. The other three session attributes for addresses point to it, and we use no session workspace for the IP and port strings.

@slimhazard
Copy link
Contributor Author

slimhazard commented Jul 16, 2017

Backends and TCP pools

struct backend and struct tcp_pool now each have these fields, in addition to the VSA fields for IPv4 and IPv6 VSAs:

        struct suckaddr         *uds;
        void                    *uds_sockaddr;

The second pointer points to suckaddr_un, which the backend and TCP pool each allocate, and later destroy. IMO this does not significantly increase the storage burden, because each backend and TCP pool has up to two sockaddrs, as was the case before with IPv4 and IPv6 (other than the fact that UDS sockaddrs are larger).

struct vrt_backend now has a field char *path. VCC does not have to do the dance of resolving IP addresses and the binary VSA, so it defers the creation of the VSA for a UDS, and of the storage for sockaddr_un, to VRT_new_backend. The VCC compiler checks the path -- it must exist and be a socket, otherwise Varnish won't be able to connect to it.

VMOD debug has been extended to support dynamic creation of backends that listen at UDSen via VRT_new_backend, and VTC tests for that have been added.

Edit: uds_sockaddr had been const void * in a previous version, but --enable-developer-warnings and flexelint pointed out that the pointer cannot be const, since for backends it must be possible to free the pointer. Later commits in the PR fix that.

@slimhazard
Copy link
Contributor Author

slimhazard commented Jul 16, 2017

VTCP_*name

These are called about 30 times in the code, and are conceived for IPs/ports, so it's a chore to get them right for UDS. The solution here is:

  • VTCP_ADDRBUFSIZE is large enough to hold a UDS path.
  • If it's a UDS, the path gets placed into the addr field, and "-" goes into port.

But that by itself doesn't solve all of the problems, because getsockname and getpeername may return an unnamed UDS, meaning that there's garbage in the sockaddr_un.sun_path field, depending on whether the socket is bound or connected. If it's bound, then getpeername (used by VTCP_hisname) returns an unnamed UDS, and if connected, then getsockname (used by VTCP_myname) returns an unnamed UDS.

So the code has to decide, on a case-by-case basis, whether we're dealing with a UDS, and whether it's bound or connected, and do the right thing. That in turn requires knowing that it's a UDS, which isn't always possible (the VTCP_*names take a socket fd, not a VSA).

@slimhazard
Copy link
Contributor Author

slimhazard commented Jul 16, 2017

VSL outputs

The output formats are consistent with formats used for IP addresses -- there is always the same number of non-whitespace positions in the output. But some of those positions may be - when the listen/backend address is UDS:

  • ReqStart /path/to/uds -
    • For IPs: ReqStart <ipaddr> <port>
  • SessOpen - - </path/to/uds or named -a arg> - - <timestamp> <fd>
    • IP: SessOpen <local ip> <local port> <-a addr> <remote ip> <remote port> <timestamp> <fd>
  • BackendStart /path/to/uds -
    • IP: BackendStart <backend addr> <backend port>
  • BackendOpen <fd> <backend display name> /path/to/uds - - - -
    • IP: BackendOpen <fd> <backend display name> <remote ip> <remote port> <local ip> <local port>

ReqStart and BackendStart are used by varnishncsa for the %h and %r formatters. There's a test (u00008.vtc) to check that these come out right -- %h produces the path, and %r comes out right for reasons I'll explain below.

@slimhazard
Copy link
Contributor Author

slimhazard commented Jul 16, 2017

-a path creation and deletion

bind creates the UDS path, and fails with EADDRINUSE if it already exists. This implementation calls unlink before bind, ignores ENOENT and leaves the path in the file system on exit.

Precedents:

  • haproxy does the same thing.
  • nginx does the opposite -- it unlinks UDSen that it created on exit, and if the path is there at startup, then bind fails and nginx fails to start.

I'm a bit of wary of deleting an existing path at startup -- someone might think they're using it for something, and Varnish comes along with root permissions and deletes it. But if we were to do it the nginx way, then:

  • We'd have to install handlers for SIGTERM, SIGINT and atexit in mgt to clean up the paths on exit.
  • We might still fail to clean up the paths if Varnish is stopped irregularly.
  • systemd love: when systemctl stop or restart are called, systemd sends SIGQUIT to stop the process. So we'd have to consider a handler for SIGQUIT as well.

That last is the reason why nginx fails to restart when you use systemctl -- it doesn't have a SIGQUIT handler, doesn't clean up the UDSen, and then on systemctl start or restart nginx fails to start if it has to bind to a UDS that it didn't remove on exit.

@slimhazard
Copy link
Contributor Author

X-Forwarded-For and backend Host headers

This implementation leaves X-F-F unchanged if the client connection came over a UDS. Maybe we should add something like localhost. I wouldn't add the path, we don't need to give out information about the file system layout.

Precedents: with haproxy and nginx, you get whatever you get when you have a configuration like set_header X-Forwarded-For $x-f-f-variable. I haven't checked yet what that looks like exactly for a UDS.

In backend definitions thus far, if .hosthdr is not set, then Host in backend requests gets set to whatever was in .host for the backend. But for UDS backends, you must have .path and you MAY NOT set .host in the definition. So in this implementation, Host gets set to localhost in requests to a UDS backend if the .hosthdr was not set.

Maybe we should require .hosthdr for UDS backend definitions. If not, we should stongly recommend setting it in the documentation.

Because of the fact that Host gets set to localhost in this situation, the varnishncsa %r formatter renders a request line with http://localhost/url, so it turns out reasonably.

@slimhazard
Copy link
Contributor Author

slimhazard commented Jul 16, 2017

vtc

To test Varnish with a UDS listen address, call it with -arg "-a /path/to/uds". The implicit -a argument with port 0 (bind to any open port) is not opened in this case.

The -listen and -connect commands have been adapted to work with UDSen -- if the string begins with /, then a UDS address is used. For server -listen "/path/to/uds", the umask is temporarily set to 0 before bind, to avoid permission issues accessing the socket. -vcl+backend detects which servers are listening at UDS, and constructs appropriate backend definitions.

The easy way to do this is to use ${tmpdir} as the location of the socket, so the new tests look something like this:

server s1 -listen "${tmpdir}/s1.sock" { ... }

varnish v1 -arg "-a ${tmpdir}/v1.sock" -vcl+backend { ... }

client c1 -connect "${tmpdir}/v1.sock" { ... }

If a server is listening at UDS, then both ${s1_addr} and ${s1_sock} have the path, and ${s1_port} is "-". When a Varnish instance listens at UDS, ${v1_addr} has the UDS path, ${v1_port} is "-", and ${v1_sock} is /path/to/uds -.

If a client or server is using a UDS address, then the vtc variables remote.ip and remote.port are NULL (match <undef>). A new variable remote.path has been added, which holds a UDS path or NULL (matching <undef>) if an IP address is in use.

However, I noticed while doing that that remote.ip and remote.port are not used in any of our VTC tests. Maybe we should remove them, and not introduce remote.path.

@slimhazard
Copy link
Contributor Author

slimhazard commented Jul 16, 2017

Socket options

The table of socket options in cache_acceptor.c now has a flag iponly. When this flag is true, the option is not set for a UDS socket.

Socket options not set for UDS are currently exactly the same as those with level is IPPROTO_TCP, meaning that TCP_NODELAY and the TCP_KEEP* options are not set.

@slimhazard
Copy link
Contributor Author

PROXY protocol

The PR has code to send UDS addresses to a backend for PROXYv2, but there are no tests for that yet.

At this point I don't think we should support receiving UDS addresses via PROXY:

  • No one seems to have missed it so far.
  • UDS is not supported in PROXYv1.

We could theoretically support receiving UDS addresses via PROXYv2, but I can't think of any place they could go besides session workspace, with all of the sockaddr_un storage requirements that we wanted to avoid.

IMO this should be something for a future iteration.

@slimhazard
Copy link
Contributor Author

HTTP/2

There is one test for HTTP/2 over UDS addresses -- t02010.vtc, which does exactly what t020000.vtc does, except with UDS.

It seems that the socket address shouldn't matter to the transport protocol. But I'm afraid that I don't really grok HTTP/2 or the vtc commands for it, so I am unaware of any surprises that may turn up after all. Someone who knows HTTP/2 better than I do should put some thought to that.

@slimhazard
Copy link
Contributor Author

Backend_health

May now emit a U, where 4 or 6 appears for IP, for backend probe over UDS.

@slimhazard
Copy link
Contributor Author

slimhazard commented Jul 16, 2017

*.ip, std.ip() and std.port()

All of the *.ip variables, including beresp.backend.ip, are NULL when the listen or backend address is a UDS. There are now VCL variables local.path (for a listen address) and beresp.backend.path, which hold the path if it's UDS, NULL otherwise.

I didn't see the point in remote., client. or server.path. For a UDS, the remote address is exactly the same as the local address (the same path). If we're not supporting UDSen over the PROXY protocol, there's also no point in client.path or server.path, and introducing them as new VCL variables might create the false impression that we are supporting UDSen over PROXY.

Because of the fact that the *.ip variables might be NULL, the assumption that a VCL_IP variable is never NULL, for example in VMOD functions, is no longer true. For example, the fallback in std.ip() could be NULL, so the idiom std.ip("<string>", client.ip) is no longer guaranteed to return an IP address. That could become a problem for third-party VMODs, if they always assert not-NULL for VCL_IP.

I updated std.ip() to return NULL if called with a string that begins with /, and not to throw assertions if the fallback if NULL. We should consider invoking VCL failure in one or both of these situations.

std.port() returns 0 if the IP is NULL (that didn't have to be changed). If std.port() is passed in a string that begins with a /, then it is not implicitly converted to an IP address (see below about getaddrinfo).

The potential NULL valiue for the *.ip variables also has the strange consequence that:

x.ip == x.ip

is always false for every x. That is, all of the *.ip variables are unequal to themselves if set to NULL due to a UDS (because IP comparison always evaluates to false if one or both of the comparands is NULL).

@slimhazard
Copy link
Contributor Author

ACLs

UDS addresses don't work in any way with ACLs. They can't be listed in an ACL, and matching a *.ip variable against an ACL, which is always NULL for a UDS, always fails for a UDS.

@slimhazard
Copy link
Contributor Author

getaddrinfo

On Linux, AF_UNIX addresses are not supported for getaddrinfo, but they are on FreeBSD. That is, if the string provided happens to match a path that is a socket, then FreeBSD getaddrinfo will return a sockaddr-as-sockaddr_un for it.

In a few places in the code, getaddrinfo is called with the intent of converting a string to an IP address, for example in the IP resolver called by VCC or in VRT_IP_string. Since the intent is to create or return an IP type (a VCL_IP variable or a VSA for IPv4 or IPv6), the code needs to make sure it doesn't get to getaddrinfo if the string might be interpreted as a UDS path. I believe that I've done this; at any rate we'll have to keep an eye on it in the future.

@slimhazard
Copy link
Contributor Author

Weirdnesses

  • std.set_ip_tos() invokes VCL failure if the listen address is a UDS (socket option IP_TOS is level IPPROTO_IP and is not compatible with UDS).
  • Test b00046.vtc uses send_urgent to send TCP OOB data, to verify that it "doesn't cause ill effects". OOB data is not supported for UDS (at least on Linux). There's no way to write a test about it, since send_urgent just calls send(2) with MSG_OOB, which fails for a UDS and causes the test to fail. I guess this would be an "ill effect". Is that a problem? (What exactly is the test trying to prove?)

@bsdphk
Copy link
Contributor

bsdphk commented Aug 8, 2017

I tried this on FreeBSD and sent geoff email with the results.

@nigoroll
Copy link
Member

nigoroll commented Sep 4, 2017

Consensus from bugwash: merge after next release, re-think the VCL interface

@bsdphk
Copy link
Contributor

bsdphk commented Sep 6, 2017

Just to detail the discussion yesterday: The main worries is if the VCL side of this is where we want to go, and the plan is to start working on this right after 5.2 is out.

… path.

On FreeBSD, getaddrinfo(3) may convert a path to an AF_UNIX sockaddr.
Note that as a consequence of the *.ip variables returning NULL for
UDS, the fallback argument to std.ip() may now be NULL.
Otherwise, since client.ip is NULL for a UDS, unset client.identity
would always return NULL, and hence would be the same for every
client. That would undermine the purposes it's meant for, such as
the hash director.
These are NULL (match <undef>) for UDSen.
NULL for an IP address (matches <undef>), the path for a UDS.
There's currently only one left, e826 in cache_acceptor.c, due to
assigning void * from VSA_Get_Sockaddr to struct sockaddr_un *.
This affects the log output for BackendStart.
@Dridi
Copy link
Member

Dridi commented Feb 27, 2018

Shouldn't we close this one?

@Dridi Dridi closed this Feb 27, 2018
@slimhazard
Copy link
Contributor Author

I was keeping it open for reference, since it's not entirely OBE yet (the backend part is still pending), but yes it was going to be closed eventually. We can of course look back at a closed PR, same difference.

@Dridi
Copy link
Member

Dridi commented Feb 27, 2018

As I see it, the backend side is happening in #2579. I just went through some triage frenzy so if anything should be kept open, don't mind me and reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants