-
Notifications
You must be signed in to change notification settings - Fork 116
Add sampling to pping #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Rewrite/regroup/reorder some points for the General pping section. Also add some new points, add some additional comments to existing points, and check in the "Skip pure ACKs" as complated. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add a document outlining my thoughts for how to implement sampling. Intended both as a basis for discussion, as well as being a form of documentation. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
|
Immediate thoughts: The "keep track of the last seen identifier and sample the first one seen" was what immediately came to mind to me as well, so that makes sense! As for the possibility of "missing" an event, one option could be to sample in a "window". I.e., when you figure out that it's time to do a new sample, you save the timestamps of the next (say) 3 identifiers. And when the replies come back you also rate limit the output so you'll end up only outputing one of those three samples when they come back (since they should be close together). As for sampling frequency, I'd say start with the simple thing: A fixed interval that defaults to something sensible but which can be overridden at the command line. And then add dynamic RTT-based sampling as the next thing :) |
Add sections on per-flow state, graceful degradation and some implementation considerations. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
|
Thanks for your comments @tohojo. I've also been thinking about allowing a timestamping a few consecutive packets, however I didn't think about also rate-limiting the output. So that will probably be a good addition to make as well. Starting simple with the sampling frequency sounds like a good choice. Overall, I will probably start of with something pretty basic, and then add more of the features I discuss after hand (as you approve of the basics :)). But still wanted to outline the overall requirements for this whole sampling thing early on, so that I don't continue building on that basic thing, only to discover after a while that it has some fatal flaw and I have to start all over (guess there's still a risk for that, but can at least try to minimize it somewhat). |
|
I was thinking a bit about the sampling as well (head sampling, i.e., sample some consecutive packets at the beginning of a flow), a time-based window sampling (this one is more intuitive and easier to overwrite per cmd line). About the output I have not thought about it. maybe I missed something, the benefit of it. |
|
See also Kathie's visualisation tool for pping: https://github.com/pollere/ppviz - haven't played with it yet, but may be worth keeping in mind... |
|
Yhea, saw that visualization tool from the email-conversation you passed along. Haven't tried it out yet either, but might be useful to keep the same "machine readable" format as her pping does so one can use that tool with both pping implementations (or at least have it as an option, we might want to output some additional information, like protocol, in the future). |
Add a per-flow rate limit, limiting how often new timestamp entries can be created. As part of this, add per-flow state keeping track of when last timestamp was created and last seen identifier for each flow. Additionally, remove timestamp entry as soon as RTT is calculated, as last seen identifier is used to find first unique value instead. Furthermore, remove packet_timestamp struct and only use __u64 as timestamp, as used memeber is no longer needed. This initial commit lacks cleanup of flow-state, user-configuration of rate limit, mechanism to handle bursts etc. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
|
Have pushed a very early implementation of the per-flow rate limitation now. This still lacks many key components that I will add on later (see commit message). However, first I plan to update the documentation with some of the outcomes from today's discussion. Also, before going further I also think I should attempt the fix @tohojo previously suggested to always load the TC program with libbpf. If I've understood it correctly, the idea is to let the current userspace program (which loads and attaches the XDP program) also load and pin the TC program using libbpf. Then it simply uses tc to attach the pinned BPF program. This way, it should have libbpf support regardless of if iproute has libbpf support or not, and I won't have to deal with the dual case iproute with/without libbpf support anymore (which in this commit leads to two definitions of the flow_state map). |
Update SAMPLING_DESIGN.md, partly based on discussions during the meeting with Red Hat on 2021-03-01. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
pping/pping_kern_tc.c
Outdated
| .nh = { .pos = pctx.data }, | ||
| }; | ||
| struct flow_state *f_state; | ||
| struct flow_state new_state = { 0 }; // Rarely-ish used, but can't really dynamically allocate memory or? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you can't. But allocating this on the heap would be inefficient anyway since it's such a small value :)
pping/pping_kern_tc.c
Outdated
| BPF_NOEXIST); | ||
| f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow); | ||
| if (!f_state) | ||
| goto end; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention in the kernel is generally to use 'out' for these kinds of 'end of function' labels. You're being consistent with 'end', though, and obviously I haven't noticed before, but just thought I'd mention it since I did notice this time ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, good to know. Have fixed it in the latest commit.
pping/pping_kern_tc.c
Outdated
| #define BURST_DURATION \ | ||
| 4000000 // 4ms, duration for when it may burst packet timestamps | ||
| #define BURST_SIZE \ | ||
| 3 // Number of packets it may create timestamps for in a burst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe don't define these until you're actually using them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, just got a bit ahead of myself.
pping/pping_kern_tc.c
Outdated
| __uint(type, BPF_MAP_TYPE_HASH); | ||
| __uint(key_size, sizeof(struct packet_id)); | ||
| __uint(value_size, sizeof(struct packet_timestamp)); | ||
| __uint(value_size, sizeof(__u64)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, BTW, when you're using BTF-defined maps instead of value_size and key_size, you can use __type(key, struct packet_id) and __type(value, __u64) (IIRC that's the syntax), which will also encode the actual types into the BTF information that goes into the kernel with the map, which makes it possible for bpftool to print out the types, for instance...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I remember seeing that somewhere (possibly in traffic-pacing-edt example), but forgot it once writing it myself. Have addressed it for all maps except the perf-buffer one which didn't seem too happy about it (guess __u32 might not be the correct type, just right size).
Load and pin the tc-bpf program in pping.c using libbpf, and only attach the pinned program using iproute. That way, can use features that are not supported by the old iproute loader, even if iproute does not have libbpf support. To support this change, extend bpf_egress_loader with option to load pinned program. Additionally, remove configure script and parts of Makefile that are no longer needed. Furthermore, remove multiple definitions of ts_start map, and place singular definition in pping_helpers.h which is included by both BPF programs. Also, some minor fixes based on Toke's review. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
pping/pping.c
Outdated
| /* return mkdir(path, 0700) ? -errno : 0; */ | ||
| /* } */ | ||
| /* return S_ISDIR(st.st_mode) ? 0 : -EEXIST; */ | ||
| /* } */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just remove code instead of commenting it out - it can always be resurrected from the git history...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, forgot to clean that up. Often temporarily comment out some code while working on it, but missed removing it before the commit in this case.
pping/pping.c
Outdated
| * Could use bpf_obj__unpin_maps(obj, PINNED_DIR) if it only tried | ||
| * unpinning pinned maps. But as it also attempts (and fails) to unpin | ||
| * maps that aren't pinned, will instead manually unpin the one pinned | ||
| * map for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be possible to call it with a NULL second argument, in which case it will only try to unpin the maps that are marked as pinned (so as long as it's the same bpf_object you originally pinned the maps on it should just work).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely missed that possibility. In that case that would be a much nicer solution.
Remove some out-commented code. Also use bpf_object__unpin_maps instead of manually unpinning the ts_start map. Additionally, change map_ipv4_to_ipv6 to use clearer implementation (that now also works for tc due to always using libbpf to load program). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add parsing of TCP FIN/RST to determine if connection is being closed, and if so delete state directly from BPF programs. Only enabled on tc-program, as verifier is unhappy about in on XDP side for some reason. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
|
Hi @tohojo, I couldn't get it to work on the XDP side however, as the verifier started complaining then. I'm still not very good at figuring out what the verifier finds problematic, and being tired/frustrated probably doesn't help. When the verifier also started to complain simply from adding a bpf_printk statement that didn't access packet data at all I decided to give up for the day. So if you have the time and energy, it would be great if you could take a look at it (I have commented out the section I have problem with in |
|
Simon Sundberg <notifications@github.com> writes:
Hi @tohojo,
Have pushed some new commits that attempts to add cleanup of the flow
state map. The first commit adds the periodic userspace cleanup, which
seems to work. The second commit contains an initial implementation of
the BPF side cleanup, which attemps to delete flow state if the packet
parsing detects that the flow is being closed (FIN or RST flags).
I couldn't get it to work on the XDP side however, as the verifier
started complaining then. I'm still not very good at figuring out what
the verifier finds problematic, and being tired/frustrated probably
doesn't help. When the verifier also started to complain simply from
adding a bpf_printk statement that didn't access packet data at all I
decided to give up for the day. So if you have the time and energy, it
would be great if you could take a look at it (I have commented out
the section I have problem with in `pping_kern_xdp.c`, as well as the
printk line in `pping_helpers.h/parse_tcp_identifier()`.
Hmm, that's a really good question! I can see why you're frustrated...
As far as I can tell, the immediate cause of the trouble is that
including the bpf_printk() statement causes the compiler to allocate
registers differently, which means that some of the later code uses the
r7 register instead of the r1 register to store the packet offsets they
are loading from. And this means that the verifier suddenly thinks that
those packet reads are out of bounds.
Not sure if this is a verifier bug, or a compiler bug, or both. Or if
there is some other subtle bug lurking somewhere that just gets
triggered by this. But at least the above is an explanation :)
The bpf_map_delete() by itself seems to work fine for me if I leave only
the bpf_printk commented out...
|
Thanks for the answer, makes it a bit clearer what's going on at least, even if it doesn't really make it any clearer for me how to fix it. Of course, the
Interesting, as I get a different error when I try to run with that (without the What I find confusing (other than it suddenly complaining at the code that's parsing the TCP options), is that the specific |
Verifier might have rejected XDP program due to opt_size being loaded from memory, see https://blog.path.net/ebpf-xdp-and-network-security. Add check of opt_size to attempt to convince verifier that it's not a negative value or anything else crazy. Leads to verifier instead thinking the program is too large (over 1m instructions). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
|
Hi again @tohojo, So my attempt at a fix was to add a little extra check to try and convince the verifier that opt_size can't contain a negative value, (and limited the upper bound to 34 as that's the largest legal size for a TCP option as far as I can tell, simply putting it at 0xff will cause the compiler to skip the check as it's already a u8). But then I run into a new problem of the program suddenly being too large (over 1000001 instructions). I find this a bit strange as I don't feel like my program should have that many paths? Think I might somehow have gotten stuck in an infinite loop here. As just after pushing this commit I noticed that if I put So if you have the time, could you check if this fix also causes the verifier to reject the program as being too big for you as well? And if so, any idea why the not-unrolled loop would become an infinite loop (or at least much larger than the unrolled version)? Finally, any idea how I should solve this whole mess? |
Add a check that to ensure verifier that opt_size is positive in case its been read in from stack. Also enable (uncomment) the flow-state cleanup from the XDP program as the added check avoids verifier rejection. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
|
With last commit (e04284a) I've added a check (or rather a bitwise AND) that seems to work for convincing the verifier that Adding this does however cause the "infinite loop" problem we've talked about (at least for me), so until we've figured that out I've temporarily reverted to unrolling the loop. Will later on squash some of these commits to avoid this whole detour, but figured I'll leave them for now while they're still somewhat relevant to our discussion. Also, as a note to myself, I must figure out what the |
|
I can envision an interesting preface about your friendship with the verifier in your thesis :) |
|
abrunstrom ***@***.***> writes:
I can envision an interesting preface about your friendship with the
verifier in your thesis :)
Heh, yeah. I've been beating my head against it for a couple of hours
now; think I'm just going to give up for the time being :/
One thing I did notice while doing this, though, is that I don't think
your code correctly handles the case where opt_size is 0 after you read
it from the packet (or at least it is only doing so incidentally by
exhausting the loop)?
|
Yhea, you are right about that. While |
|
Simon Sundberg ***@***.***> writes:
Yhea, you are right about that. While `opt_size` _should_ never be 0,
you can of course never know what you get in a packet. So as you say,
my code right now should just loop over the same place until it's
exhausted the loop. Guess the correct way to handle it would instead
be to just terminate the loop immediately.
Yup, that works :)
|
Add a check that opt_size is at least 2 in pping_helpers.h/prase_tcp_ts, otherwise terminate the loop unsucessfully. Only check the lower bound of opt_size, the upper bound will be checked in the first step of the next loop iteration anyways. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
tohojo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits - also, your main() function is getting quite long, may be worth thinking about splitting it. For the BPF code, now that you're compiling and loading everything with libbpf, you could move the two functions into the same source file and avoid some code duplications (and the need for the pping_helpers.h file).
pping/pping.c
Outdated
| duration % NS_PER_SECOND); | ||
| #endif | ||
| cleanup: | ||
| if (key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's perfectly fine to call free() on a NULL value, so you can skip these checks...
pping/pping.c
Outdated
| * Periodically cleans out entries from both the packet timestamp map and the | ||
| * flow state map. Maybe better to split up the cleaning of the maps into two | ||
| * separate threads instead, to better utilize multi-threading and allow for | ||
| * maps to be cleaned up at different intervals? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I very much doubt the map cleaning is going to be a bottleneck, so adding threading is just likely to complicate things. If you do want to optimise this in the future, the right way would be to use the in-kernel map iterator functionality... :)
| if (opt == 8 && opt_size == 10) { | ||
| if (pos + opt_size > opt_end || | ||
| pos + opt_size > data_end) | ||
| if (pos + 10 > opt_end || pos + 10 > data_end) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that if you add a check before the loop for opt_end <= data_end, you can avoid these double checks everywhere (i.e., just check against opt_end)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I tried that before but the verifier wasn't convinced. That said, it was quite some time ago and have changed some things around since then, so can try that again and see if the verifier is smart enough this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had another try at this, but still can't seem to convince the verifier that if opt_end <= data_end, then anything < opt_end must also < data_end. So think the double checks will have to be there for now.
Not sure if it's just that the verifier is too stupid, or if the opt_end get's placed on the stack and then the verifier forgets about its limitations.
Merge the pping_kern_tc.c, pping_kern_xdp.c and pping_helpers.h into the single file pping_kern.c. Do not change any of the BPF code, except renaming the map ts_start to packet_ts. To handle both BPF programs kept in single ELF-file, change loading mechanism to extract and attach both tc and XDP programs from it. Also refactor main-method into several smaller functions to reduce its size. Finally, added the --force (-f) and --cleanup-interval (-c) options to the argument parsing, and improved the parsing of the --rate-limit (-r) option. NOTE: The verifier rejects program in it's current state as too large (over 1 million instructions). Setting the TCP_MAX_OPTIONS in pping_kern.c to 5 (or less) solves this. Unsure at the moment what causes the verifier to think the program is so large, as the code in pping_kern.c is identical to the one from the three files it was merged from. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
|
Hi @tohojo, think my latest commit adress most of your points. There is however one major problem with it, the verifier rejects it due to hitting the 1 million instruction limit. I find this very confusing as:
|
|
Will also need to update the documentation a bit to reflect the changes from this commit, as incorrect documentation is worse than no documentation at all. That said, I want to hold off on it for a little while in case we need to do some other restructure to solve this "infinite loop" (or whatever that causes the verifier to think the program is so large) problem. |
|
AFAICT what happens is that the compiler shuffles the code around in a way that includes a backwards jump even though the loop gets unrolled. And so the verifier still needs to trace all the jumps around, can't convince itself that the program exits, and ends up exiting due to the instruction limit. I think we may be at the point where we need to post this on the upstream list and see if someone there knows how the verifier can be made smarter, or if there's a way to convince the compiler to emit code that passes. Do you think you could produce a minimal self-contained example suitable for posting upstream? :) |
Makes sense I guess, there are indeed some backward jumps in the bytecode. That said, those backjumps also exist for the 5 loop version, so guess it just it just gets too complex to track the jumps for the verifier after 6 loops? It also works fine for the XDP program (if I basically just comment out all the code in the tc program) which should hit the same backward jumps, so not sure what differeniates them that's relevant here.
I can try, but as I'm not sure exactly what's triggering the problem I don't really know how to recreate it in a minimal fashion. I guess some of the key ingredients are two programs in the same file sharing a common function with a loop, but there's also some secret sauce which I don't really know what it is. Is there any way to get the verifier output even when it passes a program? Would be interesting to know how many instructions it exhausts on the XDP program, as well as when only using 5 loops. See if those are anywhere near 1 million (based on the much shorter loading times I suspect not). |
|
Simon Sundberg ***@***.***> writes:
> AFAICT what happens is that the compiler shuffles the code around in
> a way that includes a backwards jump even though the loop gets
> unrolled. And so the verifier still needs to trace all the jumps
> around, can't convince itself that the program exits, and ends up
> exiting due to the instruction limit.
Makes sense I guess, there are indeed some backward jumps in the
bytecode. That said, those backjumps also exist for the 5 loop
version, so guess it just it just gets too complex to track the jumps
for the verifier after 6 loops? It also works fine for the XDP program
(if I basically just comment out all the code in the tc program) which
should hit the same backward jumps, so not sure what differeniates
them that's relevant here.
Erm, no idea? I guess the context around the call to that function trips
the verifier for some reason?
> I think we may be at the point where we need to post this on the
> upstream list and see if someone there knows how the verifier can be
> made smarter, or if there's a way to convince the compiler to emit
> code that passes. Do you think you could produce a minimal
> self-contained example suitable for posting upstream? :)
I can try, but as I'm not sure exactly what's triggering the problem I
don't really know how to recreate it in a minimal fashion. I guess
some of the key ingredients are two programs in the same file sharing
a common function with a loop, but there's also some secret sauce
which I don't really know what it is.
"Minimal" doesn't have to mean "small". It should be as small as
possible, but no smaller. And it should be self-contained so you can put
it into a single email, which I guess means copying over the helpers
from parsing_helpers.h. So I'd start by doing this, and then delete as
much code as you can while still hitting the bug.
Is there any way to get the verifier output even when it passes a
program? Would be interesting to know how many instructions it
exhausts on the XDP program, as well as when only using 5 loops. See
if those are anywhere near 1 million (based on the much shorter
loading times I suspect not).
Sure, the verifier will spit out the log in any case if requested to do
so. The trouble is that libbpf doesn't really expose this in a good way,
so you may have to hack the loading function. IIRC, it's enough to set
log_buf_size to a non-zero value at the top of load_program() in libbpf,
and set log_level in struct bpf_prog_load_attr (which means loading with
bpf_object__load_xattr() or bpf_prog_load_xattr().
|
Hmm, might toy around a bit with that later, as I guess that means I need to set the map pin-location in some different manner. For now though I simply put some of my bpftrace knowledge to use to fetch it from the verifier's |
|
Simon Sundberg ***@***.***> writes:
> Sure, the verifier will spit out the log in any case if requested to do
so. The trouble is that libbpf doesn't really expose this in a good way,
so you may have to hack the loading function. IIRC, it's enough to set
log_buf_size to a non-zero value at the top of load_program() in libbpf,
and set log_level in struct bpf_prog_load_attr (which means loading with
bpf_object__load_xattr() or bpf_prog_load_xattr().
Hmm, might toy around a bit with that later, as I guess that means I
need to set the map pin-location in some different manner.
No, you can still open with open_opts() setting pin_root_path, and then
load with load_xattr... bpf_object__load() is literally just:
int bpf_object__load(struct bpf_object *obj)
{
struct bpf_object_load_attr attr = {
.obj = obj,
};
return bpf_object__load_xattr(&attr);
}
For now though I simply put some of my bpftrace knowledge to use to
fetch it from the verifier's `bpf_verifier_env` struct using a kprobe
and kretprobe for `do_check()`.
That works too, I guess ;)
|
But for just getting the processed instructions (which was mainly what I was after), my bpftrace script seems to work fine as well. Anyways, seems like even before merged the two programs into a single part I wasn't too far away from hitting the verifier limit, as if comment out the entire XDP program (so the file only contains a single program) it still uses about 700k instructions to verify the tc program. So this is not really a new issue, in the sense that I would have hit it before as well if I just increased my loop count to say 13-15. When both the XDP and tc programs are present, it uses about 500k instructions to verify the XDP program and then hits the 1m limit for the tc program. And it's quite interesting how commenting out different parts of the code makes a differenence to the number of instructions the verifier needs to process. For example, I've noted that if I remove the "switch" logic on saddr and daddr depending on if it's egress or ingress then the XDP and tc programs will use about the same amount of instructions (whereas with the switch logic the tc programs needs about twice the number of instructions processed). If I drop the IPv6 branch then falls down to just 8k for the XDP program and 16k for the tc program. |
|
Simon Sundberg ***@***.***> writes:
> No, you can still open with open_opts() setting pin_root_path, and then
load with load_xattr... bpf_object__load() is literally just:
But for just getting the processed instructions (which was mainly what
I was after), my bpftrace script seems to work fine as well.
Ah, you just wanted the number of processed instructions? Sure, bpftrace
will probably work better for that :)
Anyways, seems like even before merged the two programs into a single
part I wasn't too far away from hitting the verifier limit, as if
comment out the entire XDP program (so the file only contains a single
program) it still uses about 700k instructions to verify the tc
program. So this is not really a new issue, in the sense that I would
have hit it before as well if I just increased my loop count to say
13-15. When both the XDP and tc programs are present, it uses about
500k instructions to verify the XDP program and then hits the 1m limit
for the tc program. And it's quite interesting how commenting out
different parts of the code makes a differenence to the number of
instructions the verifier needs to process.
Clever! :)
For example, I've noted that if I remove the "switch" logic on saddr
and daddr depending on if it's egress or ingress then the XDP and tc
programs will use about the same amount of instructions (whereas with
the switch logic the tc programs needs about twice the number of
instructions processed). If I drop the IPv6 branch then falls down to
just 8k for the XDP program and 16k for the tc program.
Ah! I guess you're running into some kind of combinatorial explosion
with the loop in skip_ip6hdrext(), then. You could try commenting that
out? Another thing to play with is removing the __always_inline to see
if you can get each function to actually end up as separate functions in
the BPF code. That *should* make the verifier look at each function in
isolation, which might also help :)
|
Yep, I do indeed seem to be hitting some interesting combinatorial effect.
Commenting out or reducing the loop count for
Unfortunately removing I have pushed a kinda minimal working example (or not working example I guess) to the The question is what to do now? Should we still try to report something upstream, and in that case what? That having multiple (unrolled, non-nested) loops seems to cause the verifier to exhaust a very large amount of processed instructions? And how do we solve it in the meantime? Splitting it up into two separate programs would probably work for now, but based on my testing that still causes the verifier to process a rather large amount of instructions, so may still run into the same problem later on as I continue to add features to pping. Guess another option would be to lower the loop counts somewhat, as I think it's quite rare that packets would need to exhaust all of them anyways. |
|
noinline does exist. You could try googling for the magical incantation for clang and trying that out?
|
I've now tried Also tried a bit of the opposite, that is telling the compiler to inline everything ( After some discussion with my supervisors we decided it would be good if I didn't get stuck on this too long, so thinking of implementing a bit of a quickfix for now and we can eventually come back to this in the future. I'm planning on just decreasing the maximum number of IPv6 header extensions parsed (they are apparently not very common). Some other options would be to split it up into two files again (largley undoing the last commit) or decreasing the maximum number of parsed TCP options, but think those come with more drawbacks. But if you have a different opinion I'm all ears. Regarding reporting it upstream, I'm not sure if there's anything in particular to report anymore (that the verifier doesn't like multiple loops?). Another less formal alternative could be to just post it on the EBPF slack channel and see if people have any good ideas on how to get around this. |
|
Simon Sundberg ***@***.***> writes:
> noinline does exist. You could try googling for the magical incantation for clang and trying that out?
I've now tried `noinline` on the `skip_ip6hdrext()` and
`parse_tcp_ts()` functions (the ones containing the loops), and it did
help a little bit. But it was not enough to get under the 1 million
instruction limit for the tc program (based on the instructions
processed for the XDP program I'm guessing it would need about 1.14
million instructions). Could potentially try and apply it to other
methods and see then, but not sure if it's a good idea for me to start
telling the compiler to stop inlining a bunch of functions (that it
would normally like to inline).
Ah, too bad - was hoping that would be a quick fix ;)
Also tried a bit of the opposite, that is telling the compiler to
inline everything (`__always_inline`). When the programs are kept in
separate files it does inline everything, but when the programs are
kept in the same file it calls `parse_packet_identifier()` as an
actual function. So was hoping telling the compiler to simply inline
it again would get me down below the 1 million limit, but seems like
even when using `__always_inline` the compiler will still generate BPF
bytecode where it's called like a function. I've compiled some of the
options I've tried and the resulting number of instructions the
verifier processed
[here](https://github.com/simosund/bpf-examples/tree/Create_minimal_example/pping_mwe)
if it is of interest.
Wow, nice writeup! :)
Adding the kernel and compiler versions you're testing this on is the
only thing I can see that's missing for this to be a useful reference
for asking people; see below.
After some discussion with my supervisors we decided it would be good
if I didn't get stuck on this too long, so thinking of implementing a
bit of a quickfix for now and we can eventually come back to this in
the future. I'm planning on just decreasing the maximum number of IPv6
header extensions parsed (they are apparently not very common). Some
other options would be to split it up into two files again (largley
undoing the last commit) or decreasing the maximum number of parsed
TCP options, but think those come with more drawbacks. But if you have
a different opinion I'm all ears.
Totally agreed, don't get held up on this; just lower one or both loop
counts and carry on!
Regarding reporting it upstream, I'm not sure if there's anything in
particular to report anymore (that the verifier doesn't like multiple
loops?). Another less formal alternative could be to just post it on
the EBPF slack channel and see if people have any good ideas on how to
get around this.
Given that you've already created such a comprehensive writeup, I
definitely think you should at least ask upstream! You could do this
either on Slack or on the mailing list; or both? On Slack you can
probably get away with just putting in the link above, but on the
mailing list you probably need to do a little bit of a writeup. You
could just edit the existing README file a bit and turn it into an
email, then send that to ***@***.*** and
***@***.***
I guess posting the link on Slack is easiest, so try that first, then
send the email afterwards if you don't get any replies? :)
|
|
Nice summary Simon! I agree, send it on the Slack channel and let's move on to the output :) |
Reduce IPV6_EXT_MAX_CHAIN to 3 to avoid hitting the verifier limit of processing 1 million instructions, This results in fewer loops in parsing_helpers.h/skip_ip6hdrnext which simplifies the verifier analysis. IPv6 extension headers do not appear to be that common, so this is unlikely to cause a considerable limitation. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Update documentation to reflect the current state of pping (after merging pping_kern_tc and pping_kern_xdp into a single file). Also add another point to the TODO list that has been discussed at a previous meeting. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
|
Hi again @tohojo. I've now reduced the number of IPv6 extensions parsed, so that the program loads again. So if you could have a final look at it we could perhaps get it merged in. I think I've addressed all of your requests (minus double checks for opt_end and data_end, which I can't seem to manage without), but there could of course be something more to do as well. Now that we load both the BPF programs from the same object-file, I guess we don't need to pin the maps anymore? So guess I could also remove the code related to that (not that much, since libbpf takes care of most of it). |
|
Yeah, good point about the pinning! I agree, let's get rid of it. We may need to bring it back later for when you enable background monitoring (in that case there won't necessarily have to be a long-running process at all if a cron job can just ping the maps), but for now it's just complicating stuff and incurring a dependency on bpffs... Couldn't spot anything else to complain about, so I'm fine with merging that once you've removed the pinning :) |
When both BPF programs are kept in the same file, no longer need to pin the maps in order to share them between the programs. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Have now removed the code related to pinning the maps (as stated, not many lines). That said, we don't really get rid of the dependency on bpffs as we still pin the tc program (so we don't need to rely on tc loading the program, only attaching it). |
|
Right, but libbpf will hopefully gain support for attaching the tc progs soon, so that can go away as well :) |
After some thinking and a bit of discussion with @abrunstrom and @sferlin, I've found that there are a number of potential issues that needs to be solved in order to get sampling to work well. Before I start actually implementing the sampling, I would like to get some feedback on if my plan for how to do the sampling makes sense. Therefore I've started documenting some of the challenges and how I plan to solve them. And instead of sending a bunch of mails back and fourth, I figured we could simply have that discussion in this PR instead.
So this PR doesn't contain any code changes (yet), only a markdown document (
SAMPLING_DESIGN.md) which I hope can serve as the basis for some discussion. I still have some sections left which I plan to fill in tomorrow, and then we can maybe discuss some of these issues on the next meeting.