-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mark JIT code as writeable / executable depending on the situation #5032
Conversation
With this change and a reversion of 119626d, |
Yikes. I'll keep investigating |
The problem might be that it's not fine-grained enough. I read in an online thread that the running time of mprotect is O(n_pages). You could try changing |
I need to find an OpenBSD machine to test on. This change seems to have no impact on boot time either on macOS or Ubuntu (AMD). The test I used is this: require "benchmark/ips"
Benchmark.ips do |x|
x.report("startup") do
system "./miniruby --yjit -e 0"
end
end Results on macOS: Master:
This branch:
Ubuntu master:
This branch:
|
I'm not sure this change would slow anything down on other operating systems. The 0.25s (no yjit) to 3.59s (yjit) may just be the overhead of using yjit on OpenBSD. You cannot test on OpenBSD without this change, because before the change, yjit doesn't run on OpenBSD. |
I mean, I want to test @maximecb's theory that |
I don't see any reason why YJIT would just be slow on OpenBSD specifically, besides mprotect. It's still an x86 CPU. I think mprotect might be slow when you try to apply it to many pages at once but would be faster if touching fewer pages. Interesting if it's still fast on macOS and Ubuntu when doing mprotect on large memory regions at once 🤔 |
That's what I was thinking. But maybe it's just one big mapping and the OS is able to do something more efficient? 🤷🏻♀️ |
fec7f83
to
a6a4154
Compare
@tenderlove if you have time, do you think you could make a microbenchmark for mprotect on BSD? Say, allocate some large N number of pages. Then, in a a loop, mark the whole thing executable, then writable but not executable. And benchmark this with N=1 and with N=10000. We could benchmark this on both BSD and macOS, and it would help us understand if the number of pages matters or not. If you don't have time maybe I can write the code and you can run it. |
Ya, I'll do that today. Looks like I've got access to a BSD machine now. |
1038c48
to
be1b12b
Compare
I made a repository here that has a benchmark for mprotect. I tried to make the allocation similar to what we do in YJIT. The benchmark makes allocations in multiples of On the CI server, the output is like this:
It seems like setting the protection on the region is pretty fast regardless of the size. Even with a 409MB region we're still seeing 145772 iterations per second. As @jeremyevans said, we've never been able to run YJIT on BSD before, so we don't know that |
Hmmm. Could avoid doing the memset to see if it has an impact? |
6942349
to
4179ea8
Compare
I was able to reproduce a slow down in boot time:
Seems like the boot time issues is the
|
Specifically this memset call. I left the mprotect calls. The change I applied is this: diff --git a/yjit_asm.c b/yjit_asm.c
index b5e3bd12c7..6a33d3f3fa 100644
--- a/yjit_asm.c
+++ b/yjit_asm.c
@@ -218,7 +218,7 @@ uint8_t *alloc_exec_mem(uint32_t mem_size)
// Fill the executable memory with INT3 (0xCC) so that
// executing uninitialized memory will fault
cb_mark_writeable(cb);
- memset(mem_block, 0xCC, mem_size);
+ //memset(mem_block, 0xCC, mem_size);
cb_mark_executable(cb);
return mem_block; |
Very surprising. Does it take this long if you have a dummy program that just does memset and mprotect? :O |
77179be
to
47114c2
Compare
I found these tests were timing out in CI. I tracked down the culprit, and it was unfortunately the mprotect calls. The initial benchmarks showed that Of course if you The numbers I got back showed essentially a linear growth relationship between the number of pages mapped and the time it takes to call I didn't want to do book keeping to keep track of every page we've written to, so @jhawthorn and I came up with a scheme. We changed the function that writes to the JIT buffer. It checks the codeblock to see if the last write was on the same page. If the last write was on the same page, just write to the code block, otherwise unlock the page then write to the code block. Since it only keeps track of the "last written" page, we just mark the whole JIT buffer as executable (relying on the OS to figure out which pages changed protection). This strategy seems to bring boot time down a bunch. It's still slower than "no JIT", but faster than it was. Before the change
After the change
No JIT
|
78c0ca9
to
c066f84
Compare
I think this is ready for review. I was able to get it running on the BSD CI server and my results are like this:
Seems like around ~40ms overhead was added versus the interpreter. We should test to see how this impacts benchmarks too though. Final thought: should we try to conditionally enable this on platforms that require these protections? That way platforms that don't care don't take the speed hit? Thoughts @maximecb @jeremyevans ? |
I think it should be enabled everywhere by default, with maybe a flag if you want to switch to the less-secure-but-faster mode. It's important to understand that the main reason for this change is it improves security (memory that is both writable and executable makes exploitation much easier). That it is also more portable to operating systems that enforce the restriction is a bonus. |
Would be curious how this affects startup time and railsbench performance on Linux. We have some easy to run benchmarks in |
If I read the numbers above correctly it's 0.58vs vs 0.14s so that would be a 440ms difference, or ~4.14x slower for startup and loading RubyGems. |
@eregon It looks like 0.58vs vs 0.14s is the difference between yjit and no yjit, not the difference between yjit with this patch and yjit without this patch. I think the consideration when deciding whether to enable this protection by default when using yjit is going to depend on the difference between yjit with this patch and yjit without this patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever approach to let the OS figure out which pages need to change protection. If it's fast enough then we can probably enable it for all systems. It would be nice if we could avoid introducing a new option for this.
+1 for benchmarking on Linux based systems to measure perf degradation. Maybe we want to add a benchmark for ruby --disable-gems -e 'require "rubygems"'
to https://github.com/Shopify/yjit-bench
If necessary, we could do bookkeeping for our pages. It would also allow us to fill with int3
on first write which helps boot perf. One of my machine spends >100ms on boot memsetting 256MiB with 0xcc
which isn't idea.
} | ||
|
||
if (ocb) { | ||
cb_mark_all_executable(ocb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This callback is invoked per iseq, so it's going to make this large mprotect call O(ISEQ_COUNT)
times. That seems like it'd be bad for compaction perf. Maybe we need to touch gc.c
so we make everything executable when exiting the GC....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure what to do about this, but I feel like we can postpone fixing this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can postpone since we'll no longer have global cb/ocb next year with the code pages.
yjit_iface.c
Outdated
cb_mark_position_writeable(cb, offset_to_value); | ||
|
||
// Object could cross a page boundary, so unlock there as well | ||
cb_mark_position_writeable(cb, offset_to_value + SIZEOF_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is an off-by-one error. If it so happens that we are writing exactly the last SIZEOF_VALUE
bytes of a page, this changes protection on two pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I thought so too (wrt the off by one), but I was worried about fixing the build first. Unlocking too much memory shouldn't make anything fail. I'll fix the OB1!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two system calls instead of one though? Can have performance implications?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maximecb As long as the VALUE doesn't span multiple pages, we'll only do one system call. I linked to it in another comment, but as long as offset_to_value + SIZEOF_VALUE - 1
falls on the same page as the previous call we'll only do one system call. This second call is in the off chance that the pointer happens to span two pages (and in that case we would need two system calls).
75e4369
to
dfb721f
Compare
Here are the perf results from my AMD machine on this branch:
Here it is from the master branch
Looks like |
Looks pretty good. Can you also measure the boot time on Linux? Otherwise, are we ready to merge? |
// Keep track of the current aligned write position. | ||
// Used for changing protection when writing to the JIT buffer | ||
uint32_t current_aligned_write_pos; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the aligned write position? What does it mean???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is similar to the write_pos
variable but it's aligned on multiples of the OS page size. That way when we write to the buffer, we can compare and only call the mprotect
system call when we're on a "new" page. It allows us to amortize the cost of an mprotect
system call.
I can add all of ^^^ as a comment if it helps (the same info is in the commit message for this code)
Here's boot time on my Linux machine. Without the patch:
With the patch:
It seems extremely close on Linux.
I think so! 😄 |
Sorry, could you try after |
Sure. Here it is with this PR:
Here it is without the PR:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Aaron for completing this and for being patient with Alan and I's questioning, producing benchmark results.
I'm sufficiently convinced that the impact on Linux platforms is minimal.
Some platforms don't want memory to be marked as writeable and executable at the same time. When we write to the code block, we calculate the OS page that the buffer position maps to. Then we call `mprotect` to allow writes on that particular page. As an optimization, we cache the "last written" aligned page which allows us to amortize the cost of the `mprotect` call. In other words, sequential writes to the same page will only call `mprotect` on the page once. When we're done writing, we call `mprotect` on the entire JIT buffer. This means we don't need to keep track of which pages were marked as writeable, we let the OS take care of that. Co-authored-by: John Hawthorn <john@hawthorn.email>
If YJIT isn't enabled, or hasn't finished booting, cb / ocb could be null. This commit just checks to make sure they're available before marking as executable Co-Authored-By: Maxime Chevalier-Boisvert <maxime.chevalierboisvert@shopify.com> Co-Authored-By: Kevin Newton <kddnewton@gmail.com>
dfb721f
to
efae920
Compare
@jeremyevans I merged this PR so I've reverted 119626d |
Awesome, thank you so much. I tested and the startup impact of
|
excellent! |
Some platforms don't want memory to be marked as writeable and
executable at the same time. This commit introduces two functions for
marking code blocks as either writeable or executable depending on the
situation. When we need to write to memory, we call
cb_mark_writeable
and when we're done writing to memory, call
cb_mark_executable
This is an initial stab at implementing memory protection for JIT code