Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bogoMips 0 causes a halt/freeze #74

Closed
misutoneko opened this issue Jul 31, 2021 · 7 comments
Closed

bogoMips 0 causes a halt/freeze #74

misutoneko opened this issue Jul 31, 2021 · 7 comments

Comments

@misutoneko
Copy link

misutoneko commented Jul 31, 2021

Hi!
Oh man what an awesome project. Your blog posts were nice too :D
Quite a lot of questions spinning around in my head now, but mainly just wanted to drop in to say "hi, saw what you did there, i approve" :D

I do have a small bug to report though. I did some testing in UniPCemu and found out that bogoMips goes to zero for some reason
(EDIT: looks like it's caused by emulation slowness...__bogo_loops reaches value of 8, so that would do it).

This causes Tilck to freeze at kb8042 kernel module init, because the nop loop counter (ecx) will flip over to max value.
Forcing loops_per_us value to 1 enables Tilck to boot all the way to the shell prompt.

@vvaltchev
Copy link
Owner

vvaltchev commented Jul 31, 2021

Hi, nice to meet you!
Thank you for your kind words and for appreciating my work.
I'm glad to hear you're using Tilck to test a new emulator, that's awesome.

Now, back to technical stuff. I've took a look at the conversation on vogons.org (great place, btw).
I'd like to start by explaining how i8042_io_wait() works: for years, it was just an empty inline function because hypervisors don't need any I/O delays: they directly intercept the IN/OUT instructions and emulate the behavior expected from the device, blocking the guest system. On real hardware machines, because of the way 8042 is implemented/emulated today, io_wait() is almost never needed too, but if you wanna be really sure your code will work reliably, it's mandatory to follow the specification and do proper IO waits of 1 microsecond. Now, because wasting CPU cycles is pointless while running on a hypervisor, I added that if(in_hypervisor()), as an optimization. It looks like that on UniPCemu the hypervisor bit in CPUID[1].ecx is simply not set.

Still, even if that bit was set, having __bogo_loops = 0 is bad, because loops_per_us will be 0 as well and delay_us() won't work properly in the general case. Now, straight to the point: how that could happen? Well, that's pretty weird and it should never happen. How the whole thing works:

  1. During boot, in init_timer() a special temporary IRQ0 handler is registered: measure_bogomips_irq_handler().
  2. After that, a kernel worker thread runs do_bogomips_loop()
  3. In that function ctx->started is set.
  4. In the meanwhile all calls to measure_bogomips_irq_handler() do nothing because of the !ctx->started
  5. The first that handler founds ctx->started, resets __bogo_loops and set pass_start = true.
  6. From now, on each timer tick (IRQ 0), ctx->ticks++ will be incremented. At the same time, the CPU spends all the rest of its time iterating in asm_do_bogomips_loop, which increments __bogo_loops once every BOGOMIPS_CONST (10,000) iterations.
  7. After a fixed amount of ticks, determined by MEASURE_BOGOMIPS_TICKS, we decide that's enough so the extra IRQ handler is uninstalled, we calculate loops_per_us and set __bogo_loops = -1: this makes the asm_do_bogomips_loop assembly function to end.
  8. delay_us() as you know uses loops_per_us under the hood

Now, given that, in order to __bogo_loops to remain 0 one of two things have to happen:

  1. We should never get an IRQ 0 again. If that's the case, we'll be stuck in do_bogomips_loop() and never see the Tilck bogoMips: message on the screen. In my understanding that's not the case.
  2. The emulator is so slow that in 1/10th of the second (MEASURE_BOGOMIPS_TICKS) we couldn't finish to perform 10,000 BOGOMIPS_CONST iterations. So, everything seems to go well but we end up with loops_per_us = 0.

In my understanding of the comments on vogons.org, you do see the Tilck bogoMips: message on the screen, so I believe we're in the case 2: the emulator is too slow. In other words, the emulator performs less than 10k iterations in 1/10th of the second, or less than 100k iterations in 1 second. So, the problem could be fixed by optimizing the emulator: generally pure software emulators (without dynamic translation) have no more than 1,000x slowdown. Starting from a realistic value of 2,000 - 3,000 bogoMips on QEMU + KVM on modern machines, a pure software emulator should show a value >= 2.

Now, given your EDIT comment, it looks like you've been able to reach __bogo_loops = 8. That means that the just stated theory of case 2 (less than 100k / second) is not quite correct. What happens is that the emulator performs less than 1 million iterations / second (which is better, of course).

loops_per_tick = 8 * 10,000 / 10 = 8,000 (with TIMER_HZ=100) 
loops_per_us = 8,000 / (1,000,000 / 100) = 0.8 = 0 (truncated)

And loops_per_us == 0 because of truncation. After that, asm_nop_loop() looks like hanging because it ends after reaching 0, but the first instruction is a SUB, so we get to -1 (= 2^32-1) and that's a lot.

So, it looks like the slowdown of the emulator (compared to physical hardware) is more than 3,000x. That's a lot, but it's not surprising. Are you compiling with it -O3? (Btw, what bogoMIPS value Tilck shows when it's run on QEMU with KVM ?)

Anyway, a fix could be to set:

loops_per_tick = 
   MAX(1, __bogo_loops) * BOGOMIPS_CONST/MEASURE_BOGOMIPS_TICKS;

In the measure_bogomips_irq_handler() function. If that turns out to be a good solution for your case, would like to make a pull-request with the fix?

@vvaltchev
Copy link
Owner

vvaltchev commented Jul 31, 2021

An alternative solution might be to:

  1. Rename loops_per_us to loops_per_ns
  2. Initialize loops_per_us = 5 (instead of 5,000)
  3. In measure_bogomips_irq_handler(): loops_per_ns = loops_per_tick / (1000 / TIMER_HZ);
  4. In do_bogomips_loop(): printk("Tilck bogoMips: %u.%03u\n", loops_per_ns/1000, loops_per_ns % 1000);
  5. In delay_us() do:
   if (loops_per_ns >= 1000)
      asm_nop_loop(us * loops_per_ns / 1000);
  1. Optionally, introduce a delay_ns() function

With such a solution, you should see something like:

Tilck bogoMips: 0.800

When running Tilck on the emulator.

@misutoneko
Copy link
Author

OK thx, I'll check those.
Actually the one in the first post won't work unless you use 10 instead of 1 (otherwise it'll still get truncated to 0).
It'd probably be better to use MAX() on loops_per_us, to avoid introducing an extra constant which may cause breakage later on.
(Unless there's some other reason to do it this way - Can't see any, but...)
As for PR - In fact I wasn't planning on doing any PRs when I first joined GitHub, but I may get around to it, eventually...

The emulation speed may be partly because of my lack of understanding with UniPCemu configuration.
The hardware on the host isn't the fastest either lol (AMD FX-series)...
qemu-system-x86_64 (4.2.1) with kvm gives me bogoMips of 1719 and yes, UniPCemu does use -O3 by default.
But to really gain speed the interpretion code would need an overhaul I guess (plus use some dirty tricks to cut corners whenever possible :)

@vvaltchev
Copy link
Owner

vvaltchev commented Aug 1, 2021

Ah yeah, you're right. My first suggestion won't work with 1, and 10 is bad because depends on TIMER_HZ and related.
The problem is that we need the following to hold: loops_per_tick >= (1000000 / TIMER_HZ), so the correct fix should be:

loops_per_tick =
   MAX(1000000 / TIMER_HZ, __bogo_loops * BOGOMIPS_CONST/MEASURE_BOGOMIPS_TICKS);

But that's pretty ugly, I know. The problem with using MAX on just loops_per_us is that it will create an inconsistency with loops_per_tick. Another super-simple solution is to just fix delay_us() this way:

void delay_us(u32 us)
{
   FASTCALL void asm_nop_loop(u32 iters);
   u32 loops = us * loops_per_us;

   if (loops > 0)
      asm_nop_loop(loops);
}

Anyway, I won't insist at all for you to make a pull-request, as it's super easy to fix it myself. Just, it seemed to me that you spent a considerable amount of time playing with this project & debugging it, so I thought you would be happy to put your name on the commit with the fix :-)

@misutoneko
Copy link
Author

The one in your second post did the trick. Here it is in patch form:
Tilck_timer_ns_patch.diff.gz

That's a good point about attribution, and for bigger patches it's a good idea for maintenance's sake. Especially with larger projects with many contributors...
My logic is that the authors will always be more fluent with their own code than me, and the patches will mostly be quite small since I mostly only do testing/bug-hunting.

vvaltchev added a commit that referenced this issue Aug 1, 2021
As discovered from #74, it is
possible on some emulators to have a vCPU be able to do less than
1 million iterations. When that happens `loops_per_us` is 0 and delay_us()
appears to hang because it starts counting back from 2^31 - 1.

The fix
--------

Since in such cases the vCPU does < 1 million iters / second, we should
not do ANY work in delay_us(u32 us), when us <= 1. But when `us` is big
enough we certainly are expected to burn some cycles in order to time to
pass.

In order to avoid loosing precision, in delay_us() we're going to spin for
`us * loops_per_us` only when loops_per_us >= 10, otherwise we're going
to spin for `(us * loops_per_ms) / 1000`. For example:

	loops_per_ms = 800
	loops_per_us = 0

If us <= 1, we'll return immediately. But if us == 5, we'll spin:

	(5 * 800) / 1000 = 4 times

The same logic is used when loops_per_us is too small (e.g. 5):

	loops_per_ms = 5923
	loops_per_us = 5

With the basic formula `us * loops_per_us`, with `us` == 4, we'll have
to spin for 4 * 5 = 20 iterations, while with the more precise one:

	(4 * 5923) / 1000 = 23

We'll have to spin a bit more.
@vvaltchev
Copy link
Owner

Thanks for the patch!
Unfortunately, at the end I had to modify it heavily mostly because some of my suggestions were not completely correct. For example, the variable loops_per_ns is calculated correctly but represents the loops per millisecond, not nanosecond. Also, after some serious thinking I realized that it's not enough to just check for loops > 0 in delay_us(). That would mean that on slower machines (emulators), delay_us() won't do any work, no matter the value of us. In the patch I've prepared, I believe I've handled all the cases :-)

vvaltchev added a commit that referenced this issue Aug 1, 2021
As discovered from #74, it is
possible on some emulators to have a vCPU able to do less than 1 million
iterations per second. When that happens, `loops_per_us` is 0 and delay_us()
appears to hang because it starts counting back from 2^31 - 1.

The fix
--------

Since in such cases the vCPU does < 1 million iters / second, we should
not do ANY work in delay_us(u32 us), when us <= 1. But when `us` is big
enough we certainly are expected to burn some cycles in order to time to
pass.

In order to avoid loosing precision, in delay_us() we're going to spin for
`us * loops_per_us` only when loops_per_us >= 10, otherwise we're going
to spin for `(us * loops_per_ms) / 1000`. For example:

	loops_per_ms = 800
	loops_per_us = 0

If us <= 1, we'll return immediately. But if us == 5, we'll spin:

	(5 * 800) / 1000 = 4 times

The same logic is used when loops_per_us is too small (e.g. 5):

	loops_per_ms = 5923
	loops_per_us = 5

With the basic formula `us * loops_per_us`, with `us` == 4, we'll have
to spin for 4 * 5 = 20 iterations, while with the more precise one:

	(4 * 5923) / 1000 = 23

We'll have to spin a bit more.
@vvaltchev
Copy link
Owner

I've merged the patch. Let me know if something still doesn't work.
Thanks for reporting this issue, man! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants