Technical details

Sebastian Macke edited this page Sep 28, 2016 · 83 revisions

Technical details

Hardware Specifications of Emulated hardware:

  • 32-Bit OR1000 Emulator with MMU, TICK counter and PIC (OR1K Specification)
  • 32 MB RAM (alterable)
  • UART 16550 connected to a terminal
  • UART 16550 not connected
  • OCFB Framebuffer 640x400 16bpp with LPC32xx touchscreen
  • virtio device with support for the 9p filesystem
  • ATA connected to a 64 kB hard drive
  • Opencore-keyboard controller
  • Ethernet MAC controller
  • Audio controller
  • Real time clock
  • Interrupt controller
  • Linux Terminal Emulator
  • Linux 3.16, Busybox and much much more

Mapping or RAM and devices:

Memory                   IRQ
0x00000000 - 0x01F00000   -   31 MB Random Access Memory (alterable)
0x90000000 - 0x90000006   2   UART 16550 connected to the terminal and keyboard
0x91000000 - 0x91001000   8   Opencore VGA/LCD 2.0 core frame buffer
0x92000000 - 0x92001000   4   Ethernet MAC controller
0x93000000 - 0x93000100   9   LPC32xx touchscreen controller
0x94000000 - 0x94000100   5   Opencore keyboard controller
0x96000000 - 0x96000006   3   second UART 16550
0x97000000 - 0x97001000   6   virtio device for the 9p filesystem
0x98000000 - 0x98000400   7   audio controller
0x99000000 - 0x99001000  10   LPC32xx real time clock
0x9A000000 - 0x9A001000   1   OMPIC interrupt controller
0x9B000000 - 0x9B000004   -   Dummy timer
0x9C000000 - 0x9C001000  11   virtio device
0x9D000000 - 0x9D001000  12   virtio device
0x9e000000 - 0x9e001000  15   ATA controller

Talks about jor1k at conferences

Javascript Optimizations

Big vs. Little Endian

The endianess of the machine is big endian, but the typed array from Javascript work with little endian. Most of the memory accesses are aligned 32 Bit. So at the beginning, after loading the image every 32-Bit word is swapped and 8 and 16 Bit memory addresses are XORed by 3 or 2.

Web workers

Most part of the code is running in its own thread by using the web worker API. Message passing is used to communicate between the worker and the graphical user interface related objects.

Transform between unsigned and signed integers

Javascript normally does not have predefined types. To cast to unsigned and signed numbers one can use the (number >>> 0) and (number >> 0) modifier, which does only change the type of "number".

Speed of unsigned vs. signed variables

In Javascript every number is supposed to be a double precision floating point number. However the Javascript compiler optimizes the code and try to figure out if an integer is also appropriate. Unfortunately the support of fast unsigned integers is still missing in some compilers. So they are transformed into doubles. The code is optimized to prevent as much unsigned int arithmetic as possible.

Sign extend

Sometimes a few numbers must be sign extended. This is done efficiently by the command (number << x) >> x, where x is an appropriate shift value. To sign extend an signed 8 bit value to a signed 32 bit value the command is ((number << 24) >> 24).

Neglecting Flags

The Carry Flag and Overflow Flag are not used by the gcc compiler. So they are ignored in this emulation. The code to support these flags can be uncommented for better compatibility but lowered speed.

Instruction MMU

Most of the time the whole instruction fetch is done very efficiently with the command

if ((checkpc^this.pc) >> 11) {
...
}
ins = int32mem[(currenttlb ^ this.pc)];

The important trick is first to check if the current page is still valid and if this is the case just to xor the program counter. The fast tlb lookup for data acccesses is implemented in a similar way.

Hardware TLB Refill Hack

The TLB Refill is done in Javascript. Unfortunately this makes it dependent on the Linux kernel as it needs the pointer to the internal translation table of the Linux kernel.

Fastpath

The fastest path for one instruction through the code is given by

for(;;) {
    if (ppc == fence) {
        ....
    }
    ins = int32ram[ppc >> 2];
    ppc = ppc + 4;

    switch ((ins >> 26)&0x3F) {
        ....
    }
}

The idea here is that the virtual pc is computed only when needed by translating ppc (physical pc) back to the virtual pc address. The variable fence is used to break out of the fast path when ppc reaches a jump or the end of the current page.

Idle State of the CPU

When the system goes idle the operating systems sends a sleep or halt signal. For this case the CPU should wait until the next interrupt occurs. We can use the setTimeout() method of Javascript to accomplish this. The usual tick is set to <=10ms under Linux. Unfortunately with the overhead of the web browsers and their Javascript engine 10ms are often not sufficient for a host processor usage of < 1%. Therefore the Linux kernel was compiled with a tick every 20ms (50 ticks per second). Usually this is not a problem as long as you don't use time critical applications like video players. The response of the system like typing on the keyboard is not influenced.

Execute ping pong

When a worker thread is executing some code it is no longer responsiveness to messages arriving. The worker thread must go idle to process the message queue. A setTimeOut command with 0ms does not work here. In order to run the cpu at full speed a message ping pong every 5-10ms is performed. The worker sends an "execute" signal to the master and the master hereupon sends it's own "execute" signal back to the worker. By doing this, we keep the responsiveness while using the worker thread efficiently.

Filesystem virtio/9p

The most advanced feature of jor1k is the filesystem which is fully implemented in Javascript. As interface the 9p/virtio implementation of Linux is used. The complete filesystem layout is loaded in the beginning in form of an XML file (https://github.com/s-macke/jor1k-sysroot). When the files are opened, they are downloaded from the repository. Compression reduces the overall loading time. This implementation is much faster than a NFS-filesystem or an on-demand block device implementation because of the significantly reduced overhead. In future dependencies of the different files can be implemented to further reduce the loading time (like library dependencies). This feature also enables us to work with the filesystem directly within Javascript, like uploading and downloading files or complete archives.

Overall Speed Dependence observed by Testing Different Web Browsers

The first time Linux booted on the emulator the web browser Chrome was the fastest (0.5-1 MIPS). After more and more optimizations were implemented Firefox was a little bit faster then Google Chrome (5 MIPS). When IE10 became compatible with my code it was the fastest (10 MIPS). After implementing the worker thread Firefox 22 got superior being 3 times faster then the other browsers (33 MIPS). For some reason this advantage got lost with Firefox 23-24 (4-9 MIPS). Instead of this Chrome managed with version 29 to get this position with 30-60 MIPS. In Firefox the asm.js version of the CPU seems to reach 30-100MIPS. At this moment changing one line of code in the Step() function could reduce or increase the speed by a factor of 3. The reason for these speed oscillations is the tremendous complexity of today's JIT compilers and the black box behavior of them which makes it almost impossible to code really fast code.

Browser run on core benchmark MIPS
Chrome 29 Core i7-2600 3.4GHz normal CPU fbdemo V1 45
Chrome 35 Core i7-2600 3.4GHz normal CPU fbdemo V1 51
Chrome 30-34 Core i7-2600 3.4GHz asm.js V1 fbdemo V1 53
Chrome 35 Core i7-2600 3.4GHz asm.js V1 fbdemo V1 55
Firefox 22 Core i7-2600 3.4GHz normal cpu fbdemo V1 33
Firefox 24-28 Core i7-2600 3.4GHz normal cpu fbdemo V1 7
Firefox 29-30 Core i7-2600 3.4GHz normal cpu fbdemo V1 67
Firefox 24-30 Core i7-2600 3.4GHz asm.js V1 fbdemo V1 74
IE 10 Core i7-2600 3.4GHz asm.js V1 fbdemo V1 22
IE 11 Core i7-2600 3.4GHz asm.js V1 fbdemo V1 51
Firefox 31 Core i7 4770 3.4GHz asm.js V1 fbdemo V1 200
------------------ -------------------- --------------------------------- --------- ----
Firefox 32 Core i7-2600 3.4GHz asm.js V2 fbdemo V2 75.5
Firefox 32 Core i7-2600 3.4GHz asm.js V2 (without asm statement) fbdemo V2 58.1
Chrome 37 Core i7-2600 3.4GHz asm.js V2 fbdemo V2 60.7
IE 11 Core i7-2600 3.4GHz asm.js V2 fbdemo V2 68.3
Safari on iPad air Apple A7 asm.js V2 fbdemo V2 81.0
Samsung Galaxy S5 Exynos 5 Octa 5422 asm.js V2 fbdemo V2 18.1
Chrome 38 64-Bit Core i7-2600 3.4GHz asm.js V2 fbdemo V2 63
Chrome 38 64-Bit Core i7-2600 3.4GHz asm.js V2 fbdemo V2 63
Firefox 32 on Surface Core i5-4200U asm.js V2 fbdemo V2 179
Firefox 33 Celeron G1820 asm.js V2 fbdemo V2 180
Firefox 36 Core i7 4770 3.4GHz asm.js V2 fbdemo V2 246

The overall speed is equivalent to a Pentium 90.