Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Add a floating point co-processor #147

Open
MJoergen opened this issue Sep 25, 2020 · 13 comments
Open

Feature request: Add a floating point co-processor #147

MJoergen opened this issue Sep 25, 2020 · 13 comments
Assignees
Labels

Comments

@MJoergen
Copy link
Collaborator

Motivation

The motivation for this issue is the demo program c/test_programs/demo_sprite_balls.c. In this program a large number of balls move about and collide, and calculating the new velocities using integer-only arithmetic leads to round-off errors and/or overflow.

Proposal

I propose a new I/O device with the following register map

Addr Description
00 dword address (only bits 12-0 are used)
01 dword data bits 15-0
02 dword data bits 31-16
03 Command and Status Register (CSR)

The CSR register has the following interpretation

Bits Description
15 Go/Busy
14 Error
13 Reserved
12-0 PC

This I/O device acts as a co-processor with an internal virtual memory consisting of 8192 dwords (a dword is two 16-bit words, i.e. a 32-bit value). The CPU can access this virtual memory by using addresses 00 - 02 in the above register map.

Each 32-bit dword can contain either a floating point value (in the standard IEEE-754 format), or a special 32-bit co-processor instruction, see below.

The intention is that the co-processor has enough storage to contain all the floating point values needed by a running program. For instance, the demo program c/test_programs/demo_sprite_balls.c mentioned above would use 5 floating point values (pos_x, pos_y, vel_x, vel_y, radius) for each ball. With 50 balls this is a total of 250 dwords. This leaves plenty of room for the co-processor instructions.

Usage

The demo program c/test_programs/demo_sprite_balls.c could be modified to use this new co-processor. In this modified version the CPU will initially write the initial values to the co-processor, and then write a sequence of co-processor instructions (i.e. a "program") to the co-processor.

During normal operation the CPU will send a single GO command via the CSR register. This sets the co-processor's Program Counter (PC), sets the co-processor's BUSY flag, and the co-processor starts executing instructions. After a short while, the co-processor stops execution and clears the BUSY flag. The CPU can now read the result (e.g. the new sprite coordinates) from the register map.

An important point here is that there is no need to copy data back and forth between the CPU and the co-processor during normal operation. All the data needed by the co-processor lives entirely inside the co-processor. Going back to the demo program c/test_programs/demo_sprite_balls.c the function update() will be replaced entirely by a single write to the co-processor CSR register. Furthermore, the data structure t_ball balls[NUM_SPRITES] will be completely removed.

Co-processor instruction

Everything in the co-processor is 32-bit wide, including the instruction. The instruction format is as follows:

31-26 25-13 12 - 0
Opcode Destination address Source address

This allows for 64 difference opcodes, and two 13-bit operand addresses.

The co-processor does not contain any internal registers, so all operands are loaded/stored directly in the virtual memory.

The following opcodes are required in the basic implementation:

Value Opcode Description
0 STOP This halts the co-processor and clears the BUSY bit
1 ADD Destination += Source
2 SUB Destination -= Source
3 MUL Destination *= Source
4 DIV Destination /= Source
5 INT Destination = (int) Source
6 ABS Destination = abs(Source)
7 SGN Destination = sign(Source)

The Error bit in the CSR register is set when a floating point error occurs. This could be one of the following:

  • An attempt to divide by zero
  • An overflow
  • Conversion to integer on an out-of-range value

Resource considerations

  • The 8192 dwords can be implemented in the FPGA using 8 BRAMs.
  • The co-processor will operate at the same clock frequency as the CPU, just to simplify the design.
  • Each co-processor instruction will take several clock cycles, because the two operands must first be read from internal memory, and the destination value must be written back. And of course, the actual calculation takes several clock cycles too.

This feature requires no changes to the compiler, but will probably need a simple assembler. Using IEEE-754 floating point format makes testing this feature easier.

TBD: To be really useful, this co-processor should implement some possibility of conditional execution.

@bernd-ulmann
Copy link
Collaborator

That is a pretty involved design. :-) Quite cool!

Please don't get me wrong, but I would like to mention a classic floating point processor which has a pretty simple interface and has proven to be quite usable:

http://www.bitsavers.org/pdf/dec/pdp11/1145/EK-FP11-MM-003_FP11-B_Floating-Point_Processor_Maintenance_Manual_1974.pdf

This would have the advantage of being very simple and not being more or less a stand-alone-processor. :-)

@sy2002
Copy link
Owner

sy2002 commented Sep 25, 2020

I like Michaels approach a lot mainly for the following reason:

True parallel processing: The CPU can continue to run and the FPU (since it has own RAM) executes own programs.

And it does feel like OpenCL on a GPU 😄

But I need to admit that I did not look yet the PDF that Bernd has suggested.

@MJoergen About VBCC: Given we had some kind of an FPU: It would be cool, if we found a solution working together with Volker so that you can seamlessly work with floats in C - "just like on any other machine". And if the programmer wants to use the turbo-boosted acceleration properties of your architecture - the "OpenCL" - then you need to write some specific code, that utilizes some library that we would provide. So the programmer would have a choice: Slightly slower, non parallelized "standard floats in C" - or - turbo boosted programs running in parallel to the CPU. Both possible...

@bernd-ulmann
Copy link
Collaborator

We could use the remaining unused QNICE opcode as some kind of "breakout" instruction, i.e. something to tell the QNICE CPU to do nothing while another device may read further words from memory and do the actual instruction decoding. So we could integrate a FPU pretty seamless into the overall system architecture.

An external device such as a FPP would wait for a signal line from QNICE to denote that QNICE has detected such a reserved instruction. It would then increment the address from which this instruction was read (I can latch the address lines when the signal mentioned above is sent) and take over the bus to start reading/decoding instructions.

We could employ a signalling scheme similar to that used in interrupt handling, i.e. two lines:

  • One from QNICE to the FPP or other devices signalling that a "breakout" instruction was found. This line might be pulled low.
  • The external device (FPP) would reply by pulling low another line which will cause QNICE to wait until it is pulled high again.
  • The external device then does whatever it pleases to do and finally gives back control to QNICE. :-)

Thus we would not have true parallel processing but we could extend QNICE by a multitude of external devices using the reserved instruction opcode. We could also use the remaining 12 bits in this "trigger instruction" to denote which external device should take over control.

What do you think?

@MJoergen
Copy link
Collaborator Author

My intention with my initial proposal was to make something that is simple to implement in hardware (and emulator) AND it requires no changes to the CPU nor to the compiler. On the down side, it requires specially crafted "co-processor assembly code" in order to make use of it. I see no way to get the compiler to support this co-processor. In theory we could write a separate (simple) compiler for this co-processor that accepts programs written in a limited subset of C. Initially such a compiler won't be necessary, but it could be nice to have as a long-term goal.

Bernd's suggestion where the external co-processor takes over the system bus does remove the need for a separate co-processor memory. Instead, the co-processor can access main memory directly. I'm assuming here that the CPU is outputting its Program Counter to the Address Bus so that the co-processor can latch this value and read its instructions from that point in the program. And I'm assuming that the co-processor "returns" a new PC back to the CPU, so it knows where to resume execution. I.e. the "special co-processor instruction" is followed directly by machine code readable by the co-processor, and after that machine code there follows "ordinary" QNICE instructions again.

Bernd's suggestion does give the option of having the C-compiler generate code for the co-processor. And until the C-compiler has been updated, we could live with writing "inline assembly" for the co-processor.

Does the current C-compiler support inline assembly using QNICE instructions?

The difference between the two architectures are quite small, as I see it. The internal part of the co-processor will be the same, only the interface to the CPU is different.

I think Bernd's design is more flexible. The fact that it loses the ability to do parallel processing I don't think is much of a problem. Often, the main CPU won't have anything useful to do anyway before the result is returned.

@sy2002
Copy link
Owner

sy2002 commented Sep 25, 2020

I see no way to get the compiler to support this co-processor.

Well I think in contrary, this should not be too hard. As far as I remember (it has been a while), when you write

float c = a * b;

VBCC emits function calls such as (pseude code)

push parameter_A 
push parameter_B
ABRA __fmul
<result in R8/R9 will be put to C>

So "the only thing" we need to do is to provide the implementation to __fmul the same way as we did that for example here, when we use the EAE in C for multiplications:

https://github.com/sy2002/QNICE-FPGA/blob/master/c/qnice/vclib/machines/qnice/libsrc/_lmul.s

BTW: This is what I called "the homework" in the email to Volker, where @MJoergen was CC.

@MJoergen
Copy link
Collaborator Author

I just thought of something. The proposal from Bernd implies that the co-processor needs TWO clock cycles to access each floating point value. The reason is simply that the memory system in QNICE is 16-bit, so it takes two reads (or two writes) to transfer a single 32-bit value. So each floating point instruction will spend 6 clock cycles just moving data back and forth (reading two operands and writing one result) PLUS whatever is needed for the actual calculation.

In contrast, the proposal mentioned at the start of this issue gives the co-processor its own 32-bit memory, and therefore it can transfer an operand/result in just one cycle. So this approach will save 3 clock cycles for each floating point instruction. That was actually the motivation for the initial idea: To avoid spending too much time moving data back and forth.

@MJoergen
Copy link
Collaborator Author

MJoergen commented Sep 26, 2020

One additional note. The co-processor could be stack-based. Sort-of emulating the RPN notation from the HP calculators. We could also borrow some ideas from this project: https://github.com/AcheronVM/acheronvm. I especially like the "sliding window" part.

@bernd-ulmann
Copy link
Collaborator

Using our "spare" instruction not necessarily forces the FPP to use 2 cycles to access any FP number, this instruction does not invalidate your idea of a FPP with many internal registers/storage. A FP number stack is a nice :-) idea and we would be in good company with Intel etc. :-)

Why not combine both ideas? My idea with using the spare opcode and your FPP with many internal registers. We could even more simplify the instruction like this:

The spare opcode still has 12 operand bits which are unused. We could say that we just feed those to the FPP which would give it a 12 bit instruction set which should be plenty given a stack architecture.

What do you think?

@MJoergen
Copy link
Collaborator Author

Interesting. I will think about it the next few days until our meeting next week.

@sy2002
Copy link
Owner

sy2002 commented Sep 26, 2020

And just to make our meeting really interessting: I do like Michaels original idea a lot because it resembles to how today's high-performace computers use GPUs as FPUs with an own language like OpenCL. :-) And by being like that, it introduces true parallel execution inside the co-processor.

@bernd-ulmann
Copy link
Collaborator

Just my two cents worth: What I fear is that the proposed FPP will show "creeping featurism". It will start simple and then grow and outgrow the underlying QNICE processor, and in the end we might see that what we created is a 32 bit floating point processor which turned out into a stand-alone machine.

Do we want this? Thinking of Mirko's remark that he loves seeing QNICE becoming a great retro game machine I doubt it. That's why I initially proposed a very simplistic approach like the FPP used in early PDP11 systems. If we aim for utmost performance then QNICE is not the right underlying architecture - we might then go down the RISC V route or something like that but definitely not work on a 16 bit processor.

The QNICE architecture has, thanks to you, Mirko and Michael, matured from a pipe dream of mine into a real machine with a real and very impressive software stack. Nevertheless, the system is still simple enough that a good student can actually understand every bit of it and its associated software. The more highly complex features we add, the more cluttered the overall architecture will become and we will create something that might be very cool performance wise but no longer cool for educational or recreational purposes.

The problem is that from your perspective there is nothing too complicated in a computer. :-) But from the perspective of a younger person, QNICE already is on the verge of being comprehensible.

Not that I want to kill any dreams here, but I, personally, would opt for a simple FPP which just executes simple instructions on an internal stack of registers (eight or 16 should be more than sufficient). Even if we spend several cycles for transferring values it would be still much faster than a software implementation.

Speaking of which, we should also think about a software library for floating point operations, what do you think?

Have a great weekend! I am now off to today's lecture (hardware systems, by the way :-) ). :-)

@sy2002
Copy link
Owner

sy2002 commented Sep 26, 2020

Do we want this? Thinking of Mirko's remark that he loves seeing QNICE becoming a great retro game machine I doubt it.

Yeah good argument, you're right and I am convinced: This is meant to be recreational retro toy and not a high performance computing platform ;-) So let's discuss a still good, but simple approach on Tuesday evening that satisfies this philosphy.

Speaking of which, we should also think about a software library for floating point operations, what do you think?

Absolutely. We do need that in Monitor and we do need it inside VBCC (see my #147 (comment))

@MJoergen
Copy link
Collaborator Author

Thank you Bernd for pointing out what I have missed: The goal to keep the complexity low. Fearure creep is a real "threat" to a project line this, and adding a FPP will greatly increase tge complexity.

The initial aim with this Issue was to have the ability to perfom FP calculations. This can eadily be achieved with a software only solution.

So it seems to me a much better solution to enable FP support in the compiler and the library. That will give the added accuracu needed in the sprite ball demo program. When that is implemented, we can evaluate the performance and then discuss what - if anything - to do about that.

@MJoergen MJoergen added V2.0 and removed V1.7 labels Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants