Skip to content

Commit

Permalink
Fixed links | Added code blocks | Added section links
Browse files Browse the repository at this point in the history
  • Loading branch information
abejgonzalez committed Nov 20, 2018
1 parent 9f35e6c commit 9b807c9
Show file tree
Hide file tree
Showing 19 changed files with 138 additions and 158 deletions.
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autosectionlabel'
]

# Add any paths that contain templates here, relative to this directory.
Expand Down
70 changes: 28 additions & 42 deletions docs/sections/BranchPrediction/Backing-Predictor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,15 +10,16 @@ are not able to learn very complicated or long history patterns).

To capture more branches and more complicated branching behaviors, BOOM
provides support for a “Backing Predictor", or BPD (see
:numref:`backing-predictor-unit`.
:numref:`backing-predictor-unit`).


The BPD’s goal is to provide very high accuracy in a (hopefully) dense
area. To make this possible, the BPD will not make a prediction until
the *fetch packet* has been decoded and the branch targets computed
directly from the instructions themselves. This saves on needing to
store the *PC tags* and *branch targets* within the BPD.
store the *PC tags* and *branch targets* within the BPD [7]_.

The BPD is accessed in parallel with the instruction cache access (See
The BPD is accessed in parallel with the instruction cache access (see
:numref:`Fetch-Unit`). This allows the BPD to be stored in sequential
memory (i.e., SRAM instead of flip-flops). With some clever
architecting, the BPD can be stored in single-ported SRAM to achieve the
Expand Down Expand Up @@ -94,21 +95,21 @@ info packet". This “info packet" is stored in a “branch re-order buffer"
(BROB) until commit time. [11]_ Once all of the instructions
corresponding to the “info packet" is committed, the “info packet" is
set to the BPD (along with the eventual outcome of the branches) and the
BPD is updated. Section \[sec:brob\] covers the BROB, which handles the
BPD is updated. :ref:`The Branch Reorder Buffer (BROB)` covers the BROB, which handles the
snapshot information needed for update the predictor during
*Commit*. Section \[sec:bpd-rename\] covers the BPD Rename
*Commit*. :ref:`Rename Snapshot State` covers the BPD Rename
Snapshots, which handles the snapshot information needed to update the
predictor during a misspeculation in the *Execute* stage.

Managing the Global History Register
------------------------------------

The *global history register* is an important piece of a branch
predictor. It contains the outcomes of the previous $N$ branches (where
$N$ is the size of the global history register). [12]_
predictor. It contains the outcomes of the previous :math:`N` branches (where
:math:`N` is the size of the global history register). [12]_

When fetching branch $i$, it is important that the direction of the
previous $i-N$ branches is available so an accurate prediction can be
When fetching branch :math:`i`, it is important that the direction of the
previous :math:`i-N` branches is available so an accurate prediction can be
made. Waiting till the *Commit* stage to update the global history
register would be too late (dozens of branches would be inflight and not
reflected!). Therefore, the global history register must be updated
Expand Down Expand Up @@ -136,7 +137,7 @@ any sort of pipeline flush event.
The Branch Reorder Buffer (BROB)
--------------------------------

The Reorder Buffer (see Chapter \[chapter:rob\]) maintains a record of
The Reorder Buffer (see :ref:`The Reorder Buffer (ROB) and the Dispatch Stage`) maintains a record of
all inflight instructions. Likewise, the Branch Reorder Buffer (BROB)
maintains a record of all inflight branch predictions. These two
structure are decoupled as BROB entries are *incredibly* expensive
Expand Down Expand Up @@ -217,7 +218,7 @@ abstract class can be found in :numref:`backing-predictor-unit` labeled “predi
Global History
^^^^^^^^^^^^^^

As discussed in Section \[sec:ghistory\], global history is a vital
As discussed in :ref:`Managing the Global History Register`, global history is a vital
piece of any branch predictor. As such, it is handled by the abstract
BranchPredictor class. Any branch predictor extending the abstract
BranchPredictor class gets access to global history without having to
Expand All @@ -226,16 +227,16 @@ handle snapshotting, updating, and bypassing.
Very Long Global History (VLHR)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some branch predictors (see Section \[sec:tage\]) require access to
Some branch predictors (see :ref:`The TAGE Predictor`) require access to
incredibly long histories – over a thousand bits. Global history is
speculatively updated after each prediction and must be snapshotted and
reset if a misprediction was made. Snapshotting a thousand bits is
untenable. Instead, VLHR is implemented as a circular buffer with a
speculative head pointer and a commit head pointer. As a prediction is
made, the prediction is written down at $VLHR[spec\_head]$ and the
made, the prediction is written down at :math:`VLHR[spec\_head]` and the
speculative head pointer is incremented and snapshotted. When a branch
mispredicts, the head pointer is reset to $snapshot+1$ and the correct
direction is written to $VLHR[snapshot]$. In this manner, each snapshot
mispredicts, the head pointer is reset to :math:`snapshot+1` and the correct
direction is written to :math:`VLHR[snapshot]`. In this manner, each snapshot
is on the order of 10 bits, not 1000 bits.

Operating System-aware Global Histories
Expand Down Expand Up @@ -371,16 +372,16 @@ there is no tag match). The table with the longest history making a
prediction wins.

On a misprediction, TAGE attempts to allocate a new entry. It will only
overwrite an entry that is “not useful” ($ubits == 0$).
overwrite an entry that is “not useful” (:math:`ubits == 0`).

TAGE Global History and the Circular Shift Registers (CSRs) [15]_
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each TAGE table has associated with it its own global history (and each
table has geometrically more history than the last table). As the
histories become incredibly long (and thus too expensive to snapshot
directly), TAGE uses the Very Long Global History Register (VLHR) as
described in Section \[sec:vlhr\]. The histories contain many more bits
described in :ref:`Very Long Global History (VLHR)`. The histories contain many more bits
of history that can be used to index a TAGE table; therefore, the
history must be “folded” to fit. A table with 1024 entries uses 10 bits
to index the table. Therefore, if the table uses 20 bits of global
Expand All @@ -390,10 +391,12 @@ bits of history.
Instead of attempting to dynamically fold a very long history register
every cycle, the VLHR can be stored in a circular shift register (CSR).
The history is stored already folded and only the new history bit and
the oldest history bit need to be provided to perform an update. Code
\[code:tage-csr\] shows an example of how a CSR works.
the oldest history bit need to be provided to perform an update.
:numref:`tage-csr` shows an example of how a CSR works.

::
.. _tage-csr:
.. code-block:: none
:caption: The circular shift register. When a new branch outcome is added, the register is shifted (and wrapped around). The new outcome is added and the oldest bit in the history is “evicted”.
Example:
A 12 bit value (0b_0111_1001_1111) folded onto a 5 bit CSR becomes
Expand All @@ -411,16 +414,12 @@ the oldest history bit need to be provided to perform an update. Code
(c[4] ^ h[ 0] generates the new c[0]).
(c[1] ^ h[12] generates the new c[2]).
Code Caption: The circular shift register. When a new branch outcome is added, the register
is shifted (and wrapped around). The new outcome is added and the oldest bit in the
history is “evicted”.

Each table must maintain *three* CSRs. The first CSR is used for
computing the index hash and has a size $n=log(num\_table\_entries)$. As
computing the index hash and has a size :math:`n=log(num\_table\_entries)`. As
a CSR contains the folded history, any periodic history pattern matching
the length of the CSR will XOR to all zeroes (potentially quite common).
For this reason, there are two CSRs for computing the tag hash, one of
width $n$ and the other of width $n-1$.
width :math:`n` and the other of width :math:`n-1`.

For every prediction, all three CSRs (for every table) must be
snapshotted and reset if a branch misprediction occurs. Another three
Expand Down Expand Up @@ -478,24 +477,11 @@ take?". This is very useful for both torturing-testing BOOM and for
providing a worse-case performance baseline for comparing branch
predictors.

.. [6] Each BTB entry corresponds to a single *Fetch PC*, but it is
helping to predict across an entire *fetch packet*. However, the
BTB entry can only store meta-data and target-data on a single
control-flow instruction. While there are certainly pathological
cases that can harm performance with this design, the assumption is
that there is a correlation between which branch in a *fetch
packet* is the dominating branch relative to the *Fetch PC*,
and - at least for narrow fetch designs - evaluations of this design
has shown it is very complexity-friendly with no noticeable loss in
performance. Some other designs instead choose to provide a whole
bank of BTBs for each possible instruction in the *fetch
packet*.
.. [7] It’s the *PC tag* storage and *branch target* storage that
makes the BTB within the NLP so expensive.
.. [8]  instructions jump to a $PC+Immediate$ location, whereas
 instructions jump to a $PC+Register[rs1]+Immediate$ location.
.. [8]  instructions jump to a :math:`PC+Immediate` location, whereas
 instructions jump to a :math:`PC+Register[rs1]+Immediate` location.
.. [9] Redirecting the Fetch Unit in the *Fetch2 Stage* for
 instructions is trivial, as the instruction can be decoded and its
Expand Down
31 changes: 19 additions & 12 deletions docs/sections/BranchPrediction/Configurations.rst
Original file line number Diff line number Diff line change
@@ -1,38 +1,45 @@

Branch Prediction Configurations
--------------------------------
================================

There are a number of parameters provided to govern the branch
prediction in BOOM.

### GShare Configuration Options
GShare Configuration Options
----------------------------

#### Global History Length
Global History Length
~~~~~~~~~~~~~~~~~~~~~

How long of a history should be tracked? The length of the global
history sets the size of the branch predictor. An $n$-bit history pairs
with a $2^n$ entry two-bit counter table.
history sets the size of the branch predictor. An :math:`n`-bit history pairs
with a :math:`2^n` entry two-bit counter table.

### TAGE Configurations
TAGE Configurations
-----------------------

#### Number of TAGE Tables
Number of TAGE Tables
~~~~~~~~~~~~~~~~~~~~~~~~~~

How many TAGE tables should be used?

#### TAGE Table Sizes
TAGE Table Sizes
~~~~~~~~~~~~~~~~~~~~~~~~

What size should each TAGE table be?

#### TAGE Table History Lengths
TAGE Table History Lengths
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

How long should the global history be for each table? This should be a
geometrically increasing value for each table.

#### TAGE Table Tag Sizes
TAGE Table Tag Sizes
~~~~~~~~~~~~~~~~~~~~~~~~~

What size should each tag be?

#### TAGE Table U-bit Size
TAGE Table U-bit Size
~~~~~~~~~~~~~~~~~~~~~~~~~~

How many bits should be used to describe the usefulness of an entry?

Expand Down
15 changes: 0 additions & 15 deletions docs/sections/BranchPrediction/Rocket-NLP-Predictor.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,21 +117,6 @@ packet* which of the many possible branches will be the dominating
branch that redirects the PC. For this reason, we use a given branch’s
*Fetch PC* rather than its own PC in the BTB tag match. [6]_

.. [1] Unfortunately, the terminology in the literature gets a bit
muddled here in what to call different types and levels of branch
predictor. I have seen “micro-BTB" versus “BTB", “NLP" versus “BHT",
and “cache-line predictor" versus “overriding predictor". Although
the Rocket code calls its own predictor the “BTB", I have chosen to
refer to it in documentation as the “next-line predictor", to denote
that it is a combinational predictor that provides single-cycle
predictions for fetching “the next line", and the Rocket BTB
encompasses far more complexity than just a “branch target buffer"
structure. Likewise, I have chosen the name “backing predictor" as I
believe it is the most accurate name, while simultaneously avoiding
being overly descriptive of the internal design (is it a simple BHT?
Is it tagged? Does it override the NLP?). But in short, I am open
to better names!
.. [2] In reality, only the very lowest bits must be saved, as the
higher-order bits will be the same.
Expand Down
19 changes: 18 additions & 1 deletion docs/sections/BranchPrediction/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ these predictions.

BOOM uses two levels of branch prediction- a single-cycle “next-line
predictor" (NLP) and a slower but more complex “backing predictor"
(BPD).
(BPD) [1]_.

.. toctree::
:maxdepth: 2
Expand All @@ -21,3 +21,20 @@ predictor" (NLP) and a slower but more complex “backing predictor"
Rocket-NLP-Predictor
Backing-Predictor
Configurations

.. [1] Unfortunately, the terminology in the literature gets a bit
muddled here in what to call different types and levels of branch
predictor. I have seen “micro-BTB" versus “BTB", “NLP" versus “BHT",
and “cache-line predictor" versus “overriding predictor". Although
the Rocket code calls its own predictor the “BTB", I have chosen to
refer to it in documentation as the “next-line predictor", to denote
that it is a combinational predictor that provides single-cycle
predictions for fetching “the next line", and the Rocket BTB
encompasses far more complexity than just a “branch target buffer"
structure. Likewise, I have chosen the name “backing predictor" as I
believe it is the most accurate name, while simultaneously avoiding
being overly descriptive of the internal design (is it a simple BHT?
Is it tagged? Does it override the NLP?). But in short, I am open
to better names!
3 changes: 0 additions & 3 deletions docs/sections/Decode/decode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,3 @@ The Decode Stage
The decode stage takes instructions from the fetch buffer, decodes them,
and allocates the necessary resources as required by each instruction.
The decode stage will stall as needed if not all resources are available.

The Decode Table
----------------
28 changes: 13 additions & 15 deletions docs/sections/Execute/execute.rst
Original file line number Diff line number Diff line change
Expand Up @@ -72,15 +72,15 @@ efficiently.

For this reason, BOOM uses an abstract Functional Unit class to “wrap"
expert-written, low-level functional units from the Rocket repository
(see Section [sec:rocket]). However, the expert-written functional units
(see :ref:`The Rocket-chip Repository Layout`). However, the expert-written functional units
created for the Rocket in-order processor make assumptions about
in-order issue and commit points (namely, that once an instruction has
been dispatched to them it will never need to be killed). These
assumptions break down for BOOM.

However, instead of re-writing or forking the functional units, BOOM
provides an abstract Functional Unit class (see Fig
[fig:abstract-functional-unit]) that “wraps" the lower-level functional
provides an abstract Functional Unit class (see :numref:`abstract-fu`)
that “wraps" the lower-level functional
units with the parameterized auto-generated support code needed to make
them work within BOOM. The request and response ports are abstracted,
allowing Functional Units to provide a unified, interchangeable
Expand All @@ -98,8 +98,7 @@ the micro-op within the expert-written functional unit. If a micro-op is
misspeculated, it’s response is de-asserted as it exits the functional
unit.

An example pipelined functional unit is shown in Fig
[fig:abstract-functional-unit].
An example pipelined functional unit is shown in :numref:`abstract-fu`.

Un-pipelined Functional Units
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down Expand Up @@ -172,7 +171,7 @@ and fence operations.
BOOM (currently) only supports having one LSU (and thus can only send
one load or store per cycle to memory). [2]_

See Chapter [sec:lsu] for more details on the LSU.
See `The Load/Store Unit (LSU)` for more details on the LSU.

Floating Point Units
--------------------
Expand All @@ -187,7 +186,7 @@ Floating Point Units
support).

The low-level floating point units used by BOOM come from the Rocket
processor (https://github.com/ucb-bar/rocket) and hardfloat
processor (https://github.com/freechipsproject/rocket-chip) and hardfloat
(https://github.com/ucb-bar/berkeley-hardfloat) repositories. Figure
[fig:functional-unit-fpu] shows the class hierarchy of the FPU.

Expand All @@ -198,7 +197,7 @@ Floating Point Divide and Square-root Unit
------------------------------------------

BOOM fully supports floating point divide and square-root operations
using a single “" (or  for short). BOOM accomplishes this by
using a single **FDiv/Sqrt** (or **fdiv** for short). BOOM accomplishes this by
instantiating a double-precision unit from the hardfloat repository. The
unit comes with the following features/constraints:
Expand All @@ -218,7 +217,7 @@ double-precision (and then the output downscaled). [4]_

Although the  unit is unpipelined, it does not fit cleanly into the
Pipelined/Unpipelined abstraction used by the other functional units
(Fig [fig:functional-unit-hierarchy]). This is because the unit provides
(see :numref:`fu-hierarchy`). This is because the unit provides
an unstable FIFO interface: although the  unit may provide a *ready*
signal on Cycle :math:`i`, there is no guarantee that it will continue
to be *ready* on Cycle :math:`i+1`, even if no operations are enqueued.
Expand All @@ -238,7 +237,11 @@ BOOM provides flexibility in specifying the issue width and the mix of
functional units in the execution pipeline. Code [code:exe\_units] shows
how to instantiate an execution pipeline in BOOM.

::


.. _parameterization-exe-unit:
.. code-block:: scala
:caption: Instantiating the Execution Pipeline (in dpath.scala). Adding execution units is as simple as instantiating another ExecutionUnit module and adding it to the exe units ArrayBuffer.
val exe_units = ArrayBuffer[ExecutionUnit]()
Expand All @@ -259,11 +262,6 @@ how to instantiate an execution pipeline in BOOM.
exe_units += Module(new MemExeUnit())
}
Code Caption: Instantiating the Execution Pipeline (in dpath.scala).
Adding execution units is as simple as instantiating another
ExecutionUnit module and adding it to the exe units
ArrayBuffer.

Additional parameterization, regarding things like the latency of the FP
units can be found within the Configuration settings (configs.scala).

Expand Down
2 changes: 1 addition & 1 deletion docs/sections/InstructionFetch/FetchStage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ cycle where to fetch the next instructions using a “next-line predictor"
(NLP). If a misprediction is detected in BOOM’s backend, or BOOM’s own
predictor wants to redirect the pipeline in a different direction, a
request is sent to the Front-End and it begins fetching along a new
instruction path. See Chapter \[chapter:bpd\] for more information on
instruction path. See :ref:`Branch Prediction` for more information on
how branch prediction fits into the Fetch Unit’s pipeline.

Since superscalar fetch is supported, the *Front-end* returns a
Expand Down

0 comments on commit 9b807c9

Please sign in to comment.