Fixed links | Added code blocks | Added section links

riscv-boom · Nov 20, 2018 · 9b807c9 · 9b807c9
1 parent 9f35e6c
commit 9b807c9
Show file tree

Hide file tree

Showing 19 changed files with 138 additions and 158 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -39,6 +39,7 @@
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
+    'sphinx.ext.autosectionlabel'
 ]
 
 # Add any paths that contain templates here, relative to this directory.

diff --git a/docs/sections/BranchPrediction/Backing-Predictor.rst b/docs/sections/BranchPrediction/Backing-Predictor.rst
@@ -10,15 +10,16 @@ are not able to learn very complicated or long history patterns).
 
 To capture more branches and more complicated branching behaviors, BOOM
 provides support for a “Backing Predictor", or BPD (see 
-:numref:`backing-predictor-unit`.
+:numref:`backing-predictor-unit`).
+
 
 The BPD’s goal is to provide very high accuracy in a (hopefully) dense
 area. To make this possible, the BPD will not make a prediction until
 the *fetch packet* has been decoded and the branch targets computed
 directly from the instructions themselves. This saves on needing to
-store the *PC tags* and *branch targets* within the BPD.
+store the *PC tags* and *branch targets* within the BPD [7]_.
 
-The BPD is accessed in parallel with the instruction cache access (See
+The BPD is accessed in parallel with the instruction cache access (see
 :numref:`Fetch-Unit`). This allows the BPD to be stored in sequential
 memory (i.e., SRAM instead of flip-flops). With some clever
 architecting, the BPD can be stored in single-ported SRAM to achieve the
@@ -94,21 +95,21 @@ info packet". This “info packet" is stored in a “branch re-order buffer"
 (BROB) until commit time. [11]_ Once all of the instructions
 corresponding to the “info packet" is committed, the “info packet" is
 set to the BPD (along with the eventual outcome of the branches) and the
-BPD is updated. Section \[sec:brob\] covers the BROB, which handles the
+BPD is updated. :ref:`The Branch Reorder Buffer (BROB)` covers the BROB, which handles the
 snapshot information needed for update the predictor during
-*Commit*. Section \[sec:bpd-rename\] covers the BPD Rename
+*Commit*. :ref:`Rename Snapshot State` covers the BPD Rename
 Snapshots, which handles the snapshot information needed to update the
 predictor during a misspeculation in the *Execute* stage.
 
 Managing the Global History Register
 ------------------------------------
 
 The *global history register* is an important piece of a branch
-predictor. It contains the outcomes of the previous $N$ branches (where
-$N$ is the size of the global history register). [12]_
+predictor. It contains the outcomes of the previous :math:`N` branches (where
+:math:`N` is the size of the global history register). [12]_
 
-When fetching branch $i$, it is important that the direction of the
-previous $i-N$ branches is available so an accurate prediction can be
+When fetching branch :math:`i`, it is important that the direction of the
+previous :math:`i-N` branches is available so an accurate prediction can be
 made. Waiting till the *Commit* stage to update the global history
 register would be too late (dozens of branches would be inflight and not
 reflected!). Therefore, the global history register must be updated
@@ -136,7 +137,7 @@ any sort of pipeline flush event.
 The Branch Reorder Buffer (BROB)
 --------------------------------
 
-The Reorder Buffer (see Chapter \[chapter:rob\]) maintains a record of
+The Reorder Buffer (see :ref:`The Reorder Buffer (ROB) and the Dispatch Stage`) maintains a record of
 all inflight instructions. Likewise, the Branch Reorder Buffer (BROB)
 maintains a record of all inflight branch predictions. These two
 structure are decoupled as BROB entries are *incredibly* expensive
@@ -217,7 +218,7 @@ abstract class can be found in :numref:`backing-predictor-unit` labeled “predi
 Global History
 ^^^^^^^^^^^^^^
 
-As discussed in Section \[sec:ghistory\], global history is a vital
+As discussed in :ref:`Managing the Global History Register`, global history is a vital
 piece of any branch predictor. As such, it is handled by the abstract
 BranchPredictor class. Any branch predictor extending the abstract
 BranchPredictor class gets access to global history without having to
@@ -226,16 +227,16 @@ handle snapshotting, updating, and bypassing.
 Very Long Global History (VLHR)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Some branch predictors (see Section \[sec:tage\]) require access to
+Some branch predictors (see :ref:`The TAGE Predictor`) require access to
 incredibly long histories – over a thousand bits. Global history is
 speculatively updated after each prediction and must be snapshotted and
 reset if a misprediction was made. Snapshotting a thousand bits is
 untenable. Instead, VLHR is implemented as a circular buffer with a
 speculative head pointer and a commit head pointer. As a prediction is
-made, the prediction is written down at $VLHR[spec\_head]$ and the
+made, the prediction is written down at :math:`VLHR[spec\_head]` and the
 speculative head pointer is incremented and snapshotted. When a branch
-mispredicts, the head pointer is reset to $snapshot+1$ and the correct
-direction is written to $VLHR[snapshot]$. In this manner, each snapshot
+mispredicts, the head pointer is reset to :math:`snapshot+1` and the correct
+direction is written to :math:`VLHR[snapshot]`. In this manner, each snapshot
 is on the order of 10 bits, not 1000 bits.
 
 Operating System-aware Global Histories
@@ -371,16 +372,16 @@ there is no tag match). The table with the longest history making a
 prediction wins.
 
 On a misprediction, TAGE attempts to allocate a new entry. It will only
-overwrite an entry that is “not useful” ($ubits == 0$).
+overwrite an entry that is “not useful” (:math:`ubits == 0`).
 
 TAGE Global History and the Circular Shift Registers (CSRs) [15]_
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Each TAGE table has associated with it its own global history (and each
 table has geometrically more history than the last table). As the
 histories become incredibly long (and thus too expensive to snapshot
 directly), TAGE uses the Very Long Global History Register (VLHR) as
-described in Section \[sec:vlhr\]. The histories contain many more bits
+described in :ref:`Very Long Global History (VLHR)`. The histories contain many more bits
 of history that can be used to index a TAGE table; therefore, the
 history must be “folded” to fit. A table with 1024 entries uses 10 bits
 to index the table. Therefore, if the table uses 20 bits of global
@@ -390,10 +391,12 @@ bits of history.
 Instead of attempting to dynamically fold a very long history register
 every cycle, the VLHR can be stored in a circular shift register (CSR).
 The history is stored already folded and only the new history bit and
-the oldest history bit need to be provided to perform an update. Code
-\[code:tage-csr\] shows an example of how a CSR works.
+the oldest history bit need to be provided to perform an update. 
+:numref:`tage-csr` shows an example of how a CSR works.
 
-::
+.. _tage-csr:
+.. code-block:: none
+    :caption: The circular shift register. When a new branch outcome is added, the register is shifted (and wrapped around). The new outcome is added and the oldest bit in the history is “evicted”.
 
     Example:   
       A 12 bit value (0b_0111_1001_1111) folded onto a 5 bit CSR becomes 
@@ -411,16 +414,12 @@ the oldest history bit need to be provided to perform an update. Code
     (c[4] ^ h[ 0] generates the new c[0]).                                        
     (c[1] ^ h[12] generates the new c[2]).       
 
-Code Caption: The circular shift register. When a new branch outcome is added, the register
-is shifted (and wrapped around). The new outcome is added and the oldest bit in the
-history is “evicted”.
-
 Each table must maintain *three* CSRs. The first CSR is used for
-computing the index hash and has a size $n=log(num\_table\_entries)$. As
+computing the index hash and has a size :math:`n=log(num\_table\_entries)`. As
 a CSR contains the folded history, any periodic history pattern matching
 the length of the CSR will XOR to all zeroes (potentially quite common).
 For this reason, there are two CSRs for computing the tag hash, one of
-width $n$ and the other of width $n-1$.
+width :math:`n` and the other of width :math:`n-1`.
 
 For every prediction, all three CSRs (for every table) must be
 snapshotted and reset if a branch misprediction occurs. Another three
@@ -478,24 +477,11 @@ take?". This is very useful for both torturing-testing BOOM and for
 providing a worse-case performance baseline for comparing branch
 predictors.
 
-.. [6] Each BTB entry corresponds to a single *Fetch PC*, but it is
-    helping to predict across an entire *fetch packet*. However, the
-    BTB entry can only store meta-data and target-data on a single
-    control-flow instruction. While there are certainly pathological
-    cases that can harm performance with this design, the assumption is
-    that there is a correlation between which branch in a *fetch
-    packet* is the dominating branch relative to the *Fetch PC*,
-    and - at least for narrow fetch designs - evaluations of this design
-    has shown it is very complexity-friendly with no noticeable loss in
-    performance. Some other designs instead choose to provide a whole
-    bank of BTBs for each possible instruction in the *fetch
-    packet*.
-
 .. [7] It’s the *PC tag* storage and *branch target* storage that
     makes the BTB within the NLP so expensive.
 
-.. [8]  instructions jump to a $PC+Immediate$ location, whereas
-     instructions jump to a $PC+Register[rs1]+Immediate$ location.
+.. [8]  instructions jump to a :math:`PC+Immediate` location, whereas
+     instructions jump to a :math:`PC+Register[rs1]+Immediate` location.
 
 .. [9] Redirecting the Fetch Unit in the *Fetch2 Stage* for
      instructions is trivial, as the instruction can be decoded and its

diff --git a/docs/sections/BranchPrediction/Configurations.rst b/docs/sections/BranchPrediction/Configurations.rst
@@ -1,38 +1,45 @@
-
 Branch Prediction Configurations
---------------------------------
+================================
 
 There are a number of parameters provided to govern the branch
 prediction in BOOM.
 
-### GShare Configuration Options
+GShare Configuration Options
+----------------------------
 
-#### Global History Length
+Global History Length
+~~~~~~~~~~~~~~~~~~~~~
 
 How long of a history should be tracked? The length of the global
-history sets the size of the branch predictor. An $n$-bit history pairs
-with a $2^n$ entry two-bit counter table.
+history sets the size of the branch predictor. An :math:`n`-bit history pairs
+with a :math:`2^n` entry two-bit counter table.
 
-### TAGE Configurations
+TAGE Configurations
+-----------------------
 
-#### Number of TAGE Tables
+Number of TAGE Tables
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 How many TAGE tables should be used?
 
-#### TAGE Table Sizes
+TAGE Table Sizes
+~~~~~~~~~~~~~~~~~~~~~~~~
 
 What size should each TAGE table be?
 
-#### TAGE Table History Lengths
+TAGE Table History Lengths
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 How long should the global history be for each table? This should be a
 geometrically increasing value for each table.
 
-#### TAGE Table Tag Sizes
+TAGE Table Tag Sizes
+~~~~~~~~~~~~~~~~~~~~~~~~~
 
 What size should each tag be?
 
-#### TAGE Table U-bit Size
+TAGE Table U-bit Size
+~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 How many bits should be used to describe the usefulness of an entry?
 

diff --git a/docs/sections/BranchPrediction/Rocket-NLP-Predictor.rst b/docs/sections/BranchPrediction/Rocket-NLP-Predictor.rst
@@ -117,21 +117,6 @@ packet* which of the many possible branches will be the dominating
 branch that redirects the PC. For this reason, we use a given branch’s
 *Fetch PC* rather than its own PC in the BTB tag match. [6]_
 
-.. [1] Unfortunately, the terminology in the literature gets a bit
-    muddled here in what to call different types and levels of branch
-    predictor. I have seen “micro-BTB" versus “BTB", “NLP" versus “BHT",
-    and “cache-line predictor" versus “overriding predictor". Although
-    the Rocket code calls its own predictor the “BTB", I have chosen to
-    refer to it in documentation as the “next-line predictor", to denote
-    that it is a combinational predictor that provides single-cycle
-    predictions for fetching “the next line", and the Rocket BTB
-    encompasses far more complexity than just a “branch target buffer"
-    structure. Likewise, I have chosen the name “backing predictor" as I
-    believe it is the most accurate name, while simultaneously avoiding
-    being overly descriptive of the internal design (is it a simple BHT?
-    Is it tagged? Does it override the NLP?). But in short, I am open
-    to better names!
-
 .. [2] In reality, only the very lowest bits must be saved, as the
     higher-order bits will be the same.
 

diff --git a/docs/sections/BranchPrediction/index.rst b/docs/sections/BranchPrediction/index.rst
@@ -12,7 +12,7 @@ these predictions.
 
 BOOM uses two levels of branch prediction- a single-cycle “next-line
 predictor" (NLP) and a slower but more complex “backing predictor"
-(BPD).
+(BPD) [1]_.
 
 .. toctree::
     :maxdepth: 2
@@ -21,3 +21,20 @@ predictor" (NLP) and a slower but more complex “backing predictor"
     Rocket-NLP-Predictor
     Backing-Predictor
     Configurations
+
+.. [1] Unfortunately, the terminology in the literature gets a bit
+    muddled here in what to call different types and levels of branch
+    predictor. I have seen “micro-BTB" versus “BTB", “NLP" versus “BHT",
+    and “cache-line predictor" versus “overriding predictor". Although
+    the Rocket code calls its own predictor the “BTB", I have chosen to
+    refer to it in documentation as the “next-line predictor", to denote
+    that it is a combinational predictor that provides single-cycle
+    predictions for fetching “the next line", and the Rocket BTB
+    encompasses far more complexity than just a “branch target buffer"
+    structure. Likewise, I have chosen the name “backing predictor" as I
+    believe it is the most accurate name, while simultaneously avoiding
+    being overly descriptive of the internal design (is it a simple BHT?
+    Is it tagged? Does it override the NLP?). But in short, I am open
+    to better names!
+
+
diff --git a/docs/sections/Decode/decode.rst b/docs/sections/Decode/decode.rst
@@ -4,6 +4,3 @@ The Decode Stage
 The decode stage takes instructions from the fetch buffer, decodes them,
 and allocates the necessary resources as required by each instruction. 
 The decode stage will stall as needed if not all resources are available.
-
-The Decode Table
-----------------
diff --git a/docs/sections/Execute/execute.rst b/docs/sections/Execute/execute.rst
@@ -72,15 +72,15 @@ efficiently.
 
 For this reason, BOOM uses an abstract Functional Unit class to “wrap"
 expert-written, low-level functional units from the Rocket repository
-(see Section [sec:rocket]). However, the expert-written functional units
+(see :ref:`The Rocket-chip Repository Layout`). However, the expert-written functional units
 created for the Rocket in-order processor make assumptions about
 in-order issue and commit points (namely, that once an instruction has
 been dispatched to them it will never need to be killed). These
 assumptions break down for BOOM.
 
 However, instead of re-writing or forking the functional units, BOOM
-provides an abstract Functional Unit class (see Fig
-[fig:abstract-functional-unit]) that “wraps" the lower-level functional
+provides an abstract Functional Unit class (see :numref:`abstract-fu`)
+that “wraps" the lower-level functional
 units with the parameterized auto-generated support code needed to make
 them work within BOOM. The request and response ports are abstracted,
 allowing Functional Units to provide a unified, interchangeable
@@ -98,8 +98,7 @@ the micro-op within the expert-written functional unit. If a micro-op is
 misspeculated, it’s response is de-asserted as it exits the functional
 unit.
 
-An example pipelined functional unit is shown in Fig
-[fig:abstract-functional-unit].
+An example pipelined functional unit is shown in :numref:`abstract-fu`.
 
 Un-pipelined Functional Units
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -172,7 +171,7 @@ and fence operations.
 BOOM (currently) only supports having one LSU (and thus can only send
 one load or store per cycle to memory). [2]_
 
-See Chapter [sec:lsu] for more details on the LSU.
+See `The Load/Store Unit (LSU)` for more details on the LSU.
 
 Floating Point Units
 --------------------
@@ -187,7 +186,7 @@ Floating Point Units
     support).
 
 The low-level floating point units used by BOOM come from the Rocket
-processor (https://github.com/ucb-bar/rocket) and hardfloat
+processor (https://github.com/freechipsproject/rocket-chip) and hardfloat
 (https://github.com/ucb-bar/berkeley-hardfloat) repositories. Figure
 [fig:functional-unit-fpu] shows the class hierarchy of the FPU.
 
@@ -198,7 +197,7 @@ Floating Point Divide and Square-root Unit
 ------------------------------------------
 
 BOOM fully supports floating point divide and square-root operations
-using a single “" (or  for short). BOOM accomplishes this by
+using a single **FDiv/Sqrt** (or **fdiv** for short). BOOM accomplishes this by
 instantiating a double-precision unit from the hardfloat repository. The
 unit comes with the following features/constraints:
 
@@ -218,7 +217,7 @@ double-precision (and then the output downscaled). [4]_
 
 Although the  unit is unpipelined, it does not fit cleanly into the
 Pipelined/Unpipelined abstraction used by the other functional units
-(Fig [fig:functional-unit-hierarchy]). This is because the unit provides
+(see :numref:`fu-hierarchy`). This is because the unit provides
 an unstable FIFO interface: although the  unit may provide a *ready*
 signal on Cycle :math:`i`, there is no guarantee that it will continue
 to be *ready* on Cycle :math:`i+1`, even if no operations are enqueued.
@@ -238,7 +237,11 @@ BOOM provides flexibility in specifying the issue width and the mix of
 functional units in the execution pipeline. Code [code:exe\_units] shows
 how to instantiate an execution pipeline in BOOM.
 
-::
+
+
+.. _parameterization-exe-unit:
+.. code-block:: scala
+    :caption: Instantiating the Execution Pipeline (in dpath.scala). Adding execution units is as simple as instantiating another ExecutionUnit module and adding it to the exe units ArrayBuffer.
 
     val exe_units = ArrayBuffer[ExecutionUnit]()
 
@@ -259,11 +262,6 @@ how to instantiate an execution pipeline in BOOM.
        exe_units += Module(new MemExeUnit())
     }
 
-Code Caption: Instantiating the Execution Pipeline (in dpath.scala).
-Adding execution units is as simple as instantiating another
-ExecutionUnit module and adding it to the exe units
-ArrayBuffer.
-
 Additional parameterization, regarding things like the latency of the FP
 units can be found within the Configuration settings (configs.scala).
 

diff --git a/docs/sections/InstructionFetch/FetchStage.rst b/docs/sections/InstructionFetch/FetchStage.rst
@@ -14,7 +14,7 @@ cycle where to fetch the next instructions using a “next-line predictor"
 (NLP). If a misprediction is detected in BOOM’s backend, or BOOM’s own
 predictor wants to redirect the pipeline in a different direction, a
 request is sent to the Front-End and it begins fetching along a new
-instruction path. See Chapter \[chapter:bpd\] for more information on
+instruction path. See :ref:`Branch Prediction` for more information on
 how branch prediction fits into the Fetch Unit’s pipeline.
 
 Since superscalar fetch is supported, the *Front-end* returns a