# Unified Processing Engine
#### Documentation | Version: 0.6.2 | Updated 2018.7.31
---

## Introduction

The processing engine is at the heart of many hardware-accelerator architectures. This PE is unified implementation that can be topologically configured to support various Deep Neural Network (DNN) accelerator architectures. Based on its configurations, it supports inner products with varying amounts of spatial and temporal parallelism, nonlinear activation functions, scalar addition, scalar multiplication, and max functions. Many existing architectures and dataflows, such as Eyeriss, Flexflow, and ShiDianNao are special cases of this PE.

## Setup

In [1]:
val path = System.getProperty("user.dir") + "/source/load-ivy.sc"
interp.load.module(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(path)))

[36mpath[39m: [32mString[39m = [32m"""
C:\Users\RyanL\OneDrive\Research\SEAL\processing-engine/source/load-ivy.sc
"""[39m

In [2]:
import chisel3._
import chisel3.util._
import chisel3.iotesters.{ChiselFlatSpec, Driver, PeekPokeTester}

import scala.math.pow

[32mimport [39m[36mchisel3._
[39m
[32mimport [39m[36mchisel3.util._
[39m
[32mimport [39m[36mchisel3.iotesters.{ChiselFlatSpec, Driver, PeekPokeTester}

[39m
[32mimport [39m[36mscala.math.pow[39m

## Register File

##### Purpose
The Register File (RF) is used as a buffer for weights and activations, as well as a scratch pad for the ALU. These are termed the Weight, Activation, and Scratch RFs. The Weight and Activation RFs interface between the External-PE Array and the IPU. The Scratch RF sits between the ALU and the NLU, with an interface to the External-PE Array

##### I/O
* Each RF has two buses in and two buses out: one for internal-PE usage and the other for inter-PE communication. These are termed *Internal Write*, *External Write*, *Internal Read*, and *External Read*&mdash;where 'Internal' denotes internal-PE and 'External' denotes inter-PE.
* There are two separate control buses: one for the Internal Read/Write buses and one for the External Read/Write buses
* Each of these control buses consists of a one-hot write enable, a one-hot read enable, a write address for each input, a read address for each output, and a select signal for each output for bypassing (see bypass section).

##### Bypassing
The bypass behavior is slightly tricky: the bus for inter-PE writing can be bypassed to the bus for internal-PE reading and the bus for internal-PE writing can be bypassed to the bus for inter-PE reading. In other words, bypasses cross between internal and external I/O. Since the input and output buses can be different sizes, *each* output signal has a select signal that must pick an input signal (of the appropriate bus). 
* Hard Bypassing is disallowed to prevent combinational loops.
* For Firm Bypassing, this select signal *must* always be one-hot, else behavior is undefined. The RF's internal memory is limited to one SInt register per input and read/write addressing is unsupported&mdash;thus the RF acts as a pipelining register. Additionally, Firm Bypassing is disallowed when there is an i/o bus does not have its corresponding pair. *E.g.* if there is an Internal Read, there must be an External Write, *etc.*
* For Soft Bypassing, this signal is used to indicate both when to bypass vs read, as well as which input to bypass from. When the select signal is all false, the RF will act as specified for a None Bypass. 
* For None Bypassing, the bypass select signal is deactivated and it each of the read/write signals can operate completely independently. It is up to the user to avoid read/write race conditions.

In [3]:
class PartialRFConfig(
        val numInputs: Int,
        val numOutputs: Int,
        val numCrossInputs: Int,
        val addrWidth: Int,
        val bpSoft: Boolean,
        val bpFirm: Boolean)

class PartialRFControl(c: PartialRFConfig) extends Bundle {
    val wEnable = Vec(c.numInputs, Bool())
    val rEnable = Vec(c.numOutputs, Bool())
    val wAddr = if (!c.bpFirm) Some(Vec(c.numInputs, UInt(c.addrWidth.W))) else None
    val rAddr = if (!c.bpFirm) Some(Vec(c.numOutputs, UInt(c.addrWidth.W))) else None
    // Each output can select which input of the opposite bus to bypass from
    val bpSel = if (c.bpSoft || c.bpFirm) Some(Vec(c.numOutputs, Vec(c.numCrossInputs, Bool()))) else None
}

class RFConfig(
        val numIntInputs: Int,
        val numExtInputs: Int,
        val numIntOutputs: Int,
        val numExtOutputs: Int,
        val addrWidth: Int,
        val dataWidth: Int,
        val bpType: String) {
    
    val bpNone = (bpType == "None")
    val bpSoft = (bpType == "Soft")
    val bpFirm = (bpType == "Firm")
    
    require(bpNone || bpSoft || bpFirm, "Invalid Bypass type.\n")
    require(numIntInputs > 0 || numExtInputs > 0, "Must have at least one input.\n")
    require(numIntOutputs > 0 || numExtOutputs > 0, "Must have at least one output.\n")
    require(dataWidth > 0, "Data bitwidth must be at least one.\n") 
    if (bpFirm) { require(addrWidth == 0, "Address width must be 0 when Firm Bypassing.\n") }
    
    val intConfig = new PartialRFConfig(
        numIntInputs, numIntOutputs, numExtOutputs, addrWidth, bpSoft, bpFirm)
    
    val extConfig = new PartialRFConfig(
        numExtInputs, numExtOutputs, numIntOutputs, addrWidth, bpSoft, bpFirm)
}

class RFControl(c: RFConfig) extends Bundle {
    
    override def cloneType = (new RFControl(c)).asInstanceOf[this.type]
    
    val internal = if (c.numIntInputs > 0 || c.numIntOutputs > 0)
        Some(new PartialRFControl(c.intConfig)) else None
    val external = if (c.numExtInputs > 0 || c.numExtOutputs > 0)
        Some(new PartialRFControl(c.extConfig)) else None
}

class RF(c: RFConfig) extends Module {
    
    val io = IO(new Bundle {
        val control = Input(new RFControl(c))
        val wInternal = Input(Vec(c.numIntInputs, SInt(c.dataWidth.W))) 
        val wExternal = Input(Vec(c.numExtInputs, SInt(c.dataWidth.W)))
        val rInternal = Output(Vec(c.numIntOutputs, SInt(c.dataWidth.W)))
        val rExternal = Output(Vec(c.numExtOutputs, SInt(c.dataWidth.W)))
    })
    
    val dataRegister = if (!c.bpFirm) 
        Some(RegInit(Vec.fill(pow(2, c.addrWidth).toInt){0.S(c.dataWidth.W)})) else None
    
    // Need to bypass through a register to prevent combinational loops
    val bpAny = c.bpSoft || c.bpFirm
    val bpRegisterInt = if (bpAny && c.numIntInputs > 0)
        Some(RegInit(Vec.fill(c.numIntInputs){0.S(c.dataWidth.W)})) else None
    val bpRegisterExt = if (bpAny && c.numExtInputs > 0)
        Some(RegInit(Vec.fill(c.numExtInputs){0.S(c.dataWidth.W)})) else None
    
    for (i <- 0 until c.numIntInputs) {
        when (io.control.internal.get.wEnable(i)) {
            if (!c.bpFirm) { dataRegister.get(io.control.internal.get.wAddr.get(i)) := io.wInternal(i) }
            if (bpRegisterInt.isDefined) { bpRegisterInt.get(i) := io.wInternal(i) }
        }
    }
    
    for (i <- 0 until c.numExtInputs) {
        when (io.control.external.get.wEnable(i)) {
            if (!c.bpFirm) { dataRegister.get(io.control.external.get.wAddr.get(i)) := io.wExternal(i) }
            if (bpRegisterExt.isDefined) { bpRegisterExt.get(i) := io.wExternal(i) }
        }
    }
    
    for (i <- 0 until c.numIntOutputs) {
        when (io.control.internal.get.rEnable(i)) {
            if (c.bpFirm) {
                io.rInternal(i) := PriorityMux(io.control.internal.get.bpSel.get(i), bpRegisterExt.get)
            } else if (c.bpSoft) {
                when (io.control.internal.get.bpSel.get(i).contains(true.B)) {
                    // External write bypasses to Internal read
                    io.rInternal(i) := PriorityMux(io.control.internal.get.bpSel.get(i), bpRegisterExt.get)
                } .otherwise {
                    io.rInternal(i) := dataRegister.get(io.control.internal.get.rAddr.get(i))
                }
            } else {
                io.rInternal(i) := dataRegister.get(io.control.internal.get.rAddr.get(i))
            }
        } .otherwise {
            io.rInternal(i) := 0.S
        }
    }
    
    for (i <- 0 until c.numExtOutputs) {
        when (io.control.external.get.rEnable(i)) {
            if (c.bpFirm) {
                io.rExternal(i) := PriorityMux(io.control.external.get.bpSel.get(i), bpRegisterInt.get)
            } else if (c.bpSoft) {
                when (io.control.external.get.bpSel.get(i).contains(true.B)) {
                    // Internal write bypasses to External read
                    io.rExternal(i) := PriorityMux(io.control.external.get.bpSel.get(i), bpRegisterInt.get)
                } .otherwise {
                    io.rExternal(i) := dataRegister.get(io.control.external.get.rAddr.get(i))
                }
            } else {
                io.rExternal(i) := dataRegister.get(io.control.external.get.rAddr.get(i))
            }
        } .otherwise {
            io.rExternal(i) := 0.S
        }
    }
}

defined [32mclass[39m [36mPartialRFConfig[39m
defined [32mclass[39m [36mPartialRFControl[39m
defined [32mclass[39m [36mRFConfig[39m
defined [32mclass[39m [36mRFControl[39m
defined [32mclass[39m [36mRF[39m

## Inner Product Unit

##### Purpose
The Inner Product Unit (IPU) performs an inner product between two vectors of a configurable length. For SIMD support, this should be configured to be wide. For stationary dataflows, such as Eyeriss, note that setting the width to 1 is equivalent to a scalar multiplication. The IPU sits between the Weight/Activation RFs and the ALU.

##### I/O
* The IPU takes two equi-dimension vector inputs.
* Control is only necessary during a Firm Bypass; it selects which index of both of the inputs to bypass.
* The primary output is the inner product of the two vectors, but during a Firm Bypass there are two extra outputs: one weight and one activation.

##### Bypassing
* None Bypassing simply outputs the inner product of the two vectors.
* Firm Bypassing adds two outputs lines: one for a single weight and one for a single activation. Additionally, there is a select signal to select which index of the inputs to bypass.

##### Future Plans
* Hard Bypassing should remove the inner product line and keep the bypass lines.
* The IPU should support FMA.
* For much more complex PE designs, multiple outputs could be desirable.

##### Definition

In [4]:
class IPUConfig(val width: Int, val inBitWidth: Int, val outBitWidth: Int, val bpType: String) {
    
    require(width >= 1, "Width must be at least one.\n")
    require(List("None", "Firm").contains(bpType), "Bypass must be \"None\" or \"Firm\".\n")
    require(inBitWidth > 0 && outBitWidth > 0, "Data bitwidth must be greater than 0\n")
    
    val bpFirm = (bpType == "Firm")
}

class IPUOutput(outBitWidth: Int, bp: Boolean) extends Bundle {
    
    override def cloneType = (new IPUOutput(outBitWidth, bp)).asInstanceOf[this.type]
    
    val innerProd = SInt(outBitWidth.W)
    // Extending the bitwidths for consistency
    val bpWeight = if (bp) Some(SInt(outBitWidth.W)) else None
    val bpActvtn = if (bp) Some(SInt(outBitWidth.W)) else None
}


class IPU(c: IPUConfig) extends Module {

    val io = IO(new Bundle {
        val bpSel = if (c.bpFirm) Some(Input(Vec(c.width, Bool()))) else None
        val weightIn = Input(Vec(c.width, SInt(c.inBitWidth.W)))
        val actvtnIn = Input(Vec(c.width, SInt(c.inBitWidth.W)))
        val out = Output(new IPUOutput(c.outBitWidth, c.bpFirm))
    })
    
    private class PMult extends Module {
        val io = IO(new Bundle {
            val weightVec = Input(Vec(c.width, SInt(c.inBitWidth.W)))
            val actvtnVec = Input(Vec(c.width, SInt(c.inBitWidth.W)))
            val pairwiseProd = Output(Vec(c.width, SInt(c.outBitWidth.W)))
        })
        io.pairwiseProd := (io.weightVec zip io.actvtnVec).map { case(a, b) => a * b }
    }
    
    private class SumTree extends Module {
        val io = IO(new Bundle {
            val inVec = Input(Vec(c.width, SInt(c.outBitWidth.W)))
            val sum = Output(SInt(c.outBitWidth.W))
        })
        
        // Recursively creates a balanced syntax tree
        private def adjReduce[A](xs: List[A], op: (A, A) => A): A = xs match {
            case List(single) => single
            case default => {
                val grouped = default.grouped(2).toList
                val result = for (g <- grouped) yield { g match {
                    case List(a, b) => op(a, b)
                    case List(x) => x
                }}
                adjReduce(result, op)
            }
        }
        
        io.sum := adjReduce(io.inVec.toList, (x: SInt, y: SInt) => x + y)
    }
    
    private val pMult = Module(new PMult)
    pMult.io.weightVec := io.weightIn
    pMult.io.actvtnVec := io.actvtnIn
    
    private val sumTree = Module(new SumTree)
    sumTree.io.inVec := pMult.io.pairwiseProd
    
    io.out.innerProd := sumTree.io.sum
    
    if (c.bpFirm) {
        io.out.bpWeight.get := PriorityMux(io.bpSel.get, io.weightIn)
        io.out.bpActvtn.get := PriorityMux(io.bpSel.get, io.actvtnIn)
    }
}

defined [32mclass[39m [36mIPUConfig[39m
defined [32mclass[39m [36mIPUOutput[39m
defined [32mclass[39m [36mIPU[39m

## ALU

##### Purpose
The ALU supports four functions: Addition, Maximum, Accumulation, and Identity. Accumulation and Addition enable Convolution layers, while Maximum enables Max-Pooling layers. *It is important to note the operands of each operation*. Accumulation acts on the inner product from the IPU and the feedback from the Scratch RF. Addition and Maximum operate on the bypassed weights and activations. Identity acts on the inner product alone. The ALU sits between the IPU and the Scratch RF.

##### I/O
* The ALU has two inputs: one bus from the IPU and one feedback line from the Scratch RF.
* The bus from the IPU contains the inner product and may or may not contain a weight and activation bypassed from the Weight and Activation RFs. It is enforced that these bypass lines exist if the ALU is set to support addition or maximum.
* It outputs a single signal to the Scratch RF.
* Control consists of a bundle of enable signals: one for each supported function. The enables are interpreted with priority in the following order: Identity, Addition, Maximum, Accumulation. If all of the enables are low, the output will read zero.

##### Bypassing
As of right now, bypassing is only supported via the Identity operation. However, there are plans to implement it via a select signal during Soft and Firm Bypassing.

##### Definition

In [5]:
class ALUConfig(val dataWidth: Int, val funcs: List[String]) {
    
    require(funcs.length > 0, "Must support at least one function.")
    for(x <- funcs) { 
        require(List("Identity", "Add", "Max", "Accumulate").contains(x), "Unsupported function.")
    }
    
    val idnSupp = funcs.contains("Identity")
    val addSupp = funcs.contains("Add")
    val maxSupp = funcs.contains("Max")
    val accSupp = funcs.contains("Accumulate")
    val addBypassIn = addSupp || maxSupp
}

class ALUFSel(c: ALUConfig) extends Bundle {
    
    override def cloneType = (new ALUFSel(c)).asInstanceOf[this.type]
    
    // Priority is given from top to bottom
    val idnEnable = if (c.idnSupp) Some(Bool()) else None
    val addEnable = if (c.addSupp) Some(Bool()) else None
    val maxEnable = if (c.maxSupp) Some(Bool()) else None
    val accEnable = if (c.accSupp) Some(Bool()) else None
}

class ALU(c: ALUConfig) extends Module {
 
    val io = IO(new Bundle {
        val fSel = Input(new ALUFSel(c))
        val ipu = Input(new IPUOutput(c.dataWidth, c.addBypassIn))
        val rf = if (c.accSupp) Some(Input(SInt(c.dataWidth.W))) else None
        val out = Output(SInt(c.dataWidth.W))
    })
    
    // The inner "OrElse" clauses are logically unnecessary,
    // but Chisel can't infer that.
    when (io.fSel.idnEnable.getOrElse(false.B)) {
        io.out := io.ipu.innerProd
    } .elsewhen (io.fSel.addEnable.getOrElse(false.B)) {
        io.out := io.ipu.bpWeight.getOrElse(0.S) + io.ipu.bpActvtn.getOrElse(0.S)
    } .elsewhen (io.fSel.maxEnable.getOrElse(false.B)) {
        when (io.ipu.bpWeight.getOrElse(0.S) > io.ipu.bpActvtn.getOrElse(0.S)) {
            io.out := io.ipu.bpWeight.getOrElse(0.S)
        } .otherwise {
            io.out := io.ipu.bpActvtn.getOrElse(0.S)
        }
    } .elsewhen (io.fSel.accEnable.getOrElse(false.B)) {
        io.out := io.ipu.innerProd + io.rf.getOrElse(0.S)
    } .otherwise {
        io.out := 0.S
    }
}

defined [32mclass[39m [36mALUConfig[39m
defined [32mclass[39m [36mALUFSel[39m
defined [32mclass[39m [36mALU[39m

## Nonlinear Unit

##### Purpose
The Nonlinear Unit (NLU) performs the nonlinear activations functions present in neural networks. As of right now, it supports the Identity function and ReLu, but there are plans to implement sigmoid and tanh using LUTs and linear interpolation. The NLU sits between the Scratch RF and the inter-PE network.

##### I/O
* The NLU sports a single input and a single output, each of an independent bitwidth.
* Control is similar to the ALU. There is a one-hot control bus that encodes which function to perform. Provided multiple signals are active, the priority is as follows: Identity, ReLu. (Then tanh, sinh once implemented.)

##### Bypassing
Bypassing is only supported via the Identity function and with direct reads to the Scratch RF. There are no plans to implement other Bypass functions, as this seems sufficient.

##### Defintion

In [6]:
class NLUConfig(val inBitWidth: Int, val outBitWidth: Int, val funcs: List[String]) {
    
    for(x <- funcs) {
        require(List("Identity", "ReLu").contains(x), "Unsupported Function")
    }
    
    val idSupp = funcs.contains("Identity")
    val reluSupp = funcs.contains("ReLu")
    val tanhSupp = false //funcs.contains("tanh")
    val sinhSupp = false //funcs.contains("sinh")
}

class NLUFSel(c: NLUConfig) extends Bundle {
    
    override def cloneType = (new NLUFSel(c)).asInstanceOf[this.type]
    
    val idEnable = if (c.idSupp) Some(Bool()) else None
    val reluEnable = if (c.reluSupp) Some(Bool()) else None
    val tanhEnable = if (c.tanhSupp) Some(Bool()) else None 
    val sinhEnable = if (c.sinhSupp) Some(Bool()) else None
}

class NLU(c: NLUConfig) extends Module {
    
    val io = IO(new Bundle {
        val fSel = Input(new NLUFSel(c))
        val in = Input(SInt(c.inBitWidth.W))
        val out = Output(SInt(c.outBitWidth.W))
    })
    
    when (io.fSel.idEnable.getOrElse(false.B)) {
        io.out := io.in
    } .elsewhen (io.fSel.reluEnable.getOrElse(false.B)) {
        when (io.in.data > 0.S) {
            io.out := io.in.data
        } .otherwise {
            io.out := 0.S
        }
    } .elsewhen (io.fSel.tanhEnable.getOrElse(false.B)) {
        // TODO
        io.out := 0.S
    } .elsewhen (io.fSel.sinhEnable.getOrElse(false.B)) {
        // TODO
        io.out := 0.S
    } .otherwise {
        io.out := 0.S
    }
}

defined [32mclass[39m [36mNLUConfig[39m
defined [32mclass[39m [36mNLUFSel[39m
defined [32mclass[39m [36mNLU[39m

## Control

For maximum configurability, control is implemented via a State Machine (Moore) and Decoder. The state machine must have its states and state transitions configured. The Decoder must have its input-output map configured.

### State Machine

##### Purpose
The State Machine encodes what state the PE should be in and what state to go to, given an external control signal. It is the interface between external control and the decoder.

##### I/O
* The State Machine takes one input: control from outside the PE.
* It also has one output: state information for the decoder.

##### Configuring
To configure the state machine, it needs to know three things: the number of possible states, the number of possible input signals, and the map between input signals and state transitions. The aforementioned map is implemented as a function. In order to make this function work with Chisel, it must return a Wire that is connected to the appropriate UInt literal.

##### Example

In [8]:
def exampleStateMap(state: UInt, control: UInt, c: StateMachineConfig): UInt = {
    
    val nextState = Wire(UInt(c.stateWidth.W))
    
    when (state === 0.U & control === 0.U) { nextState := 0.U }
    .elsewhen (state === 0.U & control === 1.U) { nextState := 1.U }
    .elsewhen (state === 1.U & control === 0.U) { nextState := 0.U }
    .elsewhen (state === 1.U & control === 1.U) { nextState := 1.U }
    .otherwise { nextState := 0.U }
    
    nextState
}

defined [32mfunction[39m [36mexampleStateMap[39m

##### Definition

In [7]:
class StateMachineConfig(
        val numStates: Int, 
        val numCtrlSigs: Int, 
        val stateMap: (UInt, UInt, StateMachineConfig) => UInt) {
    
    val stateWidth = log2Up(numStates)
    val ctrlWidth = log2Up(numCtrlSigs)
}

class StateMachine(c: StateMachineConfig) extends Module {
    
    val io = IO(new Bundle {
        val control = Input(UInt(c.ctrlWidth.W))
        val out = Output(UInt(c.stateWidth.W))
    })
    
    val register = RegInit(0.U(c.stateWidth.W))
    register := c.stateMap(register, io.control, c)
    
    io.out := register
}

defined [32mclass[39m [36mStateMachineConfig[39m
defined [32mclass[39m [36mStateMachine[39m

### Decoder

##### Purpose
The decoder acts as is typical in processors: it converts the state information into all the control signals associated with the appropriate state. It acts as the interface between the State Machine and every other module in the PE. 

##### I/O
* The decoder has a single input from the State Machine.
* It also has a control bus output going to all the Memory, *viz.* Weight RF, Activation RF, Scratch RF. This bus controls Read/Write Addressing and Enabling, as well as Bypassing
* It has another control bus output to all Processing Units, *viz.* IPU, ALU, NLU. This bus controls Function Selecting and Bypassing.

##### Configuring
Since the Decoder needs to know almost everything about the PE, it takes the full PE configuration. The part that configures the decoder's logic is the set of functions that start with "decode". There is one for each module. To implement each of these functions, they must generate a Wire for the appropriate bus type and connect each of the appropriate signals to the desired literal, then return it. For example, decodeWeightRF must generate a Wire\[RFControl\] and set the signals that exist.

##### Example
In the works.

In [9]:
class PEConfig(
        val weightRFConfig: RFConfig,
        val actvtnRFConfig: RFConfig,
        val scratchRFConfig: RFConfig,
        val ipuConfig: IPUConfig,
        val aluConfig: ALUConfig,
        val nluConfig: NLUConfig,
        val smConfig: StateMachineConfig,
        val decodeWeightRF: (UInt, RFConfig) => Data,
        val decodeActvtnRF: (UInt, RFConfig) => Data,
        val decodeScratchRF: (UInt, RFConfig) => Data,
        val decodeIPU: (UInt, IPUConfig) => Data,
        val decodeALU: (UInt, ALUConfig) => Data,
        val decodeNLU: (UInt, NLUConfig) => Data) {

    require(ipuConfig.width == weightRFConfig.numIntOutputs, 
        "IPU input width not equal to Weight RF Internal Output width.\n")
    require(ipuConfig.width == actvtnRFConfig.numIntOutputs,
        "IPU input width not equal to Activation RF Internal Output width.\n")
    
    if(ipuConfig.bpFirm) {
        require(aluConfig.addSupp || aluConfig.maxSupp,
            "Incompatible ALU and IPU Configurations")
    }
}

class MemoryControl(c: PEConfig) extends Bundle {
    
    override def cloneType = (new MemoryControl(c)).asInstanceOf[this.type]
    
    val weightRF = new RFControl(c.weightRFConfig)
    val actvtnRF = new RFControl(c.actvtnRFConfig)
    val scratchRF = new RFControl(c.scratchRFConfig)
}

class ProcessControl(c: PEConfig) extends Bundle {
    
    override def cloneType = (new ProcessControl(c)).asInstanceOf[this.type]
    
    val aluFSel = Output(new ALUFSel(c.aluConfig))
    val nluFSel = Output(new NLUFSel(c.nluConfig))
    
    val ipuBpSel = if (c.ipuConfig.bpFirm) Some(Output(Vec(c.ipuConfig.width, Bool()))) else None
}

class Decoder(c: PEConfig) extends Module {
    
    val io = IO(new Bundle {
        val state = Input(UInt(c.smConfig.stateWidth.W))
        val mem = Output(new MemoryControl(c))
        val proc = Output(new ProcessControl(c))
    })
    
    io.mem.weightRF <> c.decodeWeightRF(io.state, c.weightRFConfig)
    io.mem.actvtnRF <> c.decodeActvtnRF(io.state, c.actvtnRFConfig)
    io.mem.scratchRF <> c.decodeScratchRF(io.state, c.scratchRFConfig)
    
    if (c.ipuConfig.bpFirm) { 
        io.proc.ipuBpSel.get := c.decodeIPU(io.state, c.ipuConfig)
    }
    
    io.proc.aluFSel <> c.decodeALU(io.state, c.aluConfig)
    io.proc.nluFSel <> c.decodeNLU(io.state, c.nluConfig)
}

defined [32mclass[39m [36mPEConfig[39m
defined [32mclass[39m [36mMemoryControl[39m
defined [32mclass[39m [36mProcessControl[39m
defined [32mclass[39m [36mDecoder[39m

## PE

In [18]:
class PE(c: PEConfig) extends Module {
    
    val cw = c.weightRFConfig
    val ca = c.actvtnRFConfig
    val cs = c.scratchRFConfig
    
    val io = IO(new Bundle {
        val stateCtrl = Input(UInt(c.smConfig.ctrlWidth.W))
        val toWeightRF = Input(Vec(cw.numExtInputs, SInt(cw.dataWidth.W))) 
        val toActvtnRF = Input(Vec(ca.numExtInputs, SInt(ca.dataWidth.W)))
        val toScratchRF = Input(Vec(cs.numExtInputs, SInt(cs.dataWidth.W)))
        val fromWeightRF = Output(Vec(cw.numExtOutputs, SInt(cw.dataWidth.W)))
        val fromActvtnRF = Output(Vec(ca.numExtOutputs, SInt(ca.dataWidth.W)))
        val fromScratchRF = Output(Vec(cs.numExtOutputs, SInt(cs.dataWidth.W)))
        val totalOutput = Output(SInt(c.nluConfig.outBitWidth.W))
    })
    
    val stateMachine = Module(new StateMachine(c.smConfig))
    stateMachine.io.control := io.stateCtrl
    
    val decoder = Module(new Decoder(c))
    decoder.io.state := stateMachine.io.out
    
    val weightRF = Module(new RF(cw))
    weightRF.io.control <> decoder.io.mem.weightRF
    weightRF.io.wExternal := io.toWeightRF
    io.fromWeightRF := weightRF.io.rExternal
    
    val actvtnRF = Module(new RF(ca))
    actvtnRF.io.control <> decoder.io.mem.actvtnRF
    actvtnRF.io.wExternal := io.toActvtnRF
    io.fromActvtnRF := actvtnRF.io.rExternal
       
    val ipu = Module(new IPU(c.ipuConfig))
    if (ipu.io.bpSel.isDefined) { 
        ipu.io.bpSel.get := decoder.io.proc.ipuBpSel.get 
    }
    ipu.io.weightIn := weightRF.io.rInternal
    ipu.io.actvtnIn := actvtnRF.io.rInternal

    val alu = Module(new ALU(c.aluConfig))
    alu.io.fSel <> decoder.io.proc.aluFSel
    alu.io.ipu <> ipu.io.out
    
    val scratchRF = Module(new RF(cs))
    scratchRF.io.control <> decoder.io.mem.scratchRF
    scratchRF.io.wExternal := io.toScratchRF
    scratchRF.io.wInternal(0) := alu.io.out // TODO: Add Req. for this
    io.fromScratchRF := scratchRF.io.rExternal
    if(alu.io.rf.isDefined) { 
        alu.io.rf.get := scratchRF.io.rInternal(0) // TODO: Add Req. for this
    }
    
    val nlu = Module(new NLU(c.nluConfig))
    nlu.io.fSel <> decoder.io.proc.nluFSel
    nlu.io.in := scratchRF.io.rInternal(0) // TODO: Add Req. for this
    
    io.totalOutput := nlu.io.out
}

defined [32mclass[39m [36mnPE[39m

## Future Plans

Basing the nPE off bypasses (i.e. subtractive modifications to a preconfigured topology) is limiting. A constructive approach would be much more general. This should be accomplished through a DSL and much more fully utilizing the functional programming capabilities of Scala.