# Unified Processing Engine
#### Documentation | Version: 0.6.1 | Updated 2018.7.30
---

## Introduction

The processing engine is at the heart of many hardware-accelerator architectures. This PE is unified implementation that can be topologically configured to support various Deep Neural Network (DNN) accelerator architectures. Based on its configurations, it supports inner products with varying amounts of spatial and temporal parallelism, nonlinear activation functions, scalar addition, scalar multiplication, and max functions. Many existing architectures and dataflows, such as Eyeriss, Flexflow, and ShiDianNao are special cases of this PE.

## Setup

In [1]:
val path = System.getProperty("user.dir") + "/source/load-ivy.sc"
interp.load.module(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(path)))

[36mpath[39m: [32mString[39m = [32m"""
C:\Users\RyanL\OneDrive\Research\SEAL\processing-engine/source/load-ivy.sc
"""[39m

In [2]:
import chisel3._
import chisel3.util._
import chisel3.iotesters.{ChiselFlatSpec, Driver, PeekPokeTester}

import scala.math.pow

[32mimport [39m[36mchisel3._
[39m
[32mimport [39m[36mchisel3.util._
[39m
[32mimport [39m[36mchisel3.iotesters.{ChiselFlatSpec, Driver, PeekPokeTester}

[39m
[32mimport [39m[36mscala.math.pow[39m

## Register File

##### Purpose
The Register File (RF) is used as a buffer for weights and activations, as well as a scratch pad for the ALU.

##### I/O
* Each RF has two buses in and two buses out: one for internal-PE usage and the other for inter-PE communication. These are termed *Internal Write*, *External Write*, *Internal Read*, and *External Read*&mdash;where 'Internal' denotes internal-PE and 'External' denotes inter-PE.
* There are two separate control buses: one for the Internal Read/Write buses and one for the External Read/Write buses
* Each of these control buses consists of a one-hot write enable, a one-hot read enable, a write address for each input, a read address for each output, and a select signal for each output for bypassing (see bypass section).

##### Bypassing
The bypass behavior is slightly tricky: the bus for inter-PE writing can be bypassed to the bus for internal-PE reading and the bus for internal-PE writing can be bypassed to the bus for inter-PE reading. In other words, bypasses cross between internal and external I/O. Since the input and output buses can be different sizes, *each* output signal has a select signal that must pick an input signal (of the appropriate bus). 
* Hard Bypassing is disallowed to prevent combinational loops.
* For Firm Bypassing, this select signal *must* always be one-hot, else behavior is undefined. The RF's internal memory is limited to one SInt register per input and read/write addressing is unsupported&mdash;thus the RF acts as a pipelining register. Additionally, Firm Bypassing is disallowed when there is an i/o bus does not have its corresponding pair. *E.g.* if there is an Internal Read, there must be an External Write, *etc.*
* For Soft Bypassing, this signal is used to indicate both when to bypass vs read, as well as which input to bypass from. When the select signal is all false, the RF will act as specified for a None Bypass. 
* For None Bypassing, the bypass select signal is deactivated and it each of the read/write signals can operate completely independently. It is up to the user to avoid read/write race conditions.

In [3]:
class PartialRFConfig(
        val numInputs: Int,
        val numOutputs: Int,
        val numCrossInputs: Int,
        val addrWidth: Int,
        val bpSoft: Boolean,
        val bpFirm: Boolean)

class PartialRFControl(c: PartialRFConfig) extends Bundle {
    val wEnable = Vec(c.numInputs, Bool())
    val rEnable = Vec(c.numOutputs, Bool())
    val wAddr = if (!c.bpFirm) Some(Vec(c.numInputs, UInt(c.addrWidth.W))) else None
    val rAddr = if (!c.bpFirm) Some(Vec(c.numOutputs, UInt(c.addrWidth.W))) else None
    // Each output can select which input of the opposite bus to bypass from
    val bpSel = if (c.bpSoft || c.bpFirm) Some(Vec(c.numOutputs, Vec(c.numCrossInputs, Bool()))) else None
}

class RFConfig(
        val numIntInputs: Int,
        val numExtInputs: Int,
        val numIntOutputs: Int,
        val numExtOutputs: Int,
        val addrWidth: Int,
        val dataWidth: Int,
        val bpType: String) {
    
    val bpNone = (bpType == "None")
    val bpSoft = (bpType == "Soft")
    val bpFirm = (bpType == "Firm")
    
    require(bpNone || bpSoft || bpFirm, "Invalid Bypass type.\n")
    require(numIntInputs > 0 || numExtInputs > 0, "Must have at least one input.\n")
    require(numIntOutputs > 0 || numExtOutputs > 0, "Must have at least one output.\n")
    require(dataWidth > 0, "Data bitwidth must be at least one.\n") 
    if (bpFirm) { require(addrWidth == 0, "Address width must be 0 when Firm Bypassing.\n") }
    
    val intConfig = new PartialRFConfig(
        numIntInputs, numIntOutputs, numExtOutputs, addrWidth, bpSoft, bpFirm)
    
    val extConfig = new PartialRFConfig(
        numExtInputs, numExtOutputs, numIntOutputs, addrWidth, bpSoft, bpFirm)
}

class RFControl(c: RFConfig) extends Bundle {
    
    override def cloneType = (new RFControl(c)).asInstanceOf[this.type]
    
    val internal = if (c.numIntInputs > 0 || c.numIntOutputs > 0)
        Some(new PartialRFControl(c.intConfig)) else None
    val external = if (c.numExtInputs > 0 || c.numExtOutputs > 0)
        Some(new PartialRFControl(c.extConfig)) else None
}

class RF(c: RFConfig) extends Module {
    
    val io = IO(new Bundle {
        val control = Input(new RFControl(c))
        val wInternal = Input(Vec(c.numIntInputs, SInt(c.dataWidth.W))) 
        val wExternal = Input(Vec(c.numExtInputs, SInt(c.dataWidth.W)))
        val rInternal = Output(Vec(c.numIntOutputs, SInt(c.dataWidth.W)))
        val rExternal = Output(Vec(c.numExtOutputs, SInt(c.dataWidth.W)))
    })
    
    val dataRegister = if (!c.bpFirm) 
        Some(RegInit(Vec.fill(pow(2, c.addrWidth).toInt){0.S(c.dataWidth.W)})) else None
    
    // Need to bypass through a register to prevent combinational loops
    val bpAny = c.bpSoft || c.bpFirm
    val bpRegisterInt = if (bpAny && c.numIntInputs > 0)
        Some(RegInit(Vec.fill(c.numIntInputs){0.S(c.dataWidth.W)})) else None
    val bpRegisterExt = if (bpAny && c.numExtInputs > 0)
        Some(RegInit(Vec.fill(c.numExtInputs){0.S(c.dataWidth.W)})) else None
    
    for (i <- 0 until c.numIntInputs) {
        when (io.control.internal.get.wEnable(i)) {
            if (!c.bpFirm) { dataRegister.get(io.control.internal.get.wAddr.get(i)) := io.wInternal(i) }
            if (bpRegisterInt.isDefined) { bpRegisterInt.get(i) := io.wInternal(i) }
        }
    }
    
    for (i <- 0 until c.numExtInputs) {
        when (io.control.external.get.wEnable(i)) {
            if (!c.bpFirm) { dataRegister.get(io.control.external.get.wAddr.get(i)) := io.wExternal(i) }
            if (bpRegisterExt.isDefined) { bpRegisterExt.get(i) := io.wExternal(i) }
        }
    }
    
    for (i <- 0 until c.numIntOutputs) {
        when (io.control.internal.get.rEnable(i)) {
            if (c.bpFirm) {
                io.rInternal(i) := PriorityMux(io.control.internal.get.bpSel.get(i), bpRegisterExt.get)
            } else if (c.bpSoft) {
                when (io.control.internal.get.bpSel.get(i).contains(true.B)) {
                    // External write bypasses to Internal read
                    io.rInternal(i) := PriorityMux(io.control.internal.get.bpSel.get(i), bpRegisterExt.get)
                } .otherwise {
                    io.rInternal(i) := dataRegister.get(io.control.internal.get.rAddr.get(i))
                }
            } else {
                io.rInternal(i) := dataRegister.get(io.control.internal.get.rAddr.get(i))
            }
        } .otherwise {
            io.rInternal(i) := 0.S
        }
    }
    
    for (i <- 0 until c.numExtOutputs) {
        when (io.control.external.get.rEnable(i)) {
            if (c.bpFirm) {
                io.rExternal(i) := PriorityMux(io.control.external.get.bpSel.get(i), bpRegisterInt.get)
            } else if (c.bpSoft) {
                when (io.control.external.get.bpSel.get(i).contains(true.B)) {
                    // Internal write bypasses to External read
                    io.rExternal(i) := PriorityMux(io.control.external.get.bpSel.get(i), bpRegisterInt.get)
                } .otherwise {
                    io.rExternal(i) := dataRegister.get(io.control.external.get.rAddr.get(i))
                }
            } else {
                io.rExternal(i) := dataRegister.get(io.control.external.get.rAddr.get(i))
            }
        } .otherwise {
            io.rExternal(i) := 0.S
        }
    }
}

defined [32mclass[39m [36mRFInputs[39m
defined [32mclass[39m [36mRFOutputs[39m
defined [32mclass[39m [36mRF[39m

## Inner Product Unit

The Inner Product Unit (IPU) performs an inner product between two vectors of a configurable length. For SIMD support, this should be configured to be wide. For traditional PEs, such as Eyeriss, note that setting the width to 1 is equivalent to a scalar multiplication.

Bypass options are None or Firm. For Firm, there only one pair of weights and activations may be bypassed at a time, such that the IPU may interface with only one ALU. Thus if the bypass type is Firm, then there is a bypass selection signal that selects which input to feed to the output. This may be generalized to interface with multiple TUs, but not at this time.

There are plans to implement both Soft and Hard bypassing. There are also plans to make using FMA modules an option.

The IPU contains two embedded modules: a parallel multiplier and a reduction tree.

### Parallel Multiplier

The parallel multiplier simply takes two vector inputs of width n and outputs the element-wise multiplication. There are plans to make the multiplier type configurable, *e.g.* serial, combinational, pipelined *etc*.

In [5]:
class PMultConfiguration(val numPairs: Int, val bitWidth: Int) {
    require(numPairs >= 1, "Must have at least one pair of multiplicands.")
    require(bitWidth >= 1, "Bitwidth must be at least one.")
}

class PMultInput(numPairs: Int, bitWidth: Int) extends Bundle {
    
    override def cloneType = (new PMultInput(numPairs, bitWidth)).asInstanceOf[this.type]
    
    val weight = Vec(numPairs, SInt(bitWidth.W))
    val actvtn = Vec(numPairs, SInt(bitWidth.W))
}

class PMult(config: PMultConfiguration) extends Module {
    
    val np = config.numPairs
    val bw = config.bitWidth
    
    val io = IO(new Bundle {
        val in = Input(new PMultInput(np, bw))
        val prod = Output(Vec(np, SInt(bw.W)))
    })
    
    io.prod := (io.in.weight zip io.in.actvtn).map { case(a, b) => a * b }
}

defined [32mfunction[39m [36mcheckParamsPMult[39m
defined [32mclass[39m [36mPMultInputs[39m
defined [32mclass[39m [36mPMult[39m

### Additive Reduction Tree

The additive reduction tree outputs the L1 norm of the n-dim input vector using log(n) layers of 2-input adders. In other words, it sums the input.

There are plans to pipeline this (as well as everything else...).

In [6]:
class AdditiveRTConfiguration(val numAddends: Int, val bitWidth: Int) {
    require(numAddends >= 1, "Number of addends must be at least one.")
    require(bitWidth >= 1, "Bitwidth must be at least one.")
}

// Recursively creates a balanced syntax tree
def adjReduce[A](xs: List[A], op: (A, A) => A): A = xs match {
    case Nil => throw new IllegalArgumentException
    case List(single) => single
    case default => {
        val grouped = default.grouped(2).toList
        val result = for (g <- grouped) yield {
            g match {
                case List(a, b) => op(a, b)
                case List(x) => x
            }
        }
        adjReduce(result, op)
    }
}

class AdditiveRT(config: AdditiveRTConfiguration) extends Module {

    val na = config.numAddends
    val bw = config.bitWidth
    
    val io = IO(new Bundle {
        val in  = Input(Vec(na, SInt(bw.W)))
        val sum = Output(SInt(bw.W))
    })
    
    io.sum := adjReduce(io.in toList, (x: SInt, y: SInt) => x + y)
}

defined [32mfunction[39m [36mnonassocPairwiseReduce[39m
defined [32mfunction[39m [36mcheckParamsAdditiveRT[39m
defined [32mclass[39m [36mAdditiveRT[39m

### Putting them Together

Together, the parallel multiplier and additive reduction tree perform an n-dim inner product. They are simply connected output to input. During use of the PE, the bypass type of the IPU is dictated by the functionality of the ALU.

In [7]:
class IPUConfig(val width: Int, val bitWidth: Int, val bpType: String) {
    
    private val bypssError = "Bypass must be \"None\" or \"Firm\""
    private val widthError = "Width must be at least one"
    private val bitWdError = "Data bitwidth must be non-negative"
    
    val supportedBp = List("None", "Firm")
    
    require(width >= 1, widthError)
    require(supportedBp.contains(bpType), bypssError)
    require(bitWidth >= 0, bitWdError)
    
    val childPMultConfig = new PMultConfig(width, bitWidth)
    val childARTreeConfig = new ARTreeConfig(width, bitWidth)
    
    val bpFirm = (bpType == "Firm")
}

class IPUOutput(c: IPUConfig) extends Bundle {
    
    override def cloneType = (new IPUOutput(c)).asInstanceOf[this.type]
    
    val innerProd = Output(SInt(c.bitWidth.W))
    val bpWeight = if (c.bpFirm) Some(SInt(c.bitWidth.W)) else None
    val bpActvtn = if (c.bpFirm) Some(SInt(c.bitWidth.W)) else None
}


class IPU(c: IPUConfig) extends Module {
    
    val cPMConfig = c.childPMultConfig
    val cARTConfig = c.childARTreeConfig
    
    val io = IO(new Bundle {
        val dataIn = Input(new PMultInput(cPMConfig))
        val dataOut = Output(new IPUOutput(c))
        val bpSel = if (c.bpFirm) Some(Input(Vec(c.width, Bool()))) else None
    })
    
    val pMult = Module(new PMult(cPMConfig))
    pMult.io.in <> io.dataIn
    
    val aRTree = Module(new ARTree(cARTConfig))
    aRTree.io.in := pMult.io.prod
    
    io.dataOut.innerProd := aRTree.io.sum
    
    if (c.bpFirm) {
        io.dataOut.bpWeight.get := PriorityMux(io.bpSel.get, io.dataIn.weight)
        io.dataOut.bpActvtn.get := PriorityMux(io.bpSel.get, io.dataIn.actvtn)
    }
}

defined [32mfunction[39m [36mcheckParamsIPU[39m
defined [32mclass[39m [36mIPUInputs[39m
defined [32mclass[39m [36mIPUOutputs[39m
defined [32mclass[39m [36mIPU[39m

## ALU

The ALU is configurable based on the functions it should support and the data bitwidth. The minimum ALU simply connects the input and the output directly. Accumulate means add the local register file output to the inner product from the IPU. Add and Max do perform their respective operation on the two bypasses from the IPU.

Selecting the operation to perfom is done with a one-hot select signal. The order is always Identity, Add, Max, then Accumulate, but the exact indices will change depending on which are supported.

There are plans to add bypass behavior. Right now, it is supported with an identity operation.

In [8]:
class ALUConfig(val dataWidth: Int, val funcs: List[String]) {
    val identityError = "ALU functions must explicitly include Identity."
    val functionError = "Unsupported Error"
    val supportedFuncs = List("Identity", "Add", "Max", "Accumulate")
    
    require(funcs.contains("Identity"), identityError)
    for(x <- funcs) { require(supportedFuncs.contains(x), functionError) }
    
    val addSupp = funcs.contains("Add")
    val maxSupp = funcs.contains("Max")
    val accSupp = funcs.contains("Accumulate")
    val addBypassIn = addSupp || maxSupp
    val numFuncs = funcs.length
}

class ALUInput(c: ALUConfig) extends Bundle {
    
    override def cloneType = (new ALUInput(c)).asInstanceOf[this.type]
    
    val innerProd = Input(SInt(c.dataWidth.W))
    val funcSel = Input(Vec(c.numFuncs, Bool()))
    
    val weightBp = if(c.addBypassIn) Some(Input(SInt(c.dataWidth.W))) else None
    val actvtnBp = if(c.addBypassIn) Some(Input(SInt(c.dataWidth.W))) else None
    val rfFeedback = if(c.accSupp) Some(Input(SInt(c.dataWidth.W))) else None
}

class ALU(c: ALUConfig) extends Module {
 
    val io = IO(new Bundle {
        val in = new ALUInput(c)
        val out = Output(SInt(c.dataWidth.W))
    })
    
    val idnOut = Some(Wire(SInt(c.dataWidth.W)))
    val addOut = if(c.addSupp) Some(Wire(SInt(c.dataWidth.W))) else None
    val maxOut = if(c.maxSupp) Some(Wire(SInt(c.dataWidth.W))) else None
    val accOut = if(c.accSupp) Some(Wire(SInt(c.dataWidth.W))) else None
    
    idnOut.get := io.in.innerProd
    
    if (c.addSupp) { addOut.get := io.in.weightBp.get + io.in.actvtnBp.get }
    if (c.accSupp) { accOut.get := io.in.innerProd + io.in.rfFeedback.get }
    if (c.maxSupp) {
        when (io.in.weightBp.get > io.in.actvtnBp.get) {
            maxOut.get := io.in.weightBp.get
        } .otherwise {
            maxOut.get := io.in.actvtnBp.get
        }
    }
    
    val inters = (idnOut :: addOut :: maxOut :: accOut :: Nil) filter ( _.isDefined ) map ( _.get )
    io.out := PriorityMux(io.in.funcSel, inters)
}

defined [32mfunction[39m [36mcheckparamsALU[39m
defined [32mclass[39m [36mALUInputs[39m
defined [32mclass[39m [36mALU[39m

## Feedback Register File

The feedback register file is just a pRF of size 1 as implemented earlier. Its bypass behavior is key to implementing architectures such as FlexFlow. A Hard bypass will connect the input from the NoC to the NLU/ALU and the input from the ALU to the NoC.

## Nonlinear Unit

The Nonlinear Unit (NLU) performs the nonlinear activations functions present in neural networks. As of right now, it supports the Identity function and ReLu, but there are plans to implement sigmoid and tanh using LUTs and linear interpolation.

In [9]:
class NLUConfig(val dataWidth: Int, val funcs: List[String]) {
    
    val supportedFuncs = List("Identity", "ReLu")
    val identityError = "NLU functions must explicitly include Identity."
    val functionError = "Unsupported Function"
    
    require(funcs.contains("Identity"), identityError)
    for(x <- funcs)(require(supportedFuncs.contains(x), functionError))
    
    val reluSupp = funcs.contains("ReLu")
    val numFuncs = funcs.length
}

class NLUInputs(c: NLUConfig) extends Bundle {
    
    override def cloneType = (new NLUInputs(c)).asInstanceOf[this.type]
    
    val data = SInt(c.dataWidth.W)
    val fSel = Vec(c.numFuncs, Bool())
}

class NLU(c: NLUConfig) extends Module {
    
    val io = IO(new Bundle {
        val in  = Input(new NLUInputs(c))
        val out = Output(SInt(c.dataWidth.W))
    })
    
    val idRes   = Some(Wire(SInt(c.dataWidth.W)))
    val reluRes = if(c.reluSupp) Some(Wire(SInt(c.dataWidth.W))) else None
    
    idRes.get := io.in.data
    
    if (c.reluSupp) {
        when (io.in.data > 0.S) {
            reluRes.get := io.in.data
        } .otherwise {
            reluRes.get := 0.S
        }
    }
    
    val inters = (idRes :: reluRes :: Nil) filter ( _.isDefined ) map ( _.get )
    io.out := PriorityMux(io.in.fSel, inters)
}

defined [32mfunction[39m [36mcheckparamsNLU[39m
defined [32mclass[39m [36mNonlinearUnit[39m

## Control

Control is implemented via a State Machine (Moore) and Decoder. The state machine must have its states and state transitions configured. The Decoder must have its input-output map configured.

### State Machine

The state machine takes a map as a parameter and constructs the appropriate hardware implementation. The State Machine is configured via the "nextState" parameter: a map from the current state and current input to the next state.

In [10]:
class StateMachineConfig(
        val numStates: Int, 
        val numCtrlSigs: Int, 
        val stateMap: (UInt, UInt, StateMachineConfig) => UInt) {
    
    val stateWidth = log2Up(numStates)
    val ctrlWidth = log2Up(numCtrlSigs)
}

class StateMachine(c: StateMachineConfig) extends Module {
    
    val stateWidth: Int = log2Up(c.numStates)
    
    val io = IO(new Bundle {
        val control = Input (UInt(c.ctrlWidth.W ))
        val out     = Output(UInt(c.stateWidth.W))
    })
    
    val register = RegInit(0.U(c.stateWidth.W))
    register := c.stateMap(register, io.control, c)
    io.out := register
}

defined [32mclass[39m [36mStateMachine[39m

### Decoder

The decoder acts as is typical in processors: it converts the state information into all the control signals associated with the appropriate state. It must be configured via the "decode" parameter, a map from the state and control signal name to the desired value.

In [0]:
class DecoderConfig(
        val weightPRFConfig: PRFConfig,
        val actvtnPRFConfig: PRFConfig,
        val intrnlPRFConfig: PRFConfig,
        val ipuConfig: IPUConfig,
        val aluConfig: ALUConfig,
        val nluConfig: NLUConfig,
        val smConfig: StateMachineConfig,
        val decodeWeightPRF: (UInt, PRFConfig) => Data,
        val decodeActvtnPRF: (UInt, PRFConfig) => Data,
        val decodeIntrnlPRF: (UInt, PRFConfig) => Data,
        val decodeIPU: (UInt, IPUConfig) => Data,
        val decodeALU: (UInt, ALUConfig) => Data,
        val decodeNLU: (UInt, NLUConfig) => Data)

class MemoryControl(c: DecoderConfig) extends Bundle {
    
    override def cloneType = (new MemoryControl(c)).asInstanceOf[this.type]
    
    val weightPRF = Output(new PRFControl(c.weightPRFConfig))
    val actvtnPRF = Output(new PRFControl(c.actvtnPRFConfig))
    val intrnlPRF = Output(new PRFControl(c.intrnlPRFConfig))
}

class ProcessControl(c: DecoderConfig) extends Bundle {
    
    override def cloneType = (new ProcessControl(c)).asInstanceOf[this.type]
    
    val aluFSel = Output(Vec(c.aluConfig.numFuncs, Bool()))
    val nluFSel = Output(Vec(c.nluConfig.numFuncs, Bool()))
    
    val ports = c.weightPRFConfig.ports
    val bpFirm = c.ipuConfig.bpFirm
    
    val ipuBpSel = if (bpFirm) Some(Output(Vec(ports, Bool()))) else None
}

class Decoder(c: DecoderConfig) extends Module {
    
    val io = IO(new Bundle {
        val state = Input(UInt(c.smConfig.stateWidth.W))
        val mem = Output(new MemoryControl(c))
        val proc = Output(new ProcessControl(c))
    })
    
    io.mem.weightPRF <> c.decodeWeightPRF(io.state, c.weightPRFConfig)
    io.mem.actvtnPRF <> c.decodeActvtnPRF(io.state, c.actvtnPRFConfig)
    io.mem.intrnlPRF <> c.decodeIntrnlPRF(io.state, c.intrnlPRFConfig)
    
    if (c.ipuConfig.bpFirm) { 
        io.proc.ipuBpSel.get := c.decodeIPU(io.state, c.ipuConfig)
    }
    
    io.proc.aluFSel := c.decodeALU(io.state, c.aluConfig)
    io.proc.nluFSel := c.decodeNLU(io.state, c.nluConfig)
}

cmd0.sc:2: not found: type PRFConfig
        val weightPRFConfig: PRFConfig,
                             ^cmd0.sc:3: not found: type PRFConfig
        val actvtnPRFConfig: PRFConfig,
                             ^cmd0.sc:4: not found: type PRFConfig
        val intrnlPRFConfig: PRFConfig,
                             ^cmd0.sc:5: not found: type IPUConfig
        val ipuConfig: IPUConfig,
                       ^cmd0.sc:6: not found: type ALUConfig
        val aluConfig: ALUConfig,
                       ^cmd0.sc:7: not found: type NLUConfig
        val nluConfig: NLUConfig,
                       ^cmd0.sc:8: not found: type StateMachineConfig
        val smConfig: StateMachineConfig,
                      ^cmd0.sc:9: not found: type UInt
        val decodeWeightPRF: (UInt, PRFConfig) => Data,
                              ^cmd0.sc:9: not found: type PRFConfig
        val decodeWeightPRF: (UInt, PRFConfig) => Data,
                                    ^cmd0.sc:9: not found: type Data
  

: 

## Putting it all Together

In [18]:
class nPE(stateMap: Map[(UInt, UInt), UInt], extrnl_ctrl_width: Int, // State Machine
          decode: (UInt, String) => Data, RFports: Int, weightRFBP: String, actvtnRFBP: String, datawidth: Int, addrwidth: Int,
          aluFuncs: List[String], nluFuncs: List[String], intrnlRFBP: String
         ) extends Module {
    
    val io = IO(new Bundle {
        val extrnl_ctrl   = Input (SInt(extrnl_ctrl_width.W))
        val weightRF_in   = Input (Vec(RFports, SInt(datawidth.W)))
        val actvtnRF_in   = Input (Vec(RFports, SInt(datawidth.W)))
        val intrnlRF_in   = Input (SInt(datawidth.W))
        val weightRF_2NoC = Output(Vec(RFports, SInt(datawidth.W)))
        val actvtnRF_2NoC = Output(Vec(RFports, SInt(datawidth.W)))
        val intrnlRF_2NoC = Output(SInt(datawidth.W))
        val output        = Output(SInt(datawidth.W))
    })
    
    val stateMachine = new StateMachine(stateMap, extrnl_ctrl_width)
    stateMachine.io.control := io.extrnl_ctrl
    
    val decoder = new Decoder(decode, log2Up(stateMap.size), 
                              RFports, datawidth, addrwidth, aluFuncs, nluFuncs)
    decoder.io.state := stateMachine.io.state
    
    
    // Weight RF
    val weightRF = new pRF(RFports, weightRFBP, datawidth, addrwidth)
    
    // Mandatory Control
    weightRF.io.write_en    := decoder.io.weightRF_wen
    weightRF.io.read_en     := decoder.io.weightRF_ren
    weightRF.io.waddr       := decoder.io.weightRF_waddr
    weightRF.io.raddr_int   := decoder.io.weightRF_raddr_int
    weightRF.io.raddr_ext   := decoder.io.weightRF_raddr_ext
    
    // Optional Control
    if ( weightRF.io.bp_slct.isDefined ) { weightRF.io.bp_slct.get := decoder.io.weightRF_bp_slct_get }
    
    // Mandatory Outputs
    weightRF.io.wdata := io.weightRF_in
    
    // Optional Outputs
    io.weightRF_2NoC  := weightRF.io.rdata_ext
    
    // Activation RF
    val actvtnRF = new pRF(RFports, actvtnRFBP, datawidth, addrwidth)
    
    // Mandatory Control
    actvtnRF.io.write_en    := decoder.io.actvtnRF_wen
    actvtnRF.io.read_en     := decoder.io.actvtnRF_ren
    actvtnRF.io.waddr       := decoder.io.actvtnRF_waddr
    actvtnRF.io.raddr_int   := decoder.io.actvtnRF_raddr_int
    actvtnRF.io.raddr_ext   := decoder.io.actvtnRF_raddr_ext
    
    // Optional Control
    if ( actvtnRF.io.bp_slct.isDefined ) { actvtnRF.io.bp_slct.get := decoder.io.actvtnRF_bp_slct_get }
    
    // Mandatory Outputs
    actvtnRF.io.wdata := io.weightRF_in
    
    // Optional Outputs
    io.actvtnRF_2NoC     := actvtnRF.io.rdata_ext
       
    val ipuBP = if(aluFuncs.contains("Add") || aluFuncs.contains("Max")) "Firm" else "None" 
    val ipu   = new IPU(RFports, ipuBP, datawidth)
    if (ipu.io.sel.isDefined) { ipu.io.sel.get := decoder.io.ipu_sel_get }
    ipu.io.in1 := weightRF.io.rdata_int
    ipu.io.in2 := actvtnRF.io.rdata_int
    
    val alu = new ALU(aluFuncs, datawidth)
    alu.io.func_slct := decoder.io.alu_func_slct
    alu.io.innr_prod := ipu.io.out
    if(alu.io.weight_bp.isDefined) alu.io.weight_bp.get := ipu.io.bp1.get
    if(alu.io.actvtn_bp.isDefined) alu.io.actvtn_bp.get := ipu.io.bp2.get
    
    val intrnlRF = new pRF(1, intrnlRFBP, datawidth, addrwidth)
    intrnlRF.io.write_en  := decoder.io.intrnlRF_write_en
    intrnlRF.io.read_en   := decoder.io.intrnlRF_read_en
    intrnlRF.io.waddr     := decoder.io.intrnlRF_waddr
    intrnlRF.io.raddr_int := decoder.io.intrnlRF_raddr_int
    intrnlRF.io.raddr_ext := decoder.io.intrnlRF_raddr_ext
    if (intrnlRF.io.bp_slct.isDefined) { intrnlRF.io.bp_slct.get := decoder.io.intrnlRF_bp_slct_get }
    intrnlRF.io.wdata := Mux(decoder.io.intrnlRF_wdata_slct, alu.io.output, io.intrnlRF_in)
    io.intrnlRF_2NoC := intrnlRF.io.rdata_ext
    if(alu.io.rf_feedbk.isDefined) alu.io.rf_feedbk.get := intrnlRF.io.rdata_int
    
    val nlu = new NonlinearUnit(nluFuncs, datawidth)
    nlu.io.fslct := decoder.io.nlu_func_slct
    nlu.io.input     := intrnlRF.io.rdata_int
    io.output        := nlu.io.outpt
    
    // Woot woot
}

defined [32mclass[39m [36mnPE[39m

## Future Plans

Basing the nPE off bypasses (i.e. subtractive modifications to a preconfigured topology) is limiting. A constructive approach would be much more general. This should be accomplished through a DSL and much more fully utilizing the functional programming capabilities of Scala.