# nPE: A Configurable Processing Engine
#### Documentation
---

## Introduction

The nPE is a highly configurable processing engine that can be topologically configured to support various DNN accelerator architectures. Its functionality is able to support inner products with varying amounts of spatial and temporal parallelism, nonlinear activation functions, scalar addition, scalar multiplication, and max functions. It supports many existing architectures and dataflows, such as Eyeriss, Flexflow, and ShiDianNao. Its generality based on the notion of soft, firm, and hard bypasses.
- Soft bypasses make it possible to bypass or not bypass via control signals and a multiplexer.
- Firm bypasses add a bypass line as an output.
- Hard bypasses hardwire the input and output together.

The details of bypass behavior change depending on the specific module.

It's microarchitecture consists of two multiport Register Files, an n-dimensional inner product unit, an RF-with-ALU combo, and a nonlinear processing unit. All three registers are accessible from an external NoC, and all arithmetic modules support an "Identity" operation.

## Setup

In [10]:
val path = System.getProperty("user.dir") + "/source/load-ivy.sc"
interp.load.module(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(path)))

Compiling Main.sc


[36mpath[39m: [32mString[39m = [32m"""
C:\Users\RyanL\OneDrive\Research\SEAL\nPE/source/load-ivy.sc
"""[39m

In [11]:
import chisel3._
import chisel3.util._
import chisel3.iotesters.{ChiselFlatSpec, Driver, PeekPokeTester}

import scala.math.pow

[32mimport [39m[36mchisel3._
[39m
[32mimport [39m[36mchisel3.util._
[39m
[32mimport [39m[36mchisel3.iotesters.{ChiselFlatSpec, Driver, PeekPokeTester}

[39m
[32mimport [39m[36mscala.math.pow[39m

## Parallel Register File

The Parallel Register File (pRF) is a register file that is configurable based on number of ports in/out, register size, bypass type, and bitwidth. Bypass options are None, Soft, or Hard. The bypass behavior is such that it bypasses in[k] to out[k], where k is the port index. Without loss of generality, the bypass option will be applied to all ports. To have multiple bypass options, simply use parallel pRFs.

### Single Register File

The Single Register File acts as expected: it reads data in when write enable is high and reads data when read enable is high. It can read and write in parallel. It has two read ports: one for internal PE reading and one for external PE communication. The register size adjusts based on the bitwidth of the addresses.

In [12]:
class RF (datawidth: Int, addrwidth: Int) extends Module {
  
    val io = IO(new Bundle {
        val write_en  = Input (Bool())
        val read_en   = Input (Bool())
        val waddr     = Input (UInt(addrwidth.W))
        val wdata     = Input (SInt(datawidth.W))
        val raddr_int = Input (UInt(addrwidth.W))
        val raddr_ext = Input (UInt(addrwidth.W))
        val rdata_int = Output(SInt(datawidth.W))
        val rdata_ext = Output(SInt(datawidth.W))
    })
    
    val registers  = RegInit(Vec(Seq.fill(pow(2, addrwidth).toInt) { 0.S(addrwidth.W) }))
    
    when(io.write_en) {
        registers(io.waddr) := io.wdata
    }
    
    when(io.read_en) {
        io.rdata_int := registers(io.raddr_int)
        io.rdata_ext := registers(io.raddr_ext)
    } .otherwise {
        io.rdata_int := 0.S
        io.rdata_ext := 0.S
    }
}

defined [32mclass[39m [36mRF[39m

### Putting them Together

The single RFs come together in parallel to form the pRF. The number of ports is the number of single RFs. Thus, there are all of the control signals associated with a single RF, but with a bus width equal to the port count. If the bypass type is Soft, then each RF has an independent bypass control signal.

In [13]:
class pRF(ports: Int, bypass: String, datawidth: Int, addrwidth: Int) extends Module {
    
    require(List("None", "Soft", "Hard").contains(bypass))
    
    val io = IO(new Bundle {
        val write_en  = Input (Vec(ports, Bool()))
        val read_en   = Input (Vec(ports, Bool()))
        val waddr     = Input (Vec(ports, UInt(addrwidth.W)))
        val wdata     = Input (Vec(ports, SInt(datawidth.W)))
        val raddr_int = Input (Vec(ports, UInt(addrwidth.W)))
        val raddr_ext = Input (Vec(ports, UInt(addrwidth.W)))
        val rdata_int = Output(Vec(ports, SInt(datawidth.W)))
        val rdata_ext = Output(Vec(ports, SInt(datawidth.W)))
        val bp_slct   = if (bypass == "Soft") Some(Input(Vec(ports, Bool()))) else None
    })
    
    if(bypass == "None" || bypass == "Soft") {
        
        val rf = Seq.fill(ports){ new RF(datawidth, addrwidth) }
        
        rf.zipWithIndex.map{ case (x: RF, i: Int) => {
            
            x.io.write_en  := io.write_en(i)
            x.io.read_en   := io.read_en(i)
            x.io.waddr     := io.waddr(i)
            x.io.wdata     := io.wdata(i)
            x.io.raddr_int := io.raddr_int(i)
            x.io.raddr_ext := io.raddr_ext(i)
            
            when (io.bp_slct.getOrElse(Seq.fill(ports){ false.B })(i)) {
                io.rdata_int(i) := x.io.rdata_int
                io.rdata_ext(i) := x.io.rdata_ext
            } .otherwise {
                io.rdata_int(i) := x.io.wdata
                io.rdata_ext(i) := x.io.wdata
            }
        }}
        
    } else if(bypass == "Hard") {
        io.rdata_int := io.wdata
        io.rdata_ext := io.wdata
    }
}

defined [32mclass[39m [36mpRF[39m

## Inner Product Unit

The Inner Product Unit (IPU) performs an inner product between two vectors of a configurable length. For SIMD support, this should be configured to be wide. For traditional PEs, such as Eyeriss, note that setting the width to 1 is equivalent to a scalar multiplication.

Bypass options are None or Firm. For Firm, there only one pair of weights and activations may be bypassed at a time, such that the IPU may interface with only one ALU. Thus if the bypass type is Firm, then there is a bypass selection signal that selects which input to feed to the output. This may be generalized to interface with multiple TUs, but not at this time.

There are plans to implement both Soft and Hard bypassing. There are also plans to make using FMA modules an option.

The IPU contains two embedded modules: a parallel multiplier and a reduction tree.

### Parallel Multiplier

The parallel multiplier simply takes two vector inputs of width n and outputs the element-wise multiplication. There are plans to make the multiplier type configurable, *e.g.* serial, combinational, pipelined *etc*.

In [14]:
class pMultiplier(width: Int, bitwidth: Int) extends Module {
    
    require(width >= 1, "Width must be at least one.")
    require(bitwidth >= 1, "Bitwidth must be at least one.")
    
    val io = IO(new Bundle {
        val in1 = Input (Vec(width, SInt(bitwidth.W)))
        val in2 = Input (Vec(width, SInt(bitwidth.W)))
        val out = Output(Vec(width, SInt(bitwidth.W)))
    })
    
    io.out := (io.in1 zip io.in2).map { case(a, b) => a * b }
}

defined [32mclass[39m [36mpMultiplier[39m

### Additive Reduction Tree

The additive reduction tree outputs the L1 norm of the n-dim input vector using log(n) layers of 2-input adders. In other words, it sums the input.

There are plans to pipeline this (as well as everything else...).

In [15]:
// Recursively creates a balanced syntax tree
def nonassocPairwiseReduce[A](xs: List[A], op: (A, A) => A): A = {
  xs match {
    case Nil => throw new IllegalArgumentException
    case List(singleElem) => singleElem
    case sthElse => {
      val grouped = sthElse.grouped(2).toList
      val pairwiseOpd = for (g <- grouped) yield {
        g match {
          case List(a, b) => op(a, b)
          case List(x) => x
        }
      }
      nonassocPairwiseReduce(pairwiseOpd, op)
    }
  }
}


class AdditiveRT(width: Int, bitwidth: Int) extends Module {

    require(width >= 1, "Width must be at least one.")
    require(bitwidth >= 1, "Bitwidth must be at least one.")
    
    val io = IO(new Bundle {
        val in  = Input (Vec(width, SInt(bitwidth.W)))
        val out = Output(SInt(bitwidth.W))
    })
    
    io.out := nonassocPairwiseReduce(io.in toList, (x: SInt, y: SInt) => x + y)
}

defined [32mfunction[39m [36mnonassocPairwiseReduce[39m
defined [32mclass[39m [36mAdditiveRT[39m

### Putting them Together

Together, the parallel multiplier and additive reduction tree perform an n-dim inner product. They are simply connected output to input. During use of the PE, the bypass type of the IPU is dictated by the functionality of the ALU.

In [7]:
def checkparamsIPU(width: Int, bypass: String, bitwidth: Int) {
    require(width >= 1, "Width must be at least one.")
    require(List("None", "Firm").contains(bypass), "Bypass must be \"None\" or \"Firm\"")
    require(bitwidth >= 0, "Data bitwidth must be non-negative")
}


class IPU(width: Int, bypass: String, bitwidth: Int) extends Module {
    
    checkparamsIPU(width, bypass, bitwidth)
    
    val io = IO(new Bundle {
        val in1 = Input(Vec(width, SInt(bitwidth.W)))
        val in2 = Input(Vec(width, SInt(bitwidth.W)))
        val out = Output(UInt(bitwidth.W))
        val sel = if(bypass == "Firm") Some(Input(Vec(width, Bool()))) else None
        val bp1 = if(bypass == "Firm") Some(Output(UInt(bitwidth.W)))  else None
        val bp2 = if(bypass == "Firm") Some(Output(UInt(bitwidth.W)))  else None
    })
    
    val pM = new pMultiplier(width, bitwidth)
    pM.io.in1 := io.in1
    pM.io.in2 := io.in2
    
    val aRT = new AdditiveRT(width, bitwidth)
    aRT.io.in := pM.io.out
    
    io.out := aRT.io.out
    
    if (bypass == "Firm") {
        io.bp1.get := PriorityMux(io.sel.get, io.in1)
        io.bp2.get := PriorityMux(io.sel.get, io.in2)
    }
}

defined [32mfunction[39m [36mcheckparamsIPU[39m
defined [32mclass[39m [36mIPU[39m

## ALU

The ALU is configurable based on the functions it should support and the data bitwidth. The minimum ALU simply connects the input and the output directly. Accumulate means add the local register file output to the inner product from the IPU. Add and Max do perform their respective operation on the two bypasses from the IPU.

Selecting the operation to perfom is done with a one-hot select signal. The order is always Identity, Add, Max, then Accumulate, but the exact indices will change depending on which are supported.

There are plans to add bypass behavior. Right now, it is supported with an identity operation.

In [16]:
def checkparamsALU(funcs: List[String], datawidth: Int) {
    require(funcs.contains("Identity"), "ALU functions must explicitly include Identity.")
    val supportedFuncs = List("Identity", "Add", "Max", "Accumulate")
    for(x <- funcs)(require(supportedFuncs.contains(x), "Unsupported Function"))
}

class ALU(funcs: List[String], datawidth: Int) extends Module {
    
    checkparamsALU(funcs, datawidth)
 
    val io = IO(new Bundle {
        val innr_prod = Input(SInt(datawidth.W))
        val func_slct = Input(Vec(funcs.length, Bool()))
        val output    = Output(SInt(datawidth.W))
        val weight_bp = if(List("Add", "Max").contains(funcs)) Some(Input(SInt(datawidth.W))) else None
        val actvtn_bp = if(List("Add", "Max").contains(funcs)) Some(Input(SInt(datawidth.W))) else None
        val rf_feedbk = if(funcs.contains("Accumulate"))       Some(Input(SInt(datawidth.W))) else None
    })
    
    val idnOut = Some(Wire(SInt(datawidth.W)))
    val addOut = if(funcs.contains("Add"))        Some(Wire(SInt(datawidth.W))) else None
    val maxOut = if(funcs.contains("Max"))        Some(Wire(SInt(datawidth.W))) else None
    val accOut = if(funcs.contains("Accumulate")) Some(Wire(SInt(datawidth.W))) else None
    
    idnOut.get := io.innr_prod
    
    if (funcs.contains("Add")       ) { addOut.get := io.weight_bp.get + io.actvtn_bp.get }
    if (funcs.contains("Accumulate")) { accOut.get := io.innr_prod + io.rf_feedbk.get }
    if (funcs.contains("Max")       ) {
        when (io.weight_bp.get > io.weight_bp.get) {
            maxOut.get := io.weight_bp.get
        } .otherwise {
            maxOut.get := io.actvtn_bp.get
        }
    }
    
    val inters = (idnOut:: addOut :: maxOut :: accOut :: Nil) filter ( _.isDefined ) map ( _.get )
    io.output := PriorityMux(io.func_slct, inters)
}

defined [32mfunction[39m [36mcheckparamsALU[39m
defined [32mclass[39m [36mALU[39m

## Feedback Register File

The feedback register file is just a pRF of size 1 as implemented earlier. Its bypass behavior is key to implementing architectures such as FlexFlow. A Hard bypass will connect the input from the NoC to the NLU/ALU and the input from the ALU to the NoC.

## Nonlinear Unit

The Nonlinear Unit (NLU) performs the nonlinear activations functions present in neural networks. As of right now, it supports the Identity function and ReLu, but there are plans to implement sigmoid and tanh using LUTs and linear interpolation.

In [17]:
def checkparamsNLU(funcs: List[String], datawidth: Int) {
    require(funcs.contains("Identity"), "NLU functions must explicitly include Identity.")
    val supportedFuncs = List("Identity", "ReLu")
    for(x <- funcs)(require(supportedFuncs.contains(x), "Unsupported Function"))
}

class NonlinearUnit(funcs: List[String], datawidth: Int) extends Module {
    
    checkparamsNLU(funcs, datawidth)
    
    val io = IO(new Bundle {
        val input = Input(SInt(datawidth.W))
        val fslct = Input(Vec(funcs.length, Bool()))
        val outpt = Output(SInt(datawidth.W))
    })
    
    val idntOut = Some(Wire(SInt(datawidth.W)))
    val reluOut = if(funcs.contains("ReLu")) Some(Wire(SInt(datawidth.W))) else None
    
    idntOut.get := io.input
    if (funcs.contains("ReLu")) {
        when (io.input > 0.S) {
            reluOut.get := io.input
        } .otherwise {
            reluOut.get := 0.S
        }
    }
    
    val inters = (idntOut :: reluOut :: Nil) filter ( _.isDefined ) map ( _.get )
    io.outpt := PriorityMux(io.fslct, inters)
}

defined [32mfunction[39m [36mcheckparamsNLU[39m
defined [32mclass[39m [36mNonlinearUnit[39m

## Control Unit

For now, it won't be programmable post-synthesis, but the idea will be to make it import a state machine and execute that. Implementation to come.

### Moore Machine

The Control unit will take a Moore Machine as a parameter in order to support various dataflows.

In [None]:
class ControlUnit(mooreMachine: Map[Int, Int], decoder: Map[Int, ]

## Putting it all Together

In [None]:
class nPE(ports: Int, weightRFBP: String, actvtnRFBP: String, datawidth: Int, addrwidth: Int,
         aluFuncs: List[String], nluFuncs: List[String], intrnlRFBP: String ) extends Module {
    
    val io = IO(new Bundle {
        val weightRF_in   = Input (Vec(ports, SInt(datawidth.W)))
        val actvtnRF_in   = Input (Vec(ports, SInt(datawidth.W)))
        val intrnlRF_in   = Input (SInt(datawidth.W))
        val weightRF_2NoC = Output(Vec(ports, SInt(datawidth.W)))
        val actvtnRF_2NoC = Output(Vec(ports, SInt(datawidth.W)))
        val intrnlRF_2NoC = Output(SInt(datawidth.W))
        val output        = Output(SInt(datawidth.W))
    })
    
    // val control = new Control(Param_State_Machine)
    
    // Module
    // mandatory control
    // optional control
    // mandatory inputs
    // optional inputs
    // pe outputs
    
    val weightRF = new pRF(ports, weightRFBP, datawidth, addrwidth)
    // weightRF.io.wen         := from Control
    // weightRF.io.ren         := from Control
    // weightRF.io.waddr       := from Control
    // weightRF.io.raddr_int   := from Control
    // weightRF.io.raddr_ext   := from Control
    // if ( weightRF.io.bp_slct.isDefined ) { weightRF.io.bp_slct.get := from Control }
    weightRF.io.wdata := io.weightRF_in
    weightRF_2NoC     := weightRF.io.rdata_ext
    
    val actvtnRF = new pRF(ports, actvtnRFBP, datawidth, addrwidth)
    // actvtnRF.io.wen         := from Control
    // actvtnRF.io.ren         := from Control
    // actvtnRF.io.waddr       := from Control
    // actvtnRF.io.raddr_int   := from Control
    // actvtnRF.io.raddr_ext   := from Control
    // if ( actvtnRF.io.bp_slct.isDefined ) { actvtnRF.io.bp_slct.get := from Control }
    actvtnRF.io.wdata := io.weightRF_in
    actvtnRF_2NoC     := actvtnRF.io.rdata_ext
       
    val ipuBP = if(aluFuncs.contains("Add") || aluFuncs.contains("Max")) "Firm" else "None" 
    val ipu   = new IPU(ports, ipuBP, datawidth)
    // if (ipu.io.sel.isDefined) { ipu.io.sel.get := from Control }
    ipu.io.in1 := weightRF.io.rdata_int
    ipu.io.in2 := actvtnRF.io.rdata_int
    
    val alu = new ALU(aluFuncs, datawidth)
    // alu.io.func_slct := from Control
    alu.io.innr_prod := ipu.io.out
    if(alu.io.weight_bp.isDefined) alu.io.weight_bp.get := ipu.io.bp1
    if(alu.io.actvtn_bp.isDefined) alu.io.actvtn_bp.get := ipu.io.bp2
    
    val intrnlRF = new pRF(1, intrnlRFBP, datawidth, addrwidth)
    // intrnlRF.io.write_en  := from Control
    // intrnlRF.io.read_en   := from Control
    // intrnlRF.io.waddr     := from Control
    // intrnlRF.io.raddr_int := from Control
    // intrnlRF.io.raddr_ext := from Control
    // if (intrnlRF.io.bp_slct.isDefined) { intrnlRF.io.bp_slct.get := from Control }
    // intrnlRF.io.wdata := Mux(from Control, alu.io.out, intrnlRF_in)
    intrnlRF_2NoC := io.intrnlRF.rdata_ext
    if(alu.io.rf_feedbk.isDefined) alu.io.rf_feedbk.get := intrnlRF.io.rdata_int
    
    val nlu = new NonlinearUnit(nluFuncs, datawidth)
    nlu.io.func_slct := from Control
    nlu.io.input     := intrnlRF.io.rdata_int
    io.output        := nlu.io.output
    
    // Woot woot
}