Release v2.0.

stanford-mast · Mar 1, 2019 · 5ba0fb1 · 5ba0fb1
2 parents 5ac1bdc + fd4ce1a
commit 5ba0fb1
Show file tree

Hide file tree

Showing 49 changed files with 8,498 additions and 138 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,7 +1,71 @@
 List of major changes and improvements
 ======================================
 
-## [Unreleased]
+## [v1.6 -- v2.0] -- 2018-03-01
+
+### Added
+
+- Hardware models.
+
+  - Access forwarding.
+
+  - Buffer sharing scheme.
+    - Use `BufShrScheme` class to represent and calculate NoC transfers.
+
+- Software models.
+
+  - Add `SchedulingConstraint` class to specify loop blocking and partitioning
+    constraints.
+    - Add lazily updated rules to allow refine constraint with previous
+      scheduling results at runtime.
+    - Add subclass `SchedulingConstraintLayerPipeline` for layer pipelining
+      constraints.
+
+  - Add `InterLayerPipeline`.
+    - Layers are organized into `PipelineSegment`, which are simultaneously
+      mapped on to the resource both spatially and temporally.
+    - Each layer in the segment has a 3-tuple scheduling index including
+      segment index, spatial index, and temporal index.
+    - Each layer in the segment has its resource allocation and scheduling
+      constraint.
+    - Use `PipelineSegmentTiming` to capture the timing relation of layers in
+      the segment.
+    - Specify maximum allowed execution time overhead due to layer pipelining
+      in `Option`.
+    - Specify maximum pipelining degree for layer pipelining in `Option`.
+
+  - Add layer pipelining optimizations.
+    - Ofmap forwarding: alternate layer loop ordering.
+    - Ifmap forwarding: sharing the same inputs from memory to multiple
+      regions.
+    - Support model weight pinning when no resource time-multiplexing.
+    - Allow disabling optimizations for layer pipelining to fall back to basic
+      pipelining techniques.
+
+
+### Changed
+
+- Hardware models.
+
+  - Allow data source/destination regions in `Resource` to be non-DATA type.
+
+  - Allow `NodeRegion` to be folded along the w dimension in a zig-zag manner.
+
+- Software models.
+
+  - `LoopBlockingScheme` supports access forwarding and buffer sharing.
+
+  - `LoopBlockingScheme` supports remote node buffers as data regions (non-data
+    type data regions).
+
+  - `partition` unit number of hops calculation supports access forwarding and
+    buffer sharing.
+
+  - `DataLayout` supports closest-first forwarding data transfer for access
+    forwarding and buffer sharing.
+
+  - Refactor `NNDataflow` and `NNDataflowScheme` to incorporate inter-layer
+    pipelining.
 
 
 ## [v1.5 -- v1.6] -- 2018-01-31

diff --git a/README.rst b/README.rst
@@ -9,7 +9,7 @@ Neural Network Dataflow Scheduling
 
 This Python tool allows you to explore the energy-efficient dataflow scheduling
 for neural networks (NNs), including array mapping, loop blocking and
-reordering, and parallel partitioning.
+reordering, and (coarse-grained) parallel processing within and across layers.
 
 For hardware, we assume an Eyeriss-style NN accelerator [Chen16]_, i.e., a 2D
 array of processing elements (PEs) with a local register file in each PE, and a
@@ -26,18 +26,27 @@ In software, we decouple the dataflow scheduling into three subproblems:
   convolutions by blocking and reordering the nested loops. We support
   exhaustive search over all blocking and reordering schemes [Yang16]_, and
   analytical bypass solvers [Gao17]_.
-- Partitioning, which partitions the NN computations for parallel processing.
-  We support batch partitioning, fmap partitioning, output partitioning, input
-  partitioning, and the combination between them (hybrid) [Gao17]_. We use
-  layer-wise greedy beam search.
-
-See the details in our ASPLOS'17 paper [Gao17]_.
+- Parallel processing, which partitions the NN computations across the multiple
+  tiled engines. We support both intra-layer and inter-layer parallelism. For
+  intra-layer, we support batch partitioning, fmap partitioning, output
+  partitioning, input partitioning, and the combination between them (hybrid)
+  [Gao17]_. We also explore various dataflow optimizations including access
+  forwarding and buffer sharing [Gao19]_. We use exhaustive search within each
+  layer. For inter-layer, we support spatial pipelining (inter-layer
+  pipelining) and temporal pipelining (time multiplexing without writing back
+  intermediate data) as well as their optimized scheduling [Gao19]_. We use
+  layer-wise greedy beam search across layers.
+
+See the details in our ASPLOS'17 [Gao17]_ and ASPLOS'19 [Gao19]_ papers.
 
 If you use this tool in your work, we kindly request that you reference our
 paper(s) below, and send us a citation of your work.
 
 - Gao et al., "TETRIS: Scalable and Efficient Neural Network Acceleration with
-  3D Memory", in ASPLOS, April 2017 [Gao17]_.
+  3D Memory", in ASPLOS, April 2017.
+
+- Gao et al., "TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN
+  Accelerators", in ASPLOS. April 2019.
 
 
 Install
@@ -102,6 +111,20 @@ Other options include:
   layers, and output partitioning for FC layers.
 - ``--batch-partitioning`` and ``--ifmap-partitioning``: whether the hybrid
   partitioning also explores batch and input partitioning.
+- ``--enable-access-forwarding``: access forwarding, where the nodes fetch
+  disjoint subsets of data and forward them to other nodes. See [Gao19]_.
+- ``--enable-gbuf-sharing``: buffer sharing, where the global buffer capacity is
+  shared across nodes through NoC. See [Gao19]_.
+- ``--enable-save-writeback``: allow to elide the intermediate data writeback to
+  memory when switching between layers if it is possible to store the entire
+  data set in on-chip buffers.
+- ``--interlayer-partition``: whether to use inter-layer pipelining to
+  partition resources across multiple layers and process them simultaneously.
+- ``--layer-pipeline-time-overhead``, ``--layer-pipeline-max-degree``:
+  constrain the configuration space of inter-layer pipelining, by specifying
+  the maximum execution time overhead, or the maximum pipelining degree.
+- ``--disable-interlayer-opt``: disable optimizations and only allow basic
+  inter-layer pipelining.
 
 
 Code Structure
@@ -115,7 +138,10 @@ Code Structure
         - Array mapping: ``map_strategy``.
         - Loop blocking and reordering: ``loop_blocking``,
           ``loop_blocking_scheme``, ``loop_blocking_solver``.
-        - Partitioning: ``partition``, ``partition_scheme``.
+        - Intra-layer partitioning: ``partition``, ``partition_scheme``,
+          ``buf_shr_scheme``.
+        - Inter-layer pipelining: ``inter_layer_pipeline``,
+          ``pipeline_segment``.
         - Network and layer: ``network``, ``layer``.
     - ``nns``: example NN definitions.
     - ``tests``: unit tests.
@@ -156,6 +182,10 @@ with the Board of Trustees of Leland Stanford Junior University.
 References
 ----------
 
+.. [Gao19] Gao, Yang, Pu, Horowitz, and Kozyrakis, `TANGRAM: Optimized
+  Coarse-Grained Dataflow for Scalable NN Accelerators
+  <//dl.acm.org/citation.cfm?id=3297858.3304014>`__, in ASPLOS. April, 2019.
+
 .. [Gao17] Gao, Pu, Yang, Horowitz, and Kozyrakis, `TETRIS: Scalable and
   Efficient Neural Network Acceleration with 3D Memory
   <//dl.acm.org/citation.cfm?id=3037697.3037702>`__, in ASPLOS. April, 2017.

diff --git a/nn_dataflow/__init__.py b/nn_dataflow/__init__.py
@@ -13,5 +13,5 @@
 program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
 """
 
-__version__ = '1.6'
+__version__ = '2.0'
 
diff --git a/nn_dataflow/core/__init__.py b/nn_dataflow/core/__init__.py
@@ -20,11 +20,13 @@
 from . import loop_enum as LoopEnum
 from . import mem_hier_enum as MemHierEnum
 from . import parallel_enum as ParallelEnum
+from .buf_shr_scheme import BufShrScheme
 from .cost import Cost
 from .data_dim_loops import DataDimLoops
 from .data_layout import DataLayout
 from .fmap_range import FmapPosition, FmapRange, FmapRangeMap
 from .int_range import IntRange
+from .inter_layer_pipeline import InterLayerPipeline
 from .layer import Layer, InputLayer, ConvLayer, FCLayer, \
         LocalRegionLayer, PoolingLayer, EltwiseLayer
 from .loop_blocking_scheme import LoopBlockingScheme
@@ -36,8 +38,12 @@
 from .option import Option
 from .partition_scheme import PartitionScheme
 from .phy_dim2 import PhyDim2
+from .pipeline_segment import PipelineSegment
+from .pipeline_segment_timing import PipelineSegmentTiming
 from .resource import Resource
 from .scheduling import SchedulingCondition, SchedulingResult, Scheduling
+from .scheduling_constraint import SchedulingConstraint, \
+        SchedulingConstraintLayerPipeline
 
 from .nn_dataflow import NNDataflow