diff --git a/CHANGELOG.md b/CHANGELOG.md
index ad14e46..f48ccba 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,7 +1,71 @@
 List of major changes and improvements
 ======================================
 
-## [Unreleased]
+## [v1.6 -- v2.0] -- 2018-03-01
+
+### Added
+
+- Hardware models.
+
+  - Access forwarding.
+
+  - Buffer sharing scheme.
+    - Use `BufShrScheme` class to represent and calculate NoC transfers.
+
+- Software models.
+
+  - Add `SchedulingConstraint` class to specify loop blocking and partitioning
+    constraints.
+    - Add lazily updated rules to allow refine constraint with previous
+      scheduling results at runtime.
+    - Add subclass `SchedulingConstraintLayerPipeline` for layer pipelining
+      constraints.
+
+  - Add `InterLayerPipeline`.
+    - Layers are organized into `PipelineSegment`, which are simultaneously
+      mapped on to the resource both spatially and temporally.
+    - Each layer in the segment has a 3-tuple scheduling index including
+      segment index, spatial index, and temporal index.
+    - Each layer in the segment has its resource allocation and scheduling
+      constraint.
+    - Use `PipelineSegmentTiming` to capture the timing relation of layers in
+      the segment.
+    - Specify maximum allowed execution time overhead due to layer pipelining
+      in `Option`.
+    - Specify maximum pipelining degree for layer pipelining in `Option`.
+
+  - Add layer pipelining optimizations.
+    - Ofmap forwarding: alternate layer loop ordering.
+    - Ifmap forwarding: sharing the same inputs from memory to multiple
+      regions.
+    - Support model weight pinning when no resource time-multiplexing.
+    - Allow disabling optimizations for layer pipelining to fall back to basic
+      pipelining techniques.
+
+
+### Changed
+
+- Hardware models.
+
+  - Allow data source/destination regions in `Resource` to be non-DATA type.
+
+  - Allow `NodeRegion` to be folded along the w dimension in a zig-zag manner.
+
+- Software models.
+
+  - `LoopBlockingScheme` supports access forwarding and buffer sharing.
+
+  - `LoopBlockingScheme` supports remote node buffers as data regions (non-data
+    type data regions).
+
+  - `partition` unit number of hops calculation supports access forwarding and
+    buffer sharing.
+
+  - `DataLayout` supports closest-first forwarding data transfer for access
+    forwarding and buffer sharing.
+
+  - Refactor `NNDataflow` and `NNDataflowScheme` to incorporate inter-layer
+    pipelining.
 
 
 ## [v1.5 -- v1.6] -- 2018-01-31
diff --git a/README.rst b/README.rst
index bd9f990..3ce8b7a 100644
--- a/README.rst
+++ b/README.rst
@@ -9,7 +9,7 @@ Neural Network Dataflow Scheduling
 
 This Python tool allows you to explore the energy-efficient dataflow scheduling
 for neural networks (NNs), including array mapping, loop blocking and
-reordering, and parallel partitioning.
+reordering, and (coarse-grained) parallel processing within and across layers.
 
 For hardware, we assume an Eyeriss-style NN accelerator [Chen16]_, i.e., a 2D
 array of processing elements (PEs) with a local register file in each PE, and a
@@ -26,18 +26,27 @@ In software, we decouple the dataflow scheduling into three subproblems:
   convolutions by blocking and reordering the nested loops. We support
   exhaustive search over all blocking and reordering schemes [Yang16]_, and
   analytical bypass solvers [Gao17]_.
-- Partitioning, which partitions the NN computations for parallel processing.
-  We support batch partitioning, fmap partitioning, output partitioning, input
-  partitioning, and the combination between them (hybrid) [Gao17]_. We use
-  layer-wise greedy beam search.
-
-See the details in our ASPLOS'17 paper [Gao17]_.
+- Parallel processing, which partitions the NN computations across the multiple
+  tiled engines. We support both intra-layer and inter-layer parallelism. For
+  intra-layer, we support batch partitioning, fmap partitioning, output
+  partitioning, input partitioning, and the combination between them (hybrid)
+  [Gao17]_. We also explore various dataflow optimizations including access
+  forwarding and buffer sharing [Gao19]_. We use exhaustive search within each
+  layer. For inter-layer, we support spatial pipelining (inter-layer
+  pipelining) and temporal pipelining (time multiplexing without writing back
+  intermediate data) as well as their optimized scheduling [Gao19]_. We use
+  layer-wise greedy beam search across layers.
+
+See the details in our ASPLOS'17 [Gao17]_ and ASPLOS'19 [Gao19]_ papers.
 
 If you use this tool in your work, we kindly request that you reference our
 paper(s) below, and send us a citation of your work.
 
 - Gao et al., "TETRIS: Scalable and Efficient Neural Network Acceleration with
-  3D Memory", in ASPLOS, April 2017 [Gao17]_.
+  3D Memory", in ASPLOS, April 2017.
+
+- Gao et al., "TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN
+  Accelerators", in ASPLOS. April 2019.
 
 
 Install
@@ -102,6 +111,20 @@ Other options include:
   layers, and output partitioning for FC layers.
 - ``--batch-partitioning`` and ``--ifmap-partitioning``: whether the hybrid
   partitioning also explores batch and input partitioning.
+- ``--enable-access-forwarding``: access forwarding, where the nodes fetch
+  disjoint subsets of data and forward them to other nodes. See [Gao19]_.
+- ``--enable-gbuf-sharing``: buffer sharing, where the global buffer capacity is
+  shared across nodes through NoC. See [Gao19]_.
+- ``--enable-save-writeback``: allow to elide the intermediate data writeback to
+  memory when switching between layers if it is possible to store the entire
+  data set in on-chip buffers.
+- ``--interlayer-partition``: whether to use inter-layer pipelining to
+  partition resources across multiple layers and process them simultaneously.
+- ``--layer-pipeline-time-overhead``, ``--layer-pipeline-max-degree``:
+  constrain the configuration space of inter-layer pipelining, by specifying
+  the maximum execution time overhead, or the maximum pipelining degree.
+- ``--disable-interlayer-opt``: disable optimizations and only allow basic
+  inter-layer pipelining.
 
 
 Code Structure
@@ -115,7 +138,10 @@ Code Structure
         - Array mapping: ``map_strategy``.
         - Loop blocking and reordering: ``loop_blocking``,
           ``loop_blocking_scheme``, ``loop_blocking_solver``.
-        - Partitioning: ``partition``, ``partition_scheme``.
+        - Intra-layer partitioning: ``partition``, ``partition_scheme``,
+          ``buf_shr_scheme``.
+        - Inter-layer pipelining: ``inter_layer_pipeline``,
+          ``pipeline_segment``.
         - Network and layer: ``network``, ``layer``.
     - ``nns``: example NN definitions.
     - ``tests``: unit tests.
@@ -156,6 +182,10 @@ with the Board of Trustees of Leland Stanford Junior University.
 References
 ----------
 
+.. [Gao19] Gao, Yang, Pu, Horowitz, and Kozyrakis, `TANGRAM: Optimized
+  Coarse-Grained Dataflow for Scalable NN Accelerators
+  <//dl.acm.org/citation.cfm?id=3297858.3304014>`__, in ASPLOS. April, 2019.
+
 .. [Gao17] Gao, Pu, Yang, Horowitz, and Kozyrakis, `TETRIS: Scalable and
   Efficient Neural Network Acceleration with 3D Memory
   <//dl.acm.org/citation.cfm?id=3037697.3037702>`__, in ASPLOS. April, 2017.
diff --git a/nn_dataflow/__init__.py b/nn_dataflow/__init__.py
index 5fc5ef0..8257e4e 100644
--- a/nn_dataflow/__init__.py
+++ b/nn_dataflow/__init__.py
@@ -13,5 +13,5 @@
 program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
 """
 
-__version__ = '1.6'
+__version__ = '2.0'
 
diff --git a/nn_dataflow/core/__init__.py b/nn_dataflow/core/__init__.py
index 0fe9784..8a8e178 100644
--- a/nn_dataflow/core/__init__.py
+++ b/nn_dataflow/core/__init__.py
@@ -20,11 +20,13 @@
 from . import loop_enum as LoopEnum
 from . import mem_hier_enum as MemHierEnum
 from . import parallel_enum as ParallelEnum
+from .buf_shr_scheme import BufShrScheme
 from .cost import Cost
 from .data_dim_loops import DataDimLoops
 from .data_layout import DataLayout
 from .fmap_range import FmapPosition, FmapRange, FmapRangeMap
 from .int_range import IntRange
+from .inter_layer_pipeline import InterLayerPipeline
 from .layer import Layer, InputLayer, ConvLayer, FCLayer, \
         LocalRegionLayer, PoolingLayer, EltwiseLayer
 from .loop_blocking_scheme import LoopBlockingScheme
@@ -36,8 +38,12 @@
 from .option import Option
 from .partition_scheme import PartitionScheme
 from .phy_dim2 import PhyDim2
+from .pipeline_segment import PipelineSegment
+from .pipeline_segment_timing import PipelineSegmentTiming
 from .resource import Resource
 from .scheduling import SchedulingCondition, SchedulingResult, Scheduling
+from .scheduling_constraint import SchedulingConstraint, \
+        SchedulingConstraintLayerPipeline
 
 from .nn_dataflow import NNDataflow
 
diff --git a/nn_dataflow/core/buf_shr_scheme.py b/nn_dataflow/core/buf_shr_scheme.py
new file mode 100644
index 0000000..d496d9b
--- /dev/null
+++ b/nn_dataflow/core/buf_shr_scheme.py
@@ -0,0 +1,364 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+import math
+
+from . import data_category_enum as de
+from . import loop_enum as le
+from . import parallel_enum as pe
+from .. import util
+from .layer import ConvLayer
+from .phy_dim2 import PhyDim2
+
+class BufShrScheme(object):
+    '''
+    The buffer sharing scheme.
+    '''
+
+    def __init__(self, node_region, part, data_loops=None):
+        '''
+        `node_region` is the node region in which the buffer sharing takes
+        place.
+
+        `part` is the PartitionScheme instance that determine the buffer
+        sharing scheme.
+
+        `data_loops` is a DataDimLoops instance that determine the relationship
+        between DataCategoryEnum and ParallelEnum. Default is for ConvLayer.
+        '''
+
+        if any(pd > nrd for pd, nrd in zip(part.dim(), node_region.dim)):
+            raise ValueError('BufShrScheme: partitioning scheme does not fit '
+                             'in the node region')
+
+        if data_loops is None:
+            data_loops = ConvLayer.data_loops()
+
+        # Get node group corresponding to each LoopEnum, and the distance
+        # between neighbors in that node group.
+        lpe_dims = [PhyDim2(1, 1)] * le.NUM
+        lpe_nbr_dists = [PhyDim2(float('nan'), float('nan'))] * le.NUM
+
+        # le.BAT corresponds to pe.OFMP and pe.BATP.
+        idx_ofmp = part.order.index(pe.OFMP)
+        idx_batp = part.order.index(pe.BATP)
+        dim_ofmp = part.dim(pe.OFMP)
+        dim_batp = part.dim(pe.BATP)
+        # If only one of OFMP and BATP exists, use that one.
+        if dim_ofmp.size() == 1:
+            lpe_dims[le.BAT] = dim_batp
+            lpe_nbr_dists[le.BAT] = part.part_neighbor_dist(node_region,
+                                                            pe.BATP)
+        elif dim_batp.size() == 1:
+            lpe_dims[le.BAT] = dim_ofmp
+            lpe_nbr_dists[le.BAT] = part.part_neighbor_dist(node_region,
+                                                            pe.OFMP)
+        else:
+            # If both exist ...
+            if abs(idx_ofmp - idx_batp) == 1:
+                # ... and are adjacent in the partitioning hierarchy, use
+                # both.
+                lpe_dims[le.BAT] = dim_batp * dim_ofmp
+                # Neighbor distance is the smaller one.
+                nbr_dist_ofmp = part.part_neighbor_dist(node_region, pe.OFMP)
+                nbr_dist_batp = part.part_neighbor_dist(node_region, pe.BATP)
+                lpe_nbr_dists[le.BAT] = PhyDim2(*[min(d1, d2) for d1, d2
+                                                  in zip(nbr_dist_ofmp,
+                                                         nbr_dist_batp)])
+            else:
+                # ... but are not adjacent, use the bottom one (with
+                # smaller distance).
+                if idx_ofmp > idx_batp:
+                    lpe_dims[le.BAT] = dim_ofmp
+                    lpe_nbr_dists[le.BAT] = part.part_neighbor_dist(
+                        node_region, pe.OFMP)
+                else:
+                    lpe_dims[le.BAT] = dim_batp
+                    lpe_nbr_dists[le.BAT] = part.part_neighbor_dist(
+                        node_region, pe.BATP)
+
+        # le.OFM corresponds to pe.OUTP.
+        lpe_dims[le.OFM] = part.dim(pe.OUTP)
+        lpe_nbr_dists[le.OFM] = part.part_neighbor_dist(node_region, pe.OUTP)
+
+        # le.IFM corresponds to pe.INNP.
+        lpe_dims[le.IFM] = part.dim(pe.INPP)
+        lpe_nbr_dists[le.IFM] = part.part_neighbor_dist(node_region, pe.INPP)
+
+        # Dimension of the node group.
+        self.dims = []
+        # Distance between the neighbors in the node group.
+        self.nbr_dists = []
+
+        # The nodes corresponding to the LoopEnum unrelated to the data
+        # category will fetch the same data, i.e., sharing the data.
+        for dce in range(de.NUM):
+            lpe = (data_loops[dce].drop(range(le.NUM)) + [None])[0]
+            if lpe is None:
+                self.dims.append(PhyDim2(1, 1))
+                self.nbr_dists.append(PhyDim2(float('inf'), float('inf')))
+            else:
+                self.dims.append(lpe_dims[lpe])
+                self.nbr_dists.append(lpe_nbr_dists[lpe])
+
+        # Check extraordinary neighbor distance.
+        assert all(all((not math.isnan(nd)) and (not math.isinf(nd) or d == 1)
+                       for d, nd in zip(dim, nbr_dist))
+                   for dim, nbr_dist in zip(self.dims, self.nbr_dists))
+
+        self.node_region = node_region
+        self.part = part
+        self.data_loops = data_loops
+
+        # Cache for nhops_rotate_all().
+        self.nhops_cache = {}
+
+    def dim(self, dce):
+        ''' Get the buffer sharing node group dimensions. '''
+        return self.dims[dce]
+
+    def size(self, dce):
+        ''' Get the buffer sharing node group size. '''
+        return self.dims[dce].size()
+
+    def nhops_rotate_all(self, dce, subgrp_size, rotation_unit_cnt=None):
+        '''
+        Number of hops for rotation operation of an entire round.
+
+        The number of hops is relative to the total unique data size. E.g.,
+        when the data are in N nodes and each node has 1/M data, if all the
+        data have been transferred by 1 hop, the number of hops is N / M.
+
+        The data are spread in N nodes, where N is the group size. Each node
+        holds 1/M data, where M is given by `subgrp_size`. M is rounded up to a
+        factor of N, M' >= M, and each M' nodes is a subgroup. There are N//M'
+        == N//M subgroups. If M' == M, there are no redundant data in the nodes
+        of a subgroup.
+
+        Rotation means the following operation: nodes exchange their data with
+        the minimum number of hops, until every node has seen all the data.
+
+        How to rotate:
+
+        Each subgroup rotate their data independently. A subgroup is typically
+        2D. We chain the nodes in a snaking fashion with a priority dimension.
+        E.g., if the priority dimension is H (the 1st one), then the node chain
+        is (0,0), (1,0), ..., (H-1,0), (H-1,1), (H-2,1), ..., (0,1), (0,2),
+        ..., i.e., first go along H to the end, then turn to W and go one hop
+        to the next H, then turn and go long H, etc.. The priority dimension is
+        chosen to minimize the overall rotation hops.
+
+        We store data in the chained M' nodes of a subgroup as follow, where
+        the index is the i-th 1/M chunk:
+
+        M-1, M-2, ..., 1, 0, | M-1, M-2, ..., 2M-M'
+
+        The first M nodes loop their data over. In addition, the (M-1)-th node
+        also sends its data to the M-th node. The last M'-M nodes sequentially
+        send data to the right side, and the last node does not send data.
+
+        So in the next step:
+
+        0, M-1, ..., 2, 1, | 0, M-1, ..., 2M-M'+1
+
+        And so on until the last step:
+
+        M-2, M-3, ..., 0, M-1, | M-2, M-3, ..., 2M-M'-1
+
+        Overall, each node except for the last one sends its 1/M data to the
+        right neighbor at each of the M-1 step. And the (M-1)-th node also
+        sends its 1/M data to the 0-th node.
+
+        Note that we do not restore the initial state after one rotation round
+        (missing one step). Even in the case of multiple rotation rounds, this
+        is OK, as the node does not care about which piece of shared data it
+        starts with, as long as each node sees all data at the end.
+
+        Typically rotation ends after rotating M - 1 node buffers, i.e.,
+        skipping 1 step. When a rotation unit occupies more than one node
+        buffer, i.e., rotation unit count is less than M, the rotation ends
+        earlier, when the last rotation unit hits the beginning of the first
+        node buffer. E.g., for M = 4 and unit count is 3, the last unit
+        initially starts at 2/3 of the 3rd node, so we only rotate 2 + 2/3 =
+        8/3 node buffers, i.e., skipping 4 - 8/3 = 4/3 steps.
+
+        If rotation unit count is not given (None), assuming it is not less
+        than M, i.e., equal to M.
+        '''
+
+        # Check cache.
+        cache_key = (dce, subgrp_size, rotation_unit_cnt)
+        res = self.nhops_cache.get(cache_key, None)
+        if res is not None:
+            return res
+
+        subgrp_dim, idx_pr = self._subgrp_dim(dce, subgrp_size)
+
+        if rotation_unit_cnt is None:
+            rotation_unit_cnt = subgrp_size
+
+        # 1. Send to right neighbor.
+        # If H < W, rotate along H dimension, i.e., go along H to the end, then
+        # turn to W and go one hop to the next H, then turn and go long H, ...
+        d_pr = subgrp_dim[idx_pr]
+        d_npr = subgrp_dim[1 - idx_pr]
+        # Per-step nhops = (H-1) * W * Dh + (W-1) * Dw
+        n_pr = (d_pr - 1) * d_npr
+        n_npr = d_npr - 1
+        nhops_nbr = self._nhops_with_neighbor_dist(
+            dce,
+            PhyDim2(*[tpl[1] for tpl
+                      in sorted([(idx_pr, n_pr), (1 - idx_pr, n_npr)])]))
+
+        # 2. (M-1)-th node loops back to the 0-th node.
+        # Position of the (M-1)-th node.
+        coord = self._coordinate(subgrp_size - 1, subgrp_dim, idx_pr)
+        # Per-step nhops = distance back to the 0-th node.
+        nhops_lpbk = self._nhops_with_neighbor_dist(dce, coord)
+
+        skipped_steps = max(1, 1. * subgrp_size / rotation_unit_cnt)
+        assert 1 <= skipped_steps <= subgrp_size
+
+        # All steps; normalize; all subgroups.
+        nhops = (nhops_nbr + nhops_lpbk) \
+                * (subgrp_size - skipped_steps) \
+                * (1. / subgrp_size) \
+                * (self.size(dce) // subgrp_size)
+        assert not math.isinf(nhops) and not math.isnan(nhops)
+
+        # Update cache.
+        assert cache_key not in self.nhops_cache
+        self.nhops_cache[cache_key] = nhops
+
+        return nhops
+
+    def nhops_wide_fetch_once(self, dce, subgrp_size, fetch_width):
+        '''
+        Number of hops for one wide fetch operation.
+
+        The number of hops is relative to the total unique data size. E.g.,
+        when the data are in N nodes and each node has 1/M data, if all the
+        data have been transferred by 1 hop, the number of hops is N / M.
+
+        The data in the subgroup are spread in M' nodes, where M' rounds up M,
+        given by `subgrp_size`, to a factor of the group size N. Each node
+        holds 1/M data. See the rotation function about how the data are
+        distributed.
+
+        Wide fetch means the following operation: a node needs to access W/M >
+        1/M data without rotation, where W is given by `fetch_width`.
+
+        The ceil(W) nodes that will feed the data are those on the upstream
+        (senders) of the rotation chain to this node.
+
+        The returned number of hops is the sum across all nodes in the group.
+        Since it is relative to the total unique data size, and not relative to
+        the fetch data size (fetch width), it is normalized by the fetch width.
+        The number of hops for all nodes to get (W - 1) / W data from their (W
+        - 1) upstream nodes is equal to the number of hops for (W - 1) rotation
+        steps.
+        '''
+        if fetch_width <= 1:
+            return 0
+        elif fetch_width > subgrp_size:
+            raise ValueError('BufShrScheme: fetch width is larger than '
+                             'subgroup size. {} vs. {}.'
+                             .format(fetch_width, subgrp_size))
+
+        nhops_rot_perstep = self.nhops_rotate_all(dce, subgrp_size) \
+                / (subgrp_size - 1)
+
+        ceil_width = math.ceil(fetch_width - 1e-6)
+        # Total steps = 0 + 1 + 2 + ... + (cw - 1) - (cw - 1) * (cw - w)
+        total_steps = (ceil_width - 1) * ceil_width / 2 \
+                - (ceil_width - 1) * (ceil_width - fetch_width)
+
+        return nhops_rot_perstep * total_steps / fetch_width
+
+    def _subgrp_dim(self, dce, subgrp_size):
+        '''
+        Decide the subgroup dimensions and the priority dimension index.
+        Priority dimension is the one along which rotation happens.
+        '''
+        # Round up subgroup size to a factor of the group size.
+        true_subgrp_size = subgrp_size
+        size = self.size(dce)
+        while size % true_subgrp_size:
+            true_subgrp_size += 1
+            if true_subgrp_size > size:
+                raise ValueError('BufShrScheme: subgroup is larger than group. '
+                                 '{} vs. {}.'.format(subgrp_size, size))
+
+        dim = self.dim(dce)
+        nbr_dist = self.nbr_dists[dce]
+
+        # The dimension with smaller/larger distance.
+        idx_sm = 0 if nbr_dist[0] <= nbr_dist[1] else 1
+        idx_lg = 1 - idx_sm
+        dim_sm = dim[idx_sm]
+
+        # The smaller-distance dimension is the priority dimension.
+        idx_pr = idx_sm
+
+        tpl = [1] * 2
+
+        # We try to use as much as possible from the smaller-distance dimension
+        # to the subgroup. Figure out the maximum factor.
+        for f, _ in util.factorize(dim_sm, 2):
+            if f > tpl[idx_sm] and true_subgrp_size % f == 0:
+                tpl[idx_sm] = f
+
+        tpl[idx_lg] = true_subgrp_size // tpl[idx_sm]
+
+        subgrp_dim = PhyDim2(*tpl)
+        assert subgrp_dim.size() == true_subgrp_size
+
+        return subgrp_dim, idx_pr
+
+    @staticmethod
+    def _coordinate(index, dim, idx_pr):
+        '''
+        The coordinate of a node with sequential index `index` in the 2D nodes
+        with dimensions `dim`.  The index increases first along the priority
+        dimension given by `idx_pr` as the dimension index. Return a PhyDim2
+        relative coordinate in the subgroup without scaling by the neighbor
+        distance.
+        '''
+        dim_pr, dim_npr = dim if idx_pr == 0 else reversed(dim)
+        coord_npr, coord_pr = divmod(index, dim_pr)
+        assert coord_npr < dim_npr and coord_pr < dim_pr
+        # We go backward in the odd H, i.e., snaking.
+        if coord_npr % 2 == 1:
+            coord_pr = dim_pr - 1 - coord_pr
+        coord = PhyDim2(coord_pr, coord_npr) if idx_pr == 0 \
+                else PhyDim2(coord_npr, coord_pr)
+        return coord
+
+    def _nhops_with_neighbor_dist(self, dce, coord):
+        '''
+        Get the number of hops from (0, 0) to `coord` of the subgroup of data
+        category `dce`, by scaling by the neighbor distance.
+        '''
+        dist = [c * d if c else 0 for c, d in zip(coord, self.nbr_dists[dce])]
+        assert not any(math.isinf(d) or math.isnan(d) for d in dist)
+        return PhyDim2(*dist).hop_dist(PhyDim2(0, 0))
+
+    def __repr__(self):
+        return '{}({})'.format(
+            self.__class__.__name__,
+            ', '.join([
+                'part={}'.format(repr(self.part)),
+                'data_loops={}'.format(repr(self.data_loops))]))
+
diff --git a/nn_dataflow/core/data_layout.py b/nn_dataflow/core/data_layout.py
index 1fb8293..3832705 100644
--- a/nn_dataflow/core/data_layout.py
+++ b/nn_dataflow/core/data_layout.py
@@ -14,6 +14,7 @@
 """
 
 from collections import namedtuple
+import itertools
 
 from .fmap_range import FmapPosition, FmapRange, FmapRangeMap
 from .node_region import NodeRegion
@@ -84,12 +85,22 @@ def fmap_range_map(self):
 
         return frmap
 
-    def nhops_to(self, fmap_range, *dest_list):
+    def nhops_to(self, fmap_range, *dest_list, **kwargs):
         '''
         Get the total number of hops to transfer the FmapRange `fmap_range` to
         destinations `dest_list` given as a list of absolute coordinates.
+
+        If `forwarding` is True, the data can be forwarded between destinations
+        rather than all from the source.
         '''
-        nhops = 0
+        forwarding = kwargs.pop('forwarding', False)
+        if kwargs:
+            raise ValueError('DataLayout: method nhops_to() got an unexpected '
+                             'keyword argument: {}.'
+                             .format(kwargs.popitem()[0]))
+
+        # The number of hops to transfer data to each destination individually.
+        nhops_list = [0] * len(dest_list)
 
         for frng, region, part in zip(self.frngs, self.regions, self.parts):
 
@@ -102,8 +113,31 @@ def nhops_to(self, fmap_range, *dest_list):
                 pfrng = part.fmap_range(frng, pidx)
                 size = fmap_range.overlap_size(pfrng)
 
-                hop_dist_list = [d.hop_dist(psrc) for d in dest_list]
-                nhops += size * sum(hop_dist_list)
+                nhops_list = [n + size * d.hop_dist(psrc)
+                              for n, d in zip(nhops_list, dest_list)]
+
+        if forwarding:
+            # The number of hops to the first node and its coordinate.
+            nhops, coord = min(zip(nhops_list, dest_list))
+
+            # Size of all data.
+            total_size = self.complete_fmap_range().overlap_size(fmap_range)
+
+            # Data can be forwarded from all sources to any destination.
+            src_set = {coord}
+            dst_set = set(dest_list) - src_set
+
+            while dst_set:
+                # Each forward step, get the min-distance pair of source and
+                # destination.
+                src, dst = min(itertools.product(src_set, dst_set),
+                               key=lambda (s, d): d.hop_dist(s))
+                dst_set.remove(dst)
+                src_set.add(dst)
+                nhops += total_size * dst.hop_dist(src)
+
+        else:
+            nhops = sum(nhops_list)
 
         return nhops
 
diff --git a/nn_dataflow/core/inter_layer_pipeline.py b/nn_dataflow/core/inter_layer_pipeline.py
new file mode 100644
index 0000000..2281ddb
--- /dev/null
+++ b/nn_dataflow/core/inter_layer_pipeline.py
@@ -0,0 +1,356 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+import itertools
+
+from .layer import ConvLayer
+from .network import Network
+from .pipeline_segment import PipelineSegment
+from .resource import Resource
+
+class InterLayerPipeline(object):
+    '''
+    Inter-layer pipeline.
+    '''
+
+    def __init__(self, network, batch_size, resource, max_util_drop=0.05):
+        if not isinstance(network, Network):
+            raise TypeError('InterLayerPipeline: network must be '
+                            'a Network instance.')
+        if not isinstance(resource, Resource):
+            raise TypeError('InterLayerPipeline: resource must be '
+                            'a Resource instance.')
+        if not 0 <= max_util_drop <= 1:
+            raise ValueError('InterLayerPipeline: max_util_drop must be '
+                             'between [0, 1].')
+
+        self.network = network
+        self.batch_size = batch_size
+        self.resource = resource
+        self.max_util_drop = max_util_drop
+
+        self._calc_sched_dag()
+
+        # Vertices starting from which we have generated the segments.
+        self.seg_vertex_done = set()
+
+    def ordered_layer_list(self):
+        '''
+        Get a list of the layers in their topological order in the scheduling
+        DAG.
+        '''
+        return list(sum(self.dag_vertex_list, tuple()))
+
+    def gen_segment(self, options):
+        '''
+        Generate all valid inter-layer pipelining segments.
+        '''
+
+        kwargs = {'network': self.network,
+                  'batch_size': self.batch_size,
+                  'resource': self.resource,
+                  'max_util_drop': self.max_util_drop,
+                  'with_opt': options.layer_pipeline_opt,
+                 }
+
+        # No pipelining, each layer sequentially occupies the whole resource.
+        for layer in self.network:
+            seg = ((layer,),)
+            segment = PipelineSegment(seg, **kwargs)
+            assert segment.valid
+            yield segment
+
+        # Pipelining.
+        for vseg in self._gen_vseg():
+
+            if len(vseg) > options.layer_pipeline_max_degree:
+                continue
+
+            if len(vseg) == 1 and len(self.dag_vertex_list[vseg[0]]) == 1:
+                # An individual layer, already returned in no-pipelining case.
+                continue
+
+            # Use set to eliminate duplicates.
+            seg_cands = set()
+
+            if options.partition_interlayer:
+                # Spatial pipelining.
+                seg = tuple(self.dag_vertex_list[vidx] for vidx in vseg)
+                seg_cands.add(seg)
+
+            if options.hw_gbuf_save_writeback:
+                # Temporal pipelining.
+                # Reduce the spatial dimension.
+                seg = (tuple(itertools.chain.from_iterable(
+                    self.dag_vertex_list[vidx] for vidx in vseg)),)
+                seg_cands.add(seg)
+
+            # Determine segment allocation.
+            for seg in seg_cands:
+                segment = PipelineSegment(seg, **kwargs)
+                if segment.valid:
+                    yield segment
+
+    def _gen_vseg(self, vertex_idx=0, done=None):
+        '''
+        Generate vertex segments starting from vertex `vertex_idx`. Yield a
+        tuple of the vertices in the segment.
+
+        `done` is a set of vertices which have already been scheduled and the
+        output is already in memory.
+
+        Rules:
+
+        1. If a vertex does not share any dependencies with the current
+        segment, i.e., none of its previous vertices is in the current segment
+        or among the previous vertices of the current segment, we do not add it
+        to the segment, because there is no benefit to co-locate them.
+
+        2. If a vertex has multiple previous vertices, at most one of them
+        can be in the same segment as this vertex, because the output data
+        availability timing of multiple previous vertices may not match.
+
+        3. If a vertex has multiple next vertices, either all or at most one of
+        them can be NOT in the same segment as this vertex, because only
+        including a small subset saves little data write-back to memory.
+        '''
+
+        vseg = tuple()
+
+        if not done:
+            done = set()
+            # Reset.
+            self.seg_vertex_done = set()
+
+        if self.dag_input_vertex not in done:
+            # Input layer is always in memory.
+            done.add(self.dag_input_vertex)
+
+        # The frontier is the vertex to be considered to be added to the
+        # current segment.
+        for frontier in range(vertex_idx, len(self.dag_vertex_list)):
+
+            # Check whether the frontier can be added to the current segment.
+
+            frontier_prevs = self.dag_prev_dict[frontier]
+
+            # Whether the frontier share dependencies with the current segment,
+            # if the segment is not empty.
+            share_deps = not vseg or not frontier_prevs.isdisjoint(
+                set.union(set(vseg), *[self.dag_prev_dict[i] for i in vseg]))
+
+            # Whether multiple previous vertices are in the current segment.
+            multi_prevs = len(frontier_prevs.intersection(vseg)) > 1
+
+            if not share_deps or multi_prevs:
+                # Not sharing any dependencies (rule 1), or multiple previous
+                # vertices in the current segment (rule 2).
+
+                # Make sure the current segment is not empty.
+                assert vseg
+                # Not extend the segment any more. Note that the current
+                # segment has already been yielded, as well as the recursion,
+                # in the last iteration.
+                break
+
+            # Extend the segment.
+            vseg += (frontier,)
+
+            # Check whether the segment is valid.
+
+            for idx in vseg:
+                nexts = self.dag_next_dict[idx]
+
+                # The next vertices should either all or at most one not in the
+                # segment (rule 3).
+                if not nexts.isdisjoint(vseg) \
+                        and len(nexts.difference(vseg)) > 1:
+                    # The segment is invalid. Need to add more vertices.
+                    break
+            else:
+                # The segment is valid.
+                yield vseg
+
+                # Skip if have done.
+                if frontier + 1 in self.seg_vertex_done:
+                    continue
+
+                # Recursion.
+                for tpl in self._gen_vseg(frontier + 1, done.union(vseg)):
+                    yield tpl
+
+        assert vertex_idx not in self.seg_vertex_done
+        self.seg_vertex_done.add(vertex_idx)
+
+    def _calc_sched_dag(self):
+        '''
+        Build the scheduling DAG of the network. We merge layers with no
+        filters into their last previous layer, so a DAG vertex can contain one
+        or more layers.
+
+        We order and index the DAG vertices in their depth-first topological
+        order. This will also be the order to schedule the layers.
+
+        Also establish two dicts for the previous and next vertices of each DAG
+        vertex.
+
+        In summary, the attributes initialized include: `dag_input_vertex`,
+        `dag_vertex_list`, `dag_vertex_dict`, `dag_prev_dict`, `dag_next_dict`.
+        '''
+
+        # Vertex of the input layer.
+        self.dag_input_vertex = -1
+
+        # The DAG vertex set. Each vertex is a merged layer tuples, represented
+        # by their layer names. Use a list type to make modification easier.
+        dag_vertex_set = []
+
+        for layer_name in self.network:
+            layer = self.network[layer_name]
+
+            if isinstance(layer, ConvLayer):
+                dag_vertex_set.append((layer_name,))
+
+            else:
+                prevs = set(self.network.prevs(layer_name))
+                assert prevs
+
+                # Find and merge to a vertex if that vertex only contains one
+                # previous layer at the last, because non-last previous layer
+                # will not have its data available to be used for this layer.
+                # Also the previous layer can only have this one next layer,
+                # because its data will be overwritten by this layer locally.
+
+                # Check vertices in the reversed order.
+                for idx in reversed(range(len(dag_vertex_set))):
+                    vhead = dag_vertex_set[idx][:-1]
+                    vtail = dag_vertex_set[idx][-1]
+                    if prevs.isdisjoint(vhead) and vtail in prevs \
+                            and len(self.network.nexts(vtail)) == 1:
+                        dag_vertex_set[idx] += (layer_name,)
+                        break
+                else:
+                    # No valid vertex to merge.
+                    dag_vertex_set.append((layer_name,))
+
+        assert sum(len(v) for v in dag_vertex_set) == len(self.network)
+
+        # The DAG vertex list in the topological order.
+        self.dag_vertex_list = self._topological_order(dag_vertex_set)
+
+        # Make a directory from layer name to DAG vertex index.
+        self.dag_vertex_dict = {}
+
+        for vidx, v in enumerate(self.dag_vertex_list):
+            for layer_name in v:
+                assert layer_name not in self.dag_vertex_dict
+                self.dag_vertex_dict[layer_name] = vidx
+
+        # Add the input layer.
+        self.dag_vertex_dict[self.network.INPUT_LAYER_KEY] = \
+                self.dag_input_vertex
+        # Add the external layers.
+        for ext_layer in self.network.ext_layers():
+            self.dag_vertex_dict[ext_layer] = self.dag_input_vertex
+
+        # The previous and next relationship of the DAG vertices.
+        self.dag_prev_dict = dict((vidx, set()) for vidx
+                                  in range(len(self.dag_vertex_list)))
+        self.dag_next_dict = dict((vidx, set()) for vidx
+                                  in range(len(self.dag_vertex_list)))
+
+        for layer_name in self.network:
+            vidx = self.dag_vertex_dict[layer_name]
+
+            # Previous layers.
+            for p in self.network.prevs(layer_name):
+                pvidx = self.dag_vertex_dict[p] \
+                        if p and p not in self.network.ext_layers() \
+                        else self.dag_input_vertex
+                if pvidx != vidx:
+                    self.dag_prev_dict[vidx].add(pvidx)
+
+            # Next layers.
+            for n in self.network.nexts(layer_name):
+                if not n:
+                    continue
+                nvidx = self.dag_vertex_dict[n]
+                if nvidx != vidx:
+                    self.dag_next_dict[vidx].add(nvidx)
+
+        # Add next layers of the input layer.
+        self.dag_next_dict[self.dag_input_vertex] = set()
+        for vidx in self.dag_prev_dict:
+            if self.dag_input_vertex in self.dag_prev_dict[vidx]:
+                self.dag_next_dict[self.dag_input_vertex].add(vidx)
+
+    def _topological_order(self, dag_vertex_set):
+        '''
+        Order the DAG vertices in topological order using DFS.
+
+        Specifically, The backtrace order of the depth-first search is the
+        inverse of the topological order. See
+        https://en.wikipedia.org/wiki/Topological_sorting#Depth-first_search
+        '''
+
+        # The visited layers in the DFS order.
+        visited = []
+        # The unseen pending layers.
+        unseen = set(dag_vertex_set)
+        # The layers that have been seen, but not visited due to unvisited
+        # previous layers.
+        seen = set()
+
+        def _dfs(vertex):
+            assert vertex not in seen
+            if vertex in visited:
+                return
+
+            unseen.discard(vertex)
+            seen.add(vertex)
+
+            nexts = []
+            for l in vertex:
+                for n in self.network.nexts(l):
+                    if n and n not in vertex and n not in nexts:
+                        nexts.append(n)
+
+            # Visit next layers in the reversed order, so the reversed visit
+            # order has the original order.
+            next_vertices = []
+            for n in reversed(nexts):
+                for nv in unseen:
+                    if n in nv:
+                        next_vertices.append(nv)
+
+            for nv in next_vertices:
+                _dfs(nv)
+
+            visited.append(vertex)
+            seen.remove(vertex)
+
+        # Start from the first layers.
+        start_vertices = []
+        for l in reversed(self.network.firsts()):
+            for v in unseen:
+                if l in v:
+                    start_vertices.append(v)
+        for v in start_vertices:
+            _dfs(v)
+        assert not unseen
+        assert not seen
+
+        return list(reversed(visited))
+
diff --git a/nn_dataflow/core/loop_blocking.py b/nn_dataflow/core/loop_blocking.py
index 0c49da7..561d5bb 100644
--- a/nn_dataflow/core/loop_blocking.py
+++ b/nn_dataflow/core/loop_blocking.py
@@ -20,6 +20,7 @@
 from . import loop_blocking_solver
 from . import loop_enum as le
 from .. import util
+from .buf_shr_scheme import BufShrScheme
 from .layer import ConvLayer
 from .loop_blocking_scheme import LoopBlockingScheme
 
@@ -110,7 +111,7 @@ def _loop_blocking_cmp_key(options, cost):
 
 
 def _gen_loopblocking_perprocess(
-        nested_loop_desc, resource, cost, options,
+        nested_loop_desc, resource, bufshr, constraint, cost, options,
         gen_tifm, gen_tofm, gen_tbat, gen_ords):
 
     def _gen_bl_ts():
@@ -120,9 +121,8 @@ def _gen_bl_ts():
         Transpose LoopEnum-major to BL-major.
         '''
         gen_lp_ts = [None] * le.NUM
-        gen_lp_ts[le.IFM] = gen_tifm
-        gen_lp_ts[le.OFM] = gen_tofm
-        gen_lp_ts[le.BAT] = gen_tbat
+        gen_lp_ts[le.IFM], gen_lp_ts[le.OFM], gen_lp_ts[le.BAT] = \
+                constraint.filter_gen_ts(gen_tifm, gen_tofm, gen_tbat)
         for lp_ts in itertools.product(*gen_lp_ts):
             bl_ts = tuple(zip(*lp_ts))
             yield bl_ts
@@ -133,19 +133,27 @@ def _sweep():
         for bl_ts, bl_ords in itertools.product(_gen_bl_ts(), gen_ords):
             if is_conv_loops and skip_conv(bl_ts, bl_ords):
                 continue
+            if not constraint.is_valid_top_bl(bl_ts[0], bl_ords[0]):
+                continue
             lbs = LoopBlockingScheme(
-                nested_loop_desc, bl_ts, bl_ords, resource, options)
+                nested_loop_desc, bl_ts, bl_ords, resource, bufshr,
+                options)
             yield lbs
 
     return heapq.nsmallest(options.ntops, _sweep(),
                            key=_loop_blocking_cmp_key(options, cost))
 
 
-def gen_loopblocking(nested_loop_desc, resource, cost, options):
+def gen_loopblocking(nested_loop_desc, resource, part, constraint, cost,
+                     options):
     '''
     Generator for loop blocking.
     '''
 
+    # Buffer sharing scheme.
+    bufshr = BufShrScheme(resource.proc_region, part,
+                          nested_loop_desc.data_loops)
+
     # Solver only works for CONV layer.
     if options.sw_solve_loopblocking \
             and nested_loop_desc.data_loops == ConvLayer.data_loops():
@@ -153,8 +161,9 @@ def gen_loopblocking(nested_loop_desc, resource, cost, options):
 
         for bl_ts, bl_ords in gen(nested_loop_desc, resource, options):
             lbs = LoopBlockingScheme(nested_loop_desc, bl_ts, bl_ords,
-                                     resource, options)
-            yield lbs
+                                     resource, bufshr, options)
+            if constraint.is_valid_top_bl(lbs.bl_ts[0], lbs.bl_ords[0]):
+                yield lbs
         return
 
     ## Exhaustive search.
@@ -199,8 +208,8 @@ def retrieve_result_st():
     list_ords = list(gen_ords)
     for tifm, tofm in itertools.product(gen_tifm, gen_tofm):
         r = apply_func(_gen_loopblocking_perprocess,
-                       (nested_loop_desc, resource, cost, options,
-                        [tifm], [tofm], list_tbat, list_ords))
+                       (nested_loop_desc, resource, bufshr, constraint, cost,
+                        options, [tifm], [tofm], list_tbat, list_ords))
         results.append(r)
 
     for lbs in heapq.nsmallest(options.ntops, retrieve_func,
diff --git a/nn_dataflow/core/loop_blocking_scheme.py b/nn_dataflow/core/loop_blocking_scheme.py
index 3b5d90b..221e5f2 100644
--- a/nn_dataflow/core/loop_blocking_scheme.py
+++ b/nn_dataflow/core/loop_blocking_scheme.py
@@ -19,6 +19,7 @@
 from . import data_category_enum as de
 from . import loop_enum as le
 from . import mem_hier_enum as me
+from .node_region import NodeRegion
 from .. import util
 
 class LoopBlockingScheme(object):
@@ -37,7 +38,7 @@ class BL(object):  # pylint: disable=too-few-public-methods
         REGF = 1
         NUM = 2
 
-    def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource,
+    def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource, bufshr,
                  options):
         '''
         Given blocking factors `bl_ts` and the loop orders `bl_ords`, construct
@@ -69,6 +70,9 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource,
         `bl_ords` indicate the loop orders of all levels, indexed by BL. Each
         entry is a permutation tuple indexed by LoopEnum and gives the
         positions of the loops at this level. Smaller number means inner loop.
+
+        `bufshr` is a BufShrScheme instance, indicating the buffer sharing
+        scheme.
         '''
 
         # pylint: disable=invalid-name
@@ -76,6 +80,9 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource,
 
         # Loop structure.
         self.nld = nested_loop_desc
+        # Cache values.
+        self.total_access_gbuf = [self.nld.total_access_at_of(me.GBUF, dce)
+                                  for dce in range(de.NUM)]
 
         # Check lengths and values.
         assert len(bl_ts) == BL.NUM + 1, \
@@ -102,6 +109,9 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource,
         # Need to define time for invalid scheme.
         self.time = float('inf')
 
+        # Buffer sharing initialization.
+        self._init_bufshr(bufshr, options)
+
         # Buffer data size for one unit.
         self.unit_size = [tuple() for _ in range(BL.NUM)]
         self.unit_size[BL.GBUF] = self.nld.usize_gbuf
@@ -129,6 +139,34 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource,
         # Data fetch calculation.
         self._set_fetch()
 
+        # Check resource data src/dst region.
+        self.src_is_dram = (resource.src_data_region.type == NodeRegion.DRAM)
+        self.dst_is_dram = (resource.dst_data_region.type == NodeRegion.DRAM)
+
+        # Check resource for filter pinning.
+        self.filter_pinned = False
+        if resource.no_time_mux:
+            if all(self.bl_ts[0][lpe] == 1 for lpe
+                   in self.nld.data_loops[de.FIL].loops()):
+                self.filter_pinned = True
+                self.fetch[0][de.FIL] = 0
+
+        # If data regions are not DRAM, can only access once, no spilling.
+        if not self.src_is_dram:
+            if self.fetch[BL.GBUF][de.IFM] > 1:
+                self.valid = False
+                return
+            if resource.src_data_region == resource.proc_region:
+                # Force to store in gbuf.
+                self.stored_in_gbuf[de.IFM] = True
+        if not self.dst_is_dram:
+            if self.fetch[BL.GBUF][de.OFM] > 1:
+                self.valid = False
+                return
+            if resource.dst_data_region == resource.proc_region:
+                # Force to store in gbuf.
+                self.stored_in_gbuf[de.OFM] = True
+
         # Now with the fetch times, we can calculate the actual
         # `stored_in_gbuf` values.
         # Only store in gbuf if having reuse.
@@ -163,6 +201,20 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource,
         self.dram_time = float('nan')
         self.access = [[float('nan')] * de.NUM for _ in range(me.NUM)]
 
+        # NoC access due to buffer sharing.
+        self.noc_access = [0.] * de.NUM
+        self.bufshr_rotation_access = [0.] * de.NUM
+        self.bufshr_wide_fetch_access = [0.] * de.NUM
+
+        # Buffer sharing.
+        self._set_bufshr(resource, bufshr, options)
+
+        # Access forwarding.
+        self._set_accfwd(bufshr, options)
+
+        # Remote gbuf access.
+        self.remote_gbuf_access = [0.] * de.NUM
+
     def is_valid(self):
         '''
         Whether is a valid scheme.
@@ -179,6 +231,7 @@ def data_size(self, blvl, dce=None):
         size = self.unit_cnt[blvl][dce] * self.unit_size[blvl][dce]
         if blvl == self.BL.GBUF:
             size *= 1 if self.stored_in_gbuf[dce] else 0
+            size = util.idivc(size, self.bufshr_subgrp_size[dce])
 
         return size
 
@@ -209,6 +262,18 @@ def get_top_level_fetch(self):
 
         return self.fetch[self.BL.GBUF]
 
+    def get_noc_access(self):
+        '''
+        Get the NoC accesses of each data category.
+        '''
+        if not self.is_valid():
+            return None
+
+        if not self.finalized_stats:
+            self._calc_stats()
+
+        return self.noc_access
+
     def get_access_cost(self, cost):
         '''
         Get the data access cost of loop blocking.
@@ -220,6 +285,7 @@ def get_access_cost(self, cost):
             self._calc_stats()
 
         acc_cost = sum(c * sum(a) for c, a in zip(cost.mem_hier, self.access))
+        acc_cost += cost.mem_hier_at(me.GBUF) * sum(self.remote_gbuf_access)
 
         return acc_cost
 
@@ -248,8 +314,18 @@ def gen_index(self):
         bl_idxgen_list.append(self._gen_index_single_level(t_x, order_x))
         bl_cnt_list.append(cnt_x)
 
+        # Buffer sharing.
+        t_x = self.bufshr_bs_t
+        order_x = self.bufshr_bs_ord
+        cnt_x = [x // b for x, b
+                 in zip(self._bl_tp(slice(bl_gbuf + 1, None)),
+                        self.bufshr_bs_t)]
+        bl_idxgen_list.append(self._gen_index_single_level(t_x, order_x))
+        bl_cnt_list.append(cnt_x)
+
         # Between GBUF and REGF.
-        t_x = self.bl_ts[bl_regf]
+        t_x = [x // b for x, b
+               in zip(self.bl_ts[bl_regf], self.bufshr_bs_t)]
         order_x = self.bl_ords[bl_regf]
         cnt_x = self._bl_tp(slice(bl_regf + 1, None))
         bl_idxgen_list.append(self._gen_index_single_level(t_x, order_x))
@@ -412,8 +488,27 @@ def _calc_stats(self):
                                  else self.nld.total_access_at_of(me.GBUF, dce))
                                 * self.fetch[self.BL.GBUF][dce]
                                 * self.num_nodes
+                                / self.accfwd_reduction[dce]
                                 for dce in range(de.NUM)]
 
+        # NoC access.
+        self.bufshr_rotation_access = self._calc_bufshr_rotation_access(
+            self.bufshr_rot_fetch)
+        self.bufshr_wide_fetch_access = self._calc_bufshr_widefetch_access(
+            self.bufshr_wide_fetch)
+        self.noc_access = [a1 + a2 for a1, a2
+                           in zip(self.bufshr_rotation_access,
+                                  self.bufshr_wide_fetch_access)]
+
+        if not self.src_is_dram:
+            self.remote_gbuf_access[de.IFM] += self.access[me.DRAM][de.IFM]
+            self.access[me.DRAM][de.IFM] = 0
+        if not self.dst_is_dram:
+            self.remote_gbuf_access[de.OFM] += self.access[me.DRAM][de.OFM]
+            self.access[me.DRAM][de.OFM] = 0
+        if self.filter_pinned:
+            assert self.access[me.DRAM][de.FIL] == 0
+
         # DRAM access time.
         self.dram_time = int(math.ceil(sum(self.access[me.DRAM])
                                        / self.dram_bandwidth))
@@ -484,3 +579,458 @@ def _gen_index_single_level(t_x, order_x):
             # in LoopEnum order.
             yield tuple(idx[rev_order[lpe]] for lpe in range(le.NUM))
 
+    def _set_accfwd(self, bufshr, options):
+        '''
+        Set access forwarding (AF).
+        '''
+        assert self.is_valid() and not self.finalized_stats
+
+        # DRAM access reduction due to AF. This is the average reduction. Each
+        # node does not need to fetch exactly 1/N data.
+        self.accfwd_reduction = [1] * de.NUM
+
+        if not options.hw_access_forwarding and not options.hw_gbuf_sharing:
+            return
+
+        # If n nodes share the data, each node fetches 1/n of the data.
+        for dce in range(de.NUM):
+            self.accfwd_reduction[dce] = bufshr.size(dce)
+
+    def _init_bufshr(self, bufshr, options):
+        '''
+        Initialize buffer sharing (BS).
+
+        Must be called before any buffered data size check.
+        '''
+        assert not hasattr(self, "unit_cnt")
+
+        # Total BS nodes
+        self.bufshr_grp_size = tuple(bufshr.size(dce) if options.hw_gbuf_sharing
+                                     else 1 for dce in range(de.NUM))
+        # BS subgroup sizes.
+        # The initial values are conservative, i.e., assuming the maximum
+        # shared capacity across nodes.
+        # They can be decreased later, but never increased.
+        self.bufshr_subgrp_size = self.bufshr_grp_size
+
+        # Additional BS level between DRAM and GBUF, split out from GBUF level.
+        self.bufshr_bs_t = (1,) * le.NUM
+        self.bufshr_bs_ord = tuple(range(le.NUM))
+
+        # NoC fetch due to rotation.
+        # The fetch times means the number of hops along which each data
+        # (considered all replica) traverals over the entire nested loops.
+        # The total number of hops of all data over all nodes will be this
+        # value multiplying the size of unique data (without replica).
+        self.bufshr_rot_fetch = [0.] * de.NUM
+        # Rotation round counts.
+        self.bufshr_rot_round_cnt = [0] * de.NUM
+        # Rotation unit counts.
+        self.bufshr_rot_unit_cnt = [1] * de.NUM
+
+        # NoC fetch due to wide fetch. Meaning similar to `bufshr_rot_fetch`.
+        self.bufshr_wide_fetch = [0.] * de.NUM
+        # Wide fetch widths.
+        self.bufshr_wide_fetch_width = [0.] * de.NUM
+
+    def _set_bufshr(self, resource, bufshr, options):
+        '''
+        Set buffer sharing (BS).
+
+        The GBUF level loops, i.e., ti/to/tb[1], decide the order and ranges of
+        the access to data buffered in GBUF, which could spread across multiple
+        nodes.
+
+        - Seq-acc and non-seq-acc data category.
+
+        Depending on the loop structure, some data categories, whose related
+        loops are not adjacent and split by the other unrelated loops, has a
+        non-perfect-sequential access pattern, as the inner dimensions will be
+        accessed multiple times (due to the middle unrelated loops) before
+        switching to the next outer dimension. We call it non-seq-acc data
+        category.
+
+        E.g., with CONV layer, OFM is non-seq-acc with the following loop
+        structure:
+
+        for o
+          for i
+            for b
+
+        If there are < 3 non-trivial loops, there is no non-seq data category.
+
+        - Rotation unit.
+
+        Rotation unit for each data category is defined as the shifting size
+        for each rotation step. For seq-acc data categories, the rotation unit
+        is single REGF unit. For non-seq-acc data category, the rotation unit
+        is the product of all inner dimension sizes that are not adjacent to
+        the outermost dimension, i.e., we only rotate after all the multiple
+        accesses to the inner dimensions are done.
+
+        - Rotation round.
+
+        Given the definition of rotation unit above, the number of rotation
+        rounds is the product of all unrelated loop blocking factors above the
+        outermost dimension loop of this data category.
+
+        E.g., with above loops, IFM (i, b) rotates `to` rounds, FIL (i, o)
+        rotates once, and OFM (o, b) rotates only once.
+
+        - Wide fetch.
+
+        Rotation unit size does not affect the NoC access of rotation rounds,
+        but there may be remote accesses without rotation, called wide fetch,
+        if the rotation unit does not fit in a single node GBUF.
+
+        - BS schemes.
+
+        When exploring the BS schemes, we keep the total accesses to DRAM,
+        GBUF, and REGF unchanged, i.e., previously calculated fetch times are
+        still valid. This is guaranteed by fixing some innermost loops in the
+        GBUF level.
+
+        The other un-fixed loops (we call them flexible loops) can be reordered
+        or further blocked into an additional BS level between GBUF and DRAM
+        levels. This additional level can help reduce NoC accesses by splitting
+        the data accesses into across-node and within-node, and use up the data
+        within a node before switching to the next node.
+
+        E.g., the above loop structure can become:
+
+        for i-across-node
+          for o
+            for i-within-node
+              for b
+
+        This optimization reduces IFM (i, b) rotation rounds from `to` to 1,
+        and increases OFM (o, b) rotation rounds from 1 to `i-across-node`,
+        i.e., subgroup size of IFM; it does not change FIL (i, o) rotation
+        rounds.
+        '''
+        assert self.is_valid() and not self.finalized_stats
+
+        if not options.hw_gbuf_sharing:
+            assert all(gs == 1 for gs in self.bufshr_grp_size)
+            return
+
+        bl = self.BL.GBUF
+        blp1 = bl + 1
+
+        # If bypass GBUF, set subgroup size to 1.
+        self.bufshr_subgrp_size = tuple(sgs if self.data_size(bl, dce) else 1
+                                        for dce, sgs
+                                        in enumerate(self.bufshr_subgrp_size))
+
+        if all(sgs == 1 for sgs in self.bufshr_subgrp_size):
+            return
+
+        ## Loop structure.
+
+        # The blocking factors and loop order that are related to BS.
+        t_x = self.bl_ts[blp1]
+        ord_x = self.bl_ords[blp1]
+
+        # Non-trivial loops.
+        nt_loops = set(lpe for lpe in range(le.NUM) if t_x[lpe] > 1)
+
+        # To keep fetch times to all hierarchies unchanged, we fix some loops
+        # without further blocking them in BS. See _set_fetch(), the
+        # (unrelated) loops inside the innermost non-trivial dim loop does not
+        # contribute to the fetch times, so we fix these loops for all data
+        # categories.
+        o_inntdim_loop = max(
+            (self._innt_dim_loop(dce, t_x, ord_x) for dce in range(de.NUM)),
+            key=lambda lpe: (ord_x[lpe] if lpe is not None else -1))
+        # A tuple in the order of outer to inner, i.e., sort by inverse order.
+        fixed_loops = tuple(sorted(
+            (lpe for lpe in nt_loops if ord_x[lpe] < ord_x[o_inntdim_loop]),
+            key=lambda lpe: ord_x[lpe],
+            reverse=True))
+
+        # The loops that can be further blocked without affecting the fetch
+        # times to all hierarchies.
+        flex_loops = nt_loops.difference(fixed_loops)
+
+        ## Subgroup size candidates.
+
+        def _min_subgrp_size(*dce_list):
+            '''
+            Get the minimum BS subgroup size, but not changing the current
+            subgroup size. Minimize in the order of the given `dce_list`.
+            '''
+            # No duplication.
+            assert len(dce_list) == len(set(dce_list))
+
+            # Free capacity in each node's GBUF.
+            free_cap = resource.size_gbuf - self.data_size(bl)
+
+            sgs_list = list(self.bufshr_subgrp_size)
+
+            for dce in dce_list:
+                # Skip no sharing case.
+                if sgs_list[dce] <= 1:
+                    continue
+
+                cur_dsz = self.data_size(bl, dce)
+                tot_dsz = cur_dsz * self.bufshr_subgrp_size[dce]
+                assert cur_dsz > 0 and tot_dsz > 0
+
+                # min. sgs
+                # s.t. tot_dsz / sgs <= free_cap + cur_dsz.
+                for sgs in range(sgs_list[dce], 0, -1):
+                    if self.bufshr_grp_size[dce] % sgs != 0:
+                        # Require subgroup size to be a factor of the group
+                        # size.
+                        continue
+                    if util.idivc(tot_dsz, sgs) <= free_cap + cur_dsz:
+                        sgs_list[dce] = sgs
+                    else:
+                        break
+
+                # Reduce free capacity.
+                free_cap -= util.idivc(tot_dsz, sgs_list[dce]) - cur_dsz
+                assert free_cap >= 0
+
+            return tuple(sgs_list)
+
+        # Original subgroup size.
+        subgrp_size_cands = [self.bufshr_subgrp_size]
+        # Reduce subgroup size if data can fit in fewer nodes. Consider all
+        # orders about which data first shrink.
+        subgrp_size_cands += set(_min_subgrp_size(*dce_list) for dce_list
+                                 in itertools.permutations(range(de.NUM)))
+
+        ## Sweep all BS schemes.
+
+        def _sweep_bufshr():
+            for subgrp_size in subgrp_size_cands:
+
+                # `flex_loops` can be further blocked in BS, while others
+                # cannot (set to 1).
+                t_bs_tot = [t_x[lpe] if lpe in flex_loops else 1
+                            for lpe in range(le.NUM)]
+
+                for t_bs_frac in itertools.product(
+                        *[util.factorize(t, 2) for t in t_bs_tot]):
+                    t_bs = tuple(t[0] for t in t_bs_frac)
+
+                    loops_bs_trivial = tuple(lpe for lpe in flex_loops
+                                             if t_bs[lpe] == 1)
+
+                    for loops_bs_nontrivial, loops_bot in itertools.product(
+                            itertools.permutations([lpe for lpe in flex_loops
+                                                    if t_bs[lpe] > 1]),
+                            itertools.permutations(flex_loops)):
+
+                        loops_bs = loops_bs_trivial + loops_bs_nontrivial
+
+                        yield subgrp_size, t_bs, loops_bs, loops_bot
+
+        ## BS NoC fetch times.
+
+        dim_loops = [self.nld.data_loops[dce].loops() for dce in range(de.NUM)]
+
+        def _is_dim_loop(lpe, dce, _dim_loops=dim_loops):
+            return lpe in _dim_loops[dce]
+
+        def _calc_bufshr_fetch(subgrp_size, t_bs, loops_bs, loops_bot):
+            '''
+            Calculate the BS scheme NoC fetch times. Return rotation fetch,
+            wide fetch, and other statistics.
+
+            `subgrp_size` is the BS subgroup size for each data category.
+
+            `t_bs` is the blocking factors indexed by LoopEnum for the
+            additional BS level between DRAM and GBUF, i.e., above `blp1`. They
+            are fractorized from `t_x`. Only those in `flex_loops` can have
+            non-1 values.
+
+            `loops_bs` and `loops_bot` are ordered tuples of `flex_loops` from
+            outer to inner, for the additional BS level and the original GBUF
+            level (at the bottom) respectively.
+            '''
+            assert set(loops_bs) == set(loops_bot) == flex_loops
+            assert all(b <= x for b, x in zip(t_bs, t_x))
+            assert all(t_bs[lpe] == 1 or lpe in flex_loops
+                       for lpe in range(le.NUM))
+
+            # Make a list of tuples (LoopEnum, blocking factor)`, each
+            # corresponds to a non-trivial loop in the additional BS level and
+            # the original GBUF level, ordered from outer to inner.
+            lp_t_list = []
+            # Additional BS level.
+            lp_t_list += [(lpe, t_bs[lpe])
+                          for lpe in loops_bs if t_bs[lpe] > 1]
+            # GBUF level flex loops.
+            lp_t_list += [(lpe, util.idivc(t_x[lpe], t_bs[lpe]))
+                          for lpe in loops_bot if t_x[lpe] > t_bs[lpe]]
+            # GBUF level fixed loops.
+            lp_t_list += [(lpe, t_x[lpe]) for lpe in fixed_loops]
+            # Check.
+            assert all(tpl[1] > 1 for tpl in lp_t_list)
+
+            # Total rotation rounds (over all GBUF filling).
+            rot_rnd_cnts = []
+            # Number of rotation units.
+            rot_unit_cnts = []
+            # Wide fetch widths.
+            wide_fetch_widths = []
+
+            # Rotation NoC fetch times.
+            rot_fetch = []
+            # Wide fetch NoC fetch times.
+            wide_fetch = []
+
+            for dce in range(de.NUM):
+
+                buf_fetch = self.fetch[blp1][dce]
+                mem_fetch = self.fetch[blp1-1][dce]
+
+                # Index of the outermost dim loop in `lp_t_list`. None if all
+                # dim loops are trivial.
+                idx_odlp = next((i for i, tpl in enumerate(lp_t_list)
+                                 if _is_dim_loop(tpl[0], dce)),
+                                None)
+
+                # Rotation rounds.
+                rotrnds = 1
+                if idx_odlp is None or subgrp_size[dce] == 1:
+                    # No rotation.
+                    rotrnds = 0
+                elif idx_odlp is not None:
+                    # All unrelated loop factors above the outermost dim loop.
+                    # At DRAM level.
+                    rotrnds *= util.prod(self.nld.data_loops[dce]
+                                         .drop(self._bl_tp(slice(blp1))))
+                    # At GBUF level.
+                    rotrnds *= util.prod(tpl[1] for tpl
+                                         in itertools.islice(lp_t_list,
+                                                             idx_odlp))
+                    assert ((buf_fetch + 1) // 2 if dce == de.OFM
+                            else buf_fetch) % rotrnds == 0
+                    assert rotrnds % ((mem_fetch + 1) // 2 if dce == de.OFM
+                                      else mem_fetch) == 0
+                # Optimization: after fetching data into GBUF, if the data only
+                # rotate a single time before being replaced, we do not need to
+                # store them after this single use. So instead we can stream
+                # each rotation unit to all the nodes, and replace it by the
+                # next rotation unit one by one. This is already supported as
+                # the data will be broadcast to all nodes regardless of who
+                # stores it (see partition).
+                if rotrnds == ((mem_fetch + 1) // 2 if dce == de.OFM
+                               else mem_fetch):
+                    rotrnds = 0
+                rot_rnd_cnts.append(rotrnds)
+
+                # Number of rotation units.
+                rotunits = 1
+                # All dimension sizes of the outermost adjacent dim loops.
+                if idx_odlp is not None:
+                    rotunits = util.prod(tpl[1] for tpl
+                                         in itertools.takewhile(
+                                             lambda tpl, dce_=dce:
+                                             _is_dim_loop(tpl[0], dce_),
+                                             itertools.islice(lp_t_list,
+                                                              idx_odlp, None)))
+                rot_unit_cnts.append(rotunits)
+
+                # Wide fetch width.
+                wf_width = 1. * subgrp_size[dce] / rotunits
+                wide_fetch_widths.append(wf_width)
+
+                # Wide fetch times.
+                wf_per_bufacc = bufshr.nhops_wide_fetch_once(
+                    dce, subgrp_size[dce], wf_width)
+                # Use REGF filling (GBUF fetch).
+                # The last wide fetch before rotation can be combined with the
+                # rotation steps.
+                if dce == de.OFM:
+                    # For OFM, if we do multiple wide fetch per rotation step,
+                    # the last one has both read and write. If there is only
+                    # one wide fetch per rotation step, it only has write.
+                    if buf_fetch > 2 * rotrnds - 1:
+                        comb_wf_fetch = 2 * rotrnds
+                    else:
+                        assert buf_fetch == 2 * rotrnds - 1
+                        comb_wf_fetch = 2 * rotrnds - 1
+                else:
+                    comb_wf_fetch = rotrnds
+                # Since we do not rotate the last step, when wide fetch is
+                # non-0 (i.e., the last rotation unit is larger than one node
+                # buffer size), the wide fetch of the last unit has no rotation
+                # to combine with.
+                comb_wf_fetch *= 1. * (rotunits - 1) / rotunits
+                wf = wf_per_bufacc * (buf_fetch - comb_wf_fetch)
+                assert wf > -1e-4
+                wide_fetch.append(wf)
+
+                # Rotation fetch times.
+                rf_per_rot = bufshr.nhops_rotate_all(
+                    dce, subgrp_size[dce], rotunits)
+                rf = rf_per_rot * rotrnds
+                rot_fetch.append(rf)
+
+            return rot_fetch, wide_fetch, \
+                    rot_rnd_cnts, rot_unit_cnts, wide_fetch_widths
+
+        ## Search for the best BS scheme.
+
+        def _key_func(tuple_):
+            rot_fetch, wide_fetch = _calc_bufshr_fetch(*tuple_)[:2]
+            return sum(self._calc_bufshr_rotation_access(rot_fetch)) \
+                    + sum(self._calc_bufshr_widefetch_access(wide_fetch))
+        subgrp_size, t_bs, loops_bs, loops_bot = \
+                min(_sweep_bufshr(), key=_key_func)
+
+        # Subgroup size.
+        self.bufshr_subgrp_size = subgrp_size
+
+        # Loop blocking factors and order.
+        new_ord = [-1] * le.NUM
+        ord_idx = 0
+        for lpe in reversed(loops_bot + fixed_loops):
+            new_ord[lpe] = ord_idx
+            ord_idx += 1
+        for lpe in range(le.NUM):
+            if new_ord[lpe] < 0:
+                new_ord[lpe] = ord_idx
+                ord_idx += 1
+        self.bl_ords[blp1] = tuple(new_ord)
+
+        # Additional BS level.
+        new_ord_bs = [-1] * le.NUM
+        ord_idx = 0
+        for lpe in reversed(loops_bs):
+            if t_bs[lpe] > 1:
+                new_ord_bs[lpe] = ord_idx
+                ord_idx += 1
+        for lpe in range(le.NUM):
+            if new_ord_bs[lpe] < 0:
+                new_ord_bs[lpe] = ord_idx
+                ord_idx += 1
+        self.bufshr_bs_t = tuple(t_bs)
+        self.bufshr_bs_ord = tuple(new_ord_bs)
+
+        # Set stats.
+        self.bufshr_rot_fetch, self.bufshr_wide_fetch, \
+                self.bufshr_rot_round_cnt, self.bufshr_rot_unit_cnt, \
+                self.bufshr_wide_fetch_width = \
+                _calc_bufshr_fetch(subgrp_size, t_bs, loops_bs, loops_bot)
+
+    def _calc_bufshr_rotation_access(self, bufshr_rot_fetch):
+        ''' Calculate the BS rotation NoC accesses, over all nodes. '''
+        # All-node access needs to multiply number of groups.
+        return [self.total_access_gbuf[dce]
+                * bufshr_rot_fetch[dce]
+                * (self.num_nodes // self.bufshr_grp_size[dce])
+                for dce in range(de.NUM)]
+
+    def _calc_bufshr_widefetch_access(self, bufshr_wide_fetch):
+        ''' Calculate the BS wide fetch NoC accesses, over all nodes. '''
+        # All-node access needs to multiply number of groups.
+        return [self.total_access_gbuf[dce]
+                * bufshr_wide_fetch[dce]
+                * (self.num_nodes // self.bufshr_grp_size[dce])
+                for dce in range(de.NUM)]
+
diff --git a/nn_dataflow/core/nn_dataflow.py b/nn_dataflow/core/nn_dataflow.py
index 4d3ec98..d489455 100644
--- a/nn_dataflow/core/nn_dataflow.py
+++ b/nn_dataflow/core/nn_dataflow.py
@@ -13,6 +13,7 @@
 program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
 """
 
+from collections import defaultdict
 import itertools
 import sys
 
@@ -20,6 +21,7 @@
 from .cost import Cost
 from .data_layout import DataLayout
 from .fmap_range import FmapPosition, FmapRange
+from .inter_layer_pipeline import InterLayerPipeline
 from .map_strategy import MapStrategy
 from .network import Network
 from .nn_dataflow_scheme import NNDataflowScheme
@@ -63,6 +65,16 @@ def __init__(self, network, batch_size, resource, cost, map_strategy):
                 layer2sched[layer] = sched
             self.layer_sched_dict[layer_name] = sched
 
+        # Inter-layer pipelining.
+        self.ilp = InterLayerPipeline(self.network, self.batch_size,
+                                      self.resource)
+        self.ordered_layer_list = self.ilp.ordered_layer_list()
+
+        # NNDataflowScheme tops.
+        # The top schemes are organized by the ending layers, and keeping
+        # extended to the end of the network.
+        self.nndf_tops = {}
+
         # Default compare key function.
         self.cmp_key = lambda nndf: (nndf.total_cost, nndf.total_time)
 
@@ -78,22 +90,52 @@ def schedule_search(self, options):
         else:
             assert options.opt_goal == 'e'
 
+        # Group the segments by the ending layers.
+        segments = defaultdict(list)
+        for seg in self.ilp.gen_segment(options):
+            if seg not in segments[seg[-1][-1]]:
+                segments[seg[-1][-1]].append(seg)
+
         # Clear and reset.
-        nndf_tops = []
+        self.nndf_tops = {}
 
         # Initial input layout.
+        self.nndf_tops[None] = []
         for input_layout, ext_layout_dict in self._gen_input_layout(options):
             nndf = NNDataflowScheme(self.network, input_layout, ext_layout_dict)
-            nndf_tops.append(nndf)
+            self.nndf_tops[None].append(nndf)
 
         # Schedule layers.
-        for layer_name in self.network:
+        for layer_name in self.ordered_layer_list:
             if options.verbose:
                 sys.stderr.write('-> {}\n'.format(layer_name))
                 sys.stderr.flush()
 
-            nndf_tops = self._layer_schedule_search(
-                layer_name, nndf_tops, options)
+            # The top schemes ending with the current layer.
+            tops = []
+
+            # The segments ended with the current layer. Use them to extend the
+            # current top schemes.
+            for seg in segments[layer_name]:
+                if options.verbose:
+                    sys.stderr.write('  - {}\n'.format(seg.seg))
+                    sys.stderr.flush()
+                tops += self._segment_schedule_search(seg, options)
+
+            # Always pick and keep top n.
+            tops = sorted(tops, key=self.cmp_key)[:options.ntops]
+
+            # Add to the top list.
+            assert layer_name not in self.nndf_tops
+            self.nndf_tops[layer_name] = tops
+
+        # Final top schemes.
+        nndf_tops = self.nndf_tops.get(self.ordered_layer_list[-1], [])
+        if not nndf_tops:
+            sys.stderr.write('No valid schedule found for {}.\n'
+                             .format(self.network.net_name))
+        for nndf in nndf_tops:
+            assert len(nndf) == len(self.network)
 
         # Cache stats.
         cache_hits = 0
@@ -109,12 +151,100 @@ def schedule_search(self, options):
 
         return nndf_tops, (cache_hits, cache_misses)
 
-    def _layer_schedule_search(self, layer_name, prev_nndf_tops, options):
+    def _segment_schedule_search(self, segment, options):
+        '''
+        Schedule the given PipelineSegment `segment`.
+
+        Return new top NNDataflowScheme instances that include this segment.
+        Will NOT update the `nndf_tops` attribute.
+        '''
+        # We take the top schemes that end with the latest previous layer as
+        # the initial state.
+        first_layer_idx = self.ordered_layer_list.index(segment[0][0])
+        if first_layer_idx == 0:
+            prev_nndf_tops = self.nndf_tops[None]
+        else:
+            prev_nndf_tops = self.nndf_tops.get(
+                self.ordered_layer_list[first_layer_idx - 1], [])
+        if not prev_nndf_tops:
+            return []
+
+        # New top schemes.
+        nndf_tops = []
+
+        # Allocation.
+        allocation = segment.allocation()
+
+        # Forwarding data regions. Map a spatial index to the forwarding region.
+        fwd_data_region_dict = {}
+        for sh_list in segment.ifm_fwd_dict.values():
+            # A list of spatial indices that share the same ifmaps.
+            r = allocation[sh_list[0].sp_idx][sh_list[0].tm_idx].proc_region
+            for idx in sh_list[1:]:
+                fwd_data_region_dict[idx] = r
+        for fwd_src, fwd_dst_list in segment.ofm_fwd_dict.items():
+            # Ofmaps forwarded to neighbors.
+            r = allocation[fwd_src.sp_idx][fwd_src.tm_idx].proc_region
+            for idx in fwd_dst_list:
+                fwd_data_region_dict[idx] = r
+
+        # Max allowed time overhead for segment timing.
+        max_time_ovhd = options.layer_pipeline_time_ovhd
+
+        # Cost hint Pareto-optimal frontier.
+        frontier = set()
+
+        # Explore constraints.
+        for constraint, hints in segment.gen_constraint(max_time_ovhd):
+
+            # Filter out off-frontier constraints.
+            if any(all(h >= fh for h, fh in zip(hints, fhints))
+                   for fhints in frontier):
+                continue
+
+            # Start from the previous top schemes.
+            curr_nndf_tops = prev_nndf_tops
+
+            # Spatial scheduling.
+            for sp_idx, (ltpl, rtpl, ctpl) \
+                    in enumerate(zip(segment, allocation, constraint)):
+
+                # Temporal scheduling.
+                for tm_idx, (layer, resource, cstr) \
+                        in enumerate(zip(ltpl, rtpl, ctpl)):
+
+                    curr_nndf_tops = self._layer_schedule_search(
+                        layer, resource, cstr, sp_idx, tm_idx,
+                        fwd_data_region_dict.get((sp_idx, tm_idx)),
+                        curr_nndf_tops, options)
+
+            # Filter by time limit.
+            seg_nndf_tops = [nndf for nndf in curr_nndf_tops
+                             if all(timing.time_overhead <= max_time_ovhd
+                                    for timing in nndf.segment_timing_list)]
+
+            # Add to frontier.
+            if seg_nndf_tops:
+                frontier.add(hints)
+
+            nndf_tops += seg_nndf_tops
+
+        # Always pick and keep top n.
+        return sorted(nndf_tops, key=self.cmp_key)[:options.ntops]
+
+    def _layer_schedule_search(self, layer_name, resource, constraint,
+                               spatial_idx, temporal_idx, fwd_data_region,
+                               prev_nndf_tops, options):
         '''
         Schedule the given layer under the given previous top NNDataflowScheme
         instances in 'prev_nndf_tops`.
 
-        Return new top NNDataflowScheme instances that include this layer.
+        `spatial_idx` and `temporal_idx` give the spatial and temporal
+        scheduling index in the segment. The segment index is inferred from the
+        previous top schemes.
+
+        Return new top NNDataflowScheme instances that include this layer. Will
+        NOT update the `nndf_tops` attribute.
         '''
         nndf_tops = []
 
@@ -124,8 +254,27 @@ def _layer_schedule_search(self, layer_name, prev_nndf_tops, options):
 
             ifmap_layout = prev_nndf.fmap_layout(self.network.prevs(layer_name))
 
-            condition = SchedulingCondition(resource=self.resource,
-                                            ifmap_layout=ifmap_layout)
+            if fwd_data_region is not None:
+                # Remap source data regions to the forwarding region.
+                ifmap_layout = DataLayout(
+                    frngs=ifmap_layout.frngs,
+                    regions=(fwd_data_region,) * len(ifmap_layout.frngs),
+                    parts=tuple(p.projection(fwd_data_region, appl2frng=True)
+                                for p in ifmap_layout.parts))
+
+            segment_idx = prev_nndf.last_seg_idx
+            if spatial_idx == 0 and temporal_idx == 0:
+                # New segment.
+                segment_idx += 1
+
+            sched_seq = (segment_idx, spatial_idx, temporal_idx)
+
+            constraint.update_by_prev(prev_nndf)
+
+            condition = SchedulingCondition(resource=resource,
+                                            constraint=constraint,
+                                            ifmap_layout=ifmap_layout,
+                                            sched_seq=sched_seq)
 
             try:
                 sched_tops = layer_sched.schedule_search(condition, options)
diff --git a/nn_dataflow/core/nn_dataflow_scheme.py b/nn_dataflow/core/nn_dataflow_scheme.py
index 2e027e4..7eba77a 100644
--- a/nn_dataflow/core/nn_dataflow_scheme.py
+++ b/nn_dataflow/core/nn_dataflow_scheme.py
@@ -19,6 +19,7 @@
 from .. import util
 from .data_layout import DataLayout
 from .network import Network
+from .pipeline_segment_timing import PipelineSegmentTiming
 from .scheduling import SchedulingResult
 
 class NNDataflowScheme(MutableMapping):
@@ -53,8 +54,16 @@ def __init__(self, network, input_layout, ext_layout_dict=None):
 
         self.res_dict = OrderedDict()
 
-        self.total_cost = 0
-        self.total_time = 0
+        # Naive sum of all layer cost.
+        self.sum_cost = 0
+        self.sum_static_cost = 0
+        # Naive sum of all layer time, used to adjust cost.
+        self.sum_time = 0
+
+        # A list of segment schedule timing information.
+        self.segment_timing_list = []
+
+        self.last_seg_idx = -1
 
     def __getitem__(self, layer_name):
         ''' Get the SchedulingResult of a scheduled layer. '''
@@ -84,8 +93,23 @@ def __setitem__(self, layer_name, sched_result):
 
         self.res_dict[layer_name] = sched_result
 
-        self.total_cost += sched_result.total_cost
-        self.total_time += sched_result.total_time
+        self.sum_cost += sched_result.total_cost
+        self.sum_static_cost += sched_result.scheme['cost_static']
+        self.sum_time += sched_result.total_time
+
+        seg_idx = sched_result.sched_seq[0]
+        if seg_idx == self.last_seg_idx + 1:
+            self.segment_timing_list.append(
+                PipelineSegmentTiming(self.network, seg_idx))
+            self.last_seg_idx += 1
+        elif seg_idx == self.last_seg_idx:
+            pass
+        else:
+            raise ValueError('NNDataflowScheme: segment index is invalid. '
+                             'segment {} follows {}.'
+                             .format(seg_idx, self.last_seg_idx))
+        assert len(self.segment_timing_list) - 1 == self.last_seg_idx
+        self.segment_timing_list[-1].add(layer_name, sched_result)
 
     def __delitem__(self, layer_name):
         ''' Not legal to call. '''
@@ -129,6 +153,25 @@ def _ofmap_layout(layer_name):
 
         return DataLayout.concat(*[_ofmap_layout(l) for l in layers])
 
+    @property
+    def total_cost(self):
+        ''' Get the total cost. '''
+        if self.sum_time == 0:
+            return self.sum_cost
+        overcounted_static_cost = (self.sum_static_cost
+                                   * (1 - 1. * self.total_time / self.sum_time))
+        return self.sum_cost - overcounted_static_cost
+
+    @property
+    def total_time(self):
+        ''' Get the total time. '''
+        # Special case, when the entire network fits in one segment. No
+        # pipeline filling/draining delay.
+        if len(self.segment_timing_list) == 1 \
+                and self.__len__() == len(self.network):
+            return self.segment_timing_list[0].critical_time
+        return sum(t.time for t in self.segment_timing_list)
+
     @property
     def total_ops(self):
         ''' Get the total ops. '''
@@ -147,6 +190,16 @@ def total_noc_hops(self):
         ''' Get the total NoC hops. '''
         return sum(sr.total_noc_hops for sr in self.values())
 
+    def segment_time_list(self):
+        ''' Get the time for each segment. '''
+        return [t.time for t in self.segment_timing_list]
+
+    def segment_dram_time_list(self):
+        '''
+        Get the time for each segment on DRAM access.
+        '''
+        return [t.dram_time for t in self.segment_timing_list]
+
     def perlayer_stats(self, stats_name):
         '''
         Get a dict of per-layer stats. Valid stats must be a static method.
diff --git a/nn_dataflow/core/node_region.py b/nn_dataflow/core/node_region.py
index 06baec8..2d3b3df 100644
--- a/nn_dataflow/core/node_region.py
+++ b/nn_dataflow/core/node_region.py
@@ -16,12 +16,15 @@
 import itertools
 from collections import namedtuple
 
+from .. import util
 from .phy_dim2 import PhyDim2
 
 NODE_REGION_LIST = ['dim',
                     'origin',
                     'dist',
                     'type',
+                    'wtot',
+                    'wbeg',
                    ]
 
 class NodeRegion(namedtuple('NodeRegion', NODE_REGION_LIST)):
@@ -31,6 +34,26 @@ class NodeRegion(namedtuple('NodeRegion', NODE_REGION_LIST)):
     The `type` attribute specifies the region type, which could be `PROC` for
     computation processing nodes or 'DRAM' for off-chip data storage nodes.
 
+    The node region can be optionally folded along the w dimension in a zig-zag
+    manner. The folding scheme is defined by (wtot, wbeg). `wtot` is always
+    positive, representing the number of nodes between two turns (total width).
+    `wbeg` is the number of nodes before reaching the first turning boundary,
+    with its sign representing the direction. E.g.,
+
+    ...
+    ******************
+              ********
+              | wbeg |
+
+    or
+
+    ...
+    ******************
+    *********
+    | -wbeg |
+
+    With folded region, `origin` points to the first node.
+
     NOTE: we cannot overload __contains__ and __iter__ as a node container,
     because the base namedtuple already defines them.
     '''
@@ -46,6 +69,12 @@ def __new__(cls, *args, **kwargs):
         kwargs2 = kwargs.copy()
         if len(args) <= NODE_REGION_LIST.index('dist'):
             kwargs2.setdefault('dist', PhyDim2(1, 1))
+        if len(args) <= NODE_REGION_LIST.index('wtot'):
+            # Default to dim.w but we haven't checked dim yet. Replace later.
+            kwargs2.setdefault('wtot', None)
+        if len(args) <= NODE_REGION_LIST.index('wbeg'):
+            # Default to wtot. Also replace later.
+            kwargs2.setdefault('wbeg', None)
 
         ntp = super(NodeRegion, cls).__new__(cls, *args, **kwargs2)
 
@@ -59,6 +88,19 @@ def __new__(cls, *args, **kwargs):
         if ntp.type not in range(cls.NUM):
             raise ValueError('NodeRegion: type must be a valid type enum.')
 
+        if ntp.wtot is None:
+            ntp = ntp._replace(wtot=ntp.dim.w)
+        if ntp.wbeg is None:
+            ntp = ntp._replace(wbeg=ntp.wtot)
+
+        if not isinstance(ntp.wtot, int):
+            raise TypeError('NodeRegion: wtot must be an int.')
+        if not isinstance(ntp.wbeg, int):
+            raise TypeError('NodeRegion: wbeg must be an int.')
+
+        if not (0 < abs(ntp.wbeg) <= ntp.wtot) and ntp.dim.size() > 0:
+            raise ValueError('NodeRegion: |wbeg| must be in (0, wtot].')
+
         return ntp
 
     def contains_node(self, coordinate):
@@ -79,6 +121,86 @@ def rel2abs(self, rel_coordinate):
             raise ValueError('NodeRegion: relative coordinate {} is not in '
                              'node region {}.'.format(rel_coordinate, self))
 
-        abs_coordinate = self.origin + rel_coordinate * self.dist
+        # Add starting offset to start from the boundary before the first node,
+        # then modulo wtot to get the delta h and w to this boundary point.
+        h, w = divmod(rel_coordinate.w + self.wtot - abs(self.wbeg), self.wtot)
+        # Direction for w, changing every time when h increments.
+        direction = (-1 if self.wbeg < 0 else 1) * (-1 if h % 2 else 1)
+        # Make w relative to the left boundary.
+        w = w if direction > 0 else self.wtot - 1 - w
+
+        abs_coordinate = self.origin \
+                + PhyDim2(h=h * self.dim.h + rel_coordinate.h,
+                          w=w - (self.wtot - self.wbeg if self.wbeg > 0
+                                 else -self.wbeg - 1)) \
+                * self.dist
+
         return abs_coordinate
 
+    def allocate(self, request_list):
+        '''
+        Allocate node subregions spatially within the node region according to
+        the given `request_list` which is a list of numbers of nodes requested.
+
+        Return a list of NodeRegion instances, whose origins are absolute
+        offset (not relative to the origin of self). The allocation may fail if
+        and only if the total number of nodes requested is larger than the
+        number of nodes in the region, in which case an empty list is returned.
+
+        The strategy is to allocate stripe-wise in a zig-zag order, allowing
+        for folding in width. We first determine a stripe height as the
+        greatest common divisor of the requested numbers of nodes. Then
+        allocate each request as (stripe height, request size / stripe height)
+        to fill in the stripe, and move to the next stripe after the current
+        one is filled. If the width of a request is larger than the remaining
+        width of the current stripe, we use up the remaining width, and fold
+        the request width to the next stripe.
+        '''
+
+        if sum(request_list) > self.dim.size():
+            return []
+
+        hstrp = util.gcd(self.dim.h, *request_list)
+        subregions = []
+
+        wtot = self.dim.w
+        ofs_h, ofs_w = 0, 0
+        move_right = True
+
+        for req in request_list:
+
+            # Subregion.
+            assert req % hstrp == 0
+            width = req // hstrp
+
+            subdim = PhyDim2(hstrp, width)
+            if move_right:
+                origin = PhyDim2(ofs_h, ofs_w)
+                wbeg = min(wtot - ofs_w, width)
+                assert wbeg > 0
+            else:
+                origin = PhyDim2(ofs_h, self.dim.w - ofs_w - 1)
+                wbeg = -min(wtot - ofs_w, width)
+                assert wbeg < 0
+
+            subregions.append(NodeRegion(dim=subdim,
+                                         origin=self.origin \
+                                            + origin * self.dist,
+                                         dist=self.dist,
+                                         type=self.type,
+                                         wtot=wtot,
+                                         wbeg=wbeg))
+
+            # Move the offset
+            ofs_w += width
+            while ofs_w >= self.dim.w:
+                # Overflow, fold to the next stripe.
+                ofs_w -= self.dim.w
+                ofs_h += hstrp
+                move_right = not move_right
+
+        # Not moved outside the region.
+        assert ofs_h + hstrp <= self.dim.h or ofs_w == 0
+
+        return subregions
+
diff --git a/nn_dataflow/core/option.py b/nn_dataflow/core/option.py
index 451f044..968bf72 100644
--- a/nn_dataflow/core/option.py
+++ b/nn_dataflow/core/option.py
@@ -19,9 +19,16 @@
 
 OPTION_LIST = ['sw_gbuf_bypass',
                'sw_solve_loopblocking',
+               'hw_access_forwarding',
+               'hw_gbuf_sharing',
+               'hw_gbuf_save_writeback',
                'partition_hybrid',
                'partition_batch',
                'partition_ifmaps',
+               'partition_interlayer',
+               'layer_pipeline_time_ovhd',
+               'layer_pipeline_max_degree',
+               'layer_pipeline_opt',
                'opt_goal',
                'ntops',
                'nprocesses',
@@ -55,9 +62,16 @@ def __new__(cls, *args, **kwargs):
 
         kwdict.setdefault('sw_gbuf_bypass', (False,) * de.NUM)
         kwdict.setdefault('sw_solve_loopblocking', False)
+        kwdict.setdefault('hw_access_forwarding', False)
+        kwdict.setdefault('hw_gbuf_sharing', False)
+        kwdict.setdefault('hw_gbuf_save_writeback', False)
         kwdict.setdefault('partition_hybrid', False)
         kwdict.setdefault('partition_batch', False)
         kwdict.setdefault('partition_ifmaps', False)
+        kwdict.setdefault('partition_interlayer', False)
+        kwdict.setdefault('layer_pipeline_time_ovhd', float('inf'))
+        kwdict.setdefault('layer_pipeline_max_degree', float('inf'))
+        kwdict.setdefault('layer_pipeline_opt', True)
         kwdict.setdefault('opt_goal', 'e')
         kwdict.setdefault('ntops', 1)
         kwdict.setdefault('nprocesses', 1)
@@ -73,10 +87,38 @@ def __new__(cls, *args, **kwargs):
             raise ValueError('Option: sw_gbuf_bypass must have length {}'
                              .format(de.NUM))
 
+        if ntp.sw_solve_loopblocking and ntp.hw_gbuf_sharing:
+            raise ValueError('Option: sw_solve_loopblocking and '
+                             'hw_gbuf_sharing cannot be simultaneously '
+                             'enabled.')
+
+        if ntp.hw_access_forwarding and ntp.hw_gbuf_sharing:
+            raise ValueError('Option: hw_access_forwarding is implied by '
+                             'hw_gbuf_sharing, thus cannot be both enabled.')
+
+        if ntp.sw_solve_loopblocking and ntp.hw_gbuf_save_writeback:
+            raise ValueError('Option: sw_solve_loopblocking and '
+                             'hw_gbuf_save_writeback cannot be simultaneously '
+                             'enabled.')
+
         if ntp.partition_ifmaps and not ntp.partition_hybrid:
             raise ValueError('Option: partition_ifmaps requires '
                              'partition_hybrid to be set.')
 
+        if not isinstance(ntp.layer_pipeline_time_ovhd, (int, float)):
+            raise KeyError('Option: layer_pipeline_time_ovhd must be a '
+                           'number.')
+        if ntp.layer_pipeline_time_ovhd < 0:
+            raise ValueError('Option: layer_pipeline_time_ovhd must be '
+                             'positive.')
+
+        if not isinstance(ntp.layer_pipeline_max_degree, (int, float)):
+            raise KeyError('Option: layer_pipeline_max_degree must be a '
+                           'number.')
+        if ntp.layer_pipeline_max_degree < 0:
+            raise ValueError('Option: layer_pipeline_max_degree must be '
+                             'positive.')
+
         if ntp.opt_goal not in ['e', 'd', 'ed']:
             raise ValueError('Option: opt_goal is invalid, must be one of '
                              '\'e\', \'d\', and \'ed\'.')
diff --git a/nn_dataflow/core/partition.py b/nn_dataflow/core/partition.py
index 893599f..6ec2535 100644
--- a/nn_dataflow/core/partition.py
+++ b/nn_dataflow/core/partition.py
@@ -256,8 +256,6 @@ def unit_nhops_to_proc_region(layer, batch_size, region, part,
     category.
     '''
 
-    del options
-
     # FmapRange --> list of node coordinates processing this data.
     fil_dict = {}
     ofm_dict = {}
@@ -285,23 +283,29 @@ def unit_nhops_to_proc_region(layer, batch_size, region, part,
     ifm_dict = util.HashableDict.fromdict(ifm_dict, valfunc=tuple)
     ofm_dict = util.HashableDict.fromdict(ofm_dict, valfunc=tuple)
 
+    # When using access forwarding, each piece of data is only fetched by the
+    # closest node, and then forwarded to ALL nodes that process it, regardless
+    # of which nodes initially store it. In this way, AF nhops is independent
+    # of BS scheme.
+    fwd = options.hw_access_forwarding or options.hw_gbuf_sharing
+
     nhops = [0] * de.NUM
 
-    nhops[de.FIL] = _unit_nhops_to_fil(layer, filter_nodes, fil_dict)
+    nhops[de.FIL] = _unit_nhops_to_fil(layer, filter_nodes, fil_dict, fwd)
 
-    nhops[de.IFM] = _unit_nhops_to_ifm(ifmap_layout, ifm_dict)
+    nhops[de.IFM] = _unit_nhops_to_ifm(ifmap_layout, ifm_dict, fwd)
 
     if ofmap_layout.parts == (part,) and ofmap_layout.regions == (region,):
         # Ofmaps are stored locally, no data transfer.
         pass
     else:
-        nhops[de.OFM] = _unit_nhops_to_ofm(ofmap_layout, ofm_dict)
+        nhops[de.OFM] = _unit_nhops_to_ofm(ofmap_layout, ofm_dict, fwd)
 
     return nhops
 
 
 @fastcache.clru_cache(maxsize=1024)
-def _unit_nhops_to_fil(layer, filter_nodes, fil_dict):
+def _unit_nhops_to_fil(layer, filter_nodes, fil_dict, fwd=False):
     '''
     Get the total number of hops to transfer filter data.
 
@@ -312,16 +316,31 @@ def _unit_nhops_to_fil(layer, filter_nodes, fil_dict):
     for filrng, coord_list in fil_dict.items():
         fil_size = filrng[0].size() * filrng[1].size() * layer.filter_size()
 
-        # Min hops to each processing node across all filter source nodes.
-        min_hops = [min(coord.hop_dist(c) for c in filter_nodes)
-                    for coord in coord_list]
-        nhops += fil_size * sum(min_hops)
+        if fwd:
+            # Data can be forwarded from all sources to any destination.
+            src_set = set(filter_nodes)
+            dst_set = set(coord_list)
+
+            while dst_set:
+                # Each forward step, get the min-distance pair of source and
+                # destination.
+                src, dst = min(itertools.product(src_set, dst_set),
+                               key=lambda (s, d): d.hop_dist(s))
+                dst_set.remove(dst)
+                src_set.add(dst)
+                nhops += fil_size * dst.hop_dist(src)
+
+        else:
+            # Min hops to each processing node across all filter source nodes.
+            min_hops = [min(coord.hop_dist(c) for c in filter_nodes)
+                        for coord in coord_list]
+            nhops += fil_size * sum(min_hops)
 
     return nhops
 
 
 @fastcache.clru_cache(maxsize=1024)
-def _unit_nhops_to_ifm(ifmap_layout, ifm_dict):
+def _unit_nhops_to_ifm(ifmap_layout, ifm_dict, fwd=False):
     '''
     Get the total number of hops to transfer ifmap data.
 
@@ -330,13 +349,13 @@ def _unit_nhops_to_ifm(ifmap_layout, ifm_dict):
     nhops = 0
 
     for ifrng, coord_list in ifm_dict.items():
-        nhops += ifmap_layout.nhops_to(ifrng, *coord_list)
+        nhops += ifmap_layout.nhops_to(ifrng, *coord_list, forwarding=fwd)
 
     return nhops
 
 
 @fastcache.clru_cache(maxsize=1024)
-def _unit_nhops_to_ofm(ofmap_layout, ofm_dict):
+def _unit_nhops_to_ofm(ofmap_layout, ofm_dict, fwd=False):
     '''
     Get the total number of hops to transfer ofmap data.
 
@@ -350,16 +369,28 @@ def _unit_nhops_to_ofm(ofmap_layout, ofm_dict):
         # its buffer and start on it. Other nodes start on zero and send the
         # results to that node to accumulate there.
 
-        # Use the mid node.
-        mid_idx = len(coord_list) // 2
-        for idx, coord in enumerate(coord_list):
-            if idx == mid_idx:
-                # The mid node. Fetch from memory.
-                nhops += ofmap_layout.nhops_to(ofrng, coord)
-            else:
-                # Others. Send to the mid node (one way).
-                dist = coord.hop_dist(coord_list[mid_idx])
-                nhops += util.idivc(ofrng.size() * dist, 2)
+        if fwd:
+            # Use the closest processing node.
+            nhops_read = min(ofmap_layout.nhops_to(ofrng, c)
+                             for c in coord_list)
+            # Accumulation follows the reversed optimal forwarding tree.
+            nhops_accum = ofmap_layout.nhops_to(ofrng, *coord_list,
+                                                forwarding=True)
+            # The path between mid node and memory is in both, and accumulation
+            # is one-way.
+            nhops += util.idivc(nhops_read + nhops_accum, 2)
+
+        else:
+            # Use the middle node.
+            mid_idx = len(coord_list) // 2
+            for idx, coord in enumerate(coord_list):
+                if idx == mid_idx:
+                    # The mid node. Fetch from memory.
+                    nhops += ofmap_layout.nhops_to(ofrng, coord)
+                else:
+                    # Others. Send to the mid node (one way).
+                    dist = coord.hop_dist(coord_list[mid_idx])
+                    nhops += util.idivc(ofrng.size() * dist, 2)
 
     return nhops
 
diff --git a/nn_dataflow/core/partition_scheme.py b/nn_dataflow/core/partition_scheme.py
index b00a8b4..1735850 100644
--- a/nn_dataflow/core/partition_scheme.py
+++ b/nn_dataflow/core/partition_scheme.py
@@ -173,6 +173,41 @@ def part_layer(self, layer, batch_size):
 
         return p_layer, p_batch_size, p_occ
 
+    def part_neighbor_dist(self, node_region, pae):
+        '''
+        Get the 2D distance between nearest neighbor nodes with the given
+        parallelism in the given node region.
+
+        The returned neighbor distance is a PhyDim2 instance, each dimension of
+        which is the hop distance to the neighbor on that logical dimension.
+        '''
+        if pae not in range(pe.NUM):
+            return PhyDim2(float('nan'), float('nan'))
+
+        hdist = []
+        wdist = []
+
+        for pidx in self.gen_pidx():
+            coord = self.coordinate(node_region, pidx)
+            # On logical h dimension.
+            if pidx[pae].h > 0:
+                pidx_ph = [pidx[p] - PhyDim2(h=1, w=0) if p == pae
+                           else pidx[p] for p in range(pe.NUM)]
+                coord_ph = self.coordinate(node_region, pidx_ph)
+                hdist.append(coord.hop_dist(coord_ph))
+            # On logical w dimension.
+            if pidx[pae].w > 0:
+                pidx_pw = [pidx[p] - PhyDim2(h=0, w=1) if p == pae
+                           else pidx[p] for p in range(pe.NUM)]
+                coord_pw = self.coordinate(node_region, pidx_pw)
+                wdist.append(coord.hop_dist(coord_pw))
+
+        # Average.
+        hd = 1. * sum(hdist) / len(hdist) if hdist else float('inf')
+        wd = 1. * sum(wdist) / len(wdist) if wdist else float('inf')
+
+        return PhyDim2(h=hd, w=wd)
+
     def projection(self, region, appl2frng=False):
         '''
         Get the projection of the partitioning scheme onto a new NodeRegion
diff --git a/nn_dataflow/core/pipeline_segment.py b/nn_dataflow/core/pipeline_segment.py
new file mode 100644
index 0000000..86741cf
--- /dev/null
+++ b/nn_dataflow/core/pipeline_segment.py
@@ -0,0 +1,970 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+from collections import namedtuple, OrderedDict, Counter
+import itertools
+
+from sympy import symbols
+from sympy import Basic as symbasic
+from sympy import Eq as symeq
+from sympy.core.containers import Tuple as symtuple
+from sympy.functions.elementary.piecewise import Piecewise as sympiecewise
+
+from .. import util
+from .layer import ConvLayer
+from .network import Network
+from .resource import Resource
+from .scheduling_constraint import SchedulingConstraintLayerPipeline as Cstr
+
+class PipelineSegment(object):
+    '''
+    Inter-layer pipeline segment.
+
+    Segment is a two-level layer hierarchy, where the first level is spatially
+    scheduled and the second level is temporally scheduled.
+    '''
+
+    # pylint: disable=too-many-instance-attributes
+
+    # Scheduling index in the segment, as a tuple of spatial and temporal
+    # scheduling indices.
+    SchedIndex = namedtuple('SchedIndex', ['sp_idx', 'tm_idx'])
+
+    def __init__(self, seg, network, batch_size, resource, max_util_drop=0.05,
+                 with_opt=True):
+        if not isinstance(seg, tuple):
+            raise TypeError('PipelineSegment: seg must be a tuple.')
+        for ltpl in seg:
+            if not isinstance(ltpl, tuple):
+                raise TypeError('PipelineSegment: seg must be a tuple '
+                                'of sub-tuples.')
+
+        if not isinstance(network, Network):
+            raise TypeError('PipelineSegment: network must be '
+                            'a Network instance.')
+        if not isinstance(resource, Resource):
+            raise TypeError('PipelineSegment: resource must be '
+                            'a Resource instance.')
+
+        self.seg = seg
+        self.network = network
+        self.batch_size = batch_size
+        self.resource = resource
+        self.max_util_drop = max_util_drop
+        self.with_opt = with_opt
+
+        self.valid = self._init_deps()
+        if not self.valid:
+            return
+
+        # Resource allocation.
+        self.valid = self._alloc_resource(max_util_drop=max_util_drop)
+        if not self.valid:
+            return
+
+        # Scheduling constraints.
+        self.valid = self._init_sym_cstrs()
+        if not self.valid:
+            return
+
+    def allocation(self):
+        '''
+        Get resource allocation, as a tuple of sub-tuples corresponding to the
+        layers in the segment.
+        '''
+        if not self.valid:
+            return None
+        return self.alloc
+
+    def gen_constraint(self, max_time_overhead=float('inf')):
+        '''
+        Generate scheduling constraint for the segment, as a tuple of
+        sub-tuples of SchedulingConstraint instances, corresponding to the
+        layers in the segment.
+
+        Yield the segment constraint tuple, and hints for pruning.
+
+        Pruning hints are the top-level loop blocking factors. Smaller hints
+        indicate better (lower) cost, and larger hints indicate better segment
+        timing (with lower time overhead). Constraints with smaller hints are
+        generated before those with larger hints. So if a constraint results in
+        a valid scheduling, the later ones with all hints larger than its can
+        be pruned.
+        '''
+        syms = self.cstr_symvals.keys()
+        vals = self.cstr_symvals.values()
+        assert syms and vals
+
+        # Sort from small to large.
+        # This is not a strict ordering, but we guarantee that if all values in
+        # hint A are larger than the corresponding values in hint B, A will be
+        # generated after B.
+        vals = [sorted(v) for v in vals]
+
+        if self.cstr_topbat_idx is not None:
+            # Tovhd =  (1 + 1/to + 1 + 1/to + ...) / tb
+            #       >= (1 + 1 + ...) / tb = num_sp_fbs / tb
+            min_topbat = 1. * self.cstr_num_sp_fbs / max_time_overhead
+            pos = self.cstr_topbat_idx
+            vals[pos] = [t for t in vals[pos] if t >= min_topbat]
+
+        for valp in itertools.product(*vals):
+
+            constraint = tuple()
+
+            for atpl in self._subs_symargs(self.cstr_symargs, zip(syms, valp)):
+                ctpl = tuple()
+                for a in atpl:
+                    # Construct kwargs, adjust the types of the values.
+                    kwargs = {}
+                    kwargs['topbat'] = int(a.get('topbat', 0))
+                    kwargs['fbifm'] = bool(a.get('fbifm', False))
+                    if not kwargs['fbifm']:
+                        kwargs['topifm'] = int(a.get('topifm', 0))
+                    kwargs['fbofm'] = bool(a.get('fbofm', False))
+                    if not kwargs['fbofm']:
+                        kwargs['topofm'] = int(a.get('topofm', 0))
+                    kwargs['update_dict'] = a.get('update_dict')
+
+                    c = Cstr(**kwargs)
+                    ctpl += (c,)
+                constraint += (ctpl,)
+
+            if None in valp:
+                assert len(valp) == 1
+                hints = (1,)
+            else:
+                hints = tuple(valp)
+
+            yield constraint, hints
+
+    def __getitem__(self, index):
+        return self.seg[index]
+
+    def __iter__(self):
+        return self.seg.__iter__()
+
+    def __len__(self):
+        return len(self.seg)
+
+    def __eq__(self, other):
+        if isinstance(other, self.__class__):
+            # pylint: disable=protected-access
+            return self._key_attrs() == other._key_attrs()
+        return NotImplemented
+
+    def __ne__(self, other):
+        return not self == other
+
+    def __hash__(self):
+        return hash(tuple(self._key_attrs()))
+
+    def __repr__(self):
+        return '{}({})'.format(
+            self.__class__.__name__,
+            ', '.join([
+                'seg={}'.format(repr(self.seg)),
+                'network={}'.format(repr(self.network)),
+                'batch_size={}'.format(repr(self.batch_size)),
+                'resource={}'.format(repr(self.resource)),
+                'max_util_drop={}'.format(repr(self.max_util_drop)),
+                'with_opt={}'.format(repr(self.with_opt))]))
+
+    def _key_attrs(self):
+        ''' Used for comparison. '''
+        return (self.seg, self.network, self.batch_size, self.resource,
+                self.max_util_drop, self.with_opt)
+
+    def _init_deps(self):
+        '''
+        Initialize the dependency relationship of the layers in the segment as
+        a mapping of the scheduling indices, and check validation. Return
+        whether the segment is valid to schedule.
+
+        We categorize dependencies to 3 categories:
+        - local: with the same spatial index but different temporal indices;
+        - neighbor: with different spatial indices but in the same segment;
+        - memory: in different segments, from/to memory.
+
+        The values of the src/dst dicts are tuples of indices of the neighbor
+        dependencies. A layer can have at most one neighbor source (must be a
+        last temporal scheduled layer), but may have multiple neighbor
+        destinations (could be temporal scheduled in the middle). Also, all
+        layers with the same spatial index can have at most one neighbor
+        source.
+
+        Special index `None` means memory dependency, i.e., from/to memory.
+        Memory sources and neighbor sources must be mutual exclusive, in order
+        to correctly set the src data regions; memory destinations and neighbor
+        destinations can co-exist.
+
+        Local dependencies are omitted, as by default each layer has its
+        immediately previous layer as local source and immediately next layer
+        as local destination.
+
+        Construct an ifmap forwarding dict for shared memory source data. It
+        maps previous layer name tuples, to a list of scheduling indices of all
+        layers in this segment that share these exact previous layers. The
+        first in the list is responsible to fetch the previous layer data and
+        to forward them to others. We allow shared memory source data between
+        two layers only when both layers have memory dependency only (so their
+        temporal indices must be 0), and their previous layers are exactly the
+        same.
+
+        Construct an ofmap forwarding dict for multiple destinations of both
+        on-chip and off-chip. It maps the scheduling index of a layer in this
+        segment that has both memory and neighbor/local destinations (so needs
+        to store its ofmaps back to memory), to a list of scheduling indices of
+        all layers in this segment that accepts its ofmaps as ifmaps. Neighbor
+        dependencies are only between the last temporal one and the first
+        temporal ones; local dependencies are only between adjacent temporal
+        ones.
+        '''
+
+        self.src_dict = [[None for _ in ltpl] for ltpl in self.seg]
+        self.dst_dict = [[None for _ in ltpl] for ltpl in self.seg]
+
+        self.ifm_fwd_dict = {}
+        self.ofm_fwd_dict = {}
+
+        # Mapping from layer to spatial/temporal indices in the segment.
+        layer2idx = {l: PipelineSegment.SchedIndex(sp_idx, tm_idx)
+                     for sp_idx, ltpl in enumerate(self.seg)
+                     for tm_idx, l in enumerate(ltpl)}
+
+        # Mapping from previous layer tuple to layer.
+        prevs2layer = {}
+
+        for sp_idx, ltpl in enumerate(self.seg):
+
+            single_nbr_src = None
+
+            for tm_idx, l in enumerate(ltpl):
+
+                assert layer2idx[l] == (sp_idx, tm_idx)
+
+                # Sources.
+                src = tuple()
+
+                prevs = self.network.prevs(l)
+                assert all(p not in layer2idx or layer2idx[p] < layer2idx[l]
+                           for p in prevs)
+                mem_src = [p for p in prevs if p not in layer2idx]
+                lcl_src = [p for p in prevs if p not in mem_src
+                           and layer2idx[p].sp_idx == sp_idx]
+                nbr_src = [p for p in prevs if p not in mem_src + lcl_src]
+
+                # Ensure single local source to be the immediately previous.
+                # Check at the destination so here are assertions.
+                if not lcl_src:
+                    assert tm_idx == 0
+                else:
+                    assert len(lcl_src) == 1 \
+                            and layer2idx[lcl_src[0]].tm_idx == tm_idx - 1
+
+                # Mutual exclusive.
+                if mem_src and nbr_src:
+                    # We now allow each spatial scheduling (vertex) to have
+                    # both memory source and neighbor source when generating
+                    # segments. But each single layer cannot have both;
+                    # otherwise there would be multiple source data regions.
+                    return False
+
+                if mem_src:
+                    # Memory source.
+                    src += (None,)
+                if nbr_src:
+                    # Neighbor source.
+                    # Single neighbor source to be the last temporal scheduled.
+                    assert len(nbr_src) == 1
+                    prev_idx = layer2idx[nbr_src[0]]
+                    assert prev_idx.tm_idx == len(self.seg[prev_idx.sp_idx]) - 1
+                    # Single neighbor source across this spatial scheduling.
+                    if single_nbr_src is not None:
+                        return False
+                    single_nbr_src = prev_idx
+                    src += (prev_idx,)
+
+                # Shared memory source.
+                if mem_src and not lcl_src:
+                    assert not nbr_src
+                    assert tm_idx == 0
+                    if prevs in prevs2layer:
+                        fet_idx = layer2idx[prevs2layer[prevs]]
+                        self.ifm_fwd_dict.setdefault(prevs, [fet_idx]).append(
+                            layer2idx[l])
+                    else:
+                        prevs2layer[prevs] = l
+
+                # Destinations.
+                dst = tuple()
+
+                nexts = self.network.nexts(l)
+                assert all(n not in layer2idx or layer2idx[n] > layer2idx[l]
+                           for n in nexts)
+                mem_dst = [n for n in nexts if n not in layer2idx]
+                lcl_dst = [n for n in nexts if n not in mem_dst
+                           and layer2idx[n].sp_idx == sp_idx]
+                nbr_dst = [n for n in nexts if n not in mem_dst + lcl_dst]
+
+                # Ensure single local destination to be the immediate next.
+                if not lcl_dst:
+                    if tm_idx != len(ltpl) - 1:
+                        # Not utilize local data, sub-optimal.
+                        return False
+                else:
+                    if len(lcl_dst) != 1 \
+                            or layer2idx[lcl_dst[0]].tm_idx != tm_idx + 1:
+                        # Local data will not be available if not adjacent.
+                        return False
+
+                # Mutual exclusive.
+                # Now they can co-exist.
+                # assert not mem_dst or not nbr_dst
+                if mem_dst and nbr_dst:
+                    assert tm_idx == len(ltpl) - 1
+                    self.ofm_fwd_dict[layer2idx[l]] = [layer2idx[n]
+                                                       for n in nbr_dst]
+                if mem_dst and lcl_dst:
+                    assert not nbr_dst
+                    self.ofm_fwd_dict[layer2idx[l]] = [layer2idx[lcl_dst[0]]]
+
+                if mem_dst:
+                    # Memory destination.
+                    dst += (None,)
+                if nbr_dst:
+                    # Neighbor destinations.
+                    # This layer is the last temporal scheduled.
+                    assert tm_idx == len(ltpl) - 1
+                    dst += tuple(layer2idx[n] for n in nbr_dst)
+
+                # Basic pipelining requires a linear structure (on-chip).
+                if not self.with_opt:
+                    if len(nbr_src) + len(lcl_src) > 1 \
+                            or len(nbr_dst) + len(lcl_dst) > 1 \
+                            or ((sp_idx, tm_idx) != (0, 0)
+                                    and not nbr_src and not lcl_src):
+                        return False
+
+                self.src_dict[sp_idx][tm_idx] = src
+                self.dst_dict[sp_idx][tm_idx] = dst
+
+        return True
+
+    def _alloc_resource(self, max_util_drop=0.05):
+        '''
+        Decide the resource allocation. Return whether the allocation succeeds.
+
+        `max_util_drop` specifies the maximum utilization drop due to mismatch
+        throughput between layers.
+        '''
+
+        self.alloc = tuple()
+
+        # Allocate processing subregions.
+        subregions = self._alloc_proc(max_util_drop=max_util_drop)
+        if not subregions:
+            return False
+
+        no_time_mux = len(self.network) == sum(len(ltpl) for ltpl in self.seg)
+        # All layers that have model filters must be spatially scheduled.
+        if no_time_mux:
+            for ltpl in self.seg:
+                if len([l for l in ltpl
+                        if isinstance(self.network[l], ConvLayer)]) > 1:
+                    no_time_mux = False
+                    break
+
+        for sp_idx, ltpl in enumerate(self.seg):
+
+            # Resource for the subregion.
+            rtpl = tuple()
+
+            for tm_idx, _ in enumerate(ltpl):
+
+                # Processing region.
+                proc_region = subregions[sp_idx]
+
+                # Data source.
+                src = self.src_dict[sp_idx][tm_idx]
+                if None in src:
+                    # Data source is memory.
+                    assert src == (None,)
+                    src_data_region = self.resource.src_data_region
+                    for sh_idx_list in self.ifm_fwd_dict.values():
+                        # Find shared memory source to use forwarding.
+                        if (sp_idx, tm_idx) in sh_idx_list[1:]:
+                            src_data_region = subregions[sh_idx_list[0].sp_idx]
+                            break
+                elif src:
+                    # Data source is neighbor.
+                    assert len(src) == 1
+                    src_data_region = subregions[src[0].sp_idx]
+                else:
+                    # Data source is all local.
+                    src_data_region = proc_region
+
+                # Data destination.
+                dst = self.dst_dict[sp_idx][tm_idx]
+                if None in dst:
+                    # Data destination is memory.
+                    # assert dst == (None,)
+                    # Now we can have both memory and neighbor destinations. If
+                    # they co-exist, we need to store them locally and also
+                    # store back to memory. In this case the dst data region is
+                    # set to memory.
+                    dst_data_region = self.resource.dst_data_region
+                elif dst:
+                    # Data destinations are neighbors.
+                    # Put data in local. The next layers will fetch.
+                    dst_data_region = proc_region
+                else:
+                    # Data destination is all local.
+                    dst_data_region = proc_region
+
+                # Make resource.
+                # Note that DRAM bandwidth is not split here. We optimistically
+                # assume each layer can use the full DRAM bandwidth at
+                # different time. We adjust this assumption when calculating
+                # the segment timing.
+                rtpl += (self.resource._replace(
+                    proc_region=proc_region,
+                    src_data_region=src_data_region,
+                    dst_data_region=dst_data_region,
+                    no_time_mux=no_time_mux),)
+
+            assert len(rtpl) == len(ltpl)
+            self.alloc += (rtpl,)
+        assert len(self.alloc) == len(self.seg)
+
+        return True
+
+    def _alloc_proc(self, max_util_drop=0.05):
+        '''
+        Allocate processing subregions for the segment.
+
+        Return a list of processing subregions corresponding to the first-level
+        (spatial scheduled) layers in the segment. Return None if allocation
+        failed.
+
+        `max_util_drop` specifies the maximum utilization drop due to mismatch
+        throughput between layers.
+        '''
+
+        # Spatial allocation.
+        proc_region = self.resource.proc_region
+        dim_nodes = proc_region.dim
+        total_nodes = dim_nodes.size()
+
+        # Number of operations of each spatial allocation.
+        ops = [sum(self.network[l].total_ops() for l in ltpl)
+               for ltpl in self.seg]
+
+        # Enforce a common factor among the numbers of nodes allocated to all
+        # vertices in the segment. Such common factor is likely to be the
+        # common height of the vertex node regions.
+        common_factor_list = [cf for cf, _ in util.factorize(dim_nodes.h, 2)]
+
+        for cf in sorted(common_factor_list, reverse=True):
+            # Pick the largest common factor within the utilization constraint.
+
+            # Number of nodes of each vertex should be approximate to the
+            # number of ops of the vertex.
+            nodes_raw = [o * 1. / sum(ops) * total_nodes for o in ops]
+
+            # Round to the common factor multiples.
+            assert total_nodes % cf == 0
+            nodes = [max(1, int(round(nr / cf))) * cf for nr in nodes_raw]
+            # Fix margin.
+            while sum(nodes) != total_nodes:
+                diff = [n - nr for n, nr in zip(nodes, nodes_raw)]
+                if sum(nodes) > total_nodes:
+                    # Decrease the nodes for the vertex with the maximum
+                    # positive difference.
+                    idx, _ = max(enumerate(diff), key=lambda tpl: tpl[1])
+                    nodes[idx] -= cf
+                else:
+                    # Increase the nodes for the vertex with the minimum
+                    # negative difference.
+                    idx, _ = min(enumerate(diff), key=lambda tpl: tpl[1])
+                    nodes[idx] += cf
+
+            if 0 in nodes:
+                continue
+
+            # Utilization.
+            time = max(o * 1. / n for o, n in zip(ops, nodes))
+            utilization = sum(ops) / time / sum(nodes)
+            assert utilization < 1 + 1e-6
+
+            if utilization >= 1 - max_util_drop:
+                # Found
+                break
+
+        else:
+            # Not found.
+            return None
+
+        # Allocate in the processing region according to the number of nodes.
+        subregions = proc_region.allocate(nodes)
+        assert subregions
+        assert len(subregions) == len(self.seg)
+        if len(subregions) == 1:
+            assert subregions[0] == proc_region
+
+        return subregions
+
+    def _init_sym_cstrs(self):
+        '''
+        Initialize the symbolic scheduling constraints for the layers in the
+        segment, by constructing a nested lists of dicts `cstr_symargs` whose
+        values can be symbolic expressions for the keyword arguments of layers
+        in the segment, and a dict `cstr_symvals` mapping each symbol to its
+        possible numerical values.
+
+        Rules for constraints.
+
+        - Top BAT loop factor.
+
+        With a single layer, there is no constraint on the top BAT loop factor.
+        Otherwise all layers must share the same factor, namely `topbat_shr`.
+
+        - Fmap forwarding and fully buffering.
+
+        Only CONV layers require to fully buffer fmaps. Local-region layers
+        process data in a streaming manner.
+
+        Each CONV layer, and all local-region layers immediately following it
+        within the same spatial scheduling, are made into a group G.
+
+        (initial) if G is both the first spatial and the first temporal
+        scheduling with a CONV layer, it can choose whether to fully buffer
+        ofmaps or not. This is a configuration to explore, namely `fbofm_init`.
+        We decide its value by choosing the one that gives the fewer fully
+        buffered inter-spatial pairs on the critical forwarding path, and the
+        smaller maximum fully buffered data size.
+
+        (within-group) within G, the CONV layer, and all local-region layers,
+        should use the same top OFM factors (IFM factors are automatically
+        determined by OFM factors in local-region layers), unless CONV ofmaps
+        need to be fully buffered, in which case, the CONV layer and the last
+        layer in G fully buffer ofmaps (top OFM factor is 1), and other layers
+        still use the same top OFM factors but can be different than 1.
+
+        (inter-temporal) if G has a source from G' in the same spatial
+        scheduling (which must be immediately before G), G should fully buffer
+        ifmaps, and G' should fully buffer ofmaps.
+
+        (inter-spatial) if G has a source from G' in another spatial scheduling
+        (where the source must be the last temporal scheduling in G' and that
+        spatial scheduling),
+        (a) if G' already fully buffers ofmaps, make G fully buffer ifmaps.
+        (b) otherwise, make G fully buffer ofmaps (do not require G' to fully
+            buffer ifmaps; leave it to other rules, e.g. inter-temporal, to
+            decide); forward data between G' and G, by matching their top O/IFM
+            factors (biasing this case for smaller pipeline filling delay).
+        Notice the destination can be: (1) the leading CONV layer, whose top
+        IFM factor is constrained; (2) a local-region layer, where we constrain
+        the top OFM factors of this group (except otherwise constrained by
+        fully buffering ofmaps).
+        '''
+        # pylint: disable=too-many-branches
+
+        # Symbolic variables mapping to numerical values.
+        symvals = dict()
+
+        # Top BAT loop factor.
+        topbat = symbols('topbat_shr', integer=True)
+        symvals[topbat] = [t for t, _ in util.factorize(self.batch_size, 2)]
+
+        # Whether the initial CONV layer fully buffers ofmaps.
+        fbofm_init = symbols('fbofm_init')
+        symvals[fbofm_init] = [False, True]
+
+        def _layer_topofm_vals(layer_name):
+            layer = self.network[layer_name]
+            # We require that the total ofmap size takes at least 5% of the
+            # gbuf capacity of a single node, to avoid too fine blocking.
+            tmax = layer.total_ofmap_size(self.batch_size) \
+                    / (0.05 * self.resource.size_gbuf)
+            vals = [t for t, _ in util.factorize(layer.nofm, 2)
+                    if t <= tmax or t == 1]
+            assert vals
+            return vals
+
+        def _layer_topifm_vals(layer_name):
+            layer = self.network[layer_name]
+            # We require that the total ifmap size takes at least 5% of the
+            # gbuf capacity of a single node, to avoid too fine blocking.
+            tmax = layer.total_ifmap_size(self.batch_size) \
+                    / (0.05 * self.resource.size_gbuf)
+            vals = [t for t, _ in util.factorize(layer.nifm, 2)
+                    if t <= tmax or t == 1]
+            assert vals
+            return vals
+
+        # Layer constraint kwargs.
+        symargs = [[{'topbat': topbat} for _ in ltpl] for ltpl in self.seg]
+
+        # Candidates for critical forwarding path between spatial scheduling.
+        sp_crit_path_cands = set()
+        sp_crit_path_cands.add((0,))  # init with the first spatial.
+
+        # The last CONV layer index.
+        last_conv = PipelineSegment.SchedIndex(-1, 0)
+
+        # Whether the current group needs to fully buffer ofmap. Delayed apply
+        # to the last layer in the group.
+        curr_fbofm = False
+
+        for sp_idx, ltpl in enumerate(self.seg):
+
+            # Initial topofm, in case of a non-CONV starting layer.
+            curr_topofm = symbols('topofm_{}_s'.format(sp_idx), integer=True)
+            symvals[curr_topofm] = _layer_topofm_vals(ltpl[0])
+
+            for tm_idx, l in enumerate(ltpl):
+
+                layer = self.network[l]
+                curr_sa = symargs[sp_idx][tm_idx]
+
+                # Neighbor source dependency.
+                nsrc_sa = None
+                src_deps = self.src_dict[sp_idx][tm_idx]
+                if any(s is not None for s in src_deps):
+                    assert len(src_deps) == 1
+                    nbr_src = src_deps[0]
+                    assert nbr_src.sp_idx < sp_idx
+                    nsrc_sa = symargs[nbr_src.sp_idx][nbr_src.tm_idx]
+                    assert nsrc_sa  # not empty, used to test nbr src exists.
+                    # Set critical path candidates.
+                    new_cands = set()
+                    for cand in sp_crit_path_cands:
+                        if cand[-1] == nbr_src.sp_idx:
+                            new_cands.add(cand + (sp_idx,))
+                    sp_crit_path_cands |= new_cands
+
+                if isinstance(layer, ConvLayer):
+                    # Conv layer.
+
+                    # The last group may require to fully buffer ofmaps.
+                    # Delayed apply to the immediate previous layer.
+                    if curr_fbofm is not False:
+                        assert last_conv >= (0, 0)
+                        if last_conv.sp_idx == sp_idx:
+                            assert tm_idx > 0
+                            lsrc_sa = symargs[sp_idx][tm_idx - 1]
+                        else:
+                            lsrc_sa = symargs[last_conv.sp_idx][-1]
+                        lsrc_sa['fbofm'] = curr_fbofm
+                    # Reset.
+                    curr_fbofm = False
+
+                    # New topofm for a new group.
+                    curr_topofm = symbols('topofm_{}_{}'.format(sp_idx, tm_idx),
+                                          integer=True)
+                    symvals[curr_topofm] = _layer_topofm_vals(l)
+
+                    # Set topofm.
+                    curr_sa['topofm'] = curr_topofm
+
+                    if sp_idx == last_conv.sp_idx:
+                        # Rule inter-temporal.
+                        assert tm_idx > 0
+                        # Make this group fully buffer ifmaps.
+                        curr_sa['fbifm'] = True
+                        # Make the last group fully buffer ofmaps.
+                        last_sa = symargs[sp_idx][last_conv.tm_idx]
+                        lsrc_sa = symargs[sp_idx][tm_idx - 1]
+                        last_sa['fbofm'] = True
+                        lsrc_sa['fbofm'] = True
+
+                    elif nsrc_sa:
+                        # Rule inter-spatial.
+                        # We only look at this rule when inter-temporal rule
+                        # does not apply and the ifmaps of this group are not
+                        # yet required to fully buffer.
+                        if not self.with_opt:
+                            # Basic pipelining requires fully buffering all
+                            # pairs of neighbor src/dst.
+                            nsrc_sa['fbofm'] = True
+                        nsrc_fbofm = nsrc_sa.get('fbofm', False)
+                        # (a): if the source already fully buffers ofmaps.
+                        # Make this group fully buffer ifmaps.
+                        curr_sa['fbifm'] = symeq(nsrc_fbofm, True)
+                        # (b)-(1): otherwise.
+                        # Make this group fully buffer ofmaps.
+                        curr_sa['fbofm'] = symeq(nsrc_fbofm, False)
+                        curr_fbofm = symeq(nsrc_fbofm, False)  # delayed apply.
+                        # Match top OFM/IFM factors.
+                        curr_sa['topifm'] = sympiecewise(
+                            (nsrc_sa['topofm'], symeq(nsrc_fbofm, False)),
+                            (curr_sa.get('topifm', 0), True))
+
+                    elif last_conv < (0, 0):
+                        # The first CONV layer.
+                        # Rule initial.
+                        curr_sa['fbofm'] = fbofm_init
+                        curr_fbofm = fbofm_init
+
+                    last_conv = PipelineSegment.SchedIndex(sp_idx, tm_idx)
+
+                else:
+                    # Non-Conv layer.
+
+                    if nsrc_sa:
+                        # Rule inter-spatial, (b)-(2).
+                        nsrc_fbofm = nsrc_sa.get('fbofm', False)
+                        curr_topofm = sympiecewise(
+                            (nsrc_sa['topofm'], symeq(nsrc_fbofm, False)),
+                            (curr_topofm, True))
+                        # Also backtrace this group.
+                        for bt_idx in range(last_conv.tm_idx, tm_idx):
+                            symargs[sp_idx][bt_idx]['topofm'] = curr_topofm
+
+                    # Rule within-group.
+                    curr_sa['topofm'] = curr_topofm
+
+                # If this layer has no on-chip destinations, cancel the
+                # requirement to fully buffer ofmaps.
+                if all(d is None for d in self.dst_dict[sp_idx][tm_idx]) \
+                        and tm_idx == len(ltpl) - 1:
+                    curr_sa.pop('fbofm', False)
+
+        # Simplify.
+        self._simplify_symargs(symargs, symvals)
+
+        # Get critical forwarding path between spatial scheduling.
+        # The critical path has the longest forwarding chain.
+        sp_crit_path = max(sp_crit_path_cands, key=len)
+
+        # Check maximum fully-buffering size, and decide fbofm_init.
+        opt_val = None
+        opt_key = (float('inf'),) * 2  # (num of fb pairs, max fb size)
+        num_sp_fbs = 0
+        for val in symvals.get(fbofm_init, [False]):
+            subs_symargs = self._subs_symargs(symargs, fbofm_init, val)
+            maxsz = 0
+            numfb = 0
+            for sp_idx, (ltpl, atpl) in enumerate(zip(self.seg, subs_symargs)):
+                ms = max(itertools.chain(
+                    ((self.network[l].total_ofmap_size() if a.get('fbofm')
+                      else 0)
+                     + (self.network[l].total_ifmap_size() if a.get('fbifm')
+                        else 0)
+                     for l, a in zip(ltpl, atpl)),
+                    [0]))  # safe max with default.
+                if ms > self.alloc[sp_idx][0].proc_region.dim.size() \
+                        * self.alloc[sp_idx][0].size_gbuf:
+                    break
+                maxsz = max(maxsz, ms)
+                if sp_idx in sp_crit_path and atpl[-1].get('fbofm', False):
+                    numfb += 1
+            else:
+                key = (numfb, maxsz)
+                if key < opt_key:
+                    opt_val, opt_key = val, key
+                    num_sp_fbs = numfb
+        if opt_val is None:
+            return False
+        # Use the optimal value.
+        symvals[fbofm_init] = [opt_val]
+        self._simplify_symargs(symargs, symvals)
+
+        # Shared memory source must have the same topifm.
+        for sh_idx_list in self.ifm_fwd_dict.values():
+            assert len(sh_idx_list) > 1
+            fet_sp_idx = sh_idx_list[0].sp_idx
+            sh_symarg_list = [symargs[idx.sp_idx][0] for idx in sh_idx_list]
+
+            # Must have no constraint on ifmaps access from memory.
+            assert all(not sa.get('fbifm', False) and not sa.get('topifm', 0)
+                       for sa in sh_symarg_list)
+
+            # Cannot constrain both topifm and topofm.
+            if any(sa.get('fbofm', False) or sa.get('topofm', 0)
+                   for sa in sh_symarg_list):
+                sh_kwargs = {'fbifm': True}
+            else:
+                topifm = symbols('topifm_{}'.format(fet_sp_idx), integer=True)
+                symvals[topifm] = _layer_topifm_vals(self.seg[fet_sp_idx][0])
+                sh_kwargs = {'topifm': topifm}
+
+            # Set constraints.
+            for sa in sh_symarg_list:
+                sa.update(sh_kwargs)
+
+        # Simplify.
+        self._simplify_symargs(symargs, symvals)
+
+        # Turn constraints into lazily updated rules.
+        self._lazify_topofm_symargs(symargs, symvals)
+        # Cannot simplify any more as update_dict is not sympifi-able.
+
+        # Sort symbol dict.
+        symvals = OrderedDict(sorted(((s, symvals[s]) for s in symvals),
+                                     key=lambda item: str(item[0])))
+
+        if not symvals:
+            # Must add a dummy symbol so iterative substitution can happen.
+            symvals[symbols('_dummy')] = [None]
+
+        self.cstr_symargs = symargs
+        self.cstr_symvals = symvals
+        self.cstr_num_sp_fbs = num_sp_fbs
+        try:
+            self.cstr_topbat_idx = list(symvals.keys()).index(topbat)
+        except ValueError:
+            self.cstr_topbat_idx = None
+
+        return True
+
+    @staticmethod
+    def _simplify_symargs_one_pass(symargs, symvals):
+        '''
+        Simplify symargs and symvals in-place:
+        - If fbi/ofm is False, then remove it.
+        - If fbi/ofm is True, then remove topi/ofm.
+        - If a symbol can take only one value, then substitute it.
+        - If a symbol only occurs once, then remove its constraint.
+
+        Return whether the symargs and symvals are already simplified.
+        '''
+        for a in itertools.chain.from_iterable(symargs):
+            is_fbifm = a.get('fbifm')
+            is_fbofm = a.get('fbofm')
+            # pylint: disable=singleton-comparison
+            # lhs may be symbolic, see
+            # docs.sympy.org/latest/modules/logic.html#sympy.logic.boolalg.BooleanTrue
+            if is_fbifm == True:
+                a.pop('topifm', 0)
+            if is_fbifm == False:
+                a.pop('fbifm', False)
+            if is_fbofm == True:
+                a.pop('topofm', 0)
+            if is_fbofm == False:
+                a.pop('fbofm', False)
+
+        subs_dict = {}
+
+        # Possible values for symbols.
+        subs_dict.update(
+            (s, symvals[s][0]) for s in symvals if len(symvals[s]) == 1)
+
+        # Count the occurrence of symbols in all args (values).
+        symcnts = Counter(
+            s for a in itertools.chain.from_iterable(symargs)
+            for val in a.values() for s in symtuple(val).free_symbols)
+        assert set(symcnts.keys()).issubset(symvals.keys())
+        subs_dict.update((s, None)
+                         for s in set(symvals.keys()) - set(symcnts.keys()))
+        subs_dict.update((s, 0 if str(s).startswith('top') else False)
+                         for s in symcnts if symcnts[s] <= 1)
+
+        # Substitute symbols and remove from symbol dict.
+        for a in itertools.chain.from_iterable(symargs):
+            for k in a:
+                a[k] = symtuple(a[k]).subs(subs_dict)[0]
+        for s in subs_dict:
+            del symvals[s]
+
+        return not subs_dict
+
+    def _simplify_symargs(self, symargs, symvals):
+        ''' Simplify symargs and symvals in-place iteratively. '''
+        while not self._simplify_symargs_one_pass(symargs, symvals):
+            pass
+        used_syms = symtuple(
+            *[symtuple(*a.values())
+              for a in itertools.chain.from_iterable(symargs)]).free_symbols
+        assert set(used_syms) == set(symvals.keys())
+        assert all(val for val in symvals.values())
+
+    @staticmethod
+    def _subs_symargs(symargs, *subs_args):
+        '''
+        Substitute symbols. The additional arguments are passed to subs().
+
+        Return a new substituted copy without modifying the original one.
+        '''
+        # sympify=False is necessary because there may be str in the values.
+        return [[dict((k, symtuple(a[k], sympify=False).subs(*subs_args)[0])
+                      for k in a) for a in atpl] for atpl in symargs]
+
+    class TopOfmUpdateLambda(symbasic):
+        ''' A sympifi-able lambda function to lazily update topofm. '''
+        def __new__(cls, *args):
+            return super(PipelineSegment.TopOfmUpdateLambda, cls).__new__(cls)
+        def __call__(self, arg_s, arg_r):
+            setattr(arg_s, 'topofm', arg_r.scheme['to'][0])
+
+    def _lazify_topofm_symargs(self, symargs, symvals):
+        '''
+        Turn qualified topofm constraints into lazily updated rules.
+
+        If a symbol is only used as the topofm constraint by a single CONV
+        layer and some local-region layers, we can turn it into a lazily update
+        rule.
+        '''
+        sym2conv = {}  # symbol --> the only CONV layer using it.
+        sym2lrs = {}   # symbol --> list of local-region layer using it.
+        unqual_syms = set()  # symbols used by two or more CONV layers.
+        for l, a in zip(itertools.chain.from_iterable(self.seg),
+                        itertools.chain.from_iterable(symargs)):
+            layer = self.network[l]
+            if isinstance(layer, ConvLayer):
+                topofm = a.get('topofm', 0)
+                topifm = a.get('topifm', 0)
+                for s in symtuple(topofm, topifm).free_symbols:
+                    if s not in unqual_syms:
+                        if s in sym2conv:
+                            # If a symbol is used in two CONV layers, it cannot
+                            # be lazily updated.
+                            del sym2conv[s]
+                            sym2lrs.pop(s, [])
+                            unqual_syms.add(s)
+                        elif topofm == s:
+                            assert s not in sym2lrs
+                            sym2conv[s] = l
+            else:
+                topofm = a.get('topofm', 0)
+                if topofm in sym2conv:
+                    sym2lrs.setdefault(topofm, []).append(l)
+        assert 0 not in sym2conv and 0 not in sym2lrs
+
+        syms = sym2conv.keys()  # symbols to be lazily updated.
+        lr2conv = {}  # local-region layer to the CONV layer constraining it.
+        for s in syms:
+            for lr in sym2lrs.get(s, []):
+                lr2conv[lr] = sym2conv[s]
+        lconvs = set(lr2conv.values())  # CONV layer whose topofm to be removed.
+
+        for l, a in zip(itertools.chain.from_iterable(self.seg),
+                        itertools.chain.from_iterable(symargs)):
+            if l in lconvs:
+                # Remove CONV topofm.
+                assert sym2conv[a['topofm']] == l
+                del a['topofm']
+            elif l in lr2conv:
+                # Link local-region layer to the CONV layer.
+                lconv = lr2conv[l]
+                assert sym2conv[a['topofm']] == lconv
+                del a['topofm']
+                a['update_dict'] = {
+                    lconv: PipelineSegment.TopOfmUpdateLambda()}
+
+        for s in syms:
+            del symvals[s]
+
diff --git a/nn_dataflow/core/pipeline_segment_timing.py b/nn_dataflow/core/pipeline_segment_timing.py
new file mode 100644
index 0000000..c1e4d07
--- /dev/null
+++ b/nn_dataflow/core/pipeline_segment_timing.py
@@ -0,0 +1,233 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+from collections import namedtuple, OrderedDict
+
+from . import loop_enum as le
+from .loop_blocking_scheme import LoopBlockingScheme
+from .layer import ConvLayer
+from .network import Network
+
+class PipelineSegmentTiming(object):
+    ''' Timing information of a pipeline segment. '''
+
+    # Each layer timing info is a tuple:
+    # - time: the total time.
+    # - node_time: the total time on node processing.
+    # - dram_time: the total time on DRAM access.
+    # - num_nodes: the number of processing nodes.
+    # - ngrp: the OFM group number.
+    # - ts_xb: when to start.
+    # - td_xb: when the first BAT group of this and all prev layers is done.
+    # Time is stored by multiplying the lazily updated BAT group number (_xb).
+    # Notice (td - ts) may be greater than (time), because fused layers can
+    # have an earlier start time, but done time is sequentially accumulated.
+    LayerTiming = namedtuple('LayerTiming', ['time', 'node_time', 'dram_time',
+                                             'num_nodes', 'ngrp',
+                                             'ts_xb', 'td_xb'])
+
+    def __init__(self, network, seg_idx):
+
+        if not isinstance(network, Network):
+            raise TypeError('PipelineSegmentTiming: network must be a '
+                            'Network instance.')
+        self.network = network
+
+        # Scheduling sequence number.
+        self.seg_idx = seg_idx
+        self.last_sched_seq = None
+
+        # Time properties.
+        # The time on DRAM accesses.
+        self.dram_time = 0
+        # The time on node processing.
+        self.node_time = 0
+        # The critical (longest) spatial scheduling time.
+        self.critical_time = 0
+
+        # Mapping from layer name to spatial and temporal indices.
+        self.layer2idx = OrderedDict()
+
+        # The number of groups of which BAT are sequentially processed, i.e.,
+        # the degree of batch pipelining, shared by all layers in the segment.
+        # Lazily updated.
+        self.bat_ngrp = None
+
+        # Timing of each layer, indexed by spatial and temporal indices.
+        self.timing_list = []
+
+    @property
+    def time(self):
+        ''' The total time of the end-to-end segment processing. '''
+        return max(self.node_time, self.dram_time)
+
+    @property
+    def time_overhead(self):
+        '''
+        The time overhead as a percentage, to process layers in segment
+        compared to processing layers individually.
+        '''
+        total_num_nodes = sum(tlist[0].num_nodes
+                              for tlist in self.timing_list)
+        # Sum up the max of scaled node time and DRAM time.
+        time_indv = sum(max(1. * timing.node_time * timing.num_nodes
+                            / total_num_nodes,
+                            timing.dram_time)
+                        for tlist in self.timing_list
+                        for timing in tlist)
+        return (self.time - time_indv) / time_indv
+
+    def add(self, layer_name, sched_result):
+        ''' Add the SchedulingResult of a new layer. '''
+
+        sched_seq = sched_result.sched_seq
+
+        if sched_seq[0] != self.seg_idx:
+            raise ValueError('PipelineSegmentTiming: sched_seq {} does not '
+                             'belong to segment {}.'
+                             .format(sched_seq, self.seg_idx))
+
+        if sched_seq == self._sched_seq_incr(1):
+            # New spatial scheduling.
+            self.timing_list.append([])
+        elif sched_seq == self._sched_seq_incr(2):
+            # New temporal scheduling.
+            pass
+        else:
+            raise ValueError('PipelineSegmentTiming: sched_seq {} cannot '
+                             'follow {}'
+                             .format(sched_seq, self.last_sched_seq))
+        self.last_sched_seq = sched_seq
+
+        if layer_name in self.layer2idx:
+            raise ValueError('PipelineSegmentTiming: layer {} already in '
+                             'segment, old sched_seq {}, new sched_seq {}.'
+                             .format(layer_name, self.layer2idx[layer_name],
+                                     sched_seq[1:]))
+        self.layer2idx[layer_name] = sched_seq[1:]
+
+        # Add layer timing.
+
+        timing = self._make_layer_timing(layer_name, sched_result)
+        assert not self.timing_list[-1] \
+                or timing.num_nodes == self.timing_list[-1][-1].num_nodes
+        self.timing_list[-1].append(timing)
+        assert self.last_sched_seq[1] + 1 == len(self.timing_list)
+        assert self.last_sched_seq[2] + 1 == len(self.timing_list[-1])
+
+        # Update time.
+
+        # Critical time, as the longest of all spatial scheduling.
+        assert all(sum(timing.time for timing in tlist)
+                   <= tlist[-1].td_xb - tlist[0].ts_xb
+                   for tlist in self.timing_list)
+        self.critical_time = max(tlist[-1].td_xb - tlist[0].ts_xb
+                                 for tlist in self.timing_list)
+
+        # DRAM time.
+        # Each layer DRAM time is calculated using the layer accesses and the
+        # maximum bandwidth. Accumulating the accesses is accumulating the
+        # time.
+        self.dram_time += sched_result.total_dram_time
+
+        # Node time, as the max of end time of the last BAT group.
+        # The interval between BAT groups is determined by the critical time of
+        # one BAT group.
+        self.node_time = max((tlist[-1].td_xb
+                              + self.critical_time * (self.bat_ngrp - 1))
+                             // self.bat_ngrp
+                             for tlist in self.timing_list)
+        assert self.node_time >= self.critical_time
+
+    def _sched_seq_incr(self, pos):
+        ''' Get the next sched seq incremented at the given position. '''
+        if not self.last_sched_seq:
+            return (self.seg_idx, 0, 0)
+        assert len(self.last_sched_seq) == 3
+        return self.last_sched_seq[:pos] + (self.last_sched_seq[pos] + 1,) \
+                + (0,) * (2 - pos)
+
+    def _make_layer_timing(self, layer_name, sched_result):
+        ''' Construct and return the layer timing. '''
+        # Top-level ordered loops, from outermost to innermost.
+        ord_loops = LoopBlockingScheme.ordered_loops(
+            sched_result.scheme['tvals'][0], sched_result.scheme['orders'][0])
+
+        # Top loop blocking factors.
+        top_ts = [1] * le.NUM
+        if ord_loops and ord_loops[0][0] == le.BAT:
+            top_ts[le.BAT] = ord_loops.pop(0)[1]
+        if ord_loops:
+            lpe, t = ord_loops.pop(0)
+            assert lpe == le.IFM or lpe == le.OFM
+            top_ts[lpe] = t
+
+        # Lazily update BAT group number.
+        if not self.bat_ngrp:
+            self.bat_ngrp = top_ts[le.BAT]
+        elif self.bat_ngrp != top_ts[le.BAT]:
+            # Unmatched.
+            self.bat_ngrp = 1
+
+        # IFM/OFM group number.
+        ifm_ngrp, ofm_ngrp = top_ts[le.IFM], top_ts[le.OFM]
+
+        # Time on node processing and DRAM access.
+        node_time = sched_result.total_node_time
+        dram_time = sched_result.total_dram_time
+        # Number of nodes.
+        num_nodes = sched_result.num_nodes
+
+        # Calculate timing.
+        sp_idx, tm_idx = self.layer2idx[layer_name]
+        is_conv = isinstance(self.network[layer_name], ConvLayer)
+        time = sched_result.total_time
+        ts_xb = 0
+        td_xb = 0
+        for p in self.network.prevs(layer_name):
+            if p not in self.layer2idx:
+                # Off-chip source.
+                continue
+            # On-chip source.
+            p_sp_idx, p_tm_idx = self.layer2idx[p]
+            p_timing = self.timing_list[p_sp_idx][p_tm_idx]
+            if p_sp_idx == sp_idx:
+                assert p_tm_idx == tm_idx - 1
+                # Same spatial scheduling.
+                if not is_conv and ofm_ngrp == p_timing.ngrp:
+                    # Fused.
+                    start = p_timing.ts_xb + p_timing.time // p_timing.ngrp
+                else:
+                    # Not fused.
+                    start = p_timing.td_xb
+                # Also constrain the done time.
+                td_xb = p_timing.td_xb + time
+            else:
+                assert p_sp_idx < sp_idx
+                assert p_tm_idx == len(self.timing_list[p_sp_idx]) - 1
+                # Previous spatial scheduling.
+                if (ifm_ngrp if is_conv else ofm_ngrp) == p_timing.ngrp:
+                    # I/OFM group forwarding.
+                    start = p_timing.ts_xb + p_timing.time // p_timing.ngrp
+                else:
+                    # All I/OFM double buffering.
+                    start = p_timing.td_xb
+            ts_xb = max(ts_xb, start)
+        td_xb = max(td_xb, ts_xb + time)
+
+        return PipelineSegmentTiming.LayerTiming(
+            time=time, node_time=node_time, dram_time=dram_time,
+            num_nodes=num_nodes, ngrp=ofm_ngrp, ts_xb=ts_xb, td_xb=td_xb)
+
diff --git a/nn_dataflow/core/resource.py b/nn_dataflow/core/resource.py
index bcad270..73c4fd7 100644
--- a/nn_dataflow/core/resource.py
+++ b/nn_dataflow/core/resource.py
@@ -28,6 +28,7 @@
                  'size_regf',
                  'array_bus_width',
                  'dram_bandwidth',
+                 'no_time_mux',
                 ]
 
 class Resource(namedtuple('Resource', RESOURCE_LIST)):
@@ -79,5 +80,8 @@ def __new__(cls, *args, **kwargs):
         if ntp.dram_bandwidth <= 0:
             raise ValueError('Resource: dram_bandwidth must be positive.')
 
+        if not isinstance(ntp.no_time_mux, bool):
+            raise TypeError('Resource: no_time_mux must be boolean')
+
         return ntp
 
diff --git a/nn_dataflow/core/scheduling.py b/nn_dataflow/core/scheduling.py
index 1cf9b70..0f1398b 100644
--- a/nn_dataflow/core/scheduling.py
+++ b/nn_dataflow/core/scheduling.py
@@ -20,6 +20,7 @@
 from . import data_category_enum as de
 from . import loop_blocking
 from . import loop_enum as le
+from . import mem_hier_enum as me
 from . import partition
 from .. import util
 from .cost import Cost
@@ -28,10 +29,13 @@
 from .layer import Layer
 from .map_strategy import MapStrategy
 from .resource import Resource
+from .scheduling_constraint import SchedulingConstraint
 
 class SchedulingCondition(namedtuple('SchedulingCondition',
                                      ['resource',
+                                      'constraint',
                                       'ifmap_layout',
+                                      'sched_seq',
                                      ])):
     '''
     Layer scheduling condition.
@@ -43,9 +47,17 @@ def __new__(cls, *args, **kwargs):
         if not isinstance(ntp.resource, Resource):
             raise TypeError('SchedulingCondition: resource must be '
                             'a Resource instance.')
+        if not isinstance(ntp.constraint, SchedulingConstraint):
+            raise TypeError('SchedulingCondition: constraint must be '
+                            'a SchedulingConstraint instance.')
         if not isinstance(ntp.ifmap_layout, DataLayout):
             raise TypeError('SchedulingCondition: ifmap_layout must be '
                             'a DataLayout instance.')
+        if not isinstance(ntp.sched_seq, tuple):
+            raise TypeError('SchedulingCondition: sched_seq must be a tuple.')
+        if len(ntp.sched_seq) != 3:
+            raise ValueError('SchedulingCondition: sched_seq must have '
+                             '(segment, spatial, temporal) 3 indices.')
 
         return ntp
 
@@ -53,6 +65,7 @@ def __new__(cls, *args, **kwargs):
 class SchedulingResult(namedtuple('SchedulingResult',
                                   ['scheme',
                                    'ofmap_layout',
+                                   'sched_seq',
                                   ])):
     '''
     Layer scheduling result.
@@ -67,6 +80,11 @@ def __new__(cls, *args, **kwargs):
         if not isinstance(ntp.ofmap_layout, DataLayout):
             raise TypeError('SchedulingResult: ofmap_layout must be '
                             'a DataLayout instance.')
+        if not isinstance(ntp.sched_seq, tuple):
+            raise TypeError('SchedulingResult: sched_seq must be a tuple.')
+        if len(ntp.sched_seq) != 3:
+            raise ValueError('SchedulingResult: sched_seq must have '
+                             '(segment, spatial, temporal) 3 indices.')
 
         return ntp
 
@@ -103,7 +121,9 @@ def total_ops(self):
     @property
     def total_accesses(self):
         ''' Get the total accesses at all memory hierarchies as a list. '''
-        return [sum(acc) for acc in self.scheme['access']]
+        accesses = [sum(acc) for acc in self.scheme['access']]
+        accesses[me.GBUF] += sum(self.scheme['remote_gbuf_access'])
+        return accesses
 
     @property
     def total_noc_hops(self):
@@ -160,7 +180,8 @@ def schedule_search(self, condition, options):
 
         # Ifmap layout.
         ifmap_layout = condition.ifmap_layout
-        if not ifmap_layout.is_in(resource.src_data_region):
+        # Ifmap should be from the source data region or local.
+        if not ifmap_layout.is_in(resource.src_data_region, proc_region):
             raise ValueError('Scheduling: ifmap layout is not contained in '
                              'source data region.')
         ifrng = ifmap_layout.complete_fmap_range()
@@ -180,7 +201,7 @@ def schedule_search(self, condition, options):
                                             guaranteed=True):
             # Explore single-node schedules.
             lbs_tops = list(self.schedule_search_per_node(
-                part, resource, options))
+                part, resource, condition.constraint, options))
             if not lbs_tops:
                 continue
 
@@ -201,7 +222,8 @@ def schedule_search(self, condition, options):
                 filter_nodes, ifmap_layout, ofmap_layout, options)
 
             # Make scheduling result.
-            tops += [self._get_result(lbs, part, ofmap_layout, unit_nhops)
+            tops += [self._get_result(lbs, part, ofmap_layout,
+                                      condition.sched_seq, unit_nhops)
                      for lbs in lbs_tops]
 
         # Pick the top n.
@@ -231,7 +253,7 @@ def cache_stats(self):
         return (info.hits, info.misses)
 
     @fastcache.clru_cache(maxsize=1024)
-    def schedule_search_per_node(self, part, resource, options):
+    def schedule_search_per_node(self, part, resource, constraint, options):
         '''
         Search the best mapping strategies and loop blocking schemes for a
         single node after partitioning. Return the top LoopBlockingScheme
@@ -252,14 +274,15 @@ def schedule_search_per_node(self, part, resource, options):
 
             # Explore loop blocking schemes.
             for lbs in loop_blocking.gen_loopblocking(
-                    nested_loop_desc, resource, self.cost, options):
+                    nested_loop_desc, resource, part, constraint, self.cost,
+                    options):
 
                 if lbs.is_valid():
                     lbs_tops.append(lbs)
 
         return lbs_tops
 
-    def _get_result(self, lbs, part, ofmap_layout, unit_nhops):
+    def _get_result(self, lbs, part, ofmap_layout, sched_seq, unit_nhops):
         '''
         Make the schedule result from loop blocking and partitioning.
         '''
@@ -268,8 +291,13 @@ def _get_result(self, lbs, part, ofmap_layout, unit_nhops):
         # Cost components.
         cost_access = lbs.get_access_cost(self.cost)
 
-        total_nhops = [unh * f for unh, f
-                       in zip(unit_nhops, lbs.get_top_level_fetch())]
+        # Inter-node data forwarding/rotation hops.
+        node_nhops = lbs.get_noc_access()
+        # Memory access hops.
+        mem_nhops = [unh * f for unh, f
+                     in zip(unit_nhops, lbs.get_top_level_fetch())]
+        # Total hops = inter-node hops + memory hops.
+        total_nhops = [nnh + mnh for nnh, mnh in zip(node_nhops, mem_nhops)]
         cost_noc = self.cost.noc_hop * sum(total_nhops)
 
         cost_op = self.cost.mac_op * lbs.ops
@@ -283,6 +311,7 @@ def _get_result(self, lbs, part, ofmap_layout, unit_nhops):
         scheme['time'] = lbs.time
         scheme['ops'] = lbs.ops
         scheme['num_nodes'] = lbs.num_nodes
+        scheme['is_dram'] = (lbs.src_is_dram, lbs.dst_is_dram)
         scheme['cost_op'] = cost_op
         scheme['cost_access'] = cost_access
         scheme['cost_noc'] = cost_noc
@@ -291,6 +320,7 @@ def _get_result(self, lbs, part, ofmap_layout, unit_nhops):
         scheme['bus_time'] = lbs.bus_time
         scheme['dram_time'] = lbs.dram_time
         scheme['access'] = lbs.get_access()
+        scheme['remote_gbuf_access'] = lbs.remote_gbuf_access
         scheme['total_nhops'] = total_nhops
         scheme['fetch'] = lbs.fetch
 
@@ -305,10 +335,23 @@ def _get_result(self, lbs, part, ofmap_layout, unit_nhops):
                           for bl in range(lbs.BL.NUM)]
         scheme['unit_size'] = lbs.unit_size
         scheme['unit_cnt'] = lbs.unit_cnt
+        scheme['accfwd_reduction'] = lbs.accfwd_reduction
+        scheme['bufshr_grp_size'] = lbs.bufshr_grp_size
+        scheme['bufshr_subgrp_size'] = lbs.bufshr_subgrp_size
+        scheme['bufshr_bs_t'] = lbs.bufshr_bs_t
+        scheme['bufshr_bs_ord'] = lbs.bufshr_bs_ord
+        scheme['bufshr_rot_fetch'] = lbs.bufshr_rot_fetch
+        scheme['bufshr_rot_round_cnt'] = lbs.bufshr_rot_round_cnt
+        scheme['bufshr_rot_unit_cnt'] = lbs.bufshr_rot_unit_cnt
+        scheme['bufshr_wide_fetch'] = lbs.bufshr_wide_fetch
+        scheme['bufshr_wide_fetch_width'] = lbs.bufshr_wide_fetch_width
 
         # Partitioning.
         scheme['part'] = part
+        scheme['mem_nhops'] = mem_nhops
+        scheme['node_nhops'] = node_nhops
         scheme['unit_nhops'] = unit_nhops
 
-        return SchedulingResult(scheme=scheme, ofmap_layout=ofmap_layout)
+        return SchedulingResult(scheme=scheme, ofmap_layout=ofmap_layout,
+                                sched_seq=sched_seq)
 
diff --git a/nn_dataflow/core/scheduling_constraint.py b/nn_dataflow/core/scheduling_constraint.py
new file mode 100644
index 0000000..3f8f6fe
--- /dev/null
+++ b/nn_dataflow/core/scheduling_constraint.py
@@ -0,0 +1,190 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+import numbers
+
+from . import loop_enum as le
+from .. import util
+from .loop_blocking_scheme import LoopBlockingScheme
+
+class SchedulingConstraint(util.ContentHashClass):
+    '''
+    Layer scheduling constraint, which constrains top loop blocking factors.
+    '''
+
+    def __init__(self, topbat=0, topifm=0, topofm=0, update_dict=None):
+        '''
+        `topbat`, `topifm`, `topofm` specify the top-level loop blocking
+        factors.
+
+        `update_dict` specifies lazily updated rules to refine the constraint
+        with previous scheduling results. It should be a mapping, from previous
+        layer name to a function which takes two arguments: self, and the
+        SchedulingResult instance of that layer.
+        '''
+        if any(n < 0 or not isinstance(n, numbers.Integral)
+               for n in [topbat, topifm, topofm]):
+            raise ValueError('SchedulingConstraint: '
+                             'constrained factors must be positive integers.')
+
+        if not update_dict:
+            update_dict = {}
+        if not isinstance(update_dict, dict):
+            raise TypeError('SchedulingConstraint: '
+                            'update_dict must be a dict instance.')
+        update_dict = util.HashableDict.fromdict(update_dict)
+        for val in update_dict.values():
+            if not callable(val):
+                raise TypeError('SchedulingConstraint: '
+                                'values in update_dict must be callable.')
+
+        self.topbat = topbat
+        self.topifm = topifm
+        self.topofm = topofm
+        self.update_dict = update_dict
+
+    def is_valid_top_bl(self, top_bl_t, top_bl_ord):
+        '''
+        Whether the given `top_bl_t` and `top_bl_lpe` are valid with the
+        constraint.
+        '''
+        if self.update_dict:
+            raise ValueError('SchedulingConstraint: update_dict is not empty, '
+                             'rules have not been updated.')
+
+        if self.topbat and self.topbat != top_bl_t[le.BAT]:
+            return False
+        if self.topifm and self.topifm != top_bl_t[le.IFM]:
+            return False
+        if self.topofm and self.topofm != top_bl_t[le.OFM]:
+            return False
+
+        del top_bl_ord
+
+        return True
+
+    def is_valid_part(self, part):
+        '''
+        Whether the given `part` is valid with the constraint.
+        '''
+        # pylint: disable=unused-argument
+        if self.update_dict:
+            raise ValueError('SchedulingConstraint: update_dict is not empty, '
+                             'rules have not been updated.')
+
+        return True
+
+    def filter_gen_ts(self, gen_tifm, gen_tofm, gen_tbat):
+        ''' Get the filtered generators for loop blocking factors. '''
+        return self._filter_gen(gen_tifm, self.topifm), \
+                self._filter_gen(gen_tofm, self.topofm), \
+                self._filter_gen(gen_tbat, self.topbat)
+
+    def update_by_prev(self, prev_results):
+        '''
+        Based on the previous layer scheduling results `prev_results` as a
+        mapping from previous layer name to SchedulingResult instance, use the
+        rules specified by `update_dict` to update the constraint.
+        '''
+        for layer_name in self.update_dict:
+            self.update_dict[layer_name](self, prev_results[layer_name])
+        self.update_dict = util.HashableDict()  # clear updated rules.
+
+    @staticmethod
+    def _filter_gen(gen, topt=0):
+        ''' Get a new generator which filters the top factor. '''
+        for tpl in gen:
+            if topt == 0 or tpl[0] == topt:
+                yield tpl
+
+    def __repr__(self):
+        return '{}({})'.format(
+            self.__class__.__name__,
+            ', '.join(['{}={}'.format(k, repr(v))
+                       for k, v in self.__dict__.items()]))
+
+
+class SchedulingConstraintLayerPipeline(SchedulingConstraint):
+    '''
+    Layer scheduling constraint for inter-layer pipelining.
+
+    Constraint includes:
+    - topbat: top BAT loop blocking factor, which decides the number of groups
+      for batch pipelining. It must match between all layers in a pipeline
+      segment.
+    - topifm/topofm: top IFM/OFM blocking factor, which decides the number of
+      groups for fmap data forwarding between adjacent spatial scheduled layers
+      in a pipeline segment. It must match between forwarding
+      source/destination layers.
+    - fbifm/fbofm: whether to fully buffer the fmap data of the layer on-chip.
+      It indicates the baseline double-buffering between pipelined layers.
+
+    For loop orders, the BAT loop must be at the outermost for batch
+    pipelining. Then the loop associated with the forwarded data (IFM or OFM)
+    must follow at the second outermost. If a data category (IFM or OFM) is
+    fully buffered, then the corresponding loop is a trivial loop, which can be
+    at any where.
+    '''
+
+    def __init__(self, topbat=0, topifm=0, topofm=0, fbifm=False, fbofm=False,
+                 update_dict=None):
+
+        if fbifm:
+            # Fully-buffered IFM <=> topifm = 1.
+            if topifm != 0 and topifm != 1:
+                raise ValueError('SchedulingConstraintLayerPipeline: '
+                                 'fully-buffered IFM implies topifm = 1.')
+            topifm = 1
+
+        if fbofm:
+            # Fully-buffered OFM <=> topofm = 1.
+            if topofm != 0 and topofm != 1:
+                raise ValueError('SchedulingConstraintLayerPipeline: '
+                                 'fully-buffered OFM implies topofm = 1.')
+            topofm = 1
+
+        if topifm > 1 and topofm > 1:
+            raise ValueError('SchedulingConstraintLayerPipeline: '
+                             'impossible to have both topifm and topofm > 1, '
+                             'at least one of IFM and OFM must be a trivial '
+                             'loop (= 1) or not constrained (= 0).')
+
+        super(SchedulingConstraintLayerPipeline, self).__init__(
+            topbat=topbat, topifm=topifm, topofm=topofm,
+            update_dict=update_dict)
+
+    def is_valid_top_bl(self, top_bl_t, top_bl_ord):
+
+        if not super(SchedulingConstraintLayerPipeline, self).is_valid_top_bl(
+                top_bl_t, top_bl_ord):
+            return False
+
+        # Loop orders.
+        # Ordered loops from outer to inner.
+        ord_lpe = LoopBlockingScheme.ordered_loops(top_bl_t, top_bl_ord,
+                                                   lpe_only=True)
+        if self.topbat > 1:
+            if ord_lpe.pop(0) != le.BAT:
+                return False
+        # topifm and topofm cannot trigger together.
+        if self.topifm > 1:
+            if ord_lpe.pop(0) != le.IFM:
+                return False
+        if self.topofm > 1:
+            if ord_lpe.pop(0) != le.OFM:
+                return False
+
+        return True
+
diff --git a/nn_dataflow/tests/dataflow_test/test_nn_dataflow.py b/nn_dataflow/tests/dataflow_test/test_nn_dataflow.py
index 559bbf6..82b72b6 100644
--- a/nn_dataflow/tests/dataflow_test/test_nn_dataflow.py
+++ b/nn_dataflow/tests/dataflow_test/test_nn_dataflow.py
@@ -18,9 +18,10 @@
 import StringIO
 
 from nn_dataflow.core import Cost
-from nn_dataflow.core import InputLayer, FCLayer
+from nn_dataflow.core import InputLayer, ConvLayer, FCLayer
 from nn_dataflow.core import MapStrategy, MapStrategyEyeriss
 from nn_dataflow.core import MemHierEnum as me
+from nn_dataflow.core import Network
 from nn_dataflow.core import NodeRegion
 from nn_dataflow.core import NNDataflow
 from nn_dataflow.core import Option
@@ -37,6 +38,25 @@ def setUp(self):
         self.alex_net = import_network('alex_net')
         self.vgg_net = import_network('vgg_net')
 
+        net = Network('simple')
+        net.set_input_layer(InputLayer(4, 2))
+        net.add('1', ConvLayer(4, 4, 2, 1))
+        net.add('2', ConvLayer(4, 4, 2, 1))
+        # Two more layers to avoid single-segment case.
+        net.add('a1', ConvLayer(4, 1, 1, 1, strd=2))
+        net.add('a2', ConvLayer(1, 1, 1, 1))
+        self.simple_net = net
+
+        net = Network('complex')
+        net.set_input_layer(InputLayer(8, 8))
+        net.add('1', ConvLayer(8, 8, 8, 1))
+        net.add('2a', ConvLayer(8, 8, 8, 1), prevs=('1',))
+        net.add('3a', ConvLayer(8, 8, 8, 1))
+        net.add('2b', ConvLayer(8, 8, 8, 1), prevs=('1',))
+        net.add('3b', ConvLayer(8, 8, 8, 1))
+        net.add('4', ConvLayer(16, 8, 8, 1), prevs=('3a', '3b'))
+        self.complex_net = net
+
         self.map_strategy = MapStrategyEyeriss
 
         self.resource = Resource(proc_region=NodeRegion(origin=PhyDim2(0, 0),
@@ -56,6 +76,7 @@ def setUp(self):
                                  size_regf=512 // 2,  # 512 B
                                  array_bus_width=float('inf'),
                                  dram_bandwidth=float('inf'),
+                                 no_time_mux=False,
                                 )
 
         self.cost = Cost(mac_op=1,
@@ -127,6 +148,144 @@ def test_verbose(self):
         for layer in network:
             self.assertIn(layer, stderr_value)
 
+    def test_pipelining(self):
+        ''' Pipelining. '''
+        network = self.alex_net
+        batch_size = 1
+
+        options = Option(hw_gbuf_save_writeback=True,
+                         partition_interlayer=True)
+        nnd = NNDataflow(network, batch_size, self.resource, self.cost,
+                         self.map_strategy)
+
+        tops, _ = nnd.schedule_search(options)
+        self.assertTrue(tops)
+
+    def test_fast_forward_infeasible(self):
+        ''' Enter fast forward due to infeasible constraint. '''
+        network = self.simple_net
+        batch_size = 1
+
+        # Very small gbuf size. Small fmap tpart is infeasible.
+        resource = self.resource._replace(
+            dim_array=PhyDim2(2, 2),
+            size_gbuf=16)
+
+        options = Option(hw_gbuf_save_writeback=True,
+                         partition_interlayer=True)
+        nnd = NNDataflow(network, batch_size, resource, self.cost,
+                         self.map_strategy)
+
+        tops, _ = nnd.schedule_search(options)
+        self.assertTrue(tops)
+
+        # No pipelining is feasible.
+        for dtfl in tops:
+            self.assertTupleEqual(dtfl['1'].sched_seq, (0, 0, 0))
+            self.assertTupleEqual(dtfl['2'].sched_seq, (1, 0, 0))
+
+    def test_fast_forward_found(self):
+        ''' Enter fast forward due to early found. '''
+        network = self.simple_net
+        batch_size = 1
+
+        # No time overhead limit.
+        options = Option(hw_gbuf_save_writeback=True,
+                         partition_interlayer=True,
+                         layer_pipeline_time_ovhd=float('inf'))
+        nnd = NNDataflow(network, batch_size, self.resource, self.cost,
+                         self.map_strategy)
+
+        tops, _ = nnd.schedule_search(options)
+        self.assertTrue(tops)
+
+    def test_fast_forward_crit_time(self):
+        ''' Enter fast forward due to long critical time. '''
+        network = self.simple_net
+        batch_size = 1
+
+        # Multiple nodes for spatial pipelining.
+        resource = self.resource._replace(
+            proc_region=NodeRegion(origin=PhyDim2(0, 0),
+                                   dim=PhyDim2(8, 8),
+                                   type=NodeRegion.PROC),
+            dim_array=PhyDim2(1, 1),
+        )
+
+        # Very strict time overhead limit.
+        # At large fmap tpart, utilization decreases and critical time would
+        # increase.
+        options = Option(hw_gbuf_save_writeback=True,
+                         partition_interlayer=True,
+                         layer_pipeline_time_ovhd=1e-3)
+        nnd = NNDataflow(network, batch_size, resource, self.cost,
+                         self.map_strategy)
+
+        tops, _ = nnd.schedule_search(options)
+        self.assertTrue(tops)
+
+    def test_fast_forward_frontier(self):
+        ''' Enter fast forward due to off-frontier. '''
+        network = self.simple_net
+        batch_size = 16
+
+        # Multiple nodes for spatial pipelining.
+        resource = self.resource._replace(
+            proc_region=NodeRegion(origin=PhyDim2(0, 0),
+                                   dim=PhyDim2(8, 8),
+                                   type=NodeRegion.PROC),
+            dim_array=PhyDim2(2, 2),
+        )
+
+        # No time overhead limit.
+        options = Option(hw_gbuf_save_writeback=True,
+                         partition_interlayer=True,
+                         layer_pipeline_time_ovhd=float('inf'))
+        nnd = NNDataflow(network, batch_size, resource, self.cost,
+                         self.map_strategy)
+
+        tops, _ = nnd.schedule_search(options)
+        self.assertTrue(tops)
+
+    def test_fmap_fwd(self):
+        '''
+        Fmap forward with shared mem sources or both on/off-chip destinations.
+        '''
+        network = self.complex_net
+        batch_size = 16
+
+        # Multiple nodes for spatial pipelining.
+        resource = self.resource._replace(
+            proc_region=NodeRegion(origin=PhyDim2(0, 0),
+                                   dim=PhyDim2(8, 8),
+                                   type=NodeRegion.PROC),
+        )
+
+        # No time overhead limit.
+        options = Option(hw_gbuf_save_writeback=True,
+                         partition_interlayer=True,
+                         layer_pipeline_time_ovhd=float('inf'))
+        nnd = NNDataflow(network, batch_size, resource, self.cost,
+                         self.map_strategy)
+
+        tops, _ = nnd.schedule_search(options)
+        self.assertTrue(tops)
+
+    def test_sched_instance_sharing(self):
+        ''' Scheduling instance sharing between layers. '''
+        network = self.alex_net
+        batch_size = 1
+
+        nnd = NNDataflow(network, batch_size, self.resource, self.cost,
+                         self.map_strategy)
+
+        self.assertIs(nnd.layer_sched_dict['conv1_a'],
+                      nnd.layer_sched_dict['conv1_b'])
+        self.assertIs(nnd.layer_sched_dict['conv2_a'],
+                      nnd.layer_sched_dict['conv2_b'])
+        self.assertIs(nnd.layer_sched_dict['pool1_a'],
+                      nnd.layer_sched_dict['pool1_b'])
+
     def test_opt_goal(self):
         ''' Optimization goal. '''
         network = self.alex_net
@@ -206,22 +365,23 @@ def test_no_valid_dataflow(self):
 
         # Very small REGF.
         self.resource = Resource(proc_region=NodeRegion(origin=PhyDim2(0, 0),
-                                                        dim=PhyDim2(1, 1),
+                                                        dim=PhyDim2(4, 4),
                                                         type=NodeRegion.PROC),
                                  dram_region=NodeRegion(
                                      origin=PhyDim2(0, 0), dim=PhyDim2(1, 1),
                                      type=NodeRegion.DRAM),
                                  src_data_region=NodeRegion(
-                                     origin=PhyDim2(0, 0), dim=PhyDim2(1, 1),
+                                     origin=PhyDim2(0, 0), dim=PhyDim2(4, 4),
                                      type=NodeRegion.DRAM),
                                  dst_data_region=NodeRegion(
-                                     origin=PhyDim2(0, 0), dim=PhyDim2(1, 1),
+                                     origin=PhyDim2(0, 0), dim=PhyDim2(4, 4),
                                      type=NodeRegion.DRAM),
                                  dim_array=PhyDim2(16, 16),
                                  size_gbuf=128 * 1024 // 2,  # 128 kB
                                  size_regf=2,
                                  array_bus_width=float('inf'),
                                  dram_bandwidth=float('inf'),
+                                 no_time_mux=False,
                                 )
 
         nnd = NNDataflow(self.alex_net, 4, self.resource, self.cost,
@@ -230,6 +390,13 @@ def test_no_valid_dataflow(self):
 
         self.assertFalse(tops)
 
+        # With inter-layer pipelining.
+        options = Option(hw_gbuf_save_writeback=True,
+                         partition_interlayer=True)
+        tops, _ = nnd.schedule_search(options)
+
+        self.assertFalse(tops)
+
     def test_scheduling_failure(self):
         ''' Layer scheduling failure. '''
         network = self.alex_net
@@ -346,6 +513,7 @@ def test_eyeriss_isscc16(self):
                             size_regf=261,  # 225 + 12 + 24
                             array_bus_width=float('inf'),
                             dram_bandwidth=float('inf'),
+                            no_time_mux=False,
                            )
 
         cost = Cost(mac_op=2e-12,
@@ -442,6 +610,7 @@ def test_eyeriss_asplos17(self):
                             size_regf=1024 // 2,  # 1 kB
                             array_bus_width=float('inf'),
                             dram_bandwidth=float('inf'),
+                            no_time_mux=False,
                            )
 
         cost = Cost(mac_op=2e-12,
@@ -474,6 +643,7 @@ def test_eyeriss_asplos17(self):
                             size_regf=512 // 2,  # 512 B
                             array_bus_width=float('inf'),
                             dram_bandwidth=float('inf'),
+                            no_time_mux=False,
                            )
 
         cost = Cost(mac_op=2e-12,
diff --git a/nn_dataflow/tests/dataflow_test/test_scheduling.py b/nn_dataflow/tests/dataflow_test/test_scheduling.py
index 392188f..eca82d8 100644
--- a/nn_dataflow/tests/dataflow_test/test_scheduling.py
+++ b/nn_dataflow/tests/dataflow_test/test_scheduling.py
@@ -28,6 +28,7 @@
 from nn_dataflow.core import Resource
 from nn_dataflow.core import Scheduling
 from nn_dataflow.core import SchedulingCondition, SchedulingResult
+from nn_dataflow.core import SchedulingConstraint
 
 class TestScheduling(unittest.TestCase):
     ''' Tests for Scheduling module. '''
@@ -44,6 +45,9 @@ def setUp(self):
         self.cost = Cost(mac_op=1, mem_hier=(200, 6, 2, 1),
                          noc_hop=50, idl_unit=50)
 
+        self.none_cstr = SchedulingConstraint()
+        self.cstr = SchedulingConstraint(topofm=1, topbat=self.batch_size)
+
         self.resource = Resource(
             proc_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(4, 4),
                                    type=NodeRegion.PROC),
@@ -54,7 +58,8 @@ def setUp(self):
             dst_data_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(4, 1),
                                        type=NodeRegion.DRAM),
             dim_array=PhyDim2(16, 16), size_gbuf=65536, size_regf=64,
-            array_bus_width=float('inf'), dram_bandwidth=float('inf'))
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
 
         self.options = Option(partition_hybrid=True, partition_batch=True,
                               partition_ifmaps=True, ntops=10)
@@ -74,6 +79,8 @@ def setUp(self):
                 parts=(part.projection(self.resource.src_data_region,
                                        appl2frng=True),))
 
+        self.sched_seq = (2, 0, 1)
+
     def test_valid_args(self):
         ''' Valid arguments for constructor. '''
         schd = Scheduling(self.layers['BASE'], self.batch_size, self.cost,
@@ -116,7 +123,9 @@ def test_schedule_search(self):
                               MapStrategyEyeriss)
 
             condition = SchedulingCondition(resource=self.resource,
-                                            ifmap_layout=ifmap_layout)
+                                            constraint=self.cstr,
+                                            ifmap_layout=ifmap_layout,
+                                            sched_seq=self.sched_seq)
 
             res = schd.schedule_search(condition, self.options)
 
@@ -142,11 +151,19 @@ def test_schedule_search(self):
                 self.assertEqual(r.num_nodes,
                                  self.resource.proc_region.dim.size())
 
+            # Constraint.
+            for r in res:
+                self.assertEqual(r.scheme['to'][0], 1)
+
             # Ofmap layout.
             for r in res:
                 self.assertEqual(r.ofmap_layout.complete_fmap_range().size(),
                                  layer.total_ofmap_size(self.batch_size))
 
+            # Sequence number.
+            for r in res:
+                self.assertTupleEqual(r.sched_seq, condition.sched_seq)
+
     def test_schedule_search_ilayout(self):
         ''' Invalid ifmap_layout. '''
         layer = self.layers['BASE']
@@ -157,9 +174,11 @@ def test_schedule_search_ilayout(self):
         # Shift ifmap out of memory region.
         condition = SchedulingCondition(
             resource=self.resource,
+            constraint=self.none_cstr,
             ifmap_layout=self.ifmap_layouts['BASE']._replace(
                 regions=tuple(r._replace(origin=PhyDim2(-10, -10))
-                              for r in self.ifmap_layouts['BASE'].regions)))
+                              for r in self.ifmap_layouts['BASE'].regions)),
+            sched_seq=self.sched_seq)
 
         with self.assertRaisesRegexp(ValueError, 'Scheduling: .*ifmap.*'):
             _ = schd.schedule_search(condition, self.options)
@@ -167,7 +186,9 @@ def test_schedule_search_ilayout(self):
         # Not match layer.
         condition = SchedulingCondition(
             resource=self.resource,
-            ifmap_layout=self.ifmap_layouts['POOL'])
+            constraint=self.none_cstr,
+            ifmap_layout=self.ifmap_layouts['POOL'],
+            sched_seq=self.sched_seq)
 
         with self.assertRaisesRegexp(ValueError, 'Scheduling: .*ifmap.*'):
             _ = schd.schedule_search(condition, self.options)
@@ -182,7 +203,9 @@ def test_schedule_search_nolbs(self):
 
         condition = SchedulingCondition(
             resource=self.resource._replace(size_regf=0),
-            ifmap_layout=ifmap_layout)
+            constraint=self.none_cstr,
+            ifmap_layout=ifmap_layout,
+            sched_seq=self.sched_seq)
 
         res = schd.schedule_search(condition, self.options)
 
@@ -203,7 +226,9 @@ def test_pernode_sched_cache(self):
         self.assertTupleEqual(schd.cache_stats(), (0, 0))
 
         condition = SchedulingCondition(resource=self.resource,
-                                        ifmap_layout=ifmap_layout)
+                                        constraint=self.cstr,
+                                        ifmap_layout=ifmap_layout,
+                                        sched_seq=self.sched_seq)
 
         Scheduling.schedule_search.cache_clear()
         _ = schd.schedule_search(condition, self.options)
@@ -232,7 +257,9 @@ def test_pernode_sched_cache_key(self):
                           MapStrategyEyeriss)
 
         condition = SchedulingCondition(resource=self.resource,
-                                        ifmap_layout=ifmap_layout)
+                                        constraint=self.cstr,
+                                        ifmap_layout=ifmap_layout,
+                                        sched_seq=self.sched_seq)
 
         _ = schd.schedule_search(condition, self.options)
 
@@ -241,6 +268,7 @@ def test_pernode_sched_cache_key(self):
 
         # Make another instance.
         rsrc = Resource(**self.resource._asdict())
+        cstr = self.cstr
         opts = Option(**self.options._asdict())
         self.assertNotEqual(id(rsrc), id(self.resource))
         self.assertNotEqual(id(opts), id(self.options))
@@ -248,7 +276,7 @@ def test_pernode_sched_cache_key(self):
         part = PartitionScheme(order=(pe.BATP, pe.INPP, pe.OUTP, pe.OFMP),
                                pdims=((2, 4), (2, 1), (1, 1), (1, 1)))
 
-        _ = schd.schedule_search_per_node(part, rsrc, opts)
+        _ = schd.schedule_search_per_node(part, rsrc, cstr, opts)
 
         h2, m2 = schd.cache_stats()
         self.assertEqual(h2, h + 1)
diff --git a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking.py b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking.py
index 9539f92..045b58e 100644
--- a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking.py
+++ b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking.py
@@ -149,11 +149,34 @@ def test_gen_loopblocking_byp_sol(self):
 
         self.assertLessEqual(cnt, 8)
 
+    def test_gen_loopblocking_cstr(self):
+        ''' gen_loopblocking with constraint. '''
+
+        for lbs in self._gen_loopblocking(rsrckey='LG', cstr=self.cstr):
+
+            self.assertTrue(self.cstr.is_valid_top_bl(lbs.bl_ts[0],
+                                                      lbs.bl_ords[0]))
+
+    def test_gen_loopblocking_cstr_sol(self):
+        ''' gen_loopblocking using bypass solvers with constraint. '''
+
+        cnt1 = len(list(self._gen_loopblocking(optkey='BYPSOL')))
+
+        lbs_list = list(self._gen_loopblocking(optkey='BYPSOL', cstr=self.cstr))
+        self.assertTrue(all(
+            self.cstr.is_valid_top_bl(lbs.bl_ts[0], lbs.bl_ords[0])
+            for lbs in lbs_list))
+        cnt2 = len(lbs_list)
+
+        self.assertLessEqual(cnt2, cnt1)
+
     def _gen_loopblocking(self, wlkey='BASE', rsrckey='BASE',
-                          optkey='BASE', skip_invalid=False):
+                          optkey='BASE', cstr=None, skip_invalid=False):
         ''' gen_loopblocking trampoline. '''
+        if cstr is None:
+            cstr = self.none_cstr
         for lbs in loop_blocking.gen_loopblocking(
-                self.nld[wlkey], self.resource[rsrckey],
+                self.nld[wlkey], self.resource[rsrckey], self.part, cstr,
                 self.cost, self.options[optkey]):
             if not skip_invalid or lbs.is_valid():
                 yield lbs
diff --git a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_fixture.py b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_fixture.py
index b3e5fe7..fc15ba9 100644
--- a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_fixture.py
+++ b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_fixture.py
@@ -14,8 +14,11 @@
 """
 
 import itertools
+import math
 import unittest
 
+from nn_dataflow.core import partition
+from nn_dataflow.core import BufShrScheme
 from nn_dataflow.core import ConvLayer, PoolingLayer
 from nn_dataflow.core import Cost
 from nn_dataflow.core import DataDimLoops
@@ -27,12 +30,16 @@
 from nn_dataflow.core import NestedLoopDesc
 from nn_dataflow.core import NodeRegion
 from nn_dataflow.core import Option
+from nn_dataflow.core import ParallelEnum as pe
+from nn_dataflow.core import PartitionScheme
 from nn_dataflow.core import PhyDim2
 from nn_dataflow.core import Resource
+from nn_dataflow.core import SchedulingConstraint
 from nn_dataflow import util
 
 class TestLoopBlockingFixture(unittest.TestCase):
     ''' Base fixture class for LoopBlocking tests. '''
+    # pylint: disable=too-many-instance-attributes
 
     def setUp(self):
 
@@ -41,6 +48,7 @@ def setUp(self):
         self.layer['BASE'] = ConvLayer(12, 10, 28, 3)
         self.layer['LGFIL'] = ConvLayer(2, 4, 28, 20)
         self.layer['POOL'] = PoolingLayer(32, 28, 2)
+        self.layer['PAR'] = ConvLayer(24, 36, 56, 3)
         self.batch_size = 4
 
         # Resource.
@@ -55,19 +63,60 @@ def setUp(self):
             proc_region=proc_region, dram_region=data_region,
             src_data_region=data_region, dst_data_region=data_region,
             dim_array=dim_array, size_gbuf=65536, size_regf=64,
-            array_bus_width=float('inf'), dram_bandwidth=float('inf'))
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
         # Larger resource with sufficient capacity, to make all schemes valid.
         self.resource['LG'] = Resource(
             proc_region=proc_region, dram_region=data_region,
             src_data_region=data_region, dst_data_region=data_region,
             dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3,
-            array_bus_width=float('inf'), dram_bandwidth=float('inf'))
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
         # Small resource.
         self.resource['SM'] = Resource(
             proc_region=proc_region, dram_region=data_region,
             src_data_region=data_region, dst_data_region=data_region,
             dim_array=dim_array, size_gbuf=4096, size_regf=16,
-            array_bus_width=float('inf'), dram_bandwidth=float('inf'))
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
+        # Multi-node parallel resource.
+        self.resource['PAR'] = Resource(
+            proc_region=NodeRegion(origin=PhyDim2(0, 0),
+                                   dim=PhyDim2(4, 2),
+                                   type=NodeRegion.PROC),
+            dram_region=data_region,
+            src_data_region=data_region, dst_data_region=data_region,
+            dim_array=dim_array, size_gbuf=25000, size_regf=64,
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
+        # Resource with no data regions.
+        proc_data_region = NodeRegion(origin=PhyDim2(1, 1), dim=PhyDim2(1, 1),
+                                      type=NodeRegion.PROC)
+        self.resource['SRCNOTDATA'] = Resource(
+            proc_region=proc_region, dram_region=data_region,
+            src_data_region=proc_data_region, dst_data_region=data_region,
+            dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3,
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
+        self.resource['DSTNOTDATA'] = Resource(
+            proc_region=proc_region, dram_region=data_region,
+            src_data_region=data_region, dst_data_region=proc_data_region,
+            dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3,
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
+        self.resource['DATALOCAL'] = Resource(
+            proc_region=proc_region, dram_region=data_region,
+            src_data_region=proc_region, dst_data_region=proc_region,
+            dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3,
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
+        # Filter pinning.
+        self.resource['FILPIN'] = Resource(
+            proc_region=proc_region, dram_region=data_region,
+            src_data_region=data_region, dst_data_region=data_region,
+            dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3,
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=True)
 
         # Nested loop description after mapping.
         self.nld = {}
@@ -114,6 +163,12 @@ def setUp(self):
                                                                        le.BAT)),
                                               unit_ops=1, unit_time=1)
 
+        # Fake partition scheme.
+        self.part = PartitionScheme(range(pe.NUM), ((1, 1),) * pe.NUM)
+
+        # Fake buffer sharing scheme.
+        self.bufshr = BufShrScheme(proc_region, self.part)
+
         # Options.
         self.options = {}
         # Basic.
@@ -128,6 +183,20 @@ def setUp(self):
         self.options['BYPSOL'] = Option(sw_gbuf_bypass=(True,) * 3,
                                         sw_solve_loopblocking=True,
                                         ntops=2 ** 30)
+        # Access forwarding.
+        self.options['ACCFWD'] = Option(hw_access_forwarding=True,
+                                        ntops=2 ** 30)
+        # Buffer sharing.
+        self.options['BUFSHR'] = Option(hw_gbuf_sharing=True,
+                                        ntops=2 ** 30)
+        # Buffer sharing with bypassing.
+        self.options['BUFSHR-BYP'] = Option(sw_gbuf_bypass=(True,) * 3,
+                                            hw_gbuf_sharing=True,
+                                            ntops=2 ** 30)
+
+        # Constraint.
+        self.none_cstr = SchedulingConstraint()
+        self.cstr = SchedulingConstraint(topifm=1, topbat=1)
 
         # Cost.
         self.cost = Cost(mac_op=1, mem_hier=(200, 6, 2, 1),
@@ -140,7 +209,7 @@ def _lbs(self, bl_ts, bl_ords=None, wlkey='BASE', rsrckey='BASE',
         bl_ords = (tuple(range(le.NUM)), tuple(range(le.NUM))) \
                 if not bl_ords else bl_ords
         return LoopBlockingScheme(self.nld[wlkey], bl_ts, bl_ords,
-                                  self.resource[rsrckey],
+                                  self.resource[rsrckey], self.bufshr,
                                   self.options[optkey])
 
     def _gen_loopblocking_all(self, wlkey='BASE'):
@@ -196,6 +265,94 @@ def _make_bl_ts(self, ti_part, to_part, tb_part, wlkey='BASE'):
         lp_ts[le.BAT] = tb
         return tuple(zip(*lp_ts))
 
+    def _part_nld(self, part, layerkey='PAR'):
+        ''' Make a partitioned NestedLoopDesc and its partition occupation. '''
+        p_layer, p_batch_size, p_occ = part.part_layer(self.layer[layerkey],
+                                                       self.batch_size)
+        p_nld = next(MapStrategyEyeriss(p_layer, p_batch_size, p_occ,
+                                        self.resource['PAR'].dim_array)
+                     .gen_nested_loop_desc())
+        return p_nld
+
+    def _gen_all_partition(self, layerkey='PAR'):
+        '''
+        Generate PartitionScheme.
+        '''
+        options = Option(partition_hybrid=True,
+                         partition_batch=True,
+                         partition_ifmaps=True,
+                         ntops=2 ** 30)
+
+        for part in partition.gen_partition(
+                self.layer[layerkey], self.batch_size,
+                self.resource['PAR'].proc_region.dim, options):
+            yield part
+
+    def _total_part_size(self, part, layerkey='PAR'):
+        ''' Get the total partitioned data size. '''
+        layer = self.layer[layerkey]
+
+        nifm = util.idivc(layer.nifm, part.size(pe.INPP)) * part.size(pe.INPP)
+        nofm = util.idivc(layer.nofm, part.size(pe.OUTP)) * part.size(pe.OUTP)
+        hofm = util.idivc(layer.hofm, part.dim(pe.OFMP).h) * part.dim(pe.OFMP).h
+        wofm = util.idivc(layer.wofm, part.dim(pe.OFMP).w) * part.dim(pe.OFMP).w
+        batch_size = util.idivc(self.batch_size, part.size(pe.BATP)) \
+                * part.size(pe.BATP)
+
+        full_layer = ConvLayer(nifm, nofm, (hofm, wofm),
+                               (layer.hfil, layer.wfil),
+                               (layer.htrd, layer.wtrd))
+        filter_size = full_layer.total_filter_size()
+        ifmap_size = full_layer.total_ifmap_size(batch_size)
+        ofmap_size = full_layer.total_ofmap_size(batch_size)
+
+        self.assertGreaterEqual(filter_size, layer.total_filter_size())
+        self.assertLess(filter_size, layer.total_filter_size() * 1.2 * 1.2)
+        self.assertGreaterEqual(ofmap_size,
+                                layer.total_ofmap_size(self.batch_size))
+        self.assertLess(ofmap_size,
+                        layer.total_ofmap_size(self.batch_size)
+                        * 1.2 * 1.2 * 1.2)
+        self.assertGreaterEqual(ifmap_size,
+                                layer.total_ifmap_size(self.batch_size))
+
+        return filter_size, ifmap_size, ofmap_size
+
+    def _bufshr_params(self, lbs):
+        '''
+        Get buffer sharing parameters.
+
+        Return subgroup sizes, rotation unit counts.
+
+        Finally, a list of ordered loops as a tuple of LoopEnum and blocking
+        factor ordered from outermost to innermost excluding trivial loops.
+        '''
+        # GBUF level.
+        blp1 = lbs.BL.GBUF + 1
+        t_x = lbs.bl_ts[blp1]
+        ord_x = lbs.bl_ords[blp1]
+        # BS level.
+        t_bs = lbs.bufshr_bs_t
+        ord_bs = lbs.bufshr_bs_ord
+
+        self.assertTrue(all(x % b == 0 for x, b in zip(t_x, t_bs)))
+
+        subgrp_size = lbs.bufshr_subgrp_size
+        rot_unit_cnt = lbs.bufshr_rot_unit_cnt
+
+        # Loops as a tuple of LoopEnum and blocking factor, ordered from
+        # outermost to innermost, excluding trivial loops.
+        lp_t_list = sorted([(lpe, t_bs[lpe])
+                            for lpe in range(le.NUM) if t_bs[lpe] > 1],
+                           key=lambda tpl: ord_bs[tpl[0]],
+                           reverse=True) \
+                  + sorted([(lpe, t_x[lpe] / t_bs[lpe])
+                            for lpe in range(le.NUM) if t_x[lpe] > t_bs[lpe]],
+                           key=lambda tpl: ord_x[tpl[0]],
+                           reverse=True)
+
+        return subgrp_size, rot_unit_cnt, lp_t_list
+
 
     class _SimBuffer(object):
         ''' A data buffer model for simulation. '''
@@ -222,6 +379,9 @@ def __init__(self, dce, buf_cnt_pr, unit_size, bypass=False):
             # E.g., (c0, c1).
             self.buf_cnt_pr = buf_cnt_pr
 
+            # Range index cache.
+            self.ridx_pr_cache = {}
+
         def access_size(self):
             ''' Get access size. '''
             return self.access * self.unit_size
@@ -239,8 +399,7 @@ def do_access(self, idx_pr, cnt_pr, read=1, write=0):
                 return cnt_pr
 
             # Range index.
-            ridx_pr = tuple(idx // buf_cnt for idx, buf_cnt
-                            in zip(idx_pr, self.buf_cnt_pr))
+            ridx_pr = self._range_idx_pr(idx_pr)
 
             # Access.
             self.access += util.prod(cnt_pr) * (read + write)
@@ -253,10 +412,308 @@ def do_access(self, idx_pr, cnt_pr, read=1, write=0):
             self.data = ridx_pr
             return self.buf_cnt_pr
 
-    def _sim_access_conv(self, lbs):
+        def _range_idx_pr(self, idx_pr):
+            ''' Get the range index of all dimensions. '''
+            ridx_pr = self.ridx_pr_cache.get(idx_pr, None)
+            if ridx_pr is None:
+                ridx_pr = tuple(idx // buf_cnt for idx, buf_cnt
+                                in zip(idx_pr, self.buf_cnt_pr))
+                self.ridx_pr_cache[idx_pr] = ridx_pr
+            return ridx_pr
+
+    class _SimBufferSharing(_SimBuffer):
+        ''' A data buffer model with buffer sharing. '''
+
+        def __init__(self, dce, buf_cnt_pr, unit_size,
+                     subgrp_size, rot_unit_cnt, lp_t_list, dim_loops,
+                     bypass=False):
+
+            # pylint: disable=protected-access
+            self.base = super(TestLoopBlockingFixture._SimBufferSharing, self)
+
+            self.base.__init__(dce, buf_cnt_pr, unit_size, bypass=bypass)
+
+            # Number of rotation steps, of each range.
+            self.rot_step_cnt = {}
+            # Rotation accesses, in unit counts (* unit size).
+            self.rot_access = 0
+            # Wide fetch accesses, in unit counts (* unit size).
+            self.wf_access = 0
+
+            # Rotation rounds per load of a range. If only rotate a single
+            # round per data load, the rotation is unnecessary.
+            self.rot_rnd_cnt_per_load = None
+
+            if self.bypass:
+                return
+
+            # Subrange.
+            # A list in the accessing order of subrange indexes, i.e., the
+            # ranges of the next level; and the unit counts in one subrange.
+            self.subrng_list, self.subrng_cnt_pr = \
+                    self._init_sub_range(lp_t_list, dim_loops)
+            # Subrange index to the position in the list.
+            self.subrng_idx_dict = \
+                    dict((sr, i) for i, sr in enumerate(self.subrng_list))
+            # Number of subranges.
+            self.subrng_num = len(self.subrng_list)
+
+            # Local buffer.
+            self.buf_num = subgrp_size
+            # Number of subranges in each buffer.
+            self.buf_subrng_num = 1. * self.subrng_num / self.buf_num
+
+            # The location centroid of each subrange, i.e., buffer index
+            # weighted by fraction.
+            self.buf_subrng_centroid = []
+            cur_buf_cap = self.buf_subrng_num
+            cur_buf_idx = 0
+            for _ in range(self.subrng_num):
+                centroid = 0
+                rem_frac = 1.
+                while rem_frac > 0.:
+                    if cur_buf_cap >= rem_frac:
+                        # Fits in the current buffer.
+                        centroid += cur_buf_idx * rem_frac
+                        cur_buf_cap -= rem_frac
+                        rem_frac = 0.
+                        break
+                    else:
+                        # Partially fits.
+                        centroid += cur_buf_idx * cur_buf_cap
+                        rem_frac -= cur_buf_cap
+                        cur_buf_cap = self.buf_subrng_num
+                        cur_buf_idx += 1
+                self.buf_subrng_centroid.append(centroid)
+
+            # Rotation unit.
+            # Rotation step happens when moving to the new rotation unit.
+            assert self.subrng_num % rot_unit_cnt == 0
+            self.rot_unit_size = self.subrng_num // rot_unit_cnt
+            # Steps per rotation round.
+            self.rot_steps_per_round = 1
+            while (self.rot_steps_per_round * self.rot_unit_size
+                   + self.buf_subrng_num < self.subrng_num
+                   and (self.rot_steps_per_round + 1) * self.rot_unit_size
+                   < self.subrng_num):
+                self.rot_steps_per_round += 1
+
+            # The rotation unit currently worked on.
+            self.cur_rot_unit = 0
+            # Rotation steps of the current load of the current range.
+            self.cur_rot_step_cnt = 0
+
+            # Last wide fetch subrange index.
+            self.last_wf_subrng_idx = 0
+            # Amount of sequential wide fetch, can be combined with rotation.
+            self.seq_wf_acc = 0
+            # Total saved (combined with rotation) wide fetch access.
+            self.saved_wf_access = 0
+
+            # Subrange index cache.
+            self.sridx_pr_cache = {}
+
+        def rotation_rounds(self):
+            ''' Get number of rotation rounds. '''
+
+            # Ensure all ranges have the same rotation steps.
+            steps_list = self.rot_step_cnt.values()
+            if not steps_list:
+                return 0
+            assert all(s == steps_list[0] for s in steps_list)
+            steps = steps_list[0]
+            if steps == 0:
+                return 0
+
+            assert steps % self.rot_steps_per_round == 0
+
+            if self.rot_rnd_cnt_per_load == 1:
+                return 0
+            return steps // self.rot_steps_per_round
+
+        def rotation_access_size(self):
+            ''' Get total rotation access size. '''
+            if self.rot_rnd_cnt_per_load == 1:
+                return 0
+            return self.rot_access * self.unit_size
+
+        def wide_fetch_access_size(self):
+            ''' Get total wide fetch access size. '''
+            if self.rot_rnd_cnt_per_load == 1:
+                return (self.wf_access + self.saved_wf_access) * self.unit_size
+            return self.wf_access * self.unit_size
+
+        def do_access(self, idx_pr, cnt_pr, read=1, write=0):
+
+            ret = self.base.do_access(idx_pr, cnt_pr, read=read, write=write)
+
+            if self.bypass:
+                # Bypass, skip buffer sharing.
+                return ret
+
+            # Range index.
+            ridx_pr = self._range_idx_pr(idx_pr)
+
+            if any(ret):
+                # Miss in the shared buffer and load new range. Reset.
+                self.cur_rot_unit = 0
+                self.rot_step_cnt.setdefault(ridx_pr, 0)
+
+                if self.cur_rot_step_cnt == 0:
+                    # Initial fetch, no replaced data yet.
+                    assert self.rot_rnd_cnt_per_load is None
+                else:
+                    rot_rnd_cnt_per_load, rem_ = divmod(
+                        self.cur_rot_step_cnt, self.rot_steps_per_round)
+                    assert rem_ == 0
+                    assert self.rot_rnd_cnt_per_load is None \
+                            or self.rot_rnd_cnt_per_load == rot_rnd_cnt_per_load
+                    self.rot_rnd_cnt_per_load = rot_rnd_cnt_per_load
+                self.cur_rot_step_cnt = 0
+
+            assert all(cnt <= subrng_cnt for cnt, subrng_cnt
+                       in zip(cnt_pr, self.subrng_cnt_pr))
+
+            # Subrange index.
+            sridx_pr = self._subrange_idx_pr(idx_pr)
+
+            # Rotation unit index.
+            ru_idx = self._subrng_rot_unit_idx(sridx_pr)
+
+            if ru_idx != self.cur_rot_unit:
+                # Move to next rotation unit.
+
+                if (self.cur_rot_unit + 1) * self.rot_unit_size \
+                        >= self.subrng_num:
+                    # The current rotation unit is the last one. Start a new
+                    # rotation round.
+                    # Do not rotate back to the initial state. Instead start
+                    # from the current state.
+                    self.cur_rot_unit = 0
+
+                    self.last_wf_subrng_idx = 0
+                    self.seq_wf_acc = 0
+
+                elif self.cur_rot_unit * self.rot_unit_size \
+                        + self.buf_subrng_num >= self.subrng_num:
+                    # The last rotation unit is already local. No more rotation.
+                    self.cur_rot_unit += 1
+
+                else:
+                    # Rotate by one rotation unit, but not exceeding the end.
+                    offset = min(self.rot_unit_size,
+                                 self.subrng_num
+                                 - self.cur_rot_unit * self.rot_unit_size
+                                 - self.buf_subrng_num)
+                    assert offset > 0
+
+                    # All subranges shift by the above offset.
+                    acc_ = (1. * offset / self.buf_subrng_num) * self.subrng_num
+                    self.rot_access += util.prod(self.subrng_cnt_pr) * acc_
+                    self.cur_rot_unit += 1
+
+                    # One rotation step.
+                    self.rot_step_cnt[ridx_pr] += 1
+                    self.cur_rot_step_cnt += 1
+
+                    # Combine wide fetch with rotation.
+                    self.wf_access -= self.seq_wf_acc
+                    self.saved_wf_access += self.seq_wf_acc
+                    self.seq_wf_acc = 0
+
+                assert ru_idx == self.cur_rot_unit
+
+            # Buffer index of which has this subrange.
+            buf_idx = self._subrng_buf_idx(sridx_pr)
+
+            # Wide fetch from possibly remote buffer.
+            wf_acc = util.prod(cnt_pr) * (read + write) * buf_idx
+            self.wf_access += wf_acc
+
+            # Record amount of sequential wide fetch.
+            subrng_idx = self.subrng_idx_dict[sridx_pr]
+            if subrng_idx >= self.last_wf_subrng_idx:
+                self.seq_wf_acc += wf_acc
+            else:
+                self.seq_wf_acc = wf_acc
+            self.last_wf_subrng_idx = subrng_idx
+
+            return ret
+
+        def _subrange_idx_pr(self, idx_pr):
+            ''' Get the subrange index of all dimensions. '''
+            sridx_pr = self.sridx_pr_cache.get(idx_pr, None)
+            if sridx_pr is None:
+                sridx_pr = tuple((idx % buf_cnt) // subrng_cnt
+                                 for idx, buf_cnt, subrng_cnt
+                                 in zip(idx_pr, self.buf_cnt_pr,
+                                        self.subrng_cnt_pr))
+                self.sridx_pr_cache[idx_pr] = sridx_pr
+            return sridx_pr
+
+        def _subrng_rot_unit_idx(self, sridx_pr):
+            ''' Get the rotation unit index of the subrange. '''
+            return self.subrng_idx_dict[sridx_pr] // self.rot_unit_size
+
+        def _subrng_buf_idx(self, sridx_pr):
+            ''' Get the buffer index of which currently has the subrange. '''
+            subrng_idx = self.subrng_idx_dict[sridx_pr]
+
+            # Start from the current rotation unit.
+            subrng_idx -= self.cur_rot_unit * self.rot_unit_size
+            subrng_idx %= self.subrng_num
+
+            return self.buf_subrng_centroid[subrng_idx]
+
+        def _init_sub_range(self, lp_t_list, dim_loops):
+
+            assert len(dim_loops) == 2
+
+            subrng_list = [(0, 0)]
+            subrng_sz_pr = [1, 1]
+
+            # From inner to outer.
+            for lpe, t in reversed(lp_t_list):
+                # The data dimension index of this loop.
+                try:
+                    d = dim_loops.index(lpe)
+                except ValueError:
+                    # This loop is not related to the data, skip.
+                    assert lpe not in dim_loops
+                    continue
+
+                # Size of this dimension of current loop body, i.e., all inner
+                # loops.
+                s = subrng_sz_pr[d]
+
+                # Make the new subrange list, by looping over the current loop
+                # body with the current loop factor, and updating this
+                # dimension.
+                new_subrng_list = []
+                for i in range(t):
+                    new_subrng_list += [tuple(i_ + i * s if d_ == d else i_
+                                              for d_, i_ in enumerate(sr))
+                                        for sr in subrng_list]
+                subrng_list = new_subrng_list
+
+                # Update size of this dimension.
+                subrng_sz_pr[d] *= t
+
+                # Check.
+                assert len(set(subrng_list)) == len(subrng_list)
+                assert len(subrng_list) == util.prod(subrng_sz_pr)
+
+            subrng_cnt_pr = tuple(buf_cnt // subrng_sz for buf_cnt, subrng_sz
+                                  in zip(self.buf_cnt_pr, subrng_sz_pr))
+
+            return subrng_list, subrng_cnt_pr
+
+    def _sim_access_conv(self, lbs, get_bufshr=False):
         '''
         Get data access by actually simulating and generating loops for CONV
         layer.
+
+        If `get_bufshr` is True, also return bufshr stats.
         '''
         self.assertTrue(lbs.is_valid(), '_sim_access_conv: invalid lbs.')
 
@@ -264,6 +721,9 @@ def _sim_access_conv(self, lbs):
 
         lpts = zip(*lbs.bl_ts)
 
+        subgrp_size, rot_unit_cnt, lp_t_list = self._bufshr_params(lbs)
+        data_loops = lbs.nld.data_loops
+
         # Get buffered unit counts at each level.
         dram_buf_cnt_pr_list = [tuple(util.prod(lpts[lpe])
                                       for lpe in data_loops[dce].loops())
@@ -285,10 +745,11 @@ def _sim_access_conv(self, lbs):
                                         )
         gbufs = [None] * de.NUM
         for dce, buf_cnt_pr in enumerate(gbuf_buf_cnt_pr_list):
-            gbufs[dce] = self._SimBuffer(dce, buf_cnt_pr,
-                                         lbs.nld.unit_access[me.GBUF][dce],
-                                         bypass=(not lbs.stored_in_gbuf[dce]),
-                                        )
+            gbufs[dce] = self._SimBufferSharing(
+                dce, buf_cnt_pr, lbs.nld.unit_access[me.GBUF][dce],
+                subgrp_size[dce], rot_unit_cnt[dce], lp_t_list,
+                data_loops[dce].loops(),
+                bypass=(not lbs.stored_in_gbuf[dce]))
         regfs = [None] * de.NUM
         for dce, buf_cnt_pr in enumerate(regf_buf_cnt_pr_list):
             regfs[dce] = self._SimBuffer(dce, buf_cnt_pr,
@@ -334,8 +795,151 @@ def _sim_access_conv(self, lbs):
 
         dram_access = [drams[dce].access_size() for dce in range(de.NUM)]
         gbuf_access = [gbufs[dce].access_size() for dce in range(de.NUM)]
+
+        # Sum over all nodes.
+        dram_access = [a * lbs.num_nodes // r for a, r
+                       in zip(dram_access, lbs.accfwd_reduction)]
+        gbuf_access = [a * lbs.num_nodes for a in gbuf_access]
+
+        # Buffer sharing.
+        if get_bufshr:
+            rotation_access = [gbufs[dce].rotation_access_size()
+                               * (lbs.num_nodes // subgrp_size[dce])
+                               for dce in range(de.NUM)]
+            wide_fetch_access = [gbufs[dce].wide_fetch_access_size()
+                                 * (lbs.num_nodes // subgrp_size[dce])
+                                 for dce in range(de.NUM)]
+            rotation_rounds = [gbufs[dce].rotation_rounds()
+                               for dce in range(de.NUM)]
+
+            return dram_access, gbuf_access, \
+                    (rotation_access, wide_fetch_access, rotation_rounds)
+
+        else:
+            for dce in range(de.NUM):
+                self.assertAlmostEqual(gbufs[dce].rotation_access_size(), 0,
+                                       msg='_sim_access_conv: non-0 '
+                                           'rotation access with no bufshr.')
+                self.assertAlmostEqual(gbufs[dce].wide_fetch_access_size(), 0,
+                                       msg='_sim_access_conv: non-0 '
+                                           'wide fetch access with no bufshr.')
+                self.assertEqual(gbufs[dce].rotation_rounds(), 0,
+                                 msg='_sim_access_conv: non-0 '
+                                     'rotation rounds with no bufshr.')
+
         return dram_access, gbuf_access
 
+    def _average_neighbor_nhops(self, bufshr, subgrp_size):
+        ''' Get the average neighbor number of hops. '''
+
+        avg_nbr_nhops = []
+
+        for dce in range(de.NUM):
+            # pylint: disable=protected-access
+
+            subgrp_dim, idx_pr = bufshr._subgrp_dim(dce, subgrp_size[dce])
+            nbr_dist = bufshr.nbr_dists[dce]
+
+            d_pr = subgrp_dim[idx_pr]
+            d_npr = subgrp_dim[1 - idx_pr]
+            n_pr = (d_pr - 1) * d_npr
+            n_npr = d_npr - 1
+            nhops_nbr = bufshr._nhops_with_neighbor_dist(
+                dce,
+                PhyDim2(*[tpl[1] for tpl
+                          in sorted([(idx_pr, n_pr), (1 - idx_pr, n_npr)])]))
+
+            nhops_nbr /= 1. * subgrp_size[dce]
+
+            coord = bufshr._coordinate(subgrp_size[dce] - 1, subgrp_dim, idx_pr)
+            nhops_lpbk = bufshr._nhops_with_neighbor_dist(dce, coord)
+
+            nhops_lpbk /= 1. * subgrp_size[dce]
+
+            nhops = nhops_nbr + nhops_lpbk
+
+            if subgrp_size[dce] <= 1:
+                self.assertAlmostEqual(nhops, 0)
+            elif subgrp_dim.size() == subgrp_size[dce]:
+                self.assertTrue(min(nbr_dist) <= nhops
+                                <= max(nbr_dist)
+                                + 1. * sum(subgrp_dim) / subgrp_dim.size(),
+                                '_average_neighbor_nhops: {}: '
+                                'subgrp_size {}, subgrp_dim {}, idx_pr {}, '
+                                'nbr_dist {}, nhops {} = {} + {}'
+                                .format(dce, subgrp_size[dce], subgrp_dim,
+                                        idx_pr, nbr_dist,
+                                        nhops, nhops_nbr, nhops_lpbk))
+
+            assert not math.isnan(nhops) and not math.isinf(nhops)
+            avg_nbr_nhops.append(nhops)
+
+        return avg_nbr_nhops
+
+    def _verify_bufshr_stats(self, dram_access, gbuf_access, bufshr_stats,
+                             lbs, bufshr, test_name):
+        ''' Verify the buffer sharing stats returned by access simulation. '''
+
+        rotation_access, wide_fetch_access, rotation_rounds = bufshr_stats
+
+        avg_nbr_nhops = self._average_neighbor_nhops(bufshr,
+                                                     lbs.bufshr_subgrp_size)
+
+        # Mem hierarchy.
+        access = lbs.get_access()
+
+        self.assertListEqual(access[me.DRAM], dram_access,
+                             'test_access: DRAM: '
+                             'model {} vs. sim {}.'
+                             .format(access[me.DRAM], dram_access))
+        self.assertListEqual(access[me.GBUF], gbuf_access,
+                             'test_access: GBUF: '
+                             'model {} vs. sim {}.'
+                             .format(access[me.GBUF], gbuf_access))
+        self.assertListEqual(access[me.REGF],
+                             [lbs.ops, lbs.ops, lbs.ops * 2])
+
+        # NoC.
+        noc_access = lbs.get_noc_access()
+
+        for dce in range(de.NUM):
+            self.assertAlmostEqual(lbs.bufshr_rotation_access[dce]
+                                   + lbs.bufshr_wide_fetch_access[dce],
+                                   noc_access[dce])
+
+        for dce in range(de.NUM):
+            if lbs.bufshr_subgrp_size[dce] <= 1:
+                self.assertAlmostEqual(noc_access[dce], 0)
+
+        for dce in range(de.NUM):
+            self.assertAlmostEqual(lbs.bufshr_rot_round_cnt[dce],
+                                   rotation_rounds[dce],
+                                   msg=('{}: mismatch rotation round count '
+                                        'at {}:\nmodel: {}; sim: {}.'
+                                        .format(test_name, dce,
+                                                lbs.bufshr_rot_round_cnt,
+                                                rotation_rounds)))
+
+        for dce in range(de.NUM):
+            self.assertAlmostEqual(lbs.bufshr_rotation_access[dce],
+                                   rotation_access[dce] * avg_nbr_nhops[dce],
+                                   msg=('{}: mismatch NoC rotation access '
+                                        'at {}:\nmodel: {}; sim: {} x {}.'
+                                        .format(test_name, dce,
+                                                lbs.bufshr_rotation_access,
+                                                rotation_access,
+                                                avg_nbr_nhops)))
+
+        for dce in range(de.NUM):
+            self.assertAlmostEqual(lbs.bufshr_wide_fetch_access[dce],
+                                   wide_fetch_access[dce] * avg_nbr_nhops[dce],
+                                   msg=('{}: mismatch NoC wide fetch access '
+                                        'at {}:\nmodel: {}; sim: {} x {}.'
+                                        .format(test_name, dce,
+                                                lbs.bufshr_wide_fetch_access,
+                                                wide_fetch_access,
+                                                avg_nbr_nhops)))
+
 
     def _regularized_scheme(self, bl_ts, bl_ords):
         ''' Get the regularized scheme which will not be skipped. '''
diff --git a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_partition.py b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_partition.py
new file mode 100644
index 0000000..1f8f1e5
--- /dev/null
+++ b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_partition.py
@@ -0,0 +1,412 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+from nn_dataflow.core import BufShrScheme
+from nn_dataflow.core import DataCategoryEnum as de
+from nn_dataflow.core import loop_blocking
+from nn_dataflow.core import LoopBlockingScheme
+from nn_dataflow.core import LoopEnum as le
+from nn_dataflow.core import ParallelEnum as pe
+from nn_dataflow.core import PartitionScheme
+from nn_dataflow import util
+
+from . import TestLoopBlockingFixture
+
+class TestLoopBlockingPartition(TestLoopBlockingFixture):
+    ''' Tests for LoopBlocking module with partitioning. '''
+
+    def setUp(self):
+
+        super(TestLoopBlockingPartition, self).setUp()
+
+        # LoopBlockingScheme records stats of all nodes.
+        self.total_ops = self.layer['PAR'].total_ops(self.batch_size)
+
+        self.par_proc_region = self.resource['PAR'].proc_region
+
+    def test_accfwd(self):
+        ''' Scheme using accfwd. '''
+
+        for part in self._gen_all_partition():
+
+            p_nld = self._part_nld(part)
+
+            filter_size, ifmap_size, ofmap_size = self._total_part_size(part)
+
+            bufshr = BufShrScheme(self.par_proc_region, part)
+
+            # Filter may still have redundant fetch.
+            fil_fetch = part.size(pe.BATP, pe.OFMP) // bufshr.size(de.FIL)
+
+            for lbs in loop_blocking.gen_loopblocking(
+                    p_nld, self.resource['PAR'], part, self.none_cstr,
+                    self.cost, self.options['ACCFWD']):
+                if not lbs.is_valid():
+                    continue
+
+                # Ops.
+                self.assertAlmostEqual(lbs.ops, self.total_ops)
+
+                # Access forwarding reduction.
+                accfwd_red = lbs.accfwd_reduction
+                self.assertEqual(accfwd_red[de.FIL],
+                                 part.size(pe.BATP, pe.OFMP) // fil_fetch)
+                self.assertEqual(accfwd_red[de.OFM], part.size(pe.INPP))
+                self.assertEqual(accfwd_red[de.IFM], part.size(pe.OUTP))
+
+                # Top fetch and access.
+                top_fetch = lbs.fetch[0]
+                top_access = lbs.access[0]
+                self.assertAlmostEqual(top_access[de.FIL],
+                                       top_fetch[de.FIL] * filter_size
+                                       * fil_fetch)
+                self.assertAlmostEqual(top_access[de.OFM],
+                                       top_fetch[de.OFM] * ofmap_size)
+                self.assertGreaterEqual(top_access[de.IFM],
+                                        top_fetch[de.IFM] * ifmap_size)
+
+    def test_bufshr(self):
+        ''' Scheme using bufshr. '''
+
+        for part in self._gen_all_partition():
+
+            p_nld = self._part_nld(part)
+
+            bufshr = BufShrScheme(self.par_proc_region, part)
+
+            # Filter may still have redundant fetch.
+            fil_fetch = part.size(pe.BATP, pe.OFMP) // bufshr.size(de.FIL)
+
+            for optkey in ['BUFSHR', 'BUFSHR-BYP']:
+
+                for lbs in loop_blocking.gen_loopblocking(
+                        p_nld, self.resource['PAR'], part, self.none_cstr,
+                        self.cost, self.options[optkey]):
+                    if not lbs.is_valid():
+                        continue
+
+                    # Ops.
+                    self.assertAlmostEqual(lbs.ops, self.total_ops)
+
+                    # Buffer sharing uses access forwarding reduction.
+                    accfwd_red = lbs.accfwd_reduction
+                    self.assertEqual(accfwd_red[de.FIL],
+                                     part.size(pe.BATP, pe.OFMP) // fil_fetch)
+                    self.assertEqual(accfwd_red[de.OFM], part.size(pe.INPP))
+                    self.assertEqual(accfwd_red[de.IFM], part.size(pe.OUTP))
+
+                    # Buffer sharing group size.
+                    bufshr_grp_size = lbs.bufshr_grp_size
+                    self.assertSequenceEqual(bufshr_grp_size, accfwd_red)
+
+                    # Buffer sharing subgroup size.
+                    bufshr_subgrp_size = lbs.bufshr_subgrp_size
+                    self.assertTrue(all(subgrp <= grp for subgrp, grp
+                                        in zip(bufshr_subgrp_size,
+                                               bufshr_grp_size)))
+
+    def test_bufshr_access(self):
+        ''' Access of scheme using bufshr. '''
+
+        for part in self._gen_all_partition():
+
+            p_nld = self._part_nld(part)
+
+            bufshr = BufShrScheme(self.par_proc_region, part)
+
+            for lbs in loop_blocking.gen_loopblocking(
+                    p_nld, self.resource['PAR'], part, self.none_cstr,
+                    self.cost, self.options['BUFSHR']):
+                if not lbs.is_valid():
+                    continue
+
+                # Skip those without bufshr.
+                if all(sgs <= 1 for sgs in lbs.bufshr_subgrp_size):
+                    continue
+
+                # Sim.
+                dram_access, gbuf_access, bufshr_stats = \
+                        self._sim_access_conv(lbs, get_bufshr=True)
+
+                self._verify_bufshr_stats(dram_access, gbuf_access,
+                                          bufshr_stats, lbs, bufshr,
+                                          'test_bufshr_access')
+
+    def test_bufshr_access_byp(self):
+        ''' Access of scheme using bufshr with bypassing. '''
+
+        for part in self._gen_all_partition():
+
+            p_nld = self._part_nld(part)
+
+            bufshr = BufShrScheme(self.par_proc_region, part)
+
+            for lbs in loop_blocking.gen_loopblocking(
+                    p_nld, self.resource['PAR'], part, self.none_cstr,
+                    self.cost, self.options['BUFSHR-BYP']):
+                if not lbs.is_valid():
+                    continue
+
+                # Skip those without bufshr.
+                if all(sgs <= 1 for sgs in lbs.bufshr_subgrp_size):
+                    continue
+                # Skip those without bypassing.
+                if all(lbs.stored_in_gbuf):
+                    continue
+
+                # Sim.
+                dram_access, gbuf_access, bufshr_stats = \
+                        self._sim_access_conv(lbs, get_bufshr=True)
+
+                self._verify_bufshr_stats(dram_access, gbuf_access,
+                                          bufshr_stats, lbs, bufshr,
+                                          'test_bufshr_access')
+
+    def test_bufshr_rotation_example(self):
+        ''' Example scheme using bufshr with rotation. '''
+
+        # Make a PartitionScheme that allows bufshr for all data categories.
+        part = PartitionScheme(order=range(pe.NUM),
+                               pdims=((2, 1), (1, 2), (1, 1), (2, 1)))
+        bufshr = BufShrScheme(self.par_proc_region, part)
+        self.assertTrue(all(bufshr.size(dce) > 1 for dce in range(de.NUM)),
+                        'test_bufshr_rotation_example: '
+                        'made-up PartitionScheme is not expected: '
+                        '{}, bufshr size {}'
+                        .format(part,
+                                [bufshr.size(dce) for dce in range(de.NUM)]))
+
+        # Make a LoopBlockingScheme that uses bufshr for all data categories.
+        p_nld = self._part_nld(part)
+        bl_ts = ((util.idivc(p_nld.loopcnt[le.IFM], 6),
+                  util.idivc(p_nld.loopcnt[le.OFM], 9),
+                  util.idivc(p_nld.loopcnt[le.BAT], 2)),
+                 (3, 3, 2),
+                 (2, 3, 1))
+        bl_ords = (tuple(range(le.NUM)), tuple(range(le.NUM)))
+        lbs = LoopBlockingScheme(p_nld, bl_ts, bl_ords, self.resource['PAR'],
+                                 bufshr, self.options['BUFSHR'])
+        self.assertTrue(lbs.is_valid())
+        self.assertGreater(sum(lbs.get_noc_access()), 0)
+        self.assertTrue(all(sgs > 1 for sgs in lbs.bufshr_subgrp_size)
+                        and all(t > 1 for t in bl_ts[0]),
+                        'test_bufshr_rotation_example: '
+                        'made-up LoopBlockingScheme is not expected: '
+                        '{}, top factors {}, bufshr subgrp size {}'
+                        .format((bl_ts, bl_ords), bl_ts[0],
+                                lbs.bufshr_subgrp_size))
+
+        # Sim.
+        dram_access, gbuf_access, bufshr_stats = \
+                self._sim_access_conv(lbs, get_bufshr=True)
+
+        self._verify_bufshr_stats(dram_access, gbuf_access, bufshr_stats,
+                                  lbs, bufshr, 'test_bufshr_rotation_example')
+
+    def test_bufshr_skip_rot_example(self):
+        ''' Example scheme using bufshr that skips the single rotation. '''
+
+        # Make a PartitionScheme that allows bufshr for IFM.
+        part = PartitionScheme(order=range(pe.NUM),
+                               pdims=((2, 2), (1, 1), (2, 1), (1, 1)))
+        bufshr = BufShrScheme(self.par_proc_region, part)
+        self.assertEqual(bufshr.size(de.IFM), 4,
+                         'test_bufshr_skip_rot_example: '
+                         'made-up PartitionScheme is not expected: '
+                         '{}, bufshr size for {} {}.'
+                         .format(part, de.IFM, bufshr.size(de.IFM)))
+
+        # Make a LoopBlockingScheme that has a single rotation for IFM.
+        p_nld = self._part_nld(part)
+        bl_ts = ((util.idivc(p_nld.loopcnt[le.IFM], 3),
+                  util.idivc(p_nld.loopcnt[le.OFM], 3),
+                  util.idivc(p_nld.loopcnt[le.BAT], 2)),
+                 (1, 1, 2),
+                 (3, 3, 1))
+        bl_ords = (tuple(range(le.NUM)), tuple(range(le.NUM)))
+        lbs = LoopBlockingScheme(p_nld, bl_ts, bl_ords, self.resource['PAR'],
+                                 bufshr, self.options['BUFSHR'])
+        self.assertTrue(lbs.is_valid())
+        self.assertGreater(sum(lbs.get_noc_access()), 0)
+        self.assertEqual(lbs.bufshr_subgrp_size[de.IFM], 4,
+                         'test_bufshr_skip_rot_example: '
+                         'made-up LoopBlockingScheme is not expected: '
+                         '{}, bufshr subgrp size for {} {}.'
+                         .format((bl_ts, bl_ords), de.IFM,
+                                 lbs.bufshr_subgrp_size[de.IFM]))
+        self.assertGreater(lbs.bufshr_wide_fetch_width[de.IFM], 1,
+                           'test_bufshr_skip_rot_example: '
+                           'made-up LoopBlockingScheme is not expected: '
+                           '{}, bufshr wide fetch width for {} {}.'
+                           .format((bl_ts, bl_ords), de.IFM,
+                                   lbs.bufshr_wide_fetch_width[de.IFM]))
+        self.assertEqual(lbs.bufshr_rot_round_cnt[de.IFM], 0,
+                         'test_bufshr_skip_rot_example: '
+                         'made-up LoopBlockingScheme is not expected: '
+                         '{}, bufshr rotation rounds for {} {}'
+                         .format((bl_ts, bl_ords), de.IFM,
+                                 lbs.bufshr_rot_round_cnt[de.IFM]))
+
+        # Sim.
+        dram_access, gbuf_access, bufshr_stats = \
+                self._sim_access_conv(lbs, get_bufshr=True)
+
+        self._verify_bufshr_stats(dram_access, gbuf_access, bufshr_stats,
+                                  lbs, bufshr,
+                                  'test_bufshr_skip_rot_example')
+
+    def test_bufshr_wide_fetch_example(self):
+        ''' Example scheme using bufshr with wide fetch. '''
+
+        # Make a PartitionScheme that allows bufshr for IFM.
+        part = PartitionScheme(order=range(pe.NUM),
+                               pdims=((2, 2), (1, 1), (2, 1), (1, 1)))
+        bufshr = BufShrScheme(self.par_proc_region, part)
+        self.assertEqual(bufshr.size(de.IFM), 4,
+                         'test_bufshr_wide_fetch_example: '
+                         'made-up PartitionScheme is not expected: '
+                         '{}, bufshr size for {} {}.'
+                         .format(part, de.IFM, bufshr.size(de.IFM)))
+
+        for t1, t2 in [((3, 3, 1), (1, 1, 2)),
+                       ((1, 3, 2), (3, 1, 1))]:
+            # Make a LoopBlockingScheme that has wide fetch for IFM.
+            p_nld = self._part_nld(part)
+            bl_ts = (tuple(util.idivc(p_nld.loopcnt[lpe], t1[lpe] * t2[lpe])
+                           for lpe in range(le.NUM)),
+                     t1, t2)
+            # At GBUF level, from inner to outer: le.BAT, le.IFM, le.OFM.
+            bl_ords = (tuple(range(le.NUM)), (1, 2, 0))
+            lbs = LoopBlockingScheme(p_nld, bl_ts, bl_ords,
+                                     self.resource['PAR'], bufshr,
+                                     self.options['BUFSHR'])
+            self.assertTrue(lbs.is_valid())
+            self.assertGreater(sum(lbs.get_noc_access()), 0)
+            self.assertEqual(lbs.bufshr_subgrp_size[de.IFM], 4,
+                             'test_bufshr_wide_fetch_example: '
+                             'made-up LoopBlockingScheme is not expected: '
+                             '{}, bufshr subgrp size for {} {}.'
+                             .format((bl_ts, bl_ords), de.IFM,
+                                     lbs.bufshr_subgrp_size[de.IFM]))
+            self.assertGreater(lbs.bufshr_wide_fetch_width[de.IFM], 1,
+                               'test_bufshr_wide_fetch_example: '
+                               'made-up LoopBlockingScheme is not expected: '
+                               '{}, bufshr wide fetch width for {} {}.'
+                               .format((bl_ts, bl_ords), de.IFM,
+                                       lbs.bufshr_wide_fetch_width[de.IFM]))
+            self.assertGreater(lbs.bufshr_rot_round_cnt[de.IFM], 0,
+                               'test_bufshr_wide_fetch_example: '
+                               'made-up LoopBlockingScheme is not expected: '
+                               '{}, bufshr rotation rounds for {} {}'
+                               .format((bl_ts, bl_ords), de.IFM,
+                                       lbs.bufshr_rot_round_cnt[de.IFM]))
+
+            # Sim.
+            dram_access, gbuf_access, bufshr_stats = \
+                    self._sim_access_conv(lbs, get_bufshr=True)
+
+            self._verify_bufshr_stats(dram_access, gbuf_access, bufshr_stats,
+                                      lbs, bufshr,
+                                      'test_bufshr_wide_fetch_example')
+
+    def test_bufshr_multisubgrp_example(self):
+        ''' Example scheme using bufshr with multiple subgroups in a group. '''
+
+        # Make a PartitionScheme that allows bufshr for IFM.
+        part = PartitionScheme(order=list(reversed(range(pe.NUM))),
+                               pdims=((2, 2), (1, 1), (2, 1), (1, 1)))
+        bufshr = BufShrScheme(self.par_proc_region, part)
+        self.assertEqual(bufshr.size(de.IFM), 4,
+                         'test_bufshr_multisubgrp_example: '
+                         'made-up PartitionScheme is not expected: '
+                         '{}, bufshr size for {} {}.'
+                         .format(part, de.IFM, bufshr.size(de.IFM)))
+
+        # Make a LoopBlockingScheme that has multi subgroups per group for IFM.
+        p_nld = self._part_nld(part)
+        bl_ts = ((util.idivc(p_nld.loopcnt[le.IFM], 1),
+                  util.idivc(p_nld.loopcnt[le.OFM], 3),
+                  util.idivc(p_nld.loopcnt[le.BAT], 2)),
+                 (1, 3, 2),
+                 (1, 1, 1))
+        # At GBUF level, from inner to outer: le.BAT, le.OFM, le.IFM.
+        bl_ords = (tuple(range(le.NUM)), (2, 1, 0))
+        lbs = LoopBlockingScheme(p_nld, bl_ts, bl_ords, self.resource['PAR'],
+                                 bufshr, self.options['BUFSHR'])
+        self.assertTrue(lbs.is_valid())
+        self.assertGreater(sum(lbs.get_noc_access()), 0)
+        self.assertGreater(lbs.bufshr_grp_size[de.IFM],
+                           lbs.bufshr_subgrp_size[de.IFM],
+                           'test_bufshr_multisubgrp_example: '
+                           'made-up LoopBlockingScheme is not expected: '
+                           '{}, bufshr grp size {}, bufshr subgrp size {}'
+                           .format((bl_ts, bl_ords), lbs.bufshr_grp_size,
+                                   lbs.bufshr_subgrp_size))
+        self.assertGreater(lbs.bufshr_rot_round_cnt[de.IFM], 0,
+                           'test_bufshr_multisubgrp_example: '
+                           'made-up LoopBlockingScheme is not expected: '
+                           '{}, bufshr rotation rounds for {} {}'
+                           .format((bl_ts, bl_ords), de.IFM,
+                                   lbs.bufshr_rot_round_cnt[de.IFM]))
+
+        # Sim.
+        dram_access, gbuf_access, bufshr_stats = \
+                self._sim_access_conv(lbs, get_bufshr=True)
+
+        self._verify_bufshr_stats(dram_access, gbuf_access, bufshr_stats,
+                                  lbs, bufshr,
+                                  'test_bufshr_multisubgrp_example')
+
+    def test_bufshr_get_noc_access(self):
+        ''' get_noc_access of scheme using bufshr. '''
+
+        for part in self._gen_all_partition():
+
+            p_nld = self._part_nld(part)
+
+            for lbs in loop_blocking.gen_loopblocking(
+                    p_nld, self.resource['PAR'], part, self.none_cstr,
+                    self.cost, self.options['BUFSHR']):
+
+                noc_access = lbs.get_noc_access()
+
+                if not lbs.is_valid():
+                    self.assertIsNone(noc_access)
+
+                else:
+                    for dce in range(de.NUM):
+                        self.assertAlmostEqual(
+                            lbs.bufshr_rotation_access[dce]
+                            + lbs.bufshr_wide_fetch_access[dce],
+                            noc_access[dce])
+
+    def test_bufshr_localregionlayer(self):
+        ''' Scheme using bufshr for LocalRegionLayer. '''
+
+        for part in self._gen_all_partition(layerkey='POOL'):
+
+            p_nld = self._part_nld(part, layerkey='POOL')
+
+            for lbs in loop_blocking.gen_loopblocking(
+                    p_nld, self.resource['PAR'], part, self.none_cstr,
+                    self.cost, self.options['BUFSHR']):
+                if not lbs.is_valid():
+                    continue
+
+                self.assertTrue(all(gs == 1 for gs in lbs.bufshr_grp_size),
+                                'test_bufshr_localregionlayer: '
+                                'non-1 bufshr group size {}, part {}'
+                                .format(lbs.bufshr_grp_size, part))
+
diff --git a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_scheme.py b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_scheme.py
index eafa94a..d215874 100644
--- a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_scheme.py
+++ b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_scheme.py
@@ -381,3 +381,98 @@ def test_ordered_loops(self):
             self.assertListEqual(list(reversed(rev_loops)), ord_loops)
             self.assertListEqual([tpl[0] for tpl in ord_loops], ord_lpes)
 
+    def test_data_region_fetch(self):
+        ''' PROC type data regions. '''
+
+        # Multiple fetches with normal DATA regions.
+        bl_ts = self._make_bl_ts((0, 1, 1), (0, 1, 1), (0, 1, 1))
+        bl_ords = [[0] * le.NUM for _ in range(2)]
+        bl_ords[0][le.IFM] = 1
+        bl_ords[0][le.OFM] = 2
+        bl_ords[0][le.BAT] = 0
+        bl_ords[1] = range(le.NUM)
+        lbs_norm = self._lbs(bl_ts, bl_ords)
+        self.assertTrue(lbs_norm.is_valid())
+        self.assertGreater(lbs_norm.fetch[0][de.IFM], 1)
+        self.assertGreater(lbs_norm.fetch[0][de.OFM], 1)
+
+        lbs = self._lbs(bl_ts, bl_ords, rsrckey='SRCNOTDATA')
+        self.assertFalse(lbs.is_valid())
+        lbs = self._lbs(bl_ts, bl_ords, rsrckey='DSTNOTDATA')
+        self.assertFalse(lbs.is_valid())
+
+        # Single top-level fetch.
+        bl_ts = self._make_bl_ts((1, 0, 1), (1, 0, 1), (1, 0, 1))
+        lbs_norm = self._lbs(bl_ts, rsrckey='LG')
+
+        lbs = self._lbs(bl_ts, rsrckey='SRCNOTDATA')
+        self.assertTrue(lbs.is_valid())
+        self.assertLess(lbs.get_access_cost(self.cost),
+                        lbs_norm.get_access_cost(self.cost))
+        self.assertAlmostEqual(lbs_norm.get_access_cost(self.cost)
+                               - lbs.get_access_cost(self.cost),
+                               lbs.remote_gbuf_access[de.IFM]
+                               * (self.cost.mem_hier_at(me.DRAM)
+                                  - self.cost.mem_hier_at(me.GBUF)))
+        self.assertAlmostEqual(lbs.access[me.DRAM][de.FIL],
+                               lbs_norm.access[me.DRAM][de.FIL])
+        self.assertAlmostEqual(lbs.access[me.DRAM][de.IFM], 0)
+        self.assertAlmostEqual(lbs.access[me.DRAM][de.OFM],
+                               lbs_norm.access[me.DRAM][de.OFM])
+        self.assertAlmostEqual(lbs.access[me.GBUF][de.IFM],
+                               lbs_norm.access[me.GBUF][de.IFM])
+        self.assertAlmostEqual(lbs.remote_gbuf_access[de.IFM],
+                               lbs_norm.access[me.DRAM][de.IFM])
+
+        lbs = self._lbs(bl_ts, bl_ords, rsrckey='DSTNOTDATA')
+        self.assertTrue(lbs.is_valid())
+        self.assertLess(lbs.get_access_cost(self.cost),
+                        lbs_norm.get_access_cost(self.cost))
+        self.assertAlmostEqual(lbs_norm.get_access_cost(self.cost)
+                               - lbs.get_access_cost(self.cost),
+                               lbs.remote_gbuf_access[de.OFM]
+                               * (self.cost.mem_hier_at(me.DRAM)
+                                  - self.cost.mem_hier_at(me.GBUF)))
+        self.assertAlmostEqual(lbs.access[me.DRAM][de.FIL],
+                               lbs_norm.access[me.DRAM][de.FIL])
+        self.assertAlmostEqual(lbs.access[me.DRAM][de.IFM],
+                               lbs_norm.access[me.DRAM][de.IFM])
+        self.assertAlmostEqual(lbs.access[me.DRAM][de.OFM], 0)
+        self.assertAlmostEqual(lbs.access[me.GBUF][de.OFM],
+                               lbs_norm.access[me.GBUF][de.OFM])
+        self.assertAlmostEqual(lbs.remote_gbuf_access[de.OFM],
+                               lbs_norm.access[me.DRAM][de.OFM])
+
+        lbs = self._lbs(bl_ts, bl_ords, rsrckey='DATALOCAL')
+        self.assertTrue(lbs.is_valid())
+        self.assertLess(lbs.get_access_cost(self.cost),
+                        lbs_norm.get_access_cost(self.cost))
+        self.assertAlmostEqual(lbs.access[me.DRAM][de.FIL],
+                               lbs_norm.access[me.DRAM][de.FIL])
+        self.assertAlmostEqual(lbs.access[me.DRAM][de.IFM], 0)
+        self.assertAlmostEqual(lbs.access[me.DRAM][de.OFM], 0)
+        self.assertAlmostEqual(lbs.access[me.GBUF][de.IFM],
+                               lbs_norm.access[me.GBUF][de.IFM])
+        self.assertAlmostEqual(lbs.access[me.GBUF][de.OFM],
+                               lbs_norm.access[me.GBUF][de.OFM])
+        self.assertAlmostEqual(lbs.remote_gbuf_access[de.IFM],
+                               lbs_norm.access[me.DRAM][de.IFM])
+        self.assertAlmostEqual(lbs.remote_gbuf_access[de.OFM],
+                               lbs_norm.access[me.DRAM][de.OFM])
+
+    def test_fil_pinning(self):
+        ''' Filter pinning. '''
+
+        bl_ts = self._make_bl_ts((1, 0, 1), (1, 0, 1), (0, 1, 1))
+        bl_ords = [range(le.NUM) for _ in range(2)]
+
+        lbs_norm = self._lbs(bl_ts, bl_ords)
+        self.assertTrue(lbs_norm.is_valid())
+        self.assertGreater(lbs_norm.fetch[0][de.FIL], 0)
+        self.assertGreater(lbs_norm.get_access()[0][de.FIL], 0)
+
+        lbs = self._lbs(bl_ts, bl_ords, rsrckey='FILPIN')
+        self.assertTrue(lbs.is_valid())
+        self.assertEqual(lbs.fetch[0][de.FIL], 0)
+        self.assertEqual(lbs.get_access()[0][de.FIL], 0)
+
diff --git a/nn_dataflow/tests/map_strategy_test/test_map_strategy_fixture.py b/nn_dataflow/tests/map_strategy_test/test_map_strategy_fixture.py
index c448d45..f6c2458 100644
--- a/nn_dataflow/tests/map_strategy_test/test_map_strategy_fixture.py
+++ b/nn_dataflow/tests/map_strategy_test/test_map_strategy_fixture.py
@@ -66,5 +66,6 @@ def setUp(self):
             proc_region=proc_region, dram_region=data_region,
             src_data_region=data_region, dst_data_region=data_region,
             dim_array=PhyDim2(12, 14), size_gbuf=108*1024, size_regf=520,
-            array_bus_width=float('inf'), dram_bandwidth=float('inf'))
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
 
diff --git a/nn_dataflow/tests/partition_test/test_partition_fixture.py b/nn_dataflow/tests/partition_test/test_partition_fixture.py
index 28bab0d..90ce832 100644
--- a/nn_dataflow/tests/partition_test/test_partition_fixture.py
+++ b/nn_dataflow/tests/partition_test/test_partition_fixture.py
@@ -69,6 +69,16 @@ def setUp(self):
                                        partition_batch=True,
                                        partition_ifmaps=False,
                                        **optdict)
+        self.options['ACCFWD'] = Option(partition_hybrid=True,
+                                        partition_batch=True,
+                                        partition_ifmaps=True,
+                                        hw_access_forwarding=True,
+                                        **optdict)
+        self.options['BUFSHR'] = Option(partition_hybrid=True,
+                                        partition_batch=True,
+                                        partition_ifmaps=True,
+                                        hw_gbuf_sharing=True,
+                                        **optdict)
 
     def _gen_partition(self, wlkey='BASE', dnkey='BASE', optkey='BASE',
                        guaranteed=False):
diff --git a/nn_dataflow/tests/partition_test/test_unit_nhops_to_proc_region.py b/nn_dataflow/tests/partition_test/test_unit_nhops_to_proc_region.py
index 8b5c8b1..7f01e5b 100644
--- a/nn_dataflow/tests/partition_test/test_unit_nhops_to_proc_region.py
+++ b/nn_dataflow/tests/partition_test/test_unit_nhops_to_proc_region.py
@@ -316,6 +316,66 @@ def test_ofmap_local(self):
 
         self.assertEqual(nhops[de.OFM], 0)
 
+    def test_use_fwd(self):
+        ''' Use access forwarding. '''
+        layer = self.layers['BASE']
+
+        part = PartitionScheme(order=(pe.BATP, pe.INPP, pe.OUTP, pe.OFMP),
+                               pdims=((2, 1), (2, 4), (1, 2), (2, 1)))
+
+        nr = NodeRegion(origin=PhyDim2(0, 0), dim=part.dim(),
+                        type=NodeRegion.PROC)
+
+        far_dist = 1000
+
+        ilayout = self._make_data_layout(
+            layer.nifm, layer.hifm, layer.wifm, PhyDim2(-far_dist, 0),
+            (1, 1), (1, 1), PhyDim2(1, 1))
+
+        olayout = self._make_data_layout(
+            layer.nofm, layer.hofm, layer.wofm, PhyDim2(0, -far_dist),
+            (1, 1), (1, 1), PhyDim2(1, 1))
+
+        filter_nodes = frozenset([PhyDim2(far_dist, 0), PhyDim2(0, far_dist)])
+
+        nhops_base = partition.unit_nhops_to_proc_region(
+            layer, self.batch_size, nr, part,
+            filter_nodes, ilayout, olayout, self.options['BASE'])
+        nhops_accfwd = partition.unit_nhops_to_proc_region(
+            layer, self.batch_size, nr, part,
+            filter_nodes, ilayout, olayout, self.options['ACCFWD'])
+        nhops_bufshr = partition.unit_nhops_to_proc_region(
+            layer, self.batch_size, nr, part,
+            filter_nodes, ilayout, olayout, self.options['BUFSHR'])
+
+        for dce in range(de.NUM):
+            self.assertEqual(nhops_accfwd[dce], nhops_bufshr[dce])
+
+        # In the basic access scheme, FIL and IFM are independently fetched,
+        # resulting in repeating remote fetch. OFM are merged locally and only
+        # stored back remotely once.
+        self.assertGreater(nhops_base[de.FIL],
+                           layer.total_filter_size() * far_dist
+                           * part.size(pe.BATP) * part.size(pe.OFMP) * 0.8)
+        self.assertGreater(nhops_base[de.IFM],
+                           layer.total_ifmap_size(self.batch_size) * far_dist
+                           * part.size(pe.OUTP) * 0.8)
+
+        p_layer, p_batch_size, _ = part.part_layer(layer, self.batch_size)
+        # With forwarding, everyone is only remotely fetched once.
+        self.assertLess(nhops_accfwd[de.FIL],
+                        p_layer.total_filter_size()
+                        * part.size(pe.INPP, pe.OUTP)
+                        * (far_dist + nr.dim.size()))
+        self.assertLess(nhops_accfwd[de.IFM],
+                        p_layer.total_ifmap_size(p_batch_size)
+                        * part.size(pe.INPP, pe.OFMP, pe.BATP)
+                        * (far_dist + nr.dim.size()))
+        self.assertLess(nhops_accfwd[de.OFM],
+                        p_layer.total_ofmap_size(p_batch_size)
+                        * part.size(pe.OUTP, pe.OFMP, pe.BATP)
+                        * (far_dist + nr.dim.size()))
+
     def _make_data_layout(self, nfm, hfm, wfm, origin, bdim, ndim, dims):
         ''' Make a DataLayout instance. '''
         frng = FmapRange((0,) * 4, (self.batch_size, nfm, hfm, wfm))
diff --git a/nn_dataflow/tests/pipeline_test/__init__.py b/nn_dataflow/tests/pipeline_test/__init__.py
new file mode 100644
index 0000000..204e01c
--- /dev/null
+++ b/nn_dataflow/tests/pipeline_test/__init__.py
@@ -0,0 +1,17 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+from .test_pipeline_fixture import TestPipelineFixture
+
diff --git a/nn_dataflow/tests/pipeline_test/test_inter_layer_pipeline.py b/nn_dataflow/tests/pipeline_test/test_inter_layer_pipeline.py
new file mode 100644
index 0000000..fac452b
--- /dev/null
+++ b/nn_dataflow/tests/pipeline_test/test_inter_layer_pipeline.py
@@ -0,0 +1,497 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+import re
+
+from nn_dataflow.core import InputLayer, ConvLayer, FCLayer, PoolingLayer
+from nn_dataflow.core import InterLayerPipeline
+from nn_dataflow.core import Network
+from nn_dataflow.core import Option
+from nn_dataflow.core import PhyDim2
+from nn_dataflow.core import PipelineSegment
+
+from . import TestPipelineFixture
+
+class TestInterLayerPipeline(TestPipelineFixture):
+    ''' Tests for InterLayerPipeline. '''
+
+    def test_valid_args(self):
+        ''' Valid arguments. '''
+        ilp = InterLayerPipeline(self.net['net1'], self.batch_size,
+                                 self.resource, max_util_drop=0.1)
+        self.assertIs(ilp.network, self.net['net1'])
+        self.assertEqual(ilp.batch_size, self.batch_size)
+        self.assertIs(ilp.resource, self.resource)
+        self.assertEqual(ilp.max_util_drop, 0.1)
+
+    def test_invalid_network(self):
+        ''' Invalid network. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'InterLayerPipeline: .*network.*'):
+            _ = InterLayerPipeline(self.net['net1'].input_layer(),
+                                   self.batch_size, self.resource)
+
+    def test_invalid_resource(self):
+        ''' Invalid resource. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'InterLayerPipeline: .*resource.*'):
+            _ = InterLayerPipeline(self.net['net1'], self.batch_size,
+                                   PhyDim2(1, 1))
+
+    def test_invalid_max_util_drop(self):
+        ''' Invalid max_util_drop. '''
+        with self.assertRaisesRegexp(ValueError,
+                                     'InterLayerPipeline: .*max_util_drop.*'):
+            _ = InterLayerPipeline(self.net['net1'], self.batch_size,
+                                   self.resource, max_util_drop=1.1)
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'InterLayerPipeline: .*max_util_drop.*'):
+            _ = InterLayerPipeline(self.net['net1'], self.batch_size,
+                                   self.resource, max_util_drop=-0.1)
+
+    def test_topological_order(self):
+        ''' Topological order. '''
+        for net in self.net.values():
+
+            if not net.net_name.startswith('net'):
+                continue
+
+            ilp = self._make_ilp(net)
+
+            for layer in net:
+                vidx = ilp.dag_vertex_dict[layer]
+
+                self.assertIn(layer, ilp.dag_vertex_list[vidx])
+
+                # Layer is named by topological order.
+                self.assertTrue(layer.startswith(str(vidx)))
+
+            # Disjoint union.
+            vs_list = [set(v) for v in ilp.dag_vertex_list]
+
+            for idx, vs in enumerate(vs_list):
+                for vs2 in vs_list[:idx]:
+                    self.assertTrue(vs.isdisjoint(vs2))
+            self.assertSetEqual(set.union(*vs_list), set(net))
+
+    def test_vertex_no_merge_lr(self):
+        ''' LocalRegionLayer has no previous layer to merge with. '''
+        net = Network('tmp_net')
+        net.set_input_layer(InputLayer(30, 1))
+        net.add('0', PoolingLayer(30, 1, 1))
+        net.add('1', FCLayer(30, 40))
+        net.add('1p', PoolingLayer(40, 1, 1))
+
+        ilp = self._make_ilp(net)
+
+        for layer in net:
+            vidx = ilp.dag_vertex_dict[layer]
+
+            self.assertIn(layer, ilp.dag_vertex_list[vidx])
+
+            # Layer is named by topological order.
+            self.assertTrue(layer.startswith(str(vidx)))
+
+    def test_prev(self):
+        ''' Previous relationship. '''
+        for net in self.net.values():
+
+            ilp = self._make_ilp(net)
+
+            for vidx, prevs in ilp.dag_prev_dict.items():
+
+                # Previous layers of the current vertex.
+                prev_layers = set()
+                v = ilp.dag_vertex_list[vidx]
+                for l in v:
+                    prev_layers.update(net.prevs(l))
+                prev_layers.difference_update(v)
+
+                for pvidx in prevs:
+
+                    # Previous vertices should be ordered before this vertex.
+                    self.assertLess(pvidx, vidx)
+
+                    # Previous vertex should have at least one previous layer.
+                    if pvidx < 0:
+                        self.assertTrue(
+                            None in prev_layers
+                            or not prev_layers.isdisjoint(net.ext_layers()))
+                    else:
+                        pv = ilp.dag_vertex_list[pvidx]
+                        self.assertFalse(prev_layers.isdisjoint(pv))
+
+    def test_next(self):
+        ''' Next relationship. '''
+        for net in self.net.values():
+
+            ilp = self._make_ilp(net)
+
+            for vidx, nexts in ilp.dag_next_dict.items():
+
+                # Next layers of the current vertex.
+                next_layers = set()
+                if vidx < 0:
+                    # Go through all layers and add those with input layer as
+                    # previous.
+                    for l in net:
+                        prevs = set(net.prevs(l))
+                        if None in prevs \
+                                or not prevs.isdisjoint(net.ext_layers()):
+                            next_layers.add(l)
+                else:
+                    v = ilp.dag_vertex_list[vidx]
+                    for l in v:
+                        next_layers.update(net.nexts(l))
+                    next_layers.difference_update(v)
+
+                for nvidx in nexts:
+
+                    # Next vertices should be ordered after this vertex.
+                    self.assertGreater(nvidx, vidx)
+
+                    # Next vertex should have at least one next layer.
+                    nv = ilp.dag_vertex_list[nvidx]
+                    self.assertFalse(next_layers.isdisjoint(nv))
+
+    def test_match_prev_next(self):
+        ''' Previous and next relationships match. '''
+        for net in self.net.values():
+
+            ilp = self._make_ilp(net)
+
+            for vidx, prevs in ilp.dag_prev_dict.items():
+                for pvidx in prevs:
+                    self.assertIn(vidx, ilp.dag_next_dict[pvidx])
+
+            for vidx, nexts in ilp.dag_next_dict.items():
+                for nvidx in nexts:
+                    self.assertIn(vidx, ilp.dag_prev_dict[nvidx])
+
+    def test_gen_vseg(self):
+        ''' _gen_vseg. '''
+        # pylint: disable=protected-access
+
+        # Simple case.
+        ilp = self._make_ilp(self.net['net1'])
+        num = len(ilp.dag_vertex_list)
+        self.assertEqual(len(list(ilp._gen_vseg())),
+                         (num + 1) * num // 2)
+
+        # Linear case.
+        # Number of different vsegs of n = 1 + ... + n
+        ilp = self._make_ilp(self.net['net2'])
+        num = len(ilp.dag_vertex_list)
+        self.assertEqual(len(list(ilp._gen_vseg())),
+                         (num + 1) * num // 2)
+
+        # Fork case.
+        ilp = self._make_ilp(self.net['net4'])
+        vseg_list = list(ilp._gen_vseg())
+        self.assertEqual(len(vseg_list), 39)
+        # Case with one of multiple previous vertices on-chip.
+        self.assertIn((9, 10), vseg_list)
+        self.assertIn((13, 14), vseg_list)
+        # Case with only one next vertex off-chip.
+        self.assertIn((7, 8), vseg_list)
+        self.assertNotIn((4, 5, 6), vseg_list)
+
+        # Multiple first layers.
+        self.assertGreater(len(self.net['net3'].firsts()), 1)
+        ilp = self._make_ilp(self.net['net3'])
+        vseg_list = list(ilp._gen_vseg())
+        self.assertIn((0,), vseg_list)
+        self.assertIn((1,), vseg_list)
+
+        # Verify rules.
+        ilp = self._make_ilp(self.net['net5'])
+        vseg_list = list(ilp._gen_vseg())
+        # Layers with no shared dependencies.
+        self.assertNotIn((2, 3, 4), vseg_list)
+        self.assertNotIn((8, 9), vseg_list)
+        # Multiple previous layers.
+        self.assertNotIn((5, 6, 7), vseg_list)
+        self.assertNotIn((8, 9, 10), vseg_list)
+        self.assertNotIn((10, 11, 12), vseg_list)
+        # Multiple next layers.
+        self.assertNotIn((0, 1, 2, 3), vseg_list)
+        self.assertIn((3, 4), vseg_list)
+        self.assertIn((3, 4, 5), vseg_list)
+        self.assertIn((10, 11), vseg_list)
+
+        # No duplicate.
+        for net in self.net.values():
+            ilp = self._make_ilp(net)
+            vseg_list = list(ilp._gen_vseg())
+            self.assertEqual(len(vseg_list), len(set(vseg_list)))
+
+        # Real networks.
+        ilp = self._make_ilp(self.net['zfnet'])
+        self.assertEqual(len(ilp.dag_vertex_list), 8)
+        vseg_list = list(ilp._gen_vseg())
+        self.assertEqual(len(vseg_list), 36)
+
+        ilp = self._make_ilp(self.net['vgg_net'])
+        self.assertEqual(len(ilp.dag_vertex_list), 16)
+        vseg_list = list(ilp._gen_vseg())
+        self.assertEqual(len(vseg_list), 136)
+
+        # Large networks with forks.
+        for net_name in ['googlenet', 'resnet152']:
+            net = self.net[net_name]
+
+            ilp = self._make_ilp(net)
+            vseg_list = list(ilp._gen_vseg())
+            self.assertEqual(len(vseg_list), len(set(vseg_list)))
+
+            # The number of different vsegs is between one and eight times of
+            # the number of layers.
+            self.assertGreater(len(vseg_list), len(net))
+            self.assertLessEqual(len(vseg_list), len(net) * 8)
+
+    def test_gen_vseg_twice(self):
+        ''' _gen_vseg twice. '''
+        # pylint: disable=protected-access
+        for net_name in self.net:
+            if not net_name.startswith('net'):
+                continue
+
+            net = self.net[net_name]
+            ilp = self._make_ilp(net)
+
+            vseg_list_1 = list(ilp._gen_vseg())
+            vseg_list_2 = list(ilp._gen_vseg())
+
+            self.assertListEqual(vseg_list_1, vseg_list_2)
+
+    def test_ordered_layer_list(self):
+        ''' Get ordered_layer_list. '''
+
+        # https://stackoverflow.com/a/4836734/5277823
+        nat_key = lambda key: tuple(int(c) if c.isdigit() else c.lower()
+                                    for c in re.split('([0-9]+)', key))
+
+        for net_name in ['net1', 'net2', 'net3', 'net4', 'net5']:
+            net = self.net[net_name]
+            ilp = self._make_ilp(net)
+            ord_list = ilp.ordered_layer_list()
+
+            # In natural order.
+            self.assertTrue(all(nat_key(l1) < nat_key(l2) for l1, l2
+                                in zip(ord_list, ord_list[1:])))
+
+    def test_gen_segment(self):
+        ''' gen_segment(). '''
+        for net_name in self.net:
+            net = self.net[net_name]
+            ilp = self._make_ilp(net)
+
+            # No pipelining.
+            options = Option()
+            segs_n_lst = list(ilp.gen_segment(options))
+            segs_n = set(segs_n_lst)
+            self.assertEqual(len(segs_n_lst), len(segs_n))
+            for seg in segs_n:
+                self.assertEqual(len(seg), 1)
+                self.assertEqual(len(seg[0]), 1)
+                self.assertIn(seg[0][0], net)
+
+            # Spatial pipelining.
+            options = Option(partition_interlayer=True)
+            segs_sp_lst = list(ilp.gen_segment(options))
+            segs_sp = set(segs_sp_lst)
+            self.assertEqual(len(segs_sp_lst), len(segs_sp))
+            for seg in segs_sp:
+                for ltpl in seg:
+                    self.assertLessEqual(sum(1 for l in ltpl
+                                             if isinstance(l, ConvLayer)),
+                                         1)
+            self.assertTrue(segs_sp.issuperset(segs_n))
+
+            # Temporal pipelining.
+            options = Option(hw_gbuf_save_writeback=True)
+            segs_tp_lst = list(ilp.gen_segment(options))
+            segs_tp = set(segs_tp_lst)
+            self.assertEqual(len(segs_tp_lst), len(segs_tp))
+            for seg in segs_tp:
+                self.assertEqual(len(seg), 1)
+            self.assertTrue(segs_tp.issuperset(segs_n))
+
+            # Spatial and temporal pipelining.
+            options = Option(partition_interlayer=True,
+                             hw_gbuf_save_writeback=True)
+            segs_stp_lst = list(ilp.gen_segment(options))
+            segs_stp = set(segs_stp_lst)
+            self.assertEqual(len(segs_stp_lst), len(segs_stp))
+            self.assertSetEqual(segs_stp, segs_tp | segs_sp)
+            # Only single-layer and single-vertex segments have the same
+            # spatial and temporal pipelining.
+            segs_intersect = segs_tp & segs_sp
+            segs_single = segs_n
+            segs_single |= set(PipelineSegment((v,), ilp.network,
+                                               ilp.batch_size, ilp.resource)
+                               for v in ilp.dag_vertex_list)
+            self.assertTrue(segs_intersect.issubset(segs_single))
+
+    def test_gen_segment_max_degree(self):
+        ''' gen_segment() maximum degree. '''
+        net = self.net['vgg_net']
+        ilp = self._make_ilp(net)
+
+        options = Option(partition_interlayer=True,
+                         hw_gbuf_save_writeback=True,
+                         layer_pipeline_max_degree=4)
+        for segment in ilp.gen_segment(options):
+            self.assertLessEqual(sum(1 if isinstance(net[l], ConvLayer) else 0
+                                     for ltpl in segment for l in ltpl),
+                                 4)
+
+    def test_gen_segment_vseg(self):
+        ''' gen_segment() vertex segment. '''
+
+        for net_name in self.net:
+            if not net_name.startswith('net'):
+                continue
+            net = self.net[net_name]
+
+            ilp = self._make_ilp(net)
+            options = Option(partition_interlayer=True)
+
+            seg_set = set(ilp.gen_segment(options))
+            self.assertTrue(seg_set)
+
+            seg_v_set = set(self._gen_all_segment(net))
+            self.assertTrue(seg_set.issubset(seg_v_set))
+
+    def test_gen_segment_multi_prevs(self):
+        ''' gen_segment() with multiple previous vertices. '''
+        # pylint: disable=protected-access
+
+        net = self.net['net4']
+        ilp = self._make_ilp(net)
+
+        vseg_set = set(ilp._gen_vseg())
+        self.assertIn((9, 10), vseg_set)
+        self.assertIn((13, 14), vseg_set)
+
+        options = Option(partition_interlayer=True)
+        seg_set = set(ilp.gen_segment(options))
+
+        # 10 only has neighbor source 9; 10p only has local source 10 and
+        # memory source 8. Valid.
+        self.assertIn(self._make_segment((9, 10), ilp.network), seg_set)
+        # 14 has both neighbor source 13, and memory source 12, etc.. Invalid.
+        self.assertNotIn(self._make_segment((13, 14), ilp.network), seg_set)
+
+    def test_gen_segment_one_nexts(self):
+        ''' gen_segment() with missing one next vertex. '''
+        # pylint: disable=protected-access
+
+        net = self.net['net4']
+        ilp = self._make_ilp(net)
+
+        vseg_set = set(ilp._gen_vseg())
+        self.assertIn((7, 8), vseg_set)
+        self.assertNotIn((4, 5, 6), vseg_set)
+
+        options = Option(partition_interlayer=True)
+        seg_set = set(ilp.gen_segment(options))
+
+        self.assertIn(self._make_segment((7, 8), ilp.network), seg_set)
+        self.assertNotIn(self._make_segment((4, 5, 6), ilp.network), seg_set)
+
+    def test_gen_segment_not_opt(self):
+        ''' gen_segment() not with_opt. '''
+
+        options_with_opt = Option(partition_interlayer=True,
+                                  hw_gbuf_save_writeback=True,
+                                  layer_pipeline_opt=True)
+        options_not_opt = Option(partition_interlayer=True,
+                                 hw_gbuf_save_writeback=True,
+                                 layer_pipeline_opt=False)
+
+        # Linear ones
+        for net_name in ['net1', 'net2', 'zfnet']:
+            net = self.net[net_name]
+            ilp = self._make_ilp(net)
+
+            segs_with_opt = set(seg.seg
+                                for seg in ilp.gen_segment(options_with_opt))
+            segs_not_opt = set(seg.seg
+                               for seg in ilp.gen_segment(options_not_opt))
+
+            self.assertSetEqual(segs_with_opt, segs_not_opt)
+
+        # Non-linear ones
+        for net_name in ['net3', 'net4', 'net5', 'net6', 'net7', 'googlenet']:
+            net = self.net[net_name]
+            ilp = self._make_ilp(net)
+
+            segs_with_opt = set(seg.seg
+                                for seg in ilp.gen_segment(options_with_opt))
+            segs_not_opt = set(seg.seg
+                               for seg in ilp.gen_segment(options_not_opt))
+
+            self.assertTrue(segs_with_opt.issuperset(segs_not_opt))
+
+    def test_gen_segment_resnet(self):
+        ''' gen_segment() with ResNet. '''
+
+        net = self.net['resnet152']
+        ilp = self._make_ilp(net)
+
+        options = Option(partition_interlayer=True)
+
+        # One residual module fits.
+        segment = PipelineSegment(
+            (('conv3_2_a',), ('conv3_2_b',), ('conv3_2_c', 'conv3_2_res')),
+            ilp.network, ilp.batch_size, ilp.resource)
+
+        self.assertTupleEqual(net.prevs('conv3_2_res'),
+                              ('conv3_1_res', 'conv3_2_c'))
+        self.assertTrue(segment.valid)
+
+        segs = set(seg.seg for seg in ilp.gen_segment(options))
+        self.assertIn(segment.seg, segs)
+
+    def test_gen_segment_lstm(self):
+        ''' gen_segment() with LSTM cell. '''
+
+        net = self.net['lstm_phoneme']
+        ilp = self._make_ilp(net)
+
+        options = Option(partition_interlayer=True)
+
+        # Find a cell.
+        cname = None
+        for l in net:
+            if l[-6:] == '_igate':
+                cname = l[:-6]
+        self.assertIsNotNone(cname)
+
+        # One LSTM cell fits.
+        segment = PipelineSegment(
+            ((cname + '_cand',),
+             (cname + '_igate', cname + '_cout_i'),
+             (cname + '_fgate', cname + '_cout_f', cname + '_cout'),
+             (cname + '_ogate', cname + '_hout')),
+            ilp.network, ilp.batch_size, ilp.resource)
+
+        self.assertTrue(segment.valid)
+
+        segs = set(seg.seg for seg in ilp.gen_segment(options))
+        self.assertIn(segment.seg, segs)
+
diff --git a/nn_dataflow/tests/pipeline_test/test_pipeline_fixture.py b/nn_dataflow/tests/pipeline_test/test_pipeline_fixture.py
new file mode 100644
index 0000000..2301c71
--- /dev/null
+++ b/nn_dataflow/tests/pipeline_test/test_pipeline_fixture.py
@@ -0,0 +1,588 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+import unittest
+
+from collections import OrderedDict
+
+from nn_dataflow.core import DataLayout
+from nn_dataflow.core import FmapRange
+from nn_dataflow.core import InputLayer, ConvLayer, FCLayer, PoolingLayer
+from nn_dataflow.core import InterLayerPipeline
+from nn_dataflow.core import LoopEnum as le
+from nn_dataflow.core import Network
+from nn_dataflow.core import NodeRegion
+from nn_dataflow.core import ParallelEnum as pe
+from nn_dataflow.core import PartitionScheme
+from nn_dataflow.core import PhyDim2
+from nn_dataflow.core import PipelineSegment
+from nn_dataflow.core import Resource
+from nn_dataflow.core import SchedulingConstraint
+from nn_dataflow.core import SchedulingResult
+
+from nn_dataflow.nns import import_network, all_networks
+
+class TestPipelineFixture(unittest.TestCase):
+    ''' Base fixture class for layer pipeline tests. '''
+
+    def setUp(self):
+
+        self.net = {}
+
+        net = Network('net1')
+        # Linear.
+        net.set_input_layer(InputLayer(10, 1))
+        net.add('0', FCLayer(10, 20))
+        net.add('1', FCLayer(20, 30))
+        net.add('1p', PoolingLayer(30, 1, 1))
+        net.add('2', FCLayer(30, 40))
+        net.add('3', FCLayer(40, 50))
+        self.net[net.net_name] = net
+
+        net = Network('net2')
+        # Long linear.
+        net.set_input_layer(InputLayer(1, 1))
+        for idx in range(16):
+            net.add(str(idx), FCLayer(1, 1))
+        self.net[net.net_name] = net
+
+        net = Network('net3')
+        # Fork.
+        # /0-2\   /6- 7- 8\
+        #   x  4-5         12
+        # \1-3/   \9-10-11/
+        net.set_input_layer(InputLayer(1, 1))
+        net.add('0', FCLayer(1, 1), prevs=net.INPUT_LAYER_KEY)
+        net.add('1', FCLayer(1, 1), prevs=net.INPUT_LAYER_KEY)
+        net.add('2', FCLayer(2, 1), prevs=('0', '1'))
+        net.add('2p', PoolingLayer(1, 1, 1))
+        net.add('3', FCLayer(2, 1), prevs=('0', '1'))
+        net.add('4', FCLayer(2, 1), prevs=('2p', '3'))
+        net.add('5', FCLayer(1, 1))
+        net.add('5p', PoolingLayer(1, 1, 1))
+        net.add('6', FCLayer(1, 1), prevs='5p')
+        net.add('7', FCLayer(1, 1))
+        net.add('8', FCLayer(1, 1))
+        net.add('9', FCLayer(1, 1), prevs='5p')
+        net.add('10', FCLayer(1, 1))
+        net.add('11', FCLayer(1, 1))
+        net.add('12', FCLayer(2, 1), prevs=('8', '11'))
+        self.net[net.net_name] = net
+
+        net = Network('net4')
+        # Complex fork.
+        #          /5       \
+        # 0-1-2-3-4-6-7-8-10-14
+        #              \9/
+        #          \11-12   /
+        #          \13      /
+        net.set_input_layer(InputLayer(1, 1))
+        net.add('0', FCLayer(1, 1))
+        net.add('1', FCLayer(1, 1))
+        net.add('2', FCLayer(1, 1))
+        net.add('3', FCLayer(1, 1))
+        net.add('4', FCLayer(1, 1))
+        net.add('5', FCLayer(1, 1), prevs='4')
+        net.add('6', FCLayer(1, 1), prevs='4')
+        net.add('7', FCLayer(1, 1))
+        net.add('8', FCLayer(1, 1), prevs='7')
+        net.add('9', FCLayer(1, 1), prevs='7')
+        net.add('10', FCLayer(1, 1))
+        net.add('10p', PoolingLayer(2, 1, 1), prevs=('8', '10'))
+        net.add('11', PoolingLayer(1, 1, 1), prevs='4')
+        net.add('12', FCLayer(1, 1))
+        net.add('13', PoolingLayer(1, 1, 1), prevs='4')
+        net.add('14', FCLayer(5, 1), prevs=('5', '10p', '12', '13'))
+        self.net[net.net_name] = net
+
+        net = Network('net5')
+        # Corner cases.
+        #  ----\
+        # //1-2\ 7-8\
+        # 0-3-4-x   10-11-12
+        #  \ \5/ 9 /  \__/
+        #   6--/
+        net.set_input_layer(InputLayer(1, 1))
+        net.add('0', FCLayer(1, 1))
+        net.add('1', FCLayer(1, 1), prevs='0')
+        net.add('2', FCLayer(1, 1))
+        net.add('3', FCLayer(1, 1), prevs='0')
+        net.add('4', FCLayer(1, 1), prevs='3')
+        net.add('5', FCLayer(1, 1), prevs='3')
+        net.add('6', FCLayer(1, 1), prevs='0')
+        net.add('7', FCLayer(5, 1), prevs=('0', '2', '4', '5', '6'))
+        net.add('8', FCLayer(1, 1))
+        net.add('9', FCLayer(5, 1), prevs=('0', '2', '4', '5', '6'))
+        net.add('10', FCLayer(2, 1), prevs=('8', '9'))
+        net.add('11', FCLayer(1, 1))
+        net.add('12', FCLayer(2, 1), prevs=('10', '11'))
+        self.net[net.net_name] = net
+
+        net = Network('net6')
+        # Fmap sizes.
+        net.set_input_layer(InputLayer(1, 24))
+        net.add('0', ConvLayer(1, 1, 24, 3))
+        net.add('1', ConvLayer(1, 1, 12, 3, strd=2))
+        net.add('1p', PoolingLayer(1, 6, 2))
+        net.add('2', ConvLayer(1, 1, 6, 3))
+        net.add('3', ConvLayer(1, 1, 6, 3, strd=4), prevs=('0'))
+        self.net[net.net_name] = net
+
+        net = Network('net7')
+        # Topological order: see a visited vertex again.
+        #  /---
+        # 0-1-\\
+        #  \2--2p
+        net.set_input_layer(InputLayer(1, 1))
+        net.add('0', FCLayer(1, 1))
+        net.add('1', FCLayer(1, 1), prevs='0')
+        net.add('2', FCLayer(1, 1), prevs='0')
+        net.add('2p', PoolingLayer(3, 1, 1), prevs=('0', '1', '2'))
+        self.net[net.net_name] = net
+
+        net = Network('net8')
+        # Forward to the middle.
+        #    /-\
+        # 0-1-2-2p-4-4p
+        #  \-3------/
+        net.set_input_layer(InputLayer(1, 1))
+        net.add('0', FCLayer(1, 1))
+        net.add('1', FCLayer(1, 1), prevs='0')
+        net.add('2', FCLayer(1, 1), prevs='1')
+        net.add('2p', PoolingLayer(2, 1, 1), prevs=('1', '2'))
+        net.add('3', FCLayer(1, 1), prevs='0')
+        net.add('4', FCLayer(2, 1), prevs='2p')
+        net.add('4p', PoolingLayer(2, 1, 1), prevs=('3', '4'))
+        self.net[net.net_name] = net
+
+        net = Network('net9')
+        # Previous layers include input and others.
+        net.set_input_layer(InputLayer(1, 1))
+        net.add('0', FCLayer(1, 1))
+        net.add('1', FCLayer(2, 1), prevs=(net.INPUT_LAYER_KEY, '0'))
+        self.net[net.net_name] = net
+
+        # Real networks.
+        for net_name in all_networks():
+            self.net[net_name] = import_network(net_name)
+
+        self.batch_size = 16
+
+        self.resource = Resource(
+            proc_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(8, 8),
+                                   type=NodeRegion.PROC),
+            dram_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(8, 8),
+                                   type=NodeRegion.DRAM),
+            src_data_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(8, 4),
+                                       type=NodeRegion.DRAM),
+            dst_data_region=NodeRegion(origin=PhyDim2(0, 4), dim=PhyDim2(8, 4),
+                                       type=NodeRegion.DRAM),
+            dim_array=PhyDim2(16, 16), size_gbuf=65536, size_regf=64,
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
+
+        part = PartitionScheme(order=range(pe.NUM), pdims=[(1, 1)] * pe.NUM)
+        self.ofmap_layout = DataLayout(
+            frngs=(FmapRange((0, 0, 0, 0), (2, 4, 16, 16)),),
+            regions=(NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(1, 1),
+                                type=NodeRegion.DRAM),),
+            parts=(part,))
+
+
+    def _make_ilp(self, network):
+        ''' Make an InterLayerPipeline instance. '''
+        return InterLayerPipeline(network, self.batch_size, self.resource)
+
+    def _make_segment(self, vseg, network, temporal=False, max_util_drop=None,
+                      with_opt=True):
+        ''' Convert vertex segment to (layer) segment. '''
+        kwargs = {}
+        if max_util_drop is not None:
+            kwargs['max_util_drop'] = max_util_drop
+        if not with_opt:
+            kwargs['with_opt'] = False
+        ilp = self._make_ilp(network)
+        seg = tuple(ilp.dag_vertex_list[vidx] for vidx in vseg)
+        if temporal:
+            seg = (sum(seg, tuple()),)
+        return PipelineSegment(seg, ilp.network, ilp.batch_size, ilp.resource,
+                               **kwargs)
+
+    def _make_sched_res(self, sched_seq, time, top_ti=1, top_to=1, top_tb=1,
+                        top_ord=range(le.NUM), dram_time=0, num_nodes=4):
+        scheme = OrderedDict()
+        scheme['cost'] = 1.234 + 9.876
+        scheme['time'] = max(time, dram_time)
+        scheme['num_nodes'] = num_nodes
+        scheme['proc_time'] = time
+        scheme['bus_time'] = 0
+        scheme['dram_time'] = dram_time
+        scheme['ti'] = [top_ti, 1, 1]
+        scheme['to'] = [top_to, 1, 1]
+        scheme['tb'] = [top_tb, 1, 1]
+        scheme['tvals'] = [[top_ti, top_to, top_tb], [1] * 3, [1] * 3]
+        scheme['orders'] = [top_ord, range(le.NUM), range(le.NUM)]
+        return SchedulingResult(scheme=scheme,
+                                ofmap_layout=self.ofmap_layout,
+                                sched_seq=sched_seq)
+
+    def _gen_all_segment(self, network, **kwargs):
+        '''
+        Generate all segments directly from all layers and all vertex segments.
+        '''
+        # pylint: disable=protected-access
+        ilp = self._make_ilp(network)
+        for layer in network:
+            yield PipelineSegment(((layer,),), ilp.network, ilp.batch_size,
+                                  ilp.resource)
+        for vseg in ilp._gen_vseg():
+            segment = self._make_segment(vseg, network, **kwargs)
+            if len(segment) == 1 and len(segment[0]) == 1:
+                continue
+            yield segment
+
+    def _validate_allocation(self, segment, allocation):
+        ''' Validate segment resource allocation. '''
+
+        # Match segment.
+        self.assertEqual(len(allocation), len(segment))
+        for ltpl, rtpl in zip(segment, allocation):
+            self.assertEqual(len(rtpl), len(ltpl))
+            self.assertTrue(all(isinstance(r, Resource) for r in rtpl))
+
+        # Number of nodes.
+        nodes = []  # number of nodes.
+        for rtpl in allocation:
+            nodes.append(rtpl[0].proc_region.dim.size())
+        self.assertEqual(sum(nodes), self.resource.proc_region.dim.size())
+
+        # Temporal schedules share processing region; spatial schedules use
+        # non-overlapped processing regions.
+        used_proc_nodes = set()  # used processing nodes
+        for rtpl in allocation:
+            proc_region = rtpl[0].proc_region
+            self.assertTrue(all(r.proc_region == proc_region for r in rtpl))
+            for n in proc_region.iter_node():
+                self.assertTrue(self.resource.proc_region.contains_node(n),
+                                '_validate_allocation: node {} outside of '
+                                'the processing region {}'
+                                .format(n, self.resource.proc_region))
+                self.assertNotIn(n, used_proc_nodes,
+                                 '_validate_allocation: node {} has been '
+                                 'used.'.format(n))
+                used_proc_nodes.add(n)
+
+        # Data liveness.
+        data_regions = {}  # layers that have data currently on-chip
+        for ltpl, rtpl in zip(segment, allocation):
+
+            for l, r in zip(ltpl, rtpl):
+
+                # Check data source.
+                prev_layers = segment.network.prevs(l)
+
+                for pl in prev_layers:
+                    if pl not in data_regions:
+                        # Previous layer is not on-chip, from memory.
+                        # Try find a layer responsible to fetch shared mem src.
+                        try:
+                            sh_sp_idx = next((i for i in range(len(allocation))
+                                              if allocation[i][0].proc_region
+                                              == r.src_data_region))
+                        except StopIteration:
+                            # No shared mem src.
+                            self.assertEqual(
+                                r.src_data_region,
+                                self.resource.src_data_region,
+                                '_validate_allocation: layer {}\'s prev {} '
+                                'is not on-chip, should be from {}, but {}.'
+                                .format(l, pl, self.resource.src_data_region,
+                                        r.src_data_region))
+                        else:
+                            # There exists shared mem src.
+                            sh_l = segment[sh_sp_idx][0]
+                            self.assertEqual(segment.network.prevs(l),
+                                             segment.network.prevs(sh_l),
+                                             '_validate_allocation: layer {} '
+                                             'expects on-chip mem src sharing '
+                                             'with {}, but prevs differ.'
+                                             .format(l, sh_l))
+                    elif data_regions[pl] != r.proc_region:
+                        # Previous layer is on-chip and not local.
+                        self.assertEqual(
+                            r.src_data_region, data_regions[pl],
+                            '_validate_allocation: layer {}\'s prev {} '
+                            'is on-chip, should be from {}, but {}.'
+                            .format(l, pl, data_regions[pl],
+                                    r.src_data_region))
+
+                # Update data based on destination.
+                # Local or store back to memory. Both will be available on-chip.
+                self.assertTrue(r.dst_data_region == r.proc_region
+                                or r.dst_data_region
+                                == self.resource.dst_data_region,
+                                '_validate_allocation: data can only '
+                                'be local or storing back to mem.')
+                # Overwrite.
+                local_node_set = set(r.proc_region.iter_node())
+                data_regions = {pl: data_regions[pl] for pl in data_regions
+                                if local_node_set.isdisjoint(
+                                    data_regions[pl].iter_node())}
+                data_regions[l] = r.proc_region
+
+    def _validate_constraint(self, segment, constraint):
+        ''' Validate segment scheduling constraint. '''
+        # pylint: disable=too-many-branches
+
+        # Match segment.
+        self.assertEqual(len(constraint), len(segment))
+        for ltpl, ctpl in zip(segment, constraint):
+            self.assertEqual(len(ctpl), len(ltpl))
+            self.assertTrue(all(isinstance(c, SchedulingConstraint)
+                                for c in ctpl))
+
+        # Same top tb.
+        top_tb = constraint[0][0].topbat
+        self.assertTrue(all(c.topbat == top_tb
+                            for ctpl in constraint for c in ctpl))
+
+        # Top tb is a factor of batch size.
+        if top_tb:
+            self.assertEqual((segment.batch_size) % top_tb, 0)
+
+        # Data availability.
+
+        seg_layers = set(l for ltpl in segment for l in ltpl)
+
+        class OutAccPat(object):
+            ''' Output data access pattern types. '''
+            # pylint: disable=too-few-public-methods
+            ANY = 0   # can access in any way
+            DBF = -1  # must double-buffer
+            # SEQ: use any positive value to represent sequential access with
+            # certain number of groups.
+
+        # Available data in each spatial subregions. Each is represented by a
+        # tuple of layer name and its output data access pattern.
+        avail_data = [(None, OutAccPat.ANY) for _ in segment]
+
+        # Get groups of layers sharing the same memory source.
+        prevs2layers = {}
+        for ltpl in segment:
+            l = ltpl[0]
+            prevs2layers.setdefault(segment.network.prevs(l), []).append(l)
+        sh_mem_src_groups = [ls for ps, ls in prevs2layers.items()
+                             if not seg_layers.intersection(ps) and len(ls) > 1]
+        sh_mem_src_topifms = [None] * len(sh_mem_src_groups)
+
+        # Whether to defer fully buffering output.
+        fb_out = False
+        fb_out_conv = None
+
+        for sp_idx, (ltpl, ctpl) in enumerate(zip(segment, constraint)):
+
+            self.assertFalse(fb_out,
+                             '_validate_constraint: deferring fully buffering '
+                             'from {} should not cross spatial scheduling {}.'
+                             .format(fb_out_conv, sp_idx - 1))
+
+            for tm_idx, (layer, cstr) in enumerate(zip(ltpl, ctpl)):
+
+                # Source data and their access patterns.
+                prev_layers = segment.network.prevs(layer)
+                prev_oaps = []
+                for pl in prev_layers:
+                    if pl not in seg_layers:
+                        # Off-chip sources.
+                        poap = OutAccPat.ANY
+                    elif pl in ltpl:
+                        # On-chip and local.
+                        self.assertEqual(avail_data[sp_idx][0], pl,
+                                         '_validate_constraint: layer {} ({}) '
+                                         'local source data {} not available, '
+                                         'maybe not the immediate previous.'
+                                         .format(layer, (sp_idx, tm_idx), pl))
+                        poap = avail_data[sp_idx][1]
+                    else:
+                        # On-chip and neighbor.
+                        poap = next((avail_data[p_sp_idx][1]
+                                     for p_sp_idx in range(sp_idx)
+                                     if avail_data[p_sp_idx][0] == pl),
+                                    None)
+                        self.assertFalse(poap is None,
+                                         '_validate_constraint: layer {} ({}) '
+                                         'neighbor source data {} not '
+                                         'available on-chip.'
+                                         .format(layer, (sp_idx, tm_idx), pl))
+                    prev_oaps.append(poap)
+                # Only buffer input if having source on-chip.
+                has_src = not seg_layers.isdisjoint(prev_layers)
+
+                # The single SEQ source.
+                seq = None
+                # str is greater than all numbers, see
+                # https://docs.python.org/2/library/stdtypes.html#comparisons
+                seq_prev_oaps = [poap for poap in prev_oaps if poap > 0]
+                if seq_prev_oaps:
+                    self.assertEqual(len(seq_prev_oaps), 1,
+                                     '_validate_constraint: layer {} ({}) '
+                                     'has multiple SEQ input.'
+                                     '\nsrcs: {}, oaps: {}'
+                                     .format(layer, (sp_idx, tm_idx),
+                                             prev_layers, prev_oaps))
+                    seq = seq_prev_oaps[0]
+
+                # Destination data.
+                # Only buffer output if having destination on-chip.
+                next_layers = segment.network.nexts(layer)
+                has_dst = not seg_layers.isdisjoint(next_layers)
+
+                # Validation.
+
+                for g_idx, group in enumerate(sh_mem_src_groups):
+                    if layer in group:
+                        if sh_mem_src_topifms[g_idx] is None:
+                            sh_mem_src_topifms[g_idx] = cstr.topifm
+                        self.assertEqual(sh_mem_src_topifms[g_idx], cstr.topifm,
+                                         '_validate_constraint: layer {} ({}) '
+                                         'share memory source with {}, but has '
+                                         'mismatched topifm {} with {}.'
+                                         .format(layer, (sp_idx, tm_idx),
+                                                 group, cstr.topifm,
+                                                 sh_mem_src_topifms[g_idx]))
+                        break
+                else:
+                    if not has_src:
+                        self.assertEqual(cstr.topifm, 0,
+                                         '_validate_constraint: layer {} ({}) '
+                                         'should not constrain input as it '
+                                         'does not have on-chip sources.'
+                                         .format(layer, (sp_idx, tm_idx)))
+
+                if isinstance(segment.network[layer], ConvLayer):
+
+                    self.assertFalse(fb_out,
+                                     '_validate_constraint: deferring fully '
+                                     'buffering from {} has not been realized.'
+                                     .format(fb_out_conv))
+
+                    if any(pl in ltpl for pl in prev_layers):
+                        # Local source.
+                        lcl_poap = avail_data[sp_idx][1]
+                        self.assertTrue(lcl_poap == OutAccPat.DBF
+                                        or lcl_poap == OutAccPat.ANY,
+                                        '_validate_constraint: layer {} ({}) '
+                                        'local source data {} must fully '
+                                        'buffer output.'
+                                        .format(layer, (sp_idx, tm_idx),
+                                                lcl_poap))
+
+                    # DBF source.
+                    if OutAccPat.DBF in prev_oaps:
+                        # Must fully buffer CONV input.
+                        self.assertEqual(cstr.topifm, 1,
+                                         '_validate_constraint: layer {} ({}) '
+                                         'input is not fully buffered but has '
+                                         'DBF source.\nsrcs: {}, oaps: {}'
+                                         '\n{}'
+                                         .format(layer, (sp_idx, tm_idx),
+                                                 prev_layers, prev_oaps,
+                                                 cstr))
+
+                    # SEQ source.
+                    if seq and has_dst:
+                        # Cannot be lazily updated.
+                        self.assertNotIsInstance(
+                            seq, str,
+                            '_validate_constraint: CONV layer {} ({}) cannot '
+                            'use lazy update (from {})'
+                            .format(layer, (sp_idx, tm_idx), seq))
+                        # Must match SEQ.
+                        self.assertEqual(cstr.topifm, seq,
+                                         '_validate_constraint: layer {} ({}) '
+                                         'input groups ({}) and its SEQ src '
+                                         'output groups ({}) are mismatched.'
+                                         '\nsrcs: {}, oaps: {}'
+                                         .format(layer, (sp_idx, tm_idx),
+                                                 cstr.topifm, seq,
+                                                 prev_layers, prev_oaps))
+                        # Also must fully buffer CONV output.
+                        self.assertEqual(cstr.topofm, 1,
+                                         '_validate_constraint: layer {} ({}) '
+                                         'output is not fully buffered but has '
+                                         'SEQ source.\nsrcs: {}, oaps: {}'
+                                         .format(layer, (sp_idx, tm_idx),
+                                                 prev_layers, prev_oaps))
+                        # Deferred apply to the last layer in the group.
+                        fb_out = True
+                        fb_out_conv = layer
+
+                    oap = None
+                    if cstr.topofm == 1:
+                        if cstr.topifm == 1:
+                            # Fully buffer both, can access output in any way.
+                            # This is fine as we require to buffer either input
+                            # or output for CONV (see below).
+                            oap = OutAccPat.ANY
+                        else:
+                            oap = OutAccPat.DBF
+                    elif has_dst and cstr.topofm > 0:
+                        oap = cstr.topofm
+                        if has_src:
+                            self.assertEqual(cstr.topifm, 1,
+                                             '_validate_constraint: layer {} '
+                                             '({}) has on-chip src and dst '
+                                             'but neither input nor output '
+                                             'are fully buffered.\ncstr: {}.'
+                                             .format(layer, (sp_idx, tm_idx),
+                                                     cstr))
+                    elif has_dst:
+                        # Lazy update, record layer name as seq.
+                        oap = layer
+
+                else:
+
+                    # SEQ source.
+                    if seq and has_dst:
+                        # Must match SEQ, or fully buffer output.
+                        self.assertTrue(cstr.topofm == seq or cstr.topofm == 1
+                                        or seq in cstr.update_dict,
+                                        '_validate_constraint: layer {} ({}) '
+                                        'output is not fully buffered, and '
+                                        'groups ({}) and its SEQ src output '
+                                        'groups ({}) are mismatched, and '
+                                        'lazy update is not used.'
+                                        '\nsrcs: {}, oaps: {}'
+                                        .format(layer, (sp_idx, tm_idx),
+                                                cstr.topofm, seq,
+                                                prev_layers, prev_oaps))
+
+                    if cstr.topofm == 1:
+                        # Fully buffer output.
+                        oap = OutAccPat.DBF
+                    elif isinstance(seq, str):
+                        # Lazy update.
+                        oap = seq
+                    else:
+                        # SEQ output.
+                        oap = cstr.topofm
+
+                # Realize deferred fully buffering output.
+                if cstr.topofm == 1:
+                    fb_out = False  # reset
+
+                # Overwrite the previous temporal scheduling.
+                avail_data[sp_idx] = (layer, oap)
+
diff --git a/nn_dataflow/tests/pipeline_test/test_pipeline_segment.py b/nn_dataflow/tests/pipeline_test/test_pipeline_segment.py
new file mode 100644
index 0000000..3635dc8
--- /dev/null
+++ b/nn_dataflow/tests/pipeline_test/test_pipeline_segment.py
@@ -0,0 +1,683 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+import itertools
+
+from nn_dataflow.core import ConvLayer
+from nn_dataflow.core import NodeRegion
+from nn_dataflow.core import PhyDim2
+from nn_dataflow.core import PipelineSegment
+from nn_dataflow.core import PipelineSegmentTiming
+
+from . import TestPipelineFixture
+
+class TestPipelineSegment(TestPipelineFixture):
+    ''' Tests for PipelineSegment. '''
+
+    # pylint: disable=too-many-public-methods
+
+    def test_valid_args(self):
+        ''' Valid arguments. '''
+        segment = PipelineSegment((('0',), ('1', '1p')),
+                                  self.net['net1'], self.batch_size,
+                                  self.resource)
+        self.assertTrue(segment.valid)
+        self.assertTupleEqual(segment.seg, (('0',), ('1', '1p')))
+        self.assertIs(segment.network, self.net['net1'])
+        self.assertEqual(segment.batch_size, self.batch_size)
+        self.assertIs(segment.resource, self.resource)
+
+    def test_invalid_seg(self):
+        ''' Invalid seg. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'PipelineSegment: .*seg.*tuple.*'):
+            _ = PipelineSegment([('0',), ('1', '1p')],
+                                self.net['net1'], self.batch_size,
+                                self.resource)
+
+        with self.assertRaisesRegexp(TypeError,
+                                     'PipelineSegment: .*seg.*sub-tuple.*'):
+            _ = PipelineSegment(('0', '1', '1p'),
+                                self.net['net1'], self.batch_size,
+                                self.resource)
+
+    def test_invalid_network(self):
+        ''' Invalid network. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'PipelineSegment: .*network.*'):
+            _ = PipelineSegment((('0',), ('1', '1p')),
+                                self.net['net1'].input_layer(), self.batch_size,
+                                self.resource)
+
+    def test_invalid_resource(self):
+        ''' Invalid resource. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'PipelineSegment: .*resource.*'):
+            _ = PipelineSegment((('0',), ('1', '1p')),
+                                self.net['net1'], self.batch_size,
+                                PhyDim2(1, 1))
+
+    def test_init_deps_not_valid(self):
+        ''' Not valid segment due to init deps. '''
+
+        # Not utilize local data.
+        segment = self._make_segment((0, 1), self.net['net3'], temporal=True)
+        self.assertFalse(segment.valid)
+        self.assertFalse(hasattr(segment, 'alloc'))
+
+        # Local data not available.
+        segment = self._make_segment((10, 11, 12), self.net['net5'],
+                                     temporal=True)
+        self.assertFalse(segment.valid)
+        self.assertFalse(hasattr(segment, 'alloc'))
+
+        # Multiple neighbor source in one spatial scheduling.
+        segment = self._make_segment((1, 2), self.net['net8'])
+        self.assertFalse(segment.valid)
+        self.assertFalse(hasattr(segment, 'alloc'))
+
+        # Both memory source and neighbor source.
+        segment = self._make_segment((13, 14), self.net['net4'])
+        self.assertFalse(segment.valid)
+        self.assertFalse(hasattr(segment, 'alloc'))
+
+        # Valid cases.
+
+        # Both memory destination and neighbor destination.
+        segment = self._make_segment((7, 8), self.net['net4'])
+        self.assertTrue(segment.valid)
+
+    def test_init_deps_not_opt(self):
+        ''' Init deps for segment not with opt. '''
+
+        # Multiple on-chip sources.
+        segment = self._make_segment((3, 4), self.net['net8'])
+        self.assertTrue(segment.valid)
+        segment = self._make_segment((3, 4), self.net['net8'], with_opt=False)
+        self.assertFalse(segment.valid)
+
+        # Multiple on-chip destinations.
+        segment = self._make_segment((4, 5, 6), self.net['net4'])
+        self.assertTrue(segment.valid)
+        segment = self._make_segment((4, 5, 6), self.net['net4'],
+                                     with_opt=False)
+        self.assertFalse(segment.valid)
+
+        # Multiple linear chains.
+        segment = self._make_segment((5, 6), self.net['net4'])
+        self.assertTrue(segment.valid)
+        segment = self._make_segment((5, 6), self.net['net4'], with_opt=False)
+        self.assertFalse(segment.valid)
+
+    def test_alloc_not_valid(self):
+        ''' Not valid segment due to alloc. '''
+
+        segment = self._make_segment((0, 1), self.net['net1'],
+                                     max_util_drop=0.01)
+        self.assertFalse(segment.valid)
+
+    def test_as_sequence(self):
+        ''' As a sequence. '''
+        segment = self._make_segment((0, 1), self.net['net1'])
+        self.assertTrue(segment.valid)
+
+        self.assertSequenceEqual(segment, segment.seg)
+        self.assertTupleEqual(tuple(segment), segment.seg)
+
+        for ltpl in segment:
+            for layer in ltpl:
+                self.assertIn(layer, self.net['net1'])
+
+    def test_equal(self):
+        ''' Equality. '''
+        seg1 = self._make_segment((0, 1), self.net['net1'], max_util_drop=0.1)
+        seg2 = self._make_segment((0, 1), self.net['net1'], max_util_drop=0.01)
+        seg3 = self._make_segment((0, 1), self.net['net1'], temporal=True)
+        self.assertNotEqual(seg1, seg2)
+        self.assertNotEqual(seg1, seg3)
+
+        seg4 = self._make_segment((0, 1), self.net['net1'], max_util_drop=0.1)
+        self.assertEqual(seg1, seg4)
+
+        net = self.net['net1']
+        self.assertSetEqual(set(self._gen_all_segment(net)),
+                            set(itertools.chain(self._gen_all_segment(net),
+                                                self._gen_all_segment(net))))
+
+    def test_repr(self):
+        ''' __repr__. '''
+        seg = self._make_segment((0, 1), self.net['net1'], max_util_drop=0.1)
+        str_ = repr(seg)
+        self.assertIn(repr(seg.seg), str_)
+        self.assertIn(repr(seg.resource), str_)
+        self.assertIn(repr(seg.max_util_drop), str_)
+
+    def test_alloc_proc(self):
+        ''' _alloc_proc. '''
+        # pylint: disable=protected-access
+
+        net = self.net['net1']
+        self.assertListEqual([net[l].total_ops() for l in net],
+                             [200, 600, 30, 1200, 2000])
+
+        ilp = self._make_ilp(net)
+
+        # Single vertex.
+
+        for idx in range(len(ilp.dag_vertex_list)):
+            segment = self._make_segment((idx,), ilp.network)
+            psr = segment._alloc_proc()
+
+            self.assertEqual(len(psr), 1)
+            self.assertTupleEqual(psr[0].origin, (0, 0))
+            self.assertTupleEqual(psr[0].dim, self.resource.proc_region.dim)
+            self.assertEqual(psr[0].type, NodeRegion.PROC)
+
+        # Multiple vertices.
+
+        psr = self._make_segment((0, 1), net)._alloc_proc()
+        nodes = [nr.dim.size() for nr in psr]
+        self.assertListEqual(nodes, [16, 48])
+
+        psr = self._make_segment((2, 3), net)._alloc_proc()
+        nodes = [nr.dim.size() for nr in psr]
+        self.assertListEqual(nodes, [24, 40])
+
+        psr = self._make_segment((1, 2), net)._alloc_proc()
+        nodes = [nr.dim.size() for nr in psr]
+        self.assertTrue(nodes == [24, 40] or nodes == [22, 42])
+
+        psr = self._make_segment((1, 2, 3), net)._alloc_proc()
+        nodes = [nr.dim.size() for nr in psr]
+        self.assertTrue(nodes == [12, 20, 32] or nodes == [10, 20, 34])
+
+        # All segments.
+
+        def _check_all_segment(ilp):
+            for vseg in ilp._gen_vseg():
+                segment = self._make_segment(vseg, ilp.network)
+                psr = segment._alloc_proc()
+                if psr is None:
+                    continue
+
+                # Utilization.
+                nodes = [nr.dim.size() for nr in psr]
+                ops = [sum(ilp.network[l].total_ops() for l in ltpl)
+                       for ltpl in segment]
+                self.assertEqual(len(nodes), len(ops))
+                time = max(o * 1. / n for o, n in zip(ops, nodes))
+                max_ops = time * sum(nodes)
+                real_ops = sum(ops)
+                self.assertGreaterEqual(real_ops / max_ops, 0.9)
+
+        _check_all_segment(ilp)
+
+        for net_name in ['zfnet', 'net3']:
+            net = self.net[net_name]
+            ilp = self._make_ilp(net)
+            _check_all_segment(ilp)
+
+    def test_allocation(self):
+        ''' allocation(). '''
+
+        # Single vertex.
+
+        net = self.net['net1']
+        ilp = self._make_ilp(net)
+        for idx in range(len(ilp.dag_vertex_list)):
+            segment = self._make_segment((idx,), ilp.network)
+            alloc = segment.allocation()
+            self.assertIsNotNone(alloc)
+            self._validate_allocation(segment, alloc)
+
+        # Linear networks.
+
+        for net_name in ['net1', 'net2']:
+
+            net = self.net[net_name]
+
+            for segment in self._gen_all_segment(net):
+
+                alloc = segment.allocation()
+                if alloc is None:
+                    continue
+
+                self._validate_allocation(segment, alloc)
+
+                # This is a linear network structure.
+                rlist = sum(alloc, tuple())
+
+                # The data source of all layers except for the first in the
+                # segment should be previous processing regions.
+                for r in rlist[1:]:
+                    self.assertEqual(r.src_data_region.type, NodeRegion.PROC,
+                                     'test_segment_allocation: '
+                                     'data source should be PROC region.')
+
+                # The data destination of all layers except for the last in the
+                # segment should be local.
+                for r in rlist[:-1]:
+                    self.assertEqual(r.dst_data_region.type, NodeRegion.PROC,
+                                     'test_segment_allocation: '
+                                     'data destination should be PROC region.')
+
+        # Complex networks.
+
+        for net_name in ['net3', 'net4', 'net5']:
+
+            net = self.net[net_name]
+
+            for segment in self._gen_all_segment(net):
+
+                alloc = segment.allocation()
+                if alloc is None:
+                    continue
+
+                self._validate_allocation(segment, alloc)
+
+        # Real networks.
+
+        for net_name in self.net:
+
+            if net_name.startswith('net'):
+                continue
+            net = self.net[net_name]
+
+            for segment in self._gen_all_segment(net):
+
+                alloc = segment.allocation()
+                if alloc is None:
+                    continue
+
+                self._validate_allocation(segment, alloc)
+
+    def test_allocation_sh_mem_src(self):
+        ''' allocation() shared mem src. '''
+
+        net = self.net['net3']
+
+        segment = self._make_segment((6, 7, 8, 9), net)
+        self.assertTrue(segment.valid)
+
+        alloc = segment.allocation()
+        self.assertEqual(alloc[3][0].src_data_region, alloc[0][0].proc_region)
+
+        segment = self._make_segment((6, 7, 8, 9), net, with_opt=False)
+        self.assertFalse(segment.valid)
+
+        net = self.net['net5']
+
+        segment = self._make_segment((1, 2, 3), net)
+        self.assertTrue(segment.valid)
+
+        alloc = segment.allocation()
+        self.assertEqual(alloc[2][0].src_data_region, alloc[0][0].proc_region)
+
+        segment = self._make_segment((1, 2, 3), net, with_opt=False)
+        self.assertFalse(segment.valid)
+
+        net = self.net['net4']
+
+        segment = self._make_segment((8, 9), net)
+        self.assertTrue(segment.valid)
+
+        alloc = segment.allocation()
+        self.assertEqual(alloc[1][0].src_data_region, alloc[0][0].proc_region)
+
+        segment = self._make_segment((8, 9), net, with_opt=False)
+        self.assertFalse(segment.valid)
+
+    def test_allocation_temp(self):
+        ''' allocation() temporal. '''
+
+        for net in self.net.values():
+
+            for segment in self._gen_all_segment(net, temporal=True):
+
+                alloc = segment.allocation()
+                if alloc is None:
+                    continue
+
+                self._validate_allocation(segment, alloc)
+
+    def test_allocation_no_time_mux(self):
+        ''' allocation() no_time_mux. '''
+        net = self.net['net2']
+
+        segment = self._make_segment(tuple(range(16)), net)
+        self.assertTrue(segment.valid)
+
+        alloc = segment.allocation()
+        self.assertTrue(all(r.no_time_mux for rtpl in alloc for r in rtpl))
+
+        segment = self._make_segment(tuple(range(8)), net)
+        self.assertTrue(segment.valid)
+
+        alloc = segment.allocation()
+        self.assertFalse(any(r.no_time_mux for rtpl in alloc for r in rtpl))
+
+        segment = self._make_segment(tuple(range(16)), net, temporal=True)
+        self.assertTrue(segment.valid)
+
+        alloc = segment.allocation()
+        self.assertFalse(any(r.no_time_mux for rtpl in alloc for r in rtpl))
+
+    def test_allocation_invalid(self):
+        ''' allocation() for invalid segment. '''
+        segment = self._make_segment((0, 1), self.net['net3'], temporal=True)
+        self.assertFalse(segment.valid)
+        self.assertIsNone(segment.allocation())
+
+    def test_gen_constraint(self):
+        ''' gen_constraint(). '''
+
+        # Single vertex.
+
+        for net_name in self.net:
+
+            net = self.net[net_name]
+            ilp = self._make_ilp(net)
+
+            for idx in range(len(ilp.dag_vertex_list)):
+                segment = self._make_segment((idx,), ilp.network)
+                self.assertTrue(segment.valid)
+
+                for constraint, _ in segment.gen_constraint():
+                    self._validate_constraint(segment, constraint)
+
+                    # No top loop constraint for single-layer segment.
+                    if len(constraint) == 1 and len(constraint[0]) == 1:
+                        for c in itertools.chain.from_iterable(constraint):
+                            self.assertTrue(c.topifm == 0 and c.topofm == 0
+                                            and c.topbat == 0)
+
+        # Spatial pipelining.
+
+        for net_name in self.net:
+
+            if not net_name.startswith('net') and net_name != 'zfnet':
+                continue
+
+            net = self.net[net_name]
+
+            for segment in self._gen_all_segment(net):
+                if not segment.valid:
+                    continue
+
+                for constraint, _ in segment.gen_constraint():
+                    self._validate_constraint(segment, constraint)
+
+        # Special cases.
+
+        net = self.net['net2']
+
+        segment = PipelineSegment((('0', '1'), ('2', '3')), net,
+                                  self.batch_size, self.resource)
+
+        for constraint, _ in segment.gen_constraint():
+            self._validate_constraint(segment, constraint)
+
+    def test_gen_constraint_fbofm_init(self):
+        ''' gen_constraint() deciding fbofm_init. '''
+
+        net = self.net['zfnet']
+
+        # Two spatial, fbofm_init == False.
+        segment = PipelineSegment((('fc2',), ('fc3',)),
+                                  net, self.batch_size, self.resource)
+        self.assertTrue(segment.valid)
+        self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False))
+
+        # Two spatial, fbofm_init == False.
+        segment = PipelineSegment((('conv5', 'pool3'), ('fc1',)),
+                                  net, self.batch_size, self.resource)
+        self.assertTrue(segment.valid)
+        self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[0][1].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False))
+
+        # Four spatial, fbofm_init == False.
+        segment = PipelineSegment((('conv1', 'pool1'), ('conv2', 'pool2'),
+                                   ('conv3',), ('conv4',)),
+                                  net, self.batch_size, self.resource)
+        self.assertTrue(segment.valid)
+        self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[0][1].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False))
+        self.assertTrue(segment.cstr_symargs[1][0]['fbofm'])
+        self.assertTrue(segment.cstr_symargs[1][1]['fbofm'])
+        self.assertTrue(segment.cstr_symargs[2][0]['fbifm'])
+        self.assertFalse(segment.cstr_symargs[2][0].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[3][0].get('fbifm', False))
+
+        # Three spatial, fbofm_init == False.
+        segment = PipelineSegment((('conv4',), ('conv5', 'pool3'), ('fc1',)),
+                                  net, self.batch_size, self.resource)
+        self.assertTrue(segment.valid)
+        self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False))
+        self.assertTrue(segment.cstr_symargs[1][0]['fbofm'])
+        self.assertTrue(segment.cstr_symargs[1][1]['fbofm'])
+        self.assertTrue(segment.cstr_symargs[2][0]['fbifm'])
+
+        # Three spatial, fbofm_init == False.
+        segment = PipelineSegment((('conv2', 'pool2'), ('conv3',), ('conv4',)),
+                                  net, self.batch_size, self.resource)
+        self.assertTrue(segment.valid)
+        self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[0][1].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False))
+        self.assertTrue(segment.cstr_symargs[1][0]['fbofm'])
+        self.assertTrue(segment.cstr_symargs[2][0]['fbifm'])
+
+        # Three spatial, fbofm_init == True.
+        segment = PipelineSegment((('conv3',), ('conv4',), ('conv5', 'pool3')),
+                                  net, self.batch_size, self.resource)
+        self.assertTrue(segment.valid)
+        self.assertTrue(segment.cstr_symargs[0][0]['fbofm'])
+        self.assertTrue(segment.cstr_symargs[1][0]['fbifm'])
+        self.assertFalse(segment.cstr_symargs[1][0].get('fbofm', False))
+        self.assertFalse(segment.cstr_symargs[2][0].get('fbifm', False))
+
+    def test_gen_constraint_sh_mem_src(self):
+        ''' gen_constraint() shared mem src. '''
+
+        net = self.net['net3']
+
+        segment = self._make_segment((6, 7, 8, 9), net)
+        self.assertTrue(segment.valid)
+
+        # 0 and 3 share memory source.
+        for constraint, _ in segment.gen_constraint():
+            self._validate_constraint(segment, constraint)
+
+            self.assertEqual(constraint[3][0].topifm, constraint[0][0].topifm)
+            self.assertTrue(constraint[3][0].topifm <= 1
+                            or constraint[3][0].topofm <= 1)
+            self.assertTrue(constraint[0][0].topifm <= 1
+                            or constraint[0][0].topofm <= 1)
+
+        net = self.net['net5']
+
+        segment = self._make_segment((1, 2, 3), net)
+        self.assertTrue(segment.valid)
+
+        # 0 and 2 share memory source.
+        for constraint, _ in segment.gen_constraint():
+            self._validate_constraint(segment, constraint)
+
+            # 0 constrains topofm.
+            self.assertNotEqual(constraint[0][0].topofm, 0)
+
+            # Must fully buffer ifmaps.
+            self.assertEqual(constraint[2][0].topifm, 1)
+            self.assertEqual(constraint[0][0].topifm, 1)
+
+        net = self.net['net4']
+
+        segment = self._make_segment((8, 9), net)
+        self.assertTrue(segment.valid)
+
+        # 0 and 1 share memory source.
+        for constraint, _ in segment.gen_constraint():
+            self._validate_constraint(segment, constraint)
+
+            # No topofm constraint.
+            self.assertEqual(constraint[0][0].topofm, 0)
+            self.assertEqual(constraint[1][0].topofm, 0)
+
+            self.assertEqual(constraint[1][0].topifm, constraint[0][0].topifm)
+
+    def test_gen_constraint_temporal(self):
+        ''' gen_constraint() temporal. '''
+
+        for net_name in self.net:
+
+            net = self.net[net_name]
+
+            for segment in self._gen_all_segment(net, temporal=True):
+                if not segment.valid:
+                    continue
+
+                for constraint, _ in segment.gen_constraint():
+                    self._validate_constraint(segment, constraint)
+
+    def test_gen_constraint_hints(self):
+        ''' gen_constraint() pruning hints. '''
+
+        # Use ZFNet to give the real fmap dimensions.
+        net_name = 'zfnet'
+
+        net = self.net[net_name]
+
+        for segment in self._gen_all_segment(net):
+            if not segment.valid:
+                continue
+
+            hints_set = set()
+            last_hints = None
+
+            for _, hints in segment.gen_constraint():
+
+                self.assertTrue(all(isinstance(h, int) and h > 0
+                                    for h in hints),
+                                'test_gen_constraint_hints: '
+                                'all hints should be positive integers only. '
+                                '{}'.format(hints))
+
+                self.assertTrue(all(
+                    not all(h < ph for h, ph in zip(hints, phints))
+                    for phints in hints_set),
+                                'test_gen_constraint_hints: '
+                                'smaller hints are generated too late.')
+
+                if last_hints:
+                    self.assertGreater(hints, last_hints,
+                                       'test_gen_constraint_hints: '
+                                       'hints should be generated from small '
+                                       'to large.')
+                last_hints = hints
+
+    def test_gen_constraint_max_ovhd(self):
+        ''' gen_constraint() with max_time_overhead. '''
+
+        def _make_key(constraint):
+            return tuple((c.topifm, c.topofm, c.topbat)
+                         for c in itertools.chain.from_iterable(constraint))
+
+        net = self.net['zfnet']
+
+        for segment in self._gen_all_segment(net):
+            if not segment.valid:
+                continue
+
+            set_all = set()
+            set_1 = set()
+            set_5 = set()
+
+            for constraint, _ in segment.gen_constraint():
+
+                timing = PipelineSegmentTiming(net, 0)
+                for sp_idx, (ltpl, ctpl) in enumerate(zip(segment, constraint)):
+                    for tm_idx, (l, c) in enumerate(zip(ltpl, ctpl)):
+                        res = self._make_sched_res((0, sp_idx, tm_idx),
+                                                   65536 // len(ltpl),
+                                                   top_ti=c.topifm,
+                                                   top_to=c.topofm,
+                                                   top_tb=c.topbat)
+                        timing.add(l, res)
+
+                key = _make_key(constraint)
+
+                set_all.add(key)
+                if timing.time_overhead <= 0.1:
+                    set_1.add(key)
+                if timing.time_overhead <= 0.5:
+                    set_5.add(key)
+
+            for constraint, _ in segment.gen_constraint(max_time_overhead=0.1):
+                key = _make_key(constraint)
+                set_1.discard(key)
+
+            self.assertFalse(set_1,
+                             'gen_constraint with max_time_overhead: '
+                             'miss generating constraints with <= 0.1 ovhd:\n'
+                             '{}'.format(set_1))
+
+            for constraint, _ in segment.gen_constraint(max_time_overhead=0.5):
+                key = _make_key(constraint)
+                set_5.discard(key)
+
+            self.assertFalse(set_5,
+                             'gen_constraint with max_time_overhead: '
+                             'miss generating constraints with <= 0.5 ovhd:\n'
+                             '{}'.format(set_5))
+
+    def test_gen_constraint_not_opt(self):
+        ''' gen_constraint() not with opt. '''
+
+        def _validate_fully_buffered_constraint(segment, constraint):
+            layer2idx = dict((l, (sp_idx, tm_idx))
+                             for sp_idx, ltpl in enumerate(segment)
+                             for tm_idx, l in enumerate(ltpl))
+            seg_layers = set(layer2idx.keys())
+
+            for l, c in zip(itertools.chain.from_iterable(segment),
+                            itertools.chain.from_iterable(constraint)):
+
+                if not isinstance(net[l], ConvLayer):
+                    continue
+
+                onchip_prevs = seg_layers.intersection(net.prevs(l))
+                if onchip_prevs:
+                    self.assertEqual(c.topifm, 1)
+                    for p in onchip_prevs:
+                        sp_idx, tm_idx = layer2idx[p]
+                        p_c = constraint[sp_idx][tm_idx]
+                        self.assertEqual(p_c.topofm, 1)
+
+        for net_name in self.net:
+
+            net = self.net[net_name]
+
+            # Spatial pipelining.
+            for segment in self._gen_all_segment(net, with_opt=False):
+                if not segment.valid:
+                    continue
+
+                for constraint, _ in segment.gen_constraint():
+                    _validate_fully_buffered_constraint(segment, constraint)
+
diff --git a/nn_dataflow/tests/pipeline_test/test_pipeline_segment_timing.py b/nn_dataflow/tests/pipeline_test/test_pipeline_segment_timing.py
new file mode 100644
index 0000000..edf4291
--- /dev/null
+++ b/nn_dataflow/tests/pipeline_test/test_pipeline_segment_timing.py
@@ -0,0 +1,343 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+from nn_dataflow.core import InputLayer, FCLayer, PoolingLayer
+from nn_dataflow.core import Network
+from nn_dataflow.core import PipelineSegmentTiming
+
+from . import TestPipelineFixture
+
+class TestPipelineSegmentTiming(TestPipelineFixture):
+    ''' Tests for PipelineSegmentTiming. '''
+
+    def setUp(self):
+        super(TestPipelineSegmentTiming, self).setUp()
+
+        self.net1 = self.net['net1']
+
+        self.net4 = self.net['net4']
+
+        self.netlr = Network('net1')
+        self.netlr.set_input_layer(InputLayer(10, 1))
+        self.netlr.add('0p1', PoolingLayer(10, 1, 1))
+        self.netlr.add('0p2', PoolingLayer(10, 1, 1))
+        self.netlr.add('0p3', PoolingLayer(10, 1, 1))
+        self.netlr.add('1', FCLayer(10, 20))
+
+    def test_valid_args(self):
+        ''' Valid arguments. '''
+        timing = PipelineSegmentTiming(self.net1, 3)
+        self.assertIs(timing.network, self.net1)
+        self.assertEqual(timing.seg_idx, 3)
+
+    def test_invalid_network(self):
+        ''' Invalid network. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'PipelineSegmentTiming: .*network.*'):
+            _ = PipelineSegmentTiming(self.net1.input_layer(), 3)
+
+    def test_add(self):
+        ''' add(). '''
+        # No fused.
+
+        timing = PipelineSegmentTiming(self.net1, 3)
+
+        timing.add('0', self._make_sched_res((3, 0, 0), 123,
+                                             top_to=3, top_tb=2))
+        self.assertTupleEqual(timing.last_sched_seq, (3, 0, 0))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 3)
+
+        timing.add('1', self._make_sched_res((3, 1, 0), 141,
+                                             top_ti=3, top_tb=2))
+        self.assertTupleEqual(timing.last_sched_seq, (3, 1, 0))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 1)
+
+        timing.add('1p', self._make_sched_res((3, 1, 1), 12,
+                                              top_ti=3, top_tb=2))
+        self.assertTupleEqual(timing.last_sched_seq, (3, 1, 1))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 1)
+
+        self.assertEqual(timing.bat_ngrp, 2)
+        self.assertEqual(len(timing.timing_list), 2)
+        self.assertEqual(len(timing.timing_list[0]), 1)
+        self.assertEqual(len(timing.timing_list[1]), 2)
+
+        # Fused.
+
+        timing = PipelineSegmentTiming(self.net1, 3)
+
+        timing.add('0', self._make_sched_res((3, 0, 0), 123,
+                                             top_tb=2))
+        self.assertTupleEqual(timing.last_sched_seq, (3, 0, 0))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 1)
+
+        timing.add('1', self._make_sched_res((3, 1, 0), 141,
+                                             top_to=3, top_tb=2))
+        self.assertTupleEqual(timing.last_sched_seq, (3, 1, 0))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 3)
+
+        timing.add('1p', self._make_sched_res((3, 1, 1), 12,
+                                              top_to=3, top_tb=2))
+        self.assertTupleEqual(timing.last_sched_seq, (3, 1, 1))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 3)
+
+        # Unmatched BAT group number.
+
+        self.assertEqual(timing.bat_ngrp, 2)
+        timing.add('2', self._make_sched_res((3, 2, 0), 123, top_tb=4))
+        self.assertEqual(timing.bat_ngrp, 1)
+
+    def test_add_all_lr(self):
+        ''' add() all LocalRegionLayer. '''
+        timing = PipelineSegmentTiming(self.netlr, 2)
+
+        timing.add('0p1', self._make_sched_res((2, 0, 0), 40, top_to=4))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 4)
+        timing.add('0p2', self._make_sched_res((2, 0, 1), 80, top_to=4))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 4)
+        timing.add('0p3', self._make_sched_res((2, 0, 2), 60, top_to=4))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 4)
+        timing.add('1', self._make_sched_res((2, 1, 0), 800, top_to=4))
+        self.assertEqual(timing.timing_list[-1][-1].ngrp, 4)
+
+    def test_add_invalid_sched_seq(self):
+        ''' add(), invalid sched seq. '''
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 123))
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'PipelineSegmentTiming: .*belong to.*'):
+            timing.add('1', self._make_sched_res((2, 1, 0), 123))
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'PipelineSegmentTiming: .*follow.*'):
+            timing.add('1p', self._make_sched_res((3, 1, 1), 123))
+
+    def test_add_already_in(self):
+        ''' add(), layer already in. '''
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 123))
+        with self.assertRaisesRegexp(ValueError,
+                                     'PipelineSegmentTiming: .*layer 0.*'):
+            timing.add('0', self._make_sched_res((3, 1, 0), 123))
+
+    def test_time_bat_ngrp(self):
+        ''' time and critical_time bat_ngrp. '''
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 120, top_tb=4))
+        timing.add('1', self._make_sched_res((3, 1, 0), 130, top_tb=4))
+        timing.add('1p', self._make_sched_res((3, 1, 1), 20, top_tb=4))
+        timing.add('2', self._make_sched_res((3, 2, 0), 136, top_tb=4))
+        self.assertEqual(timing.critical_time, 150)
+        self.assertEqual(timing.time, 120 // 4 + 130 + 20 + 136 // 4)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / ((120 + 130 + 20 + 136) / 3.) - 1)
+
+        # Unmatched BAT group number.
+        timing.add('3', self._make_sched_res((3, 3, 0), 100, top_tb=2))
+        self.assertEqual(timing.time, 120 + 130 + 20 + 136 + 100)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time
+                               / ((120 + 130 + 20 + 136 + 100) / 4.) - 1)
+
+    def test_time_ifm_ofm_ngrp(self):
+        ''' time and critical_time ifm_ngrp and ofm_ngrp. '''
+
+        # Single-group wait, first critical.
+
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 120,
+                                             top_to=3, top_tb=2))
+        timing.add('1', self._make_sched_res((3, 1, 0), 90,
+                                             top_ti=3, top_tb=2))
+        self.assertEqual(timing.critical_time, 120)
+        # Layer 0 is critical. Layer 0 last BAT group starts at 120 - 120 // 2.
+        # Layer 1 last BAT group starts 120 // 2 // 3 later, which takes 90 //
+        # 2.
+        self.assertEqual(timing.time,
+                         120 - 120 // 2 + 120 // 2 // 3 + 90 // 2)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / ((120 + 90) / 2.) - 1)
+
+        # Single-group wait, second critical.
+
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 120,
+                                             top_to=3, top_tb=2))
+        timing.add('1', self._make_sched_res((3, 1, 0), 150,
+                                             top_ti=3, top_tb=2))
+        self.assertEqual(timing.critical_time, 150)
+        # Layer 1 is critical. Layer 1 first BAT group starts at 120 // 2 // 3,
+        # and takes 150 for all its BAT groups.
+        self.assertEqual(timing.time, 120 // 2 // 3 + 150)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / ((120 + 150) / 2.) - 1)
+
+        # All-group wait, first critical.
+
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 120,
+                                             top_to=3, top_tb=2))
+        timing.add('1', self._make_sched_res((3, 1, 0), 90,
+                                             top_to=3, top_tb=2))
+        self.assertEqual(timing.critical_time, 120)
+        self.assertEqual(timing.time, 120 + 90 // 2)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / ((120 + 90) / 2.) - 1)
+
+        # All-group wait, second critical.
+
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 120,
+                                             top_ti=3, top_tb=2))
+        timing.add('1', self._make_sched_res((3, 1, 0), 150,
+                                             top_ti=3, top_tb=2))
+        self.assertEqual(timing.critical_time, 150)
+        self.assertEqual(timing.time, 120 // 2 + 150)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / ((120 + 150) / 2.) - 1)
+
+    def test_time_linear(self):
+        ''' time and critical_time linear. '''
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 120,
+                                             top_ti=3, top_tb=2))
+        timing.add('1', self._make_sched_res((3, 1, 0), 129,
+                                             top_to=3, top_tb=2))
+        timing.add('1p', self._make_sched_res((3, 1, 1), 21,
+                                              top_to=3, top_tb=2))
+        timing.add('2', self._make_sched_res((3, 2, 0), 138,
+                                             top_ti=3, top_tb=2))
+        self.assertEqual(timing.critical_time, 150)
+        # Layer 1 is critical. Layer 1+1p first BAT group starts at 120 // 2,
+        # and last BAT group starts at 150 // 2 later. Layer 2 last BAT group
+        # starts 150 // 2 // 3 later, and takes 138 // 2.
+        self.assertEqual(timing.time,
+                         120 // 2 + 150 // 2 + 150 // 2 // 3 + 138 // 2)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / ((120 + 129 + 21 + 138) / 3.) - 1)
+
+    def test_time_branch(self):
+        ''' time and critical_time branch. '''
+
+        # Single-group wait.
+
+        timing = PipelineSegmentTiming(self.net4, 3)
+        timing.add('6', self._make_sched_res((3, 0, 0), 120,
+                                             top_ti=3, top_tb=2))
+        timing.add('7', self._make_sched_res((3, 1, 0), 150,
+                                             top_to=3, top_tb=2))
+        timing.add('8', self._make_sched_res((3, 2, 0), 144,
+                                             top_ti=3, top_tb=2))
+        timing.add('9', self._make_sched_res((3, 3, 0), 168,
+                                             top_ti=3, top_tb=2))
+        self.assertEqual(timing.critical_time, 168)
+        # Layer 9 is critical. Layer 7 first BAT group starts at 120 // 2.
+        # Layer 8 and 9 first BAT group starts at 150 // 2 // 3 later, and all
+        # groups of layer 9 take 168.
+        self.assertEqual(timing.time,
+                         120 // 2 + 150 // 2 // 3 + 168)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / ((120 + 150 + 144 + 168) / 4.) - 1)
+
+        # All-group wait.
+
+        timing = PipelineSegmentTiming(self.net4, 3)
+        timing.add('6', self._make_sched_res((3, 0, 0), 120, top_tb=2))
+        timing.add('7', self._make_sched_res((3, 1, 0), 150, top_tb=2))
+        timing.add('8', self._make_sched_res((3, 2, 0), 144, top_tb=2))
+        timing.add('9', self._make_sched_res((3, 3, 0), 132, top_tb=2))
+        self.assertEqual(timing.critical_time, 150)
+        # Layer 7 is critical. Layer 7 first BAT group starts at 120 // 2, and
+        # layer 7 last BAT group ends at 150 later, at which time layer 8 and 9
+        # last BAT group starts, and takes 144 // 2.
+        self.assertEqual(timing.time, 120 // 2 + 150 + 144 // 2)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / ((120 + 150 + 144 + 132) / 4.) - 1)
+
+    def test_time_all_lr(self):
+        ''' time and critical_time all LocalRegionLayer. '''
+        timing = PipelineSegmentTiming(self.netlr, 2)
+        timing.add('0p1', self._make_sched_res((2, 0, 0), 40,
+                                               top_to=5, top_tb=2))
+        timing.add('0p2', self._make_sched_res((2, 0, 1), 80,
+                                               top_to=5, top_tb=2))
+        timing.add('0p3', self._make_sched_res((2, 0, 2), 60,
+                                               top_to=5, top_tb=2))
+        timing.add('1', self._make_sched_res((2, 1, 0), 800,
+                                             top_ti=5, top_tb=2))
+        self.assertEqual(timing.critical_time, 800)
+        # Layer 1 is critical. Layer 1 first BAT group starts at (40 + 80 + 60)
+        # // 2 // 5, and takes 800.
+        self.assertEqual(timing.time, (40 + 80 + 60) // 2 // 5 + 800)
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / ((40 + 80 + 60 + 800) / 2.) - 1)
+
+    def test_time_single_spatial(self):
+        ''' time and critical_time for single-spatial segment. '''
+
+        for net_name in self.net:
+            if not net_name.startswith('net'):
+                continue
+            net = self.net[net_name]
+
+            for seg in self._gen_all_segment(net, temporal=True):
+                if not seg.valid:
+                    continue
+                self.assertEqual(len(seg), 1)
+
+                timing = PipelineSegmentTiming(net, 0)
+                for idx, layer in enumerate(seg[0]):
+                    timing.add(layer,
+                               self._make_sched_res((0, 0, idx),
+                                                    (40 + idx * 7 % 3) * 16,
+                                                    top_to=4, top_ti=4,
+                                                    top_tb=4))
+
+                self.assertEqual(timing.critical_time, timing.time)
+                self.assertAlmostEqual(timing.time_overhead, 0.)
+
+    def test_time_dram_time(self):
+        ''' time and critical_time dominated by DRAM time. '''
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 120, dram_time=100,
+                                             top_ti=3, top_tb=4))
+        timing.add('1', self._make_sched_res((3, 1, 0), 130, dram_time=140,
+                                             top_to=3, top_tb=4))
+        timing.add('1p', self._make_sched_res((3, 1, 1), 20, dram_time=10,
+                                              top_to=3, top_tb=4))
+        timing.add('2', self._make_sched_res((3, 2, 0), 138, dram_time=100,
+                                             top_ti=3, top_tb=4))
+        self.assertEqual(timing.critical_time, 160)
+        self.assertEqual(timing.time, 100 + 140 + 10 + 100)
+        self.assertEqual(timing.dram_time, timing.time)
+        self.assertLess(timing.node_time, timing.time)
+
+    def test_time_overhead(self):
+        ''' time_overhead. '''
+        timing = PipelineSegmentTiming(self.net1, 3)
+        timing.add('0', self._make_sched_res((3, 0, 0), 120, num_nodes=4,
+                                             top_ti=3, top_tb=4))
+        timing.add('1', self._make_sched_res((3, 1, 0), 130, num_nodes=6,
+                                             top_to=3, top_tb=4))
+        timing.add('1p', self._make_sched_res((3, 1, 1), 20, num_nodes=6,
+                                              top_to=3, top_tb=4))
+        timing.add('2', self._make_sched_res((3, 2, 0), 138, num_nodes=3,
+                                             top_ti=3, top_tb=4))
+
+        time_indv = 120 * 4 / 13. + (130 + 20) * 6 / 13. + 138 * 3 / 13.
+        self.assertAlmostEqual(timing.time_overhead,
+                               timing.time / time_indv - 1)
+
diff --git a/nn_dataflow/tests/unit_test/test_buf_shr_scheme.py b/nn_dataflow/tests/unit_test/test_buf_shr_scheme.py
new file mode 100644
index 0000000..c04bd27
--- /dev/null
+++ b/nn_dataflow/tests/unit_test/test_buf_shr_scheme.py
@@ -0,0 +1,349 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+import math
+import unittest
+
+from nn_dataflow.core import BufShrScheme
+from nn_dataflow.core import DataCategoryEnum as de
+from nn_dataflow.core import DataDimLoops
+from nn_dataflow.core import LoopEnum as le
+from nn_dataflow.core import NodeRegion
+from nn_dataflow.core import ParallelEnum as pe
+from nn_dataflow.core import PartitionScheme
+from nn_dataflow.core import PhyDim2
+
+class TestBufShrScheme(unittest.TestCase):
+    ''' Tests for BufShrScheme. '''
+
+    def setUp(self):
+        self.ps1 = PartitionScheme(order=[pe.BATP, pe.OUTP, pe.OFMP, pe.INPP],
+                                   pdims=[(2, 3), (3, 1), (1, 5), (5, 2)])
+        self.ps2 = PartitionScheme(order=range(pe.NUM),
+                                   pdims=[(2, 2), (5, 5), (3, 3), (1, 1)])
+        self.ps3 = PartitionScheme(order=range(pe.NUM),
+                                   pdims=[(1, 6), (1, 2), (4, 1), (3, 5)])
+
+        self.nr1 = NodeRegion(origin=PhyDim2(0, 0), dim=self.ps1.dim(),
+                              type=NodeRegion.PROC)
+        self.nr2 = NodeRegion(origin=PhyDim2(0, 0), dim=self.ps2.dim(),
+                              type=NodeRegion.PROC)
+        self.nr3 = NodeRegion(origin=PhyDim2(0, 0), dim=self.ps3.dim(),
+                              type=NodeRegion.PROC)
+
+        self.bufshr1 = BufShrScheme(self.nr1, self.ps1)
+        self.bufshr2 = BufShrScheme(self.nr2, self.ps2)
+        self.bufshr3 = BufShrScheme(self.nr3, self.ps3)
+
+    def test_dim(self):
+        ''' Accessor dim. '''
+        for bufshr, ps in zip([self.bufshr1, self.bufshr2, self.bufshr3],
+                              [self.ps1, self.ps2, self.ps3]):
+            self.assertTupleEqual(bufshr.dim(de.IFM), ps.dim(pe.OUTP))
+            self.assertTupleEqual(bufshr.dim(de.OFM), ps.dim(pe.INPP))
+
+        self.assertTupleEqual(self.bufshr1.dim(de.FIL), self.ps1.dim(pe.OFMP))
+        self.assertTupleEqual(self.bufshr2.dim(de.FIL),
+                              self.ps2.dim(pe.OFMP, pe.BATP))
+        self.assertTupleEqual(self.bufshr3.dim(de.FIL),
+                              self.ps3.dim(pe.OFMP, pe.BATP))
+
+    def test_size(self):
+        ''' Get size. '''
+        for bufshr in [self.bufshr1, self.bufshr2, self.bufshr3]:
+            for dce in range(de.NUM):
+                self.assertEqual(bufshr.dim(dce).size(), bufshr.size(dce))
+
+    def test_dim_fil(self):
+        ''' Accessor dim with different partitioning for FIL. '''
+        # Adjacent, BATP upon OFMP.
+        ps = PartitionScheme(order=[pe.INPP, pe.OUTP, pe.BATP, pe.OFMP],
+                             pdims=[(2, 2), (5, 5), (3, 3), (7, 7)])
+        nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(),
+                        type=NodeRegion.PROC)
+        self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (15,) * 2)
+        # Adjacent, OFMP upon BATP.
+        ps = PartitionScheme(order=[pe.INPP, pe.OFMP, pe.BATP, pe.OUTP],
+                             pdims=[(2, 2), (5, 5), (3, 3), (7, 7)])
+        nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(),
+                        type=NodeRegion.PROC)
+        self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (15,) * 2)
+
+        # Not adjacent, BATP upon OFMP.
+        ps = PartitionScheme(order=[pe.OUTP, pe.BATP, pe.INPP, pe.OFMP],
+                             pdims=[(2, 2), (5, 5), (3, 3), (7, 7)])
+        nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(),
+                        type=NodeRegion.PROC)
+        self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (5,) * 2)
+        # Not adjacent, OFMP upon BATP.
+        ps = PartitionScheme(order=[pe.OFMP, pe.INPP, pe.BATP, pe.OUTP],
+                             pdims=[(2, 2), (5, 5), (3, 3), (7, 7)])
+        nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(),
+                        type=NodeRegion.PROC)
+        self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (3,) * 2)
+
+        # Only BATP.
+        ps = PartitionScheme(order=[pe.OUTP, pe.BATP, pe.INPP, pe.OFMP],
+                             pdims=[(2, 2), (1, 1), (3, 3), (7, 7)])
+        nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(),
+                        type=NodeRegion.PROC)
+        self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (3,) * 2)
+        # Only OFMP.
+        ps = PartitionScheme(order=[pe.OFMP, pe.INPP, pe.BATP, pe.OUTP],
+                             pdims=[(2, 2), (5, 5), (1, 1), (7, 7)])
+        nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(),
+                        type=NodeRegion.PROC)
+        self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (5,) * 2)
+
+    def test_dim_invalid_index(self):
+        ''' Accessor dim invalid index. '''
+        with self.assertRaises(IndexError):
+            _ = self.bufshr1.dim(de.NUM)
+
+    def test_size_invalid_index(self):
+        ''' Get size invalid index. '''
+        with self.assertRaises(IndexError):
+            _ = self.bufshr1.size(de.NUM)
+
+    def test_nbr_dists(self):
+        ''' Accessor nbr_dists. '''
+        inf = float('inf')
+
+        self.assertTupleEqual(self.bufshr1.nbr_dists[de.FIL], (5, inf))
+        self.assertTupleEqual(self.bufshr1.nbr_dists[de.IFM], (15, 2))
+        self.assertTupleEqual(self.bufshr1.nbr_dists[de.OFM], (1, 1))
+
+        self.assertTupleEqual(self.bufshr2.nbr_dists[de.FIL], (1, 1))
+        self.assertTupleEqual(self.bufshr2.nbr_dists[de.IFM], (15, 15))
+        self.assertTupleEqual(self.bufshr2.nbr_dists[de.OFM], (inf, inf))
+
+        self.assertTupleEqual(self.bufshr3.nbr_dists[de.FIL], (3, 5))
+        self.assertTupleEqual(self.bufshr3.nbr_dists[de.IFM], (inf, 10))
+        self.assertTupleEqual(self.bufshr3.nbr_dists[de.OFM], (1, 1))
+
+    def test_default_data_loops(self):
+        ''' Default data_loops in constructor. '''
+        data_loops = [None] * de.NUM
+        data_loops[de.FIL] = DataDimLoops(le.IFM, le.OFM)
+        data_loops[de.IFM] = DataDimLoops(le.IFM, le.BAT)
+        data_loops[de.OFM] = DataDimLoops(le.OFM, le.BAT)
+
+        for bufshr, nr, ps in zip([self.bufshr1, self.bufshr2, self.bufshr3],
+                                  [self.nr1, self.nr2, self.nr3],
+                                  [self.ps1, self.ps2, self.ps3]):
+
+            bufshr_ = BufShrScheme(nr, ps, data_loops)
+
+            for dce in range(de.NUM):
+                self.assertTupleEqual(bufshr.dim(dce),
+                                      bufshr_.dim(dce))
+                self.assertTupleEqual(bufshr.nbr_dists[dce],
+                                      bufshr_.nbr_dists[dce])
+
+    def test_data_loops(self):
+        ''' data_loops in constructor. '''
+        data_loops = [None] * de.NUM
+        data_loops[de.FIL] = DataDimLoops(le.IFM, le.OFM)
+        data_loops[de.IFM] = DataDimLoops(le.OFM, le.BAT)
+        data_loops[de.OFM] = DataDimLoops(le.OFM, le.BAT)
+
+        for nr, ps in zip([self.nr1, self.nr2, self.nr3],
+                          [self.ps1, self.ps2, self.ps3]):
+
+            bufshr = BufShrScheme(nr, ps, data_loops)
+
+            self.assertTupleEqual(bufshr.dim(de.IFM), bufshr.dim(de.OFM))
+            self.assertTupleEqual(bufshr.nbr_dists[de.IFM],
+                                  bufshr.nbr_dists[de.OFM])
+
+    def test_data_loops_all_lpe(self):
+        ''' data_loops in constructor have all LoopEnum. '''
+        data_loops = [None] * de.NUM
+        data_loops[de.FIL] = DataDimLoops(le.IFM, le.OFM)
+        data_loops[de.IFM] = DataDimLoops(le.IFM, le.OFM, le.BAT)
+        data_loops[de.OFM] = DataDimLoops(le.OFM, le.BAT)
+
+        bufshr = BufShrScheme(self.nr1, self.ps1, data_loops)
+
+        self.assertTupleEqual(bufshr.dim(de.IFM), (1, 1))
+        self.assertTrue(all(math.isinf(d) for d in bufshr.nbr_dists[de.IFM]))
+
+    def test_mismatch_node_region(self):
+        ''' Mismatched node region and part in constructor. '''
+        # Smaller node region. Invalid.
+        with self.assertRaisesRegexp(ValueError, 'BufShrScheme: .*region.*'):
+            _ = BufShrScheme(NodeRegion(origin=PhyDim2(0, 0),
+                                        dim=PhyDim2(1, 1),
+                                        type=NodeRegion.PROC),
+                             self.ps1)
+
+        # Larger node region. Valid.
+        bufshr = BufShrScheme(NodeRegion(origin=PhyDim2(0, 0),
+                                         dim=PhyDim2(100, 100),
+                                         type=NodeRegion.PROC),
+                              self.ps1)
+        self.assertTupleEqual(bufshr.dim(de.IFM), self.ps1.dim(pe.OUTP))
+
+    def test_nhops_rotate_all(self):
+        ''' Get nhops_rotate_all. '''
+        # With `self.bufshr3` and FIL, the dimension is 4 by 2, with neighbor
+        # distances 3 and 5.
+        bufshr = self.bufshr3
+        dce = de.FIL
+        self.assertTupleEqual(bufshr.dim(dce), (4, 2))
+        self.assertTupleEqual(bufshr.nbr_dists[dce], (3, 5))
+
+        # Subgroup as 4 by 2. The whole circle is six hops of 3 and two hops of
+        # 5, but only 7 of 8 steps.
+        self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 8),
+                               (3 * 6 + 5 * 2) * 7 / 8.)
+        # Subgroup as 4 by 1. One node does three hops of 3, and other three
+        # nodes do two hops of 3 and one hop of 9 (looping back).
+        self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 4),
+                               ((3 * 3) + (3 * 2 + 9) * 3) / 4. * 2)
+        # Subgroup as 2 by 1. All nodes do one hop of 3.
+        self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 2),
+                               (3 + 3) / 2. * 4)
+        # Subgroup as 1. No rotation.
+        self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 1), 0)
+
+        # Subgroup as 4 by 1. One node does two hops of 3 and two do one hop of
+        # 3 and 6 each. The 3rd node also sends to the 4th one with two hops of
+        # 3.
+        self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 3),
+                               ((3 * 2) + (3 + 6) * 2 + (3 * 2)) / 3. * 2)
+        # Subgroup as 4 by 2. The 1st node does three hops of 3 and one hop of
+        # 5. The 2nd, 3rd, and 4th nodes do two hops of 3, and one hop of 5,
+        # and one looping back from the 5th node to the 1st node. The 5th node
+        # does one looping back and three hops of 3. Finally, the 5th node also
+        # sends to the 6th to 8th nodes.
+        self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 5),
+                               ((3 * 3 + 5) + (3 * 2 + 5 + (3 * 3 + 5)) * 3
+                                + ((3 * 3 + 5) + 3 * 3) + 3 * 3 * 4) / 5.)
+        # The others are similar.
+        self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 6),
+                               ((3 * 4 + 5) + (3 * 3 + 5 + (3 * 2 + 5)) * 4
+                                + ((3 * 2 + 5) + 3 * 4) + 3 * 2 * 5) / 6.)
+        self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 7),
+                               ((3 * 5 + 5) + (3 * 4 + 5 + (3 * 1 + 5)) * 5
+                                + ((3 * 1 + 5) + 3 * 5) + 3 * 1 * 6) / 7.)
+
+    def test_nhops_rotate_all_invalid(self):
+        ''' Get nhops_rotate_all with invalid args. '''
+        with self.assertRaisesRegexp(ValueError, 'BufShrScheme: .*subgroup.*'):
+            _ = self.bufshr3.nhops_rotate_all(
+                de.FIL, self.bufshr3.size(de.FIL) + 1)
+
+    def test_nhops_rotate_all_rot_unit(self):
+        ''' Get nhops_rotate_all with rotation unit count. '''
+
+        bufshr = self.bufshr3
+        dce = de.FIL
+        self.assertTupleEqual(bufshr.dim(dce), (4, 2))
+
+        for subgrp_size in range(1, bufshr.size(dce)):
+
+            nhops = bufshr.nhops_rotate_all(dce, subgrp_size)
+
+            for rotation_unit_cnt in range(subgrp_size, 32):
+                self.assertEqual(bufshr.nhops_rotate_all(dce, subgrp_size,
+                                                         rotation_unit_cnt),
+                                 nhops)
+
+            for rotation_unit_cnt in range(1, subgrp_size):
+                self.assertLess(bufshr.nhops_rotate_all(dce, subgrp_size,
+                                                        rotation_unit_cnt),
+                                nhops)
+
+    def test_nhops_rotate_all_cache(self):
+        ''' Get nhops_rotate_all using cache. '''
+
+        bufshr = self.bufshr3
+        dce = de.FIL
+
+        self.assertFalse(bufshr.nhops_cache)
+
+        nhops_8 = bufshr.nhops_rotate_all(dce, 8)
+        nhops_4 = bufshr.nhops_rotate_all(dce, 4)
+        nhops_1 = bufshr.nhops_rotate_all(dce, 1)
+        self.assertEqual(len(bufshr.nhops_cache), 3)
+        self.assertEqual(nhops_8, bufshr.nhops_rotate_all(dce, 8))
+        self.assertEqual(nhops_4, bufshr.nhops_rotate_all(dce, 4))
+        self.assertEqual(nhops_1, bufshr.nhops_rotate_all(dce, 1))
+        self.assertEqual(len(bufshr.nhops_cache), 3)
+
+        dce = de.IFM
+
+        nhops_3 = bufshr.nhops_rotate_all(dce, 3)
+        nhops_2 = bufshr.nhops_rotate_all(dce, 2)
+        self.assertEqual(len(bufshr.nhops_cache), 5)
+        self.assertEqual(nhops_3, bufshr.nhops_rotate_all(dce, 3))
+        self.assertEqual(nhops_2, bufshr.nhops_rotate_all(dce, 2))
+        self.assertEqual(len(bufshr.nhops_cache), 5)
+
+        nhops_rot_unit = bufshr.nhops_rotate_all(dce, 3, 2)
+
+        self.assertEqual(len(bufshr.nhops_cache), 6)
+        self.assertEqual(nhops_rot_unit, bufshr.nhops_rotate_all(dce, 3, 2))
+        self.assertEqual(len(bufshr.nhops_cache), 6)
+
+    def test_nhops_wide_fetch_once(self):
+        ''' Get nhops_wide_fetch_once. '''
+        # With `self.bufshr3` and FIL, the dimension is 4 by 2, with neighbor
+        # distances 3 and 5.
+        bufshr = self.bufshr3
+        dce = de.FIL
+        self.assertTupleEqual(bufshr.dim(dce), (4, 2))
+        self.assertTupleEqual(bufshr.nbr_dists[dce], (3, 5))
+
+        for subgrp_size in range(bufshr.size(dce)):
+            self.assertAlmostEqual(
+                bufshr.nhops_wide_fetch_once(dce, subgrp_size, 1), 0)
+
+        # Three nodes fetch one hop of 3, and the last node fetches one hop of
+        # 9 (looping back).
+        self.assertAlmostEqual(bufshr.nhops_wide_fetch_once(dce, 4, 2) * 2,
+                               (3 * 3 + 9) / 4. * 2)
+        # Two nodes fetch one hop of 3, and the 3rd node fetches one hop of 6
+        # (looping back). The last node fetches one hop of 3 from the 3rd.
+        self.assertAlmostEqual(bufshr.nhops_wide_fetch_once(dce, 3, 2) * 2,
+                               (3 * 2 + 6 + 3) / 3. * 2)
+        # All nodes do one hop of 3.
+        self.assertAlmostEqual(bufshr.nhops_wide_fetch_once(dce, 2, 2) * 2,
+                               (3 + 3) / 2. * 4)
+
+        for subgrp_size in range(2, bufshr.size(dce)):
+            self.assertAlmostEqual(
+                bufshr.nhops_wide_fetch_once(dce, subgrp_size, 1.5) * 1.5,
+                bufshr.nhops_wide_fetch_once(dce, subgrp_size, 2) * 2. / 2.)
+
+    def test_nhops_wide_fetch_once_inv(self):
+        ''' Get nhops_wide_fetch_once with invalid args. '''
+        with self.assertRaisesRegexp(ValueError, 'BufShrScheme: .*subgroup.*'):
+            _ = self.bufshr3.nhops_wide_fetch_once(
+                de.FIL, self.bufshr3.size(de.FIL) + 1, 2)
+
+        with self.assertRaisesRegexp(ValueError, 'BufShrScheme: .*width.*'):
+            _ = self.bufshr3.nhops_wide_fetch_once(
+                de.FIL,
+                self.bufshr3.size(de.FIL) / 2,
+                self.bufshr3.size(de.FIL) / 2 + 1)
+
+    def test_repr(self):
+        ''' __repr__. '''
+        self.assertIn(repr(self.ps1), repr(self.bufshr1))
+        self.assertIn(repr(self.ps2), repr(self.bufshr2))
+        self.assertIn(repr(self.ps3), repr(self.bufshr3))
+
diff --git a/nn_dataflow/tests/unit_test/test_data_layout.py b/nn_dataflow/tests/unit_test/test_data_layout.py
index b1855a7..f2c5827 100644
--- a/nn_dataflow/tests/unit_test/test_data_layout.py
+++ b/nn_dataflow/tests/unit_test/test_data_layout.py
@@ -212,6 +212,52 @@ def test_nhops_to_multidests(self):
                                      PhyDim2(2, 2)),
                          nhops)
 
+    def test_nhops_to_multidests_fwd(self):
+        ''' Get nhops_to multiple destinations forwarding. '''
+        fr = FmapRange((0,) * 4, (4, 4, 16, 16))
+        # First to (2, 2), then (2, 2) to (-1, -2), (-1, -2) to (-2, -3).
+        nhops = 2 * 4 * 8 * 16 * (2 + 1 + 1 + 0) \
+                + 2 * 4 * 8 * 16 * (4 * 7) \
+                + 2 * 4 * 8 * 16 * (4 * 2)
+        self.assertEqual(self.dl1.nhops_to(fr,
+                                           PhyDim2(-1, -2), PhyDim2(-2, -3),
+                                           PhyDim2(2, 2),
+                                           forwarding=True),
+                         nhops)
+
+        frng1 = FmapRange((0, 4, 0, 0), (4, 8, 16, 16))
+        dl = DataLayout(frngs=(self.frng1, frng1),
+                        regions=(self.region1, self.region2),
+                        parts=(self.part1, self.part2))
+        self.assertEqual(dl.nhops_to(fr,
+                                     PhyDim2(-1, -2), PhyDim2(-2, -3),
+                                     PhyDim2(2, 2),
+                                     forwarding=True),
+                         nhops)
+
+        nhops += 2 * 4 * 16 * 16 * ((3 + 4) + 2 * 7 + 2 * 2)
+        fr = FmapRange((0,) * 4, (16,) * 4)
+        self.assertEqual(dl.nhops_to(fr,
+                                     PhyDim2(-1, -2), PhyDim2(-2, -3),
+                                     PhyDim2(2, 2),
+                                     forwarding=True),
+                         nhops)
+
+        # (2, 2) to (3, 10) and (8, 4)
+        nhops += 4 * 8 * 16 * 16 * (9 + 8)
+        self.assertEqual(dl.nhops_to(fr,
+                                     PhyDim2(-1, -2), PhyDim2(-2, -3),
+                                     PhyDim2(2, 2), PhyDim2(3, 10),
+                                     PhyDim2(8, 4),
+                                     forwarding=True),
+                         nhops)
+
+    def test_nhops_to_invalid_kwargs(self):
+        ''' Get nhops_to invalid kwargs. '''
+        fr = FmapRange((0,) * 4, (4, 4, 16, 16))
+        with self.assertRaisesRegexp(ValueError, 'DataLayout: .*keyword.*'):
+            _ = self.dl1.nhops_to(fr, PhyDim2(1, 1), f=True)
+
     def test_is_in(self):
         ''' Whether is_in. '''
         nr1 = self.region1
@@ -255,6 +301,31 @@ def test_is_in(self):
                                             dim=PhyDim2(50, 50),
                                             type=self.region1.type)))
 
+    def test_is_in_folded(self):
+        ''' Whether is_in with folded regions. '''
+        # (1, 1/2), (2/3, 0/1/2), (4, 1/2)
+        nr1 = NodeRegion(origin=PhyDim2(1, 1), dim=PhyDim2(1, 10),
+                         type=self.region1.type, wtot=3, wbeg=2)
+        # (1, 1/2), (2, 2)
+        nr2 = NodeRegion(origin=PhyDim2(1, 1), dim=PhyDim2(1, 3),
+                         type=self.region1.type, wtot=3, wbeg=2)
+        self.assertTrue(self.dl1.is_in(nr1))
+        self.assertFalse(self.dl1.is_in(nr2))
+
+        # (1-2, 2), (3-4/5-6/7-8, 0/1/2)
+        region = NodeRegion(origin=PhyDim2(1, 2), dim=PhyDim2(2, 10),
+                            type=self.region1.type, wtot=3, wbeg=1)
+        part = PartitionScheme(order=range(pe.NUM),
+                               pdims=(PhyDim2(1, 5), PhyDim2(2, 1),
+                                      PhyDim2(1, 2), PhyDim2(1, 1)))
+        dl = DataLayout(frngs=self.dl1.frngs,
+                        regions=(region,), parts=(part,))
+        # (1-2, 1/2), (3-4/5-6, -1/0/1/2), (7-8, 0/1/2)
+        nr3 = NodeRegion(origin=PhyDim2(1, 1), dim=PhyDim2(2, 13),
+                         type=self.region1.type, wtot=4, wbeg=2)
+        self.assertTrue(dl.is_in(nr3))
+        self.assertFalse(dl.is_in(nr2))
+
     def test_concat(self):
         ''' Concat. '''
         fr = FmapRange((0,) * 4, (30,) * 4)
diff --git a/nn_dataflow/tests/unit_test/test_nn_dataflow_scheme.py b/nn_dataflow/tests/unit_test/test_nn_dataflow_scheme.py
index 8824146..910441f 100644
--- a/nn_dataflow/tests/unit_test/test_nn_dataflow_scheme.py
+++ b/nn_dataflow/tests/unit_test/test_nn_dataflow_scheme.py
@@ -30,6 +30,7 @@
 
 class TestNNDataflowScheme(unittest.TestCase):
     ''' Tests for NNDataflowScheme. '''
+    # pylint: disable=too-many-public-methods
 
     # pylint: disable=too-many-public-methods
 
@@ -57,15 +58,21 @@ def setUp(self):
 
         c1_layer = self.network['c1']
         self.c1res = SchedulingResult(
-            scheme=OrderedDict([('cost', 1.5), ('time', 2.), ('ops', 4.),
+            scheme=OrderedDict([('cost', 1.5), ('time', 200.), ('ops', 4.),
                                 ('num_nodes', 4),
                                 ('cost_op', 0.5), ('cost_access', 1.),
                                 ('cost_noc', 0), ('cost_static', 0),
-                                ('proc_time', 2), ('bus_time', 0),
+                                ('proc_time', 200), ('bus_time', 0),
                                 ('dram_time', 0),
                                 ('access', [[7, 8, 9]] * me.NUM),
+                                ('remote_gbuf_access', [0] * 3),
                                 ('total_nhops', [4, 5, 6]),
                                 ('fetch', [[1, 1, 1], [2, 2, 2]]),
+                                ('ti', [2, 2, 3]),
+                                ('to', [1, 2, 3]),
+                                ('tb', [1, 2, 3]),
+                                ('tvals', [[2, 1, 1], [2, 2, 2], [3, 3, 3]]),
+                                ('orders', [range(3)] * 2),
                                ]),
             ofmap_layout=DataLayout(
                 frngs=(FmapRange((0, 0, 0, 0),
@@ -76,19 +83,26 @@ def setUp(self):
                 regions=(NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(1, 2),
                                     type=NodeRegion.DRAM),),
                 parts=(PartitionScheme(order=range(pe.NUM),
-                                       pdims=[(1, 1)] * pe.NUM),)))
+                                       pdims=[(1, 1)] * pe.NUM),)),
+            sched_seq=(0, 0, 0))
 
         p1_layer = self.network['p1']
         self.p1res = SchedulingResult(
-            scheme=OrderedDict([('cost', 0.6), ('time', 0.05), ('ops', 0.1),
+            scheme=OrderedDict([('cost', 0.6), ('time', 5), ('ops', 0.1),
                                 ('num_nodes', 2),
                                 ('cost_op', 0.1), ('cost_access', 0.5),
                                 ('cost_noc', 0), ('cost_static', 0),
-                                ('proc_time', 0.05), ('bus_time', 0),
+                                ('proc_time', 5), ('bus_time', 0),
                                 ('dram_time', 0),
                                 ('access', [[.7, .8, .9]] * me.NUM),
+                                ('remote_gbuf_access', [0] * 3),
                                 ('total_nhops', [.4, .5, .6]),
                                 ('fetch', [[1, 1, 1], [2, 2, 2]]),
+                                ('ti', [2, 2, 3]),
+                                ('to', [1, 2, 3]),
+                                ('tb', [1, 2, 3]),
+                                ('tvals', [[2, 1, 1], [2, 2, 2], [3, 3, 3]]),
+                                ('orders', [range(3)] * 2),
                                ]),
             ofmap_layout=DataLayout(
                 frngs=(FmapRange((0, 0, 0, 0),
@@ -99,12 +113,17 @@ def setUp(self):
                 regions=(NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(1, 2),
                                     type=NodeRegion.DRAM),),
                 parts=(PartitionScheme(order=range(pe.NUM),
-                                       pdims=[(1, 1)] * pe.NUM),)))
+                                       pdims=[(1, 1)] * pe.NUM),)),
+            sched_seq=(0, 1, 0))
+
+        self.p2res = SchedulingResult(
+            scheme=self.p1res.scheme, ofmap_layout=self.p1res.ofmap_layout,
+            sched_seq=(0, 2, 0))
 
         self.dtfl = NNDataflowScheme(self.network, self.input_layout)
         self.dtfl['c1'] = self.c1res
         self.dtfl['p1'] = self.p1res
-        self.dtfl['p2'] = self.p1res
+        self.dtfl['p2'] = self.p2res
 
     def test_init(self):
         ''' Initial. '''
@@ -225,7 +244,7 @@ def test_setitem_already_exists(self):
         df['c1'] = self.c1res
 
         with self.assertRaisesRegexp(KeyError, 'NNDataflowScheme: .*c1*'):
-            df['c1'] = self.c1res
+            df['c1'] = self.c1res._replace(sched_seq=(1, 0, 0))
 
     def test_setitem_prev_not_in(self):
         ''' __setitem__ previous not existing. '''
@@ -247,6 +266,22 @@ def test_setitem_prev_input_ext(self):
         df['c2'] = self.c1res
         self.assertAlmostEqual(df.total_cost, self.c1res.total_cost)
 
+    def test_setitem_invalid_seg_idx(self):
+        ''' __setitem__ invalid segment index. '''
+        df = NNDataflowScheme(self.network, self.input_layout)
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'NNDataflowScheme: .*segment index*'):
+            df['c1'] = self.c1res._replace(sched_seq=(1, 0, 0))
+
+        df = NNDataflowScheme(self.network, self.input_layout)
+        df['c1'] = self.c1res
+        df['p1'] = self.p1res._replace(sched_seq=(1, 0, 0))
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'NNDataflowScheme: .*segment index*'):
+            df['p2'] = self.p2res._replace(sched_seq=(0, 0, 0))
+
     def test_delitem(self):
         ''' __delitem__. '''
         df = NNDataflowScheme(self.network, self.input_layout)
@@ -288,7 +323,7 @@ def test_copy_ext(self):
                                 'e1': self.input_layout})
         df1['c1'] = self.c1res
         df1['p1'] = self.p1res
-        df1['p2'] = self.p1res
+        df1['p2'] = self.p2res
 
         df2 = df1.copy()
 
@@ -330,7 +365,7 @@ def test_fmap_layout_ext(self):
                                'e1': self.input_layout})
         df['c1'] = self.c1res
         df['p1'] = self.p1res
-        df['p2'] = self.p1res
+        df['p2'] = self.p2res
 
         flayout = df.fmap_layout(('e0',))
         self.assertEqual(flayout, self.input_layout)
@@ -345,7 +380,7 @@ def test_fmap_layout_ext(self):
     def test_properties(self):
         ''' Property accessors. '''
         self.assertAlmostEqual(self.dtfl.total_cost, 1.5 + 0.6 * 2)
-        self.assertAlmostEqual(self.dtfl.total_time, 2 + 0.05 * 2)
+        self.assertAlmostEqual(self.dtfl.total_time, 200 + 5)
 
         self.assertAlmostEqual(self.dtfl.total_ops, 4 + 0.1 * 2)
         for a in self.dtfl.total_accesses:
@@ -353,21 +388,105 @@ def test_properties(self):
         self.assertAlmostEqual(self.dtfl.total_noc_hops,
                                (4 + 5 + 6) + (.4 + .5 + .6) * 2)
 
+    def test_time_full_net_single_seg(self):
+        ''' time() when full network fits in a single segment. '''
+        dtfl = NNDataflowScheme(self.network, self.input_layout)
+        dtfl['c1'] = self.c1res
+        dtfl['p1'] = self.p1res._replace(sched_seq=(0, 1, 0))
+        dtfl['p2'] = self.p2res._replace(sched_seq=(0, 2, 0))
+        dtfl['f1'] = self.c1res._replace(sched_seq=(0, 3, 0))
+        self.assertEqual(dtfl.total_time, 200)
+
+    def test_static_cost_adjust(self):
+        ''' Adjust static cost portion. '''
+
+        # Add static cost.
+        idl_unit_cost = 1e-3
+
+        c1scheme = self.c1res.scheme
+        c1static = c1scheme['time'] * idl_unit_cost
+        c1scheme['cost_static'] += c1static
+        c1scheme['cost_access'] -= c1static
+
+        p1scheme = self.p1res.scheme
+        p1static = p1scheme['time'] * idl_unit_cost
+        p1scheme['cost_static'] += p1static
+        p1scheme['cost_access'] -= p1static
+
+        # No adjust.
+        dtfl = NNDataflowScheme(self.network, self.input_layout)
+        dtfl['c1'] = self.c1res._replace(scheme=c1scheme)
+        dtfl['p1'] = self.p1res._replace(scheme=p1scheme, sched_seq=(1, 0, 0))
+        dtfl['p2'] = self.p2res._replace(scheme=p1scheme, sched_seq=(2, 0, 0))
+        dtfl['f1'] = self.c1res._replace(scheme=c1scheme, sched_seq=(3, 0, 0))
+
+        sum_cost = 1.5 + 0.6 + 0.6 + 1.5
+        sum_time = 200 + 5 + 5 + 200
+
+        self.assertAlmostEqual(dtfl.total_cost, sum_cost)
+        self.assertAlmostEqual(dtfl.total_time, sum_time)
+
+        # With adjust.
+        dtfl = NNDataflowScheme(self.network, self.input_layout)
+        dtfl['c1'] = self.c1res._replace(scheme=c1scheme)
+        dtfl['p1'] = self.p1res._replace(scheme=p1scheme, sched_seq=(0, 1, 0))
+        dtfl['p2'] = self.p2res._replace(scheme=p1scheme, sched_seq=(0, 2, 0))
+        dtfl['f1'] = self.c1res._replace(scheme=c1scheme, sched_seq=(1, 0, 0))
+
+        diff = (sum_time - dtfl.total_time) * idl_unit_cost
+        self.assertGreater(diff, 0)
+        self.assertAlmostEqual(dtfl.total_cost, sum_cost -diff)
+
+        # All in one segment.
+        dtfl = NNDataflowScheme(self.network, self.input_layout)
+        dtfl['c1'] = self.c1res._replace(scheme=c1scheme)
+        dtfl['p1'] = self.p1res._replace(scheme=p1scheme, sched_seq=(0, 1, 0))
+        dtfl['p2'] = self.p2res._replace(scheme=p1scheme, sched_seq=(0, 2, 0))
+        dtfl['f1'] = self.c1res._replace(scheme=c1scheme, sched_seq=(0, 3, 0))
+
+        diff = (sum_time - dtfl.total_time) * idl_unit_cost
+        self.assertGreater(diff, 0)
+        self.assertAlmostEqual(dtfl.total_cost, sum_cost -diff)
+
+    def test_segment_time_list(self):
+        ''' segment_time_list(). '''
+        dtfl = NNDataflowScheme(self.network, self.input_layout)
+        dtfl['c1'] = self.c1res
+        dtfl['p1'] = self.p1res
+        dtfl['p2'] = self.p2res._replace(sched_seq=(1, 0, 0))
+        self.assertListEqual(dtfl.segment_time_list(), [205, 5])
+
+    def test_segment_dram_time_list(self):
+        ''' segment_dram_time_list(). '''
+        c1_scheme = self.c1res.scheme.copy()
+        c1_scheme['dram_time'] = 180
+        p1_scheme = self.p1res.scheme.copy()
+        p1_scheme['dram_time'] = 5
+        p2_scheme = self.p2res.scheme.copy()
+        p2_scheme['dram_time'] = 10
+        dtfl = NNDataflowScheme(self.network, self.input_layout)
+        dtfl['c1'] = self.c1res._replace(scheme=c1_scheme)
+        dtfl['p1'] = self.p1res._replace(scheme=p1_scheme)
+        dtfl['p2'] = self.p2res._replace(sched_seq=(1, 0, 0),
+                                         scheme=p2_scheme)
+        self.assertListEqual(dtfl.segment_dram_time_list(), [185, 10])
+        self.assertListEqual(dtfl.segment_time_list(), [205, 10])
+
     def test_stats_active_node_pes(self):
         ''' Per-layer stats: active node PEs. '''
         stats = self.dtfl.perlayer_stats('active_node_pes')
         self.assertEqual(len(stats), len(self.dtfl))
-        self.assertAlmostEqual(stats['c1'], 0.5)
-        self.assertAlmostEqual(stats['p1'], 1)
-        self.assertAlmostEqual(stats['p2'], 1)
+        self.assertAlmostEqual(stats['c1'], 0.005)
+        self.assertAlmostEqual(stats['p1'], 0.01)
+        self.assertAlmostEqual(stats['p2'], 0.01)
 
     def test_stats_dram_bandwidth(self):
         ''' Per-layer stats: DRAM bandwidth. '''
         stats = self.dtfl.perlayer_stats('dram_bandwidth')
         self.assertEqual(len(stats), len(self.dtfl))
-        self.assertAlmostEqual(stats['c1'], (7 + 8 + 9) / 2.)
-        self.assertAlmostEqual(stats['p1'], (.7 + .8 + .9) / 0.05)
-        self.assertAlmostEqual(stats['p2'], (.7 + .8 + .9) / 0.05)
+        self.assertAlmostEqual(stats['c1'], (7. + 8. + 9.) / 200)
+        self.assertAlmostEqual(stats['p1'], (.7 + .8 + .9) / 5)
+        self.assertAlmostEqual(stats['p2'], (.7 + .8 + .9) / 5)
 
     def test_stats_not_supported(self):
         ''' Per-layer stats: not supported. '''
diff --git a/nn_dataflow/tests/unit_test/test_node_region.py b/nn_dataflow/tests/unit_test/test_node_region.py
index 73e026d..fa88181 100644
--- a/nn_dataflow/tests/unit_test/test_node_region.py
+++ b/nn_dataflow/tests/unit_test/test_node_region.py
@@ -25,10 +25,66 @@ def test_valid_args(self):
         ''' Valid arguments. '''
         nr = NodeRegion(dim=PhyDim2(4, 4),
                         origin=PhyDim2(1, 3),
-                        type=NodeRegion.PROC)
+                        type=NodeRegion.PROC,
+                        wtot=2,
+                        wbeg=-1)
         self.assertTupleEqual(nr.dim, (4, 4), 'dim')
         self.assertTupleEqual(nr.origin, (1, 3), 'origin')
         self.assertEqual(nr.type, NodeRegion.PROC, 'type')
+        self.assertEqual(nr.wtot, 2, 'wtot')
+        self.assertEqual(nr.wbeg, -1, 'wbeg')
+
+    def test_default_wtot_wbeg(self):
+        ''' Default wtot and wbeg. '''
+        nr = NodeRegion(dim=PhyDim2(4, 8),
+                        origin=PhyDim2(1, 3),
+                        type=NodeRegion.PROC)
+        self.assertEqual(nr.wtot, 8)
+        self.assertEqual(nr.wbeg, 8)
+
+        nr = NodeRegion(dim=PhyDim2(4, 8),
+                        origin=PhyDim2(1, 3),
+                        type=NodeRegion.PROC,
+                        wtot=6)
+        self.assertEqual(nr.wtot, 6)
+        self.assertEqual(nr.wbeg, 6)
+
+        nr = NodeRegion(dim=PhyDim2(4, 8),
+                        origin=PhyDim2(1, 3),
+                        type=NodeRegion.PROC,
+                        wbeg=-5)
+        self.assertEqual(nr.wtot, 8)
+        self.assertEqual(nr.wbeg, -5)
+
+    def test_args_kwargs(self):
+        ''' Different ways to give args and kwargs. '''
+        dim = PhyDim2(4, 8)
+        origin = PhyDim2(1, 3)
+        dist = PhyDim2(1, 1)
+        type_ = NodeRegion.PROC
+        wtot = 6
+        wbeg = 5
+
+        nr0 = NodeRegion(dim=dim, origin=origin, dist=dist, type=type_,
+                         wtot=wtot, wbeg=wbeg)
+
+        nr = NodeRegion(dim, origin, dist, type_, wtot, wbeg)
+        self.assertTupleEqual(nr, nr0)
+
+        nr = NodeRegion(dim, origin, wbeg=wbeg, wtot=wtot, type=type_,
+                        dist=dist)
+        self.assertTupleEqual(nr, nr0)
+
+        nr = NodeRegion(dim, origin, dist, type=type_, wtot=wtot, wbeg=wbeg)
+        self.assertTupleEqual(nr, nr0)
+
+    def test_larger_wtot(self):
+        ''' wtot > dim.w is valid. '''
+        nr = NodeRegion(dim=PhyDim2(4, 8),
+                        origin=PhyDim2(1, 3),
+                        type=NodeRegion.PROC,
+                        wtot=20)
+        self.assertEqual(nr.wtot, 20)
 
     def test_invalid_dim(self):
         ''' Invalid dim. '''
@@ -59,6 +115,45 @@ def test_invalid_type(self):
                            origin=PhyDim2(1, 3),
                            type=NodeRegion.NUM)
 
+    def test_invalid_wtot_type(self):
+        ''' Invalid wtot type. '''
+        with self.assertRaisesRegexp(TypeError, 'NodeRegion: .*wtot.*'):
+            _ = NodeRegion(dim=PhyDim2(4, 4),
+                           origin=PhyDim2(1, 3),
+                           type=NodeRegion.PROC,
+                           wtot=1.3)
+
+    def test_invalid_wbeg_type(self):
+        ''' Invalid wbeg type. '''
+        with self.assertRaisesRegexp(TypeError, 'NodeRegion: .*wbeg.*'):
+            _ = NodeRegion(dim=PhyDim2(4, 4),
+                           origin=PhyDim2(1, 3),
+                           type=NodeRegion.PROC,
+                           wbeg=1.3)
+
+    def test_invalid_wbeg(self):
+        ''' Invalid wbeg. '''
+        with self.assertRaisesRegexp(ValueError, 'NodeRegion: .*wbeg.*'):
+            _ = NodeRegion(dim=PhyDim2(4, 4),
+                           origin=PhyDim2(1, 3),
+                           type=NodeRegion.PROC,
+                           wtot=4,
+                           wbeg=5)
+
+        with self.assertRaisesRegexp(ValueError, 'NodeRegion: .*wbeg.*'):
+            _ = NodeRegion(dim=PhyDim2(4, 4),
+                           origin=PhyDim2(1, 3),
+                           type=NodeRegion.PROC,
+                           wtot=4,
+                           wbeg=-5)
+
+        with self.assertRaisesRegexp(ValueError, 'NodeRegion: .*wbeg.*'):
+            _ = NodeRegion(dim=PhyDim2(4, 4),
+                           origin=PhyDim2(1, 3),
+                           type=NodeRegion.PROC,
+                           wtot=4,
+                           wbeg=0)
+
     def test_contains_node(self):
         ''' Whether contains node. '''
         nr = NodeRegion(dim=PhyDim2(4, 4),
@@ -138,3 +233,165 @@ def test_rel2abs_not_in(self):
         with self.assertRaisesRegexp(ValueError, 'NodeRegion: .*not in.*'):
             _ = nr.rel2abs(PhyDim2(0, 4))
 
+    def test_rel2abs_folded(self):
+        ''' Get rel2abs with folded. '''
+        nr = NodeRegion(dim=PhyDim2(4, 8),
+                        origin=PhyDim2(1, 3),
+                        type=NodeRegion.PROC,
+                        wtot=3)
+        # 67
+        # 543
+        # 012
+
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(1, 2)), (1 + 1, 5))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(2, 3)), (5 + 2, 5))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(0, 5)), (5 + 0, 3))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(3, 7)), (9 + 3, 4))
+
+        self.assertSetEqual(set(nr.rel2abs(PhyDim2(h, w))
+                                for h in range(nr.dim.h)
+                                for w in range(nr.dim.w)),
+                            set(nr.iter_node()))
+
+        nr = NodeRegion(dim=PhyDim2(4, 8),
+                        origin=PhyDim2(1, 3),
+                        type=NodeRegion.PROC,
+                        wtot=3,
+                        wbeg=1)
+        #   7
+        # 456
+        # 321
+        #   0
+
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(2, 0)), (1 + 2, 3))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(1, 2)), (5 + 1, 2))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(2, 3)), (5 + 2, 1))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(0, 5)), (9 + 0, 2))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(3, 7)), (13 + 3, 3))
+
+        self.assertSetEqual(set(nr.rel2abs(PhyDim2(h, w))
+                                for h in range(nr.dim.h)
+                                for w in range(nr.dim.w)),
+                            set(nr.iter_node()))
+
+        nr = NodeRegion(dim=PhyDim2(4, 8),
+                        origin=PhyDim2(1, 3),
+                        type=NodeRegion.PROC,
+                        wtot=4,
+                        wbeg=-2)
+        #   76
+        # 2345
+        # 10
+
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(1, 1)), (1 + 1, 2))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(2, 3)), (5 + 2, 3))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(0, 5)), (5 + 0, 5))
+        self.assertTupleEqual(nr.rel2abs(PhyDim2(3, 7)), (9 + 3, 4))
+
+        self.assertSetEqual(set(nr.rel2abs(PhyDim2(h, w))
+                                for h in range(nr.dim.h)
+                                for w in range(nr.dim.w)),
+                            set(nr.iter_node()))
+
+    def test_allocate(self):
+        ''' allocate. '''
+
+        nr = NodeRegion(dim=PhyDim2(4, 4),
+                        origin=PhyDim2(1, 3),
+                        type=NodeRegion.PROC)
+
+        def _common_check(length):
+            self.assertEqual(len(subregions), length)
+            aggr_node_set = set()
+            for sr in subregions:
+                self.assertTupleEqual(sr.dist, nr.dist)
+                self.assertEqual(sr.type, NodeRegion.PROC)
+                self.assertEqual(sr.wtot, 4)
+                for c in sr.iter_node():
+                    self.assertTrue(nr.contains_node(c))
+                self.assertTrue(aggr_node_set.isdisjoint(sr.iter_node()))
+                aggr_node_set.update(sr.iter_node())
+            self.assertSetEqual(set(nr.iter_node()), aggr_node_set)
+
+        request_list = [4, 4, 4, 4, 4]
+        self.assertEqual(len(nr.allocate(request_list)), 0)
+
+        request_list = [2, 3, 3, 2, 4, 2]
+        subregions = nr.allocate(request_list)
+        # 5544
+        # 3344
+        # 2221
+        # 0011
+        _common_check(len(request_list))
+        self.assertTupleEqual(subregions[0].dim, (1, 2))
+        self.assertTupleEqual(subregions[0].origin, (1, 3))
+        self.assertEqual(subregions[0].wbeg, 2)
+        self.assertTupleEqual(subregions[1].dim, (1, 3))
+        self.assertTupleEqual(subregions[1].origin, (1, 5))
+        self.assertEqual(subregions[1].wbeg, 2)
+        self.assertTupleEqual(subregions[2].dim, (1, 3))
+        self.assertTupleEqual(subregions[2].origin, (2, 5))
+        self.assertEqual(subregions[2].wbeg, -3)
+        self.assertTupleEqual(subregions[3].dim, (1, 2))
+        self.assertTupleEqual(subregions[3].origin, (3, 3))
+        self.assertEqual(subregions[3].wbeg, 2)
+        self.assertTupleEqual(subregions[4].dim, (1, 4))
+        self.assertTupleEqual(subregions[4].origin, (3, 5))
+        self.assertEqual(subregions[4].wbeg, 2)
+        self.assertTupleEqual(subregions[5].dim, (1, 2))
+        self.assertTupleEqual(subregions[5].origin, (4, 4))
+        self.assertEqual(subregions[5].wbeg, -2)
+
+        request_list = [5, 11]
+        subregions = nr.allocate(request_list)
+        # 1111
+        # 1111
+        # 1110
+        # 0000
+        _common_check(len(request_list))
+        self.assertTupleEqual(subregions[0].dim, (1, 5))
+        self.assertTupleEqual(subregions[0].origin, (1, 3))
+        self.assertEqual(subregions[0].wbeg, 4)
+        self.assertTupleEqual(subregions[1].dim, (1, 11))
+        self.assertTupleEqual(subregions[1].origin, (2, 5))
+        self.assertEqual(subregions[1].wbeg, -3)
+
+        request_list = [2, 4, 4, 2, 4]
+        subregions = nr.allocate(request_list)
+        # 4432
+        # 4432
+        # 0112
+        # 0112
+        _common_check(len(request_list))
+        self.assertTupleEqual(subregions[0].dim, (2, 1))
+        self.assertTupleEqual(subregions[0].origin, (1, 3))
+        self.assertEqual(subregions[0].wbeg, 1)
+        self.assertTupleEqual(subregions[1].dim, (2, 2))
+        self.assertTupleEqual(subregions[1].origin, (1, 4))
+        self.assertEqual(subregions[1].wbeg, 2)
+        self.assertTupleEqual(subregions[2].dim, (2, 2))
+        self.assertTupleEqual(subregions[2].origin, (1, 6))
+        self.assertEqual(subregions[2].wbeg, 1)
+        self.assertTupleEqual(subregions[3].dim, (2, 1))
+        self.assertTupleEqual(subregions[3].origin, (3, 5))
+        self.assertEqual(subregions[3].wbeg, -1)
+        self.assertTupleEqual(subregions[4].dim, (2, 2))
+        self.assertTupleEqual(subregions[4].origin, (3, 4))
+        self.assertEqual(subregions[4].wbeg, -2)
+
+        nr = nr._replace(dist=PhyDim2(2, 1))
+
+        request_list = [10, 6]
+        subregions = nr.allocate(request_list)
+        # 1110
+        # 1110
+        # 0000
+        # 0000
+        _common_check(len(request_list))
+        self.assertTupleEqual(subregions[0].dim, (2, 5))
+        self.assertTupleEqual(subregions[0].origin, (1, 3))
+        self.assertEqual(subregions[0].wbeg, 4)
+        self.assertTupleEqual(subregions[1].dim, (2, 3))
+        self.assertTupleEqual(subregions[1].origin, (5, 5))
+        self.assertEqual(subregions[1].wbeg, -3)
+
diff --git a/nn_dataflow/tests/unit_test/test_option.py b/nn_dataflow/tests/unit_test/test_option.py
index f713f95..3c6627c 100644
--- a/nn_dataflow/tests/unit_test/test_option.py
+++ b/nn_dataflow/tests/unit_test/test_option.py
@@ -24,9 +24,12 @@ def test_valid_kwargs(self):
         ''' Valid keyword arguments. '''
         options = Option(sw_gbuf_bypass=(False, False, False),
                          sw_solve_loopblocking=False,
+                         hw_access_forwarding=False,
+                         hw_gbuf_sharing=False,
                          partition_hybrid=True,
                          partition_batch=False,
                          partition_ifmaps=False,
+                         partition_interlayer=False,
                          opt_goal='ed',
                          ntops=10,
                          nprocesses=16,
@@ -36,12 +39,18 @@ def test_valid_kwargs(self):
                          'sw_gbuf_bypass')
         self.assertEqual(options.sw_solve_loopblocking, False,
                          'sw_solve_loopblocking')
+        self.assertEqual(options.hw_access_forwarding, False,
+                         'hw_access_forwarding')
+        self.assertEqual(options.hw_gbuf_sharing, False,
+                         'hw_gbuf_sharing')
         self.assertEqual(options.partition_hybrid, True,
                          'partition_hybrid')
         self.assertEqual(options.partition_batch, False,
                          'partition_batch')
         self.assertEqual(options.partition_ifmaps, False,
                          'partition_ifmaps')
+        self.assertEqual(options.partition_interlayer, False,
+                         'partition_interlayer')
         self.assertEqual(options.opt_goal, 'ed', 'opt_goal')
         self.assertEqual(options.ntops, 10, 'ntops')
         self.assertEqual(options.nprocesses, 16, 'nprocesses')
@@ -93,6 +102,27 @@ def test_invalid_swgbyp_len(self):
         with self.assertRaisesRegexp(ValueError, 'Option: .*sw_gbuf_bypass.*'):
             _ = Option(sw_gbuf_bypass=(False, False))
 
+    def test_invalid_swsol_hwbufshr(self):
+        ''' Invalid sw_solve_loopblocking and hw_gbuf_sharing comb. '''
+        with self.assertRaisesRegexp(ValueError,
+                                     'Option: .*sw_solve_loopblocking.*'
+                                     'hw_gbuf_sharing.*'):
+            _ = Option(sw_solve_loopblocking=True, hw_gbuf_sharing=True)
+
+    def test_invalid_hwaccfwd_hwbufshr(self):
+        ''' Invalid hw_access_forwarding and hw_gbuf_sharing comb. '''
+        with self.assertRaisesRegexp(ValueError,
+                                     'Option: .*hw_access_forwarding.*'
+                                     'hw_gbuf_sharing.*'):
+            _ = Option(hw_access_forwarding=True, hw_gbuf_sharing=True)
+
+    def test_invalid_swsol_hwswb(self):
+        ''' Invalid sw_solve_loopblocking and hw_gbuf_save_writeback comb. '''
+        with self.assertRaisesRegexp(ValueError,
+                                     'Option: .*sw_solve_loopblocking.*'
+                                     'hw_gbuf_save_writeback.*'):
+            _ = Option(sw_solve_loopblocking=True, hw_gbuf_save_writeback=True)
+
     def test_invalid_part_hybrid_ifmaps(self):
         ''' Invalid partition_hybrid and partition_ifmaps comb. '''
         with self.assertRaisesRegexp(ValueError,
@@ -100,6 +130,26 @@ def test_invalid_part_hybrid_ifmaps(self):
                                      'partition_hybrid.*'):
             _ = Option(partition_hybrid=False, partition_ifmaps=True)
 
+    def test_invalid_time_ovhd(self):
+        ''' Invalid layer_pipeline_time_ovhd. '''
+        with self.assertRaisesRegexp(KeyError,
+                                     'Option: .*layer_pipeline_time_ovhd.*'):
+            _ = Option(layer_pipeline_time_ovhd=None)
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'Option: .*layer_pipeline_time_ovhd.*'):
+            _ = Option(layer_pipeline_time_ovhd=-1)
+
+    def test_invalid_max_degree(self):
+        ''' Invalid layer_pipeline_max_degree. '''
+        with self.assertRaisesRegexp(KeyError,
+                                     'Option: .*layer_pipeline_max_degree.*'):
+            _ = Option(layer_pipeline_max_degree=None)
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'Option: .*layer_pipeline_max_degree.*'):
+            _ = Option(layer_pipeline_max_degree=-1)
+
     def test_invalid_opt_goal(self):
         ''' Invalid opt_goal. '''
         with self.assertRaisesRegexp(ValueError, 'Option: .*opt_goal.*'):
diff --git a/nn_dataflow/tests/unit_test/test_partition_scheme.py b/nn_dataflow/tests/unit_test/test_partition_scheme.py
index c30f6f0..b282a19 100644
--- a/nn_dataflow/tests/unit_test/test_partition_scheme.py
+++ b/nn_dataflow/tests/unit_test/test_partition_scheme.py
@@ -15,6 +15,7 @@
 
 import collections
 import itertools
+import math
 import unittest
 
 from nn_dataflow.core import FmapPosition, FmapRange
@@ -248,6 +249,31 @@ def data_loops():
         with self.assertRaisesRegexp(TypeError, 'PartitionScheme: .*layer.*'):
             _ = self.ps1.part_layer(layer, self.ps1.size(pe.BATP))
 
+    def test_part_neighbor_dist(self):
+        ''' Get part_neighbor_dist. '''
+        for ps, nr in zip([self.ps1, self.ps2], [self.nr1, self.nr2]):
+
+            for idx in range(pe.NUM):
+                nbr_dist = ps.part_neighbor_dist(nr, ps.order[idx])
+                dim_below = ps.dim(*ps.order[idx + 1:]) if idx + 1 < pe.NUM \
+                        else PhyDim2(1, 1)
+                dim_cur = ps.dim(ps.order[idx])
+
+                if dim_cur.h == 1:
+                    self.assertTrue(math.isinf(nbr_dist.h))
+                else:
+                    self.assertEqual(nbr_dist.h, dim_below.h)
+
+                if dim_cur.w == 1:
+                    self.assertTrue(math.isinf(nbr_dist.w))
+                else:
+                    self.assertEqual(nbr_dist.w, dim_below.w)
+
+    def test_part_neighbor_dist_inv(self):
+        ''' Get part_neighbor_dist invalid arg. '''
+        dist = self.ps1.part_neighbor_dist(self.nr1, pe.NUM)
+        self.assertTrue(all(math.isnan(d) for d in dist))
+
     def test_projection(self):
         ''' Get projection. '''
 
diff --git a/nn_dataflow/tests/unit_test/test_resource.py b/nn_dataflow/tests/unit_test/test_resource.py
index c1e5c90..6c2602f 100644
--- a/nn_dataflow/tests/unit_test/test_resource.py
+++ b/nn_dataflow/tests/unit_test/test_resource.py
@@ -45,6 +45,7 @@ def test_valid_args(self):
                             size_regf=512,
                             array_bus_width=8,
                             dram_bandwidth=128,
+                            no_time_mux=False,
                            )
         self.assertTupleEqual(resource.proc_region.dim, (2, 2), 'proc_region')
         self.assertTupleEqual(resource.dram_region.dim, (2, 2), 'dram_region')
@@ -53,6 +54,7 @@ def test_valid_args(self):
         self.assertEqual(resource.size_regf, 512, 'size_regf')
         self.assertEqual(resource.array_bus_width, 8, 'array_bus_width')
         self.assertEqual(resource.dram_bandwidth, 128, 'dram_bandwidth')
+        self.assertFalse(resource.no_time_mux, 'no_time_mux')
 
     def test_invalid_proc_region(self):
         ''' Invalid proc_region. '''
@@ -66,6 +68,7 @@ def test_invalid_proc_region(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
 
     def test_invalid_proc_region_dram(self):
@@ -82,6 +85,7 @@ def test_invalid_proc_region_dram(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
 
     def test_invalid_dram_region(self):
@@ -96,6 +100,7 @@ def test_invalid_dram_region(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
 
     def test_invalid_dram_region_proc(self):
@@ -112,6 +117,7 @@ def test_invalid_dram_region_proc(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
 
     def test_invalid_data_region(self):
@@ -126,6 +132,7 @@ def test_invalid_data_region(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
         with self.assertRaisesRegexp(TypeError, 'Resource: .*dst_data_.*'):
             _ = Resource(proc_region=self.proc_region,
@@ -137,6 +144,7 @@ def test_invalid_data_region(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
 
     def test_invalid_dim_array(self):
@@ -151,6 +159,7 @@ def test_invalid_dim_array(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
 
     def test_invalid_size_gbuf(self):
@@ -165,6 +174,7 @@ def test_invalid_size_gbuf(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
 
     def test_invalid_size_regf(self):
@@ -179,6 +189,7 @@ def test_invalid_size_regf(self):
                          size_regf=(512,),
                          array_bus_width=8,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
 
     def test_invalid_array_bus_width(self):
@@ -194,6 +205,7 @@ def test_invalid_array_bus_width(self):
                          size_regf=512,
                          array_bus_width=1.2,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
         with self.assertRaisesRegexp(ValueError,
                                      'Resource: .*array_bus_width.*'):
@@ -206,6 +218,7 @@ def test_invalid_array_bus_width(self):
                          size_regf=512,
                          array_bus_width=-2,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
         with self.assertRaisesRegexp(ValueError,
                                      'Resource: .*array_bus_width.*'):
@@ -218,6 +231,7 @@ def test_invalid_array_bus_width(self):
                          size_regf=512,
                          array_bus_width=0,
                          dram_bandwidth=128,
+                         no_time_mux=False,
                         )
 
     def test_invalid_dram_bandwidth(self):
@@ -233,6 +247,7 @@ def test_invalid_dram_bandwidth(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=None,
+                         no_time_mux=False,
                         )
         with self.assertRaisesRegexp(ValueError,
                                      'Resource: .*dram_bandwidth.*'):
@@ -245,6 +260,7 @@ def test_invalid_dram_bandwidth(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=-3,
+                         no_time_mux=False,
                         )
         with self.assertRaisesRegexp(ValueError,
                                      'Resource: .*dram_bandwidth.*'):
@@ -257,5 +273,22 @@ def test_invalid_dram_bandwidth(self):
                          size_regf=512,
                          array_bus_width=8,
                          dram_bandwidth=0,
+                         no_time_mux=False,
+                        )
+
+    def test_invalid_no_time_mux(self):
+        ''' Invalid no_time_mux. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'Resource: .*no_time_mux.*'):
+            _ = Resource(proc_region=self.proc_region,
+                         dram_region=self.dram_region,
+                         src_data_region=self.src_data_region,
+                         dst_data_region=self.dst_data_region,
+                         dim_array=PhyDim2(16, 16),
+                         size_gbuf=131072,
+                         size_regf=512,
+                         array_bus_width=8,
+                         dram_bandwidth=128,
+                         no_time_mux=None,
                         )
 
diff --git a/nn_dataflow/tests/unit_test/test_scheduling_condition.py b/nn_dataflow/tests/unit_test/test_scheduling_condition.py
index 30ea75a..e80f026 100644
--- a/nn_dataflow/tests/unit_test/test_scheduling_condition.py
+++ b/nn_dataflow/tests/unit_test/test_scheduling_condition.py
@@ -23,6 +23,7 @@
 from nn_dataflow.core import PhyDim2
 from nn_dataflow.core import Resource
 from nn_dataflow.core import SchedulingCondition
+from nn_dataflow.core import SchedulingConstraint
 
 class TestSchedulingCondition(unittest.TestCase):
     ''' Tests for SchedulingCondition. '''
@@ -39,7 +40,10 @@ def setUp(self):
             dst_data_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(1, 1),
                                        type=NodeRegion.DRAM),
             dim_array=PhyDim2(16, 16), size_gbuf=65536, size_regf=64,
-            array_bus_width=float('inf'), dram_bandwidth=float('inf'))
+            array_bus_width=float('inf'), dram_bandwidth=float('inf'),
+            no_time_mux=False)
+
+        self.none_cstr = SchedulingConstraint()
 
         part = PartitionScheme(order=range(pe.NUM), pdims=[(1, 1)] * pe.NUM)
         self.ifmap_layout = DataLayout(frngs=(FmapRange((0, 0, 0, 0),
@@ -47,24 +51,59 @@ def setUp(self):
                                        regions=(self.resource.src_data_region,),
                                        parts=(part,))
 
+        self.sched_seq = (2, 0, 0)
+
     def test_valid_args(self):
         ''' Valid arguments. '''
         condition = SchedulingCondition(resource=self.resource,
-                                        ifmap_layout=self.ifmap_layout)
+                                        constraint=self.none_cstr,
+                                        ifmap_layout=self.ifmap_layout,
+                                        sched_seq=self.sched_seq)
         self.assertEqual(condition.resource, self.resource)
+        self.assertEqual(condition.constraint, self.none_cstr)
         self.assertEqual(condition.ifmap_layout, self.ifmap_layout)
+        self.assertTupleEqual(condition.sched_seq, self.sched_seq)
 
     def test_invalid_resource(self):
         ''' Invalid resource. '''
         with self.assertRaisesRegexp(TypeError,
                                      'SchedulingCondition: .*resource.*'):
             _ = SchedulingCondition(resource=None,
-                                    ifmap_layout=self.ifmap_layout)
+                                    constraint=self.none_cstr,
+                                    ifmap_layout=self.ifmap_layout,
+                                    sched_seq=self.sched_seq)
+
+    def test_invalid_constraint(self):
+        ''' Invalid constraint. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'SchedulingCondition: .*constraint.*'):
+            _ = SchedulingCondition(resource=self.resource,
+                                    constraint=None,
+                                    ifmap_layout=self.ifmap_layout,
+                                    sched_seq=self.sched_seq)
 
     def test_invalid_ifmap_layout(self):
-        ''' Invalid resource. '''
+        ''' Invalid ifmap_layout. '''
         with self.assertRaisesRegexp(TypeError,
                                      'SchedulingCondition: .*ifmap_layout.*'):
             _ = SchedulingCondition(resource=self.resource,
-                                    ifmap_layout=None)
+                                    constraint=self.none_cstr,
+                                    ifmap_layout=None,
+                                    sched_seq=self.sched_seq)
+
+    def test_invalid_sched_seq(self):
+        ''' Invalid sched_seq. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'SchedulingCondition: .*sched_seq.*'):
+            _ = SchedulingCondition(resource=self.resource,
+                                    constraint=self.none_cstr,
+                                    ifmap_layout=self.ifmap_layout,
+                                    sched_seq=list(self.sched_seq))
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'SchedulingCondition: .*sched_seq.*'):
+            _ = SchedulingCondition(resource=self.resource,
+                                    constraint=self.none_cstr,
+                                    ifmap_layout=self.ifmap_layout,
+                                    sched_seq=self.sched_seq[:-1])
 
diff --git a/nn_dataflow/tests/unit_test/test_scheduling_constraint.py b/nn_dataflow/tests/unit_test/test_scheduling_constraint.py
new file mode 100644
index 0000000..c401803
--- /dev/null
+++ b/nn_dataflow/tests/unit_test/test_scheduling_constraint.py
@@ -0,0 +1,354 @@
+""" $lic$
+Copyright (C) 2016-2019 by The Board of Trustees of Stanford University
+
+This program is free software: you can redistribute it and/or modify it under
+the terms of the Modified BSD-3 License as published by the Open Source
+Initiative.
+
+This program is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
+PARTICULAR PURPOSE. See the BSD-3 License for more details.
+
+You should have received a copy of the Modified BSD-3 License along with this
+program. If not, see <https://opensource.org/licenses/BSD-3-Clause>.
+"""
+
+import itertools
+import unittest
+
+from nn_dataflow.core import LoopEnum as le
+from nn_dataflow.core import ParallelEnum as pe
+from nn_dataflow.core import PartitionScheme
+from nn_dataflow.core import SchedulingConstraint, \
+        SchedulingConstraintLayerPipeline
+
+from nn_dataflow import util
+
+class TestSchedulingConstraintFixture(unittest.TestCase):
+    ''' Base fixture class for SchedulingConstraint tests. '''
+
+    @staticmethod
+    def _gen_bl(t_end=9):
+        ''' Generator for bl_t and bl_ord. '''
+        return itertools.product(itertools.product(*[range(1, t_end)] * le.NUM),
+                                 itertools.permutations(range(le.NUM)))
+
+
+class TestSchedulingConstraint(TestSchedulingConstraintFixture):
+    ''' Tests for SchedulingConstraint. '''
+
+    def test_valid_args(self):
+        ''' Valid arguments. '''
+        cstr = SchedulingConstraint(topbat=2, topifm=1, topofm=4)
+        self.assertEqual(cstr.topbat, 2)
+        self.assertEqual(cstr.topifm, 1)
+        self.assertEqual(cstr.topofm, 4)
+        self.assertDictEqual(cstr.update_dict, {})
+
+        cstr = SchedulingConstraint(topbat=2, topofm=4)
+        self.assertEqual(cstr.topbat, 2)
+        self.assertEqual(cstr.topifm, 0)
+        self.assertEqual(cstr.topofm, 4)
+        self.assertDictEqual(cstr.update_dict, {})
+
+        cstr = SchedulingConstraint(
+            topofm=4,
+            update_dict={
+                'l1': lambda s, _: setattr(s, 'topbat', 1),
+                'l2': lambda s, r: setattr(s, 'topifm', r.topifm),
+            })
+        self.assertEqual(cstr.topbat, 0)
+        self.assertEqual(cstr.topifm, 0)
+        self.assertEqual(cstr.topofm, 4)
+        self.assertEqual(len(cstr.update_dict), 2)
+        self.assertIn('l1', cstr.update_dict)
+        self.assertIn('l2', cstr.update_dict)
+
+        cstr = SchedulingConstraint()
+        self.assertEqual(cstr.topbat, 0)
+        self.assertEqual(cstr.topifm, 0)
+        self.assertEqual(cstr.topofm, 0)
+        self.assertDictEqual(cstr.update_dict, {})
+
+    def test_invalid_args(self):
+        ''' Invalid arguments. '''
+        with self.assertRaisesRegexp(ValueError,
+                                     'SchedulingConstraint: '
+                                     '.*positive integers.*'):
+            _ = SchedulingConstraint(topbat=-1, topofm=2.)
+
+    def test_invalid_update_dict(self):
+        ''' Invalid argument update_dict. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'SchedulingConstraint: '
+                                     '.*update_dict.*'):
+            _ = SchedulingConstraint(update_dict=['l1'])
+
+        with self.assertRaisesRegexp(TypeError,
+                                     'SchedulingConstraint: '
+                                     '.*update_dict.*'):
+            _ = SchedulingConstraint(update_dict={'l1': 1})
+
+    def test_null_constraint(self):
+        ''' Null constraint. '''
+        cstr = SchedulingConstraint()
+
+        self.assertTrue(cstr.is_valid_top_bl((1, 1, 2), (0, 1, 2)))
+        self.assertTrue(cstr.is_valid_top_bl((3, 4, 5), (2, 1, 0)))
+        self.assertTrue(cstr.is_valid_top_bl((1, 1, 1), (1, 2, 0)))
+
+        self.assertTrue(cstr.is_valid_part(PartitionScheme(
+            order=range(pe.NUM), pdims=[(2, 2)] * pe.NUM)))
+
+    def test_is_valid_top_bl(self):
+        ''' Whether is_valid_top_bl. '''
+        cstr = SchedulingConstraint(topbat=2, topofm=4)
+        for bl_t, bl_ord in self._gen_bl():
+            valid = (bl_t[le.BAT] == 2 and bl_t[le.OFM] == 4)
+            self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid)
+
+        cstr = SchedulingConstraint(topifm=4)
+        for bl_t, bl_ord in self._gen_bl():
+            valid = (bl_t[le.IFM] == 4)
+            self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid)
+
+        cstr = SchedulingConstraint()
+        for bl_t, bl_ord in self._gen_bl():
+            self.assertTrue(cstr.is_valid_top_bl(bl_t, bl_ord))
+
+    def test_is_valid_part(self):
+        ''' Whether is_valid_part. '''
+        cstr = SchedulingConstraintLayerPipeline(
+            topbat=2, topifm=1, topofm=4, fbifm=True, fbofm=False)
+        self.assertTrue(cstr.is_valid_part(PartitionScheme(
+            order=range(pe.NUM), pdims=[(2, 2)] * pe.NUM)))
+
+        cstr = SchedulingConstraintLayerPipeline(topbat=2, topofm=4, fbifm=True)
+        self.assertTrue(cstr.is_valid_part(PartitionScheme(
+            order=range(pe.NUM), pdims=[(2, 2)] * pe.NUM)))
+
+        cstr = SchedulingConstraintLayerPipeline()
+        self.assertTrue(cstr.is_valid_part(PartitionScheme(
+            order=range(pe.NUM), pdims=[(2, 2)] * pe.NUM)))
+
+    def test_is_valid_before_update(self):
+        ''' is_valid_top_bl and is_valid_part called before update. '''
+        cstr = SchedulingConstraint(
+            topofm=4,
+            update_dict={
+                'l1': lambda s, _: setattr(s, 'topbat', 1),
+                'l2': lambda s, r: setattr(s, 'topifm', r.topifm),
+            })
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'SchedulingConstraint: '
+                                     '.*update_dict.*'):
+            cstr.is_valid_top_bl([1] * le.NUM, range(le.NUM))
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'SchedulingConstraint: '
+                                     '.*update_dict.*'):
+            cstr.is_valid_part(PartitionScheme(order=range(pe.NUM),
+                                               pdims=[(2, 2)] * pe.NUM))
+
+    def test_filter_gen_ts(self):
+        ''' Get filter_gen_ts. '''
+        gen_tifm = util.factorize(36, 3)
+        gen_tofm = util.factorize(20, 3)
+        gen_tbat = util.factorize(16, 3)
+
+        cstr = SchedulingConstraint(topbat=2, topofm=4)
+
+        gifm, gifm0, gen_tifm = itertools.tee(gen_tifm, 3)
+        gofm, gofm0, gen_tofm = itertools.tee(gen_tofm, 3)
+        gbat, gbat0, gen_tbat = itertools.tee(gen_tbat, 3)
+        fgifm, fgofm, fgbat = cstr.filter_gen_ts(gifm, gofm, gbat)
+
+        self.assertSetEqual(set(fgifm), set(gifm0))
+        set_fgofm = set(fgofm)
+        set_fgbat = set(fgbat)
+        self.assertTrue(set_fgofm.issubset(set(gofm0)))
+        self.assertTrue(set_fgbat.issubset(set(gbat0)))
+        self.assertSetEqual(set_fgofm,
+                            set([(4,) + tpl for tpl in util.factorize(5, 2)]))
+        self.assertSetEqual(set_fgbat,
+                            set([(2,) + tpl for tpl in util.factorize(8, 2)]))
+
+        cstr = SchedulingConstraint(topifm=4)
+
+        gifm, gifm0, gen_tifm = itertools.tee(gen_tifm, 3)
+        gofm, gofm0, gen_tofm = itertools.tee(gen_tofm, 3)
+        gbat, gbat0, gen_tbat = itertools.tee(gen_tbat, 3)
+        fgifm, fgofm, fgbat = cstr.filter_gen_ts(gifm, gofm, gbat)
+
+        self.assertSetEqual(set(fgofm), set(gofm0))
+        self.assertSetEqual(set(fgbat), set(gbat0))
+        set_fgifm = set(fgifm)
+        self.assertTrue(set_fgifm.issubset(set(gifm0)))
+        self.assertSetEqual(set_fgifm,
+                            set([(4,) + tpl for tpl in util.factorize(9, 2)]))
+
+        cstr = SchedulingConstraint()
+
+        gifm, gifm0, gen_tifm = itertools.tee(gen_tifm, 3)
+        gofm, gofm0, gen_tofm = itertools.tee(gen_tofm, 3)
+        gbat, gbat0, gen_tbat = itertools.tee(gen_tbat, 3)
+        fgifm, fgofm, fgbat = cstr.filter_gen_ts(gifm, gofm, gbat)
+
+        self.assertSetEqual(set(fgifm), set(gifm0))
+        self.assertSetEqual(set(fgofm), set(gofm0))
+        self.assertSetEqual(set(fgbat), set(gbat0))
+
+    def test_update_by_prev(self):
+        ''' Modifier update_by_prev. '''
+        cstr = SchedulingConstraint(
+            topofm=4,
+            update_dict={
+                'l1': lambda s, _: setattr(s, 'topbat', 1),
+                'l2': lambda s, r: setattr(s, 'topifm', r.topifm),
+            })
+        self.assertEqual(cstr.topbat, 0)
+        self.assertEqual(cstr.topifm, 0)
+        self.assertEqual(cstr.topofm, 4)
+
+        r = SchedulingConstraint(topifm=2)
+        cstr.update_by_prev({'l1': None, 'l2': r})
+
+        self.assertEqual(cstr.topbat, 1)
+        self.assertEqual(cstr.topifm, 2)
+        self.assertEqual(cstr.topofm, 4)
+
+        self.assertFalse(cstr.is_valid_top_bl([1, 4, 1], range(le.NUM)))
+        self.assertTrue(cstr.is_valid_top_bl([2, 4, 1], range(le.NUM)))
+
+    def test_content_hash(self):
+        ''' Content-based hash. '''
+        cstr1 = SchedulingConstraint(topbat=2)
+        cstr2 = SchedulingConstraint(topbat=2)
+        self.assertNotEqual(id(cstr1), id(cstr2))
+        self.assertEqual(hash(cstr1), hash(cstr2))
+        self.assertEqual(cstr1, cstr2)
+
+        cstr3 = SchedulingConstraint(
+            topbat=2,
+            update_dict={
+                'l1': lambda s, _: setattr(s, 'topbat', 1),
+                'l2': lambda s, r: setattr(s, 'topifm', r.topifm),
+            })
+        r = SchedulingConstraint(topifm=2)
+        cstr3.update_by_prev({'l1': None, 'l2': r})
+        cstr4 = SchedulingConstraint(topifm=2, topbat=1)
+        self.assertNotEqual(id(cstr3), id(cstr4))
+        self.assertEqual(hash(cstr3), hash(cstr4))
+        self.assertEqual(cstr3, cstr4)
+
+    def test_repr(self):
+        ''' __repr__. '''
+        cstr = SchedulingConstraint(topbat=2)
+        self.assertIn('SchedulingConstraint(', repr(cstr))
+        self.assertIn('topbat=2', repr(cstr))
+        self.assertIn('topifm=0', repr(cstr))
+        self.assertIn('topofm=0', repr(cstr))
+
+        cstr = SchedulingConstraint(update_dict={
+            'l1': lambda s, _: setattr(s, 'topbat', 1),
+            'l2': lambda s, r: setattr(s, 'topifm', r.topifm),
+        })
+        self.assertIn('update_dict=', repr(cstr))
+        self.assertIn('l1', repr(cstr))
+        self.assertIn('l2', repr(cstr))
+
+
+class TestSchedulingConstraintLayerPipeline(TestSchedulingConstraintFixture):
+    ''' Tests for SchedulingConstraintLayerPipeline. '''
+
+    def test_valid_args(self):
+        ''' Valid arguments. '''
+        cstr = SchedulingConstraintLayerPipeline(
+            topbat=2, topifm=1, topofm=4, fbifm=True, fbofm=False)
+        self.assertEqual(cstr.topbat, 2)
+        self.assertEqual(cstr.topifm, 1)
+        self.assertEqual(cstr.topofm, 4)
+
+        cstr = SchedulingConstraintLayerPipeline(topbat=2, topofm=4, fbifm=True)
+        self.assertEqual(cstr.topbat, 2)
+        self.assertEqual(cstr.topifm, 1)
+        self.assertEqual(cstr.topofm, 4)
+
+        cstr = SchedulingConstraintLayerPipeline()
+        self.assertEqual(cstr.topbat, 0)
+        self.assertEqual(cstr.topifm, 0)
+        self.assertEqual(cstr.topofm, 0)
+
+        cstr = SchedulingConstraintLayerPipeline(fbifm=True, fbofm=True)
+        self.assertEqual(cstr.topbat, 0)
+        self.assertEqual(cstr.topifm, 1)
+        self.assertEqual(cstr.topofm, 1)
+
+    def test_invalid_args(self):
+        ''' Invalid arguments. '''
+        with self.assertRaisesRegexp(ValueError,
+                                     'SchedulingConstraintLayerPipeline: '
+                                     '.*IFM.*'):
+            _ = SchedulingConstraintLayerPipeline(topifm=2, fbifm=True)
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'SchedulingConstraintLayerPipeline: '
+                                     '.*OFM.*'):
+            _ = SchedulingConstraintLayerPipeline(topofm=2, fbofm=True)
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'SchedulingConstraintLayerPipeline: '
+                                     '.*IFM.*OFM.*'):
+            _ = SchedulingConstraintLayerPipeline(topifm=2, topofm=2)
+
+    def test_null_constraint(self):
+        ''' Null constraint. '''
+        cstr = SchedulingConstraintLayerPipeline()
+
+        self.assertTrue(cstr.is_valid_top_bl((1, 1, 2), (0, 1, 2)))
+        self.assertTrue(cstr.is_valid_top_bl((3, 4, 5), (2, 1, 0)))
+        self.assertTrue(cstr.is_valid_top_bl((1, 1, 1), (1, 2, 0)))
+
+    def test_is_valid_top_bl(self):
+        ''' Whether is_valid_top_bl. '''
+        cstr = SchedulingConstraintLayerPipeline(topbat=2, topofm=4, fbifm=True)
+        for bl_t, bl_ord in self._gen_bl():
+            valid = (bl_t[le.BAT] == 2 and bl_t[le.IFM] == 1
+                     and bl_t[le.OFM] == 4
+                     and bl_ord[le.BAT] > bl_ord[le.OFM])
+            self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid)
+
+        cstr = SchedulingConstraintLayerPipeline(topifm=4, fbofm=True)
+        for bl_t, bl_ord in self._gen_bl():
+            valid = (bl_t[le.IFM] == 4 and bl_t[le.OFM] == 1
+                     and (bl_ord[le.IFM] > bl_ord[le.BAT]
+                          or bl_t[le.BAT] == 1))
+            self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid)
+
+        cstr = SchedulingConstraintLayerPipeline(topofm=4)
+        for bl_t, bl_ord in self._gen_bl():
+            valid = (bl_t[le.OFM] == 4
+                     and (bl_ord[le.OFM] > bl_ord[le.BAT]
+                          or bl_t[le.BAT] == 1)
+                     and (bl_ord[le.OFM] > bl_ord[le.IFM]
+                          or bl_t[le.IFM] == 1))
+            self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid)
+
+        cstr = SchedulingConstraintLayerPipeline(fbifm=True)
+        for bl_t, bl_ord in self._gen_bl():
+            valid = (bl_t[le.IFM] == 1)
+            self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid)
+
+        cstr = SchedulingConstraintLayerPipeline()
+        for bl_t, bl_ord in self._gen_bl():
+            self.assertTrue(cstr.is_valid_top_bl(bl_t, bl_ord))
+
+    def test_repr(self):
+        ''' __repr__. '''
+        cstr = SchedulingConstraintLayerPipeline(topbat=2, fbifm=True)
+        self.assertIn('SchedulingConstraintLayerPipeline', repr(cstr))
+        self.assertIn('topbat=2', repr(cstr))
+        self.assertIn('topifm=1', repr(cstr))
+        self.assertIn('topofm=0', repr(cstr))
+
diff --git a/nn_dataflow/tests/unit_test/test_scheduling_result.py b/nn_dataflow/tests/unit_test/test_scheduling_result.py
index 39aad90..30ae01c 100644
--- a/nn_dataflow/tests/unit_test/test_scheduling_result.py
+++ b/nn_dataflow/tests/unit_test/test_scheduling_result.py
@@ -44,6 +44,7 @@ def setUp(self):
                                                [30, 40, 50],
                                                [400, 500, 600],
                                                [5000, 6000, 7000]]),
+                                   ('remote_gbuf_access', [0, 0, 0]),
                                    ('total_nhops', [123, 456, 789]),
                                    ('fetch', [[1, 2, 1], [3, 4, 5]]),
                                   ])
@@ -55,38 +56,60 @@ def setUp(self):
                                 type=NodeRegion.DRAM),),
             parts=(part,))
 
+        self.sched_seq = (2, 0, 0)
+
     def test_valid_args(self):
         ''' Valid arguments. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertIn('ops', result.scheme)
         self.assertIn('total_nhops', result.scheme)
         self.assertEqual(result.ofmap_layout, self.ofmap_layout)
+        self.assertTupleEqual(result.sched_seq, self.sched_seq)
 
     def test_invalid_scheme(self):
         ''' Invalid scheme. '''
         with self.assertRaisesRegexp(TypeError,
                                      'SchedulingResult: .*scheme.*'):
             _ = SchedulingResult(scheme={},
-                                 ofmap_layout=self.ofmap_layout)
+                                 ofmap_layout=self.ofmap_layout,
+                                 sched_seq=self.sched_seq)
 
     def test_invalid_ofmap_layout(self):
         ''' Invalid ofmap_layout. '''
         with self.assertRaisesRegexp(TypeError,
                                      'SchedulingResult: .*ofmap_layout.*'):
             _ = SchedulingResult(scheme=self.scheme,
-                                 ofmap_layout=None)
+                                 ofmap_layout=None,
+                                 sched_seq=self.sched_seq)
+
+    def test_invalid_sched_seq(self):
+        ''' Invalid sched_seq. '''
+        with self.assertRaisesRegexp(TypeError,
+                                     'SchedulingResult: .*sched_seq.*'):
+            _ = SchedulingResult(scheme=self.scheme,
+                                 ofmap_layout=self.ofmap_layout,
+                                 sched_seq=list(self.sched_seq))
+
+        with self.assertRaisesRegexp(ValueError,
+                                     'SchedulingResult: .*sched_seq.*'):
+            _ = SchedulingResult(scheme=self.scheme,
+                                 ofmap_layout=self.ofmap_layout,
+                                 sched_seq=self.sched_seq[:-1])
 
     def test_total_cost(self):
         ''' Accessor total_cost. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertAlmostEqual(result.total_cost, 1.234 + 9.876)
 
     def test_total_time(self):
         ''' Accessor total_time. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertAlmostEqual(result.total_time, 123.4)
 
         self.assertGreaterEqual(result.total_time, result.total_node_time)
@@ -95,55 +118,74 @@ def test_total_time(self):
     def test_total_node_time(self):
         ''' Accessor total_node_time. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertAlmostEqual(result.total_node_time, max(59, 40))
 
         scheme = self.scheme
         scheme['bus_time'] = 100
         result = SchedulingResult(scheme=scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertAlmostEqual(result.total_node_time, max(59, 100))
 
     def test_total_dram_time(self):
         ''' Accessor total_dram_time. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertAlmostEqual(result.total_dram_time, 120)
 
     def test_total_proc_time(self):
         ''' Accessor total_proc_time. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertAlmostEqual(result.total_proc_time, 59)
 
         scheme = self.scheme
         scheme['bus_time'] = 100
         result = SchedulingResult(scheme=scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertAlmostEqual(result.total_proc_time, 59)
 
     def test_total_ops(self):
         ''' Accessor total_ops. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertEqual(result.total_ops, 1234)
 
     def test_total_accesses(self):
         ''' Accessor total_cost. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertSequenceEqual(result.total_accesses,
                                  [9, 120, 1500, 18000])
 
+    def test_total_accesses_rgbuf(self):
+        ''' Accessor total_accesses remote gbuf. '''
+        scheme = self.scheme.copy()
+        scheme['remote_gbuf_access'] = [10, 20, 30]
+        result = SchedulingResult(scheme=scheme,
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
+        self.assertSequenceEqual(result.total_accesses,
+                                 [9, 120 + 60, 1500, 18000])
+
     def test_total_noc_hops(self):
         ''' Accessor total_noc_hops. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertEqual(result.total_noc_hops, 1368)
 
     def test_num_nodes(self):
         ''' Accessor num_nodes. '''
         result = SchedulingResult(scheme=self.scheme,
-                                  ofmap_layout=self.ofmap_layout)
+                                  ofmap_layout=self.ofmap_layout,
+                                  sched_seq=self.sched_seq)
         self.assertEqual(result.num_nodes, 4)
 
diff --git a/nn_dataflow/tests/unit_test/test_util.py b/nn_dataflow/tests/unit_test/test_util.py
index 392e10b..ff37455 100644
--- a/nn_dataflow/tests/unit_test/test_util.py
+++ b/nn_dataflow/tests/unit_test/test_util.py
@@ -338,6 +338,128 @@ def test_equal_size(self):
             self.assertLessEqual(max_size - min_size, 1)
 
 
+class TestUtilGCD(unittest.TestCase):
+    ''' Tests for util.gcd. '''
+
+    def test_int(self):
+        ''' Integers. '''
+        self.assertEqual(util.gcd(3, 4), 1)
+        self.assertEqual(util.gcd(8, 4), 4)
+        self.assertEqual(util.gcd(3, 9), 3)
+        self.assertEqual(util.gcd(15, 12), 3)
+        self.assertEqual(util.gcd(300, 410), 10)
+
+    def test_multi(self):
+        ''' Multiple values. '''
+        self.assertEqual(util.gcd(4, 8, 10), 2)
+        self.assertEqual(util.gcd(*range(6, 21, 3)), 3)
+
+    def test_single(self):
+        ''' Single value. '''
+        for v in range(1, 10):
+            self.assertEqual(util.gcd(v), v)
+
+    def test_no_arg(self):
+        ''' No argument. '''
+        with self.assertRaises(ValueError):
+            _ = util.gcd()
+
+    def test_float(self):
+        ''' Float. '''
+        with self.assertRaisesRegexp(TypeError, '.*integers.*'):
+            _ = util.gcd(1., 2)
+
+        with self.assertRaisesRegexp(TypeError, '.*integers.*'):
+            _ = util.gcd(1, 2.2)
+
+        with self.assertRaisesRegexp(TypeError, '.*integers.*'):
+            _ = util.gcd(1, 2, 3, 4.2)
+
+    def test_non_positive(self):
+        ''' Non-positive values. '''
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.gcd(-1, 2)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.gcd(1, -2)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.gcd(3, 6, 9, 12, -21)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.gcd(3, 0)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.gcd(0, 3)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.gcd(0, 5, 10, 15, 20)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.gcd(5, 10, 0, 15, 20)
+
+
+class TestUtilLCM(unittest.TestCase):
+    ''' Tests for util.lcm. '''
+
+    def test_int(self):
+        ''' Integers. '''
+        self.assertEqual(util.lcm(3, 4), 12)
+        self.assertEqual(util.lcm(8, 4), 8)
+        self.assertEqual(util.lcm(3, 9), 9)
+        self.assertEqual(util.lcm(15, 12), 60)
+        self.assertEqual(util.lcm(300, 410), 12300)
+
+    def test_multi(self):
+        ''' Multiple values. '''
+        self.assertEqual(util.lcm(4, 8, 10), 40)
+        self.assertEqual(util.lcm(*range(6, 21, 3)), 180)
+
+    def test_single(self):
+        ''' Single value. '''
+        for v in range(1, 10):
+            self.assertEqual(util.lcm(v), v)
+
+    def test_no_arg(self):
+        ''' No argument. '''
+        with self.assertRaises(ValueError):
+            _ = util.lcm()
+
+    def test_float(self):
+        ''' Float. '''
+        with self.assertRaisesRegexp(TypeError, '.*integers.*'):
+            _ = util.lcm(1., 2)
+
+        with self.assertRaisesRegexp(TypeError, '.*integers.*'):
+            _ = util.lcm(1, 2.2)
+
+        with self.assertRaisesRegexp(TypeError, '.*integers.*'):
+            _ = util.lcm(1, 2, 3, 4.2)
+
+    def test_non_positive(self):
+        ''' Non-positive values. '''
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.lcm(-1, 2)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.lcm(1, -2)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.lcm(3, 6, 9, 12, -21)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.lcm(3, 0)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.lcm(0, 3)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.lcm(0, 5, 10, 15, 20)
+
+        with self.assertRaisesRegexp(ValueError, '.*positive.*'):
+            _ = util.lcm(5, 10, 0, 15, 20)
+
+
 class TestUtilIsclose(unittest.TestCase):
     ''' Tests for util.isclose. '''
 
diff --git a/nn_dataflow/tools/nn_dataflow_search.py b/nn_dataflow/tools/nn_dataflow_search.py
index 833e167..edd13ae 100644
--- a/nn_dataflow/tools/nn_dataflow_search.py
+++ b/nn_dataflow/tools/nn_dataflow_search.py
@@ -71,6 +71,8 @@ def stats_dict(dfsch, cost):
 
     stats['active_node_pes'] = dfsch.perlayer_stats('active_node_pes')
     stats['dram_bandwidth'] = dfsch.perlayer_stats('dram_bandwidth')
+    stats['segment_time'] = dfsch.segment_time_list()
+    stats['segment_dram_time'] = dfsch.segment_dram_time_list()
     stats['input_layout'] = dfsch.input_layout
     stats['ext_layout_dict'] = dfsch.ext_layout_dict
     stats['schedules'] = dfsch.res_dict
@@ -129,7 +131,8 @@ def do_scheduling(args):
                         size_gbuf=size_gbuf,
                         size_regf=size_regf,
                         array_bus_width=array_bus_width,
-                        dram_bandwidth=dram_bandwidth)
+                        dram_bandwidth=dram_bandwidth,
+                        no_time_mux=False)
 
     ## Cost.
 
@@ -151,9 +154,16 @@ def do_scheduling(args):
     bypass[de.FIL] = 'f' not in args.disable_bypass
     options = Option(sw_gbuf_bypass=tuple(bypass),
                      sw_solve_loopblocking=args.solve_loopblocking,
+                     hw_access_forwarding=args.enable_access_forwarding,
+                     hw_gbuf_sharing=args.enable_gbuf_sharing,
+                     hw_gbuf_save_writeback=args.enable_save_writeback,
                      partition_hybrid=args.hybrid_partition,
                      partition_batch=args.batch_partition,
                      partition_ifmaps=args.ifmaps_partition,
+                     partition_interlayer=args.interlayer_partition,
+                     layer_pipeline_time_ovhd=args.layer_pipeline_time_overhead,
+                     layer_pipeline_max_degree=args.layer_pipeline_max_degree,
+                     layer_pipeline_opt=not args.disable_interlayer_opt,
                      opt_goal=args.goal.lower(),
                      ntops=args.top,
                      nprocesses=args.processes,
@@ -249,6 +259,20 @@ def argparser():
     ap.add_argument('--solve-loopblocking', action='store_true',
                     help='Use analytical solver to choose loop blocking. '
                          'Otherwise use exhaustive search.')
+    ap.add_argument('--enable-access-forwarding', action='store_true',
+                    help='Each node fetches a subset of data and forwards to '
+                         'other nodes.')
+    ap.add_argument('--enable-gbuf-sharing', action='store_true',
+                    help='Share gbuf capacity across nodes through NoC.')
+    ap.add_argument('--enable-save-writeback', action='store_true',
+                    help='Allow to save the writeback to memory for the '
+                         'intermediate data between layers if able to '
+                         'store the entire data set in on-chip buffers.')
+    ap.add_argument('--disable-interlayer-opt',
+                    '--basic-interlayer-partition',
+                    action='store_true',
+                    help='Disable optimizations and only allow basic '
+                         'inter-layer pipeline.')
 
     ap.add_argument('--hybrid-partition',
                     '--hybrid-partition2d',  # deprecated old name
@@ -262,6 +286,20 @@ def argparser():
                     action='store_true',
                     help='Allow partitioning ifmap channel dimension, which '
                          'requires extra data synchronization.')
+    ap.add_argument('--interlayer-partition', '--inter-layer-partition',
+                    action='store_true',
+                    help='Allow partitioning resources across multiple layers '
+                         'and process them simultaneously as an inter-layer '
+                         'pipeline.')
+
+    ap.add_argument('--layer-pipeline-time-overhead',
+                    type=float, default=float('inf'),
+                    help='maximum allowed execution time overhead due to '
+                         'layer pipelining.')
+    ap.add_argument('--layer-pipeline-max-degree',
+                    type=float, default=float('inf'),
+                    help='maximum allowed layer pipelining degree, i.e., '
+                         'number of vertices in a pipeline segment.')
 
     ap.add_argument('-g', '--goal', default='e',
                     choices=['e', 'd', 'ed', 'E', 'D', 'ED'],
diff --git a/nn_dataflow/util.py b/nn_dataflow/util.py
index ccc9edc..368efe8 100644
--- a/nn_dataflow/util.py
+++ b/nn_dataflow/util.py
@@ -217,6 +217,48 @@ def get_ith_range(rng, idx, num):
     return beg, end
 
 
+def gcd(*values):
+    '''
+    Get the greatest common divisor of the given values.
+    '''
+    if any(not isinstance(v, int) for v in values):
+        raise TypeError('value must be integers.')
+    if any(v <= 0 for v in values):
+        raise ValueError('arguments must be positive.')
+
+    if not values:
+        raise ValueError('must give at least 1 value.')
+    if len(values) == 1:
+        return values[0]
+    if len(values) > 2:
+        return reduce(gcd, values)
+
+    a, b = values
+    while b:
+        a, b = b, a % b
+    return a
+
+
+def lcm(*values):
+    '''
+    Get the least common multiple of the given values.
+    '''
+    if any(not isinstance(v, int) for v in values):
+        raise TypeError('value must be integers.')
+    if any(v <= 0 for v in values):
+        raise ValueError('arguments must be positive.')
+
+    if not values:
+        raise ValueError('must give at least 1 value.')
+    if len(values) == 1:
+        return values[0]
+    if len(values) > 2:
+        return reduce(lcm, values)
+
+    a, b = values
+    return a * b // gcd(a, b)
+
+
 def isclose(vala, valb, rel_tol=1e-9, abs_tol=0.0):
     '''
     Whether two values are close to each other.
diff --git a/requirements.txt b/requirements.txt
index 311d9de..8fbb362 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -4,3 +4,4 @@ fastcache==1.0.2
 pytest==3.1.2
 pytest-cov==2.5.1
 pytest-xdist==1.17.1
+sympy==1.2.0
diff --git a/setup.py b/setup.py
index fa36cca..0358120 100644
--- a/setup.py
+++ b/setup.py
@@ -54,6 +54,7 @@ def _readme():
         'pytest>=3',
         'pytest-cov>=2',
         'pytest-xdist>=1',
+        'sympy>=1',
     ],
 
     entry_points={