diff --git a/CHANGELOG.md b/CHANGELOG.md index ad14e46..f48ccba 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,7 +1,71 @@ List of major changes and improvements ====================================== -## [Unreleased] +## [v1.6 -- v2.0] -- 2018-03-01 + +### Added + +- Hardware models. + + - Access forwarding. + + - Buffer sharing scheme. + - Use `BufShrScheme` class to represent and calculate NoC transfers. + +- Software models. + + - Add `SchedulingConstraint` class to specify loop blocking and partitioning + constraints. + - Add lazily updated rules to allow refine constraint with previous + scheduling results at runtime. + - Add subclass `SchedulingConstraintLayerPipeline` for layer pipelining + constraints. + + - Add `InterLayerPipeline`. + - Layers are organized into `PipelineSegment`, which are simultaneously + mapped on to the resource both spatially and temporally. + - Each layer in the segment has a 3-tuple scheduling index including + segment index, spatial index, and temporal index. + - Each layer in the segment has its resource allocation and scheduling + constraint. + - Use `PipelineSegmentTiming` to capture the timing relation of layers in + the segment. + - Specify maximum allowed execution time overhead due to layer pipelining + in `Option`. + - Specify maximum pipelining degree for layer pipelining in `Option`. + + - Add layer pipelining optimizations. + - Ofmap forwarding: alternate layer loop ordering. + - Ifmap forwarding: sharing the same inputs from memory to multiple + regions. + - Support model weight pinning when no resource time-multiplexing. + - Allow disabling optimizations for layer pipelining to fall back to basic + pipelining techniques. + + +### Changed + +- Hardware models. + + - Allow data source/destination regions in `Resource` to be non-DATA type. + + - Allow `NodeRegion` to be folded along the w dimension in a zig-zag manner. + +- Software models. + + - `LoopBlockingScheme` supports access forwarding and buffer sharing. + + - `LoopBlockingScheme` supports remote node buffers as data regions (non-data + type data regions). + + - `partition` unit number of hops calculation supports access forwarding and + buffer sharing. + + - `DataLayout` supports closest-first forwarding data transfer for access + forwarding and buffer sharing. + + - Refactor `NNDataflow` and `NNDataflowScheme` to incorporate inter-layer + pipelining. ## [v1.5 -- v1.6] -- 2018-01-31 diff --git a/README.rst b/README.rst index bd9f990..3ce8b7a 100644 --- a/README.rst +++ b/README.rst @@ -9,7 +9,7 @@ Neural Network Dataflow Scheduling This Python tool allows you to explore the energy-efficient dataflow scheduling for neural networks (NNs), including array mapping, loop blocking and -reordering, and parallel partitioning. +reordering, and (coarse-grained) parallel processing within and across layers. For hardware, we assume an Eyeriss-style NN accelerator [Chen16]_, i.e., a 2D array of processing elements (PEs) with a local register file in each PE, and a @@ -26,18 +26,27 @@ In software, we decouple the dataflow scheduling into three subproblems: convolutions by blocking and reordering the nested loops. We support exhaustive search over all blocking and reordering schemes [Yang16]_, and analytical bypass solvers [Gao17]_. -- Partitioning, which partitions the NN computations for parallel processing. - We support batch partitioning, fmap partitioning, output partitioning, input - partitioning, and the combination between them (hybrid) [Gao17]_. We use - layer-wise greedy beam search. - -See the details in our ASPLOS'17 paper [Gao17]_. +- Parallel processing, which partitions the NN computations across the multiple + tiled engines. We support both intra-layer and inter-layer parallelism. For + intra-layer, we support batch partitioning, fmap partitioning, output + partitioning, input partitioning, and the combination between them (hybrid) + [Gao17]_. We also explore various dataflow optimizations including access + forwarding and buffer sharing [Gao19]_. We use exhaustive search within each + layer. For inter-layer, we support spatial pipelining (inter-layer + pipelining) and temporal pipelining (time multiplexing without writing back + intermediate data) as well as their optimized scheduling [Gao19]_. We use + layer-wise greedy beam search across layers. + +See the details in our ASPLOS'17 [Gao17]_ and ASPLOS'19 [Gao19]_ papers. If you use this tool in your work, we kindly request that you reference our paper(s) below, and send us a citation of your work. - Gao et al., "TETRIS: Scalable and Efficient Neural Network Acceleration with - 3D Memory", in ASPLOS, April 2017 [Gao17]_. + 3D Memory", in ASPLOS, April 2017. + +- Gao et al., "TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN + Accelerators", in ASPLOS. April 2019. Install @@ -102,6 +111,20 @@ Other options include: layers, and output partitioning for FC layers. - ``--batch-partitioning`` and ``--ifmap-partitioning``: whether the hybrid partitioning also explores batch and input partitioning. +- ``--enable-access-forwarding``: access forwarding, where the nodes fetch + disjoint subsets of data and forward them to other nodes. See [Gao19]_. +- ``--enable-gbuf-sharing``: buffer sharing, where the global buffer capacity is + shared across nodes through NoC. See [Gao19]_. +- ``--enable-save-writeback``: allow to elide the intermediate data writeback to + memory when switching between layers if it is possible to store the entire + data set in on-chip buffers. +- ``--interlayer-partition``: whether to use inter-layer pipelining to + partition resources across multiple layers and process them simultaneously. +- ``--layer-pipeline-time-overhead``, ``--layer-pipeline-max-degree``: + constrain the configuration space of inter-layer pipelining, by specifying + the maximum execution time overhead, or the maximum pipelining degree. +- ``--disable-interlayer-opt``: disable optimizations and only allow basic + inter-layer pipelining. Code Structure @@ -115,7 +138,10 @@ Code Structure - Array mapping: ``map_strategy``. - Loop blocking and reordering: ``loop_blocking``, ``loop_blocking_scheme``, ``loop_blocking_solver``. - - Partitioning: ``partition``, ``partition_scheme``. + - Intra-layer partitioning: ``partition``, ``partition_scheme``, + ``buf_shr_scheme``. + - Inter-layer pipelining: ``inter_layer_pipeline``, + ``pipeline_segment``. - Network and layer: ``network``, ``layer``. - ``nns``: example NN definitions. - ``tests``: unit tests. @@ -156,6 +182,10 @@ with the Board of Trustees of Leland Stanford Junior University. References ---------- +.. [Gao19] Gao, Yang, Pu, Horowitz, and Kozyrakis, `TANGRAM: Optimized + Coarse-Grained Dataflow for Scalable NN Accelerators + `__, in ASPLOS. April, 2019. + .. [Gao17] Gao, Pu, Yang, Horowitz, and Kozyrakis, `TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory `__, in ASPLOS. April, 2017. diff --git a/nn_dataflow/__init__.py b/nn_dataflow/__init__.py index 5fc5ef0..8257e4e 100644 --- a/nn_dataflow/__init__.py +++ b/nn_dataflow/__init__.py @@ -13,5 +13,5 @@ program. If not, see . """ -__version__ = '1.6' +__version__ = '2.0' diff --git a/nn_dataflow/core/__init__.py b/nn_dataflow/core/__init__.py index 0fe9784..8a8e178 100644 --- a/nn_dataflow/core/__init__.py +++ b/nn_dataflow/core/__init__.py @@ -20,11 +20,13 @@ from . import loop_enum as LoopEnum from . import mem_hier_enum as MemHierEnum from . import parallel_enum as ParallelEnum +from .buf_shr_scheme import BufShrScheme from .cost import Cost from .data_dim_loops import DataDimLoops from .data_layout import DataLayout from .fmap_range import FmapPosition, FmapRange, FmapRangeMap from .int_range import IntRange +from .inter_layer_pipeline import InterLayerPipeline from .layer import Layer, InputLayer, ConvLayer, FCLayer, \ LocalRegionLayer, PoolingLayer, EltwiseLayer from .loop_blocking_scheme import LoopBlockingScheme @@ -36,8 +38,12 @@ from .option import Option from .partition_scheme import PartitionScheme from .phy_dim2 import PhyDim2 +from .pipeline_segment import PipelineSegment +from .pipeline_segment_timing import PipelineSegmentTiming from .resource import Resource from .scheduling import SchedulingCondition, SchedulingResult, Scheduling +from .scheduling_constraint import SchedulingConstraint, \ + SchedulingConstraintLayerPipeline from .nn_dataflow import NNDataflow diff --git a/nn_dataflow/core/buf_shr_scheme.py b/nn_dataflow/core/buf_shr_scheme.py new file mode 100644 index 0000000..d496d9b --- /dev/null +++ b/nn_dataflow/core/buf_shr_scheme.py @@ -0,0 +1,364 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +import math + +from . import data_category_enum as de +from . import loop_enum as le +from . import parallel_enum as pe +from .. import util +from .layer import ConvLayer +from .phy_dim2 import PhyDim2 + +class BufShrScheme(object): + ''' + The buffer sharing scheme. + ''' + + def __init__(self, node_region, part, data_loops=None): + ''' + `node_region` is the node region in which the buffer sharing takes + place. + + `part` is the PartitionScheme instance that determine the buffer + sharing scheme. + + `data_loops` is a DataDimLoops instance that determine the relationship + between DataCategoryEnum and ParallelEnum. Default is for ConvLayer. + ''' + + if any(pd > nrd for pd, nrd in zip(part.dim(), node_region.dim)): + raise ValueError('BufShrScheme: partitioning scheme does not fit ' + 'in the node region') + + if data_loops is None: + data_loops = ConvLayer.data_loops() + + # Get node group corresponding to each LoopEnum, and the distance + # between neighbors in that node group. + lpe_dims = [PhyDim2(1, 1)] * le.NUM + lpe_nbr_dists = [PhyDim2(float('nan'), float('nan'))] * le.NUM + + # le.BAT corresponds to pe.OFMP and pe.BATP. + idx_ofmp = part.order.index(pe.OFMP) + idx_batp = part.order.index(pe.BATP) + dim_ofmp = part.dim(pe.OFMP) + dim_batp = part.dim(pe.BATP) + # If only one of OFMP and BATP exists, use that one. + if dim_ofmp.size() == 1: + lpe_dims[le.BAT] = dim_batp + lpe_nbr_dists[le.BAT] = part.part_neighbor_dist(node_region, + pe.BATP) + elif dim_batp.size() == 1: + lpe_dims[le.BAT] = dim_ofmp + lpe_nbr_dists[le.BAT] = part.part_neighbor_dist(node_region, + pe.OFMP) + else: + # If both exist ... + if abs(idx_ofmp - idx_batp) == 1: + # ... and are adjacent in the partitioning hierarchy, use + # both. + lpe_dims[le.BAT] = dim_batp * dim_ofmp + # Neighbor distance is the smaller one. + nbr_dist_ofmp = part.part_neighbor_dist(node_region, pe.OFMP) + nbr_dist_batp = part.part_neighbor_dist(node_region, pe.BATP) + lpe_nbr_dists[le.BAT] = PhyDim2(*[min(d1, d2) for d1, d2 + in zip(nbr_dist_ofmp, + nbr_dist_batp)]) + else: + # ... but are not adjacent, use the bottom one (with + # smaller distance). + if idx_ofmp > idx_batp: + lpe_dims[le.BAT] = dim_ofmp + lpe_nbr_dists[le.BAT] = part.part_neighbor_dist( + node_region, pe.OFMP) + else: + lpe_dims[le.BAT] = dim_batp + lpe_nbr_dists[le.BAT] = part.part_neighbor_dist( + node_region, pe.BATP) + + # le.OFM corresponds to pe.OUTP. + lpe_dims[le.OFM] = part.dim(pe.OUTP) + lpe_nbr_dists[le.OFM] = part.part_neighbor_dist(node_region, pe.OUTP) + + # le.IFM corresponds to pe.INNP. + lpe_dims[le.IFM] = part.dim(pe.INPP) + lpe_nbr_dists[le.IFM] = part.part_neighbor_dist(node_region, pe.INPP) + + # Dimension of the node group. + self.dims = [] + # Distance between the neighbors in the node group. + self.nbr_dists = [] + + # The nodes corresponding to the LoopEnum unrelated to the data + # category will fetch the same data, i.e., sharing the data. + for dce in range(de.NUM): + lpe = (data_loops[dce].drop(range(le.NUM)) + [None])[0] + if lpe is None: + self.dims.append(PhyDim2(1, 1)) + self.nbr_dists.append(PhyDim2(float('inf'), float('inf'))) + else: + self.dims.append(lpe_dims[lpe]) + self.nbr_dists.append(lpe_nbr_dists[lpe]) + + # Check extraordinary neighbor distance. + assert all(all((not math.isnan(nd)) and (not math.isinf(nd) or d == 1) + for d, nd in zip(dim, nbr_dist)) + for dim, nbr_dist in zip(self.dims, self.nbr_dists)) + + self.node_region = node_region + self.part = part + self.data_loops = data_loops + + # Cache for nhops_rotate_all(). + self.nhops_cache = {} + + def dim(self, dce): + ''' Get the buffer sharing node group dimensions. ''' + return self.dims[dce] + + def size(self, dce): + ''' Get the buffer sharing node group size. ''' + return self.dims[dce].size() + + def nhops_rotate_all(self, dce, subgrp_size, rotation_unit_cnt=None): + ''' + Number of hops for rotation operation of an entire round. + + The number of hops is relative to the total unique data size. E.g., + when the data are in N nodes and each node has 1/M data, if all the + data have been transferred by 1 hop, the number of hops is N / M. + + The data are spread in N nodes, where N is the group size. Each node + holds 1/M data, where M is given by `subgrp_size`. M is rounded up to a + factor of N, M' >= M, and each M' nodes is a subgroup. There are N//M' + == N//M subgroups. If M' == M, there are no redundant data in the nodes + of a subgroup. + + Rotation means the following operation: nodes exchange their data with + the minimum number of hops, until every node has seen all the data. + + How to rotate: + + Each subgroup rotate their data independently. A subgroup is typically + 2D. We chain the nodes in a snaking fashion with a priority dimension. + E.g., if the priority dimension is H (the 1st one), then the node chain + is (0,0), (1,0), ..., (H-1,0), (H-1,1), (H-2,1), ..., (0,1), (0,2), + ..., i.e., first go along H to the end, then turn to W and go one hop + to the next H, then turn and go long H, etc.. The priority dimension is + chosen to minimize the overall rotation hops. + + We store data in the chained M' nodes of a subgroup as follow, where + the index is the i-th 1/M chunk: + + M-1, M-2, ..., 1, 0, | M-1, M-2, ..., 2M-M' + + The first M nodes loop their data over. In addition, the (M-1)-th node + also sends its data to the M-th node. The last M'-M nodes sequentially + send data to the right side, and the last node does not send data. + + So in the next step: + + 0, M-1, ..., 2, 1, | 0, M-1, ..., 2M-M'+1 + + And so on until the last step: + + M-2, M-3, ..., 0, M-1, | M-2, M-3, ..., 2M-M'-1 + + Overall, each node except for the last one sends its 1/M data to the + right neighbor at each of the M-1 step. And the (M-1)-th node also + sends its 1/M data to the 0-th node. + + Note that we do not restore the initial state after one rotation round + (missing one step). Even in the case of multiple rotation rounds, this + is OK, as the node does not care about which piece of shared data it + starts with, as long as each node sees all data at the end. + + Typically rotation ends after rotating M - 1 node buffers, i.e., + skipping 1 step. When a rotation unit occupies more than one node + buffer, i.e., rotation unit count is less than M, the rotation ends + earlier, when the last rotation unit hits the beginning of the first + node buffer. E.g., for M = 4 and unit count is 3, the last unit + initially starts at 2/3 of the 3rd node, so we only rotate 2 + 2/3 = + 8/3 node buffers, i.e., skipping 4 - 8/3 = 4/3 steps. + + If rotation unit count is not given (None), assuming it is not less + than M, i.e., equal to M. + ''' + + # Check cache. + cache_key = (dce, subgrp_size, rotation_unit_cnt) + res = self.nhops_cache.get(cache_key, None) + if res is not None: + return res + + subgrp_dim, idx_pr = self._subgrp_dim(dce, subgrp_size) + + if rotation_unit_cnt is None: + rotation_unit_cnt = subgrp_size + + # 1. Send to right neighbor. + # If H < W, rotate along H dimension, i.e., go along H to the end, then + # turn to W and go one hop to the next H, then turn and go long H, ... + d_pr = subgrp_dim[idx_pr] + d_npr = subgrp_dim[1 - idx_pr] + # Per-step nhops = (H-1) * W * Dh + (W-1) * Dw + n_pr = (d_pr - 1) * d_npr + n_npr = d_npr - 1 + nhops_nbr = self._nhops_with_neighbor_dist( + dce, + PhyDim2(*[tpl[1] for tpl + in sorted([(idx_pr, n_pr), (1 - idx_pr, n_npr)])])) + + # 2. (M-1)-th node loops back to the 0-th node. + # Position of the (M-1)-th node. + coord = self._coordinate(subgrp_size - 1, subgrp_dim, idx_pr) + # Per-step nhops = distance back to the 0-th node. + nhops_lpbk = self._nhops_with_neighbor_dist(dce, coord) + + skipped_steps = max(1, 1. * subgrp_size / rotation_unit_cnt) + assert 1 <= skipped_steps <= subgrp_size + + # All steps; normalize; all subgroups. + nhops = (nhops_nbr + nhops_lpbk) \ + * (subgrp_size - skipped_steps) \ + * (1. / subgrp_size) \ + * (self.size(dce) // subgrp_size) + assert not math.isinf(nhops) and not math.isnan(nhops) + + # Update cache. + assert cache_key not in self.nhops_cache + self.nhops_cache[cache_key] = nhops + + return nhops + + def nhops_wide_fetch_once(self, dce, subgrp_size, fetch_width): + ''' + Number of hops for one wide fetch operation. + + The number of hops is relative to the total unique data size. E.g., + when the data are in N nodes and each node has 1/M data, if all the + data have been transferred by 1 hop, the number of hops is N / M. + + The data in the subgroup are spread in M' nodes, where M' rounds up M, + given by `subgrp_size`, to a factor of the group size N. Each node + holds 1/M data. See the rotation function about how the data are + distributed. + + Wide fetch means the following operation: a node needs to access W/M > + 1/M data without rotation, where W is given by `fetch_width`. + + The ceil(W) nodes that will feed the data are those on the upstream + (senders) of the rotation chain to this node. + + The returned number of hops is the sum across all nodes in the group. + Since it is relative to the total unique data size, and not relative to + the fetch data size (fetch width), it is normalized by the fetch width. + The number of hops for all nodes to get (W - 1) / W data from their (W + - 1) upstream nodes is equal to the number of hops for (W - 1) rotation + steps. + ''' + if fetch_width <= 1: + return 0 + elif fetch_width > subgrp_size: + raise ValueError('BufShrScheme: fetch width is larger than ' + 'subgroup size. {} vs. {}.' + .format(fetch_width, subgrp_size)) + + nhops_rot_perstep = self.nhops_rotate_all(dce, subgrp_size) \ + / (subgrp_size - 1) + + ceil_width = math.ceil(fetch_width - 1e-6) + # Total steps = 0 + 1 + 2 + ... + (cw - 1) - (cw - 1) * (cw - w) + total_steps = (ceil_width - 1) * ceil_width / 2 \ + - (ceil_width - 1) * (ceil_width - fetch_width) + + return nhops_rot_perstep * total_steps / fetch_width + + def _subgrp_dim(self, dce, subgrp_size): + ''' + Decide the subgroup dimensions and the priority dimension index. + Priority dimension is the one along which rotation happens. + ''' + # Round up subgroup size to a factor of the group size. + true_subgrp_size = subgrp_size + size = self.size(dce) + while size % true_subgrp_size: + true_subgrp_size += 1 + if true_subgrp_size > size: + raise ValueError('BufShrScheme: subgroup is larger than group. ' + '{} vs. {}.'.format(subgrp_size, size)) + + dim = self.dim(dce) + nbr_dist = self.nbr_dists[dce] + + # The dimension with smaller/larger distance. + idx_sm = 0 if nbr_dist[0] <= nbr_dist[1] else 1 + idx_lg = 1 - idx_sm + dim_sm = dim[idx_sm] + + # The smaller-distance dimension is the priority dimension. + idx_pr = idx_sm + + tpl = [1] * 2 + + # We try to use as much as possible from the smaller-distance dimension + # to the subgroup. Figure out the maximum factor. + for f, _ in util.factorize(dim_sm, 2): + if f > tpl[idx_sm] and true_subgrp_size % f == 0: + tpl[idx_sm] = f + + tpl[idx_lg] = true_subgrp_size // tpl[idx_sm] + + subgrp_dim = PhyDim2(*tpl) + assert subgrp_dim.size() == true_subgrp_size + + return subgrp_dim, idx_pr + + @staticmethod + def _coordinate(index, dim, idx_pr): + ''' + The coordinate of a node with sequential index `index` in the 2D nodes + with dimensions `dim`. The index increases first along the priority + dimension given by `idx_pr` as the dimension index. Return a PhyDim2 + relative coordinate in the subgroup without scaling by the neighbor + distance. + ''' + dim_pr, dim_npr = dim if idx_pr == 0 else reversed(dim) + coord_npr, coord_pr = divmod(index, dim_pr) + assert coord_npr < dim_npr and coord_pr < dim_pr + # We go backward in the odd H, i.e., snaking. + if coord_npr % 2 == 1: + coord_pr = dim_pr - 1 - coord_pr + coord = PhyDim2(coord_pr, coord_npr) if idx_pr == 0 \ + else PhyDim2(coord_npr, coord_pr) + return coord + + def _nhops_with_neighbor_dist(self, dce, coord): + ''' + Get the number of hops from (0, 0) to `coord` of the subgroup of data + category `dce`, by scaling by the neighbor distance. + ''' + dist = [c * d if c else 0 for c, d in zip(coord, self.nbr_dists[dce])] + assert not any(math.isinf(d) or math.isnan(d) for d in dist) + return PhyDim2(*dist).hop_dist(PhyDim2(0, 0)) + + def __repr__(self): + return '{}({})'.format( + self.__class__.__name__, + ', '.join([ + 'part={}'.format(repr(self.part)), + 'data_loops={}'.format(repr(self.data_loops))])) + diff --git a/nn_dataflow/core/data_layout.py b/nn_dataflow/core/data_layout.py index 1fb8293..3832705 100644 --- a/nn_dataflow/core/data_layout.py +++ b/nn_dataflow/core/data_layout.py @@ -14,6 +14,7 @@ """ from collections import namedtuple +import itertools from .fmap_range import FmapPosition, FmapRange, FmapRangeMap from .node_region import NodeRegion @@ -84,12 +85,22 @@ def fmap_range_map(self): return frmap - def nhops_to(self, fmap_range, *dest_list): + def nhops_to(self, fmap_range, *dest_list, **kwargs): ''' Get the total number of hops to transfer the FmapRange `fmap_range` to destinations `dest_list` given as a list of absolute coordinates. + + If `forwarding` is True, the data can be forwarded between destinations + rather than all from the source. ''' - nhops = 0 + forwarding = kwargs.pop('forwarding', False) + if kwargs: + raise ValueError('DataLayout: method nhops_to() got an unexpected ' + 'keyword argument: {}.' + .format(kwargs.popitem()[0])) + + # The number of hops to transfer data to each destination individually. + nhops_list = [0] * len(dest_list) for frng, region, part in zip(self.frngs, self.regions, self.parts): @@ -102,8 +113,31 @@ def nhops_to(self, fmap_range, *dest_list): pfrng = part.fmap_range(frng, pidx) size = fmap_range.overlap_size(pfrng) - hop_dist_list = [d.hop_dist(psrc) for d in dest_list] - nhops += size * sum(hop_dist_list) + nhops_list = [n + size * d.hop_dist(psrc) + for n, d in zip(nhops_list, dest_list)] + + if forwarding: + # The number of hops to the first node and its coordinate. + nhops, coord = min(zip(nhops_list, dest_list)) + + # Size of all data. + total_size = self.complete_fmap_range().overlap_size(fmap_range) + + # Data can be forwarded from all sources to any destination. + src_set = {coord} + dst_set = set(dest_list) - src_set + + while dst_set: + # Each forward step, get the min-distance pair of source and + # destination. + src, dst = min(itertools.product(src_set, dst_set), + key=lambda (s, d): d.hop_dist(s)) + dst_set.remove(dst) + src_set.add(dst) + nhops += total_size * dst.hop_dist(src) + + else: + nhops = sum(nhops_list) return nhops diff --git a/nn_dataflow/core/inter_layer_pipeline.py b/nn_dataflow/core/inter_layer_pipeline.py new file mode 100644 index 0000000..2281ddb --- /dev/null +++ b/nn_dataflow/core/inter_layer_pipeline.py @@ -0,0 +1,356 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +import itertools + +from .layer import ConvLayer +from .network import Network +from .pipeline_segment import PipelineSegment +from .resource import Resource + +class InterLayerPipeline(object): + ''' + Inter-layer pipeline. + ''' + + def __init__(self, network, batch_size, resource, max_util_drop=0.05): + if not isinstance(network, Network): + raise TypeError('InterLayerPipeline: network must be ' + 'a Network instance.') + if not isinstance(resource, Resource): + raise TypeError('InterLayerPipeline: resource must be ' + 'a Resource instance.') + if not 0 <= max_util_drop <= 1: + raise ValueError('InterLayerPipeline: max_util_drop must be ' + 'between [0, 1].') + + self.network = network + self.batch_size = batch_size + self.resource = resource + self.max_util_drop = max_util_drop + + self._calc_sched_dag() + + # Vertices starting from which we have generated the segments. + self.seg_vertex_done = set() + + def ordered_layer_list(self): + ''' + Get a list of the layers in their topological order in the scheduling + DAG. + ''' + return list(sum(self.dag_vertex_list, tuple())) + + def gen_segment(self, options): + ''' + Generate all valid inter-layer pipelining segments. + ''' + + kwargs = {'network': self.network, + 'batch_size': self.batch_size, + 'resource': self.resource, + 'max_util_drop': self.max_util_drop, + 'with_opt': options.layer_pipeline_opt, + } + + # No pipelining, each layer sequentially occupies the whole resource. + for layer in self.network: + seg = ((layer,),) + segment = PipelineSegment(seg, **kwargs) + assert segment.valid + yield segment + + # Pipelining. + for vseg in self._gen_vseg(): + + if len(vseg) > options.layer_pipeline_max_degree: + continue + + if len(vseg) == 1 and len(self.dag_vertex_list[vseg[0]]) == 1: + # An individual layer, already returned in no-pipelining case. + continue + + # Use set to eliminate duplicates. + seg_cands = set() + + if options.partition_interlayer: + # Spatial pipelining. + seg = tuple(self.dag_vertex_list[vidx] for vidx in vseg) + seg_cands.add(seg) + + if options.hw_gbuf_save_writeback: + # Temporal pipelining. + # Reduce the spatial dimension. + seg = (tuple(itertools.chain.from_iterable( + self.dag_vertex_list[vidx] for vidx in vseg)),) + seg_cands.add(seg) + + # Determine segment allocation. + for seg in seg_cands: + segment = PipelineSegment(seg, **kwargs) + if segment.valid: + yield segment + + def _gen_vseg(self, vertex_idx=0, done=None): + ''' + Generate vertex segments starting from vertex `vertex_idx`. Yield a + tuple of the vertices in the segment. + + `done` is a set of vertices which have already been scheduled and the + output is already in memory. + + Rules: + + 1. If a vertex does not share any dependencies with the current + segment, i.e., none of its previous vertices is in the current segment + or among the previous vertices of the current segment, we do not add it + to the segment, because there is no benefit to co-locate them. + + 2. If a vertex has multiple previous vertices, at most one of them + can be in the same segment as this vertex, because the output data + availability timing of multiple previous vertices may not match. + + 3. If a vertex has multiple next vertices, either all or at most one of + them can be NOT in the same segment as this vertex, because only + including a small subset saves little data write-back to memory. + ''' + + vseg = tuple() + + if not done: + done = set() + # Reset. + self.seg_vertex_done = set() + + if self.dag_input_vertex not in done: + # Input layer is always in memory. + done.add(self.dag_input_vertex) + + # The frontier is the vertex to be considered to be added to the + # current segment. + for frontier in range(vertex_idx, len(self.dag_vertex_list)): + + # Check whether the frontier can be added to the current segment. + + frontier_prevs = self.dag_prev_dict[frontier] + + # Whether the frontier share dependencies with the current segment, + # if the segment is not empty. + share_deps = not vseg or not frontier_prevs.isdisjoint( + set.union(set(vseg), *[self.dag_prev_dict[i] for i in vseg])) + + # Whether multiple previous vertices are in the current segment. + multi_prevs = len(frontier_prevs.intersection(vseg)) > 1 + + if not share_deps or multi_prevs: + # Not sharing any dependencies (rule 1), or multiple previous + # vertices in the current segment (rule 2). + + # Make sure the current segment is not empty. + assert vseg + # Not extend the segment any more. Note that the current + # segment has already been yielded, as well as the recursion, + # in the last iteration. + break + + # Extend the segment. + vseg += (frontier,) + + # Check whether the segment is valid. + + for idx in vseg: + nexts = self.dag_next_dict[idx] + + # The next vertices should either all or at most one not in the + # segment (rule 3). + if not nexts.isdisjoint(vseg) \ + and len(nexts.difference(vseg)) > 1: + # The segment is invalid. Need to add more vertices. + break + else: + # The segment is valid. + yield vseg + + # Skip if have done. + if frontier + 1 in self.seg_vertex_done: + continue + + # Recursion. + for tpl in self._gen_vseg(frontier + 1, done.union(vseg)): + yield tpl + + assert vertex_idx not in self.seg_vertex_done + self.seg_vertex_done.add(vertex_idx) + + def _calc_sched_dag(self): + ''' + Build the scheduling DAG of the network. We merge layers with no + filters into their last previous layer, so a DAG vertex can contain one + or more layers. + + We order and index the DAG vertices in their depth-first topological + order. This will also be the order to schedule the layers. + + Also establish two dicts for the previous and next vertices of each DAG + vertex. + + In summary, the attributes initialized include: `dag_input_vertex`, + `dag_vertex_list`, `dag_vertex_dict`, `dag_prev_dict`, `dag_next_dict`. + ''' + + # Vertex of the input layer. + self.dag_input_vertex = -1 + + # The DAG vertex set. Each vertex is a merged layer tuples, represented + # by their layer names. Use a list type to make modification easier. + dag_vertex_set = [] + + for layer_name in self.network: + layer = self.network[layer_name] + + if isinstance(layer, ConvLayer): + dag_vertex_set.append((layer_name,)) + + else: + prevs = set(self.network.prevs(layer_name)) + assert prevs + + # Find and merge to a vertex if that vertex only contains one + # previous layer at the last, because non-last previous layer + # will not have its data available to be used for this layer. + # Also the previous layer can only have this one next layer, + # because its data will be overwritten by this layer locally. + + # Check vertices in the reversed order. + for idx in reversed(range(len(dag_vertex_set))): + vhead = dag_vertex_set[idx][:-1] + vtail = dag_vertex_set[idx][-1] + if prevs.isdisjoint(vhead) and vtail in prevs \ + and len(self.network.nexts(vtail)) == 1: + dag_vertex_set[idx] += (layer_name,) + break + else: + # No valid vertex to merge. + dag_vertex_set.append((layer_name,)) + + assert sum(len(v) for v in dag_vertex_set) == len(self.network) + + # The DAG vertex list in the topological order. + self.dag_vertex_list = self._topological_order(dag_vertex_set) + + # Make a directory from layer name to DAG vertex index. + self.dag_vertex_dict = {} + + for vidx, v in enumerate(self.dag_vertex_list): + for layer_name in v: + assert layer_name not in self.dag_vertex_dict + self.dag_vertex_dict[layer_name] = vidx + + # Add the input layer. + self.dag_vertex_dict[self.network.INPUT_LAYER_KEY] = \ + self.dag_input_vertex + # Add the external layers. + for ext_layer in self.network.ext_layers(): + self.dag_vertex_dict[ext_layer] = self.dag_input_vertex + + # The previous and next relationship of the DAG vertices. + self.dag_prev_dict = dict((vidx, set()) for vidx + in range(len(self.dag_vertex_list))) + self.dag_next_dict = dict((vidx, set()) for vidx + in range(len(self.dag_vertex_list))) + + for layer_name in self.network: + vidx = self.dag_vertex_dict[layer_name] + + # Previous layers. + for p in self.network.prevs(layer_name): + pvidx = self.dag_vertex_dict[p] \ + if p and p not in self.network.ext_layers() \ + else self.dag_input_vertex + if pvidx != vidx: + self.dag_prev_dict[vidx].add(pvidx) + + # Next layers. + for n in self.network.nexts(layer_name): + if not n: + continue + nvidx = self.dag_vertex_dict[n] + if nvidx != vidx: + self.dag_next_dict[vidx].add(nvidx) + + # Add next layers of the input layer. + self.dag_next_dict[self.dag_input_vertex] = set() + for vidx in self.dag_prev_dict: + if self.dag_input_vertex in self.dag_prev_dict[vidx]: + self.dag_next_dict[self.dag_input_vertex].add(vidx) + + def _topological_order(self, dag_vertex_set): + ''' + Order the DAG vertices in topological order using DFS. + + Specifically, The backtrace order of the depth-first search is the + inverse of the topological order. See + https://en.wikipedia.org/wiki/Topological_sorting#Depth-first_search + ''' + + # The visited layers in the DFS order. + visited = [] + # The unseen pending layers. + unseen = set(dag_vertex_set) + # The layers that have been seen, but not visited due to unvisited + # previous layers. + seen = set() + + def _dfs(vertex): + assert vertex not in seen + if vertex in visited: + return + + unseen.discard(vertex) + seen.add(vertex) + + nexts = [] + for l in vertex: + for n in self.network.nexts(l): + if n and n not in vertex and n not in nexts: + nexts.append(n) + + # Visit next layers in the reversed order, so the reversed visit + # order has the original order. + next_vertices = [] + for n in reversed(nexts): + for nv in unseen: + if n in nv: + next_vertices.append(nv) + + for nv in next_vertices: + _dfs(nv) + + visited.append(vertex) + seen.remove(vertex) + + # Start from the first layers. + start_vertices = [] + for l in reversed(self.network.firsts()): + for v in unseen: + if l in v: + start_vertices.append(v) + for v in start_vertices: + _dfs(v) + assert not unseen + assert not seen + + return list(reversed(visited)) + diff --git a/nn_dataflow/core/loop_blocking.py b/nn_dataflow/core/loop_blocking.py index 0c49da7..561d5bb 100644 --- a/nn_dataflow/core/loop_blocking.py +++ b/nn_dataflow/core/loop_blocking.py @@ -20,6 +20,7 @@ from . import loop_blocking_solver from . import loop_enum as le from .. import util +from .buf_shr_scheme import BufShrScheme from .layer import ConvLayer from .loop_blocking_scheme import LoopBlockingScheme @@ -110,7 +111,7 @@ def _loop_blocking_cmp_key(options, cost): def _gen_loopblocking_perprocess( - nested_loop_desc, resource, cost, options, + nested_loop_desc, resource, bufshr, constraint, cost, options, gen_tifm, gen_tofm, gen_tbat, gen_ords): def _gen_bl_ts(): @@ -120,9 +121,8 @@ def _gen_bl_ts(): Transpose LoopEnum-major to BL-major. ''' gen_lp_ts = [None] * le.NUM - gen_lp_ts[le.IFM] = gen_tifm - gen_lp_ts[le.OFM] = gen_tofm - gen_lp_ts[le.BAT] = gen_tbat + gen_lp_ts[le.IFM], gen_lp_ts[le.OFM], gen_lp_ts[le.BAT] = \ + constraint.filter_gen_ts(gen_tifm, gen_tofm, gen_tbat) for lp_ts in itertools.product(*gen_lp_ts): bl_ts = tuple(zip(*lp_ts)) yield bl_ts @@ -133,19 +133,27 @@ def _sweep(): for bl_ts, bl_ords in itertools.product(_gen_bl_ts(), gen_ords): if is_conv_loops and skip_conv(bl_ts, bl_ords): continue + if not constraint.is_valid_top_bl(bl_ts[0], bl_ords[0]): + continue lbs = LoopBlockingScheme( - nested_loop_desc, bl_ts, bl_ords, resource, options) + nested_loop_desc, bl_ts, bl_ords, resource, bufshr, + options) yield lbs return heapq.nsmallest(options.ntops, _sweep(), key=_loop_blocking_cmp_key(options, cost)) -def gen_loopblocking(nested_loop_desc, resource, cost, options): +def gen_loopblocking(nested_loop_desc, resource, part, constraint, cost, + options): ''' Generator for loop blocking. ''' + # Buffer sharing scheme. + bufshr = BufShrScheme(resource.proc_region, part, + nested_loop_desc.data_loops) + # Solver only works for CONV layer. if options.sw_solve_loopblocking \ and nested_loop_desc.data_loops == ConvLayer.data_loops(): @@ -153,8 +161,9 @@ def gen_loopblocking(nested_loop_desc, resource, cost, options): for bl_ts, bl_ords in gen(nested_loop_desc, resource, options): lbs = LoopBlockingScheme(nested_loop_desc, bl_ts, bl_ords, - resource, options) - yield lbs + resource, bufshr, options) + if constraint.is_valid_top_bl(lbs.bl_ts[0], lbs.bl_ords[0]): + yield lbs return ## Exhaustive search. @@ -199,8 +208,8 @@ def retrieve_result_st(): list_ords = list(gen_ords) for tifm, tofm in itertools.product(gen_tifm, gen_tofm): r = apply_func(_gen_loopblocking_perprocess, - (nested_loop_desc, resource, cost, options, - [tifm], [tofm], list_tbat, list_ords)) + (nested_loop_desc, resource, bufshr, constraint, cost, + options, [tifm], [tofm], list_tbat, list_ords)) results.append(r) for lbs in heapq.nsmallest(options.ntops, retrieve_func, diff --git a/nn_dataflow/core/loop_blocking_scheme.py b/nn_dataflow/core/loop_blocking_scheme.py index 3b5d90b..221e5f2 100644 --- a/nn_dataflow/core/loop_blocking_scheme.py +++ b/nn_dataflow/core/loop_blocking_scheme.py @@ -19,6 +19,7 @@ from . import data_category_enum as de from . import loop_enum as le from . import mem_hier_enum as me +from .node_region import NodeRegion from .. import util class LoopBlockingScheme(object): @@ -37,7 +38,7 @@ class BL(object): # pylint: disable=too-few-public-methods REGF = 1 NUM = 2 - def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource, + def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource, bufshr, options): ''' Given blocking factors `bl_ts` and the loop orders `bl_ords`, construct @@ -69,6 +70,9 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource, `bl_ords` indicate the loop orders of all levels, indexed by BL. Each entry is a permutation tuple indexed by LoopEnum and gives the positions of the loops at this level. Smaller number means inner loop. + + `bufshr` is a BufShrScheme instance, indicating the buffer sharing + scheme. ''' # pylint: disable=invalid-name @@ -76,6 +80,9 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource, # Loop structure. self.nld = nested_loop_desc + # Cache values. + self.total_access_gbuf = [self.nld.total_access_at_of(me.GBUF, dce) + for dce in range(de.NUM)] # Check lengths and values. assert len(bl_ts) == BL.NUM + 1, \ @@ -102,6 +109,9 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource, # Need to define time for invalid scheme. self.time = float('inf') + # Buffer sharing initialization. + self._init_bufshr(bufshr, options) + # Buffer data size for one unit. self.unit_size = [tuple() for _ in range(BL.NUM)] self.unit_size[BL.GBUF] = self.nld.usize_gbuf @@ -129,6 +139,34 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource, # Data fetch calculation. self._set_fetch() + # Check resource data src/dst region. + self.src_is_dram = (resource.src_data_region.type == NodeRegion.DRAM) + self.dst_is_dram = (resource.dst_data_region.type == NodeRegion.DRAM) + + # Check resource for filter pinning. + self.filter_pinned = False + if resource.no_time_mux: + if all(self.bl_ts[0][lpe] == 1 for lpe + in self.nld.data_loops[de.FIL].loops()): + self.filter_pinned = True + self.fetch[0][de.FIL] = 0 + + # If data regions are not DRAM, can only access once, no spilling. + if not self.src_is_dram: + if self.fetch[BL.GBUF][de.IFM] > 1: + self.valid = False + return + if resource.src_data_region == resource.proc_region: + # Force to store in gbuf. + self.stored_in_gbuf[de.IFM] = True + if not self.dst_is_dram: + if self.fetch[BL.GBUF][de.OFM] > 1: + self.valid = False + return + if resource.dst_data_region == resource.proc_region: + # Force to store in gbuf. + self.stored_in_gbuf[de.OFM] = True + # Now with the fetch times, we can calculate the actual # `stored_in_gbuf` values. # Only store in gbuf if having reuse. @@ -163,6 +201,20 @@ def __init__(self, nested_loop_desc, bl_ts, bl_ords, resource, self.dram_time = float('nan') self.access = [[float('nan')] * de.NUM for _ in range(me.NUM)] + # NoC access due to buffer sharing. + self.noc_access = [0.] * de.NUM + self.bufshr_rotation_access = [0.] * de.NUM + self.bufshr_wide_fetch_access = [0.] * de.NUM + + # Buffer sharing. + self._set_bufshr(resource, bufshr, options) + + # Access forwarding. + self._set_accfwd(bufshr, options) + + # Remote gbuf access. + self.remote_gbuf_access = [0.] * de.NUM + def is_valid(self): ''' Whether is a valid scheme. @@ -179,6 +231,7 @@ def data_size(self, blvl, dce=None): size = self.unit_cnt[blvl][dce] * self.unit_size[blvl][dce] if blvl == self.BL.GBUF: size *= 1 if self.stored_in_gbuf[dce] else 0 + size = util.idivc(size, self.bufshr_subgrp_size[dce]) return size @@ -209,6 +262,18 @@ def get_top_level_fetch(self): return self.fetch[self.BL.GBUF] + def get_noc_access(self): + ''' + Get the NoC accesses of each data category. + ''' + if not self.is_valid(): + return None + + if not self.finalized_stats: + self._calc_stats() + + return self.noc_access + def get_access_cost(self, cost): ''' Get the data access cost of loop blocking. @@ -220,6 +285,7 @@ def get_access_cost(self, cost): self._calc_stats() acc_cost = sum(c * sum(a) for c, a in zip(cost.mem_hier, self.access)) + acc_cost += cost.mem_hier_at(me.GBUF) * sum(self.remote_gbuf_access) return acc_cost @@ -248,8 +314,18 @@ def gen_index(self): bl_idxgen_list.append(self._gen_index_single_level(t_x, order_x)) bl_cnt_list.append(cnt_x) + # Buffer sharing. + t_x = self.bufshr_bs_t + order_x = self.bufshr_bs_ord + cnt_x = [x // b for x, b + in zip(self._bl_tp(slice(bl_gbuf + 1, None)), + self.bufshr_bs_t)] + bl_idxgen_list.append(self._gen_index_single_level(t_x, order_x)) + bl_cnt_list.append(cnt_x) + # Between GBUF and REGF. - t_x = self.bl_ts[bl_regf] + t_x = [x // b for x, b + in zip(self.bl_ts[bl_regf], self.bufshr_bs_t)] order_x = self.bl_ords[bl_regf] cnt_x = self._bl_tp(slice(bl_regf + 1, None)) bl_idxgen_list.append(self._gen_index_single_level(t_x, order_x)) @@ -412,8 +488,27 @@ def _calc_stats(self): else self.nld.total_access_at_of(me.GBUF, dce)) * self.fetch[self.BL.GBUF][dce] * self.num_nodes + / self.accfwd_reduction[dce] for dce in range(de.NUM)] + # NoC access. + self.bufshr_rotation_access = self._calc_bufshr_rotation_access( + self.bufshr_rot_fetch) + self.bufshr_wide_fetch_access = self._calc_bufshr_widefetch_access( + self.bufshr_wide_fetch) + self.noc_access = [a1 + a2 for a1, a2 + in zip(self.bufshr_rotation_access, + self.bufshr_wide_fetch_access)] + + if not self.src_is_dram: + self.remote_gbuf_access[de.IFM] += self.access[me.DRAM][de.IFM] + self.access[me.DRAM][de.IFM] = 0 + if not self.dst_is_dram: + self.remote_gbuf_access[de.OFM] += self.access[me.DRAM][de.OFM] + self.access[me.DRAM][de.OFM] = 0 + if self.filter_pinned: + assert self.access[me.DRAM][de.FIL] == 0 + # DRAM access time. self.dram_time = int(math.ceil(sum(self.access[me.DRAM]) / self.dram_bandwidth)) @@ -484,3 +579,458 @@ def _gen_index_single_level(t_x, order_x): # in LoopEnum order. yield tuple(idx[rev_order[lpe]] for lpe in range(le.NUM)) + def _set_accfwd(self, bufshr, options): + ''' + Set access forwarding (AF). + ''' + assert self.is_valid() and not self.finalized_stats + + # DRAM access reduction due to AF. This is the average reduction. Each + # node does not need to fetch exactly 1/N data. + self.accfwd_reduction = [1] * de.NUM + + if not options.hw_access_forwarding and not options.hw_gbuf_sharing: + return + + # If n nodes share the data, each node fetches 1/n of the data. + for dce in range(de.NUM): + self.accfwd_reduction[dce] = bufshr.size(dce) + + def _init_bufshr(self, bufshr, options): + ''' + Initialize buffer sharing (BS). + + Must be called before any buffered data size check. + ''' + assert not hasattr(self, "unit_cnt") + + # Total BS nodes + self.bufshr_grp_size = tuple(bufshr.size(dce) if options.hw_gbuf_sharing + else 1 for dce in range(de.NUM)) + # BS subgroup sizes. + # The initial values are conservative, i.e., assuming the maximum + # shared capacity across nodes. + # They can be decreased later, but never increased. + self.bufshr_subgrp_size = self.bufshr_grp_size + + # Additional BS level between DRAM and GBUF, split out from GBUF level. + self.bufshr_bs_t = (1,) * le.NUM + self.bufshr_bs_ord = tuple(range(le.NUM)) + + # NoC fetch due to rotation. + # The fetch times means the number of hops along which each data + # (considered all replica) traverals over the entire nested loops. + # The total number of hops of all data over all nodes will be this + # value multiplying the size of unique data (without replica). + self.bufshr_rot_fetch = [0.] * de.NUM + # Rotation round counts. + self.bufshr_rot_round_cnt = [0] * de.NUM + # Rotation unit counts. + self.bufshr_rot_unit_cnt = [1] * de.NUM + + # NoC fetch due to wide fetch. Meaning similar to `bufshr_rot_fetch`. + self.bufshr_wide_fetch = [0.] * de.NUM + # Wide fetch widths. + self.bufshr_wide_fetch_width = [0.] * de.NUM + + def _set_bufshr(self, resource, bufshr, options): + ''' + Set buffer sharing (BS). + + The GBUF level loops, i.e., ti/to/tb[1], decide the order and ranges of + the access to data buffered in GBUF, which could spread across multiple + nodes. + + - Seq-acc and non-seq-acc data category. + + Depending on the loop structure, some data categories, whose related + loops are not adjacent and split by the other unrelated loops, has a + non-perfect-sequential access pattern, as the inner dimensions will be + accessed multiple times (due to the middle unrelated loops) before + switching to the next outer dimension. We call it non-seq-acc data + category. + + E.g., with CONV layer, OFM is non-seq-acc with the following loop + structure: + + for o + for i + for b + + If there are < 3 non-trivial loops, there is no non-seq data category. + + - Rotation unit. + + Rotation unit for each data category is defined as the shifting size + for each rotation step. For seq-acc data categories, the rotation unit + is single REGF unit. For non-seq-acc data category, the rotation unit + is the product of all inner dimension sizes that are not adjacent to + the outermost dimension, i.e., we only rotate after all the multiple + accesses to the inner dimensions are done. + + - Rotation round. + + Given the definition of rotation unit above, the number of rotation + rounds is the product of all unrelated loop blocking factors above the + outermost dimension loop of this data category. + + E.g., with above loops, IFM (i, b) rotates `to` rounds, FIL (i, o) + rotates once, and OFM (o, b) rotates only once. + + - Wide fetch. + + Rotation unit size does not affect the NoC access of rotation rounds, + but there may be remote accesses without rotation, called wide fetch, + if the rotation unit does not fit in a single node GBUF. + + - BS schemes. + + When exploring the BS schemes, we keep the total accesses to DRAM, + GBUF, and REGF unchanged, i.e., previously calculated fetch times are + still valid. This is guaranteed by fixing some innermost loops in the + GBUF level. + + The other un-fixed loops (we call them flexible loops) can be reordered + or further blocked into an additional BS level between GBUF and DRAM + levels. This additional level can help reduce NoC accesses by splitting + the data accesses into across-node and within-node, and use up the data + within a node before switching to the next node. + + E.g., the above loop structure can become: + + for i-across-node + for o + for i-within-node + for b + + This optimization reduces IFM (i, b) rotation rounds from `to` to 1, + and increases OFM (o, b) rotation rounds from 1 to `i-across-node`, + i.e., subgroup size of IFM; it does not change FIL (i, o) rotation + rounds. + ''' + assert self.is_valid() and not self.finalized_stats + + if not options.hw_gbuf_sharing: + assert all(gs == 1 for gs in self.bufshr_grp_size) + return + + bl = self.BL.GBUF + blp1 = bl + 1 + + # If bypass GBUF, set subgroup size to 1. + self.bufshr_subgrp_size = tuple(sgs if self.data_size(bl, dce) else 1 + for dce, sgs + in enumerate(self.bufshr_subgrp_size)) + + if all(sgs == 1 for sgs in self.bufshr_subgrp_size): + return + + ## Loop structure. + + # The blocking factors and loop order that are related to BS. + t_x = self.bl_ts[blp1] + ord_x = self.bl_ords[blp1] + + # Non-trivial loops. + nt_loops = set(lpe for lpe in range(le.NUM) if t_x[lpe] > 1) + + # To keep fetch times to all hierarchies unchanged, we fix some loops + # without further blocking them in BS. See _set_fetch(), the + # (unrelated) loops inside the innermost non-trivial dim loop does not + # contribute to the fetch times, so we fix these loops for all data + # categories. + o_inntdim_loop = max( + (self._innt_dim_loop(dce, t_x, ord_x) for dce in range(de.NUM)), + key=lambda lpe: (ord_x[lpe] if lpe is not None else -1)) + # A tuple in the order of outer to inner, i.e., sort by inverse order. + fixed_loops = tuple(sorted( + (lpe for lpe in nt_loops if ord_x[lpe] < ord_x[o_inntdim_loop]), + key=lambda lpe: ord_x[lpe], + reverse=True)) + + # The loops that can be further blocked without affecting the fetch + # times to all hierarchies. + flex_loops = nt_loops.difference(fixed_loops) + + ## Subgroup size candidates. + + def _min_subgrp_size(*dce_list): + ''' + Get the minimum BS subgroup size, but not changing the current + subgroup size. Minimize in the order of the given `dce_list`. + ''' + # No duplication. + assert len(dce_list) == len(set(dce_list)) + + # Free capacity in each node's GBUF. + free_cap = resource.size_gbuf - self.data_size(bl) + + sgs_list = list(self.bufshr_subgrp_size) + + for dce in dce_list: + # Skip no sharing case. + if sgs_list[dce] <= 1: + continue + + cur_dsz = self.data_size(bl, dce) + tot_dsz = cur_dsz * self.bufshr_subgrp_size[dce] + assert cur_dsz > 0 and tot_dsz > 0 + + # min. sgs + # s.t. tot_dsz / sgs <= free_cap + cur_dsz. + for sgs in range(sgs_list[dce], 0, -1): + if self.bufshr_grp_size[dce] % sgs != 0: + # Require subgroup size to be a factor of the group + # size. + continue + if util.idivc(tot_dsz, sgs) <= free_cap + cur_dsz: + sgs_list[dce] = sgs + else: + break + + # Reduce free capacity. + free_cap -= util.idivc(tot_dsz, sgs_list[dce]) - cur_dsz + assert free_cap >= 0 + + return tuple(sgs_list) + + # Original subgroup size. + subgrp_size_cands = [self.bufshr_subgrp_size] + # Reduce subgroup size if data can fit in fewer nodes. Consider all + # orders about which data first shrink. + subgrp_size_cands += set(_min_subgrp_size(*dce_list) for dce_list + in itertools.permutations(range(de.NUM))) + + ## Sweep all BS schemes. + + def _sweep_bufshr(): + for subgrp_size in subgrp_size_cands: + + # `flex_loops` can be further blocked in BS, while others + # cannot (set to 1). + t_bs_tot = [t_x[lpe] if lpe in flex_loops else 1 + for lpe in range(le.NUM)] + + for t_bs_frac in itertools.product( + *[util.factorize(t, 2) for t in t_bs_tot]): + t_bs = tuple(t[0] for t in t_bs_frac) + + loops_bs_trivial = tuple(lpe for lpe in flex_loops + if t_bs[lpe] == 1) + + for loops_bs_nontrivial, loops_bot in itertools.product( + itertools.permutations([lpe for lpe in flex_loops + if t_bs[lpe] > 1]), + itertools.permutations(flex_loops)): + + loops_bs = loops_bs_trivial + loops_bs_nontrivial + + yield subgrp_size, t_bs, loops_bs, loops_bot + + ## BS NoC fetch times. + + dim_loops = [self.nld.data_loops[dce].loops() for dce in range(de.NUM)] + + def _is_dim_loop(lpe, dce, _dim_loops=dim_loops): + return lpe in _dim_loops[dce] + + def _calc_bufshr_fetch(subgrp_size, t_bs, loops_bs, loops_bot): + ''' + Calculate the BS scheme NoC fetch times. Return rotation fetch, + wide fetch, and other statistics. + + `subgrp_size` is the BS subgroup size for each data category. + + `t_bs` is the blocking factors indexed by LoopEnum for the + additional BS level between DRAM and GBUF, i.e., above `blp1`. They + are fractorized from `t_x`. Only those in `flex_loops` can have + non-1 values. + + `loops_bs` and `loops_bot` are ordered tuples of `flex_loops` from + outer to inner, for the additional BS level and the original GBUF + level (at the bottom) respectively. + ''' + assert set(loops_bs) == set(loops_bot) == flex_loops + assert all(b <= x for b, x in zip(t_bs, t_x)) + assert all(t_bs[lpe] == 1 or lpe in flex_loops + for lpe in range(le.NUM)) + + # Make a list of tuples (LoopEnum, blocking factor)`, each + # corresponds to a non-trivial loop in the additional BS level and + # the original GBUF level, ordered from outer to inner. + lp_t_list = [] + # Additional BS level. + lp_t_list += [(lpe, t_bs[lpe]) + for lpe in loops_bs if t_bs[lpe] > 1] + # GBUF level flex loops. + lp_t_list += [(lpe, util.idivc(t_x[lpe], t_bs[lpe])) + for lpe in loops_bot if t_x[lpe] > t_bs[lpe]] + # GBUF level fixed loops. + lp_t_list += [(lpe, t_x[lpe]) for lpe in fixed_loops] + # Check. + assert all(tpl[1] > 1 for tpl in lp_t_list) + + # Total rotation rounds (over all GBUF filling). + rot_rnd_cnts = [] + # Number of rotation units. + rot_unit_cnts = [] + # Wide fetch widths. + wide_fetch_widths = [] + + # Rotation NoC fetch times. + rot_fetch = [] + # Wide fetch NoC fetch times. + wide_fetch = [] + + for dce in range(de.NUM): + + buf_fetch = self.fetch[blp1][dce] + mem_fetch = self.fetch[blp1-1][dce] + + # Index of the outermost dim loop in `lp_t_list`. None if all + # dim loops are trivial. + idx_odlp = next((i for i, tpl in enumerate(lp_t_list) + if _is_dim_loop(tpl[0], dce)), + None) + + # Rotation rounds. + rotrnds = 1 + if idx_odlp is None or subgrp_size[dce] == 1: + # No rotation. + rotrnds = 0 + elif idx_odlp is not None: + # All unrelated loop factors above the outermost dim loop. + # At DRAM level. + rotrnds *= util.prod(self.nld.data_loops[dce] + .drop(self._bl_tp(slice(blp1)))) + # At GBUF level. + rotrnds *= util.prod(tpl[1] for tpl + in itertools.islice(lp_t_list, + idx_odlp)) + assert ((buf_fetch + 1) // 2 if dce == de.OFM + else buf_fetch) % rotrnds == 0 + assert rotrnds % ((mem_fetch + 1) // 2 if dce == de.OFM + else mem_fetch) == 0 + # Optimization: after fetching data into GBUF, if the data only + # rotate a single time before being replaced, we do not need to + # store them after this single use. So instead we can stream + # each rotation unit to all the nodes, and replace it by the + # next rotation unit one by one. This is already supported as + # the data will be broadcast to all nodes regardless of who + # stores it (see partition). + if rotrnds == ((mem_fetch + 1) // 2 if dce == de.OFM + else mem_fetch): + rotrnds = 0 + rot_rnd_cnts.append(rotrnds) + + # Number of rotation units. + rotunits = 1 + # All dimension sizes of the outermost adjacent dim loops. + if idx_odlp is not None: + rotunits = util.prod(tpl[1] for tpl + in itertools.takewhile( + lambda tpl, dce_=dce: + _is_dim_loop(tpl[0], dce_), + itertools.islice(lp_t_list, + idx_odlp, None))) + rot_unit_cnts.append(rotunits) + + # Wide fetch width. + wf_width = 1. * subgrp_size[dce] / rotunits + wide_fetch_widths.append(wf_width) + + # Wide fetch times. + wf_per_bufacc = bufshr.nhops_wide_fetch_once( + dce, subgrp_size[dce], wf_width) + # Use REGF filling (GBUF fetch). + # The last wide fetch before rotation can be combined with the + # rotation steps. + if dce == de.OFM: + # For OFM, if we do multiple wide fetch per rotation step, + # the last one has both read and write. If there is only + # one wide fetch per rotation step, it only has write. + if buf_fetch > 2 * rotrnds - 1: + comb_wf_fetch = 2 * rotrnds + else: + assert buf_fetch == 2 * rotrnds - 1 + comb_wf_fetch = 2 * rotrnds - 1 + else: + comb_wf_fetch = rotrnds + # Since we do not rotate the last step, when wide fetch is + # non-0 (i.e., the last rotation unit is larger than one node + # buffer size), the wide fetch of the last unit has no rotation + # to combine with. + comb_wf_fetch *= 1. * (rotunits - 1) / rotunits + wf = wf_per_bufacc * (buf_fetch - comb_wf_fetch) + assert wf > -1e-4 + wide_fetch.append(wf) + + # Rotation fetch times. + rf_per_rot = bufshr.nhops_rotate_all( + dce, subgrp_size[dce], rotunits) + rf = rf_per_rot * rotrnds + rot_fetch.append(rf) + + return rot_fetch, wide_fetch, \ + rot_rnd_cnts, rot_unit_cnts, wide_fetch_widths + + ## Search for the best BS scheme. + + def _key_func(tuple_): + rot_fetch, wide_fetch = _calc_bufshr_fetch(*tuple_)[:2] + return sum(self._calc_bufshr_rotation_access(rot_fetch)) \ + + sum(self._calc_bufshr_widefetch_access(wide_fetch)) + subgrp_size, t_bs, loops_bs, loops_bot = \ + min(_sweep_bufshr(), key=_key_func) + + # Subgroup size. + self.bufshr_subgrp_size = subgrp_size + + # Loop blocking factors and order. + new_ord = [-1] * le.NUM + ord_idx = 0 + for lpe in reversed(loops_bot + fixed_loops): + new_ord[lpe] = ord_idx + ord_idx += 1 + for lpe in range(le.NUM): + if new_ord[lpe] < 0: + new_ord[lpe] = ord_idx + ord_idx += 1 + self.bl_ords[blp1] = tuple(new_ord) + + # Additional BS level. + new_ord_bs = [-1] * le.NUM + ord_idx = 0 + for lpe in reversed(loops_bs): + if t_bs[lpe] > 1: + new_ord_bs[lpe] = ord_idx + ord_idx += 1 + for lpe in range(le.NUM): + if new_ord_bs[lpe] < 0: + new_ord_bs[lpe] = ord_idx + ord_idx += 1 + self.bufshr_bs_t = tuple(t_bs) + self.bufshr_bs_ord = tuple(new_ord_bs) + + # Set stats. + self.bufshr_rot_fetch, self.bufshr_wide_fetch, \ + self.bufshr_rot_round_cnt, self.bufshr_rot_unit_cnt, \ + self.bufshr_wide_fetch_width = \ + _calc_bufshr_fetch(subgrp_size, t_bs, loops_bs, loops_bot) + + def _calc_bufshr_rotation_access(self, bufshr_rot_fetch): + ''' Calculate the BS rotation NoC accesses, over all nodes. ''' + # All-node access needs to multiply number of groups. + return [self.total_access_gbuf[dce] + * bufshr_rot_fetch[dce] + * (self.num_nodes // self.bufshr_grp_size[dce]) + for dce in range(de.NUM)] + + def _calc_bufshr_widefetch_access(self, bufshr_wide_fetch): + ''' Calculate the BS wide fetch NoC accesses, over all nodes. ''' + # All-node access needs to multiply number of groups. + return [self.total_access_gbuf[dce] + * bufshr_wide_fetch[dce] + * (self.num_nodes // self.bufshr_grp_size[dce]) + for dce in range(de.NUM)] + diff --git a/nn_dataflow/core/nn_dataflow.py b/nn_dataflow/core/nn_dataflow.py index 4d3ec98..d489455 100644 --- a/nn_dataflow/core/nn_dataflow.py +++ b/nn_dataflow/core/nn_dataflow.py @@ -13,6 +13,7 @@ program. If not, see . """ +from collections import defaultdict import itertools import sys @@ -20,6 +21,7 @@ from .cost import Cost from .data_layout import DataLayout from .fmap_range import FmapPosition, FmapRange +from .inter_layer_pipeline import InterLayerPipeline from .map_strategy import MapStrategy from .network import Network from .nn_dataflow_scheme import NNDataflowScheme @@ -63,6 +65,16 @@ def __init__(self, network, batch_size, resource, cost, map_strategy): layer2sched[layer] = sched self.layer_sched_dict[layer_name] = sched + # Inter-layer pipelining. + self.ilp = InterLayerPipeline(self.network, self.batch_size, + self.resource) + self.ordered_layer_list = self.ilp.ordered_layer_list() + + # NNDataflowScheme tops. + # The top schemes are organized by the ending layers, and keeping + # extended to the end of the network. + self.nndf_tops = {} + # Default compare key function. self.cmp_key = lambda nndf: (nndf.total_cost, nndf.total_time) @@ -78,22 +90,52 @@ def schedule_search(self, options): else: assert options.opt_goal == 'e' + # Group the segments by the ending layers. + segments = defaultdict(list) + for seg in self.ilp.gen_segment(options): + if seg not in segments[seg[-1][-1]]: + segments[seg[-1][-1]].append(seg) + # Clear and reset. - nndf_tops = [] + self.nndf_tops = {} # Initial input layout. + self.nndf_tops[None] = [] for input_layout, ext_layout_dict in self._gen_input_layout(options): nndf = NNDataflowScheme(self.network, input_layout, ext_layout_dict) - nndf_tops.append(nndf) + self.nndf_tops[None].append(nndf) # Schedule layers. - for layer_name in self.network: + for layer_name in self.ordered_layer_list: if options.verbose: sys.stderr.write('-> {}\n'.format(layer_name)) sys.stderr.flush() - nndf_tops = self._layer_schedule_search( - layer_name, nndf_tops, options) + # The top schemes ending with the current layer. + tops = [] + + # The segments ended with the current layer. Use them to extend the + # current top schemes. + for seg in segments[layer_name]: + if options.verbose: + sys.stderr.write(' - {}\n'.format(seg.seg)) + sys.stderr.flush() + tops += self._segment_schedule_search(seg, options) + + # Always pick and keep top n. + tops = sorted(tops, key=self.cmp_key)[:options.ntops] + + # Add to the top list. + assert layer_name not in self.nndf_tops + self.nndf_tops[layer_name] = tops + + # Final top schemes. + nndf_tops = self.nndf_tops.get(self.ordered_layer_list[-1], []) + if not nndf_tops: + sys.stderr.write('No valid schedule found for {}.\n' + .format(self.network.net_name)) + for nndf in nndf_tops: + assert len(nndf) == len(self.network) # Cache stats. cache_hits = 0 @@ -109,12 +151,100 @@ def schedule_search(self, options): return nndf_tops, (cache_hits, cache_misses) - def _layer_schedule_search(self, layer_name, prev_nndf_tops, options): + def _segment_schedule_search(self, segment, options): + ''' + Schedule the given PipelineSegment `segment`. + + Return new top NNDataflowScheme instances that include this segment. + Will NOT update the `nndf_tops` attribute. + ''' + # We take the top schemes that end with the latest previous layer as + # the initial state. + first_layer_idx = self.ordered_layer_list.index(segment[0][0]) + if first_layer_idx == 0: + prev_nndf_tops = self.nndf_tops[None] + else: + prev_nndf_tops = self.nndf_tops.get( + self.ordered_layer_list[first_layer_idx - 1], []) + if not prev_nndf_tops: + return [] + + # New top schemes. + nndf_tops = [] + + # Allocation. + allocation = segment.allocation() + + # Forwarding data regions. Map a spatial index to the forwarding region. + fwd_data_region_dict = {} + for sh_list in segment.ifm_fwd_dict.values(): + # A list of spatial indices that share the same ifmaps. + r = allocation[sh_list[0].sp_idx][sh_list[0].tm_idx].proc_region + for idx in sh_list[1:]: + fwd_data_region_dict[idx] = r + for fwd_src, fwd_dst_list in segment.ofm_fwd_dict.items(): + # Ofmaps forwarded to neighbors. + r = allocation[fwd_src.sp_idx][fwd_src.tm_idx].proc_region + for idx in fwd_dst_list: + fwd_data_region_dict[idx] = r + + # Max allowed time overhead for segment timing. + max_time_ovhd = options.layer_pipeline_time_ovhd + + # Cost hint Pareto-optimal frontier. + frontier = set() + + # Explore constraints. + for constraint, hints in segment.gen_constraint(max_time_ovhd): + + # Filter out off-frontier constraints. + if any(all(h >= fh for h, fh in zip(hints, fhints)) + for fhints in frontier): + continue + + # Start from the previous top schemes. + curr_nndf_tops = prev_nndf_tops + + # Spatial scheduling. + for sp_idx, (ltpl, rtpl, ctpl) \ + in enumerate(zip(segment, allocation, constraint)): + + # Temporal scheduling. + for tm_idx, (layer, resource, cstr) \ + in enumerate(zip(ltpl, rtpl, ctpl)): + + curr_nndf_tops = self._layer_schedule_search( + layer, resource, cstr, sp_idx, tm_idx, + fwd_data_region_dict.get((sp_idx, tm_idx)), + curr_nndf_tops, options) + + # Filter by time limit. + seg_nndf_tops = [nndf for nndf in curr_nndf_tops + if all(timing.time_overhead <= max_time_ovhd + for timing in nndf.segment_timing_list)] + + # Add to frontier. + if seg_nndf_tops: + frontier.add(hints) + + nndf_tops += seg_nndf_tops + + # Always pick and keep top n. + return sorted(nndf_tops, key=self.cmp_key)[:options.ntops] + + def _layer_schedule_search(self, layer_name, resource, constraint, + spatial_idx, temporal_idx, fwd_data_region, + prev_nndf_tops, options): ''' Schedule the given layer under the given previous top NNDataflowScheme instances in 'prev_nndf_tops`. - Return new top NNDataflowScheme instances that include this layer. + `spatial_idx` and `temporal_idx` give the spatial and temporal + scheduling index in the segment. The segment index is inferred from the + previous top schemes. + + Return new top NNDataflowScheme instances that include this layer. Will + NOT update the `nndf_tops` attribute. ''' nndf_tops = [] @@ -124,8 +254,27 @@ def _layer_schedule_search(self, layer_name, prev_nndf_tops, options): ifmap_layout = prev_nndf.fmap_layout(self.network.prevs(layer_name)) - condition = SchedulingCondition(resource=self.resource, - ifmap_layout=ifmap_layout) + if fwd_data_region is not None: + # Remap source data regions to the forwarding region. + ifmap_layout = DataLayout( + frngs=ifmap_layout.frngs, + regions=(fwd_data_region,) * len(ifmap_layout.frngs), + parts=tuple(p.projection(fwd_data_region, appl2frng=True) + for p in ifmap_layout.parts)) + + segment_idx = prev_nndf.last_seg_idx + if spatial_idx == 0 and temporal_idx == 0: + # New segment. + segment_idx += 1 + + sched_seq = (segment_idx, spatial_idx, temporal_idx) + + constraint.update_by_prev(prev_nndf) + + condition = SchedulingCondition(resource=resource, + constraint=constraint, + ifmap_layout=ifmap_layout, + sched_seq=sched_seq) try: sched_tops = layer_sched.schedule_search(condition, options) diff --git a/nn_dataflow/core/nn_dataflow_scheme.py b/nn_dataflow/core/nn_dataflow_scheme.py index 2e027e4..7eba77a 100644 --- a/nn_dataflow/core/nn_dataflow_scheme.py +++ b/nn_dataflow/core/nn_dataflow_scheme.py @@ -19,6 +19,7 @@ from .. import util from .data_layout import DataLayout from .network import Network +from .pipeline_segment_timing import PipelineSegmentTiming from .scheduling import SchedulingResult class NNDataflowScheme(MutableMapping): @@ -53,8 +54,16 @@ def __init__(self, network, input_layout, ext_layout_dict=None): self.res_dict = OrderedDict() - self.total_cost = 0 - self.total_time = 0 + # Naive sum of all layer cost. + self.sum_cost = 0 + self.sum_static_cost = 0 + # Naive sum of all layer time, used to adjust cost. + self.sum_time = 0 + + # A list of segment schedule timing information. + self.segment_timing_list = [] + + self.last_seg_idx = -1 def __getitem__(self, layer_name): ''' Get the SchedulingResult of a scheduled layer. ''' @@ -84,8 +93,23 @@ def __setitem__(self, layer_name, sched_result): self.res_dict[layer_name] = sched_result - self.total_cost += sched_result.total_cost - self.total_time += sched_result.total_time + self.sum_cost += sched_result.total_cost + self.sum_static_cost += sched_result.scheme['cost_static'] + self.sum_time += sched_result.total_time + + seg_idx = sched_result.sched_seq[0] + if seg_idx == self.last_seg_idx + 1: + self.segment_timing_list.append( + PipelineSegmentTiming(self.network, seg_idx)) + self.last_seg_idx += 1 + elif seg_idx == self.last_seg_idx: + pass + else: + raise ValueError('NNDataflowScheme: segment index is invalid. ' + 'segment {} follows {}.' + .format(seg_idx, self.last_seg_idx)) + assert len(self.segment_timing_list) - 1 == self.last_seg_idx + self.segment_timing_list[-1].add(layer_name, sched_result) def __delitem__(self, layer_name): ''' Not legal to call. ''' @@ -129,6 +153,25 @@ def _ofmap_layout(layer_name): return DataLayout.concat(*[_ofmap_layout(l) for l in layers]) + @property + def total_cost(self): + ''' Get the total cost. ''' + if self.sum_time == 0: + return self.sum_cost + overcounted_static_cost = (self.sum_static_cost + * (1 - 1. * self.total_time / self.sum_time)) + return self.sum_cost - overcounted_static_cost + + @property + def total_time(self): + ''' Get the total time. ''' + # Special case, when the entire network fits in one segment. No + # pipeline filling/draining delay. + if len(self.segment_timing_list) == 1 \ + and self.__len__() == len(self.network): + return self.segment_timing_list[0].critical_time + return sum(t.time for t in self.segment_timing_list) + @property def total_ops(self): ''' Get the total ops. ''' @@ -147,6 +190,16 @@ def total_noc_hops(self): ''' Get the total NoC hops. ''' return sum(sr.total_noc_hops for sr in self.values()) + def segment_time_list(self): + ''' Get the time for each segment. ''' + return [t.time for t in self.segment_timing_list] + + def segment_dram_time_list(self): + ''' + Get the time for each segment on DRAM access. + ''' + return [t.dram_time for t in self.segment_timing_list] + def perlayer_stats(self, stats_name): ''' Get a dict of per-layer stats. Valid stats must be a static method. diff --git a/nn_dataflow/core/node_region.py b/nn_dataflow/core/node_region.py index 06baec8..2d3b3df 100644 --- a/nn_dataflow/core/node_region.py +++ b/nn_dataflow/core/node_region.py @@ -16,12 +16,15 @@ import itertools from collections import namedtuple +from .. import util from .phy_dim2 import PhyDim2 NODE_REGION_LIST = ['dim', 'origin', 'dist', 'type', + 'wtot', + 'wbeg', ] class NodeRegion(namedtuple('NodeRegion', NODE_REGION_LIST)): @@ -31,6 +34,26 @@ class NodeRegion(namedtuple('NodeRegion', NODE_REGION_LIST)): The `type` attribute specifies the region type, which could be `PROC` for computation processing nodes or 'DRAM' for off-chip data storage nodes. + The node region can be optionally folded along the w dimension in a zig-zag + manner. The folding scheme is defined by (wtot, wbeg). `wtot` is always + positive, representing the number of nodes between two turns (total width). + `wbeg` is the number of nodes before reaching the first turning boundary, + with its sign representing the direction. E.g., + + ... + ****************** + ******** + | wbeg | + + or + + ... + ****************** + ********* + | -wbeg | + + With folded region, `origin` points to the first node. + NOTE: we cannot overload __contains__ and __iter__ as a node container, because the base namedtuple already defines them. ''' @@ -46,6 +69,12 @@ def __new__(cls, *args, **kwargs): kwargs2 = kwargs.copy() if len(args) <= NODE_REGION_LIST.index('dist'): kwargs2.setdefault('dist', PhyDim2(1, 1)) + if len(args) <= NODE_REGION_LIST.index('wtot'): + # Default to dim.w but we haven't checked dim yet. Replace later. + kwargs2.setdefault('wtot', None) + if len(args) <= NODE_REGION_LIST.index('wbeg'): + # Default to wtot. Also replace later. + kwargs2.setdefault('wbeg', None) ntp = super(NodeRegion, cls).__new__(cls, *args, **kwargs2) @@ -59,6 +88,19 @@ def __new__(cls, *args, **kwargs): if ntp.type not in range(cls.NUM): raise ValueError('NodeRegion: type must be a valid type enum.') + if ntp.wtot is None: + ntp = ntp._replace(wtot=ntp.dim.w) + if ntp.wbeg is None: + ntp = ntp._replace(wbeg=ntp.wtot) + + if not isinstance(ntp.wtot, int): + raise TypeError('NodeRegion: wtot must be an int.') + if not isinstance(ntp.wbeg, int): + raise TypeError('NodeRegion: wbeg must be an int.') + + if not (0 < abs(ntp.wbeg) <= ntp.wtot) and ntp.dim.size() > 0: + raise ValueError('NodeRegion: |wbeg| must be in (0, wtot].') + return ntp def contains_node(self, coordinate): @@ -79,6 +121,86 @@ def rel2abs(self, rel_coordinate): raise ValueError('NodeRegion: relative coordinate {} is not in ' 'node region {}.'.format(rel_coordinate, self)) - abs_coordinate = self.origin + rel_coordinate * self.dist + # Add starting offset to start from the boundary before the first node, + # then modulo wtot to get the delta h and w to this boundary point. + h, w = divmod(rel_coordinate.w + self.wtot - abs(self.wbeg), self.wtot) + # Direction for w, changing every time when h increments. + direction = (-1 if self.wbeg < 0 else 1) * (-1 if h % 2 else 1) + # Make w relative to the left boundary. + w = w if direction > 0 else self.wtot - 1 - w + + abs_coordinate = self.origin \ + + PhyDim2(h=h * self.dim.h + rel_coordinate.h, + w=w - (self.wtot - self.wbeg if self.wbeg > 0 + else -self.wbeg - 1)) \ + * self.dist + return abs_coordinate + def allocate(self, request_list): + ''' + Allocate node subregions spatially within the node region according to + the given `request_list` which is a list of numbers of nodes requested. + + Return a list of NodeRegion instances, whose origins are absolute + offset (not relative to the origin of self). The allocation may fail if + and only if the total number of nodes requested is larger than the + number of nodes in the region, in which case an empty list is returned. + + The strategy is to allocate stripe-wise in a zig-zag order, allowing + for folding in width. We first determine a stripe height as the + greatest common divisor of the requested numbers of nodes. Then + allocate each request as (stripe height, request size / stripe height) + to fill in the stripe, and move to the next stripe after the current + one is filled. If the width of a request is larger than the remaining + width of the current stripe, we use up the remaining width, and fold + the request width to the next stripe. + ''' + + if sum(request_list) > self.dim.size(): + return [] + + hstrp = util.gcd(self.dim.h, *request_list) + subregions = [] + + wtot = self.dim.w + ofs_h, ofs_w = 0, 0 + move_right = True + + for req in request_list: + + # Subregion. + assert req % hstrp == 0 + width = req // hstrp + + subdim = PhyDim2(hstrp, width) + if move_right: + origin = PhyDim2(ofs_h, ofs_w) + wbeg = min(wtot - ofs_w, width) + assert wbeg > 0 + else: + origin = PhyDim2(ofs_h, self.dim.w - ofs_w - 1) + wbeg = -min(wtot - ofs_w, width) + assert wbeg < 0 + + subregions.append(NodeRegion(dim=subdim, + origin=self.origin \ + + origin * self.dist, + dist=self.dist, + type=self.type, + wtot=wtot, + wbeg=wbeg)) + + # Move the offset + ofs_w += width + while ofs_w >= self.dim.w: + # Overflow, fold to the next stripe. + ofs_w -= self.dim.w + ofs_h += hstrp + move_right = not move_right + + # Not moved outside the region. + assert ofs_h + hstrp <= self.dim.h or ofs_w == 0 + + return subregions + diff --git a/nn_dataflow/core/option.py b/nn_dataflow/core/option.py index 451f044..968bf72 100644 --- a/nn_dataflow/core/option.py +++ b/nn_dataflow/core/option.py @@ -19,9 +19,16 @@ OPTION_LIST = ['sw_gbuf_bypass', 'sw_solve_loopblocking', + 'hw_access_forwarding', + 'hw_gbuf_sharing', + 'hw_gbuf_save_writeback', 'partition_hybrid', 'partition_batch', 'partition_ifmaps', + 'partition_interlayer', + 'layer_pipeline_time_ovhd', + 'layer_pipeline_max_degree', + 'layer_pipeline_opt', 'opt_goal', 'ntops', 'nprocesses', @@ -55,9 +62,16 @@ def __new__(cls, *args, **kwargs): kwdict.setdefault('sw_gbuf_bypass', (False,) * de.NUM) kwdict.setdefault('sw_solve_loopblocking', False) + kwdict.setdefault('hw_access_forwarding', False) + kwdict.setdefault('hw_gbuf_sharing', False) + kwdict.setdefault('hw_gbuf_save_writeback', False) kwdict.setdefault('partition_hybrid', False) kwdict.setdefault('partition_batch', False) kwdict.setdefault('partition_ifmaps', False) + kwdict.setdefault('partition_interlayer', False) + kwdict.setdefault('layer_pipeline_time_ovhd', float('inf')) + kwdict.setdefault('layer_pipeline_max_degree', float('inf')) + kwdict.setdefault('layer_pipeline_opt', True) kwdict.setdefault('opt_goal', 'e') kwdict.setdefault('ntops', 1) kwdict.setdefault('nprocesses', 1) @@ -73,10 +87,38 @@ def __new__(cls, *args, **kwargs): raise ValueError('Option: sw_gbuf_bypass must have length {}' .format(de.NUM)) + if ntp.sw_solve_loopblocking and ntp.hw_gbuf_sharing: + raise ValueError('Option: sw_solve_loopblocking and ' + 'hw_gbuf_sharing cannot be simultaneously ' + 'enabled.') + + if ntp.hw_access_forwarding and ntp.hw_gbuf_sharing: + raise ValueError('Option: hw_access_forwarding is implied by ' + 'hw_gbuf_sharing, thus cannot be both enabled.') + + if ntp.sw_solve_loopblocking and ntp.hw_gbuf_save_writeback: + raise ValueError('Option: sw_solve_loopblocking and ' + 'hw_gbuf_save_writeback cannot be simultaneously ' + 'enabled.') + if ntp.partition_ifmaps and not ntp.partition_hybrid: raise ValueError('Option: partition_ifmaps requires ' 'partition_hybrid to be set.') + if not isinstance(ntp.layer_pipeline_time_ovhd, (int, float)): + raise KeyError('Option: layer_pipeline_time_ovhd must be a ' + 'number.') + if ntp.layer_pipeline_time_ovhd < 0: + raise ValueError('Option: layer_pipeline_time_ovhd must be ' + 'positive.') + + if not isinstance(ntp.layer_pipeline_max_degree, (int, float)): + raise KeyError('Option: layer_pipeline_max_degree must be a ' + 'number.') + if ntp.layer_pipeline_max_degree < 0: + raise ValueError('Option: layer_pipeline_max_degree must be ' + 'positive.') + if ntp.opt_goal not in ['e', 'd', 'ed']: raise ValueError('Option: opt_goal is invalid, must be one of ' '\'e\', \'d\', and \'ed\'.') diff --git a/nn_dataflow/core/partition.py b/nn_dataflow/core/partition.py index 893599f..6ec2535 100644 --- a/nn_dataflow/core/partition.py +++ b/nn_dataflow/core/partition.py @@ -256,8 +256,6 @@ def unit_nhops_to_proc_region(layer, batch_size, region, part, category. ''' - del options - # FmapRange --> list of node coordinates processing this data. fil_dict = {} ofm_dict = {} @@ -285,23 +283,29 @@ def unit_nhops_to_proc_region(layer, batch_size, region, part, ifm_dict = util.HashableDict.fromdict(ifm_dict, valfunc=tuple) ofm_dict = util.HashableDict.fromdict(ofm_dict, valfunc=tuple) + # When using access forwarding, each piece of data is only fetched by the + # closest node, and then forwarded to ALL nodes that process it, regardless + # of which nodes initially store it. In this way, AF nhops is independent + # of BS scheme. + fwd = options.hw_access_forwarding or options.hw_gbuf_sharing + nhops = [0] * de.NUM - nhops[de.FIL] = _unit_nhops_to_fil(layer, filter_nodes, fil_dict) + nhops[de.FIL] = _unit_nhops_to_fil(layer, filter_nodes, fil_dict, fwd) - nhops[de.IFM] = _unit_nhops_to_ifm(ifmap_layout, ifm_dict) + nhops[de.IFM] = _unit_nhops_to_ifm(ifmap_layout, ifm_dict, fwd) if ofmap_layout.parts == (part,) and ofmap_layout.regions == (region,): # Ofmaps are stored locally, no data transfer. pass else: - nhops[de.OFM] = _unit_nhops_to_ofm(ofmap_layout, ofm_dict) + nhops[de.OFM] = _unit_nhops_to_ofm(ofmap_layout, ofm_dict, fwd) return nhops @fastcache.clru_cache(maxsize=1024) -def _unit_nhops_to_fil(layer, filter_nodes, fil_dict): +def _unit_nhops_to_fil(layer, filter_nodes, fil_dict, fwd=False): ''' Get the total number of hops to transfer filter data. @@ -312,16 +316,31 @@ def _unit_nhops_to_fil(layer, filter_nodes, fil_dict): for filrng, coord_list in fil_dict.items(): fil_size = filrng[0].size() * filrng[1].size() * layer.filter_size() - # Min hops to each processing node across all filter source nodes. - min_hops = [min(coord.hop_dist(c) for c in filter_nodes) - for coord in coord_list] - nhops += fil_size * sum(min_hops) + if fwd: + # Data can be forwarded from all sources to any destination. + src_set = set(filter_nodes) + dst_set = set(coord_list) + + while dst_set: + # Each forward step, get the min-distance pair of source and + # destination. + src, dst = min(itertools.product(src_set, dst_set), + key=lambda (s, d): d.hop_dist(s)) + dst_set.remove(dst) + src_set.add(dst) + nhops += fil_size * dst.hop_dist(src) + + else: + # Min hops to each processing node across all filter source nodes. + min_hops = [min(coord.hop_dist(c) for c in filter_nodes) + for coord in coord_list] + nhops += fil_size * sum(min_hops) return nhops @fastcache.clru_cache(maxsize=1024) -def _unit_nhops_to_ifm(ifmap_layout, ifm_dict): +def _unit_nhops_to_ifm(ifmap_layout, ifm_dict, fwd=False): ''' Get the total number of hops to transfer ifmap data. @@ -330,13 +349,13 @@ def _unit_nhops_to_ifm(ifmap_layout, ifm_dict): nhops = 0 for ifrng, coord_list in ifm_dict.items(): - nhops += ifmap_layout.nhops_to(ifrng, *coord_list) + nhops += ifmap_layout.nhops_to(ifrng, *coord_list, forwarding=fwd) return nhops @fastcache.clru_cache(maxsize=1024) -def _unit_nhops_to_ofm(ofmap_layout, ofm_dict): +def _unit_nhops_to_ofm(ofmap_layout, ofm_dict, fwd=False): ''' Get the total number of hops to transfer ofmap data. @@ -350,16 +369,28 @@ def _unit_nhops_to_ofm(ofmap_layout, ofm_dict): # its buffer and start on it. Other nodes start on zero and send the # results to that node to accumulate there. - # Use the mid node. - mid_idx = len(coord_list) // 2 - for idx, coord in enumerate(coord_list): - if idx == mid_idx: - # The mid node. Fetch from memory. - nhops += ofmap_layout.nhops_to(ofrng, coord) - else: - # Others. Send to the mid node (one way). - dist = coord.hop_dist(coord_list[mid_idx]) - nhops += util.idivc(ofrng.size() * dist, 2) + if fwd: + # Use the closest processing node. + nhops_read = min(ofmap_layout.nhops_to(ofrng, c) + for c in coord_list) + # Accumulation follows the reversed optimal forwarding tree. + nhops_accum = ofmap_layout.nhops_to(ofrng, *coord_list, + forwarding=True) + # The path between mid node and memory is in both, and accumulation + # is one-way. + nhops += util.idivc(nhops_read + nhops_accum, 2) + + else: + # Use the middle node. + mid_idx = len(coord_list) // 2 + for idx, coord in enumerate(coord_list): + if idx == mid_idx: + # The mid node. Fetch from memory. + nhops += ofmap_layout.nhops_to(ofrng, coord) + else: + # Others. Send to the mid node (one way). + dist = coord.hop_dist(coord_list[mid_idx]) + nhops += util.idivc(ofrng.size() * dist, 2) return nhops diff --git a/nn_dataflow/core/partition_scheme.py b/nn_dataflow/core/partition_scheme.py index b00a8b4..1735850 100644 --- a/nn_dataflow/core/partition_scheme.py +++ b/nn_dataflow/core/partition_scheme.py @@ -173,6 +173,41 @@ def part_layer(self, layer, batch_size): return p_layer, p_batch_size, p_occ + def part_neighbor_dist(self, node_region, pae): + ''' + Get the 2D distance between nearest neighbor nodes with the given + parallelism in the given node region. + + The returned neighbor distance is a PhyDim2 instance, each dimension of + which is the hop distance to the neighbor on that logical dimension. + ''' + if pae not in range(pe.NUM): + return PhyDim2(float('nan'), float('nan')) + + hdist = [] + wdist = [] + + for pidx in self.gen_pidx(): + coord = self.coordinate(node_region, pidx) + # On logical h dimension. + if pidx[pae].h > 0: + pidx_ph = [pidx[p] - PhyDim2(h=1, w=0) if p == pae + else pidx[p] for p in range(pe.NUM)] + coord_ph = self.coordinate(node_region, pidx_ph) + hdist.append(coord.hop_dist(coord_ph)) + # On logical w dimension. + if pidx[pae].w > 0: + pidx_pw = [pidx[p] - PhyDim2(h=0, w=1) if p == pae + else pidx[p] for p in range(pe.NUM)] + coord_pw = self.coordinate(node_region, pidx_pw) + wdist.append(coord.hop_dist(coord_pw)) + + # Average. + hd = 1. * sum(hdist) / len(hdist) if hdist else float('inf') + wd = 1. * sum(wdist) / len(wdist) if wdist else float('inf') + + return PhyDim2(h=hd, w=wd) + def projection(self, region, appl2frng=False): ''' Get the projection of the partitioning scheme onto a new NodeRegion diff --git a/nn_dataflow/core/pipeline_segment.py b/nn_dataflow/core/pipeline_segment.py new file mode 100644 index 0000000..86741cf --- /dev/null +++ b/nn_dataflow/core/pipeline_segment.py @@ -0,0 +1,970 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +from collections import namedtuple, OrderedDict, Counter +import itertools + +from sympy import symbols +from sympy import Basic as symbasic +from sympy import Eq as symeq +from sympy.core.containers import Tuple as symtuple +from sympy.functions.elementary.piecewise import Piecewise as sympiecewise + +from .. import util +from .layer import ConvLayer +from .network import Network +from .resource import Resource +from .scheduling_constraint import SchedulingConstraintLayerPipeline as Cstr + +class PipelineSegment(object): + ''' + Inter-layer pipeline segment. + + Segment is a two-level layer hierarchy, where the first level is spatially + scheduled and the second level is temporally scheduled. + ''' + + # pylint: disable=too-many-instance-attributes + + # Scheduling index in the segment, as a tuple of spatial and temporal + # scheduling indices. + SchedIndex = namedtuple('SchedIndex', ['sp_idx', 'tm_idx']) + + def __init__(self, seg, network, batch_size, resource, max_util_drop=0.05, + with_opt=True): + if not isinstance(seg, tuple): + raise TypeError('PipelineSegment: seg must be a tuple.') + for ltpl in seg: + if not isinstance(ltpl, tuple): + raise TypeError('PipelineSegment: seg must be a tuple ' + 'of sub-tuples.') + + if not isinstance(network, Network): + raise TypeError('PipelineSegment: network must be ' + 'a Network instance.') + if not isinstance(resource, Resource): + raise TypeError('PipelineSegment: resource must be ' + 'a Resource instance.') + + self.seg = seg + self.network = network + self.batch_size = batch_size + self.resource = resource + self.max_util_drop = max_util_drop + self.with_opt = with_opt + + self.valid = self._init_deps() + if not self.valid: + return + + # Resource allocation. + self.valid = self._alloc_resource(max_util_drop=max_util_drop) + if not self.valid: + return + + # Scheduling constraints. + self.valid = self._init_sym_cstrs() + if not self.valid: + return + + def allocation(self): + ''' + Get resource allocation, as a tuple of sub-tuples corresponding to the + layers in the segment. + ''' + if not self.valid: + return None + return self.alloc + + def gen_constraint(self, max_time_overhead=float('inf')): + ''' + Generate scheduling constraint for the segment, as a tuple of + sub-tuples of SchedulingConstraint instances, corresponding to the + layers in the segment. + + Yield the segment constraint tuple, and hints for pruning. + + Pruning hints are the top-level loop blocking factors. Smaller hints + indicate better (lower) cost, and larger hints indicate better segment + timing (with lower time overhead). Constraints with smaller hints are + generated before those with larger hints. So if a constraint results in + a valid scheduling, the later ones with all hints larger than its can + be pruned. + ''' + syms = self.cstr_symvals.keys() + vals = self.cstr_symvals.values() + assert syms and vals + + # Sort from small to large. + # This is not a strict ordering, but we guarantee that if all values in + # hint A are larger than the corresponding values in hint B, A will be + # generated after B. + vals = [sorted(v) for v in vals] + + if self.cstr_topbat_idx is not None: + # Tovhd = (1 + 1/to + 1 + 1/to + ...) / tb + # >= (1 + 1 + ...) / tb = num_sp_fbs / tb + min_topbat = 1. * self.cstr_num_sp_fbs / max_time_overhead + pos = self.cstr_topbat_idx + vals[pos] = [t for t in vals[pos] if t >= min_topbat] + + for valp in itertools.product(*vals): + + constraint = tuple() + + for atpl in self._subs_symargs(self.cstr_symargs, zip(syms, valp)): + ctpl = tuple() + for a in atpl: + # Construct kwargs, adjust the types of the values. + kwargs = {} + kwargs['topbat'] = int(a.get('topbat', 0)) + kwargs['fbifm'] = bool(a.get('fbifm', False)) + if not kwargs['fbifm']: + kwargs['topifm'] = int(a.get('topifm', 0)) + kwargs['fbofm'] = bool(a.get('fbofm', False)) + if not kwargs['fbofm']: + kwargs['topofm'] = int(a.get('topofm', 0)) + kwargs['update_dict'] = a.get('update_dict') + + c = Cstr(**kwargs) + ctpl += (c,) + constraint += (ctpl,) + + if None in valp: + assert len(valp) == 1 + hints = (1,) + else: + hints = tuple(valp) + + yield constraint, hints + + def __getitem__(self, index): + return self.seg[index] + + def __iter__(self): + return self.seg.__iter__() + + def __len__(self): + return len(self.seg) + + def __eq__(self, other): + if isinstance(other, self.__class__): + # pylint: disable=protected-access + return self._key_attrs() == other._key_attrs() + return NotImplemented + + def __ne__(self, other): + return not self == other + + def __hash__(self): + return hash(tuple(self._key_attrs())) + + def __repr__(self): + return '{}({})'.format( + self.__class__.__name__, + ', '.join([ + 'seg={}'.format(repr(self.seg)), + 'network={}'.format(repr(self.network)), + 'batch_size={}'.format(repr(self.batch_size)), + 'resource={}'.format(repr(self.resource)), + 'max_util_drop={}'.format(repr(self.max_util_drop)), + 'with_opt={}'.format(repr(self.with_opt))])) + + def _key_attrs(self): + ''' Used for comparison. ''' + return (self.seg, self.network, self.batch_size, self.resource, + self.max_util_drop, self.with_opt) + + def _init_deps(self): + ''' + Initialize the dependency relationship of the layers in the segment as + a mapping of the scheduling indices, and check validation. Return + whether the segment is valid to schedule. + + We categorize dependencies to 3 categories: + - local: with the same spatial index but different temporal indices; + - neighbor: with different spatial indices but in the same segment; + - memory: in different segments, from/to memory. + + The values of the src/dst dicts are tuples of indices of the neighbor + dependencies. A layer can have at most one neighbor source (must be a + last temporal scheduled layer), but may have multiple neighbor + destinations (could be temporal scheduled in the middle). Also, all + layers with the same spatial index can have at most one neighbor + source. + + Special index `None` means memory dependency, i.e., from/to memory. + Memory sources and neighbor sources must be mutual exclusive, in order + to correctly set the src data regions; memory destinations and neighbor + destinations can co-exist. + + Local dependencies are omitted, as by default each layer has its + immediately previous layer as local source and immediately next layer + as local destination. + + Construct an ifmap forwarding dict for shared memory source data. It + maps previous layer name tuples, to a list of scheduling indices of all + layers in this segment that share these exact previous layers. The + first in the list is responsible to fetch the previous layer data and + to forward them to others. We allow shared memory source data between + two layers only when both layers have memory dependency only (so their + temporal indices must be 0), and their previous layers are exactly the + same. + + Construct an ofmap forwarding dict for multiple destinations of both + on-chip and off-chip. It maps the scheduling index of a layer in this + segment that has both memory and neighbor/local destinations (so needs + to store its ofmaps back to memory), to a list of scheduling indices of + all layers in this segment that accepts its ofmaps as ifmaps. Neighbor + dependencies are only between the last temporal one and the first + temporal ones; local dependencies are only between adjacent temporal + ones. + ''' + + self.src_dict = [[None for _ in ltpl] for ltpl in self.seg] + self.dst_dict = [[None for _ in ltpl] for ltpl in self.seg] + + self.ifm_fwd_dict = {} + self.ofm_fwd_dict = {} + + # Mapping from layer to spatial/temporal indices in the segment. + layer2idx = {l: PipelineSegment.SchedIndex(sp_idx, tm_idx) + for sp_idx, ltpl in enumerate(self.seg) + for tm_idx, l in enumerate(ltpl)} + + # Mapping from previous layer tuple to layer. + prevs2layer = {} + + for sp_idx, ltpl in enumerate(self.seg): + + single_nbr_src = None + + for tm_idx, l in enumerate(ltpl): + + assert layer2idx[l] == (sp_idx, tm_idx) + + # Sources. + src = tuple() + + prevs = self.network.prevs(l) + assert all(p not in layer2idx or layer2idx[p] < layer2idx[l] + for p in prevs) + mem_src = [p for p in prevs if p not in layer2idx] + lcl_src = [p for p in prevs if p not in mem_src + and layer2idx[p].sp_idx == sp_idx] + nbr_src = [p for p in prevs if p not in mem_src + lcl_src] + + # Ensure single local source to be the immediately previous. + # Check at the destination so here are assertions. + if not lcl_src: + assert tm_idx == 0 + else: + assert len(lcl_src) == 1 \ + and layer2idx[lcl_src[0]].tm_idx == tm_idx - 1 + + # Mutual exclusive. + if mem_src and nbr_src: + # We now allow each spatial scheduling (vertex) to have + # both memory source and neighbor source when generating + # segments. But each single layer cannot have both; + # otherwise there would be multiple source data regions. + return False + + if mem_src: + # Memory source. + src += (None,) + if nbr_src: + # Neighbor source. + # Single neighbor source to be the last temporal scheduled. + assert len(nbr_src) == 1 + prev_idx = layer2idx[nbr_src[0]] + assert prev_idx.tm_idx == len(self.seg[prev_idx.sp_idx]) - 1 + # Single neighbor source across this spatial scheduling. + if single_nbr_src is not None: + return False + single_nbr_src = prev_idx + src += (prev_idx,) + + # Shared memory source. + if mem_src and not lcl_src: + assert not nbr_src + assert tm_idx == 0 + if prevs in prevs2layer: + fet_idx = layer2idx[prevs2layer[prevs]] + self.ifm_fwd_dict.setdefault(prevs, [fet_idx]).append( + layer2idx[l]) + else: + prevs2layer[prevs] = l + + # Destinations. + dst = tuple() + + nexts = self.network.nexts(l) + assert all(n not in layer2idx or layer2idx[n] > layer2idx[l] + for n in nexts) + mem_dst = [n for n in nexts if n not in layer2idx] + lcl_dst = [n for n in nexts if n not in mem_dst + and layer2idx[n].sp_idx == sp_idx] + nbr_dst = [n for n in nexts if n not in mem_dst + lcl_dst] + + # Ensure single local destination to be the immediate next. + if not lcl_dst: + if tm_idx != len(ltpl) - 1: + # Not utilize local data, sub-optimal. + return False + else: + if len(lcl_dst) != 1 \ + or layer2idx[lcl_dst[0]].tm_idx != tm_idx + 1: + # Local data will not be available if not adjacent. + return False + + # Mutual exclusive. + # Now they can co-exist. + # assert not mem_dst or not nbr_dst + if mem_dst and nbr_dst: + assert tm_idx == len(ltpl) - 1 + self.ofm_fwd_dict[layer2idx[l]] = [layer2idx[n] + for n in nbr_dst] + if mem_dst and lcl_dst: + assert not nbr_dst + self.ofm_fwd_dict[layer2idx[l]] = [layer2idx[lcl_dst[0]]] + + if mem_dst: + # Memory destination. + dst += (None,) + if nbr_dst: + # Neighbor destinations. + # This layer is the last temporal scheduled. + assert tm_idx == len(ltpl) - 1 + dst += tuple(layer2idx[n] for n in nbr_dst) + + # Basic pipelining requires a linear structure (on-chip). + if not self.with_opt: + if len(nbr_src) + len(lcl_src) > 1 \ + or len(nbr_dst) + len(lcl_dst) > 1 \ + or ((sp_idx, tm_idx) != (0, 0) + and not nbr_src and not lcl_src): + return False + + self.src_dict[sp_idx][tm_idx] = src + self.dst_dict[sp_idx][tm_idx] = dst + + return True + + def _alloc_resource(self, max_util_drop=0.05): + ''' + Decide the resource allocation. Return whether the allocation succeeds. + + `max_util_drop` specifies the maximum utilization drop due to mismatch + throughput between layers. + ''' + + self.alloc = tuple() + + # Allocate processing subregions. + subregions = self._alloc_proc(max_util_drop=max_util_drop) + if not subregions: + return False + + no_time_mux = len(self.network) == sum(len(ltpl) for ltpl in self.seg) + # All layers that have model filters must be spatially scheduled. + if no_time_mux: + for ltpl in self.seg: + if len([l for l in ltpl + if isinstance(self.network[l], ConvLayer)]) > 1: + no_time_mux = False + break + + for sp_idx, ltpl in enumerate(self.seg): + + # Resource for the subregion. + rtpl = tuple() + + for tm_idx, _ in enumerate(ltpl): + + # Processing region. + proc_region = subregions[sp_idx] + + # Data source. + src = self.src_dict[sp_idx][tm_idx] + if None in src: + # Data source is memory. + assert src == (None,) + src_data_region = self.resource.src_data_region + for sh_idx_list in self.ifm_fwd_dict.values(): + # Find shared memory source to use forwarding. + if (sp_idx, tm_idx) in sh_idx_list[1:]: + src_data_region = subregions[sh_idx_list[0].sp_idx] + break + elif src: + # Data source is neighbor. + assert len(src) == 1 + src_data_region = subregions[src[0].sp_idx] + else: + # Data source is all local. + src_data_region = proc_region + + # Data destination. + dst = self.dst_dict[sp_idx][tm_idx] + if None in dst: + # Data destination is memory. + # assert dst == (None,) + # Now we can have both memory and neighbor destinations. If + # they co-exist, we need to store them locally and also + # store back to memory. In this case the dst data region is + # set to memory. + dst_data_region = self.resource.dst_data_region + elif dst: + # Data destinations are neighbors. + # Put data in local. The next layers will fetch. + dst_data_region = proc_region + else: + # Data destination is all local. + dst_data_region = proc_region + + # Make resource. + # Note that DRAM bandwidth is not split here. We optimistically + # assume each layer can use the full DRAM bandwidth at + # different time. We adjust this assumption when calculating + # the segment timing. + rtpl += (self.resource._replace( + proc_region=proc_region, + src_data_region=src_data_region, + dst_data_region=dst_data_region, + no_time_mux=no_time_mux),) + + assert len(rtpl) == len(ltpl) + self.alloc += (rtpl,) + assert len(self.alloc) == len(self.seg) + + return True + + def _alloc_proc(self, max_util_drop=0.05): + ''' + Allocate processing subregions for the segment. + + Return a list of processing subregions corresponding to the first-level + (spatial scheduled) layers in the segment. Return None if allocation + failed. + + `max_util_drop` specifies the maximum utilization drop due to mismatch + throughput between layers. + ''' + + # Spatial allocation. + proc_region = self.resource.proc_region + dim_nodes = proc_region.dim + total_nodes = dim_nodes.size() + + # Number of operations of each spatial allocation. + ops = [sum(self.network[l].total_ops() for l in ltpl) + for ltpl in self.seg] + + # Enforce a common factor among the numbers of nodes allocated to all + # vertices in the segment. Such common factor is likely to be the + # common height of the vertex node regions. + common_factor_list = [cf for cf, _ in util.factorize(dim_nodes.h, 2)] + + for cf in sorted(common_factor_list, reverse=True): + # Pick the largest common factor within the utilization constraint. + + # Number of nodes of each vertex should be approximate to the + # number of ops of the vertex. + nodes_raw = [o * 1. / sum(ops) * total_nodes for o in ops] + + # Round to the common factor multiples. + assert total_nodes % cf == 0 + nodes = [max(1, int(round(nr / cf))) * cf for nr in nodes_raw] + # Fix margin. + while sum(nodes) != total_nodes: + diff = [n - nr for n, nr in zip(nodes, nodes_raw)] + if sum(nodes) > total_nodes: + # Decrease the nodes for the vertex with the maximum + # positive difference. + idx, _ = max(enumerate(diff), key=lambda tpl: tpl[1]) + nodes[idx] -= cf + else: + # Increase the nodes for the vertex with the minimum + # negative difference. + idx, _ = min(enumerate(diff), key=lambda tpl: tpl[1]) + nodes[idx] += cf + + if 0 in nodes: + continue + + # Utilization. + time = max(o * 1. / n for o, n in zip(ops, nodes)) + utilization = sum(ops) / time / sum(nodes) + assert utilization < 1 + 1e-6 + + if utilization >= 1 - max_util_drop: + # Found + break + + else: + # Not found. + return None + + # Allocate in the processing region according to the number of nodes. + subregions = proc_region.allocate(nodes) + assert subregions + assert len(subregions) == len(self.seg) + if len(subregions) == 1: + assert subregions[0] == proc_region + + return subregions + + def _init_sym_cstrs(self): + ''' + Initialize the symbolic scheduling constraints for the layers in the + segment, by constructing a nested lists of dicts `cstr_symargs` whose + values can be symbolic expressions for the keyword arguments of layers + in the segment, and a dict `cstr_symvals` mapping each symbol to its + possible numerical values. + + Rules for constraints. + + - Top BAT loop factor. + + With a single layer, there is no constraint on the top BAT loop factor. + Otherwise all layers must share the same factor, namely `topbat_shr`. + + - Fmap forwarding and fully buffering. + + Only CONV layers require to fully buffer fmaps. Local-region layers + process data in a streaming manner. + + Each CONV layer, and all local-region layers immediately following it + within the same spatial scheduling, are made into a group G. + + (initial) if G is both the first spatial and the first temporal + scheduling with a CONV layer, it can choose whether to fully buffer + ofmaps or not. This is a configuration to explore, namely `fbofm_init`. + We decide its value by choosing the one that gives the fewer fully + buffered inter-spatial pairs on the critical forwarding path, and the + smaller maximum fully buffered data size. + + (within-group) within G, the CONV layer, and all local-region layers, + should use the same top OFM factors (IFM factors are automatically + determined by OFM factors in local-region layers), unless CONV ofmaps + need to be fully buffered, in which case, the CONV layer and the last + layer in G fully buffer ofmaps (top OFM factor is 1), and other layers + still use the same top OFM factors but can be different than 1. + + (inter-temporal) if G has a source from G' in the same spatial + scheduling (which must be immediately before G), G should fully buffer + ifmaps, and G' should fully buffer ofmaps. + + (inter-spatial) if G has a source from G' in another spatial scheduling + (where the source must be the last temporal scheduling in G' and that + spatial scheduling), + (a) if G' already fully buffers ofmaps, make G fully buffer ifmaps. + (b) otherwise, make G fully buffer ofmaps (do not require G' to fully + buffer ifmaps; leave it to other rules, e.g. inter-temporal, to + decide); forward data between G' and G, by matching their top O/IFM + factors (biasing this case for smaller pipeline filling delay). + Notice the destination can be: (1) the leading CONV layer, whose top + IFM factor is constrained; (2) a local-region layer, where we constrain + the top OFM factors of this group (except otherwise constrained by + fully buffering ofmaps). + ''' + # pylint: disable=too-many-branches + + # Symbolic variables mapping to numerical values. + symvals = dict() + + # Top BAT loop factor. + topbat = symbols('topbat_shr', integer=True) + symvals[topbat] = [t for t, _ in util.factorize(self.batch_size, 2)] + + # Whether the initial CONV layer fully buffers ofmaps. + fbofm_init = symbols('fbofm_init') + symvals[fbofm_init] = [False, True] + + def _layer_topofm_vals(layer_name): + layer = self.network[layer_name] + # We require that the total ofmap size takes at least 5% of the + # gbuf capacity of a single node, to avoid too fine blocking. + tmax = layer.total_ofmap_size(self.batch_size) \ + / (0.05 * self.resource.size_gbuf) + vals = [t for t, _ in util.factorize(layer.nofm, 2) + if t <= tmax or t == 1] + assert vals + return vals + + def _layer_topifm_vals(layer_name): + layer = self.network[layer_name] + # We require that the total ifmap size takes at least 5% of the + # gbuf capacity of a single node, to avoid too fine blocking. + tmax = layer.total_ifmap_size(self.batch_size) \ + / (0.05 * self.resource.size_gbuf) + vals = [t for t, _ in util.factorize(layer.nifm, 2) + if t <= tmax or t == 1] + assert vals + return vals + + # Layer constraint kwargs. + symargs = [[{'topbat': topbat} for _ in ltpl] for ltpl in self.seg] + + # Candidates for critical forwarding path between spatial scheduling. + sp_crit_path_cands = set() + sp_crit_path_cands.add((0,)) # init with the first spatial. + + # The last CONV layer index. + last_conv = PipelineSegment.SchedIndex(-1, 0) + + # Whether the current group needs to fully buffer ofmap. Delayed apply + # to the last layer in the group. + curr_fbofm = False + + for sp_idx, ltpl in enumerate(self.seg): + + # Initial topofm, in case of a non-CONV starting layer. + curr_topofm = symbols('topofm_{}_s'.format(sp_idx), integer=True) + symvals[curr_topofm] = _layer_topofm_vals(ltpl[0]) + + for tm_idx, l in enumerate(ltpl): + + layer = self.network[l] + curr_sa = symargs[sp_idx][tm_idx] + + # Neighbor source dependency. + nsrc_sa = None + src_deps = self.src_dict[sp_idx][tm_idx] + if any(s is not None for s in src_deps): + assert len(src_deps) == 1 + nbr_src = src_deps[0] + assert nbr_src.sp_idx < sp_idx + nsrc_sa = symargs[nbr_src.sp_idx][nbr_src.tm_idx] + assert nsrc_sa # not empty, used to test nbr src exists. + # Set critical path candidates. + new_cands = set() + for cand in sp_crit_path_cands: + if cand[-1] == nbr_src.sp_idx: + new_cands.add(cand + (sp_idx,)) + sp_crit_path_cands |= new_cands + + if isinstance(layer, ConvLayer): + # Conv layer. + + # The last group may require to fully buffer ofmaps. + # Delayed apply to the immediate previous layer. + if curr_fbofm is not False: + assert last_conv >= (0, 0) + if last_conv.sp_idx == sp_idx: + assert tm_idx > 0 + lsrc_sa = symargs[sp_idx][tm_idx - 1] + else: + lsrc_sa = symargs[last_conv.sp_idx][-1] + lsrc_sa['fbofm'] = curr_fbofm + # Reset. + curr_fbofm = False + + # New topofm for a new group. + curr_topofm = symbols('topofm_{}_{}'.format(sp_idx, tm_idx), + integer=True) + symvals[curr_topofm] = _layer_topofm_vals(l) + + # Set topofm. + curr_sa['topofm'] = curr_topofm + + if sp_idx == last_conv.sp_idx: + # Rule inter-temporal. + assert tm_idx > 0 + # Make this group fully buffer ifmaps. + curr_sa['fbifm'] = True + # Make the last group fully buffer ofmaps. + last_sa = symargs[sp_idx][last_conv.tm_idx] + lsrc_sa = symargs[sp_idx][tm_idx - 1] + last_sa['fbofm'] = True + lsrc_sa['fbofm'] = True + + elif nsrc_sa: + # Rule inter-spatial. + # We only look at this rule when inter-temporal rule + # does not apply and the ifmaps of this group are not + # yet required to fully buffer. + if not self.with_opt: + # Basic pipelining requires fully buffering all + # pairs of neighbor src/dst. + nsrc_sa['fbofm'] = True + nsrc_fbofm = nsrc_sa.get('fbofm', False) + # (a): if the source already fully buffers ofmaps. + # Make this group fully buffer ifmaps. + curr_sa['fbifm'] = symeq(nsrc_fbofm, True) + # (b)-(1): otherwise. + # Make this group fully buffer ofmaps. + curr_sa['fbofm'] = symeq(nsrc_fbofm, False) + curr_fbofm = symeq(nsrc_fbofm, False) # delayed apply. + # Match top OFM/IFM factors. + curr_sa['topifm'] = sympiecewise( + (nsrc_sa['topofm'], symeq(nsrc_fbofm, False)), + (curr_sa.get('topifm', 0), True)) + + elif last_conv < (0, 0): + # The first CONV layer. + # Rule initial. + curr_sa['fbofm'] = fbofm_init + curr_fbofm = fbofm_init + + last_conv = PipelineSegment.SchedIndex(sp_idx, tm_idx) + + else: + # Non-Conv layer. + + if nsrc_sa: + # Rule inter-spatial, (b)-(2). + nsrc_fbofm = nsrc_sa.get('fbofm', False) + curr_topofm = sympiecewise( + (nsrc_sa['topofm'], symeq(nsrc_fbofm, False)), + (curr_topofm, True)) + # Also backtrace this group. + for bt_idx in range(last_conv.tm_idx, tm_idx): + symargs[sp_idx][bt_idx]['topofm'] = curr_topofm + + # Rule within-group. + curr_sa['topofm'] = curr_topofm + + # If this layer has no on-chip destinations, cancel the + # requirement to fully buffer ofmaps. + if all(d is None for d in self.dst_dict[sp_idx][tm_idx]) \ + and tm_idx == len(ltpl) - 1: + curr_sa.pop('fbofm', False) + + # Simplify. + self._simplify_symargs(symargs, symvals) + + # Get critical forwarding path between spatial scheduling. + # The critical path has the longest forwarding chain. + sp_crit_path = max(sp_crit_path_cands, key=len) + + # Check maximum fully-buffering size, and decide fbofm_init. + opt_val = None + opt_key = (float('inf'),) * 2 # (num of fb pairs, max fb size) + num_sp_fbs = 0 + for val in symvals.get(fbofm_init, [False]): + subs_symargs = self._subs_symargs(symargs, fbofm_init, val) + maxsz = 0 + numfb = 0 + for sp_idx, (ltpl, atpl) in enumerate(zip(self.seg, subs_symargs)): + ms = max(itertools.chain( + ((self.network[l].total_ofmap_size() if a.get('fbofm') + else 0) + + (self.network[l].total_ifmap_size() if a.get('fbifm') + else 0) + for l, a in zip(ltpl, atpl)), + [0])) # safe max with default. + if ms > self.alloc[sp_idx][0].proc_region.dim.size() \ + * self.alloc[sp_idx][0].size_gbuf: + break + maxsz = max(maxsz, ms) + if sp_idx in sp_crit_path and atpl[-1].get('fbofm', False): + numfb += 1 + else: + key = (numfb, maxsz) + if key < opt_key: + opt_val, opt_key = val, key + num_sp_fbs = numfb + if opt_val is None: + return False + # Use the optimal value. + symvals[fbofm_init] = [opt_val] + self._simplify_symargs(symargs, symvals) + + # Shared memory source must have the same topifm. + for sh_idx_list in self.ifm_fwd_dict.values(): + assert len(sh_idx_list) > 1 + fet_sp_idx = sh_idx_list[0].sp_idx + sh_symarg_list = [symargs[idx.sp_idx][0] for idx in sh_idx_list] + + # Must have no constraint on ifmaps access from memory. + assert all(not sa.get('fbifm', False) and not sa.get('topifm', 0) + for sa in sh_symarg_list) + + # Cannot constrain both topifm and topofm. + if any(sa.get('fbofm', False) or sa.get('topofm', 0) + for sa in sh_symarg_list): + sh_kwargs = {'fbifm': True} + else: + topifm = symbols('topifm_{}'.format(fet_sp_idx), integer=True) + symvals[topifm] = _layer_topifm_vals(self.seg[fet_sp_idx][0]) + sh_kwargs = {'topifm': topifm} + + # Set constraints. + for sa in sh_symarg_list: + sa.update(sh_kwargs) + + # Simplify. + self._simplify_symargs(symargs, symvals) + + # Turn constraints into lazily updated rules. + self._lazify_topofm_symargs(symargs, symvals) + # Cannot simplify any more as update_dict is not sympifi-able. + + # Sort symbol dict. + symvals = OrderedDict(sorted(((s, symvals[s]) for s in symvals), + key=lambda item: str(item[0]))) + + if not symvals: + # Must add a dummy symbol so iterative substitution can happen. + symvals[symbols('_dummy')] = [None] + + self.cstr_symargs = symargs + self.cstr_symvals = symvals + self.cstr_num_sp_fbs = num_sp_fbs + try: + self.cstr_topbat_idx = list(symvals.keys()).index(topbat) + except ValueError: + self.cstr_topbat_idx = None + + return True + + @staticmethod + def _simplify_symargs_one_pass(symargs, symvals): + ''' + Simplify symargs and symvals in-place: + - If fbi/ofm is False, then remove it. + - If fbi/ofm is True, then remove topi/ofm. + - If a symbol can take only one value, then substitute it. + - If a symbol only occurs once, then remove its constraint. + + Return whether the symargs and symvals are already simplified. + ''' + for a in itertools.chain.from_iterable(symargs): + is_fbifm = a.get('fbifm') + is_fbofm = a.get('fbofm') + # pylint: disable=singleton-comparison + # lhs may be symbolic, see + # docs.sympy.org/latest/modules/logic.html#sympy.logic.boolalg.BooleanTrue + if is_fbifm == True: + a.pop('topifm', 0) + if is_fbifm == False: + a.pop('fbifm', False) + if is_fbofm == True: + a.pop('topofm', 0) + if is_fbofm == False: + a.pop('fbofm', False) + + subs_dict = {} + + # Possible values for symbols. + subs_dict.update( + (s, symvals[s][0]) for s in symvals if len(symvals[s]) == 1) + + # Count the occurrence of symbols in all args (values). + symcnts = Counter( + s for a in itertools.chain.from_iterable(symargs) + for val in a.values() for s in symtuple(val).free_symbols) + assert set(symcnts.keys()).issubset(symvals.keys()) + subs_dict.update((s, None) + for s in set(symvals.keys()) - set(symcnts.keys())) + subs_dict.update((s, 0 if str(s).startswith('top') else False) + for s in symcnts if symcnts[s] <= 1) + + # Substitute symbols and remove from symbol dict. + for a in itertools.chain.from_iterable(symargs): + for k in a: + a[k] = symtuple(a[k]).subs(subs_dict)[0] + for s in subs_dict: + del symvals[s] + + return not subs_dict + + def _simplify_symargs(self, symargs, symvals): + ''' Simplify symargs and symvals in-place iteratively. ''' + while not self._simplify_symargs_one_pass(symargs, symvals): + pass + used_syms = symtuple( + *[symtuple(*a.values()) + for a in itertools.chain.from_iterable(symargs)]).free_symbols + assert set(used_syms) == set(symvals.keys()) + assert all(val for val in symvals.values()) + + @staticmethod + def _subs_symargs(symargs, *subs_args): + ''' + Substitute symbols. The additional arguments are passed to subs(). + + Return a new substituted copy without modifying the original one. + ''' + # sympify=False is necessary because there may be str in the values. + return [[dict((k, symtuple(a[k], sympify=False).subs(*subs_args)[0]) + for k in a) for a in atpl] for atpl in symargs] + + class TopOfmUpdateLambda(symbasic): + ''' A sympifi-able lambda function to lazily update topofm. ''' + def __new__(cls, *args): + return super(PipelineSegment.TopOfmUpdateLambda, cls).__new__(cls) + def __call__(self, arg_s, arg_r): + setattr(arg_s, 'topofm', arg_r.scheme['to'][0]) + + def _lazify_topofm_symargs(self, symargs, symvals): + ''' + Turn qualified topofm constraints into lazily updated rules. + + If a symbol is only used as the topofm constraint by a single CONV + layer and some local-region layers, we can turn it into a lazily update + rule. + ''' + sym2conv = {} # symbol --> the only CONV layer using it. + sym2lrs = {} # symbol --> list of local-region layer using it. + unqual_syms = set() # symbols used by two or more CONV layers. + for l, a in zip(itertools.chain.from_iterable(self.seg), + itertools.chain.from_iterable(symargs)): + layer = self.network[l] + if isinstance(layer, ConvLayer): + topofm = a.get('topofm', 0) + topifm = a.get('topifm', 0) + for s in symtuple(topofm, topifm).free_symbols: + if s not in unqual_syms: + if s in sym2conv: + # If a symbol is used in two CONV layers, it cannot + # be lazily updated. + del sym2conv[s] + sym2lrs.pop(s, []) + unqual_syms.add(s) + elif topofm == s: + assert s not in sym2lrs + sym2conv[s] = l + else: + topofm = a.get('topofm', 0) + if topofm in sym2conv: + sym2lrs.setdefault(topofm, []).append(l) + assert 0 not in sym2conv and 0 not in sym2lrs + + syms = sym2conv.keys() # symbols to be lazily updated. + lr2conv = {} # local-region layer to the CONV layer constraining it. + for s in syms: + for lr in sym2lrs.get(s, []): + lr2conv[lr] = sym2conv[s] + lconvs = set(lr2conv.values()) # CONV layer whose topofm to be removed. + + for l, a in zip(itertools.chain.from_iterable(self.seg), + itertools.chain.from_iterable(symargs)): + if l in lconvs: + # Remove CONV topofm. + assert sym2conv[a['topofm']] == l + del a['topofm'] + elif l in lr2conv: + # Link local-region layer to the CONV layer. + lconv = lr2conv[l] + assert sym2conv[a['topofm']] == lconv + del a['topofm'] + a['update_dict'] = { + lconv: PipelineSegment.TopOfmUpdateLambda()} + + for s in syms: + del symvals[s] + diff --git a/nn_dataflow/core/pipeline_segment_timing.py b/nn_dataflow/core/pipeline_segment_timing.py new file mode 100644 index 0000000..c1e4d07 --- /dev/null +++ b/nn_dataflow/core/pipeline_segment_timing.py @@ -0,0 +1,233 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +from collections import namedtuple, OrderedDict + +from . import loop_enum as le +from .loop_blocking_scheme import LoopBlockingScheme +from .layer import ConvLayer +from .network import Network + +class PipelineSegmentTiming(object): + ''' Timing information of a pipeline segment. ''' + + # Each layer timing info is a tuple: + # - time: the total time. + # - node_time: the total time on node processing. + # - dram_time: the total time on DRAM access. + # - num_nodes: the number of processing nodes. + # - ngrp: the OFM group number. + # - ts_xb: when to start. + # - td_xb: when the first BAT group of this and all prev layers is done. + # Time is stored by multiplying the lazily updated BAT group number (_xb). + # Notice (td - ts) may be greater than (time), because fused layers can + # have an earlier start time, but done time is sequentially accumulated. + LayerTiming = namedtuple('LayerTiming', ['time', 'node_time', 'dram_time', + 'num_nodes', 'ngrp', + 'ts_xb', 'td_xb']) + + def __init__(self, network, seg_idx): + + if not isinstance(network, Network): + raise TypeError('PipelineSegmentTiming: network must be a ' + 'Network instance.') + self.network = network + + # Scheduling sequence number. + self.seg_idx = seg_idx + self.last_sched_seq = None + + # Time properties. + # The time on DRAM accesses. + self.dram_time = 0 + # The time on node processing. + self.node_time = 0 + # The critical (longest) spatial scheduling time. + self.critical_time = 0 + + # Mapping from layer name to spatial and temporal indices. + self.layer2idx = OrderedDict() + + # The number of groups of which BAT are sequentially processed, i.e., + # the degree of batch pipelining, shared by all layers in the segment. + # Lazily updated. + self.bat_ngrp = None + + # Timing of each layer, indexed by spatial and temporal indices. + self.timing_list = [] + + @property + def time(self): + ''' The total time of the end-to-end segment processing. ''' + return max(self.node_time, self.dram_time) + + @property + def time_overhead(self): + ''' + The time overhead as a percentage, to process layers in segment + compared to processing layers individually. + ''' + total_num_nodes = sum(tlist[0].num_nodes + for tlist in self.timing_list) + # Sum up the max of scaled node time and DRAM time. + time_indv = sum(max(1. * timing.node_time * timing.num_nodes + / total_num_nodes, + timing.dram_time) + for tlist in self.timing_list + for timing in tlist) + return (self.time - time_indv) / time_indv + + def add(self, layer_name, sched_result): + ''' Add the SchedulingResult of a new layer. ''' + + sched_seq = sched_result.sched_seq + + if sched_seq[0] != self.seg_idx: + raise ValueError('PipelineSegmentTiming: sched_seq {} does not ' + 'belong to segment {}.' + .format(sched_seq, self.seg_idx)) + + if sched_seq == self._sched_seq_incr(1): + # New spatial scheduling. + self.timing_list.append([]) + elif sched_seq == self._sched_seq_incr(2): + # New temporal scheduling. + pass + else: + raise ValueError('PipelineSegmentTiming: sched_seq {} cannot ' + 'follow {}' + .format(sched_seq, self.last_sched_seq)) + self.last_sched_seq = sched_seq + + if layer_name in self.layer2idx: + raise ValueError('PipelineSegmentTiming: layer {} already in ' + 'segment, old sched_seq {}, new sched_seq {}.' + .format(layer_name, self.layer2idx[layer_name], + sched_seq[1:])) + self.layer2idx[layer_name] = sched_seq[1:] + + # Add layer timing. + + timing = self._make_layer_timing(layer_name, sched_result) + assert not self.timing_list[-1] \ + or timing.num_nodes == self.timing_list[-1][-1].num_nodes + self.timing_list[-1].append(timing) + assert self.last_sched_seq[1] + 1 == len(self.timing_list) + assert self.last_sched_seq[2] + 1 == len(self.timing_list[-1]) + + # Update time. + + # Critical time, as the longest of all spatial scheduling. + assert all(sum(timing.time for timing in tlist) + <= tlist[-1].td_xb - tlist[0].ts_xb + for tlist in self.timing_list) + self.critical_time = max(tlist[-1].td_xb - tlist[0].ts_xb + for tlist in self.timing_list) + + # DRAM time. + # Each layer DRAM time is calculated using the layer accesses and the + # maximum bandwidth. Accumulating the accesses is accumulating the + # time. + self.dram_time += sched_result.total_dram_time + + # Node time, as the max of end time of the last BAT group. + # The interval between BAT groups is determined by the critical time of + # one BAT group. + self.node_time = max((tlist[-1].td_xb + + self.critical_time * (self.bat_ngrp - 1)) + // self.bat_ngrp + for tlist in self.timing_list) + assert self.node_time >= self.critical_time + + def _sched_seq_incr(self, pos): + ''' Get the next sched seq incremented at the given position. ''' + if not self.last_sched_seq: + return (self.seg_idx, 0, 0) + assert len(self.last_sched_seq) == 3 + return self.last_sched_seq[:pos] + (self.last_sched_seq[pos] + 1,) \ + + (0,) * (2 - pos) + + def _make_layer_timing(self, layer_name, sched_result): + ''' Construct and return the layer timing. ''' + # Top-level ordered loops, from outermost to innermost. + ord_loops = LoopBlockingScheme.ordered_loops( + sched_result.scheme['tvals'][0], sched_result.scheme['orders'][0]) + + # Top loop blocking factors. + top_ts = [1] * le.NUM + if ord_loops and ord_loops[0][0] == le.BAT: + top_ts[le.BAT] = ord_loops.pop(0)[1] + if ord_loops: + lpe, t = ord_loops.pop(0) + assert lpe == le.IFM or lpe == le.OFM + top_ts[lpe] = t + + # Lazily update BAT group number. + if not self.bat_ngrp: + self.bat_ngrp = top_ts[le.BAT] + elif self.bat_ngrp != top_ts[le.BAT]: + # Unmatched. + self.bat_ngrp = 1 + + # IFM/OFM group number. + ifm_ngrp, ofm_ngrp = top_ts[le.IFM], top_ts[le.OFM] + + # Time on node processing and DRAM access. + node_time = sched_result.total_node_time + dram_time = sched_result.total_dram_time + # Number of nodes. + num_nodes = sched_result.num_nodes + + # Calculate timing. + sp_idx, tm_idx = self.layer2idx[layer_name] + is_conv = isinstance(self.network[layer_name], ConvLayer) + time = sched_result.total_time + ts_xb = 0 + td_xb = 0 + for p in self.network.prevs(layer_name): + if p not in self.layer2idx: + # Off-chip source. + continue + # On-chip source. + p_sp_idx, p_tm_idx = self.layer2idx[p] + p_timing = self.timing_list[p_sp_idx][p_tm_idx] + if p_sp_idx == sp_idx: + assert p_tm_idx == tm_idx - 1 + # Same spatial scheduling. + if not is_conv and ofm_ngrp == p_timing.ngrp: + # Fused. + start = p_timing.ts_xb + p_timing.time // p_timing.ngrp + else: + # Not fused. + start = p_timing.td_xb + # Also constrain the done time. + td_xb = p_timing.td_xb + time + else: + assert p_sp_idx < sp_idx + assert p_tm_idx == len(self.timing_list[p_sp_idx]) - 1 + # Previous spatial scheduling. + if (ifm_ngrp if is_conv else ofm_ngrp) == p_timing.ngrp: + # I/OFM group forwarding. + start = p_timing.ts_xb + p_timing.time // p_timing.ngrp + else: + # All I/OFM double buffering. + start = p_timing.td_xb + ts_xb = max(ts_xb, start) + td_xb = max(td_xb, ts_xb + time) + + return PipelineSegmentTiming.LayerTiming( + time=time, node_time=node_time, dram_time=dram_time, + num_nodes=num_nodes, ngrp=ofm_ngrp, ts_xb=ts_xb, td_xb=td_xb) + diff --git a/nn_dataflow/core/resource.py b/nn_dataflow/core/resource.py index bcad270..73c4fd7 100644 --- a/nn_dataflow/core/resource.py +++ b/nn_dataflow/core/resource.py @@ -28,6 +28,7 @@ 'size_regf', 'array_bus_width', 'dram_bandwidth', + 'no_time_mux', ] class Resource(namedtuple('Resource', RESOURCE_LIST)): @@ -79,5 +80,8 @@ def __new__(cls, *args, **kwargs): if ntp.dram_bandwidth <= 0: raise ValueError('Resource: dram_bandwidth must be positive.') + if not isinstance(ntp.no_time_mux, bool): + raise TypeError('Resource: no_time_mux must be boolean') + return ntp diff --git a/nn_dataflow/core/scheduling.py b/nn_dataflow/core/scheduling.py index 1cf9b70..0f1398b 100644 --- a/nn_dataflow/core/scheduling.py +++ b/nn_dataflow/core/scheduling.py @@ -20,6 +20,7 @@ from . import data_category_enum as de from . import loop_blocking from . import loop_enum as le +from . import mem_hier_enum as me from . import partition from .. import util from .cost import Cost @@ -28,10 +29,13 @@ from .layer import Layer from .map_strategy import MapStrategy from .resource import Resource +from .scheduling_constraint import SchedulingConstraint class SchedulingCondition(namedtuple('SchedulingCondition', ['resource', + 'constraint', 'ifmap_layout', + 'sched_seq', ])): ''' Layer scheduling condition. @@ -43,9 +47,17 @@ def __new__(cls, *args, **kwargs): if not isinstance(ntp.resource, Resource): raise TypeError('SchedulingCondition: resource must be ' 'a Resource instance.') + if not isinstance(ntp.constraint, SchedulingConstraint): + raise TypeError('SchedulingCondition: constraint must be ' + 'a SchedulingConstraint instance.') if not isinstance(ntp.ifmap_layout, DataLayout): raise TypeError('SchedulingCondition: ifmap_layout must be ' 'a DataLayout instance.') + if not isinstance(ntp.sched_seq, tuple): + raise TypeError('SchedulingCondition: sched_seq must be a tuple.') + if len(ntp.sched_seq) != 3: + raise ValueError('SchedulingCondition: sched_seq must have ' + '(segment, spatial, temporal) 3 indices.') return ntp @@ -53,6 +65,7 @@ def __new__(cls, *args, **kwargs): class SchedulingResult(namedtuple('SchedulingResult', ['scheme', 'ofmap_layout', + 'sched_seq', ])): ''' Layer scheduling result. @@ -67,6 +80,11 @@ def __new__(cls, *args, **kwargs): if not isinstance(ntp.ofmap_layout, DataLayout): raise TypeError('SchedulingResult: ofmap_layout must be ' 'a DataLayout instance.') + if not isinstance(ntp.sched_seq, tuple): + raise TypeError('SchedulingResult: sched_seq must be a tuple.') + if len(ntp.sched_seq) != 3: + raise ValueError('SchedulingResult: sched_seq must have ' + '(segment, spatial, temporal) 3 indices.') return ntp @@ -103,7 +121,9 @@ def total_ops(self): @property def total_accesses(self): ''' Get the total accesses at all memory hierarchies as a list. ''' - return [sum(acc) for acc in self.scheme['access']] + accesses = [sum(acc) for acc in self.scheme['access']] + accesses[me.GBUF] += sum(self.scheme['remote_gbuf_access']) + return accesses @property def total_noc_hops(self): @@ -160,7 +180,8 @@ def schedule_search(self, condition, options): # Ifmap layout. ifmap_layout = condition.ifmap_layout - if not ifmap_layout.is_in(resource.src_data_region): + # Ifmap should be from the source data region or local. + if not ifmap_layout.is_in(resource.src_data_region, proc_region): raise ValueError('Scheduling: ifmap layout is not contained in ' 'source data region.') ifrng = ifmap_layout.complete_fmap_range() @@ -180,7 +201,7 @@ def schedule_search(self, condition, options): guaranteed=True): # Explore single-node schedules. lbs_tops = list(self.schedule_search_per_node( - part, resource, options)) + part, resource, condition.constraint, options)) if not lbs_tops: continue @@ -201,7 +222,8 @@ def schedule_search(self, condition, options): filter_nodes, ifmap_layout, ofmap_layout, options) # Make scheduling result. - tops += [self._get_result(lbs, part, ofmap_layout, unit_nhops) + tops += [self._get_result(lbs, part, ofmap_layout, + condition.sched_seq, unit_nhops) for lbs in lbs_tops] # Pick the top n. @@ -231,7 +253,7 @@ def cache_stats(self): return (info.hits, info.misses) @fastcache.clru_cache(maxsize=1024) - def schedule_search_per_node(self, part, resource, options): + def schedule_search_per_node(self, part, resource, constraint, options): ''' Search the best mapping strategies and loop blocking schemes for a single node after partitioning. Return the top LoopBlockingScheme @@ -252,14 +274,15 @@ def schedule_search_per_node(self, part, resource, options): # Explore loop blocking schemes. for lbs in loop_blocking.gen_loopblocking( - nested_loop_desc, resource, self.cost, options): + nested_loop_desc, resource, part, constraint, self.cost, + options): if lbs.is_valid(): lbs_tops.append(lbs) return lbs_tops - def _get_result(self, lbs, part, ofmap_layout, unit_nhops): + def _get_result(self, lbs, part, ofmap_layout, sched_seq, unit_nhops): ''' Make the schedule result from loop blocking and partitioning. ''' @@ -268,8 +291,13 @@ def _get_result(self, lbs, part, ofmap_layout, unit_nhops): # Cost components. cost_access = lbs.get_access_cost(self.cost) - total_nhops = [unh * f for unh, f - in zip(unit_nhops, lbs.get_top_level_fetch())] + # Inter-node data forwarding/rotation hops. + node_nhops = lbs.get_noc_access() + # Memory access hops. + mem_nhops = [unh * f for unh, f + in zip(unit_nhops, lbs.get_top_level_fetch())] + # Total hops = inter-node hops + memory hops. + total_nhops = [nnh + mnh for nnh, mnh in zip(node_nhops, mem_nhops)] cost_noc = self.cost.noc_hop * sum(total_nhops) cost_op = self.cost.mac_op * lbs.ops @@ -283,6 +311,7 @@ def _get_result(self, lbs, part, ofmap_layout, unit_nhops): scheme['time'] = lbs.time scheme['ops'] = lbs.ops scheme['num_nodes'] = lbs.num_nodes + scheme['is_dram'] = (lbs.src_is_dram, lbs.dst_is_dram) scheme['cost_op'] = cost_op scheme['cost_access'] = cost_access scheme['cost_noc'] = cost_noc @@ -291,6 +320,7 @@ def _get_result(self, lbs, part, ofmap_layout, unit_nhops): scheme['bus_time'] = lbs.bus_time scheme['dram_time'] = lbs.dram_time scheme['access'] = lbs.get_access() + scheme['remote_gbuf_access'] = lbs.remote_gbuf_access scheme['total_nhops'] = total_nhops scheme['fetch'] = lbs.fetch @@ -305,10 +335,23 @@ def _get_result(self, lbs, part, ofmap_layout, unit_nhops): for bl in range(lbs.BL.NUM)] scheme['unit_size'] = lbs.unit_size scheme['unit_cnt'] = lbs.unit_cnt + scheme['accfwd_reduction'] = lbs.accfwd_reduction + scheme['bufshr_grp_size'] = lbs.bufshr_grp_size + scheme['bufshr_subgrp_size'] = lbs.bufshr_subgrp_size + scheme['bufshr_bs_t'] = lbs.bufshr_bs_t + scheme['bufshr_bs_ord'] = lbs.bufshr_bs_ord + scheme['bufshr_rot_fetch'] = lbs.bufshr_rot_fetch + scheme['bufshr_rot_round_cnt'] = lbs.bufshr_rot_round_cnt + scheme['bufshr_rot_unit_cnt'] = lbs.bufshr_rot_unit_cnt + scheme['bufshr_wide_fetch'] = lbs.bufshr_wide_fetch + scheme['bufshr_wide_fetch_width'] = lbs.bufshr_wide_fetch_width # Partitioning. scheme['part'] = part + scheme['mem_nhops'] = mem_nhops + scheme['node_nhops'] = node_nhops scheme['unit_nhops'] = unit_nhops - return SchedulingResult(scheme=scheme, ofmap_layout=ofmap_layout) + return SchedulingResult(scheme=scheme, ofmap_layout=ofmap_layout, + sched_seq=sched_seq) diff --git a/nn_dataflow/core/scheduling_constraint.py b/nn_dataflow/core/scheduling_constraint.py new file mode 100644 index 0000000..3f8f6fe --- /dev/null +++ b/nn_dataflow/core/scheduling_constraint.py @@ -0,0 +1,190 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +import numbers + +from . import loop_enum as le +from .. import util +from .loop_blocking_scheme import LoopBlockingScheme + +class SchedulingConstraint(util.ContentHashClass): + ''' + Layer scheduling constraint, which constrains top loop blocking factors. + ''' + + def __init__(self, topbat=0, topifm=0, topofm=0, update_dict=None): + ''' + `topbat`, `topifm`, `topofm` specify the top-level loop blocking + factors. + + `update_dict` specifies lazily updated rules to refine the constraint + with previous scheduling results. It should be a mapping, from previous + layer name to a function which takes two arguments: self, and the + SchedulingResult instance of that layer. + ''' + if any(n < 0 or not isinstance(n, numbers.Integral) + for n in [topbat, topifm, topofm]): + raise ValueError('SchedulingConstraint: ' + 'constrained factors must be positive integers.') + + if not update_dict: + update_dict = {} + if not isinstance(update_dict, dict): + raise TypeError('SchedulingConstraint: ' + 'update_dict must be a dict instance.') + update_dict = util.HashableDict.fromdict(update_dict) + for val in update_dict.values(): + if not callable(val): + raise TypeError('SchedulingConstraint: ' + 'values in update_dict must be callable.') + + self.topbat = topbat + self.topifm = topifm + self.topofm = topofm + self.update_dict = update_dict + + def is_valid_top_bl(self, top_bl_t, top_bl_ord): + ''' + Whether the given `top_bl_t` and `top_bl_lpe` are valid with the + constraint. + ''' + if self.update_dict: + raise ValueError('SchedulingConstraint: update_dict is not empty, ' + 'rules have not been updated.') + + if self.topbat and self.topbat != top_bl_t[le.BAT]: + return False + if self.topifm and self.topifm != top_bl_t[le.IFM]: + return False + if self.topofm and self.topofm != top_bl_t[le.OFM]: + return False + + del top_bl_ord + + return True + + def is_valid_part(self, part): + ''' + Whether the given `part` is valid with the constraint. + ''' + # pylint: disable=unused-argument + if self.update_dict: + raise ValueError('SchedulingConstraint: update_dict is not empty, ' + 'rules have not been updated.') + + return True + + def filter_gen_ts(self, gen_tifm, gen_tofm, gen_tbat): + ''' Get the filtered generators for loop blocking factors. ''' + return self._filter_gen(gen_tifm, self.topifm), \ + self._filter_gen(gen_tofm, self.topofm), \ + self._filter_gen(gen_tbat, self.topbat) + + def update_by_prev(self, prev_results): + ''' + Based on the previous layer scheduling results `prev_results` as a + mapping from previous layer name to SchedulingResult instance, use the + rules specified by `update_dict` to update the constraint. + ''' + for layer_name in self.update_dict: + self.update_dict[layer_name](self, prev_results[layer_name]) + self.update_dict = util.HashableDict() # clear updated rules. + + @staticmethod + def _filter_gen(gen, topt=0): + ''' Get a new generator which filters the top factor. ''' + for tpl in gen: + if topt == 0 or tpl[0] == topt: + yield tpl + + def __repr__(self): + return '{}({})'.format( + self.__class__.__name__, + ', '.join(['{}={}'.format(k, repr(v)) + for k, v in self.__dict__.items()])) + + +class SchedulingConstraintLayerPipeline(SchedulingConstraint): + ''' + Layer scheduling constraint for inter-layer pipelining. + + Constraint includes: + - topbat: top BAT loop blocking factor, which decides the number of groups + for batch pipelining. It must match between all layers in a pipeline + segment. + - topifm/topofm: top IFM/OFM blocking factor, which decides the number of + groups for fmap data forwarding between adjacent spatial scheduled layers + in a pipeline segment. It must match between forwarding + source/destination layers. + - fbifm/fbofm: whether to fully buffer the fmap data of the layer on-chip. + It indicates the baseline double-buffering between pipelined layers. + + For loop orders, the BAT loop must be at the outermost for batch + pipelining. Then the loop associated with the forwarded data (IFM or OFM) + must follow at the second outermost. If a data category (IFM or OFM) is + fully buffered, then the corresponding loop is a trivial loop, which can be + at any where. + ''' + + def __init__(self, topbat=0, topifm=0, topofm=0, fbifm=False, fbofm=False, + update_dict=None): + + if fbifm: + # Fully-buffered IFM <=> topifm = 1. + if topifm != 0 and topifm != 1: + raise ValueError('SchedulingConstraintLayerPipeline: ' + 'fully-buffered IFM implies topifm = 1.') + topifm = 1 + + if fbofm: + # Fully-buffered OFM <=> topofm = 1. + if topofm != 0 and topofm != 1: + raise ValueError('SchedulingConstraintLayerPipeline: ' + 'fully-buffered OFM implies topofm = 1.') + topofm = 1 + + if topifm > 1 and topofm > 1: + raise ValueError('SchedulingConstraintLayerPipeline: ' + 'impossible to have both topifm and topofm > 1, ' + 'at least one of IFM and OFM must be a trivial ' + 'loop (= 1) or not constrained (= 0).') + + super(SchedulingConstraintLayerPipeline, self).__init__( + topbat=topbat, topifm=topifm, topofm=topofm, + update_dict=update_dict) + + def is_valid_top_bl(self, top_bl_t, top_bl_ord): + + if not super(SchedulingConstraintLayerPipeline, self).is_valid_top_bl( + top_bl_t, top_bl_ord): + return False + + # Loop orders. + # Ordered loops from outer to inner. + ord_lpe = LoopBlockingScheme.ordered_loops(top_bl_t, top_bl_ord, + lpe_only=True) + if self.topbat > 1: + if ord_lpe.pop(0) != le.BAT: + return False + # topifm and topofm cannot trigger together. + if self.topifm > 1: + if ord_lpe.pop(0) != le.IFM: + return False + if self.topofm > 1: + if ord_lpe.pop(0) != le.OFM: + return False + + return True + diff --git a/nn_dataflow/tests/dataflow_test/test_nn_dataflow.py b/nn_dataflow/tests/dataflow_test/test_nn_dataflow.py index 559bbf6..82b72b6 100644 --- a/nn_dataflow/tests/dataflow_test/test_nn_dataflow.py +++ b/nn_dataflow/tests/dataflow_test/test_nn_dataflow.py @@ -18,9 +18,10 @@ import StringIO from nn_dataflow.core import Cost -from nn_dataflow.core import InputLayer, FCLayer +from nn_dataflow.core import InputLayer, ConvLayer, FCLayer from nn_dataflow.core import MapStrategy, MapStrategyEyeriss from nn_dataflow.core import MemHierEnum as me +from nn_dataflow.core import Network from nn_dataflow.core import NodeRegion from nn_dataflow.core import NNDataflow from nn_dataflow.core import Option @@ -37,6 +38,25 @@ def setUp(self): self.alex_net = import_network('alex_net') self.vgg_net = import_network('vgg_net') + net = Network('simple') + net.set_input_layer(InputLayer(4, 2)) + net.add('1', ConvLayer(4, 4, 2, 1)) + net.add('2', ConvLayer(4, 4, 2, 1)) + # Two more layers to avoid single-segment case. + net.add('a1', ConvLayer(4, 1, 1, 1, strd=2)) + net.add('a2', ConvLayer(1, 1, 1, 1)) + self.simple_net = net + + net = Network('complex') + net.set_input_layer(InputLayer(8, 8)) + net.add('1', ConvLayer(8, 8, 8, 1)) + net.add('2a', ConvLayer(8, 8, 8, 1), prevs=('1',)) + net.add('3a', ConvLayer(8, 8, 8, 1)) + net.add('2b', ConvLayer(8, 8, 8, 1), prevs=('1',)) + net.add('3b', ConvLayer(8, 8, 8, 1)) + net.add('4', ConvLayer(16, 8, 8, 1), prevs=('3a', '3b')) + self.complex_net = net + self.map_strategy = MapStrategyEyeriss self.resource = Resource(proc_region=NodeRegion(origin=PhyDim2(0, 0), @@ -56,6 +76,7 @@ def setUp(self): size_regf=512 // 2, # 512 B array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False, ) self.cost = Cost(mac_op=1, @@ -127,6 +148,144 @@ def test_verbose(self): for layer in network: self.assertIn(layer, stderr_value) + def test_pipelining(self): + ''' Pipelining. ''' + network = self.alex_net + batch_size = 1 + + options = Option(hw_gbuf_save_writeback=True, + partition_interlayer=True) + nnd = NNDataflow(network, batch_size, self.resource, self.cost, + self.map_strategy) + + tops, _ = nnd.schedule_search(options) + self.assertTrue(tops) + + def test_fast_forward_infeasible(self): + ''' Enter fast forward due to infeasible constraint. ''' + network = self.simple_net + batch_size = 1 + + # Very small gbuf size. Small fmap tpart is infeasible. + resource = self.resource._replace( + dim_array=PhyDim2(2, 2), + size_gbuf=16) + + options = Option(hw_gbuf_save_writeback=True, + partition_interlayer=True) + nnd = NNDataflow(network, batch_size, resource, self.cost, + self.map_strategy) + + tops, _ = nnd.schedule_search(options) + self.assertTrue(tops) + + # No pipelining is feasible. + for dtfl in tops: + self.assertTupleEqual(dtfl['1'].sched_seq, (0, 0, 0)) + self.assertTupleEqual(dtfl['2'].sched_seq, (1, 0, 0)) + + def test_fast_forward_found(self): + ''' Enter fast forward due to early found. ''' + network = self.simple_net + batch_size = 1 + + # No time overhead limit. + options = Option(hw_gbuf_save_writeback=True, + partition_interlayer=True, + layer_pipeline_time_ovhd=float('inf')) + nnd = NNDataflow(network, batch_size, self.resource, self.cost, + self.map_strategy) + + tops, _ = nnd.schedule_search(options) + self.assertTrue(tops) + + def test_fast_forward_crit_time(self): + ''' Enter fast forward due to long critical time. ''' + network = self.simple_net + batch_size = 1 + + # Multiple nodes for spatial pipelining. + resource = self.resource._replace( + proc_region=NodeRegion(origin=PhyDim2(0, 0), + dim=PhyDim2(8, 8), + type=NodeRegion.PROC), + dim_array=PhyDim2(1, 1), + ) + + # Very strict time overhead limit. + # At large fmap tpart, utilization decreases and critical time would + # increase. + options = Option(hw_gbuf_save_writeback=True, + partition_interlayer=True, + layer_pipeline_time_ovhd=1e-3) + nnd = NNDataflow(network, batch_size, resource, self.cost, + self.map_strategy) + + tops, _ = nnd.schedule_search(options) + self.assertTrue(tops) + + def test_fast_forward_frontier(self): + ''' Enter fast forward due to off-frontier. ''' + network = self.simple_net + batch_size = 16 + + # Multiple nodes for spatial pipelining. + resource = self.resource._replace( + proc_region=NodeRegion(origin=PhyDim2(0, 0), + dim=PhyDim2(8, 8), + type=NodeRegion.PROC), + dim_array=PhyDim2(2, 2), + ) + + # No time overhead limit. + options = Option(hw_gbuf_save_writeback=True, + partition_interlayer=True, + layer_pipeline_time_ovhd=float('inf')) + nnd = NNDataflow(network, batch_size, resource, self.cost, + self.map_strategy) + + tops, _ = nnd.schedule_search(options) + self.assertTrue(tops) + + def test_fmap_fwd(self): + ''' + Fmap forward with shared mem sources or both on/off-chip destinations. + ''' + network = self.complex_net + batch_size = 16 + + # Multiple nodes for spatial pipelining. + resource = self.resource._replace( + proc_region=NodeRegion(origin=PhyDim2(0, 0), + dim=PhyDim2(8, 8), + type=NodeRegion.PROC), + ) + + # No time overhead limit. + options = Option(hw_gbuf_save_writeback=True, + partition_interlayer=True, + layer_pipeline_time_ovhd=float('inf')) + nnd = NNDataflow(network, batch_size, resource, self.cost, + self.map_strategy) + + tops, _ = nnd.schedule_search(options) + self.assertTrue(tops) + + def test_sched_instance_sharing(self): + ''' Scheduling instance sharing between layers. ''' + network = self.alex_net + batch_size = 1 + + nnd = NNDataflow(network, batch_size, self.resource, self.cost, + self.map_strategy) + + self.assertIs(nnd.layer_sched_dict['conv1_a'], + nnd.layer_sched_dict['conv1_b']) + self.assertIs(nnd.layer_sched_dict['conv2_a'], + nnd.layer_sched_dict['conv2_b']) + self.assertIs(nnd.layer_sched_dict['pool1_a'], + nnd.layer_sched_dict['pool1_b']) + def test_opt_goal(self): ''' Optimization goal. ''' network = self.alex_net @@ -206,22 +365,23 @@ def test_no_valid_dataflow(self): # Very small REGF. self.resource = Resource(proc_region=NodeRegion(origin=PhyDim2(0, 0), - dim=PhyDim2(1, 1), + dim=PhyDim2(4, 4), type=NodeRegion.PROC), dram_region=NodeRegion( origin=PhyDim2(0, 0), dim=PhyDim2(1, 1), type=NodeRegion.DRAM), src_data_region=NodeRegion( - origin=PhyDim2(0, 0), dim=PhyDim2(1, 1), + origin=PhyDim2(0, 0), dim=PhyDim2(4, 4), type=NodeRegion.DRAM), dst_data_region=NodeRegion( - origin=PhyDim2(0, 0), dim=PhyDim2(1, 1), + origin=PhyDim2(0, 0), dim=PhyDim2(4, 4), type=NodeRegion.DRAM), dim_array=PhyDim2(16, 16), size_gbuf=128 * 1024 // 2, # 128 kB size_regf=2, array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False, ) nnd = NNDataflow(self.alex_net, 4, self.resource, self.cost, @@ -230,6 +390,13 @@ def test_no_valid_dataflow(self): self.assertFalse(tops) + # With inter-layer pipelining. + options = Option(hw_gbuf_save_writeback=True, + partition_interlayer=True) + tops, _ = nnd.schedule_search(options) + + self.assertFalse(tops) + def test_scheduling_failure(self): ''' Layer scheduling failure. ''' network = self.alex_net @@ -346,6 +513,7 @@ def test_eyeriss_isscc16(self): size_regf=261, # 225 + 12 + 24 array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False, ) cost = Cost(mac_op=2e-12, @@ -442,6 +610,7 @@ def test_eyeriss_asplos17(self): size_regf=1024 // 2, # 1 kB array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False, ) cost = Cost(mac_op=2e-12, @@ -474,6 +643,7 @@ def test_eyeriss_asplos17(self): size_regf=512 // 2, # 512 B array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False, ) cost = Cost(mac_op=2e-12, diff --git a/nn_dataflow/tests/dataflow_test/test_scheduling.py b/nn_dataflow/tests/dataflow_test/test_scheduling.py index 392188f..eca82d8 100644 --- a/nn_dataflow/tests/dataflow_test/test_scheduling.py +++ b/nn_dataflow/tests/dataflow_test/test_scheduling.py @@ -28,6 +28,7 @@ from nn_dataflow.core import Resource from nn_dataflow.core import Scheduling from nn_dataflow.core import SchedulingCondition, SchedulingResult +from nn_dataflow.core import SchedulingConstraint class TestScheduling(unittest.TestCase): ''' Tests for Scheduling module. ''' @@ -44,6 +45,9 @@ def setUp(self): self.cost = Cost(mac_op=1, mem_hier=(200, 6, 2, 1), noc_hop=50, idl_unit=50) + self.none_cstr = SchedulingConstraint() + self.cstr = SchedulingConstraint(topofm=1, topbat=self.batch_size) + self.resource = Resource( proc_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(4, 4), type=NodeRegion.PROC), @@ -54,7 +58,8 @@ def setUp(self): dst_data_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(4, 1), type=NodeRegion.DRAM), dim_array=PhyDim2(16, 16), size_gbuf=65536, size_regf=64, - array_bus_width=float('inf'), dram_bandwidth=float('inf')) + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) self.options = Option(partition_hybrid=True, partition_batch=True, partition_ifmaps=True, ntops=10) @@ -74,6 +79,8 @@ def setUp(self): parts=(part.projection(self.resource.src_data_region, appl2frng=True),)) + self.sched_seq = (2, 0, 1) + def test_valid_args(self): ''' Valid arguments for constructor. ''' schd = Scheduling(self.layers['BASE'], self.batch_size, self.cost, @@ -116,7 +123,9 @@ def test_schedule_search(self): MapStrategyEyeriss) condition = SchedulingCondition(resource=self.resource, - ifmap_layout=ifmap_layout) + constraint=self.cstr, + ifmap_layout=ifmap_layout, + sched_seq=self.sched_seq) res = schd.schedule_search(condition, self.options) @@ -142,11 +151,19 @@ def test_schedule_search(self): self.assertEqual(r.num_nodes, self.resource.proc_region.dim.size()) + # Constraint. + for r in res: + self.assertEqual(r.scheme['to'][0], 1) + # Ofmap layout. for r in res: self.assertEqual(r.ofmap_layout.complete_fmap_range().size(), layer.total_ofmap_size(self.batch_size)) + # Sequence number. + for r in res: + self.assertTupleEqual(r.sched_seq, condition.sched_seq) + def test_schedule_search_ilayout(self): ''' Invalid ifmap_layout. ''' layer = self.layers['BASE'] @@ -157,9 +174,11 @@ def test_schedule_search_ilayout(self): # Shift ifmap out of memory region. condition = SchedulingCondition( resource=self.resource, + constraint=self.none_cstr, ifmap_layout=self.ifmap_layouts['BASE']._replace( regions=tuple(r._replace(origin=PhyDim2(-10, -10)) - for r in self.ifmap_layouts['BASE'].regions))) + for r in self.ifmap_layouts['BASE'].regions)), + sched_seq=self.sched_seq) with self.assertRaisesRegexp(ValueError, 'Scheduling: .*ifmap.*'): _ = schd.schedule_search(condition, self.options) @@ -167,7 +186,9 @@ def test_schedule_search_ilayout(self): # Not match layer. condition = SchedulingCondition( resource=self.resource, - ifmap_layout=self.ifmap_layouts['POOL']) + constraint=self.none_cstr, + ifmap_layout=self.ifmap_layouts['POOL'], + sched_seq=self.sched_seq) with self.assertRaisesRegexp(ValueError, 'Scheduling: .*ifmap.*'): _ = schd.schedule_search(condition, self.options) @@ -182,7 +203,9 @@ def test_schedule_search_nolbs(self): condition = SchedulingCondition( resource=self.resource._replace(size_regf=0), - ifmap_layout=ifmap_layout) + constraint=self.none_cstr, + ifmap_layout=ifmap_layout, + sched_seq=self.sched_seq) res = schd.schedule_search(condition, self.options) @@ -203,7 +226,9 @@ def test_pernode_sched_cache(self): self.assertTupleEqual(schd.cache_stats(), (0, 0)) condition = SchedulingCondition(resource=self.resource, - ifmap_layout=ifmap_layout) + constraint=self.cstr, + ifmap_layout=ifmap_layout, + sched_seq=self.sched_seq) Scheduling.schedule_search.cache_clear() _ = schd.schedule_search(condition, self.options) @@ -232,7 +257,9 @@ def test_pernode_sched_cache_key(self): MapStrategyEyeriss) condition = SchedulingCondition(resource=self.resource, - ifmap_layout=ifmap_layout) + constraint=self.cstr, + ifmap_layout=ifmap_layout, + sched_seq=self.sched_seq) _ = schd.schedule_search(condition, self.options) @@ -241,6 +268,7 @@ def test_pernode_sched_cache_key(self): # Make another instance. rsrc = Resource(**self.resource._asdict()) + cstr = self.cstr opts = Option(**self.options._asdict()) self.assertNotEqual(id(rsrc), id(self.resource)) self.assertNotEqual(id(opts), id(self.options)) @@ -248,7 +276,7 @@ def test_pernode_sched_cache_key(self): part = PartitionScheme(order=(pe.BATP, pe.INPP, pe.OUTP, pe.OFMP), pdims=((2, 4), (2, 1), (1, 1), (1, 1))) - _ = schd.schedule_search_per_node(part, rsrc, opts) + _ = schd.schedule_search_per_node(part, rsrc, cstr, opts) h2, m2 = schd.cache_stats() self.assertEqual(h2, h + 1) diff --git a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking.py b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking.py index 9539f92..045b58e 100644 --- a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking.py +++ b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking.py @@ -149,11 +149,34 @@ def test_gen_loopblocking_byp_sol(self): self.assertLessEqual(cnt, 8) + def test_gen_loopblocking_cstr(self): + ''' gen_loopblocking with constraint. ''' + + for lbs in self._gen_loopblocking(rsrckey='LG', cstr=self.cstr): + + self.assertTrue(self.cstr.is_valid_top_bl(lbs.bl_ts[0], + lbs.bl_ords[0])) + + def test_gen_loopblocking_cstr_sol(self): + ''' gen_loopblocking using bypass solvers with constraint. ''' + + cnt1 = len(list(self._gen_loopblocking(optkey='BYPSOL'))) + + lbs_list = list(self._gen_loopblocking(optkey='BYPSOL', cstr=self.cstr)) + self.assertTrue(all( + self.cstr.is_valid_top_bl(lbs.bl_ts[0], lbs.bl_ords[0]) + for lbs in lbs_list)) + cnt2 = len(lbs_list) + + self.assertLessEqual(cnt2, cnt1) + def _gen_loopblocking(self, wlkey='BASE', rsrckey='BASE', - optkey='BASE', skip_invalid=False): + optkey='BASE', cstr=None, skip_invalid=False): ''' gen_loopblocking trampoline. ''' + if cstr is None: + cstr = self.none_cstr for lbs in loop_blocking.gen_loopblocking( - self.nld[wlkey], self.resource[rsrckey], + self.nld[wlkey], self.resource[rsrckey], self.part, cstr, self.cost, self.options[optkey]): if not skip_invalid or lbs.is_valid(): yield lbs diff --git a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_fixture.py b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_fixture.py index b3e5fe7..fc15ba9 100644 --- a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_fixture.py +++ b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_fixture.py @@ -14,8 +14,11 @@ """ import itertools +import math import unittest +from nn_dataflow.core import partition +from nn_dataflow.core import BufShrScheme from nn_dataflow.core import ConvLayer, PoolingLayer from nn_dataflow.core import Cost from nn_dataflow.core import DataDimLoops @@ -27,12 +30,16 @@ from nn_dataflow.core import NestedLoopDesc from nn_dataflow.core import NodeRegion from nn_dataflow.core import Option +from nn_dataflow.core import ParallelEnum as pe +from nn_dataflow.core import PartitionScheme from nn_dataflow.core import PhyDim2 from nn_dataflow.core import Resource +from nn_dataflow.core import SchedulingConstraint from nn_dataflow import util class TestLoopBlockingFixture(unittest.TestCase): ''' Base fixture class for LoopBlocking tests. ''' + # pylint: disable=too-many-instance-attributes def setUp(self): @@ -41,6 +48,7 @@ def setUp(self): self.layer['BASE'] = ConvLayer(12, 10, 28, 3) self.layer['LGFIL'] = ConvLayer(2, 4, 28, 20) self.layer['POOL'] = PoolingLayer(32, 28, 2) + self.layer['PAR'] = ConvLayer(24, 36, 56, 3) self.batch_size = 4 # Resource. @@ -55,19 +63,60 @@ def setUp(self): proc_region=proc_region, dram_region=data_region, src_data_region=data_region, dst_data_region=data_region, dim_array=dim_array, size_gbuf=65536, size_regf=64, - array_bus_width=float('inf'), dram_bandwidth=float('inf')) + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) # Larger resource with sufficient capacity, to make all schemes valid. self.resource['LG'] = Resource( proc_region=proc_region, dram_region=data_region, src_data_region=data_region, dst_data_region=data_region, dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3, - array_bus_width=float('inf'), dram_bandwidth=float('inf')) + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) # Small resource. self.resource['SM'] = Resource( proc_region=proc_region, dram_region=data_region, src_data_region=data_region, dst_data_region=data_region, dim_array=dim_array, size_gbuf=4096, size_regf=16, - array_bus_width=float('inf'), dram_bandwidth=float('inf')) + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) + # Multi-node parallel resource. + self.resource['PAR'] = Resource( + proc_region=NodeRegion(origin=PhyDim2(0, 0), + dim=PhyDim2(4, 2), + type=NodeRegion.PROC), + dram_region=data_region, + src_data_region=data_region, dst_data_region=data_region, + dim_array=dim_array, size_gbuf=25000, size_regf=64, + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) + # Resource with no data regions. + proc_data_region = NodeRegion(origin=PhyDim2(1, 1), dim=PhyDim2(1, 1), + type=NodeRegion.PROC) + self.resource['SRCNOTDATA'] = Resource( + proc_region=proc_region, dram_region=data_region, + src_data_region=proc_data_region, dst_data_region=data_region, + dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3, + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) + self.resource['DSTNOTDATA'] = Resource( + proc_region=proc_region, dram_region=data_region, + src_data_region=data_region, dst_data_region=proc_data_region, + dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3, + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) + self.resource['DATALOCAL'] = Resource( + proc_region=proc_region, dram_region=data_region, + src_data_region=proc_region, dst_data_region=proc_region, + dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3, + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) + # Filter pinning. + self.resource['FILPIN'] = Resource( + proc_region=proc_region, dram_region=data_region, + src_data_region=data_region, dst_data_region=data_region, + dim_array=dim_array, size_gbuf=1024 ** 3, size_regf=1024 ** 3, + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=True) # Nested loop description after mapping. self.nld = {} @@ -114,6 +163,12 @@ def setUp(self): le.BAT)), unit_ops=1, unit_time=1) + # Fake partition scheme. + self.part = PartitionScheme(range(pe.NUM), ((1, 1),) * pe.NUM) + + # Fake buffer sharing scheme. + self.bufshr = BufShrScheme(proc_region, self.part) + # Options. self.options = {} # Basic. @@ -128,6 +183,20 @@ def setUp(self): self.options['BYPSOL'] = Option(sw_gbuf_bypass=(True,) * 3, sw_solve_loopblocking=True, ntops=2 ** 30) + # Access forwarding. + self.options['ACCFWD'] = Option(hw_access_forwarding=True, + ntops=2 ** 30) + # Buffer sharing. + self.options['BUFSHR'] = Option(hw_gbuf_sharing=True, + ntops=2 ** 30) + # Buffer sharing with bypassing. + self.options['BUFSHR-BYP'] = Option(sw_gbuf_bypass=(True,) * 3, + hw_gbuf_sharing=True, + ntops=2 ** 30) + + # Constraint. + self.none_cstr = SchedulingConstraint() + self.cstr = SchedulingConstraint(topifm=1, topbat=1) # Cost. self.cost = Cost(mac_op=1, mem_hier=(200, 6, 2, 1), @@ -140,7 +209,7 @@ def _lbs(self, bl_ts, bl_ords=None, wlkey='BASE', rsrckey='BASE', bl_ords = (tuple(range(le.NUM)), tuple(range(le.NUM))) \ if not bl_ords else bl_ords return LoopBlockingScheme(self.nld[wlkey], bl_ts, bl_ords, - self.resource[rsrckey], + self.resource[rsrckey], self.bufshr, self.options[optkey]) def _gen_loopblocking_all(self, wlkey='BASE'): @@ -196,6 +265,94 @@ def _make_bl_ts(self, ti_part, to_part, tb_part, wlkey='BASE'): lp_ts[le.BAT] = tb return tuple(zip(*lp_ts)) + def _part_nld(self, part, layerkey='PAR'): + ''' Make a partitioned NestedLoopDesc and its partition occupation. ''' + p_layer, p_batch_size, p_occ = part.part_layer(self.layer[layerkey], + self.batch_size) + p_nld = next(MapStrategyEyeriss(p_layer, p_batch_size, p_occ, + self.resource['PAR'].dim_array) + .gen_nested_loop_desc()) + return p_nld + + def _gen_all_partition(self, layerkey='PAR'): + ''' + Generate PartitionScheme. + ''' + options = Option(partition_hybrid=True, + partition_batch=True, + partition_ifmaps=True, + ntops=2 ** 30) + + for part in partition.gen_partition( + self.layer[layerkey], self.batch_size, + self.resource['PAR'].proc_region.dim, options): + yield part + + def _total_part_size(self, part, layerkey='PAR'): + ''' Get the total partitioned data size. ''' + layer = self.layer[layerkey] + + nifm = util.idivc(layer.nifm, part.size(pe.INPP)) * part.size(pe.INPP) + nofm = util.idivc(layer.nofm, part.size(pe.OUTP)) * part.size(pe.OUTP) + hofm = util.idivc(layer.hofm, part.dim(pe.OFMP).h) * part.dim(pe.OFMP).h + wofm = util.idivc(layer.wofm, part.dim(pe.OFMP).w) * part.dim(pe.OFMP).w + batch_size = util.idivc(self.batch_size, part.size(pe.BATP)) \ + * part.size(pe.BATP) + + full_layer = ConvLayer(nifm, nofm, (hofm, wofm), + (layer.hfil, layer.wfil), + (layer.htrd, layer.wtrd)) + filter_size = full_layer.total_filter_size() + ifmap_size = full_layer.total_ifmap_size(batch_size) + ofmap_size = full_layer.total_ofmap_size(batch_size) + + self.assertGreaterEqual(filter_size, layer.total_filter_size()) + self.assertLess(filter_size, layer.total_filter_size() * 1.2 * 1.2) + self.assertGreaterEqual(ofmap_size, + layer.total_ofmap_size(self.batch_size)) + self.assertLess(ofmap_size, + layer.total_ofmap_size(self.batch_size) + * 1.2 * 1.2 * 1.2) + self.assertGreaterEqual(ifmap_size, + layer.total_ifmap_size(self.batch_size)) + + return filter_size, ifmap_size, ofmap_size + + def _bufshr_params(self, lbs): + ''' + Get buffer sharing parameters. + + Return subgroup sizes, rotation unit counts. + + Finally, a list of ordered loops as a tuple of LoopEnum and blocking + factor ordered from outermost to innermost excluding trivial loops. + ''' + # GBUF level. + blp1 = lbs.BL.GBUF + 1 + t_x = lbs.bl_ts[blp1] + ord_x = lbs.bl_ords[blp1] + # BS level. + t_bs = lbs.bufshr_bs_t + ord_bs = lbs.bufshr_bs_ord + + self.assertTrue(all(x % b == 0 for x, b in zip(t_x, t_bs))) + + subgrp_size = lbs.bufshr_subgrp_size + rot_unit_cnt = lbs.bufshr_rot_unit_cnt + + # Loops as a tuple of LoopEnum and blocking factor, ordered from + # outermost to innermost, excluding trivial loops. + lp_t_list = sorted([(lpe, t_bs[lpe]) + for lpe in range(le.NUM) if t_bs[lpe] > 1], + key=lambda tpl: ord_bs[tpl[0]], + reverse=True) \ + + sorted([(lpe, t_x[lpe] / t_bs[lpe]) + for lpe in range(le.NUM) if t_x[lpe] > t_bs[lpe]], + key=lambda tpl: ord_x[tpl[0]], + reverse=True) + + return subgrp_size, rot_unit_cnt, lp_t_list + class _SimBuffer(object): ''' A data buffer model for simulation. ''' @@ -222,6 +379,9 @@ def __init__(self, dce, buf_cnt_pr, unit_size, bypass=False): # E.g., (c0, c1). self.buf_cnt_pr = buf_cnt_pr + # Range index cache. + self.ridx_pr_cache = {} + def access_size(self): ''' Get access size. ''' return self.access * self.unit_size @@ -239,8 +399,7 @@ def do_access(self, idx_pr, cnt_pr, read=1, write=0): return cnt_pr # Range index. - ridx_pr = tuple(idx // buf_cnt for idx, buf_cnt - in zip(idx_pr, self.buf_cnt_pr)) + ridx_pr = self._range_idx_pr(idx_pr) # Access. self.access += util.prod(cnt_pr) * (read + write) @@ -253,10 +412,308 @@ def do_access(self, idx_pr, cnt_pr, read=1, write=0): self.data = ridx_pr return self.buf_cnt_pr - def _sim_access_conv(self, lbs): + def _range_idx_pr(self, idx_pr): + ''' Get the range index of all dimensions. ''' + ridx_pr = self.ridx_pr_cache.get(idx_pr, None) + if ridx_pr is None: + ridx_pr = tuple(idx // buf_cnt for idx, buf_cnt + in zip(idx_pr, self.buf_cnt_pr)) + self.ridx_pr_cache[idx_pr] = ridx_pr + return ridx_pr + + class _SimBufferSharing(_SimBuffer): + ''' A data buffer model with buffer sharing. ''' + + def __init__(self, dce, buf_cnt_pr, unit_size, + subgrp_size, rot_unit_cnt, lp_t_list, dim_loops, + bypass=False): + + # pylint: disable=protected-access + self.base = super(TestLoopBlockingFixture._SimBufferSharing, self) + + self.base.__init__(dce, buf_cnt_pr, unit_size, bypass=bypass) + + # Number of rotation steps, of each range. + self.rot_step_cnt = {} + # Rotation accesses, in unit counts (* unit size). + self.rot_access = 0 + # Wide fetch accesses, in unit counts (* unit size). + self.wf_access = 0 + + # Rotation rounds per load of a range. If only rotate a single + # round per data load, the rotation is unnecessary. + self.rot_rnd_cnt_per_load = None + + if self.bypass: + return + + # Subrange. + # A list in the accessing order of subrange indexes, i.e., the + # ranges of the next level; and the unit counts in one subrange. + self.subrng_list, self.subrng_cnt_pr = \ + self._init_sub_range(lp_t_list, dim_loops) + # Subrange index to the position in the list. + self.subrng_idx_dict = \ + dict((sr, i) for i, sr in enumerate(self.subrng_list)) + # Number of subranges. + self.subrng_num = len(self.subrng_list) + + # Local buffer. + self.buf_num = subgrp_size + # Number of subranges in each buffer. + self.buf_subrng_num = 1. * self.subrng_num / self.buf_num + + # The location centroid of each subrange, i.e., buffer index + # weighted by fraction. + self.buf_subrng_centroid = [] + cur_buf_cap = self.buf_subrng_num + cur_buf_idx = 0 + for _ in range(self.subrng_num): + centroid = 0 + rem_frac = 1. + while rem_frac > 0.: + if cur_buf_cap >= rem_frac: + # Fits in the current buffer. + centroid += cur_buf_idx * rem_frac + cur_buf_cap -= rem_frac + rem_frac = 0. + break + else: + # Partially fits. + centroid += cur_buf_idx * cur_buf_cap + rem_frac -= cur_buf_cap + cur_buf_cap = self.buf_subrng_num + cur_buf_idx += 1 + self.buf_subrng_centroid.append(centroid) + + # Rotation unit. + # Rotation step happens when moving to the new rotation unit. + assert self.subrng_num % rot_unit_cnt == 0 + self.rot_unit_size = self.subrng_num // rot_unit_cnt + # Steps per rotation round. + self.rot_steps_per_round = 1 + while (self.rot_steps_per_round * self.rot_unit_size + + self.buf_subrng_num < self.subrng_num + and (self.rot_steps_per_round + 1) * self.rot_unit_size + < self.subrng_num): + self.rot_steps_per_round += 1 + + # The rotation unit currently worked on. + self.cur_rot_unit = 0 + # Rotation steps of the current load of the current range. + self.cur_rot_step_cnt = 0 + + # Last wide fetch subrange index. + self.last_wf_subrng_idx = 0 + # Amount of sequential wide fetch, can be combined with rotation. + self.seq_wf_acc = 0 + # Total saved (combined with rotation) wide fetch access. + self.saved_wf_access = 0 + + # Subrange index cache. + self.sridx_pr_cache = {} + + def rotation_rounds(self): + ''' Get number of rotation rounds. ''' + + # Ensure all ranges have the same rotation steps. + steps_list = self.rot_step_cnt.values() + if not steps_list: + return 0 + assert all(s == steps_list[0] for s in steps_list) + steps = steps_list[0] + if steps == 0: + return 0 + + assert steps % self.rot_steps_per_round == 0 + + if self.rot_rnd_cnt_per_load == 1: + return 0 + return steps // self.rot_steps_per_round + + def rotation_access_size(self): + ''' Get total rotation access size. ''' + if self.rot_rnd_cnt_per_load == 1: + return 0 + return self.rot_access * self.unit_size + + def wide_fetch_access_size(self): + ''' Get total wide fetch access size. ''' + if self.rot_rnd_cnt_per_load == 1: + return (self.wf_access + self.saved_wf_access) * self.unit_size + return self.wf_access * self.unit_size + + def do_access(self, idx_pr, cnt_pr, read=1, write=0): + + ret = self.base.do_access(idx_pr, cnt_pr, read=read, write=write) + + if self.bypass: + # Bypass, skip buffer sharing. + return ret + + # Range index. + ridx_pr = self._range_idx_pr(idx_pr) + + if any(ret): + # Miss in the shared buffer and load new range. Reset. + self.cur_rot_unit = 0 + self.rot_step_cnt.setdefault(ridx_pr, 0) + + if self.cur_rot_step_cnt == 0: + # Initial fetch, no replaced data yet. + assert self.rot_rnd_cnt_per_load is None + else: + rot_rnd_cnt_per_load, rem_ = divmod( + self.cur_rot_step_cnt, self.rot_steps_per_round) + assert rem_ == 0 + assert self.rot_rnd_cnt_per_load is None \ + or self.rot_rnd_cnt_per_load == rot_rnd_cnt_per_load + self.rot_rnd_cnt_per_load = rot_rnd_cnt_per_load + self.cur_rot_step_cnt = 0 + + assert all(cnt <= subrng_cnt for cnt, subrng_cnt + in zip(cnt_pr, self.subrng_cnt_pr)) + + # Subrange index. + sridx_pr = self._subrange_idx_pr(idx_pr) + + # Rotation unit index. + ru_idx = self._subrng_rot_unit_idx(sridx_pr) + + if ru_idx != self.cur_rot_unit: + # Move to next rotation unit. + + if (self.cur_rot_unit + 1) * self.rot_unit_size \ + >= self.subrng_num: + # The current rotation unit is the last one. Start a new + # rotation round. + # Do not rotate back to the initial state. Instead start + # from the current state. + self.cur_rot_unit = 0 + + self.last_wf_subrng_idx = 0 + self.seq_wf_acc = 0 + + elif self.cur_rot_unit * self.rot_unit_size \ + + self.buf_subrng_num >= self.subrng_num: + # The last rotation unit is already local. No more rotation. + self.cur_rot_unit += 1 + + else: + # Rotate by one rotation unit, but not exceeding the end. + offset = min(self.rot_unit_size, + self.subrng_num + - self.cur_rot_unit * self.rot_unit_size + - self.buf_subrng_num) + assert offset > 0 + + # All subranges shift by the above offset. + acc_ = (1. * offset / self.buf_subrng_num) * self.subrng_num + self.rot_access += util.prod(self.subrng_cnt_pr) * acc_ + self.cur_rot_unit += 1 + + # One rotation step. + self.rot_step_cnt[ridx_pr] += 1 + self.cur_rot_step_cnt += 1 + + # Combine wide fetch with rotation. + self.wf_access -= self.seq_wf_acc + self.saved_wf_access += self.seq_wf_acc + self.seq_wf_acc = 0 + + assert ru_idx == self.cur_rot_unit + + # Buffer index of which has this subrange. + buf_idx = self._subrng_buf_idx(sridx_pr) + + # Wide fetch from possibly remote buffer. + wf_acc = util.prod(cnt_pr) * (read + write) * buf_idx + self.wf_access += wf_acc + + # Record amount of sequential wide fetch. + subrng_idx = self.subrng_idx_dict[sridx_pr] + if subrng_idx >= self.last_wf_subrng_idx: + self.seq_wf_acc += wf_acc + else: + self.seq_wf_acc = wf_acc + self.last_wf_subrng_idx = subrng_idx + + return ret + + def _subrange_idx_pr(self, idx_pr): + ''' Get the subrange index of all dimensions. ''' + sridx_pr = self.sridx_pr_cache.get(idx_pr, None) + if sridx_pr is None: + sridx_pr = tuple((idx % buf_cnt) // subrng_cnt + for idx, buf_cnt, subrng_cnt + in zip(idx_pr, self.buf_cnt_pr, + self.subrng_cnt_pr)) + self.sridx_pr_cache[idx_pr] = sridx_pr + return sridx_pr + + def _subrng_rot_unit_idx(self, sridx_pr): + ''' Get the rotation unit index of the subrange. ''' + return self.subrng_idx_dict[sridx_pr] // self.rot_unit_size + + def _subrng_buf_idx(self, sridx_pr): + ''' Get the buffer index of which currently has the subrange. ''' + subrng_idx = self.subrng_idx_dict[sridx_pr] + + # Start from the current rotation unit. + subrng_idx -= self.cur_rot_unit * self.rot_unit_size + subrng_idx %= self.subrng_num + + return self.buf_subrng_centroid[subrng_idx] + + def _init_sub_range(self, lp_t_list, dim_loops): + + assert len(dim_loops) == 2 + + subrng_list = [(0, 0)] + subrng_sz_pr = [1, 1] + + # From inner to outer. + for lpe, t in reversed(lp_t_list): + # The data dimension index of this loop. + try: + d = dim_loops.index(lpe) + except ValueError: + # This loop is not related to the data, skip. + assert lpe not in dim_loops + continue + + # Size of this dimension of current loop body, i.e., all inner + # loops. + s = subrng_sz_pr[d] + + # Make the new subrange list, by looping over the current loop + # body with the current loop factor, and updating this + # dimension. + new_subrng_list = [] + for i in range(t): + new_subrng_list += [tuple(i_ + i * s if d_ == d else i_ + for d_, i_ in enumerate(sr)) + for sr in subrng_list] + subrng_list = new_subrng_list + + # Update size of this dimension. + subrng_sz_pr[d] *= t + + # Check. + assert len(set(subrng_list)) == len(subrng_list) + assert len(subrng_list) == util.prod(subrng_sz_pr) + + subrng_cnt_pr = tuple(buf_cnt // subrng_sz for buf_cnt, subrng_sz + in zip(self.buf_cnt_pr, subrng_sz_pr)) + + return subrng_list, subrng_cnt_pr + + def _sim_access_conv(self, lbs, get_bufshr=False): ''' Get data access by actually simulating and generating loops for CONV layer. + + If `get_bufshr` is True, also return bufshr stats. ''' self.assertTrue(lbs.is_valid(), '_sim_access_conv: invalid lbs.') @@ -264,6 +721,9 @@ def _sim_access_conv(self, lbs): lpts = zip(*lbs.bl_ts) + subgrp_size, rot_unit_cnt, lp_t_list = self._bufshr_params(lbs) + data_loops = lbs.nld.data_loops + # Get buffered unit counts at each level. dram_buf_cnt_pr_list = [tuple(util.prod(lpts[lpe]) for lpe in data_loops[dce].loops()) @@ -285,10 +745,11 @@ def _sim_access_conv(self, lbs): ) gbufs = [None] * de.NUM for dce, buf_cnt_pr in enumerate(gbuf_buf_cnt_pr_list): - gbufs[dce] = self._SimBuffer(dce, buf_cnt_pr, - lbs.nld.unit_access[me.GBUF][dce], - bypass=(not lbs.stored_in_gbuf[dce]), - ) + gbufs[dce] = self._SimBufferSharing( + dce, buf_cnt_pr, lbs.nld.unit_access[me.GBUF][dce], + subgrp_size[dce], rot_unit_cnt[dce], lp_t_list, + data_loops[dce].loops(), + bypass=(not lbs.stored_in_gbuf[dce])) regfs = [None] * de.NUM for dce, buf_cnt_pr in enumerate(regf_buf_cnt_pr_list): regfs[dce] = self._SimBuffer(dce, buf_cnt_pr, @@ -334,8 +795,151 @@ def _sim_access_conv(self, lbs): dram_access = [drams[dce].access_size() for dce in range(de.NUM)] gbuf_access = [gbufs[dce].access_size() for dce in range(de.NUM)] + + # Sum over all nodes. + dram_access = [a * lbs.num_nodes // r for a, r + in zip(dram_access, lbs.accfwd_reduction)] + gbuf_access = [a * lbs.num_nodes for a in gbuf_access] + + # Buffer sharing. + if get_bufshr: + rotation_access = [gbufs[dce].rotation_access_size() + * (lbs.num_nodes // subgrp_size[dce]) + for dce in range(de.NUM)] + wide_fetch_access = [gbufs[dce].wide_fetch_access_size() + * (lbs.num_nodes // subgrp_size[dce]) + for dce in range(de.NUM)] + rotation_rounds = [gbufs[dce].rotation_rounds() + for dce in range(de.NUM)] + + return dram_access, gbuf_access, \ + (rotation_access, wide_fetch_access, rotation_rounds) + + else: + for dce in range(de.NUM): + self.assertAlmostEqual(gbufs[dce].rotation_access_size(), 0, + msg='_sim_access_conv: non-0 ' + 'rotation access with no bufshr.') + self.assertAlmostEqual(gbufs[dce].wide_fetch_access_size(), 0, + msg='_sim_access_conv: non-0 ' + 'wide fetch access with no bufshr.') + self.assertEqual(gbufs[dce].rotation_rounds(), 0, + msg='_sim_access_conv: non-0 ' + 'rotation rounds with no bufshr.') + return dram_access, gbuf_access + def _average_neighbor_nhops(self, bufshr, subgrp_size): + ''' Get the average neighbor number of hops. ''' + + avg_nbr_nhops = [] + + for dce in range(de.NUM): + # pylint: disable=protected-access + + subgrp_dim, idx_pr = bufshr._subgrp_dim(dce, subgrp_size[dce]) + nbr_dist = bufshr.nbr_dists[dce] + + d_pr = subgrp_dim[idx_pr] + d_npr = subgrp_dim[1 - idx_pr] + n_pr = (d_pr - 1) * d_npr + n_npr = d_npr - 1 + nhops_nbr = bufshr._nhops_with_neighbor_dist( + dce, + PhyDim2(*[tpl[1] for tpl + in sorted([(idx_pr, n_pr), (1 - idx_pr, n_npr)])])) + + nhops_nbr /= 1. * subgrp_size[dce] + + coord = bufshr._coordinate(subgrp_size[dce] - 1, subgrp_dim, idx_pr) + nhops_lpbk = bufshr._nhops_with_neighbor_dist(dce, coord) + + nhops_lpbk /= 1. * subgrp_size[dce] + + nhops = nhops_nbr + nhops_lpbk + + if subgrp_size[dce] <= 1: + self.assertAlmostEqual(nhops, 0) + elif subgrp_dim.size() == subgrp_size[dce]: + self.assertTrue(min(nbr_dist) <= nhops + <= max(nbr_dist) + + 1. * sum(subgrp_dim) / subgrp_dim.size(), + '_average_neighbor_nhops: {}: ' + 'subgrp_size {}, subgrp_dim {}, idx_pr {}, ' + 'nbr_dist {}, nhops {} = {} + {}' + .format(dce, subgrp_size[dce], subgrp_dim, + idx_pr, nbr_dist, + nhops, nhops_nbr, nhops_lpbk)) + + assert not math.isnan(nhops) and not math.isinf(nhops) + avg_nbr_nhops.append(nhops) + + return avg_nbr_nhops + + def _verify_bufshr_stats(self, dram_access, gbuf_access, bufshr_stats, + lbs, bufshr, test_name): + ''' Verify the buffer sharing stats returned by access simulation. ''' + + rotation_access, wide_fetch_access, rotation_rounds = bufshr_stats + + avg_nbr_nhops = self._average_neighbor_nhops(bufshr, + lbs.bufshr_subgrp_size) + + # Mem hierarchy. + access = lbs.get_access() + + self.assertListEqual(access[me.DRAM], dram_access, + 'test_access: DRAM: ' + 'model {} vs. sim {}.' + .format(access[me.DRAM], dram_access)) + self.assertListEqual(access[me.GBUF], gbuf_access, + 'test_access: GBUF: ' + 'model {} vs. sim {}.' + .format(access[me.GBUF], gbuf_access)) + self.assertListEqual(access[me.REGF], + [lbs.ops, lbs.ops, lbs.ops * 2]) + + # NoC. + noc_access = lbs.get_noc_access() + + for dce in range(de.NUM): + self.assertAlmostEqual(lbs.bufshr_rotation_access[dce] + + lbs.bufshr_wide_fetch_access[dce], + noc_access[dce]) + + for dce in range(de.NUM): + if lbs.bufshr_subgrp_size[dce] <= 1: + self.assertAlmostEqual(noc_access[dce], 0) + + for dce in range(de.NUM): + self.assertAlmostEqual(lbs.bufshr_rot_round_cnt[dce], + rotation_rounds[dce], + msg=('{}: mismatch rotation round count ' + 'at {}:\nmodel: {}; sim: {}.' + .format(test_name, dce, + lbs.bufshr_rot_round_cnt, + rotation_rounds))) + + for dce in range(de.NUM): + self.assertAlmostEqual(lbs.bufshr_rotation_access[dce], + rotation_access[dce] * avg_nbr_nhops[dce], + msg=('{}: mismatch NoC rotation access ' + 'at {}:\nmodel: {}; sim: {} x {}.' + .format(test_name, dce, + lbs.bufshr_rotation_access, + rotation_access, + avg_nbr_nhops))) + + for dce in range(de.NUM): + self.assertAlmostEqual(lbs.bufshr_wide_fetch_access[dce], + wide_fetch_access[dce] * avg_nbr_nhops[dce], + msg=('{}: mismatch NoC wide fetch access ' + 'at {}:\nmodel: {}; sim: {} x {}.' + .format(test_name, dce, + lbs.bufshr_wide_fetch_access, + wide_fetch_access, + avg_nbr_nhops))) + def _regularized_scheme(self, bl_ts, bl_ords): ''' Get the regularized scheme which will not be skipped. ''' diff --git a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_partition.py b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_partition.py new file mode 100644 index 0000000..1f8f1e5 --- /dev/null +++ b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_partition.py @@ -0,0 +1,412 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +from nn_dataflow.core import BufShrScheme +from nn_dataflow.core import DataCategoryEnum as de +from nn_dataflow.core import loop_blocking +from nn_dataflow.core import LoopBlockingScheme +from nn_dataflow.core import LoopEnum as le +from nn_dataflow.core import ParallelEnum as pe +from nn_dataflow.core import PartitionScheme +from nn_dataflow import util + +from . import TestLoopBlockingFixture + +class TestLoopBlockingPartition(TestLoopBlockingFixture): + ''' Tests for LoopBlocking module with partitioning. ''' + + def setUp(self): + + super(TestLoopBlockingPartition, self).setUp() + + # LoopBlockingScheme records stats of all nodes. + self.total_ops = self.layer['PAR'].total_ops(self.batch_size) + + self.par_proc_region = self.resource['PAR'].proc_region + + def test_accfwd(self): + ''' Scheme using accfwd. ''' + + for part in self._gen_all_partition(): + + p_nld = self._part_nld(part) + + filter_size, ifmap_size, ofmap_size = self._total_part_size(part) + + bufshr = BufShrScheme(self.par_proc_region, part) + + # Filter may still have redundant fetch. + fil_fetch = part.size(pe.BATP, pe.OFMP) // bufshr.size(de.FIL) + + for lbs in loop_blocking.gen_loopblocking( + p_nld, self.resource['PAR'], part, self.none_cstr, + self.cost, self.options['ACCFWD']): + if not lbs.is_valid(): + continue + + # Ops. + self.assertAlmostEqual(lbs.ops, self.total_ops) + + # Access forwarding reduction. + accfwd_red = lbs.accfwd_reduction + self.assertEqual(accfwd_red[de.FIL], + part.size(pe.BATP, pe.OFMP) // fil_fetch) + self.assertEqual(accfwd_red[de.OFM], part.size(pe.INPP)) + self.assertEqual(accfwd_red[de.IFM], part.size(pe.OUTP)) + + # Top fetch and access. + top_fetch = lbs.fetch[0] + top_access = lbs.access[0] + self.assertAlmostEqual(top_access[de.FIL], + top_fetch[de.FIL] * filter_size + * fil_fetch) + self.assertAlmostEqual(top_access[de.OFM], + top_fetch[de.OFM] * ofmap_size) + self.assertGreaterEqual(top_access[de.IFM], + top_fetch[de.IFM] * ifmap_size) + + def test_bufshr(self): + ''' Scheme using bufshr. ''' + + for part in self._gen_all_partition(): + + p_nld = self._part_nld(part) + + bufshr = BufShrScheme(self.par_proc_region, part) + + # Filter may still have redundant fetch. + fil_fetch = part.size(pe.BATP, pe.OFMP) // bufshr.size(de.FIL) + + for optkey in ['BUFSHR', 'BUFSHR-BYP']: + + for lbs in loop_blocking.gen_loopblocking( + p_nld, self.resource['PAR'], part, self.none_cstr, + self.cost, self.options[optkey]): + if not lbs.is_valid(): + continue + + # Ops. + self.assertAlmostEqual(lbs.ops, self.total_ops) + + # Buffer sharing uses access forwarding reduction. + accfwd_red = lbs.accfwd_reduction + self.assertEqual(accfwd_red[de.FIL], + part.size(pe.BATP, pe.OFMP) // fil_fetch) + self.assertEqual(accfwd_red[de.OFM], part.size(pe.INPP)) + self.assertEqual(accfwd_red[de.IFM], part.size(pe.OUTP)) + + # Buffer sharing group size. + bufshr_grp_size = lbs.bufshr_grp_size + self.assertSequenceEqual(bufshr_grp_size, accfwd_red) + + # Buffer sharing subgroup size. + bufshr_subgrp_size = lbs.bufshr_subgrp_size + self.assertTrue(all(subgrp <= grp for subgrp, grp + in zip(bufshr_subgrp_size, + bufshr_grp_size))) + + def test_bufshr_access(self): + ''' Access of scheme using bufshr. ''' + + for part in self._gen_all_partition(): + + p_nld = self._part_nld(part) + + bufshr = BufShrScheme(self.par_proc_region, part) + + for lbs in loop_blocking.gen_loopblocking( + p_nld, self.resource['PAR'], part, self.none_cstr, + self.cost, self.options['BUFSHR']): + if not lbs.is_valid(): + continue + + # Skip those without bufshr. + if all(sgs <= 1 for sgs in lbs.bufshr_subgrp_size): + continue + + # Sim. + dram_access, gbuf_access, bufshr_stats = \ + self._sim_access_conv(lbs, get_bufshr=True) + + self._verify_bufshr_stats(dram_access, gbuf_access, + bufshr_stats, lbs, bufshr, + 'test_bufshr_access') + + def test_bufshr_access_byp(self): + ''' Access of scheme using bufshr with bypassing. ''' + + for part in self._gen_all_partition(): + + p_nld = self._part_nld(part) + + bufshr = BufShrScheme(self.par_proc_region, part) + + for lbs in loop_blocking.gen_loopblocking( + p_nld, self.resource['PAR'], part, self.none_cstr, + self.cost, self.options['BUFSHR-BYP']): + if not lbs.is_valid(): + continue + + # Skip those without bufshr. + if all(sgs <= 1 for sgs in lbs.bufshr_subgrp_size): + continue + # Skip those without bypassing. + if all(lbs.stored_in_gbuf): + continue + + # Sim. + dram_access, gbuf_access, bufshr_stats = \ + self._sim_access_conv(lbs, get_bufshr=True) + + self._verify_bufshr_stats(dram_access, gbuf_access, + bufshr_stats, lbs, bufshr, + 'test_bufshr_access') + + def test_bufshr_rotation_example(self): + ''' Example scheme using bufshr with rotation. ''' + + # Make a PartitionScheme that allows bufshr for all data categories. + part = PartitionScheme(order=range(pe.NUM), + pdims=((2, 1), (1, 2), (1, 1), (2, 1))) + bufshr = BufShrScheme(self.par_proc_region, part) + self.assertTrue(all(bufshr.size(dce) > 1 for dce in range(de.NUM)), + 'test_bufshr_rotation_example: ' + 'made-up PartitionScheme is not expected: ' + '{}, bufshr size {}' + .format(part, + [bufshr.size(dce) for dce in range(de.NUM)])) + + # Make a LoopBlockingScheme that uses bufshr for all data categories. + p_nld = self._part_nld(part) + bl_ts = ((util.idivc(p_nld.loopcnt[le.IFM], 6), + util.idivc(p_nld.loopcnt[le.OFM], 9), + util.idivc(p_nld.loopcnt[le.BAT], 2)), + (3, 3, 2), + (2, 3, 1)) + bl_ords = (tuple(range(le.NUM)), tuple(range(le.NUM))) + lbs = LoopBlockingScheme(p_nld, bl_ts, bl_ords, self.resource['PAR'], + bufshr, self.options['BUFSHR']) + self.assertTrue(lbs.is_valid()) + self.assertGreater(sum(lbs.get_noc_access()), 0) + self.assertTrue(all(sgs > 1 for sgs in lbs.bufshr_subgrp_size) + and all(t > 1 for t in bl_ts[0]), + 'test_bufshr_rotation_example: ' + 'made-up LoopBlockingScheme is not expected: ' + '{}, top factors {}, bufshr subgrp size {}' + .format((bl_ts, bl_ords), bl_ts[0], + lbs.bufshr_subgrp_size)) + + # Sim. + dram_access, gbuf_access, bufshr_stats = \ + self._sim_access_conv(lbs, get_bufshr=True) + + self._verify_bufshr_stats(dram_access, gbuf_access, bufshr_stats, + lbs, bufshr, 'test_bufshr_rotation_example') + + def test_bufshr_skip_rot_example(self): + ''' Example scheme using bufshr that skips the single rotation. ''' + + # Make a PartitionScheme that allows bufshr for IFM. + part = PartitionScheme(order=range(pe.NUM), + pdims=((2, 2), (1, 1), (2, 1), (1, 1))) + bufshr = BufShrScheme(self.par_proc_region, part) + self.assertEqual(bufshr.size(de.IFM), 4, + 'test_bufshr_skip_rot_example: ' + 'made-up PartitionScheme is not expected: ' + '{}, bufshr size for {} {}.' + .format(part, de.IFM, bufshr.size(de.IFM))) + + # Make a LoopBlockingScheme that has a single rotation for IFM. + p_nld = self._part_nld(part) + bl_ts = ((util.idivc(p_nld.loopcnt[le.IFM], 3), + util.idivc(p_nld.loopcnt[le.OFM], 3), + util.idivc(p_nld.loopcnt[le.BAT], 2)), + (1, 1, 2), + (3, 3, 1)) + bl_ords = (tuple(range(le.NUM)), tuple(range(le.NUM))) + lbs = LoopBlockingScheme(p_nld, bl_ts, bl_ords, self.resource['PAR'], + bufshr, self.options['BUFSHR']) + self.assertTrue(lbs.is_valid()) + self.assertGreater(sum(lbs.get_noc_access()), 0) + self.assertEqual(lbs.bufshr_subgrp_size[de.IFM], 4, + 'test_bufshr_skip_rot_example: ' + 'made-up LoopBlockingScheme is not expected: ' + '{}, bufshr subgrp size for {} {}.' + .format((bl_ts, bl_ords), de.IFM, + lbs.bufshr_subgrp_size[de.IFM])) + self.assertGreater(lbs.bufshr_wide_fetch_width[de.IFM], 1, + 'test_bufshr_skip_rot_example: ' + 'made-up LoopBlockingScheme is not expected: ' + '{}, bufshr wide fetch width for {} {}.' + .format((bl_ts, bl_ords), de.IFM, + lbs.bufshr_wide_fetch_width[de.IFM])) + self.assertEqual(lbs.bufshr_rot_round_cnt[de.IFM], 0, + 'test_bufshr_skip_rot_example: ' + 'made-up LoopBlockingScheme is not expected: ' + '{}, bufshr rotation rounds for {} {}' + .format((bl_ts, bl_ords), de.IFM, + lbs.bufshr_rot_round_cnt[de.IFM])) + + # Sim. + dram_access, gbuf_access, bufshr_stats = \ + self._sim_access_conv(lbs, get_bufshr=True) + + self._verify_bufshr_stats(dram_access, gbuf_access, bufshr_stats, + lbs, bufshr, + 'test_bufshr_skip_rot_example') + + def test_bufshr_wide_fetch_example(self): + ''' Example scheme using bufshr with wide fetch. ''' + + # Make a PartitionScheme that allows bufshr for IFM. + part = PartitionScheme(order=range(pe.NUM), + pdims=((2, 2), (1, 1), (2, 1), (1, 1))) + bufshr = BufShrScheme(self.par_proc_region, part) + self.assertEqual(bufshr.size(de.IFM), 4, + 'test_bufshr_wide_fetch_example: ' + 'made-up PartitionScheme is not expected: ' + '{}, bufshr size for {} {}.' + .format(part, de.IFM, bufshr.size(de.IFM))) + + for t1, t2 in [((3, 3, 1), (1, 1, 2)), + ((1, 3, 2), (3, 1, 1))]: + # Make a LoopBlockingScheme that has wide fetch for IFM. + p_nld = self._part_nld(part) + bl_ts = (tuple(util.idivc(p_nld.loopcnt[lpe], t1[lpe] * t2[lpe]) + for lpe in range(le.NUM)), + t1, t2) + # At GBUF level, from inner to outer: le.BAT, le.IFM, le.OFM. + bl_ords = (tuple(range(le.NUM)), (1, 2, 0)) + lbs = LoopBlockingScheme(p_nld, bl_ts, bl_ords, + self.resource['PAR'], bufshr, + self.options['BUFSHR']) + self.assertTrue(lbs.is_valid()) + self.assertGreater(sum(lbs.get_noc_access()), 0) + self.assertEqual(lbs.bufshr_subgrp_size[de.IFM], 4, + 'test_bufshr_wide_fetch_example: ' + 'made-up LoopBlockingScheme is not expected: ' + '{}, bufshr subgrp size for {} {}.' + .format((bl_ts, bl_ords), de.IFM, + lbs.bufshr_subgrp_size[de.IFM])) + self.assertGreater(lbs.bufshr_wide_fetch_width[de.IFM], 1, + 'test_bufshr_wide_fetch_example: ' + 'made-up LoopBlockingScheme is not expected: ' + '{}, bufshr wide fetch width for {} {}.' + .format((bl_ts, bl_ords), de.IFM, + lbs.bufshr_wide_fetch_width[de.IFM])) + self.assertGreater(lbs.bufshr_rot_round_cnt[de.IFM], 0, + 'test_bufshr_wide_fetch_example: ' + 'made-up LoopBlockingScheme is not expected: ' + '{}, bufshr rotation rounds for {} {}' + .format((bl_ts, bl_ords), de.IFM, + lbs.bufshr_rot_round_cnt[de.IFM])) + + # Sim. + dram_access, gbuf_access, bufshr_stats = \ + self._sim_access_conv(lbs, get_bufshr=True) + + self._verify_bufshr_stats(dram_access, gbuf_access, bufshr_stats, + lbs, bufshr, + 'test_bufshr_wide_fetch_example') + + def test_bufshr_multisubgrp_example(self): + ''' Example scheme using bufshr with multiple subgroups in a group. ''' + + # Make a PartitionScheme that allows bufshr for IFM. + part = PartitionScheme(order=list(reversed(range(pe.NUM))), + pdims=((2, 2), (1, 1), (2, 1), (1, 1))) + bufshr = BufShrScheme(self.par_proc_region, part) + self.assertEqual(bufshr.size(de.IFM), 4, + 'test_bufshr_multisubgrp_example: ' + 'made-up PartitionScheme is not expected: ' + '{}, bufshr size for {} {}.' + .format(part, de.IFM, bufshr.size(de.IFM))) + + # Make a LoopBlockingScheme that has multi subgroups per group for IFM. + p_nld = self._part_nld(part) + bl_ts = ((util.idivc(p_nld.loopcnt[le.IFM], 1), + util.idivc(p_nld.loopcnt[le.OFM], 3), + util.idivc(p_nld.loopcnt[le.BAT], 2)), + (1, 3, 2), + (1, 1, 1)) + # At GBUF level, from inner to outer: le.BAT, le.OFM, le.IFM. + bl_ords = (tuple(range(le.NUM)), (2, 1, 0)) + lbs = LoopBlockingScheme(p_nld, bl_ts, bl_ords, self.resource['PAR'], + bufshr, self.options['BUFSHR']) + self.assertTrue(lbs.is_valid()) + self.assertGreater(sum(lbs.get_noc_access()), 0) + self.assertGreater(lbs.bufshr_grp_size[de.IFM], + lbs.bufshr_subgrp_size[de.IFM], + 'test_bufshr_multisubgrp_example: ' + 'made-up LoopBlockingScheme is not expected: ' + '{}, bufshr grp size {}, bufshr subgrp size {}' + .format((bl_ts, bl_ords), lbs.bufshr_grp_size, + lbs.bufshr_subgrp_size)) + self.assertGreater(lbs.bufshr_rot_round_cnt[de.IFM], 0, + 'test_bufshr_multisubgrp_example: ' + 'made-up LoopBlockingScheme is not expected: ' + '{}, bufshr rotation rounds for {} {}' + .format((bl_ts, bl_ords), de.IFM, + lbs.bufshr_rot_round_cnt[de.IFM])) + + # Sim. + dram_access, gbuf_access, bufshr_stats = \ + self._sim_access_conv(lbs, get_bufshr=True) + + self._verify_bufshr_stats(dram_access, gbuf_access, bufshr_stats, + lbs, bufshr, + 'test_bufshr_multisubgrp_example') + + def test_bufshr_get_noc_access(self): + ''' get_noc_access of scheme using bufshr. ''' + + for part in self._gen_all_partition(): + + p_nld = self._part_nld(part) + + for lbs in loop_blocking.gen_loopblocking( + p_nld, self.resource['PAR'], part, self.none_cstr, + self.cost, self.options['BUFSHR']): + + noc_access = lbs.get_noc_access() + + if not lbs.is_valid(): + self.assertIsNone(noc_access) + + else: + for dce in range(de.NUM): + self.assertAlmostEqual( + lbs.bufshr_rotation_access[dce] + + lbs.bufshr_wide_fetch_access[dce], + noc_access[dce]) + + def test_bufshr_localregionlayer(self): + ''' Scheme using bufshr for LocalRegionLayer. ''' + + for part in self._gen_all_partition(layerkey='POOL'): + + p_nld = self._part_nld(part, layerkey='POOL') + + for lbs in loop_blocking.gen_loopblocking( + p_nld, self.resource['PAR'], part, self.none_cstr, + self.cost, self.options['BUFSHR']): + if not lbs.is_valid(): + continue + + self.assertTrue(all(gs == 1 for gs in lbs.bufshr_grp_size), + 'test_bufshr_localregionlayer: ' + 'non-1 bufshr group size {}, part {}' + .format(lbs.bufshr_grp_size, part)) + diff --git a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_scheme.py b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_scheme.py index eafa94a..d215874 100644 --- a/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_scheme.py +++ b/nn_dataflow/tests/loop_blocking_test/test_loop_blocking_scheme.py @@ -381,3 +381,98 @@ def test_ordered_loops(self): self.assertListEqual(list(reversed(rev_loops)), ord_loops) self.assertListEqual([tpl[0] for tpl in ord_loops], ord_lpes) + def test_data_region_fetch(self): + ''' PROC type data regions. ''' + + # Multiple fetches with normal DATA regions. + bl_ts = self._make_bl_ts((0, 1, 1), (0, 1, 1), (0, 1, 1)) + bl_ords = [[0] * le.NUM for _ in range(2)] + bl_ords[0][le.IFM] = 1 + bl_ords[0][le.OFM] = 2 + bl_ords[0][le.BAT] = 0 + bl_ords[1] = range(le.NUM) + lbs_norm = self._lbs(bl_ts, bl_ords) + self.assertTrue(lbs_norm.is_valid()) + self.assertGreater(lbs_norm.fetch[0][de.IFM], 1) + self.assertGreater(lbs_norm.fetch[0][de.OFM], 1) + + lbs = self._lbs(bl_ts, bl_ords, rsrckey='SRCNOTDATA') + self.assertFalse(lbs.is_valid()) + lbs = self._lbs(bl_ts, bl_ords, rsrckey='DSTNOTDATA') + self.assertFalse(lbs.is_valid()) + + # Single top-level fetch. + bl_ts = self._make_bl_ts((1, 0, 1), (1, 0, 1), (1, 0, 1)) + lbs_norm = self._lbs(bl_ts, rsrckey='LG') + + lbs = self._lbs(bl_ts, rsrckey='SRCNOTDATA') + self.assertTrue(lbs.is_valid()) + self.assertLess(lbs.get_access_cost(self.cost), + lbs_norm.get_access_cost(self.cost)) + self.assertAlmostEqual(lbs_norm.get_access_cost(self.cost) + - lbs.get_access_cost(self.cost), + lbs.remote_gbuf_access[de.IFM] + * (self.cost.mem_hier_at(me.DRAM) + - self.cost.mem_hier_at(me.GBUF))) + self.assertAlmostEqual(lbs.access[me.DRAM][de.FIL], + lbs_norm.access[me.DRAM][de.FIL]) + self.assertAlmostEqual(lbs.access[me.DRAM][de.IFM], 0) + self.assertAlmostEqual(lbs.access[me.DRAM][de.OFM], + lbs_norm.access[me.DRAM][de.OFM]) + self.assertAlmostEqual(lbs.access[me.GBUF][de.IFM], + lbs_norm.access[me.GBUF][de.IFM]) + self.assertAlmostEqual(lbs.remote_gbuf_access[de.IFM], + lbs_norm.access[me.DRAM][de.IFM]) + + lbs = self._lbs(bl_ts, bl_ords, rsrckey='DSTNOTDATA') + self.assertTrue(lbs.is_valid()) + self.assertLess(lbs.get_access_cost(self.cost), + lbs_norm.get_access_cost(self.cost)) + self.assertAlmostEqual(lbs_norm.get_access_cost(self.cost) + - lbs.get_access_cost(self.cost), + lbs.remote_gbuf_access[de.OFM] + * (self.cost.mem_hier_at(me.DRAM) + - self.cost.mem_hier_at(me.GBUF))) + self.assertAlmostEqual(lbs.access[me.DRAM][de.FIL], + lbs_norm.access[me.DRAM][de.FIL]) + self.assertAlmostEqual(lbs.access[me.DRAM][de.IFM], + lbs_norm.access[me.DRAM][de.IFM]) + self.assertAlmostEqual(lbs.access[me.DRAM][de.OFM], 0) + self.assertAlmostEqual(lbs.access[me.GBUF][de.OFM], + lbs_norm.access[me.GBUF][de.OFM]) + self.assertAlmostEqual(lbs.remote_gbuf_access[de.OFM], + lbs_norm.access[me.DRAM][de.OFM]) + + lbs = self._lbs(bl_ts, bl_ords, rsrckey='DATALOCAL') + self.assertTrue(lbs.is_valid()) + self.assertLess(lbs.get_access_cost(self.cost), + lbs_norm.get_access_cost(self.cost)) + self.assertAlmostEqual(lbs.access[me.DRAM][de.FIL], + lbs_norm.access[me.DRAM][de.FIL]) + self.assertAlmostEqual(lbs.access[me.DRAM][de.IFM], 0) + self.assertAlmostEqual(lbs.access[me.DRAM][de.OFM], 0) + self.assertAlmostEqual(lbs.access[me.GBUF][de.IFM], + lbs_norm.access[me.GBUF][de.IFM]) + self.assertAlmostEqual(lbs.access[me.GBUF][de.OFM], + lbs_norm.access[me.GBUF][de.OFM]) + self.assertAlmostEqual(lbs.remote_gbuf_access[de.IFM], + lbs_norm.access[me.DRAM][de.IFM]) + self.assertAlmostEqual(lbs.remote_gbuf_access[de.OFM], + lbs_norm.access[me.DRAM][de.OFM]) + + def test_fil_pinning(self): + ''' Filter pinning. ''' + + bl_ts = self._make_bl_ts((1, 0, 1), (1, 0, 1), (0, 1, 1)) + bl_ords = [range(le.NUM) for _ in range(2)] + + lbs_norm = self._lbs(bl_ts, bl_ords) + self.assertTrue(lbs_norm.is_valid()) + self.assertGreater(lbs_norm.fetch[0][de.FIL], 0) + self.assertGreater(lbs_norm.get_access()[0][de.FIL], 0) + + lbs = self._lbs(bl_ts, bl_ords, rsrckey='FILPIN') + self.assertTrue(lbs.is_valid()) + self.assertEqual(lbs.fetch[0][de.FIL], 0) + self.assertEqual(lbs.get_access()[0][de.FIL], 0) + diff --git a/nn_dataflow/tests/map_strategy_test/test_map_strategy_fixture.py b/nn_dataflow/tests/map_strategy_test/test_map_strategy_fixture.py index c448d45..f6c2458 100644 --- a/nn_dataflow/tests/map_strategy_test/test_map_strategy_fixture.py +++ b/nn_dataflow/tests/map_strategy_test/test_map_strategy_fixture.py @@ -66,5 +66,6 @@ def setUp(self): proc_region=proc_region, dram_region=data_region, src_data_region=data_region, dst_data_region=data_region, dim_array=PhyDim2(12, 14), size_gbuf=108*1024, size_regf=520, - array_bus_width=float('inf'), dram_bandwidth=float('inf')) + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) diff --git a/nn_dataflow/tests/partition_test/test_partition_fixture.py b/nn_dataflow/tests/partition_test/test_partition_fixture.py index 28bab0d..90ce832 100644 --- a/nn_dataflow/tests/partition_test/test_partition_fixture.py +++ b/nn_dataflow/tests/partition_test/test_partition_fixture.py @@ -69,6 +69,16 @@ def setUp(self): partition_batch=True, partition_ifmaps=False, **optdict) + self.options['ACCFWD'] = Option(partition_hybrid=True, + partition_batch=True, + partition_ifmaps=True, + hw_access_forwarding=True, + **optdict) + self.options['BUFSHR'] = Option(partition_hybrid=True, + partition_batch=True, + partition_ifmaps=True, + hw_gbuf_sharing=True, + **optdict) def _gen_partition(self, wlkey='BASE', dnkey='BASE', optkey='BASE', guaranteed=False): diff --git a/nn_dataflow/tests/partition_test/test_unit_nhops_to_proc_region.py b/nn_dataflow/tests/partition_test/test_unit_nhops_to_proc_region.py index 8b5c8b1..7f01e5b 100644 --- a/nn_dataflow/tests/partition_test/test_unit_nhops_to_proc_region.py +++ b/nn_dataflow/tests/partition_test/test_unit_nhops_to_proc_region.py @@ -316,6 +316,66 @@ def test_ofmap_local(self): self.assertEqual(nhops[de.OFM], 0) + def test_use_fwd(self): + ''' Use access forwarding. ''' + layer = self.layers['BASE'] + + part = PartitionScheme(order=(pe.BATP, pe.INPP, pe.OUTP, pe.OFMP), + pdims=((2, 1), (2, 4), (1, 2), (2, 1))) + + nr = NodeRegion(origin=PhyDim2(0, 0), dim=part.dim(), + type=NodeRegion.PROC) + + far_dist = 1000 + + ilayout = self._make_data_layout( + layer.nifm, layer.hifm, layer.wifm, PhyDim2(-far_dist, 0), + (1, 1), (1, 1), PhyDim2(1, 1)) + + olayout = self._make_data_layout( + layer.nofm, layer.hofm, layer.wofm, PhyDim2(0, -far_dist), + (1, 1), (1, 1), PhyDim2(1, 1)) + + filter_nodes = frozenset([PhyDim2(far_dist, 0), PhyDim2(0, far_dist)]) + + nhops_base = partition.unit_nhops_to_proc_region( + layer, self.batch_size, nr, part, + filter_nodes, ilayout, olayout, self.options['BASE']) + nhops_accfwd = partition.unit_nhops_to_proc_region( + layer, self.batch_size, nr, part, + filter_nodes, ilayout, olayout, self.options['ACCFWD']) + nhops_bufshr = partition.unit_nhops_to_proc_region( + layer, self.batch_size, nr, part, + filter_nodes, ilayout, olayout, self.options['BUFSHR']) + + for dce in range(de.NUM): + self.assertEqual(nhops_accfwd[dce], nhops_bufshr[dce]) + + # In the basic access scheme, FIL and IFM are independently fetched, + # resulting in repeating remote fetch. OFM are merged locally and only + # stored back remotely once. + self.assertGreater(nhops_base[de.FIL], + layer.total_filter_size() * far_dist + * part.size(pe.BATP) * part.size(pe.OFMP) * 0.8) + self.assertGreater(nhops_base[de.IFM], + layer.total_ifmap_size(self.batch_size) * far_dist + * part.size(pe.OUTP) * 0.8) + + p_layer, p_batch_size, _ = part.part_layer(layer, self.batch_size) + # With forwarding, everyone is only remotely fetched once. + self.assertLess(nhops_accfwd[de.FIL], + p_layer.total_filter_size() + * part.size(pe.INPP, pe.OUTP) + * (far_dist + nr.dim.size())) + self.assertLess(nhops_accfwd[de.IFM], + p_layer.total_ifmap_size(p_batch_size) + * part.size(pe.INPP, pe.OFMP, pe.BATP) + * (far_dist + nr.dim.size())) + self.assertLess(nhops_accfwd[de.OFM], + p_layer.total_ofmap_size(p_batch_size) + * part.size(pe.OUTP, pe.OFMP, pe.BATP) + * (far_dist + nr.dim.size())) + def _make_data_layout(self, nfm, hfm, wfm, origin, bdim, ndim, dims): ''' Make a DataLayout instance. ''' frng = FmapRange((0,) * 4, (self.batch_size, nfm, hfm, wfm)) diff --git a/nn_dataflow/tests/pipeline_test/__init__.py b/nn_dataflow/tests/pipeline_test/__init__.py new file mode 100644 index 0000000..204e01c --- /dev/null +++ b/nn_dataflow/tests/pipeline_test/__init__.py @@ -0,0 +1,17 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +from .test_pipeline_fixture import TestPipelineFixture + diff --git a/nn_dataflow/tests/pipeline_test/test_inter_layer_pipeline.py b/nn_dataflow/tests/pipeline_test/test_inter_layer_pipeline.py new file mode 100644 index 0000000..fac452b --- /dev/null +++ b/nn_dataflow/tests/pipeline_test/test_inter_layer_pipeline.py @@ -0,0 +1,497 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +import re + +from nn_dataflow.core import InputLayer, ConvLayer, FCLayer, PoolingLayer +from nn_dataflow.core import InterLayerPipeline +from nn_dataflow.core import Network +from nn_dataflow.core import Option +from nn_dataflow.core import PhyDim2 +from nn_dataflow.core import PipelineSegment + +from . import TestPipelineFixture + +class TestInterLayerPipeline(TestPipelineFixture): + ''' Tests for InterLayerPipeline. ''' + + def test_valid_args(self): + ''' Valid arguments. ''' + ilp = InterLayerPipeline(self.net['net1'], self.batch_size, + self.resource, max_util_drop=0.1) + self.assertIs(ilp.network, self.net['net1']) + self.assertEqual(ilp.batch_size, self.batch_size) + self.assertIs(ilp.resource, self.resource) + self.assertEqual(ilp.max_util_drop, 0.1) + + def test_invalid_network(self): + ''' Invalid network. ''' + with self.assertRaisesRegexp(TypeError, + 'InterLayerPipeline: .*network.*'): + _ = InterLayerPipeline(self.net['net1'].input_layer(), + self.batch_size, self.resource) + + def test_invalid_resource(self): + ''' Invalid resource. ''' + with self.assertRaisesRegexp(TypeError, + 'InterLayerPipeline: .*resource.*'): + _ = InterLayerPipeline(self.net['net1'], self.batch_size, + PhyDim2(1, 1)) + + def test_invalid_max_util_drop(self): + ''' Invalid max_util_drop. ''' + with self.assertRaisesRegexp(ValueError, + 'InterLayerPipeline: .*max_util_drop.*'): + _ = InterLayerPipeline(self.net['net1'], self.batch_size, + self.resource, max_util_drop=1.1) + + with self.assertRaisesRegexp(ValueError, + 'InterLayerPipeline: .*max_util_drop.*'): + _ = InterLayerPipeline(self.net['net1'], self.batch_size, + self.resource, max_util_drop=-0.1) + + def test_topological_order(self): + ''' Topological order. ''' + for net in self.net.values(): + + if not net.net_name.startswith('net'): + continue + + ilp = self._make_ilp(net) + + for layer in net: + vidx = ilp.dag_vertex_dict[layer] + + self.assertIn(layer, ilp.dag_vertex_list[vidx]) + + # Layer is named by topological order. + self.assertTrue(layer.startswith(str(vidx))) + + # Disjoint union. + vs_list = [set(v) for v in ilp.dag_vertex_list] + + for idx, vs in enumerate(vs_list): + for vs2 in vs_list[:idx]: + self.assertTrue(vs.isdisjoint(vs2)) + self.assertSetEqual(set.union(*vs_list), set(net)) + + def test_vertex_no_merge_lr(self): + ''' LocalRegionLayer has no previous layer to merge with. ''' + net = Network('tmp_net') + net.set_input_layer(InputLayer(30, 1)) + net.add('0', PoolingLayer(30, 1, 1)) + net.add('1', FCLayer(30, 40)) + net.add('1p', PoolingLayer(40, 1, 1)) + + ilp = self._make_ilp(net) + + for layer in net: + vidx = ilp.dag_vertex_dict[layer] + + self.assertIn(layer, ilp.dag_vertex_list[vidx]) + + # Layer is named by topological order. + self.assertTrue(layer.startswith(str(vidx))) + + def test_prev(self): + ''' Previous relationship. ''' + for net in self.net.values(): + + ilp = self._make_ilp(net) + + for vidx, prevs in ilp.dag_prev_dict.items(): + + # Previous layers of the current vertex. + prev_layers = set() + v = ilp.dag_vertex_list[vidx] + for l in v: + prev_layers.update(net.prevs(l)) + prev_layers.difference_update(v) + + for pvidx in prevs: + + # Previous vertices should be ordered before this vertex. + self.assertLess(pvidx, vidx) + + # Previous vertex should have at least one previous layer. + if pvidx < 0: + self.assertTrue( + None in prev_layers + or not prev_layers.isdisjoint(net.ext_layers())) + else: + pv = ilp.dag_vertex_list[pvidx] + self.assertFalse(prev_layers.isdisjoint(pv)) + + def test_next(self): + ''' Next relationship. ''' + for net in self.net.values(): + + ilp = self._make_ilp(net) + + for vidx, nexts in ilp.dag_next_dict.items(): + + # Next layers of the current vertex. + next_layers = set() + if vidx < 0: + # Go through all layers and add those with input layer as + # previous. + for l in net: + prevs = set(net.prevs(l)) + if None in prevs \ + or not prevs.isdisjoint(net.ext_layers()): + next_layers.add(l) + else: + v = ilp.dag_vertex_list[vidx] + for l in v: + next_layers.update(net.nexts(l)) + next_layers.difference_update(v) + + for nvidx in nexts: + + # Next vertices should be ordered after this vertex. + self.assertGreater(nvidx, vidx) + + # Next vertex should have at least one next layer. + nv = ilp.dag_vertex_list[nvidx] + self.assertFalse(next_layers.isdisjoint(nv)) + + def test_match_prev_next(self): + ''' Previous and next relationships match. ''' + for net in self.net.values(): + + ilp = self._make_ilp(net) + + for vidx, prevs in ilp.dag_prev_dict.items(): + for pvidx in prevs: + self.assertIn(vidx, ilp.dag_next_dict[pvidx]) + + for vidx, nexts in ilp.dag_next_dict.items(): + for nvidx in nexts: + self.assertIn(vidx, ilp.dag_prev_dict[nvidx]) + + def test_gen_vseg(self): + ''' _gen_vseg. ''' + # pylint: disable=protected-access + + # Simple case. + ilp = self._make_ilp(self.net['net1']) + num = len(ilp.dag_vertex_list) + self.assertEqual(len(list(ilp._gen_vseg())), + (num + 1) * num // 2) + + # Linear case. + # Number of different vsegs of n = 1 + ... + n + ilp = self._make_ilp(self.net['net2']) + num = len(ilp.dag_vertex_list) + self.assertEqual(len(list(ilp._gen_vseg())), + (num + 1) * num // 2) + + # Fork case. + ilp = self._make_ilp(self.net['net4']) + vseg_list = list(ilp._gen_vseg()) + self.assertEqual(len(vseg_list), 39) + # Case with one of multiple previous vertices on-chip. + self.assertIn((9, 10), vseg_list) + self.assertIn((13, 14), vseg_list) + # Case with only one next vertex off-chip. + self.assertIn((7, 8), vseg_list) + self.assertNotIn((4, 5, 6), vseg_list) + + # Multiple first layers. + self.assertGreater(len(self.net['net3'].firsts()), 1) + ilp = self._make_ilp(self.net['net3']) + vseg_list = list(ilp._gen_vseg()) + self.assertIn((0,), vseg_list) + self.assertIn((1,), vseg_list) + + # Verify rules. + ilp = self._make_ilp(self.net['net5']) + vseg_list = list(ilp._gen_vseg()) + # Layers with no shared dependencies. + self.assertNotIn((2, 3, 4), vseg_list) + self.assertNotIn((8, 9), vseg_list) + # Multiple previous layers. + self.assertNotIn((5, 6, 7), vseg_list) + self.assertNotIn((8, 9, 10), vseg_list) + self.assertNotIn((10, 11, 12), vseg_list) + # Multiple next layers. + self.assertNotIn((0, 1, 2, 3), vseg_list) + self.assertIn((3, 4), vseg_list) + self.assertIn((3, 4, 5), vseg_list) + self.assertIn((10, 11), vseg_list) + + # No duplicate. + for net in self.net.values(): + ilp = self._make_ilp(net) + vseg_list = list(ilp._gen_vseg()) + self.assertEqual(len(vseg_list), len(set(vseg_list))) + + # Real networks. + ilp = self._make_ilp(self.net['zfnet']) + self.assertEqual(len(ilp.dag_vertex_list), 8) + vseg_list = list(ilp._gen_vseg()) + self.assertEqual(len(vseg_list), 36) + + ilp = self._make_ilp(self.net['vgg_net']) + self.assertEqual(len(ilp.dag_vertex_list), 16) + vseg_list = list(ilp._gen_vseg()) + self.assertEqual(len(vseg_list), 136) + + # Large networks with forks. + for net_name in ['googlenet', 'resnet152']: + net = self.net[net_name] + + ilp = self._make_ilp(net) + vseg_list = list(ilp._gen_vseg()) + self.assertEqual(len(vseg_list), len(set(vseg_list))) + + # The number of different vsegs is between one and eight times of + # the number of layers. + self.assertGreater(len(vseg_list), len(net)) + self.assertLessEqual(len(vseg_list), len(net) * 8) + + def test_gen_vseg_twice(self): + ''' _gen_vseg twice. ''' + # pylint: disable=protected-access + for net_name in self.net: + if not net_name.startswith('net'): + continue + + net = self.net[net_name] + ilp = self._make_ilp(net) + + vseg_list_1 = list(ilp._gen_vseg()) + vseg_list_2 = list(ilp._gen_vseg()) + + self.assertListEqual(vseg_list_1, vseg_list_2) + + def test_ordered_layer_list(self): + ''' Get ordered_layer_list. ''' + + # https://stackoverflow.com/a/4836734/5277823 + nat_key = lambda key: tuple(int(c) if c.isdigit() else c.lower() + for c in re.split('([0-9]+)', key)) + + for net_name in ['net1', 'net2', 'net3', 'net4', 'net5']: + net = self.net[net_name] + ilp = self._make_ilp(net) + ord_list = ilp.ordered_layer_list() + + # In natural order. + self.assertTrue(all(nat_key(l1) < nat_key(l2) for l1, l2 + in zip(ord_list, ord_list[1:]))) + + def test_gen_segment(self): + ''' gen_segment(). ''' + for net_name in self.net: + net = self.net[net_name] + ilp = self._make_ilp(net) + + # No pipelining. + options = Option() + segs_n_lst = list(ilp.gen_segment(options)) + segs_n = set(segs_n_lst) + self.assertEqual(len(segs_n_lst), len(segs_n)) + for seg in segs_n: + self.assertEqual(len(seg), 1) + self.assertEqual(len(seg[0]), 1) + self.assertIn(seg[0][0], net) + + # Spatial pipelining. + options = Option(partition_interlayer=True) + segs_sp_lst = list(ilp.gen_segment(options)) + segs_sp = set(segs_sp_lst) + self.assertEqual(len(segs_sp_lst), len(segs_sp)) + for seg in segs_sp: + for ltpl in seg: + self.assertLessEqual(sum(1 for l in ltpl + if isinstance(l, ConvLayer)), + 1) + self.assertTrue(segs_sp.issuperset(segs_n)) + + # Temporal pipelining. + options = Option(hw_gbuf_save_writeback=True) + segs_tp_lst = list(ilp.gen_segment(options)) + segs_tp = set(segs_tp_lst) + self.assertEqual(len(segs_tp_lst), len(segs_tp)) + for seg in segs_tp: + self.assertEqual(len(seg), 1) + self.assertTrue(segs_tp.issuperset(segs_n)) + + # Spatial and temporal pipelining. + options = Option(partition_interlayer=True, + hw_gbuf_save_writeback=True) + segs_stp_lst = list(ilp.gen_segment(options)) + segs_stp = set(segs_stp_lst) + self.assertEqual(len(segs_stp_lst), len(segs_stp)) + self.assertSetEqual(segs_stp, segs_tp | segs_sp) + # Only single-layer and single-vertex segments have the same + # spatial and temporal pipelining. + segs_intersect = segs_tp & segs_sp + segs_single = segs_n + segs_single |= set(PipelineSegment((v,), ilp.network, + ilp.batch_size, ilp.resource) + for v in ilp.dag_vertex_list) + self.assertTrue(segs_intersect.issubset(segs_single)) + + def test_gen_segment_max_degree(self): + ''' gen_segment() maximum degree. ''' + net = self.net['vgg_net'] + ilp = self._make_ilp(net) + + options = Option(partition_interlayer=True, + hw_gbuf_save_writeback=True, + layer_pipeline_max_degree=4) + for segment in ilp.gen_segment(options): + self.assertLessEqual(sum(1 if isinstance(net[l], ConvLayer) else 0 + for ltpl in segment for l in ltpl), + 4) + + def test_gen_segment_vseg(self): + ''' gen_segment() vertex segment. ''' + + for net_name in self.net: + if not net_name.startswith('net'): + continue + net = self.net[net_name] + + ilp = self._make_ilp(net) + options = Option(partition_interlayer=True) + + seg_set = set(ilp.gen_segment(options)) + self.assertTrue(seg_set) + + seg_v_set = set(self._gen_all_segment(net)) + self.assertTrue(seg_set.issubset(seg_v_set)) + + def test_gen_segment_multi_prevs(self): + ''' gen_segment() with multiple previous vertices. ''' + # pylint: disable=protected-access + + net = self.net['net4'] + ilp = self._make_ilp(net) + + vseg_set = set(ilp._gen_vseg()) + self.assertIn((9, 10), vseg_set) + self.assertIn((13, 14), vseg_set) + + options = Option(partition_interlayer=True) + seg_set = set(ilp.gen_segment(options)) + + # 10 only has neighbor source 9; 10p only has local source 10 and + # memory source 8. Valid. + self.assertIn(self._make_segment((9, 10), ilp.network), seg_set) + # 14 has both neighbor source 13, and memory source 12, etc.. Invalid. + self.assertNotIn(self._make_segment((13, 14), ilp.network), seg_set) + + def test_gen_segment_one_nexts(self): + ''' gen_segment() with missing one next vertex. ''' + # pylint: disable=protected-access + + net = self.net['net4'] + ilp = self._make_ilp(net) + + vseg_set = set(ilp._gen_vseg()) + self.assertIn((7, 8), vseg_set) + self.assertNotIn((4, 5, 6), vseg_set) + + options = Option(partition_interlayer=True) + seg_set = set(ilp.gen_segment(options)) + + self.assertIn(self._make_segment((7, 8), ilp.network), seg_set) + self.assertNotIn(self._make_segment((4, 5, 6), ilp.network), seg_set) + + def test_gen_segment_not_opt(self): + ''' gen_segment() not with_opt. ''' + + options_with_opt = Option(partition_interlayer=True, + hw_gbuf_save_writeback=True, + layer_pipeline_opt=True) + options_not_opt = Option(partition_interlayer=True, + hw_gbuf_save_writeback=True, + layer_pipeline_opt=False) + + # Linear ones + for net_name in ['net1', 'net2', 'zfnet']: + net = self.net[net_name] + ilp = self._make_ilp(net) + + segs_with_opt = set(seg.seg + for seg in ilp.gen_segment(options_with_opt)) + segs_not_opt = set(seg.seg + for seg in ilp.gen_segment(options_not_opt)) + + self.assertSetEqual(segs_with_opt, segs_not_opt) + + # Non-linear ones + for net_name in ['net3', 'net4', 'net5', 'net6', 'net7', 'googlenet']: + net = self.net[net_name] + ilp = self._make_ilp(net) + + segs_with_opt = set(seg.seg + for seg in ilp.gen_segment(options_with_opt)) + segs_not_opt = set(seg.seg + for seg in ilp.gen_segment(options_not_opt)) + + self.assertTrue(segs_with_opt.issuperset(segs_not_opt)) + + def test_gen_segment_resnet(self): + ''' gen_segment() with ResNet. ''' + + net = self.net['resnet152'] + ilp = self._make_ilp(net) + + options = Option(partition_interlayer=True) + + # One residual module fits. + segment = PipelineSegment( + (('conv3_2_a',), ('conv3_2_b',), ('conv3_2_c', 'conv3_2_res')), + ilp.network, ilp.batch_size, ilp.resource) + + self.assertTupleEqual(net.prevs('conv3_2_res'), + ('conv3_1_res', 'conv3_2_c')) + self.assertTrue(segment.valid) + + segs = set(seg.seg for seg in ilp.gen_segment(options)) + self.assertIn(segment.seg, segs) + + def test_gen_segment_lstm(self): + ''' gen_segment() with LSTM cell. ''' + + net = self.net['lstm_phoneme'] + ilp = self._make_ilp(net) + + options = Option(partition_interlayer=True) + + # Find a cell. + cname = None + for l in net: + if l[-6:] == '_igate': + cname = l[:-6] + self.assertIsNotNone(cname) + + # One LSTM cell fits. + segment = PipelineSegment( + ((cname + '_cand',), + (cname + '_igate', cname + '_cout_i'), + (cname + '_fgate', cname + '_cout_f', cname + '_cout'), + (cname + '_ogate', cname + '_hout')), + ilp.network, ilp.batch_size, ilp.resource) + + self.assertTrue(segment.valid) + + segs = set(seg.seg for seg in ilp.gen_segment(options)) + self.assertIn(segment.seg, segs) + diff --git a/nn_dataflow/tests/pipeline_test/test_pipeline_fixture.py b/nn_dataflow/tests/pipeline_test/test_pipeline_fixture.py new file mode 100644 index 0000000..2301c71 --- /dev/null +++ b/nn_dataflow/tests/pipeline_test/test_pipeline_fixture.py @@ -0,0 +1,588 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +import unittest + +from collections import OrderedDict + +from nn_dataflow.core import DataLayout +from nn_dataflow.core import FmapRange +from nn_dataflow.core import InputLayer, ConvLayer, FCLayer, PoolingLayer +from nn_dataflow.core import InterLayerPipeline +from nn_dataflow.core import LoopEnum as le +from nn_dataflow.core import Network +from nn_dataflow.core import NodeRegion +from nn_dataflow.core import ParallelEnum as pe +from nn_dataflow.core import PartitionScheme +from nn_dataflow.core import PhyDim2 +from nn_dataflow.core import PipelineSegment +from nn_dataflow.core import Resource +from nn_dataflow.core import SchedulingConstraint +from nn_dataflow.core import SchedulingResult + +from nn_dataflow.nns import import_network, all_networks + +class TestPipelineFixture(unittest.TestCase): + ''' Base fixture class for layer pipeline tests. ''' + + def setUp(self): + + self.net = {} + + net = Network('net1') + # Linear. + net.set_input_layer(InputLayer(10, 1)) + net.add('0', FCLayer(10, 20)) + net.add('1', FCLayer(20, 30)) + net.add('1p', PoolingLayer(30, 1, 1)) + net.add('2', FCLayer(30, 40)) + net.add('3', FCLayer(40, 50)) + self.net[net.net_name] = net + + net = Network('net2') + # Long linear. + net.set_input_layer(InputLayer(1, 1)) + for idx in range(16): + net.add(str(idx), FCLayer(1, 1)) + self.net[net.net_name] = net + + net = Network('net3') + # Fork. + # /0-2\ /6- 7- 8\ + # x 4-5 12 + # \1-3/ \9-10-11/ + net.set_input_layer(InputLayer(1, 1)) + net.add('0', FCLayer(1, 1), prevs=net.INPUT_LAYER_KEY) + net.add('1', FCLayer(1, 1), prevs=net.INPUT_LAYER_KEY) + net.add('2', FCLayer(2, 1), prevs=('0', '1')) + net.add('2p', PoolingLayer(1, 1, 1)) + net.add('3', FCLayer(2, 1), prevs=('0', '1')) + net.add('4', FCLayer(2, 1), prevs=('2p', '3')) + net.add('5', FCLayer(1, 1)) + net.add('5p', PoolingLayer(1, 1, 1)) + net.add('6', FCLayer(1, 1), prevs='5p') + net.add('7', FCLayer(1, 1)) + net.add('8', FCLayer(1, 1)) + net.add('9', FCLayer(1, 1), prevs='5p') + net.add('10', FCLayer(1, 1)) + net.add('11', FCLayer(1, 1)) + net.add('12', FCLayer(2, 1), prevs=('8', '11')) + self.net[net.net_name] = net + + net = Network('net4') + # Complex fork. + # /5 \ + # 0-1-2-3-4-6-7-8-10-14 + # \9/ + # \11-12 / + # \13 / + net.set_input_layer(InputLayer(1, 1)) + net.add('0', FCLayer(1, 1)) + net.add('1', FCLayer(1, 1)) + net.add('2', FCLayer(1, 1)) + net.add('3', FCLayer(1, 1)) + net.add('4', FCLayer(1, 1)) + net.add('5', FCLayer(1, 1), prevs='4') + net.add('6', FCLayer(1, 1), prevs='4') + net.add('7', FCLayer(1, 1)) + net.add('8', FCLayer(1, 1), prevs='7') + net.add('9', FCLayer(1, 1), prevs='7') + net.add('10', FCLayer(1, 1)) + net.add('10p', PoolingLayer(2, 1, 1), prevs=('8', '10')) + net.add('11', PoolingLayer(1, 1, 1), prevs='4') + net.add('12', FCLayer(1, 1)) + net.add('13', PoolingLayer(1, 1, 1), prevs='4') + net.add('14', FCLayer(5, 1), prevs=('5', '10p', '12', '13')) + self.net[net.net_name] = net + + net = Network('net5') + # Corner cases. + # ----\ + # //1-2\ 7-8\ + # 0-3-4-x 10-11-12 + # \ \5/ 9 / \__/ + # 6--/ + net.set_input_layer(InputLayer(1, 1)) + net.add('0', FCLayer(1, 1)) + net.add('1', FCLayer(1, 1), prevs='0') + net.add('2', FCLayer(1, 1)) + net.add('3', FCLayer(1, 1), prevs='0') + net.add('4', FCLayer(1, 1), prevs='3') + net.add('5', FCLayer(1, 1), prevs='3') + net.add('6', FCLayer(1, 1), prevs='0') + net.add('7', FCLayer(5, 1), prevs=('0', '2', '4', '5', '6')) + net.add('8', FCLayer(1, 1)) + net.add('9', FCLayer(5, 1), prevs=('0', '2', '4', '5', '6')) + net.add('10', FCLayer(2, 1), prevs=('8', '9')) + net.add('11', FCLayer(1, 1)) + net.add('12', FCLayer(2, 1), prevs=('10', '11')) + self.net[net.net_name] = net + + net = Network('net6') + # Fmap sizes. + net.set_input_layer(InputLayer(1, 24)) + net.add('0', ConvLayer(1, 1, 24, 3)) + net.add('1', ConvLayer(1, 1, 12, 3, strd=2)) + net.add('1p', PoolingLayer(1, 6, 2)) + net.add('2', ConvLayer(1, 1, 6, 3)) + net.add('3', ConvLayer(1, 1, 6, 3, strd=4), prevs=('0')) + self.net[net.net_name] = net + + net = Network('net7') + # Topological order: see a visited vertex again. + # /--- + # 0-1-\\ + # \2--2p + net.set_input_layer(InputLayer(1, 1)) + net.add('0', FCLayer(1, 1)) + net.add('1', FCLayer(1, 1), prevs='0') + net.add('2', FCLayer(1, 1), prevs='0') + net.add('2p', PoolingLayer(3, 1, 1), prevs=('0', '1', '2')) + self.net[net.net_name] = net + + net = Network('net8') + # Forward to the middle. + # /-\ + # 0-1-2-2p-4-4p + # \-3------/ + net.set_input_layer(InputLayer(1, 1)) + net.add('0', FCLayer(1, 1)) + net.add('1', FCLayer(1, 1), prevs='0') + net.add('2', FCLayer(1, 1), prevs='1') + net.add('2p', PoolingLayer(2, 1, 1), prevs=('1', '2')) + net.add('3', FCLayer(1, 1), prevs='0') + net.add('4', FCLayer(2, 1), prevs='2p') + net.add('4p', PoolingLayer(2, 1, 1), prevs=('3', '4')) + self.net[net.net_name] = net + + net = Network('net9') + # Previous layers include input and others. + net.set_input_layer(InputLayer(1, 1)) + net.add('0', FCLayer(1, 1)) + net.add('1', FCLayer(2, 1), prevs=(net.INPUT_LAYER_KEY, '0')) + self.net[net.net_name] = net + + # Real networks. + for net_name in all_networks(): + self.net[net_name] = import_network(net_name) + + self.batch_size = 16 + + self.resource = Resource( + proc_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(8, 8), + type=NodeRegion.PROC), + dram_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(8, 8), + type=NodeRegion.DRAM), + src_data_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(8, 4), + type=NodeRegion.DRAM), + dst_data_region=NodeRegion(origin=PhyDim2(0, 4), dim=PhyDim2(8, 4), + type=NodeRegion.DRAM), + dim_array=PhyDim2(16, 16), size_gbuf=65536, size_regf=64, + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) + + part = PartitionScheme(order=range(pe.NUM), pdims=[(1, 1)] * pe.NUM) + self.ofmap_layout = DataLayout( + frngs=(FmapRange((0, 0, 0, 0), (2, 4, 16, 16)),), + regions=(NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(1, 1), + type=NodeRegion.DRAM),), + parts=(part,)) + + + def _make_ilp(self, network): + ''' Make an InterLayerPipeline instance. ''' + return InterLayerPipeline(network, self.batch_size, self.resource) + + def _make_segment(self, vseg, network, temporal=False, max_util_drop=None, + with_opt=True): + ''' Convert vertex segment to (layer) segment. ''' + kwargs = {} + if max_util_drop is not None: + kwargs['max_util_drop'] = max_util_drop + if not with_opt: + kwargs['with_opt'] = False + ilp = self._make_ilp(network) + seg = tuple(ilp.dag_vertex_list[vidx] for vidx in vseg) + if temporal: + seg = (sum(seg, tuple()),) + return PipelineSegment(seg, ilp.network, ilp.batch_size, ilp.resource, + **kwargs) + + def _make_sched_res(self, sched_seq, time, top_ti=1, top_to=1, top_tb=1, + top_ord=range(le.NUM), dram_time=0, num_nodes=4): + scheme = OrderedDict() + scheme['cost'] = 1.234 + 9.876 + scheme['time'] = max(time, dram_time) + scheme['num_nodes'] = num_nodes + scheme['proc_time'] = time + scheme['bus_time'] = 0 + scheme['dram_time'] = dram_time + scheme['ti'] = [top_ti, 1, 1] + scheme['to'] = [top_to, 1, 1] + scheme['tb'] = [top_tb, 1, 1] + scheme['tvals'] = [[top_ti, top_to, top_tb], [1] * 3, [1] * 3] + scheme['orders'] = [top_ord, range(le.NUM), range(le.NUM)] + return SchedulingResult(scheme=scheme, + ofmap_layout=self.ofmap_layout, + sched_seq=sched_seq) + + def _gen_all_segment(self, network, **kwargs): + ''' + Generate all segments directly from all layers and all vertex segments. + ''' + # pylint: disable=protected-access + ilp = self._make_ilp(network) + for layer in network: + yield PipelineSegment(((layer,),), ilp.network, ilp.batch_size, + ilp.resource) + for vseg in ilp._gen_vseg(): + segment = self._make_segment(vseg, network, **kwargs) + if len(segment) == 1 and len(segment[0]) == 1: + continue + yield segment + + def _validate_allocation(self, segment, allocation): + ''' Validate segment resource allocation. ''' + + # Match segment. + self.assertEqual(len(allocation), len(segment)) + for ltpl, rtpl in zip(segment, allocation): + self.assertEqual(len(rtpl), len(ltpl)) + self.assertTrue(all(isinstance(r, Resource) for r in rtpl)) + + # Number of nodes. + nodes = [] # number of nodes. + for rtpl in allocation: + nodes.append(rtpl[0].proc_region.dim.size()) + self.assertEqual(sum(nodes), self.resource.proc_region.dim.size()) + + # Temporal schedules share processing region; spatial schedules use + # non-overlapped processing regions. + used_proc_nodes = set() # used processing nodes + for rtpl in allocation: + proc_region = rtpl[0].proc_region + self.assertTrue(all(r.proc_region == proc_region for r in rtpl)) + for n in proc_region.iter_node(): + self.assertTrue(self.resource.proc_region.contains_node(n), + '_validate_allocation: node {} outside of ' + 'the processing region {}' + .format(n, self.resource.proc_region)) + self.assertNotIn(n, used_proc_nodes, + '_validate_allocation: node {} has been ' + 'used.'.format(n)) + used_proc_nodes.add(n) + + # Data liveness. + data_regions = {} # layers that have data currently on-chip + for ltpl, rtpl in zip(segment, allocation): + + for l, r in zip(ltpl, rtpl): + + # Check data source. + prev_layers = segment.network.prevs(l) + + for pl in prev_layers: + if pl not in data_regions: + # Previous layer is not on-chip, from memory. + # Try find a layer responsible to fetch shared mem src. + try: + sh_sp_idx = next((i for i in range(len(allocation)) + if allocation[i][0].proc_region + == r.src_data_region)) + except StopIteration: + # No shared mem src. + self.assertEqual( + r.src_data_region, + self.resource.src_data_region, + '_validate_allocation: layer {}\'s prev {} ' + 'is not on-chip, should be from {}, but {}.' + .format(l, pl, self.resource.src_data_region, + r.src_data_region)) + else: + # There exists shared mem src. + sh_l = segment[sh_sp_idx][0] + self.assertEqual(segment.network.prevs(l), + segment.network.prevs(sh_l), + '_validate_allocation: layer {} ' + 'expects on-chip mem src sharing ' + 'with {}, but prevs differ.' + .format(l, sh_l)) + elif data_regions[pl] != r.proc_region: + # Previous layer is on-chip and not local. + self.assertEqual( + r.src_data_region, data_regions[pl], + '_validate_allocation: layer {}\'s prev {} ' + 'is on-chip, should be from {}, but {}.' + .format(l, pl, data_regions[pl], + r.src_data_region)) + + # Update data based on destination. + # Local or store back to memory. Both will be available on-chip. + self.assertTrue(r.dst_data_region == r.proc_region + or r.dst_data_region + == self.resource.dst_data_region, + '_validate_allocation: data can only ' + 'be local or storing back to mem.') + # Overwrite. + local_node_set = set(r.proc_region.iter_node()) + data_regions = {pl: data_regions[pl] for pl in data_regions + if local_node_set.isdisjoint( + data_regions[pl].iter_node())} + data_regions[l] = r.proc_region + + def _validate_constraint(self, segment, constraint): + ''' Validate segment scheduling constraint. ''' + # pylint: disable=too-many-branches + + # Match segment. + self.assertEqual(len(constraint), len(segment)) + for ltpl, ctpl in zip(segment, constraint): + self.assertEqual(len(ctpl), len(ltpl)) + self.assertTrue(all(isinstance(c, SchedulingConstraint) + for c in ctpl)) + + # Same top tb. + top_tb = constraint[0][0].topbat + self.assertTrue(all(c.topbat == top_tb + for ctpl in constraint for c in ctpl)) + + # Top tb is a factor of batch size. + if top_tb: + self.assertEqual((segment.batch_size) % top_tb, 0) + + # Data availability. + + seg_layers = set(l for ltpl in segment for l in ltpl) + + class OutAccPat(object): + ''' Output data access pattern types. ''' + # pylint: disable=too-few-public-methods + ANY = 0 # can access in any way + DBF = -1 # must double-buffer + # SEQ: use any positive value to represent sequential access with + # certain number of groups. + + # Available data in each spatial subregions. Each is represented by a + # tuple of layer name and its output data access pattern. + avail_data = [(None, OutAccPat.ANY) for _ in segment] + + # Get groups of layers sharing the same memory source. + prevs2layers = {} + for ltpl in segment: + l = ltpl[0] + prevs2layers.setdefault(segment.network.prevs(l), []).append(l) + sh_mem_src_groups = [ls for ps, ls in prevs2layers.items() + if not seg_layers.intersection(ps) and len(ls) > 1] + sh_mem_src_topifms = [None] * len(sh_mem_src_groups) + + # Whether to defer fully buffering output. + fb_out = False + fb_out_conv = None + + for sp_idx, (ltpl, ctpl) in enumerate(zip(segment, constraint)): + + self.assertFalse(fb_out, + '_validate_constraint: deferring fully buffering ' + 'from {} should not cross spatial scheduling {}.' + .format(fb_out_conv, sp_idx - 1)) + + for tm_idx, (layer, cstr) in enumerate(zip(ltpl, ctpl)): + + # Source data and their access patterns. + prev_layers = segment.network.prevs(layer) + prev_oaps = [] + for pl in prev_layers: + if pl not in seg_layers: + # Off-chip sources. + poap = OutAccPat.ANY + elif pl in ltpl: + # On-chip and local. + self.assertEqual(avail_data[sp_idx][0], pl, + '_validate_constraint: layer {} ({}) ' + 'local source data {} not available, ' + 'maybe not the immediate previous.' + .format(layer, (sp_idx, tm_idx), pl)) + poap = avail_data[sp_idx][1] + else: + # On-chip and neighbor. + poap = next((avail_data[p_sp_idx][1] + for p_sp_idx in range(sp_idx) + if avail_data[p_sp_idx][0] == pl), + None) + self.assertFalse(poap is None, + '_validate_constraint: layer {} ({}) ' + 'neighbor source data {} not ' + 'available on-chip.' + .format(layer, (sp_idx, tm_idx), pl)) + prev_oaps.append(poap) + # Only buffer input if having source on-chip. + has_src = not seg_layers.isdisjoint(prev_layers) + + # The single SEQ source. + seq = None + # str is greater than all numbers, see + # https://docs.python.org/2/library/stdtypes.html#comparisons + seq_prev_oaps = [poap for poap in prev_oaps if poap > 0] + if seq_prev_oaps: + self.assertEqual(len(seq_prev_oaps), 1, + '_validate_constraint: layer {} ({}) ' + 'has multiple SEQ input.' + '\nsrcs: {}, oaps: {}' + .format(layer, (sp_idx, tm_idx), + prev_layers, prev_oaps)) + seq = seq_prev_oaps[0] + + # Destination data. + # Only buffer output if having destination on-chip. + next_layers = segment.network.nexts(layer) + has_dst = not seg_layers.isdisjoint(next_layers) + + # Validation. + + for g_idx, group in enumerate(sh_mem_src_groups): + if layer in group: + if sh_mem_src_topifms[g_idx] is None: + sh_mem_src_topifms[g_idx] = cstr.topifm + self.assertEqual(sh_mem_src_topifms[g_idx], cstr.topifm, + '_validate_constraint: layer {} ({}) ' + 'share memory source with {}, but has ' + 'mismatched topifm {} with {}.' + .format(layer, (sp_idx, tm_idx), + group, cstr.topifm, + sh_mem_src_topifms[g_idx])) + break + else: + if not has_src: + self.assertEqual(cstr.topifm, 0, + '_validate_constraint: layer {} ({}) ' + 'should not constrain input as it ' + 'does not have on-chip sources.' + .format(layer, (sp_idx, tm_idx))) + + if isinstance(segment.network[layer], ConvLayer): + + self.assertFalse(fb_out, + '_validate_constraint: deferring fully ' + 'buffering from {} has not been realized.' + .format(fb_out_conv)) + + if any(pl in ltpl for pl in prev_layers): + # Local source. + lcl_poap = avail_data[sp_idx][1] + self.assertTrue(lcl_poap == OutAccPat.DBF + or lcl_poap == OutAccPat.ANY, + '_validate_constraint: layer {} ({}) ' + 'local source data {} must fully ' + 'buffer output.' + .format(layer, (sp_idx, tm_idx), + lcl_poap)) + + # DBF source. + if OutAccPat.DBF in prev_oaps: + # Must fully buffer CONV input. + self.assertEqual(cstr.topifm, 1, + '_validate_constraint: layer {} ({}) ' + 'input is not fully buffered but has ' + 'DBF source.\nsrcs: {}, oaps: {}' + '\n{}' + .format(layer, (sp_idx, tm_idx), + prev_layers, prev_oaps, + cstr)) + + # SEQ source. + if seq and has_dst: + # Cannot be lazily updated. + self.assertNotIsInstance( + seq, str, + '_validate_constraint: CONV layer {} ({}) cannot ' + 'use lazy update (from {})' + .format(layer, (sp_idx, tm_idx), seq)) + # Must match SEQ. + self.assertEqual(cstr.topifm, seq, + '_validate_constraint: layer {} ({}) ' + 'input groups ({}) and its SEQ src ' + 'output groups ({}) are mismatched.' + '\nsrcs: {}, oaps: {}' + .format(layer, (sp_idx, tm_idx), + cstr.topifm, seq, + prev_layers, prev_oaps)) + # Also must fully buffer CONV output. + self.assertEqual(cstr.topofm, 1, + '_validate_constraint: layer {} ({}) ' + 'output is not fully buffered but has ' + 'SEQ source.\nsrcs: {}, oaps: {}' + .format(layer, (sp_idx, tm_idx), + prev_layers, prev_oaps)) + # Deferred apply to the last layer in the group. + fb_out = True + fb_out_conv = layer + + oap = None + if cstr.topofm == 1: + if cstr.topifm == 1: + # Fully buffer both, can access output in any way. + # This is fine as we require to buffer either input + # or output for CONV (see below). + oap = OutAccPat.ANY + else: + oap = OutAccPat.DBF + elif has_dst and cstr.topofm > 0: + oap = cstr.topofm + if has_src: + self.assertEqual(cstr.topifm, 1, + '_validate_constraint: layer {} ' + '({}) has on-chip src and dst ' + 'but neither input nor output ' + 'are fully buffered.\ncstr: {}.' + .format(layer, (sp_idx, tm_idx), + cstr)) + elif has_dst: + # Lazy update, record layer name as seq. + oap = layer + + else: + + # SEQ source. + if seq and has_dst: + # Must match SEQ, or fully buffer output. + self.assertTrue(cstr.topofm == seq or cstr.topofm == 1 + or seq in cstr.update_dict, + '_validate_constraint: layer {} ({}) ' + 'output is not fully buffered, and ' + 'groups ({}) and its SEQ src output ' + 'groups ({}) are mismatched, and ' + 'lazy update is not used.' + '\nsrcs: {}, oaps: {}' + .format(layer, (sp_idx, tm_idx), + cstr.topofm, seq, + prev_layers, prev_oaps)) + + if cstr.topofm == 1: + # Fully buffer output. + oap = OutAccPat.DBF + elif isinstance(seq, str): + # Lazy update. + oap = seq + else: + # SEQ output. + oap = cstr.topofm + + # Realize deferred fully buffering output. + if cstr.topofm == 1: + fb_out = False # reset + + # Overwrite the previous temporal scheduling. + avail_data[sp_idx] = (layer, oap) + diff --git a/nn_dataflow/tests/pipeline_test/test_pipeline_segment.py b/nn_dataflow/tests/pipeline_test/test_pipeline_segment.py new file mode 100644 index 0000000..3635dc8 --- /dev/null +++ b/nn_dataflow/tests/pipeline_test/test_pipeline_segment.py @@ -0,0 +1,683 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +import itertools + +from nn_dataflow.core import ConvLayer +from nn_dataflow.core import NodeRegion +from nn_dataflow.core import PhyDim2 +from nn_dataflow.core import PipelineSegment +from nn_dataflow.core import PipelineSegmentTiming + +from . import TestPipelineFixture + +class TestPipelineSegment(TestPipelineFixture): + ''' Tests for PipelineSegment. ''' + + # pylint: disable=too-many-public-methods + + def test_valid_args(self): + ''' Valid arguments. ''' + segment = PipelineSegment((('0',), ('1', '1p')), + self.net['net1'], self.batch_size, + self.resource) + self.assertTrue(segment.valid) + self.assertTupleEqual(segment.seg, (('0',), ('1', '1p'))) + self.assertIs(segment.network, self.net['net1']) + self.assertEqual(segment.batch_size, self.batch_size) + self.assertIs(segment.resource, self.resource) + + def test_invalid_seg(self): + ''' Invalid seg. ''' + with self.assertRaisesRegexp(TypeError, + 'PipelineSegment: .*seg.*tuple.*'): + _ = PipelineSegment([('0',), ('1', '1p')], + self.net['net1'], self.batch_size, + self.resource) + + with self.assertRaisesRegexp(TypeError, + 'PipelineSegment: .*seg.*sub-tuple.*'): + _ = PipelineSegment(('0', '1', '1p'), + self.net['net1'], self.batch_size, + self.resource) + + def test_invalid_network(self): + ''' Invalid network. ''' + with self.assertRaisesRegexp(TypeError, + 'PipelineSegment: .*network.*'): + _ = PipelineSegment((('0',), ('1', '1p')), + self.net['net1'].input_layer(), self.batch_size, + self.resource) + + def test_invalid_resource(self): + ''' Invalid resource. ''' + with self.assertRaisesRegexp(TypeError, + 'PipelineSegment: .*resource.*'): + _ = PipelineSegment((('0',), ('1', '1p')), + self.net['net1'], self.batch_size, + PhyDim2(1, 1)) + + def test_init_deps_not_valid(self): + ''' Not valid segment due to init deps. ''' + + # Not utilize local data. + segment = self._make_segment((0, 1), self.net['net3'], temporal=True) + self.assertFalse(segment.valid) + self.assertFalse(hasattr(segment, 'alloc')) + + # Local data not available. + segment = self._make_segment((10, 11, 12), self.net['net5'], + temporal=True) + self.assertFalse(segment.valid) + self.assertFalse(hasattr(segment, 'alloc')) + + # Multiple neighbor source in one spatial scheduling. + segment = self._make_segment((1, 2), self.net['net8']) + self.assertFalse(segment.valid) + self.assertFalse(hasattr(segment, 'alloc')) + + # Both memory source and neighbor source. + segment = self._make_segment((13, 14), self.net['net4']) + self.assertFalse(segment.valid) + self.assertFalse(hasattr(segment, 'alloc')) + + # Valid cases. + + # Both memory destination and neighbor destination. + segment = self._make_segment((7, 8), self.net['net4']) + self.assertTrue(segment.valid) + + def test_init_deps_not_opt(self): + ''' Init deps for segment not with opt. ''' + + # Multiple on-chip sources. + segment = self._make_segment((3, 4), self.net['net8']) + self.assertTrue(segment.valid) + segment = self._make_segment((3, 4), self.net['net8'], with_opt=False) + self.assertFalse(segment.valid) + + # Multiple on-chip destinations. + segment = self._make_segment((4, 5, 6), self.net['net4']) + self.assertTrue(segment.valid) + segment = self._make_segment((4, 5, 6), self.net['net4'], + with_opt=False) + self.assertFalse(segment.valid) + + # Multiple linear chains. + segment = self._make_segment((5, 6), self.net['net4']) + self.assertTrue(segment.valid) + segment = self._make_segment((5, 6), self.net['net4'], with_opt=False) + self.assertFalse(segment.valid) + + def test_alloc_not_valid(self): + ''' Not valid segment due to alloc. ''' + + segment = self._make_segment((0, 1), self.net['net1'], + max_util_drop=0.01) + self.assertFalse(segment.valid) + + def test_as_sequence(self): + ''' As a sequence. ''' + segment = self._make_segment((0, 1), self.net['net1']) + self.assertTrue(segment.valid) + + self.assertSequenceEqual(segment, segment.seg) + self.assertTupleEqual(tuple(segment), segment.seg) + + for ltpl in segment: + for layer in ltpl: + self.assertIn(layer, self.net['net1']) + + def test_equal(self): + ''' Equality. ''' + seg1 = self._make_segment((0, 1), self.net['net1'], max_util_drop=0.1) + seg2 = self._make_segment((0, 1), self.net['net1'], max_util_drop=0.01) + seg3 = self._make_segment((0, 1), self.net['net1'], temporal=True) + self.assertNotEqual(seg1, seg2) + self.assertNotEqual(seg1, seg3) + + seg4 = self._make_segment((0, 1), self.net['net1'], max_util_drop=0.1) + self.assertEqual(seg1, seg4) + + net = self.net['net1'] + self.assertSetEqual(set(self._gen_all_segment(net)), + set(itertools.chain(self._gen_all_segment(net), + self._gen_all_segment(net)))) + + def test_repr(self): + ''' __repr__. ''' + seg = self._make_segment((0, 1), self.net['net1'], max_util_drop=0.1) + str_ = repr(seg) + self.assertIn(repr(seg.seg), str_) + self.assertIn(repr(seg.resource), str_) + self.assertIn(repr(seg.max_util_drop), str_) + + def test_alloc_proc(self): + ''' _alloc_proc. ''' + # pylint: disable=protected-access + + net = self.net['net1'] + self.assertListEqual([net[l].total_ops() for l in net], + [200, 600, 30, 1200, 2000]) + + ilp = self._make_ilp(net) + + # Single vertex. + + for idx in range(len(ilp.dag_vertex_list)): + segment = self._make_segment((idx,), ilp.network) + psr = segment._alloc_proc() + + self.assertEqual(len(psr), 1) + self.assertTupleEqual(psr[0].origin, (0, 0)) + self.assertTupleEqual(psr[0].dim, self.resource.proc_region.dim) + self.assertEqual(psr[0].type, NodeRegion.PROC) + + # Multiple vertices. + + psr = self._make_segment((0, 1), net)._alloc_proc() + nodes = [nr.dim.size() for nr in psr] + self.assertListEqual(nodes, [16, 48]) + + psr = self._make_segment((2, 3), net)._alloc_proc() + nodes = [nr.dim.size() for nr in psr] + self.assertListEqual(nodes, [24, 40]) + + psr = self._make_segment((1, 2), net)._alloc_proc() + nodes = [nr.dim.size() for nr in psr] + self.assertTrue(nodes == [24, 40] or nodes == [22, 42]) + + psr = self._make_segment((1, 2, 3), net)._alloc_proc() + nodes = [nr.dim.size() for nr in psr] + self.assertTrue(nodes == [12, 20, 32] or nodes == [10, 20, 34]) + + # All segments. + + def _check_all_segment(ilp): + for vseg in ilp._gen_vseg(): + segment = self._make_segment(vseg, ilp.network) + psr = segment._alloc_proc() + if psr is None: + continue + + # Utilization. + nodes = [nr.dim.size() for nr in psr] + ops = [sum(ilp.network[l].total_ops() for l in ltpl) + for ltpl in segment] + self.assertEqual(len(nodes), len(ops)) + time = max(o * 1. / n for o, n in zip(ops, nodes)) + max_ops = time * sum(nodes) + real_ops = sum(ops) + self.assertGreaterEqual(real_ops / max_ops, 0.9) + + _check_all_segment(ilp) + + for net_name in ['zfnet', 'net3']: + net = self.net[net_name] + ilp = self._make_ilp(net) + _check_all_segment(ilp) + + def test_allocation(self): + ''' allocation(). ''' + + # Single vertex. + + net = self.net['net1'] + ilp = self._make_ilp(net) + for idx in range(len(ilp.dag_vertex_list)): + segment = self._make_segment((idx,), ilp.network) + alloc = segment.allocation() + self.assertIsNotNone(alloc) + self._validate_allocation(segment, alloc) + + # Linear networks. + + for net_name in ['net1', 'net2']: + + net = self.net[net_name] + + for segment in self._gen_all_segment(net): + + alloc = segment.allocation() + if alloc is None: + continue + + self._validate_allocation(segment, alloc) + + # This is a linear network structure. + rlist = sum(alloc, tuple()) + + # The data source of all layers except for the first in the + # segment should be previous processing regions. + for r in rlist[1:]: + self.assertEqual(r.src_data_region.type, NodeRegion.PROC, + 'test_segment_allocation: ' + 'data source should be PROC region.') + + # The data destination of all layers except for the last in the + # segment should be local. + for r in rlist[:-1]: + self.assertEqual(r.dst_data_region.type, NodeRegion.PROC, + 'test_segment_allocation: ' + 'data destination should be PROC region.') + + # Complex networks. + + for net_name in ['net3', 'net4', 'net5']: + + net = self.net[net_name] + + for segment in self._gen_all_segment(net): + + alloc = segment.allocation() + if alloc is None: + continue + + self._validate_allocation(segment, alloc) + + # Real networks. + + for net_name in self.net: + + if net_name.startswith('net'): + continue + net = self.net[net_name] + + for segment in self._gen_all_segment(net): + + alloc = segment.allocation() + if alloc is None: + continue + + self._validate_allocation(segment, alloc) + + def test_allocation_sh_mem_src(self): + ''' allocation() shared mem src. ''' + + net = self.net['net3'] + + segment = self._make_segment((6, 7, 8, 9), net) + self.assertTrue(segment.valid) + + alloc = segment.allocation() + self.assertEqual(alloc[3][0].src_data_region, alloc[0][0].proc_region) + + segment = self._make_segment((6, 7, 8, 9), net, with_opt=False) + self.assertFalse(segment.valid) + + net = self.net['net5'] + + segment = self._make_segment((1, 2, 3), net) + self.assertTrue(segment.valid) + + alloc = segment.allocation() + self.assertEqual(alloc[2][0].src_data_region, alloc[0][0].proc_region) + + segment = self._make_segment((1, 2, 3), net, with_opt=False) + self.assertFalse(segment.valid) + + net = self.net['net4'] + + segment = self._make_segment((8, 9), net) + self.assertTrue(segment.valid) + + alloc = segment.allocation() + self.assertEqual(alloc[1][0].src_data_region, alloc[0][0].proc_region) + + segment = self._make_segment((8, 9), net, with_opt=False) + self.assertFalse(segment.valid) + + def test_allocation_temp(self): + ''' allocation() temporal. ''' + + for net in self.net.values(): + + for segment in self._gen_all_segment(net, temporal=True): + + alloc = segment.allocation() + if alloc is None: + continue + + self._validate_allocation(segment, alloc) + + def test_allocation_no_time_mux(self): + ''' allocation() no_time_mux. ''' + net = self.net['net2'] + + segment = self._make_segment(tuple(range(16)), net) + self.assertTrue(segment.valid) + + alloc = segment.allocation() + self.assertTrue(all(r.no_time_mux for rtpl in alloc for r in rtpl)) + + segment = self._make_segment(tuple(range(8)), net) + self.assertTrue(segment.valid) + + alloc = segment.allocation() + self.assertFalse(any(r.no_time_mux for rtpl in alloc for r in rtpl)) + + segment = self._make_segment(tuple(range(16)), net, temporal=True) + self.assertTrue(segment.valid) + + alloc = segment.allocation() + self.assertFalse(any(r.no_time_mux for rtpl in alloc for r in rtpl)) + + def test_allocation_invalid(self): + ''' allocation() for invalid segment. ''' + segment = self._make_segment((0, 1), self.net['net3'], temporal=True) + self.assertFalse(segment.valid) + self.assertIsNone(segment.allocation()) + + def test_gen_constraint(self): + ''' gen_constraint(). ''' + + # Single vertex. + + for net_name in self.net: + + net = self.net[net_name] + ilp = self._make_ilp(net) + + for idx in range(len(ilp.dag_vertex_list)): + segment = self._make_segment((idx,), ilp.network) + self.assertTrue(segment.valid) + + for constraint, _ in segment.gen_constraint(): + self._validate_constraint(segment, constraint) + + # No top loop constraint for single-layer segment. + if len(constraint) == 1 and len(constraint[0]) == 1: + for c in itertools.chain.from_iterable(constraint): + self.assertTrue(c.topifm == 0 and c.topofm == 0 + and c.topbat == 0) + + # Spatial pipelining. + + for net_name in self.net: + + if not net_name.startswith('net') and net_name != 'zfnet': + continue + + net = self.net[net_name] + + for segment in self._gen_all_segment(net): + if not segment.valid: + continue + + for constraint, _ in segment.gen_constraint(): + self._validate_constraint(segment, constraint) + + # Special cases. + + net = self.net['net2'] + + segment = PipelineSegment((('0', '1'), ('2', '3')), net, + self.batch_size, self.resource) + + for constraint, _ in segment.gen_constraint(): + self._validate_constraint(segment, constraint) + + def test_gen_constraint_fbofm_init(self): + ''' gen_constraint() deciding fbofm_init. ''' + + net = self.net['zfnet'] + + # Two spatial, fbofm_init == False. + segment = PipelineSegment((('fc2',), ('fc3',)), + net, self.batch_size, self.resource) + self.assertTrue(segment.valid) + self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False)) + + # Two spatial, fbofm_init == False. + segment = PipelineSegment((('conv5', 'pool3'), ('fc1',)), + net, self.batch_size, self.resource) + self.assertTrue(segment.valid) + self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[0][1].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False)) + + # Four spatial, fbofm_init == False. + segment = PipelineSegment((('conv1', 'pool1'), ('conv2', 'pool2'), + ('conv3',), ('conv4',)), + net, self.batch_size, self.resource) + self.assertTrue(segment.valid) + self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[0][1].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False)) + self.assertTrue(segment.cstr_symargs[1][0]['fbofm']) + self.assertTrue(segment.cstr_symargs[1][1]['fbofm']) + self.assertTrue(segment.cstr_symargs[2][0]['fbifm']) + self.assertFalse(segment.cstr_symargs[2][0].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[3][0].get('fbifm', False)) + + # Three spatial, fbofm_init == False. + segment = PipelineSegment((('conv4',), ('conv5', 'pool3'), ('fc1',)), + net, self.batch_size, self.resource) + self.assertTrue(segment.valid) + self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False)) + self.assertTrue(segment.cstr_symargs[1][0]['fbofm']) + self.assertTrue(segment.cstr_symargs[1][1]['fbofm']) + self.assertTrue(segment.cstr_symargs[2][0]['fbifm']) + + # Three spatial, fbofm_init == False. + segment = PipelineSegment((('conv2', 'pool2'), ('conv3',), ('conv4',)), + net, self.batch_size, self.resource) + self.assertTrue(segment.valid) + self.assertFalse(segment.cstr_symargs[0][0].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[0][1].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[1][0].get('fbifm', False)) + self.assertTrue(segment.cstr_symargs[1][0]['fbofm']) + self.assertTrue(segment.cstr_symargs[2][0]['fbifm']) + + # Three spatial, fbofm_init == True. + segment = PipelineSegment((('conv3',), ('conv4',), ('conv5', 'pool3')), + net, self.batch_size, self.resource) + self.assertTrue(segment.valid) + self.assertTrue(segment.cstr_symargs[0][0]['fbofm']) + self.assertTrue(segment.cstr_symargs[1][0]['fbifm']) + self.assertFalse(segment.cstr_symargs[1][0].get('fbofm', False)) + self.assertFalse(segment.cstr_symargs[2][0].get('fbifm', False)) + + def test_gen_constraint_sh_mem_src(self): + ''' gen_constraint() shared mem src. ''' + + net = self.net['net3'] + + segment = self._make_segment((6, 7, 8, 9), net) + self.assertTrue(segment.valid) + + # 0 and 3 share memory source. + for constraint, _ in segment.gen_constraint(): + self._validate_constraint(segment, constraint) + + self.assertEqual(constraint[3][0].topifm, constraint[0][0].topifm) + self.assertTrue(constraint[3][0].topifm <= 1 + or constraint[3][0].topofm <= 1) + self.assertTrue(constraint[0][0].topifm <= 1 + or constraint[0][0].topofm <= 1) + + net = self.net['net5'] + + segment = self._make_segment((1, 2, 3), net) + self.assertTrue(segment.valid) + + # 0 and 2 share memory source. + for constraint, _ in segment.gen_constraint(): + self._validate_constraint(segment, constraint) + + # 0 constrains topofm. + self.assertNotEqual(constraint[0][0].topofm, 0) + + # Must fully buffer ifmaps. + self.assertEqual(constraint[2][0].topifm, 1) + self.assertEqual(constraint[0][0].topifm, 1) + + net = self.net['net4'] + + segment = self._make_segment((8, 9), net) + self.assertTrue(segment.valid) + + # 0 and 1 share memory source. + for constraint, _ in segment.gen_constraint(): + self._validate_constraint(segment, constraint) + + # No topofm constraint. + self.assertEqual(constraint[0][0].topofm, 0) + self.assertEqual(constraint[1][0].topofm, 0) + + self.assertEqual(constraint[1][0].topifm, constraint[0][0].topifm) + + def test_gen_constraint_temporal(self): + ''' gen_constraint() temporal. ''' + + for net_name in self.net: + + net = self.net[net_name] + + for segment in self._gen_all_segment(net, temporal=True): + if not segment.valid: + continue + + for constraint, _ in segment.gen_constraint(): + self._validate_constraint(segment, constraint) + + def test_gen_constraint_hints(self): + ''' gen_constraint() pruning hints. ''' + + # Use ZFNet to give the real fmap dimensions. + net_name = 'zfnet' + + net = self.net[net_name] + + for segment in self._gen_all_segment(net): + if not segment.valid: + continue + + hints_set = set() + last_hints = None + + for _, hints in segment.gen_constraint(): + + self.assertTrue(all(isinstance(h, int) and h > 0 + for h in hints), + 'test_gen_constraint_hints: ' + 'all hints should be positive integers only. ' + '{}'.format(hints)) + + self.assertTrue(all( + not all(h < ph for h, ph in zip(hints, phints)) + for phints in hints_set), + 'test_gen_constraint_hints: ' + 'smaller hints are generated too late.') + + if last_hints: + self.assertGreater(hints, last_hints, + 'test_gen_constraint_hints: ' + 'hints should be generated from small ' + 'to large.') + last_hints = hints + + def test_gen_constraint_max_ovhd(self): + ''' gen_constraint() with max_time_overhead. ''' + + def _make_key(constraint): + return tuple((c.topifm, c.topofm, c.topbat) + for c in itertools.chain.from_iterable(constraint)) + + net = self.net['zfnet'] + + for segment in self._gen_all_segment(net): + if not segment.valid: + continue + + set_all = set() + set_1 = set() + set_5 = set() + + for constraint, _ in segment.gen_constraint(): + + timing = PipelineSegmentTiming(net, 0) + for sp_idx, (ltpl, ctpl) in enumerate(zip(segment, constraint)): + for tm_idx, (l, c) in enumerate(zip(ltpl, ctpl)): + res = self._make_sched_res((0, sp_idx, tm_idx), + 65536 // len(ltpl), + top_ti=c.topifm, + top_to=c.topofm, + top_tb=c.topbat) + timing.add(l, res) + + key = _make_key(constraint) + + set_all.add(key) + if timing.time_overhead <= 0.1: + set_1.add(key) + if timing.time_overhead <= 0.5: + set_5.add(key) + + for constraint, _ in segment.gen_constraint(max_time_overhead=0.1): + key = _make_key(constraint) + set_1.discard(key) + + self.assertFalse(set_1, + 'gen_constraint with max_time_overhead: ' + 'miss generating constraints with <= 0.1 ovhd:\n' + '{}'.format(set_1)) + + for constraint, _ in segment.gen_constraint(max_time_overhead=0.5): + key = _make_key(constraint) + set_5.discard(key) + + self.assertFalse(set_5, + 'gen_constraint with max_time_overhead: ' + 'miss generating constraints with <= 0.5 ovhd:\n' + '{}'.format(set_5)) + + def test_gen_constraint_not_opt(self): + ''' gen_constraint() not with opt. ''' + + def _validate_fully_buffered_constraint(segment, constraint): + layer2idx = dict((l, (sp_idx, tm_idx)) + for sp_idx, ltpl in enumerate(segment) + for tm_idx, l in enumerate(ltpl)) + seg_layers = set(layer2idx.keys()) + + for l, c in zip(itertools.chain.from_iterable(segment), + itertools.chain.from_iterable(constraint)): + + if not isinstance(net[l], ConvLayer): + continue + + onchip_prevs = seg_layers.intersection(net.prevs(l)) + if onchip_prevs: + self.assertEqual(c.topifm, 1) + for p in onchip_prevs: + sp_idx, tm_idx = layer2idx[p] + p_c = constraint[sp_idx][tm_idx] + self.assertEqual(p_c.topofm, 1) + + for net_name in self.net: + + net = self.net[net_name] + + # Spatial pipelining. + for segment in self._gen_all_segment(net, with_opt=False): + if not segment.valid: + continue + + for constraint, _ in segment.gen_constraint(): + _validate_fully_buffered_constraint(segment, constraint) + diff --git a/nn_dataflow/tests/pipeline_test/test_pipeline_segment_timing.py b/nn_dataflow/tests/pipeline_test/test_pipeline_segment_timing.py new file mode 100644 index 0000000..edf4291 --- /dev/null +++ b/nn_dataflow/tests/pipeline_test/test_pipeline_segment_timing.py @@ -0,0 +1,343 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +from nn_dataflow.core import InputLayer, FCLayer, PoolingLayer +from nn_dataflow.core import Network +from nn_dataflow.core import PipelineSegmentTiming + +from . import TestPipelineFixture + +class TestPipelineSegmentTiming(TestPipelineFixture): + ''' Tests for PipelineSegmentTiming. ''' + + def setUp(self): + super(TestPipelineSegmentTiming, self).setUp() + + self.net1 = self.net['net1'] + + self.net4 = self.net['net4'] + + self.netlr = Network('net1') + self.netlr.set_input_layer(InputLayer(10, 1)) + self.netlr.add('0p1', PoolingLayer(10, 1, 1)) + self.netlr.add('0p2', PoolingLayer(10, 1, 1)) + self.netlr.add('0p3', PoolingLayer(10, 1, 1)) + self.netlr.add('1', FCLayer(10, 20)) + + def test_valid_args(self): + ''' Valid arguments. ''' + timing = PipelineSegmentTiming(self.net1, 3) + self.assertIs(timing.network, self.net1) + self.assertEqual(timing.seg_idx, 3) + + def test_invalid_network(self): + ''' Invalid network. ''' + with self.assertRaisesRegexp(TypeError, + 'PipelineSegmentTiming: .*network.*'): + _ = PipelineSegmentTiming(self.net1.input_layer(), 3) + + def test_add(self): + ''' add(). ''' + # No fused. + + timing = PipelineSegmentTiming(self.net1, 3) + + timing.add('0', self._make_sched_res((3, 0, 0), 123, + top_to=3, top_tb=2)) + self.assertTupleEqual(timing.last_sched_seq, (3, 0, 0)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 3) + + timing.add('1', self._make_sched_res((3, 1, 0), 141, + top_ti=3, top_tb=2)) + self.assertTupleEqual(timing.last_sched_seq, (3, 1, 0)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 1) + + timing.add('1p', self._make_sched_res((3, 1, 1), 12, + top_ti=3, top_tb=2)) + self.assertTupleEqual(timing.last_sched_seq, (3, 1, 1)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 1) + + self.assertEqual(timing.bat_ngrp, 2) + self.assertEqual(len(timing.timing_list), 2) + self.assertEqual(len(timing.timing_list[0]), 1) + self.assertEqual(len(timing.timing_list[1]), 2) + + # Fused. + + timing = PipelineSegmentTiming(self.net1, 3) + + timing.add('0', self._make_sched_res((3, 0, 0), 123, + top_tb=2)) + self.assertTupleEqual(timing.last_sched_seq, (3, 0, 0)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 1) + + timing.add('1', self._make_sched_res((3, 1, 0), 141, + top_to=3, top_tb=2)) + self.assertTupleEqual(timing.last_sched_seq, (3, 1, 0)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 3) + + timing.add('1p', self._make_sched_res((3, 1, 1), 12, + top_to=3, top_tb=2)) + self.assertTupleEqual(timing.last_sched_seq, (3, 1, 1)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 3) + + # Unmatched BAT group number. + + self.assertEqual(timing.bat_ngrp, 2) + timing.add('2', self._make_sched_res((3, 2, 0), 123, top_tb=4)) + self.assertEqual(timing.bat_ngrp, 1) + + def test_add_all_lr(self): + ''' add() all LocalRegionLayer. ''' + timing = PipelineSegmentTiming(self.netlr, 2) + + timing.add('0p1', self._make_sched_res((2, 0, 0), 40, top_to=4)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 4) + timing.add('0p2', self._make_sched_res((2, 0, 1), 80, top_to=4)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 4) + timing.add('0p3', self._make_sched_res((2, 0, 2), 60, top_to=4)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 4) + timing.add('1', self._make_sched_res((2, 1, 0), 800, top_to=4)) + self.assertEqual(timing.timing_list[-1][-1].ngrp, 4) + + def test_add_invalid_sched_seq(self): + ''' add(), invalid sched seq. ''' + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 123)) + + with self.assertRaisesRegexp(ValueError, + 'PipelineSegmentTiming: .*belong to.*'): + timing.add('1', self._make_sched_res((2, 1, 0), 123)) + + with self.assertRaisesRegexp(ValueError, + 'PipelineSegmentTiming: .*follow.*'): + timing.add('1p', self._make_sched_res((3, 1, 1), 123)) + + def test_add_already_in(self): + ''' add(), layer already in. ''' + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 123)) + with self.assertRaisesRegexp(ValueError, + 'PipelineSegmentTiming: .*layer 0.*'): + timing.add('0', self._make_sched_res((3, 1, 0), 123)) + + def test_time_bat_ngrp(self): + ''' time and critical_time bat_ngrp. ''' + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 120, top_tb=4)) + timing.add('1', self._make_sched_res((3, 1, 0), 130, top_tb=4)) + timing.add('1p', self._make_sched_res((3, 1, 1), 20, top_tb=4)) + timing.add('2', self._make_sched_res((3, 2, 0), 136, top_tb=4)) + self.assertEqual(timing.critical_time, 150) + self.assertEqual(timing.time, 120 // 4 + 130 + 20 + 136 // 4) + self.assertAlmostEqual(timing.time_overhead, + timing.time / ((120 + 130 + 20 + 136) / 3.) - 1) + + # Unmatched BAT group number. + timing.add('3', self._make_sched_res((3, 3, 0), 100, top_tb=2)) + self.assertEqual(timing.time, 120 + 130 + 20 + 136 + 100) + self.assertAlmostEqual(timing.time_overhead, + timing.time + / ((120 + 130 + 20 + 136 + 100) / 4.) - 1) + + def test_time_ifm_ofm_ngrp(self): + ''' time and critical_time ifm_ngrp and ofm_ngrp. ''' + + # Single-group wait, first critical. + + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 120, + top_to=3, top_tb=2)) + timing.add('1', self._make_sched_res((3, 1, 0), 90, + top_ti=3, top_tb=2)) + self.assertEqual(timing.critical_time, 120) + # Layer 0 is critical. Layer 0 last BAT group starts at 120 - 120 // 2. + # Layer 1 last BAT group starts 120 // 2 // 3 later, which takes 90 // + # 2. + self.assertEqual(timing.time, + 120 - 120 // 2 + 120 // 2 // 3 + 90 // 2) + self.assertAlmostEqual(timing.time_overhead, + timing.time / ((120 + 90) / 2.) - 1) + + # Single-group wait, second critical. + + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 120, + top_to=3, top_tb=2)) + timing.add('1', self._make_sched_res((3, 1, 0), 150, + top_ti=3, top_tb=2)) + self.assertEqual(timing.critical_time, 150) + # Layer 1 is critical. Layer 1 first BAT group starts at 120 // 2 // 3, + # and takes 150 for all its BAT groups. + self.assertEqual(timing.time, 120 // 2 // 3 + 150) + self.assertAlmostEqual(timing.time_overhead, + timing.time / ((120 + 150) / 2.) - 1) + + # All-group wait, first critical. + + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 120, + top_to=3, top_tb=2)) + timing.add('1', self._make_sched_res((3, 1, 0), 90, + top_to=3, top_tb=2)) + self.assertEqual(timing.critical_time, 120) + self.assertEqual(timing.time, 120 + 90 // 2) + self.assertAlmostEqual(timing.time_overhead, + timing.time / ((120 + 90) / 2.) - 1) + + # All-group wait, second critical. + + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 120, + top_ti=3, top_tb=2)) + timing.add('1', self._make_sched_res((3, 1, 0), 150, + top_ti=3, top_tb=2)) + self.assertEqual(timing.critical_time, 150) + self.assertEqual(timing.time, 120 // 2 + 150) + self.assertAlmostEqual(timing.time_overhead, + timing.time / ((120 + 150) / 2.) - 1) + + def test_time_linear(self): + ''' time and critical_time linear. ''' + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 120, + top_ti=3, top_tb=2)) + timing.add('1', self._make_sched_res((3, 1, 0), 129, + top_to=3, top_tb=2)) + timing.add('1p', self._make_sched_res((3, 1, 1), 21, + top_to=3, top_tb=2)) + timing.add('2', self._make_sched_res((3, 2, 0), 138, + top_ti=3, top_tb=2)) + self.assertEqual(timing.critical_time, 150) + # Layer 1 is critical. Layer 1+1p first BAT group starts at 120 // 2, + # and last BAT group starts at 150 // 2 later. Layer 2 last BAT group + # starts 150 // 2 // 3 later, and takes 138 // 2. + self.assertEqual(timing.time, + 120 // 2 + 150 // 2 + 150 // 2 // 3 + 138 // 2) + self.assertAlmostEqual(timing.time_overhead, + timing.time / ((120 + 129 + 21 + 138) / 3.) - 1) + + def test_time_branch(self): + ''' time and critical_time branch. ''' + + # Single-group wait. + + timing = PipelineSegmentTiming(self.net4, 3) + timing.add('6', self._make_sched_res((3, 0, 0), 120, + top_ti=3, top_tb=2)) + timing.add('7', self._make_sched_res((3, 1, 0), 150, + top_to=3, top_tb=2)) + timing.add('8', self._make_sched_res((3, 2, 0), 144, + top_ti=3, top_tb=2)) + timing.add('9', self._make_sched_res((3, 3, 0), 168, + top_ti=3, top_tb=2)) + self.assertEqual(timing.critical_time, 168) + # Layer 9 is critical. Layer 7 first BAT group starts at 120 // 2. + # Layer 8 and 9 first BAT group starts at 150 // 2 // 3 later, and all + # groups of layer 9 take 168. + self.assertEqual(timing.time, + 120 // 2 + 150 // 2 // 3 + 168) + self.assertAlmostEqual(timing.time_overhead, + timing.time / ((120 + 150 + 144 + 168) / 4.) - 1) + + # All-group wait. + + timing = PipelineSegmentTiming(self.net4, 3) + timing.add('6', self._make_sched_res((3, 0, 0), 120, top_tb=2)) + timing.add('7', self._make_sched_res((3, 1, 0), 150, top_tb=2)) + timing.add('8', self._make_sched_res((3, 2, 0), 144, top_tb=2)) + timing.add('9', self._make_sched_res((3, 3, 0), 132, top_tb=2)) + self.assertEqual(timing.critical_time, 150) + # Layer 7 is critical. Layer 7 first BAT group starts at 120 // 2, and + # layer 7 last BAT group ends at 150 later, at which time layer 8 and 9 + # last BAT group starts, and takes 144 // 2. + self.assertEqual(timing.time, 120 // 2 + 150 + 144 // 2) + self.assertAlmostEqual(timing.time_overhead, + timing.time / ((120 + 150 + 144 + 132) / 4.) - 1) + + def test_time_all_lr(self): + ''' time and critical_time all LocalRegionLayer. ''' + timing = PipelineSegmentTiming(self.netlr, 2) + timing.add('0p1', self._make_sched_res((2, 0, 0), 40, + top_to=5, top_tb=2)) + timing.add('0p2', self._make_sched_res((2, 0, 1), 80, + top_to=5, top_tb=2)) + timing.add('0p3', self._make_sched_res((2, 0, 2), 60, + top_to=5, top_tb=2)) + timing.add('1', self._make_sched_res((2, 1, 0), 800, + top_ti=5, top_tb=2)) + self.assertEqual(timing.critical_time, 800) + # Layer 1 is critical. Layer 1 first BAT group starts at (40 + 80 + 60) + # // 2 // 5, and takes 800. + self.assertEqual(timing.time, (40 + 80 + 60) // 2 // 5 + 800) + self.assertAlmostEqual(timing.time_overhead, + timing.time / ((40 + 80 + 60 + 800) / 2.) - 1) + + def test_time_single_spatial(self): + ''' time and critical_time for single-spatial segment. ''' + + for net_name in self.net: + if not net_name.startswith('net'): + continue + net = self.net[net_name] + + for seg in self._gen_all_segment(net, temporal=True): + if not seg.valid: + continue + self.assertEqual(len(seg), 1) + + timing = PipelineSegmentTiming(net, 0) + for idx, layer in enumerate(seg[0]): + timing.add(layer, + self._make_sched_res((0, 0, idx), + (40 + idx * 7 % 3) * 16, + top_to=4, top_ti=4, + top_tb=4)) + + self.assertEqual(timing.critical_time, timing.time) + self.assertAlmostEqual(timing.time_overhead, 0.) + + def test_time_dram_time(self): + ''' time and critical_time dominated by DRAM time. ''' + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 120, dram_time=100, + top_ti=3, top_tb=4)) + timing.add('1', self._make_sched_res((3, 1, 0), 130, dram_time=140, + top_to=3, top_tb=4)) + timing.add('1p', self._make_sched_res((3, 1, 1), 20, dram_time=10, + top_to=3, top_tb=4)) + timing.add('2', self._make_sched_res((3, 2, 0), 138, dram_time=100, + top_ti=3, top_tb=4)) + self.assertEqual(timing.critical_time, 160) + self.assertEqual(timing.time, 100 + 140 + 10 + 100) + self.assertEqual(timing.dram_time, timing.time) + self.assertLess(timing.node_time, timing.time) + + def test_time_overhead(self): + ''' time_overhead. ''' + timing = PipelineSegmentTiming(self.net1, 3) + timing.add('0', self._make_sched_res((3, 0, 0), 120, num_nodes=4, + top_ti=3, top_tb=4)) + timing.add('1', self._make_sched_res((3, 1, 0), 130, num_nodes=6, + top_to=3, top_tb=4)) + timing.add('1p', self._make_sched_res((3, 1, 1), 20, num_nodes=6, + top_to=3, top_tb=4)) + timing.add('2', self._make_sched_res((3, 2, 0), 138, num_nodes=3, + top_ti=3, top_tb=4)) + + time_indv = 120 * 4 / 13. + (130 + 20) * 6 / 13. + 138 * 3 / 13. + self.assertAlmostEqual(timing.time_overhead, + timing.time / time_indv - 1) + diff --git a/nn_dataflow/tests/unit_test/test_buf_shr_scheme.py b/nn_dataflow/tests/unit_test/test_buf_shr_scheme.py new file mode 100644 index 0000000..c04bd27 --- /dev/null +++ b/nn_dataflow/tests/unit_test/test_buf_shr_scheme.py @@ -0,0 +1,349 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +import math +import unittest + +from nn_dataflow.core import BufShrScheme +from nn_dataflow.core import DataCategoryEnum as de +from nn_dataflow.core import DataDimLoops +from nn_dataflow.core import LoopEnum as le +from nn_dataflow.core import NodeRegion +from nn_dataflow.core import ParallelEnum as pe +from nn_dataflow.core import PartitionScheme +from nn_dataflow.core import PhyDim2 + +class TestBufShrScheme(unittest.TestCase): + ''' Tests for BufShrScheme. ''' + + def setUp(self): + self.ps1 = PartitionScheme(order=[pe.BATP, pe.OUTP, pe.OFMP, pe.INPP], + pdims=[(2, 3), (3, 1), (1, 5), (5, 2)]) + self.ps2 = PartitionScheme(order=range(pe.NUM), + pdims=[(2, 2), (5, 5), (3, 3), (1, 1)]) + self.ps3 = PartitionScheme(order=range(pe.NUM), + pdims=[(1, 6), (1, 2), (4, 1), (3, 5)]) + + self.nr1 = NodeRegion(origin=PhyDim2(0, 0), dim=self.ps1.dim(), + type=NodeRegion.PROC) + self.nr2 = NodeRegion(origin=PhyDim2(0, 0), dim=self.ps2.dim(), + type=NodeRegion.PROC) + self.nr3 = NodeRegion(origin=PhyDim2(0, 0), dim=self.ps3.dim(), + type=NodeRegion.PROC) + + self.bufshr1 = BufShrScheme(self.nr1, self.ps1) + self.bufshr2 = BufShrScheme(self.nr2, self.ps2) + self.bufshr3 = BufShrScheme(self.nr3, self.ps3) + + def test_dim(self): + ''' Accessor dim. ''' + for bufshr, ps in zip([self.bufshr1, self.bufshr2, self.bufshr3], + [self.ps1, self.ps2, self.ps3]): + self.assertTupleEqual(bufshr.dim(de.IFM), ps.dim(pe.OUTP)) + self.assertTupleEqual(bufshr.dim(de.OFM), ps.dim(pe.INPP)) + + self.assertTupleEqual(self.bufshr1.dim(de.FIL), self.ps1.dim(pe.OFMP)) + self.assertTupleEqual(self.bufshr2.dim(de.FIL), + self.ps2.dim(pe.OFMP, pe.BATP)) + self.assertTupleEqual(self.bufshr3.dim(de.FIL), + self.ps3.dim(pe.OFMP, pe.BATP)) + + def test_size(self): + ''' Get size. ''' + for bufshr in [self.bufshr1, self.bufshr2, self.bufshr3]: + for dce in range(de.NUM): + self.assertEqual(bufshr.dim(dce).size(), bufshr.size(dce)) + + def test_dim_fil(self): + ''' Accessor dim with different partitioning for FIL. ''' + # Adjacent, BATP upon OFMP. + ps = PartitionScheme(order=[pe.INPP, pe.OUTP, pe.BATP, pe.OFMP], + pdims=[(2, 2), (5, 5), (3, 3), (7, 7)]) + nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(), + type=NodeRegion.PROC) + self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (15,) * 2) + # Adjacent, OFMP upon BATP. + ps = PartitionScheme(order=[pe.INPP, pe.OFMP, pe.BATP, pe.OUTP], + pdims=[(2, 2), (5, 5), (3, 3), (7, 7)]) + nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(), + type=NodeRegion.PROC) + self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (15,) * 2) + + # Not adjacent, BATP upon OFMP. + ps = PartitionScheme(order=[pe.OUTP, pe.BATP, pe.INPP, pe.OFMP], + pdims=[(2, 2), (5, 5), (3, 3), (7, 7)]) + nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(), + type=NodeRegion.PROC) + self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (5,) * 2) + # Not adjacent, OFMP upon BATP. + ps = PartitionScheme(order=[pe.OFMP, pe.INPP, pe.BATP, pe.OUTP], + pdims=[(2, 2), (5, 5), (3, 3), (7, 7)]) + nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(), + type=NodeRegion.PROC) + self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (3,) * 2) + + # Only BATP. + ps = PartitionScheme(order=[pe.OUTP, pe.BATP, pe.INPP, pe.OFMP], + pdims=[(2, 2), (1, 1), (3, 3), (7, 7)]) + nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(), + type=NodeRegion.PROC) + self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (3,) * 2) + # Only OFMP. + ps = PartitionScheme(order=[pe.OFMP, pe.INPP, pe.BATP, pe.OUTP], + pdims=[(2, 2), (5, 5), (1, 1), (7, 7)]) + nr = NodeRegion(origin=PhyDim2(0, 0), dim=ps.dim(), + type=NodeRegion.PROC) + self.assertTupleEqual(BufShrScheme(nr, ps).dim(de.FIL), (5,) * 2) + + def test_dim_invalid_index(self): + ''' Accessor dim invalid index. ''' + with self.assertRaises(IndexError): + _ = self.bufshr1.dim(de.NUM) + + def test_size_invalid_index(self): + ''' Get size invalid index. ''' + with self.assertRaises(IndexError): + _ = self.bufshr1.size(de.NUM) + + def test_nbr_dists(self): + ''' Accessor nbr_dists. ''' + inf = float('inf') + + self.assertTupleEqual(self.bufshr1.nbr_dists[de.FIL], (5, inf)) + self.assertTupleEqual(self.bufshr1.nbr_dists[de.IFM], (15, 2)) + self.assertTupleEqual(self.bufshr1.nbr_dists[de.OFM], (1, 1)) + + self.assertTupleEqual(self.bufshr2.nbr_dists[de.FIL], (1, 1)) + self.assertTupleEqual(self.bufshr2.nbr_dists[de.IFM], (15, 15)) + self.assertTupleEqual(self.bufshr2.nbr_dists[de.OFM], (inf, inf)) + + self.assertTupleEqual(self.bufshr3.nbr_dists[de.FIL], (3, 5)) + self.assertTupleEqual(self.bufshr3.nbr_dists[de.IFM], (inf, 10)) + self.assertTupleEqual(self.bufshr3.nbr_dists[de.OFM], (1, 1)) + + def test_default_data_loops(self): + ''' Default data_loops in constructor. ''' + data_loops = [None] * de.NUM + data_loops[de.FIL] = DataDimLoops(le.IFM, le.OFM) + data_loops[de.IFM] = DataDimLoops(le.IFM, le.BAT) + data_loops[de.OFM] = DataDimLoops(le.OFM, le.BAT) + + for bufshr, nr, ps in zip([self.bufshr1, self.bufshr2, self.bufshr3], + [self.nr1, self.nr2, self.nr3], + [self.ps1, self.ps2, self.ps3]): + + bufshr_ = BufShrScheme(nr, ps, data_loops) + + for dce in range(de.NUM): + self.assertTupleEqual(bufshr.dim(dce), + bufshr_.dim(dce)) + self.assertTupleEqual(bufshr.nbr_dists[dce], + bufshr_.nbr_dists[dce]) + + def test_data_loops(self): + ''' data_loops in constructor. ''' + data_loops = [None] * de.NUM + data_loops[de.FIL] = DataDimLoops(le.IFM, le.OFM) + data_loops[de.IFM] = DataDimLoops(le.OFM, le.BAT) + data_loops[de.OFM] = DataDimLoops(le.OFM, le.BAT) + + for nr, ps in zip([self.nr1, self.nr2, self.nr3], + [self.ps1, self.ps2, self.ps3]): + + bufshr = BufShrScheme(nr, ps, data_loops) + + self.assertTupleEqual(bufshr.dim(de.IFM), bufshr.dim(de.OFM)) + self.assertTupleEqual(bufshr.nbr_dists[de.IFM], + bufshr.nbr_dists[de.OFM]) + + def test_data_loops_all_lpe(self): + ''' data_loops in constructor have all LoopEnum. ''' + data_loops = [None] * de.NUM + data_loops[de.FIL] = DataDimLoops(le.IFM, le.OFM) + data_loops[de.IFM] = DataDimLoops(le.IFM, le.OFM, le.BAT) + data_loops[de.OFM] = DataDimLoops(le.OFM, le.BAT) + + bufshr = BufShrScheme(self.nr1, self.ps1, data_loops) + + self.assertTupleEqual(bufshr.dim(de.IFM), (1, 1)) + self.assertTrue(all(math.isinf(d) for d in bufshr.nbr_dists[de.IFM])) + + def test_mismatch_node_region(self): + ''' Mismatched node region and part in constructor. ''' + # Smaller node region. Invalid. + with self.assertRaisesRegexp(ValueError, 'BufShrScheme: .*region.*'): + _ = BufShrScheme(NodeRegion(origin=PhyDim2(0, 0), + dim=PhyDim2(1, 1), + type=NodeRegion.PROC), + self.ps1) + + # Larger node region. Valid. + bufshr = BufShrScheme(NodeRegion(origin=PhyDim2(0, 0), + dim=PhyDim2(100, 100), + type=NodeRegion.PROC), + self.ps1) + self.assertTupleEqual(bufshr.dim(de.IFM), self.ps1.dim(pe.OUTP)) + + def test_nhops_rotate_all(self): + ''' Get nhops_rotate_all. ''' + # With `self.bufshr3` and FIL, the dimension is 4 by 2, with neighbor + # distances 3 and 5. + bufshr = self.bufshr3 + dce = de.FIL + self.assertTupleEqual(bufshr.dim(dce), (4, 2)) + self.assertTupleEqual(bufshr.nbr_dists[dce], (3, 5)) + + # Subgroup as 4 by 2. The whole circle is six hops of 3 and two hops of + # 5, but only 7 of 8 steps. + self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 8), + (3 * 6 + 5 * 2) * 7 / 8.) + # Subgroup as 4 by 1. One node does three hops of 3, and other three + # nodes do two hops of 3 and one hop of 9 (looping back). + self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 4), + ((3 * 3) + (3 * 2 + 9) * 3) / 4. * 2) + # Subgroup as 2 by 1. All nodes do one hop of 3. + self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 2), + (3 + 3) / 2. * 4) + # Subgroup as 1. No rotation. + self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 1), 0) + + # Subgroup as 4 by 1. One node does two hops of 3 and two do one hop of + # 3 and 6 each. The 3rd node also sends to the 4th one with two hops of + # 3. + self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 3), + ((3 * 2) + (3 + 6) * 2 + (3 * 2)) / 3. * 2) + # Subgroup as 4 by 2. The 1st node does three hops of 3 and one hop of + # 5. The 2nd, 3rd, and 4th nodes do two hops of 3, and one hop of 5, + # and one looping back from the 5th node to the 1st node. The 5th node + # does one looping back and three hops of 3. Finally, the 5th node also + # sends to the 6th to 8th nodes. + self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 5), + ((3 * 3 + 5) + (3 * 2 + 5 + (3 * 3 + 5)) * 3 + + ((3 * 3 + 5) + 3 * 3) + 3 * 3 * 4) / 5.) + # The others are similar. + self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 6), + ((3 * 4 + 5) + (3 * 3 + 5 + (3 * 2 + 5)) * 4 + + ((3 * 2 + 5) + 3 * 4) + 3 * 2 * 5) / 6.) + self.assertAlmostEqual(bufshr.nhops_rotate_all(dce, 7), + ((3 * 5 + 5) + (3 * 4 + 5 + (3 * 1 + 5)) * 5 + + ((3 * 1 + 5) + 3 * 5) + 3 * 1 * 6) / 7.) + + def test_nhops_rotate_all_invalid(self): + ''' Get nhops_rotate_all with invalid args. ''' + with self.assertRaisesRegexp(ValueError, 'BufShrScheme: .*subgroup.*'): + _ = self.bufshr3.nhops_rotate_all( + de.FIL, self.bufshr3.size(de.FIL) + 1) + + def test_nhops_rotate_all_rot_unit(self): + ''' Get nhops_rotate_all with rotation unit count. ''' + + bufshr = self.bufshr3 + dce = de.FIL + self.assertTupleEqual(bufshr.dim(dce), (4, 2)) + + for subgrp_size in range(1, bufshr.size(dce)): + + nhops = bufshr.nhops_rotate_all(dce, subgrp_size) + + for rotation_unit_cnt in range(subgrp_size, 32): + self.assertEqual(bufshr.nhops_rotate_all(dce, subgrp_size, + rotation_unit_cnt), + nhops) + + for rotation_unit_cnt in range(1, subgrp_size): + self.assertLess(bufshr.nhops_rotate_all(dce, subgrp_size, + rotation_unit_cnt), + nhops) + + def test_nhops_rotate_all_cache(self): + ''' Get nhops_rotate_all using cache. ''' + + bufshr = self.bufshr3 + dce = de.FIL + + self.assertFalse(bufshr.nhops_cache) + + nhops_8 = bufshr.nhops_rotate_all(dce, 8) + nhops_4 = bufshr.nhops_rotate_all(dce, 4) + nhops_1 = bufshr.nhops_rotate_all(dce, 1) + self.assertEqual(len(bufshr.nhops_cache), 3) + self.assertEqual(nhops_8, bufshr.nhops_rotate_all(dce, 8)) + self.assertEqual(nhops_4, bufshr.nhops_rotate_all(dce, 4)) + self.assertEqual(nhops_1, bufshr.nhops_rotate_all(dce, 1)) + self.assertEqual(len(bufshr.nhops_cache), 3) + + dce = de.IFM + + nhops_3 = bufshr.nhops_rotate_all(dce, 3) + nhops_2 = bufshr.nhops_rotate_all(dce, 2) + self.assertEqual(len(bufshr.nhops_cache), 5) + self.assertEqual(nhops_3, bufshr.nhops_rotate_all(dce, 3)) + self.assertEqual(nhops_2, bufshr.nhops_rotate_all(dce, 2)) + self.assertEqual(len(bufshr.nhops_cache), 5) + + nhops_rot_unit = bufshr.nhops_rotate_all(dce, 3, 2) + + self.assertEqual(len(bufshr.nhops_cache), 6) + self.assertEqual(nhops_rot_unit, bufshr.nhops_rotate_all(dce, 3, 2)) + self.assertEqual(len(bufshr.nhops_cache), 6) + + def test_nhops_wide_fetch_once(self): + ''' Get nhops_wide_fetch_once. ''' + # With `self.bufshr3` and FIL, the dimension is 4 by 2, with neighbor + # distances 3 and 5. + bufshr = self.bufshr3 + dce = de.FIL + self.assertTupleEqual(bufshr.dim(dce), (4, 2)) + self.assertTupleEqual(bufshr.nbr_dists[dce], (3, 5)) + + for subgrp_size in range(bufshr.size(dce)): + self.assertAlmostEqual( + bufshr.nhops_wide_fetch_once(dce, subgrp_size, 1), 0) + + # Three nodes fetch one hop of 3, and the last node fetches one hop of + # 9 (looping back). + self.assertAlmostEqual(bufshr.nhops_wide_fetch_once(dce, 4, 2) * 2, + (3 * 3 + 9) / 4. * 2) + # Two nodes fetch one hop of 3, and the 3rd node fetches one hop of 6 + # (looping back). The last node fetches one hop of 3 from the 3rd. + self.assertAlmostEqual(bufshr.nhops_wide_fetch_once(dce, 3, 2) * 2, + (3 * 2 + 6 + 3) / 3. * 2) + # All nodes do one hop of 3. + self.assertAlmostEqual(bufshr.nhops_wide_fetch_once(dce, 2, 2) * 2, + (3 + 3) / 2. * 4) + + for subgrp_size in range(2, bufshr.size(dce)): + self.assertAlmostEqual( + bufshr.nhops_wide_fetch_once(dce, subgrp_size, 1.5) * 1.5, + bufshr.nhops_wide_fetch_once(dce, subgrp_size, 2) * 2. / 2.) + + def test_nhops_wide_fetch_once_inv(self): + ''' Get nhops_wide_fetch_once with invalid args. ''' + with self.assertRaisesRegexp(ValueError, 'BufShrScheme: .*subgroup.*'): + _ = self.bufshr3.nhops_wide_fetch_once( + de.FIL, self.bufshr3.size(de.FIL) + 1, 2) + + with self.assertRaisesRegexp(ValueError, 'BufShrScheme: .*width.*'): + _ = self.bufshr3.nhops_wide_fetch_once( + de.FIL, + self.bufshr3.size(de.FIL) / 2, + self.bufshr3.size(de.FIL) / 2 + 1) + + def test_repr(self): + ''' __repr__. ''' + self.assertIn(repr(self.ps1), repr(self.bufshr1)) + self.assertIn(repr(self.ps2), repr(self.bufshr2)) + self.assertIn(repr(self.ps3), repr(self.bufshr3)) + diff --git a/nn_dataflow/tests/unit_test/test_data_layout.py b/nn_dataflow/tests/unit_test/test_data_layout.py index b1855a7..f2c5827 100644 --- a/nn_dataflow/tests/unit_test/test_data_layout.py +++ b/nn_dataflow/tests/unit_test/test_data_layout.py @@ -212,6 +212,52 @@ def test_nhops_to_multidests(self): PhyDim2(2, 2)), nhops) + def test_nhops_to_multidests_fwd(self): + ''' Get nhops_to multiple destinations forwarding. ''' + fr = FmapRange((0,) * 4, (4, 4, 16, 16)) + # First to (2, 2), then (2, 2) to (-1, -2), (-1, -2) to (-2, -3). + nhops = 2 * 4 * 8 * 16 * (2 + 1 + 1 + 0) \ + + 2 * 4 * 8 * 16 * (4 * 7) \ + + 2 * 4 * 8 * 16 * (4 * 2) + self.assertEqual(self.dl1.nhops_to(fr, + PhyDim2(-1, -2), PhyDim2(-2, -3), + PhyDim2(2, 2), + forwarding=True), + nhops) + + frng1 = FmapRange((0, 4, 0, 0), (4, 8, 16, 16)) + dl = DataLayout(frngs=(self.frng1, frng1), + regions=(self.region1, self.region2), + parts=(self.part1, self.part2)) + self.assertEqual(dl.nhops_to(fr, + PhyDim2(-1, -2), PhyDim2(-2, -3), + PhyDim2(2, 2), + forwarding=True), + nhops) + + nhops += 2 * 4 * 16 * 16 * ((3 + 4) + 2 * 7 + 2 * 2) + fr = FmapRange((0,) * 4, (16,) * 4) + self.assertEqual(dl.nhops_to(fr, + PhyDim2(-1, -2), PhyDim2(-2, -3), + PhyDim2(2, 2), + forwarding=True), + nhops) + + # (2, 2) to (3, 10) and (8, 4) + nhops += 4 * 8 * 16 * 16 * (9 + 8) + self.assertEqual(dl.nhops_to(fr, + PhyDim2(-1, -2), PhyDim2(-2, -3), + PhyDim2(2, 2), PhyDim2(3, 10), + PhyDim2(8, 4), + forwarding=True), + nhops) + + def test_nhops_to_invalid_kwargs(self): + ''' Get nhops_to invalid kwargs. ''' + fr = FmapRange((0,) * 4, (4, 4, 16, 16)) + with self.assertRaisesRegexp(ValueError, 'DataLayout: .*keyword.*'): + _ = self.dl1.nhops_to(fr, PhyDim2(1, 1), f=True) + def test_is_in(self): ''' Whether is_in. ''' nr1 = self.region1 @@ -255,6 +301,31 @@ def test_is_in(self): dim=PhyDim2(50, 50), type=self.region1.type))) + def test_is_in_folded(self): + ''' Whether is_in with folded regions. ''' + # (1, 1/2), (2/3, 0/1/2), (4, 1/2) + nr1 = NodeRegion(origin=PhyDim2(1, 1), dim=PhyDim2(1, 10), + type=self.region1.type, wtot=3, wbeg=2) + # (1, 1/2), (2, 2) + nr2 = NodeRegion(origin=PhyDim2(1, 1), dim=PhyDim2(1, 3), + type=self.region1.type, wtot=3, wbeg=2) + self.assertTrue(self.dl1.is_in(nr1)) + self.assertFalse(self.dl1.is_in(nr2)) + + # (1-2, 2), (3-4/5-6/7-8, 0/1/2) + region = NodeRegion(origin=PhyDim2(1, 2), dim=PhyDim2(2, 10), + type=self.region1.type, wtot=3, wbeg=1) + part = PartitionScheme(order=range(pe.NUM), + pdims=(PhyDim2(1, 5), PhyDim2(2, 1), + PhyDim2(1, 2), PhyDim2(1, 1))) + dl = DataLayout(frngs=self.dl1.frngs, + regions=(region,), parts=(part,)) + # (1-2, 1/2), (3-4/5-6, -1/0/1/2), (7-8, 0/1/2) + nr3 = NodeRegion(origin=PhyDim2(1, 1), dim=PhyDim2(2, 13), + type=self.region1.type, wtot=4, wbeg=2) + self.assertTrue(dl.is_in(nr3)) + self.assertFalse(dl.is_in(nr2)) + def test_concat(self): ''' Concat. ''' fr = FmapRange((0,) * 4, (30,) * 4) diff --git a/nn_dataflow/tests/unit_test/test_nn_dataflow_scheme.py b/nn_dataflow/tests/unit_test/test_nn_dataflow_scheme.py index 8824146..910441f 100644 --- a/nn_dataflow/tests/unit_test/test_nn_dataflow_scheme.py +++ b/nn_dataflow/tests/unit_test/test_nn_dataflow_scheme.py @@ -30,6 +30,7 @@ class TestNNDataflowScheme(unittest.TestCase): ''' Tests for NNDataflowScheme. ''' + # pylint: disable=too-many-public-methods # pylint: disable=too-many-public-methods @@ -57,15 +58,21 @@ def setUp(self): c1_layer = self.network['c1'] self.c1res = SchedulingResult( - scheme=OrderedDict([('cost', 1.5), ('time', 2.), ('ops', 4.), + scheme=OrderedDict([('cost', 1.5), ('time', 200.), ('ops', 4.), ('num_nodes', 4), ('cost_op', 0.5), ('cost_access', 1.), ('cost_noc', 0), ('cost_static', 0), - ('proc_time', 2), ('bus_time', 0), + ('proc_time', 200), ('bus_time', 0), ('dram_time', 0), ('access', [[7, 8, 9]] * me.NUM), + ('remote_gbuf_access', [0] * 3), ('total_nhops', [4, 5, 6]), ('fetch', [[1, 1, 1], [2, 2, 2]]), + ('ti', [2, 2, 3]), + ('to', [1, 2, 3]), + ('tb', [1, 2, 3]), + ('tvals', [[2, 1, 1], [2, 2, 2], [3, 3, 3]]), + ('orders', [range(3)] * 2), ]), ofmap_layout=DataLayout( frngs=(FmapRange((0, 0, 0, 0), @@ -76,19 +83,26 @@ def setUp(self): regions=(NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(1, 2), type=NodeRegion.DRAM),), parts=(PartitionScheme(order=range(pe.NUM), - pdims=[(1, 1)] * pe.NUM),))) + pdims=[(1, 1)] * pe.NUM),)), + sched_seq=(0, 0, 0)) p1_layer = self.network['p1'] self.p1res = SchedulingResult( - scheme=OrderedDict([('cost', 0.6), ('time', 0.05), ('ops', 0.1), + scheme=OrderedDict([('cost', 0.6), ('time', 5), ('ops', 0.1), ('num_nodes', 2), ('cost_op', 0.1), ('cost_access', 0.5), ('cost_noc', 0), ('cost_static', 0), - ('proc_time', 0.05), ('bus_time', 0), + ('proc_time', 5), ('bus_time', 0), ('dram_time', 0), ('access', [[.7, .8, .9]] * me.NUM), + ('remote_gbuf_access', [0] * 3), ('total_nhops', [.4, .5, .6]), ('fetch', [[1, 1, 1], [2, 2, 2]]), + ('ti', [2, 2, 3]), + ('to', [1, 2, 3]), + ('tb', [1, 2, 3]), + ('tvals', [[2, 1, 1], [2, 2, 2], [3, 3, 3]]), + ('orders', [range(3)] * 2), ]), ofmap_layout=DataLayout( frngs=(FmapRange((0, 0, 0, 0), @@ -99,12 +113,17 @@ def setUp(self): regions=(NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(1, 2), type=NodeRegion.DRAM),), parts=(PartitionScheme(order=range(pe.NUM), - pdims=[(1, 1)] * pe.NUM),))) + pdims=[(1, 1)] * pe.NUM),)), + sched_seq=(0, 1, 0)) + + self.p2res = SchedulingResult( + scheme=self.p1res.scheme, ofmap_layout=self.p1res.ofmap_layout, + sched_seq=(0, 2, 0)) self.dtfl = NNDataflowScheme(self.network, self.input_layout) self.dtfl['c1'] = self.c1res self.dtfl['p1'] = self.p1res - self.dtfl['p2'] = self.p1res + self.dtfl['p2'] = self.p2res def test_init(self): ''' Initial. ''' @@ -225,7 +244,7 @@ def test_setitem_already_exists(self): df['c1'] = self.c1res with self.assertRaisesRegexp(KeyError, 'NNDataflowScheme: .*c1*'): - df['c1'] = self.c1res + df['c1'] = self.c1res._replace(sched_seq=(1, 0, 0)) def test_setitem_prev_not_in(self): ''' __setitem__ previous not existing. ''' @@ -247,6 +266,22 @@ def test_setitem_prev_input_ext(self): df['c2'] = self.c1res self.assertAlmostEqual(df.total_cost, self.c1res.total_cost) + def test_setitem_invalid_seg_idx(self): + ''' __setitem__ invalid segment index. ''' + df = NNDataflowScheme(self.network, self.input_layout) + + with self.assertRaisesRegexp(ValueError, + 'NNDataflowScheme: .*segment index*'): + df['c1'] = self.c1res._replace(sched_seq=(1, 0, 0)) + + df = NNDataflowScheme(self.network, self.input_layout) + df['c1'] = self.c1res + df['p1'] = self.p1res._replace(sched_seq=(1, 0, 0)) + + with self.assertRaisesRegexp(ValueError, + 'NNDataflowScheme: .*segment index*'): + df['p2'] = self.p2res._replace(sched_seq=(0, 0, 0)) + def test_delitem(self): ''' __delitem__. ''' df = NNDataflowScheme(self.network, self.input_layout) @@ -288,7 +323,7 @@ def test_copy_ext(self): 'e1': self.input_layout}) df1['c1'] = self.c1res df1['p1'] = self.p1res - df1['p2'] = self.p1res + df1['p2'] = self.p2res df2 = df1.copy() @@ -330,7 +365,7 @@ def test_fmap_layout_ext(self): 'e1': self.input_layout}) df['c1'] = self.c1res df['p1'] = self.p1res - df['p2'] = self.p1res + df['p2'] = self.p2res flayout = df.fmap_layout(('e0',)) self.assertEqual(flayout, self.input_layout) @@ -345,7 +380,7 @@ def test_fmap_layout_ext(self): def test_properties(self): ''' Property accessors. ''' self.assertAlmostEqual(self.dtfl.total_cost, 1.5 + 0.6 * 2) - self.assertAlmostEqual(self.dtfl.total_time, 2 + 0.05 * 2) + self.assertAlmostEqual(self.dtfl.total_time, 200 + 5) self.assertAlmostEqual(self.dtfl.total_ops, 4 + 0.1 * 2) for a in self.dtfl.total_accesses: @@ -353,21 +388,105 @@ def test_properties(self): self.assertAlmostEqual(self.dtfl.total_noc_hops, (4 + 5 + 6) + (.4 + .5 + .6) * 2) + def test_time_full_net_single_seg(self): + ''' time() when full network fits in a single segment. ''' + dtfl = NNDataflowScheme(self.network, self.input_layout) + dtfl['c1'] = self.c1res + dtfl['p1'] = self.p1res._replace(sched_seq=(0, 1, 0)) + dtfl['p2'] = self.p2res._replace(sched_seq=(0, 2, 0)) + dtfl['f1'] = self.c1res._replace(sched_seq=(0, 3, 0)) + self.assertEqual(dtfl.total_time, 200) + + def test_static_cost_adjust(self): + ''' Adjust static cost portion. ''' + + # Add static cost. + idl_unit_cost = 1e-3 + + c1scheme = self.c1res.scheme + c1static = c1scheme['time'] * idl_unit_cost + c1scheme['cost_static'] += c1static + c1scheme['cost_access'] -= c1static + + p1scheme = self.p1res.scheme + p1static = p1scheme['time'] * idl_unit_cost + p1scheme['cost_static'] += p1static + p1scheme['cost_access'] -= p1static + + # No adjust. + dtfl = NNDataflowScheme(self.network, self.input_layout) + dtfl['c1'] = self.c1res._replace(scheme=c1scheme) + dtfl['p1'] = self.p1res._replace(scheme=p1scheme, sched_seq=(1, 0, 0)) + dtfl['p2'] = self.p2res._replace(scheme=p1scheme, sched_seq=(2, 0, 0)) + dtfl['f1'] = self.c1res._replace(scheme=c1scheme, sched_seq=(3, 0, 0)) + + sum_cost = 1.5 + 0.6 + 0.6 + 1.5 + sum_time = 200 + 5 + 5 + 200 + + self.assertAlmostEqual(dtfl.total_cost, sum_cost) + self.assertAlmostEqual(dtfl.total_time, sum_time) + + # With adjust. + dtfl = NNDataflowScheme(self.network, self.input_layout) + dtfl['c1'] = self.c1res._replace(scheme=c1scheme) + dtfl['p1'] = self.p1res._replace(scheme=p1scheme, sched_seq=(0, 1, 0)) + dtfl['p2'] = self.p2res._replace(scheme=p1scheme, sched_seq=(0, 2, 0)) + dtfl['f1'] = self.c1res._replace(scheme=c1scheme, sched_seq=(1, 0, 0)) + + diff = (sum_time - dtfl.total_time) * idl_unit_cost + self.assertGreater(diff, 0) + self.assertAlmostEqual(dtfl.total_cost, sum_cost -diff) + + # All in one segment. + dtfl = NNDataflowScheme(self.network, self.input_layout) + dtfl['c1'] = self.c1res._replace(scheme=c1scheme) + dtfl['p1'] = self.p1res._replace(scheme=p1scheme, sched_seq=(0, 1, 0)) + dtfl['p2'] = self.p2res._replace(scheme=p1scheme, sched_seq=(0, 2, 0)) + dtfl['f1'] = self.c1res._replace(scheme=c1scheme, sched_seq=(0, 3, 0)) + + diff = (sum_time - dtfl.total_time) * idl_unit_cost + self.assertGreater(diff, 0) + self.assertAlmostEqual(dtfl.total_cost, sum_cost -diff) + + def test_segment_time_list(self): + ''' segment_time_list(). ''' + dtfl = NNDataflowScheme(self.network, self.input_layout) + dtfl['c1'] = self.c1res + dtfl['p1'] = self.p1res + dtfl['p2'] = self.p2res._replace(sched_seq=(1, 0, 0)) + self.assertListEqual(dtfl.segment_time_list(), [205, 5]) + + def test_segment_dram_time_list(self): + ''' segment_dram_time_list(). ''' + c1_scheme = self.c1res.scheme.copy() + c1_scheme['dram_time'] = 180 + p1_scheme = self.p1res.scheme.copy() + p1_scheme['dram_time'] = 5 + p2_scheme = self.p2res.scheme.copy() + p2_scheme['dram_time'] = 10 + dtfl = NNDataflowScheme(self.network, self.input_layout) + dtfl['c1'] = self.c1res._replace(scheme=c1_scheme) + dtfl['p1'] = self.p1res._replace(scheme=p1_scheme) + dtfl['p2'] = self.p2res._replace(sched_seq=(1, 0, 0), + scheme=p2_scheme) + self.assertListEqual(dtfl.segment_dram_time_list(), [185, 10]) + self.assertListEqual(dtfl.segment_time_list(), [205, 10]) + def test_stats_active_node_pes(self): ''' Per-layer stats: active node PEs. ''' stats = self.dtfl.perlayer_stats('active_node_pes') self.assertEqual(len(stats), len(self.dtfl)) - self.assertAlmostEqual(stats['c1'], 0.5) - self.assertAlmostEqual(stats['p1'], 1) - self.assertAlmostEqual(stats['p2'], 1) + self.assertAlmostEqual(stats['c1'], 0.005) + self.assertAlmostEqual(stats['p1'], 0.01) + self.assertAlmostEqual(stats['p2'], 0.01) def test_stats_dram_bandwidth(self): ''' Per-layer stats: DRAM bandwidth. ''' stats = self.dtfl.perlayer_stats('dram_bandwidth') self.assertEqual(len(stats), len(self.dtfl)) - self.assertAlmostEqual(stats['c1'], (7 + 8 + 9) / 2.) - self.assertAlmostEqual(stats['p1'], (.7 + .8 + .9) / 0.05) - self.assertAlmostEqual(stats['p2'], (.7 + .8 + .9) / 0.05) + self.assertAlmostEqual(stats['c1'], (7. + 8. + 9.) / 200) + self.assertAlmostEqual(stats['p1'], (.7 + .8 + .9) / 5) + self.assertAlmostEqual(stats['p2'], (.7 + .8 + .9) / 5) def test_stats_not_supported(self): ''' Per-layer stats: not supported. ''' diff --git a/nn_dataflow/tests/unit_test/test_node_region.py b/nn_dataflow/tests/unit_test/test_node_region.py index 73e026d..fa88181 100644 --- a/nn_dataflow/tests/unit_test/test_node_region.py +++ b/nn_dataflow/tests/unit_test/test_node_region.py @@ -25,10 +25,66 @@ def test_valid_args(self): ''' Valid arguments. ''' nr = NodeRegion(dim=PhyDim2(4, 4), origin=PhyDim2(1, 3), - type=NodeRegion.PROC) + type=NodeRegion.PROC, + wtot=2, + wbeg=-1) self.assertTupleEqual(nr.dim, (4, 4), 'dim') self.assertTupleEqual(nr.origin, (1, 3), 'origin') self.assertEqual(nr.type, NodeRegion.PROC, 'type') + self.assertEqual(nr.wtot, 2, 'wtot') + self.assertEqual(nr.wbeg, -1, 'wbeg') + + def test_default_wtot_wbeg(self): + ''' Default wtot and wbeg. ''' + nr = NodeRegion(dim=PhyDim2(4, 8), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC) + self.assertEqual(nr.wtot, 8) + self.assertEqual(nr.wbeg, 8) + + nr = NodeRegion(dim=PhyDim2(4, 8), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wtot=6) + self.assertEqual(nr.wtot, 6) + self.assertEqual(nr.wbeg, 6) + + nr = NodeRegion(dim=PhyDim2(4, 8), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wbeg=-5) + self.assertEqual(nr.wtot, 8) + self.assertEqual(nr.wbeg, -5) + + def test_args_kwargs(self): + ''' Different ways to give args and kwargs. ''' + dim = PhyDim2(4, 8) + origin = PhyDim2(1, 3) + dist = PhyDim2(1, 1) + type_ = NodeRegion.PROC + wtot = 6 + wbeg = 5 + + nr0 = NodeRegion(dim=dim, origin=origin, dist=dist, type=type_, + wtot=wtot, wbeg=wbeg) + + nr = NodeRegion(dim, origin, dist, type_, wtot, wbeg) + self.assertTupleEqual(nr, nr0) + + nr = NodeRegion(dim, origin, wbeg=wbeg, wtot=wtot, type=type_, + dist=dist) + self.assertTupleEqual(nr, nr0) + + nr = NodeRegion(dim, origin, dist, type=type_, wtot=wtot, wbeg=wbeg) + self.assertTupleEqual(nr, nr0) + + def test_larger_wtot(self): + ''' wtot > dim.w is valid. ''' + nr = NodeRegion(dim=PhyDim2(4, 8), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wtot=20) + self.assertEqual(nr.wtot, 20) def test_invalid_dim(self): ''' Invalid dim. ''' @@ -59,6 +115,45 @@ def test_invalid_type(self): origin=PhyDim2(1, 3), type=NodeRegion.NUM) + def test_invalid_wtot_type(self): + ''' Invalid wtot type. ''' + with self.assertRaisesRegexp(TypeError, 'NodeRegion: .*wtot.*'): + _ = NodeRegion(dim=PhyDim2(4, 4), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wtot=1.3) + + def test_invalid_wbeg_type(self): + ''' Invalid wbeg type. ''' + with self.assertRaisesRegexp(TypeError, 'NodeRegion: .*wbeg.*'): + _ = NodeRegion(dim=PhyDim2(4, 4), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wbeg=1.3) + + def test_invalid_wbeg(self): + ''' Invalid wbeg. ''' + with self.assertRaisesRegexp(ValueError, 'NodeRegion: .*wbeg.*'): + _ = NodeRegion(dim=PhyDim2(4, 4), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wtot=4, + wbeg=5) + + with self.assertRaisesRegexp(ValueError, 'NodeRegion: .*wbeg.*'): + _ = NodeRegion(dim=PhyDim2(4, 4), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wtot=4, + wbeg=-5) + + with self.assertRaisesRegexp(ValueError, 'NodeRegion: .*wbeg.*'): + _ = NodeRegion(dim=PhyDim2(4, 4), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wtot=4, + wbeg=0) + def test_contains_node(self): ''' Whether contains node. ''' nr = NodeRegion(dim=PhyDim2(4, 4), @@ -138,3 +233,165 @@ def test_rel2abs_not_in(self): with self.assertRaisesRegexp(ValueError, 'NodeRegion: .*not in.*'): _ = nr.rel2abs(PhyDim2(0, 4)) + def test_rel2abs_folded(self): + ''' Get rel2abs with folded. ''' + nr = NodeRegion(dim=PhyDim2(4, 8), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wtot=3) + # 67 + # 543 + # 012 + + self.assertTupleEqual(nr.rel2abs(PhyDim2(1, 2)), (1 + 1, 5)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(2, 3)), (5 + 2, 5)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(0, 5)), (5 + 0, 3)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(3, 7)), (9 + 3, 4)) + + self.assertSetEqual(set(nr.rel2abs(PhyDim2(h, w)) + for h in range(nr.dim.h) + for w in range(nr.dim.w)), + set(nr.iter_node())) + + nr = NodeRegion(dim=PhyDim2(4, 8), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wtot=3, + wbeg=1) + # 7 + # 456 + # 321 + # 0 + + self.assertTupleEqual(nr.rel2abs(PhyDim2(2, 0)), (1 + 2, 3)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(1, 2)), (5 + 1, 2)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(2, 3)), (5 + 2, 1)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(0, 5)), (9 + 0, 2)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(3, 7)), (13 + 3, 3)) + + self.assertSetEqual(set(nr.rel2abs(PhyDim2(h, w)) + for h in range(nr.dim.h) + for w in range(nr.dim.w)), + set(nr.iter_node())) + + nr = NodeRegion(dim=PhyDim2(4, 8), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC, + wtot=4, + wbeg=-2) + # 76 + # 2345 + # 10 + + self.assertTupleEqual(nr.rel2abs(PhyDim2(1, 1)), (1 + 1, 2)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(2, 3)), (5 + 2, 3)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(0, 5)), (5 + 0, 5)) + self.assertTupleEqual(nr.rel2abs(PhyDim2(3, 7)), (9 + 3, 4)) + + self.assertSetEqual(set(nr.rel2abs(PhyDim2(h, w)) + for h in range(nr.dim.h) + for w in range(nr.dim.w)), + set(nr.iter_node())) + + def test_allocate(self): + ''' allocate. ''' + + nr = NodeRegion(dim=PhyDim2(4, 4), + origin=PhyDim2(1, 3), + type=NodeRegion.PROC) + + def _common_check(length): + self.assertEqual(len(subregions), length) + aggr_node_set = set() + for sr in subregions: + self.assertTupleEqual(sr.dist, nr.dist) + self.assertEqual(sr.type, NodeRegion.PROC) + self.assertEqual(sr.wtot, 4) + for c in sr.iter_node(): + self.assertTrue(nr.contains_node(c)) + self.assertTrue(aggr_node_set.isdisjoint(sr.iter_node())) + aggr_node_set.update(sr.iter_node()) + self.assertSetEqual(set(nr.iter_node()), aggr_node_set) + + request_list = [4, 4, 4, 4, 4] + self.assertEqual(len(nr.allocate(request_list)), 0) + + request_list = [2, 3, 3, 2, 4, 2] + subregions = nr.allocate(request_list) + # 5544 + # 3344 + # 2221 + # 0011 + _common_check(len(request_list)) + self.assertTupleEqual(subregions[0].dim, (1, 2)) + self.assertTupleEqual(subregions[0].origin, (1, 3)) + self.assertEqual(subregions[0].wbeg, 2) + self.assertTupleEqual(subregions[1].dim, (1, 3)) + self.assertTupleEqual(subregions[1].origin, (1, 5)) + self.assertEqual(subregions[1].wbeg, 2) + self.assertTupleEqual(subregions[2].dim, (1, 3)) + self.assertTupleEqual(subregions[2].origin, (2, 5)) + self.assertEqual(subregions[2].wbeg, -3) + self.assertTupleEqual(subregions[3].dim, (1, 2)) + self.assertTupleEqual(subregions[3].origin, (3, 3)) + self.assertEqual(subregions[3].wbeg, 2) + self.assertTupleEqual(subregions[4].dim, (1, 4)) + self.assertTupleEqual(subregions[4].origin, (3, 5)) + self.assertEqual(subregions[4].wbeg, 2) + self.assertTupleEqual(subregions[5].dim, (1, 2)) + self.assertTupleEqual(subregions[5].origin, (4, 4)) + self.assertEqual(subregions[5].wbeg, -2) + + request_list = [5, 11] + subregions = nr.allocate(request_list) + # 1111 + # 1111 + # 1110 + # 0000 + _common_check(len(request_list)) + self.assertTupleEqual(subregions[0].dim, (1, 5)) + self.assertTupleEqual(subregions[0].origin, (1, 3)) + self.assertEqual(subregions[0].wbeg, 4) + self.assertTupleEqual(subregions[1].dim, (1, 11)) + self.assertTupleEqual(subregions[1].origin, (2, 5)) + self.assertEqual(subregions[1].wbeg, -3) + + request_list = [2, 4, 4, 2, 4] + subregions = nr.allocate(request_list) + # 4432 + # 4432 + # 0112 + # 0112 + _common_check(len(request_list)) + self.assertTupleEqual(subregions[0].dim, (2, 1)) + self.assertTupleEqual(subregions[0].origin, (1, 3)) + self.assertEqual(subregions[0].wbeg, 1) + self.assertTupleEqual(subregions[1].dim, (2, 2)) + self.assertTupleEqual(subregions[1].origin, (1, 4)) + self.assertEqual(subregions[1].wbeg, 2) + self.assertTupleEqual(subregions[2].dim, (2, 2)) + self.assertTupleEqual(subregions[2].origin, (1, 6)) + self.assertEqual(subregions[2].wbeg, 1) + self.assertTupleEqual(subregions[3].dim, (2, 1)) + self.assertTupleEqual(subregions[3].origin, (3, 5)) + self.assertEqual(subregions[3].wbeg, -1) + self.assertTupleEqual(subregions[4].dim, (2, 2)) + self.assertTupleEqual(subregions[4].origin, (3, 4)) + self.assertEqual(subregions[4].wbeg, -2) + + nr = nr._replace(dist=PhyDim2(2, 1)) + + request_list = [10, 6] + subregions = nr.allocate(request_list) + # 1110 + # 1110 + # 0000 + # 0000 + _common_check(len(request_list)) + self.assertTupleEqual(subregions[0].dim, (2, 5)) + self.assertTupleEqual(subregions[0].origin, (1, 3)) + self.assertEqual(subregions[0].wbeg, 4) + self.assertTupleEqual(subregions[1].dim, (2, 3)) + self.assertTupleEqual(subregions[1].origin, (5, 5)) + self.assertEqual(subregions[1].wbeg, -3) + diff --git a/nn_dataflow/tests/unit_test/test_option.py b/nn_dataflow/tests/unit_test/test_option.py index f713f95..3c6627c 100644 --- a/nn_dataflow/tests/unit_test/test_option.py +++ b/nn_dataflow/tests/unit_test/test_option.py @@ -24,9 +24,12 @@ def test_valid_kwargs(self): ''' Valid keyword arguments. ''' options = Option(sw_gbuf_bypass=(False, False, False), sw_solve_loopblocking=False, + hw_access_forwarding=False, + hw_gbuf_sharing=False, partition_hybrid=True, partition_batch=False, partition_ifmaps=False, + partition_interlayer=False, opt_goal='ed', ntops=10, nprocesses=16, @@ -36,12 +39,18 @@ def test_valid_kwargs(self): 'sw_gbuf_bypass') self.assertEqual(options.sw_solve_loopblocking, False, 'sw_solve_loopblocking') + self.assertEqual(options.hw_access_forwarding, False, + 'hw_access_forwarding') + self.assertEqual(options.hw_gbuf_sharing, False, + 'hw_gbuf_sharing') self.assertEqual(options.partition_hybrid, True, 'partition_hybrid') self.assertEqual(options.partition_batch, False, 'partition_batch') self.assertEqual(options.partition_ifmaps, False, 'partition_ifmaps') + self.assertEqual(options.partition_interlayer, False, + 'partition_interlayer') self.assertEqual(options.opt_goal, 'ed', 'opt_goal') self.assertEqual(options.ntops, 10, 'ntops') self.assertEqual(options.nprocesses, 16, 'nprocesses') @@ -93,6 +102,27 @@ def test_invalid_swgbyp_len(self): with self.assertRaisesRegexp(ValueError, 'Option: .*sw_gbuf_bypass.*'): _ = Option(sw_gbuf_bypass=(False, False)) + def test_invalid_swsol_hwbufshr(self): + ''' Invalid sw_solve_loopblocking and hw_gbuf_sharing comb. ''' + with self.assertRaisesRegexp(ValueError, + 'Option: .*sw_solve_loopblocking.*' + 'hw_gbuf_sharing.*'): + _ = Option(sw_solve_loopblocking=True, hw_gbuf_sharing=True) + + def test_invalid_hwaccfwd_hwbufshr(self): + ''' Invalid hw_access_forwarding and hw_gbuf_sharing comb. ''' + with self.assertRaisesRegexp(ValueError, + 'Option: .*hw_access_forwarding.*' + 'hw_gbuf_sharing.*'): + _ = Option(hw_access_forwarding=True, hw_gbuf_sharing=True) + + def test_invalid_swsol_hwswb(self): + ''' Invalid sw_solve_loopblocking and hw_gbuf_save_writeback comb. ''' + with self.assertRaisesRegexp(ValueError, + 'Option: .*sw_solve_loopblocking.*' + 'hw_gbuf_save_writeback.*'): + _ = Option(sw_solve_loopblocking=True, hw_gbuf_save_writeback=True) + def test_invalid_part_hybrid_ifmaps(self): ''' Invalid partition_hybrid and partition_ifmaps comb. ''' with self.assertRaisesRegexp(ValueError, @@ -100,6 +130,26 @@ def test_invalid_part_hybrid_ifmaps(self): 'partition_hybrid.*'): _ = Option(partition_hybrid=False, partition_ifmaps=True) + def test_invalid_time_ovhd(self): + ''' Invalid layer_pipeline_time_ovhd. ''' + with self.assertRaisesRegexp(KeyError, + 'Option: .*layer_pipeline_time_ovhd.*'): + _ = Option(layer_pipeline_time_ovhd=None) + + with self.assertRaisesRegexp(ValueError, + 'Option: .*layer_pipeline_time_ovhd.*'): + _ = Option(layer_pipeline_time_ovhd=-1) + + def test_invalid_max_degree(self): + ''' Invalid layer_pipeline_max_degree. ''' + with self.assertRaisesRegexp(KeyError, + 'Option: .*layer_pipeline_max_degree.*'): + _ = Option(layer_pipeline_max_degree=None) + + with self.assertRaisesRegexp(ValueError, + 'Option: .*layer_pipeline_max_degree.*'): + _ = Option(layer_pipeline_max_degree=-1) + def test_invalid_opt_goal(self): ''' Invalid opt_goal. ''' with self.assertRaisesRegexp(ValueError, 'Option: .*opt_goal.*'): diff --git a/nn_dataflow/tests/unit_test/test_partition_scheme.py b/nn_dataflow/tests/unit_test/test_partition_scheme.py index c30f6f0..b282a19 100644 --- a/nn_dataflow/tests/unit_test/test_partition_scheme.py +++ b/nn_dataflow/tests/unit_test/test_partition_scheme.py @@ -15,6 +15,7 @@ import collections import itertools +import math import unittest from nn_dataflow.core import FmapPosition, FmapRange @@ -248,6 +249,31 @@ def data_loops(): with self.assertRaisesRegexp(TypeError, 'PartitionScheme: .*layer.*'): _ = self.ps1.part_layer(layer, self.ps1.size(pe.BATP)) + def test_part_neighbor_dist(self): + ''' Get part_neighbor_dist. ''' + for ps, nr in zip([self.ps1, self.ps2], [self.nr1, self.nr2]): + + for idx in range(pe.NUM): + nbr_dist = ps.part_neighbor_dist(nr, ps.order[idx]) + dim_below = ps.dim(*ps.order[idx + 1:]) if idx + 1 < pe.NUM \ + else PhyDim2(1, 1) + dim_cur = ps.dim(ps.order[idx]) + + if dim_cur.h == 1: + self.assertTrue(math.isinf(nbr_dist.h)) + else: + self.assertEqual(nbr_dist.h, dim_below.h) + + if dim_cur.w == 1: + self.assertTrue(math.isinf(nbr_dist.w)) + else: + self.assertEqual(nbr_dist.w, dim_below.w) + + def test_part_neighbor_dist_inv(self): + ''' Get part_neighbor_dist invalid arg. ''' + dist = self.ps1.part_neighbor_dist(self.nr1, pe.NUM) + self.assertTrue(all(math.isnan(d) for d in dist)) + def test_projection(self): ''' Get projection. ''' diff --git a/nn_dataflow/tests/unit_test/test_resource.py b/nn_dataflow/tests/unit_test/test_resource.py index c1e5c90..6c2602f 100644 --- a/nn_dataflow/tests/unit_test/test_resource.py +++ b/nn_dataflow/tests/unit_test/test_resource.py @@ -45,6 +45,7 @@ def test_valid_args(self): size_regf=512, array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) self.assertTupleEqual(resource.proc_region.dim, (2, 2), 'proc_region') self.assertTupleEqual(resource.dram_region.dim, (2, 2), 'dram_region') @@ -53,6 +54,7 @@ def test_valid_args(self): self.assertEqual(resource.size_regf, 512, 'size_regf') self.assertEqual(resource.array_bus_width, 8, 'array_bus_width') self.assertEqual(resource.dram_bandwidth, 128, 'dram_bandwidth') + self.assertFalse(resource.no_time_mux, 'no_time_mux') def test_invalid_proc_region(self): ''' Invalid proc_region. ''' @@ -66,6 +68,7 @@ def test_invalid_proc_region(self): size_regf=512, array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) def test_invalid_proc_region_dram(self): @@ -82,6 +85,7 @@ def test_invalid_proc_region_dram(self): size_regf=512, array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) def test_invalid_dram_region(self): @@ -96,6 +100,7 @@ def test_invalid_dram_region(self): size_regf=512, array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) def test_invalid_dram_region_proc(self): @@ -112,6 +117,7 @@ def test_invalid_dram_region_proc(self): size_regf=512, array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) def test_invalid_data_region(self): @@ -126,6 +132,7 @@ def test_invalid_data_region(self): size_regf=512, array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) with self.assertRaisesRegexp(TypeError, 'Resource: .*dst_data_.*'): _ = Resource(proc_region=self.proc_region, @@ -137,6 +144,7 @@ def test_invalid_data_region(self): size_regf=512, array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) def test_invalid_dim_array(self): @@ -151,6 +159,7 @@ def test_invalid_dim_array(self): size_regf=512, array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) def test_invalid_size_gbuf(self): @@ -165,6 +174,7 @@ def test_invalid_size_gbuf(self): size_regf=512, array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) def test_invalid_size_regf(self): @@ -179,6 +189,7 @@ def test_invalid_size_regf(self): size_regf=(512,), array_bus_width=8, dram_bandwidth=128, + no_time_mux=False, ) def test_invalid_array_bus_width(self): @@ -194,6 +205,7 @@ def test_invalid_array_bus_width(self): size_regf=512, array_bus_width=1.2, dram_bandwidth=128, + no_time_mux=False, ) with self.assertRaisesRegexp(ValueError, 'Resource: .*array_bus_width.*'): @@ -206,6 +218,7 @@ def test_invalid_array_bus_width(self): size_regf=512, array_bus_width=-2, dram_bandwidth=128, + no_time_mux=False, ) with self.assertRaisesRegexp(ValueError, 'Resource: .*array_bus_width.*'): @@ -218,6 +231,7 @@ def test_invalid_array_bus_width(self): size_regf=512, array_bus_width=0, dram_bandwidth=128, + no_time_mux=False, ) def test_invalid_dram_bandwidth(self): @@ -233,6 +247,7 @@ def test_invalid_dram_bandwidth(self): size_regf=512, array_bus_width=8, dram_bandwidth=None, + no_time_mux=False, ) with self.assertRaisesRegexp(ValueError, 'Resource: .*dram_bandwidth.*'): @@ -245,6 +260,7 @@ def test_invalid_dram_bandwidth(self): size_regf=512, array_bus_width=8, dram_bandwidth=-3, + no_time_mux=False, ) with self.assertRaisesRegexp(ValueError, 'Resource: .*dram_bandwidth.*'): @@ -257,5 +273,22 @@ def test_invalid_dram_bandwidth(self): size_regf=512, array_bus_width=8, dram_bandwidth=0, + no_time_mux=False, + ) + + def test_invalid_no_time_mux(self): + ''' Invalid no_time_mux. ''' + with self.assertRaisesRegexp(TypeError, + 'Resource: .*no_time_mux.*'): + _ = Resource(proc_region=self.proc_region, + dram_region=self.dram_region, + src_data_region=self.src_data_region, + dst_data_region=self.dst_data_region, + dim_array=PhyDim2(16, 16), + size_gbuf=131072, + size_regf=512, + array_bus_width=8, + dram_bandwidth=128, + no_time_mux=None, ) diff --git a/nn_dataflow/tests/unit_test/test_scheduling_condition.py b/nn_dataflow/tests/unit_test/test_scheduling_condition.py index 30ea75a..e80f026 100644 --- a/nn_dataflow/tests/unit_test/test_scheduling_condition.py +++ b/nn_dataflow/tests/unit_test/test_scheduling_condition.py @@ -23,6 +23,7 @@ from nn_dataflow.core import PhyDim2 from nn_dataflow.core import Resource from nn_dataflow.core import SchedulingCondition +from nn_dataflow.core import SchedulingConstraint class TestSchedulingCondition(unittest.TestCase): ''' Tests for SchedulingCondition. ''' @@ -39,7 +40,10 @@ def setUp(self): dst_data_region=NodeRegion(origin=PhyDim2(0, 0), dim=PhyDim2(1, 1), type=NodeRegion.DRAM), dim_array=PhyDim2(16, 16), size_gbuf=65536, size_regf=64, - array_bus_width=float('inf'), dram_bandwidth=float('inf')) + array_bus_width=float('inf'), dram_bandwidth=float('inf'), + no_time_mux=False) + + self.none_cstr = SchedulingConstraint() part = PartitionScheme(order=range(pe.NUM), pdims=[(1, 1)] * pe.NUM) self.ifmap_layout = DataLayout(frngs=(FmapRange((0, 0, 0, 0), @@ -47,24 +51,59 @@ def setUp(self): regions=(self.resource.src_data_region,), parts=(part,)) + self.sched_seq = (2, 0, 0) + def test_valid_args(self): ''' Valid arguments. ''' condition = SchedulingCondition(resource=self.resource, - ifmap_layout=self.ifmap_layout) + constraint=self.none_cstr, + ifmap_layout=self.ifmap_layout, + sched_seq=self.sched_seq) self.assertEqual(condition.resource, self.resource) + self.assertEqual(condition.constraint, self.none_cstr) self.assertEqual(condition.ifmap_layout, self.ifmap_layout) + self.assertTupleEqual(condition.sched_seq, self.sched_seq) def test_invalid_resource(self): ''' Invalid resource. ''' with self.assertRaisesRegexp(TypeError, 'SchedulingCondition: .*resource.*'): _ = SchedulingCondition(resource=None, - ifmap_layout=self.ifmap_layout) + constraint=self.none_cstr, + ifmap_layout=self.ifmap_layout, + sched_seq=self.sched_seq) + + def test_invalid_constraint(self): + ''' Invalid constraint. ''' + with self.assertRaisesRegexp(TypeError, + 'SchedulingCondition: .*constraint.*'): + _ = SchedulingCondition(resource=self.resource, + constraint=None, + ifmap_layout=self.ifmap_layout, + sched_seq=self.sched_seq) def test_invalid_ifmap_layout(self): - ''' Invalid resource. ''' + ''' Invalid ifmap_layout. ''' with self.assertRaisesRegexp(TypeError, 'SchedulingCondition: .*ifmap_layout.*'): _ = SchedulingCondition(resource=self.resource, - ifmap_layout=None) + constraint=self.none_cstr, + ifmap_layout=None, + sched_seq=self.sched_seq) + + def test_invalid_sched_seq(self): + ''' Invalid sched_seq. ''' + with self.assertRaisesRegexp(TypeError, + 'SchedulingCondition: .*sched_seq.*'): + _ = SchedulingCondition(resource=self.resource, + constraint=self.none_cstr, + ifmap_layout=self.ifmap_layout, + sched_seq=list(self.sched_seq)) + + with self.assertRaisesRegexp(ValueError, + 'SchedulingCondition: .*sched_seq.*'): + _ = SchedulingCondition(resource=self.resource, + constraint=self.none_cstr, + ifmap_layout=self.ifmap_layout, + sched_seq=self.sched_seq[:-1]) diff --git a/nn_dataflow/tests/unit_test/test_scheduling_constraint.py b/nn_dataflow/tests/unit_test/test_scheduling_constraint.py new file mode 100644 index 0000000..c401803 --- /dev/null +++ b/nn_dataflow/tests/unit_test/test_scheduling_constraint.py @@ -0,0 +1,354 @@ +""" $lic$ +Copyright (C) 2016-2019 by The Board of Trustees of Stanford University + +This program is free software: you can redistribute it and/or modify it under +the terms of the Modified BSD-3 License as published by the Open Source +Initiative. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the BSD-3 License for more details. + +You should have received a copy of the Modified BSD-3 License along with this +program. If not, see . +""" + +import itertools +import unittest + +from nn_dataflow.core import LoopEnum as le +from nn_dataflow.core import ParallelEnum as pe +from nn_dataflow.core import PartitionScheme +from nn_dataflow.core import SchedulingConstraint, \ + SchedulingConstraintLayerPipeline + +from nn_dataflow import util + +class TestSchedulingConstraintFixture(unittest.TestCase): + ''' Base fixture class for SchedulingConstraint tests. ''' + + @staticmethod + def _gen_bl(t_end=9): + ''' Generator for bl_t and bl_ord. ''' + return itertools.product(itertools.product(*[range(1, t_end)] * le.NUM), + itertools.permutations(range(le.NUM))) + + +class TestSchedulingConstraint(TestSchedulingConstraintFixture): + ''' Tests for SchedulingConstraint. ''' + + def test_valid_args(self): + ''' Valid arguments. ''' + cstr = SchedulingConstraint(topbat=2, topifm=1, topofm=4) + self.assertEqual(cstr.topbat, 2) + self.assertEqual(cstr.topifm, 1) + self.assertEqual(cstr.topofm, 4) + self.assertDictEqual(cstr.update_dict, {}) + + cstr = SchedulingConstraint(topbat=2, topofm=4) + self.assertEqual(cstr.topbat, 2) + self.assertEqual(cstr.topifm, 0) + self.assertEqual(cstr.topofm, 4) + self.assertDictEqual(cstr.update_dict, {}) + + cstr = SchedulingConstraint( + topofm=4, + update_dict={ + 'l1': lambda s, _: setattr(s, 'topbat', 1), + 'l2': lambda s, r: setattr(s, 'topifm', r.topifm), + }) + self.assertEqual(cstr.topbat, 0) + self.assertEqual(cstr.topifm, 0) + self.assertEqual(cstr.topofm, 4) + self.assertEqual(len(cstr.update_dict), 2) + self.assertIn('l1', cstr.update_dict) + self.assertIn('l2', cstr.update_dict) + + cstr = SchedulingConstraint() + self.assertEqual(cstr.topbat, 0) + self.assertEqual(cstr.topifm, 0) + self.assertEqual(cstr.topofm, 0) + self.assertDictEqual(cstr.update_dict, {}) + + def test_invalid_args(self): + ''' Invalid arguments. ''' + with self.assertRaisesRegexp(ValueError, + 'SchedulingConstraint: ' + '.*positive integers.*'): + _ = SchedulingConstraint(topbat=-1, topofm=2.) + + def test_invalid_update_dict(self): + ''' Invalid argument update_dict. ''' + with self.assertRaisesRegexp(TypeError, + 'SchedulingConstraint: ' + '.*update_dict.*'): + _ = SchedulingConstraint(update_dict=['l1']) + + with self.assertRaisesRegexp(TypeError, + 'SchedulingConstraint: ' + '.*update_dict.*'): + _ = SchedulingConstraint(update_dict={'l1': 1}) + + def test_null_constraint(self): + ''' Null constraint. ''' + cstr = SchedulingConstraint() + + self.assertTrue(cstr.is_valid_top_bl((1, 1, 2), (0, 1, 2))) + self.assertTrue(cstr.is_valid_top_bl((3, 4, 5), (2, 1, 0))) + self.assertTrue(cstr.is_valid_top_bl((1, 1, 1), (1, 2, 0))) + + self.assertTrue(cstr.is_valid_part(PartitionScheme( + order=range(pe.NUM), pdims=[(2, 2)] * pe.NUM))) + + def test_is_valid_top_bl(self): + ''' Whether is_valid_top_bl. ''' + cstr = SchedulingConstraint(topbat=2, topofm=4) + for bl_t, bl_ord in self._gen_bl(): + valid = (bl_t[le.BAT] == 2 and bl_t[le.OFM] == 4) + self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid) + + cstr = SchedulingConstraint(topifm=4) + for bl_t, bl_ord in self._gen_bl(): + valid = (bl_t[le.IFM] == 4) + self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid) + + cstr = SchedulingConstraint() + for bl_t, bl_ord in self._gen_bl(): + self.assertTrue(cstr.is_valid_top_bl(bl_t, bl_ord)) + + def test_is_valid_part(self): + ''' Whether is_valid_part. ''' + cstr = SchedulingConstraintLayerPipeline( + topbat=2, topifm=1, topofm=4, fbifm=True, fbofm=False) + self.assertTrue(cstr.is_valid_part(PartitionScheme( + order=range(pe.NUM), pdims=[(2, 2)] * pe.NUM))) + + cstr = SchedulingConstraintLayerPipeline(topbat=2, topofm=4, fbifm=True) + self.assertTrue(cstr.is_valid_part(PartitionScheme( + order=range(pe.NUM), pdims=[(2, 2)] * pe.NUM))) + + cstr = SchedulingConstraintLayerPipeline() + self.assertTrue(cstr.is_valid_part(PartitionScheme( + order=range(pe.NUM), pdims=[(2, 2)] * pe.NUM))) + + def test_is_valid_before_update(self): + ''' is_valid_top_bl and is_valid_part called before update. ''' + cstr = SchedulingConstraint( + topofm=4, + update_dict={ + 'l1': lambda s, _: setattr(s, 'topbat', 1), + 'l2': lambda s, r: setattr(s, 'topifm', r.topifm), + }) + + with self.assertRaisesRegexp(ValueError, + 'SchedulingConstraint: ' + '.*update_dict.*'): + cstr.is_valid_top_bl([1] * le.NUM, range(le.NUM)) + + with self.assertRaisesRegexp(ValueError, + 'SchedulingConstraint: ' + '.*update_dict.*'): + cstr.is_valid_part(PartitionScheme(order=range(pe.NUM), + pdims=[(2, 2)] * pe.NUM)) + + def test_filter_gen_ts(self): + ''' Get filter_gen_ts. ''' + gen_tifm = util.factorize(36, 3) + gen_tofm = util.factorize(20, 3) + gen_tbat = util.factorize(16, 3) + + cstr = SchedulingConstraint(topbat=2, topofm=4) + + gifm, gifm0, gen_tifm = itertools.tee(gen_tifm, 3) + gofm, gofm0, gen_tofm = itertools.tee(gen_tofm, 3) + gbat, gbat0, gen_tbat = itertools.tee(gen_tbat, 3) + fgifm, fgofm, fgbat = cstr.filter_gen_ts(gifm, gofm, gbat) + + self.assertSetEqual(set(fgifm), set(gifm0)) + set_fgofm = set(fgofm) + set_fgbat = set(fgbat) + self.assertTrue(set_fgofm.issubset(set(gofm0))) + self.assertTrue(set_fgbat.issubset(set(gbat0))) + self.assertSetEqual(set_fgofm, + set([(4,) + tpl for tpl in util.factorize(5, 2)])) + self.assertSetEqual(set_fgbat, + set([(2,) + tpl for tpl in util.factorize(8, 2)])) + + cstr = SchedulingConstraint(topifm=4) + + gifm, gifm0, gen_tifm = itertools.tee(gen_tifm, 3) + gofm, gofm0, gen_tofm = itertools.tee(gen_tofm, 3) + gbat, gbat0, gen_tbat = itertools.tee(gen_tbat, 3) + fgifm, fgofm, fgbat = cstr.filter_gen_ts(gifm, gofm, gbat) + + self.assertSetEqual(set(fgofm), set(gofm0)) + self.assertSetEqual(set(fgbat), set(gbat0)) + set_fgifm = set(fgifm) + self.assertTrue(set_fgifm.issubset(set(gifm0))) + self.assertSetEqual(set_fgifm, + set([(4,) + tpl for tpl in util.factorize(9, 2)])) + + cstr = SchedulingConstraint() + + gifm, gifm0, gen_tifm = itertools.tee(gen_tifm, 3) + gofm, gofm0, gen_tofm = itertools.tee(gen_tofm, 3) + gbat, gbat0, gen_tbat = itertools.tee(gen_tbat, 3) + fgifm, fgofm, fgbat = cstr.filter_gen_ts(gifm, gofm, gbat) + + self.assertSetEqual(set(fgifm), set(gifm0)) + self.assertSetEqual(set(fgofm), set(gofm0)) + self.assertSetEqual(set(fgbat), set(gbat0)) + + def test_update_by_prev(self): + ''' Modifier update_by_prev. ''' + cstr = SchedulingConstraint( + topofm=4, + update_dict={ + 'l1': lambda s, _: setattr(s, 'topbat', 1), + 'l2': lambda s, r: setattr(s, 'topifm', r.topifm), + }) + self.assertEqual(cstr.topbat, 0) + self.assertEqual(cstr.topifm, 0) + self.assertEqual(cstr.topofm, 4) + + r = SchedulingConstraint(topifm=2) + cstr.update_by_prev({'l1': None, 'l2': r}) + + self.assertEqual(cstr.topbat, 1) + self.assertEqual(cstr.topifm, 2) + self.assertEqual(cstr.topofm, 4) + + self.assertFalse(cstr.is_valid_top_bl([1, 4, 1], range(le.NUM))) + self.assertTrue(cstr.is_valid_top_bl([2, 4, 1], range(le.NUM))) + + def test_content_hash(self): + ''' Content-based hash. ''' + cstr1 = SchedulingConstraint(topbat=2) + cstr2 = SchedulingConstraint(topbat=2) + self.assertNotEqual(id(cstr1), id(cstr2)) + self.assertEqual(hash(cstr1), hash(cstr2)) + self.assertEqual(cstr1, cstr2) + + cstr3 = SchedulingConstraint( + topbat=2, + update_dict={ + 'l1': lambda s, _: setattr(s, 'topbat', 1), + 'l2': lambda s, r: setattr(s, 'topifm', r.topifm), + }) + r = SchedulingConstraint(topifm=2) + cstr3.update_by_prev({'l1': None, 'l2': r}) + cstr4 = SchedulingConstraint(topifm=2, topbat=1) + self.assertNotEqual(id(cstr3), id(cstr4)) + self.assertEqual(hash(cstr3), hash(cstr4)) + self.assertEqual(cstr3, cstr4) + + def test_repr(self): + ''' __repr__. ''' + cstr = SchedulingConstraint(topbat=2) + self.assertIn('SchedulingConstraint(', repr(cstr)) + self.assertIn('topbat=2', repr(cstr)) + self.assertIn('topifm=0', repr(cstr)) + self.assertIn('topofm=0', repr(cstr)) + + cstr = SchedulingConstraint(update_dict={ + 'l1': lambda s, _: setattr(s, 'topbat', 1), + 'l2': lambda s, r: setattr(s, 'topifm', r.topifm), + }) + self.assertIn('update_dict=', repr(cstr)) + self.assertIn('l1', repr(cstr)) + self.assertIn('l2', repr(cstr)) + + +class TestSchedulingConstraintLayerPipeline(TestSchedulingConstraintFixture): + ''' Tests for SchedulingConstraintLayerPipeline. ''' + + def test_valid_args(self): + ''' Valid arguments. ''' + cstr = SchedulingConstraintLayerPipeline( + topbat=2, topifm=1, topofm=4, fbifm=True, fbofm=False) + self.assertEqual(cstr.topbat, 2) + self.assertEqual(cstr.topifm, 1) + self.assertEqual(cstr.topofm, 4) + + cstr = SchedulingConstraintLayerPipeline(topbat=2, topofm=4, fbifm=True) + self.assertEqual(cstr.topbat, 2) + self.assertEqual(cstr.topifm, 1) + self.assertEqual(cstr.topofm, 4) + + cstr = SchedulingConstraintLayerPipeline() + self.assertEqual(cstr.topbat, 0) + self.assertEqual(cstr.topifm, 0) + self.assertEqual(cstr.topofm, 0) + + cstr = SchedulingConstraintLayerPipeline(fbifm=True, fbofm=True) + self.assertEqual(cstr.topbat, 0) + self.assertEqual(cstr.topifm, 1) + self.assertEqual(cstr.topofm, 1) + + def test_invalid_args(self): + ''' Invalid arguments. ''' + with self.assertRaisesRegexp(ValueError, + 'SchedulingConstraintLayerPipeline: ' + '.*IFM.*'): + _ = SchedulingConstraintLayerPipeline(topifm=2, fbifm=True) + + with self.assertRaisesRegexp(ValueError, + 'SchedulingConstraintLayerPipeline: ' + '.*OFM.*'): + _ = SchedulingConstraintLayerPipeline(topofm=2, fbofm=True) + + with self.assertRaisesRegexp(ValueError, + 'SchedulingConstraintLayerPipeline: ' + '.*IFM.*OFM.*'): + _ = SchedulingConstraintLayerPipeline(topifm=2, topofm=2) + + def test_null_constraint(self): + ''' Null constraint. ''' + cstr = SchedulingConstraintLayerPipeline() + + self.assertTrue(cstr.is_valid_top_bl((1, 1, 2), (0, 1, 2))) + self.assertTrue(cstr.is_valid_top_bl((3, 4, 5), (2, 1, 0))) + self.assertTrue(cstr.is_valid_top_bl((1, 1, 1), (1, 2, 0))) + + def test_is_valid_top_bl(self): + ''' Whether is_valid_top_bl. ''' + cstr = SchedulingConstraintLayerPipeline(topbat=2, topofm=4, fbifm=True) + for bl_t, bl_ord in self._gen_bl(): + valid = (bl_t[le.BAT] == 2 and bl_t[le.IFM] == 1 + and bl_t[le.OFM] == 4 + and bl_ord[le.BAT] > bl_ord[le.OFM]) + self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid) + + cstr = SchedulingConstraintLayerPipeline(topifm=4, fbofm=True) + for bl_t, bl_ord in self._gen_bl(): + valid = (bl_t[le.IFM] == 4 and bl_t[le.OFM] == 1 + and (bl_ord[le.IFM] > bl_ord[le.BAT] + or bl_t[le.BAT] == 1)) + self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid) + + cstr = SchedulingConstraintLayerPipeline(topofm=4) + for bl_t, bl_ord in self._gen_bl(): + valid = (bl_t[le.OFM] == 4 + and (bl_ord[le.OFM] > bl_ord[le.BAT] + or bl_t[le.BAT] == 1) + and (bl_ord[le.OFM] > bl_ord[le.IFM] + or bl_t[le.IFM] == 1)) + self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid) + + cstr = SchedulingConstraintLayerPipeline(fbifm=True) + for bl_t, bl_ord in self._gen_bl(): + valid = (bl_t[le.IFM] == 1) + self.assertEqual(cstr.is_valid_top_bl(bl_t, bl_ord), valid) + + cstr = SchedulingConstraintLayerPipeline() + for bl_t, bl_ord in self._gen_bl(): + self.assertTrue(cstr.is_valid_top_bl(bl_t, bl_ord)) + + def test_repr(self): + ''' __repr__. ''' + cstr = SchedulingConstraintLayerPipeline(topbat=2, fbifm=True) + self.assertIn('SchedulingConstraintLayerPipeline', repr(cstr)) + self.assertIn('topbat=2', repr(cstr)) + self.assertIn('topifm=1', repr(cstr)) + self.assertIn('topofm=0', repr(cstr)) + diff --git a/nn_dataflow/tests/unit_test/test_scheduling_result.py b/nn_dataflow/tests/unit_test/test_scheduling_result.py index 39aad90..30ae01c 100644 --- a/nn_dataflow/tests/unit_test/test_scheduling_result.py +++ b/nn_dataflow/tests/unit_test/test_scheduling_result.py @@ -44,6 +44,7 @@ def setUp(self): [30, 40, 50], [400, 500, 600], [5000, 6000, 7000]]), + ('remote_gbuf_access', [0, 0, 0]), ('total_nhops', [123, 456, 789]), ('fetch', [[1, 2, 1], [3, 4, 5]]), ]) @@ -55,38 +56,60 @@ def setUp(self): type=NodeRegion.DRAM),), parts=(part,)) + self.sched_seq = (2, 0, 0) + def test_valid_args(self): ''' Valid arguments. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertIn('ops', result.scheme) self.assertIn('total_nhops', result.scheme) self.assertEqual(result.ofmap_layout, self.ofmap_layout) + self.assertTupleEqual(result.sched_seq, self.sched_seq) def test_invalid_scheme(self): ''' Invalid scheme. ''' with self.assertRaisesRegexp(TypeError, 'SchedulingResult: .*scheme.*'): _ = SchedulingResult(scheme={}, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) def test_invalid_ofmap_layout(self): ''' Invalid ofmap_layout. ''' with self.assertRaisesRegexp(TypeError, 'SchedulingResult: .*ofmap_layout.*'): _ = SchedulingResult(scheme=self.scheme, - ofmap_layout=None) + ofmap_layout=None, + sched_seq=self.sched_seq) + + def test_invalid_sched_seq(self): + ''' Invalid sched_seq. ''' + with self.assertRaisesRegexp(TypeError, + 'SchedulingResult: .*sched_seq.*'): + _ = SchedulingResult(scheme=self.scheme, + ofmap_layout=self.ofmap_layout, + sched_seq=list(self.sched_seq)) + + with self.assertRaisesRegexp(ValueError, + 'SchedulingResult: .*sched_seq.*'): + _ = SchedulingResult(scheme=self.scheme, + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq[:-1]) def test_total_cost(self): ''' Accessor total_cost. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertAlmostEqual(result.total_cost, 1.234 + 9.876) def test_total_time(self): ''' Accessor total_time. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertAlmostEqual(result.total_time, 123.4) self.assertGreaterEqual(result.total_time, result.total_node_time) @@ -95,55 +118,74 @@ def test_total_time(self): def test_total_node_time(self): ''' Accessor total_node_time. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertAlmostEqual(result.total_node_time, max(59, 40)) scheme = self.scheme scheme['bus_time'] = 100 result = SchedulingResult(scheme=scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertAlmostEqual(result.total_node_time, max(59, 100)) def test_total_dram_time(self): ''' Accessor total_dram_time. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertAlmostEqual(result.total_dram_time, 120) def test_total_proc_time(self): ''' Accessor total_proc_time. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertAlmostEqual(result.total_proc_time, 59) scheme = self.scheme scheme['bus_time'] = 100 result = SchedulingResult(scheme=scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertAlmostEqual(result.total_proc_time, 59) def test_total_ops(self): ''' Accessor total_ops. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertEqual(result.total_ops, 1234) def test_total_accesses(self): ''' Accessor total_cost. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertSequenceEqual(result.total_accesses, [9, 120, 1500, 18000]) + def test_total_accesses_rgbuf(self): + ''' Accessor total_accesses remote gbuf. ''' + scheme = self.scheme.copy() + scheme['remote_gbuf_access'] = [10, 20, 30] + result = SchedulingResult(scheme=scheme, + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) + self.assertSequenceEqual(result.total_accesses, + [9, 120 + 60, 1500, 18000]) + def test_total_noc_hops(self): ''' Accessor total_noc_hops. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertEqual(result.total_noc_hops, 1368) def test_num_nodes(self): ''' Accessor num_nodes. ''' result = SchedulingResult(scheme=self.scheme, - ofmap_layout=self.ofmap_layout) + ofmap_layout=self.ofmap_layout, + sched_seq=self.sched_seq) self.assertEqual(result.num_nodes, 4) diff --git a/nn_dataflow/tests/unit_test/test_util.py b/nn_dataflow/tests/unit_test/test_util.py index 392e10b..ff37455 100644 --- a/nn_dataflow/tests/unit_test/test_util.py +++ b/nn_dataflow/tests/unit_test/test_util.py @@ -338,6 +338,128 @@ def test_equal_size(self): self.assertLessEqual(max_size - min_size, 1) +class TestUtilGCD(unittest.TestCase): + ''' Tests for util.gcd. ''' + + def test_int(self): + ''' Integers. ''' + self.assertEqual(util.gcd(3, 4), 1) + self.assertEqual(util.gcd(8, 4), 4) + self.assertEqual(util.gcd(3, 9), 3) + self.assertEqual(util.gcd(15, 12), 3) + self.assertEqual(util.gcd(300, 410), 10) + + def test_multi(self): + ''' Multiple values. ''' + self.assertEqual(util.gcd(4, 8, 10), 2) + self.assertEqual(util.gcd(*range(6, 21, 3)), 3) + + def test_single(self): + ''' Single value. ''' + for v in range(1, 10): + self.assertEqual(util.gcd(v), v) + + def test_no_arg(self): + ''' No argument. ''' + with self.assertRaises(ValueError): + _ = util.gcd() + + def test_float(self): + ''' Float. ''' + with self.assertRaisesRegexp(TypeError, '.*integers.*'): + _ = util.gcd(1., 2) + + with self.assertRaisesRegexp(TypeError, '.*integers.*'): + _ = util.gcd(1, 2.2) + + with self.assertRaisesRegexp(TypeError, '.*integers.*'): + _ = util.gcd(1, 2, 3, 4.2) + + def test_non_positive(self): + ''' Non-positive values. ''' + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.gcd(-1, 2) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.gcd(1, -2) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.gcd(3, 6, 9, 12, -21) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.gcd(3, 0) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.gcd(0, 3) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.gcd(0, 5, 10, 15, 20) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.gcd(5, 10, 0, 15, 20) + + +class TestUtilLCM(unittest.TestCase): + ''' Tests for util.lcm. ''' + + def test_int(self): + ''' Integers. ''' + self.assertEqual(util.lcm(3, 4), 12) + self.assertEqual(util.lcm(8, 4), 8) + self.assertEqual(util.lcm(3, 9), 9) + self.assertEqual(util.lcm(15, 12), 60) + self.assertEqual(util.lcm(300, 410), 12300) + + def test_multi(self): + ''' Multiple values. ''' + self.assertEqual(util.lcm(4, 8, 10), 40) + self.assertEqual(util.lcm(*range(6, 21, 3)), 180) + + def test_single(self): + ''' Single value. ''' + for v in range(1, 10): + self.assertEqual(util.lcm(v), v) + + def test_no_arg(self): + ''' No argument. ''' + with self.assertRaises(ValueError): + _ = util.lcm() + + def test_float(self): + ''' Float. ''' + with self.assertRaisesRegexp(TypeError, '.*integers.*'): + _ = util.lcm(1., 2) + + with self.assertRaisesRegexp(TypeError, '.*integers.*'): + _ = util.lcm(1, 2.2) + + with self.assertRaisesRegexp(TypeError, '.*integers.*'): + _ = util.lcm(1, 2, 3, 4.2) + + def test_non_positive(self): + ''' Non-positive values. ''' + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.lcm(-1, 2) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.lcm(1, -2) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.lcm(3, 6, 9, 12, -21) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.lcm(3, 0) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.lcm(0, 3) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.lcm(0, 5, 10, 15, 20) + + with self.assertRaisesRegexp(ValueError, '.*positive.*'): + _ = util.lcm(5, 10, 0, 15, 20) + + class TestUtilIsclose(unittest.TestCase): ''' Tests for util.isclose. ''' diff --git a/nn_dataflow/tools/nn_dataflow_search.py b/nn_dataflow/tools/nn_dataflow_search.py index 833e167..edd13ae 100644 --- a/nn_dataflow/tools/nn_dataflow_search.py +++ b/nn_dataflow/tools/nn_dataflow_search.py @@ -71,6 +71,8 @@ def stats_dict(dfsch, cost): stats['active_node_pes'] = dfsch.perlayer_stats('active_node_pes') stats['dram_bandwidth'] = dfsch.perlayer_stats('dram_bandwidth') + stats['segment_time'] = dfsch.segment_time_list() + stats['segment_dram_time'] = dfsch.segment_dram_time_list() stats['input_layout'] = dfsch.input_layout stats['ext_layout_dict'] = dfsch.ext_layout_dict stats['schedules'] = dfsch.res_dict @@ -129,7 +131,8 @@ def do_scheduling(args): size_gbuf=size_gbuf, size_regf=size_regf, array_bus_width=array_bus_width, - dram_bandwidth=dram_bandwidth) + dram_bandwidth=dram_bandwidth, + no_time_mux=False) ## Cost. @@ -151,9 +154,16 @@ def do_scheduling(args): bypass[de.FIL] = 'f' not in args.disable_bypass options = Option(sw_gbuf_bypass=tuple(bypass), sw_solve_loopblocking=args.solve_loopblocking, + hw_access_forwarding=args.enable_access_forwarding, + hw_gbuf_sharing=args.enable_gbuf_sharing, + hw_gbuf_save_writeback=args.enable_save_writeback, partition_hybrid=args.hybrid_partition, partition_batch=args.batch_partition, partition_ifmaps=args.ifmaps_partition, + partition_interlayer=args.interlayer_partition, + layer_pipeline_time_ovhd=args.layer_pipeline_time_overhead, + layer_pipeline_max_degree=args.layer_pipeline_max_degree, + layer_pipeline_opt=not args.disable_interlayer_opt, opt_goal=args.goal.lower(), ntops=args.top, nprocesses=args.processes, @@ -249,6 +259,20 @@ def argparser(): ap.add_argument('--solve-loopblocking', action='store_true', help='Use analytical solver to choose loop blocking. ' 'Otherwise use exhaustive search.') + ap.add_argument('--enable-access-forwarding', action='store_true', + help='Each node fetches a subset of data and forwards to ' + 'other nodes.') + ap.add_argument('--enable-gbuf-sharing', action='store_true', + help='Share gbuf capacity across nodes through NoC.') + ap.add_argument('--enable-save-writeback', action='store_true', + help='Allow to save the writeback to memory for the ' + 'intermediate data between layers if able to ' + 'store the entire data set in on-chip buffers.') + ap.add_argument('--disable-interlayer-opt', + '--basic-interlayer-partition', + action='store_true', + help='Disable optimizations and only allow basic ' + 'inter-layer pipeline.') ap.add_argument('--hybrid-partition', '--hybrid-partition2d', # deprecated old name @@ -262,6 +286,20 @@ def argparser(): action='store_true', help='Allow partitioning ifmap channel dimension, which ' 'requires extra data synchronization.') + ap.add_argument('--interlayer-partition', '--inter-layer-partition', + action='store_true', + help='Allow partitioning resources across multiple layers ' + 'and process them simultaneously as an inter-layer ' + 'pipeline.') + + ap.add_argument('--layer-pipeline-time-overhead', + type=float, default=float('inf'), + help='maximum allowed execution time overhead due to ' + 'layer pipelining.') + ap.add_argument('--layer-pipeline-max-degree', + type=float, default=float('inf'), + help='maximum allowed layer pipelining degree, i.e., ' + 'number of vertices in a pipeline segment.') ap.add_argument('-g', '--goal', default='e', choices=['e', 'd', 'ed', 'E', 'D', 'ED'], diff --git a/nn_dataflow/util.py b/nn_dataflow/util.py index ccc9edc..368efe8 100644 --- a/nn_dataflow/util.py +++ b/nn_dataflow/util.py @@ -217,6 +217,48 @@ def get_ith_range(rng, idx, num): return beg, end +def gcd(*values): + ''' + Get the greatest common divisor of the given values. + ''' + if any(not isinstance(v, int) for v in values): + raise TypeError('value must be integers.') + if any(v <= 0 for v in values): + raise ValueError('arguments must be positive.') + + if not values: + raise ValueError('must give at least 1 value.') + if len(values) == 1: + return values[0] + if len(values) > 2: + return reduce(gcd, values) + + a, b = values + while b: + a, b = b, a % b + return a + + +def lcm(*values): + ''' + Get the least common multiple of the given values. + ''' + if any(not isinstance(v, int) for v in values): + raise TypeError('value must be integers.') + if any(v <= 0 for v in values): + raise ValueError('arguments must be positive.') + + if not values: + raise ValueError('must give at least 1 value.') + if len(values) == 1: + return values[0] + if len(values) > 2: + return reduce(lcm, values) + + a, b = values + return a * b // gcd(a, b) + + def isclose(vala, valb, rel_tol=1e-9, abs_tol=0.0): ''' Whether two values are close to each other. diff --git a/requirements.txt b/requirements.txt index 311d9de..8fbb362 100644 --- a/requirements.txt +++ b/requirements.txt @@ -4,3 +4,4 @@ fastcache==1.0.2 pytest==3.1.2 pytest-cov==2.5.1 pytest-xdist==1.17.1 +sympy==1.2.0 diff --git a/setup.py b/setup.py index fa36cca..0358120 100644 --- a/setup.py +++ b/setup.py @@ -54,6 +54,7 @@ def _readme(): 'pytest>=3', 'pytest-cov>=2', 'pytest-xdist>=1', + 'sympy>=1', ], entry_points={