Tags: LBANN/lbann
Tags
Improved I/O performance for Python dataset reader by reusing memory allocation.
Improved the performance of the Python dataset data reader to use shared memory allocations to reduce the overhead of copying between Python and C++.
2024_09_10_v0.105_pre_release Incorporating updated CI build infrastructure and associated superbuild configurations.
Integration of features and bugfixs targeted for v0.105 release.
Baseline version for benchmarking Wide-ResNet50 model with AMP.
============================== Release Notes: v0.104 ================… …============== C++ API: Support for new training algorithms: Support for new network structures: - Added GPT-3 transformers and training recipes Support for new layers: - Select operator (set tensor value based on predicate) - Model parallelism for channel-wise fully-connected layers Python front-end: - Support for PyTorch Module conversion to LBANN graphs (requires PyTorch 2.0 or newer, compiled with PyTorch Dynamo) Performance optimizations: - Support in-place computations for capable layers as a memory optimization - Allow distconv-enabled convolution and batchnorm layers to reuse their input activations as error signals as a memory optimization if the parent layer does not need its activations in the backward pass. This optimization can be disabled by setting the environment variable DISTCONV_DISABLE_MEM_OPT=1. - Added support for selective weight sharding (also known as Fully-Sharded Data Parallelism, or FSDP). To enable, set sharded=true on weight objects. - Allow distconv to be disabled at runtime with LBANN_DISABLE_DISTCONV=1. - Activations are now deallocated when no longer needed via a reference counter, disable with LBANN_DISABLE_ACT_GC=1. - Added option for LBANN to set the number of OMP threads to modest default (4) if the environment doesn't specify anything. - Save memory on backpropagation by not replicating gradients between GradientManager and data_type_optimizer - Save more memory in FSDP by synchronizing previous outstanding async communication calls and freeing up local gradient contributions - FSDP: release full weight views after backprop - Batching heads in multi-head attention into single operations instead of on a per-head basis - Stacking the weights and biases for queries/keys/values in self-attention Model portability & usability: - Added support for profiling with Caliper Experiments & Applications: - Updated CosmoFlow model to automatically scale the model architecture and parallelism with input size. - Added a PyTorch reference implementation of CosmoFlow. Internal features: - Removed the mini_batch_size parameter from the following functions in the layer class hierarchy: fp_setup_inputs, fp_setup_outputs, bp_setup_gradient_wrt_inputs and the distconv_adapter class: fp_setup, bp_setup - Support global and local gradient norm clipping with the clip_gradient_norm callback - Interactive progress bar with the progress_bar callback - Evaluate progress callback allows for periodic monitoring during training with independent data set (intra-epoch evaluation) - Detailed memory usage profiling with the memory_profiler callback - Refactored subgraph parallelism I/O & data readers: - Renamed percent_of_data_to_use more accurately to fraction_of_data_to_use. - DataReaderMetaData, training_dr_linearized_data_size, and num_parallel_readers were removed from the model and layer API, and instead reside in the data ingestion pipeline. - Fixed implementation of background I/O to achive better decoupling of background data fetch. Can be enabled / disabled with runtime flag. - Set the default number of I/O threads to 4 - Changed the I/O and transform pipeline to use a bank of RNGs that is now indexed by the sample ID in the load sequence, rather than the I/O thread ID. This eliminates variablility when using different numbers of I/O threads. - Moved state tracking current position in a data set from the data reader to the dataset class. - Split the I/O RNGs into two banks one for training and one for all other execution modes. Build system: - Updated build script to use CachedCMakeProject mode, which should simplfy the overall workflow - Set a default time limit for CI tests to avoid unnecessary stalls Bug fixes: - Fixed a bug where in-place layers sometimes attached a locked view of a matrix to a mutable view. - Fixed a bug when trying to use the legacy HDF5 data reader without data store. - Fixed concurrency bugs in the data store - Fixed DistConv memory optimization bug Retired features: - Support for autoencoder strategy in the summarize images callback was removed - Removed deprecated Layer protobuf fields: weight_data, num_neurons_from_data_reader - Removed support for calculating a global mini-batch across multiple models using the imcomm callback or multiple trainers. The mini-batch is now strictly contained to a single model in a single trainer. This deprecates an unused (and old) multi-model execution mode using imcomm callback that predated LTFB. - Removed the notion of effective mini-batch size versus current mini-batch size. - Remove world master mini-batch adjustment. - Remove model offset field. No longer necessary since data sets do not span models. - Remove the cached value of the current mini-batch size from the SGD execution context. It is now only cached in the model. - Removed the imcomm "inter-model" callback - Removed the num-parallel-readers parameter to the I/O subsystem. This eliminates an older version of I/O parallelism that relied on a non-data-parallel I/O buffer and had different ranks fetching entire mini-batches. It is superseded by standard data-parallel I/O.
Stable branch used for the Cosmoflow baseline experiments.
============================== Release Notes: v0.103 ================… …============== C++ API: - Added ability to load models and run inference from external C++ applications - Added inference-only execution algorithm Support for new training algorithms: - 2nd-order optimization with K-FAC. Currently supports fully-connected, convolution, and GRU layers. - Added Sub-graph parallelism support for multi-branch architectures (split, slice, sum, and concat layers) - Data + sub-graph parallelism for in-core models (D&SP and D&SP-cSub) - Initial sub-graph parallelism support for common layers in Transformers - Model topology mutation in LTFB/PBT - Added sub-grid parallelism support for K-FAC using primary and secondary grids - Truncation selection exchange for LTFB/PBT - Regularized evolution for LTFB/PBT - Hyperparameter grid search - Multi-GAN training algorithm with multiple discriminators Support for new network structures: - Edge-conditioned graph neural networks - RoBERTa with pretrained weights Support for new layers: - Added support for 2D Matrices for Scatter and Gather layers - Added support for distributed Scatter and Gather layers - DistConv enabled 3D MatMul - Added image rotation layer and composite image transformation layer (rotate, shear, translate) - Added distributed tensor parallelism with channelwise decomposition for channelwise fully connected layer - Added "binary-with-constant" operators - Updated deconvolution layer to match PyTorch's API - Updated identity layer to copy tensors to enable tensor parallelism in subsequent layers in the compute graph - Added IdentityZero layer that allows alternating generator/discriminator updates for training GANs. - Added an External layer that enables separately-compiled library to be loaded dynamically - Added support for labels_only mode on data-parallel cross entropy layer Python front-end: - Added support for building and launching jobs on Fugaku - Added Riken as a known compute center - Added Perlmutter as a known compute center - Added support for PJM as job launcher - Unified convolution/deconvolution interface to better approximate PyTorch. - Added circular (periodic) padding transformation for 2D and 3D tensors - Added support for Flux job scheduler Performance optimizations: - Enabled the input layers to use a view of the I/O buffers in the buffered data coordinator - Use default-allocated GPU memory for long-lived buffers - Optimized GPU kernels for entry-wise operators - Optionally use default-allocated GPU memory for long-lived buffers Model portability & usability: - Weight initialization from NumPy files - Expanded layer documentation Experiments & Applications: - Example for training Transformer model with D&SP and D&SP-cSub - PROBIESNet model for HRRL data - Cosmo 3D GAN - MNIST GAN - Image GAN - Example Distributed Graph Convolutions Networks - NASNet - RoBERTa Internal features: - Added operator class - Added AlternateUpdates callback to be used with IdentityZero layers for training GANs. - Added support for serializing network architectures to protobuf format. - Reformatted headers and implementation files for a more IWYU paradigm. - General support for ROCm-enabled DistConv - Support fo use of libfabric plugin for RCCL and NCCL - Framework-wide improvements in support for ROCm and MIOpen - Callback for alternating optimizer layer update - Command line argument to hang the LBANN application for debuggin - Add a cuTT/hipTT backend to the permute layer - Add a permute layer utilizing cuTENSOR for the permute implementation - Weight initializer from NumPy file I/O & data readers: - Updated SMILES data reader to use sample lists - Added explicitly managed buffered reading and local unpacking for the SMILES data reader to minimize file access - Sample lists with integral indices can use range format (start ... end) - Added a new extensible HDF5 data reader that uses a data set schema and experiment schema files to define how the data is represented. This allows the user to change the representation of data without changing the data reader. - Changed the input layer to take a data field and only produce a single output. Currently valid Data fields are samples, labels, and responses. - Added support for using arbitrary field names with HDF5 data reader. - Updated the data coordinator and data readers to take dynamic data fields rather than fixed fields. Input buffers are no long allocated for fields that are not used in active models. - Added support in the generic data reader and synthetic data reader clases for arbitrary data fields. - Added support for data readers to return full Conduit nodes to the Data Coordinator. - Data coordinator can now directly return packed data fields to input layers. - Added padding and cutout transformations Build system: - Added support for using uptream Spack repositories - Added support to reuse existing Spack environments, which significantly decreases the startup time of running a CI job - Enforce consistent GPU targets in Spack environment - Switched from Bamboo to GitLab CI framework Bug fixes: - Fixed GPU kernels that launched with more blocks than allowed - Fixed build and runtime errors with DistConv - Use correct LayerNorm operaton in "Attention Is All You Need" Transformer - Fixed a bug where the input layer performed unnecessary memory allocations. - Bug fixes within Cosmoflow and U-Net models - Fixed a bug in the GPU-based computation of the batchnorm statistics - Patch for when distconv'd input layer is followed by non-distconv layer - Bugfix input layer activations: Fixed the input layer so that it would only resize the activation matrix if it wasn't already setup to be a view of the data_coordinator's matrix. This addresses a signficant performance bug in the data ingestion where the activation matrix was a view into the data coordinator's internal buffers. - Fixed bad convolution parameters producing incorrect layer shapes. - Enabling tensor copy on distconv-enabled Identity layer - General cleanup and improvement in the coverage and robustness of CI testing - Fix buffer overflow in SMILES data reader - Fix a bug in TSE - Do not construct bias weights when not needed in conv and FC modules - Use tournament set in LTFB with truncation selection exchange - Cleanup data reader tests memory leaks - Fixed a buffer overrun, heap overflow, and double allocation of the data store in the SMILES data reader - Match LayerNorm and InstanceNorm layers to PyTorch - Make sure GPU grid dims are valid in slice/concat layers - Fixed incorrect matrix ording in K-FAC for conv layer - Bugfix for polynomial learning rate schedule
============================== Release Notes: v0.102 ================… …============== Support for new training algorithms: - LTFB is now a first-class training algorithm. - LTFB now allows multiple metrics. The local algorithm is favored by each trainer and a partner model must win every metric to be declared the tournament winner. - The batched iterative optimizer (sgd_training_algorithm) was refactored for consistency. - Improved documentation of training algorithm infrastructure. Support for new network structures: - ATOM WAE model - character-based Wasserstein Autoencoder - Community GAN model for graph data sets Support for new layers: - "DFTAbs" layer that computes the absolute value of the channel-wise DFT of the input data - Adding support for 3D Matrix Multiplication - Added scatter and gather neural network layers - CPU-based GRU layers using oneDNN - Added batch-wise reduce-sum - ArcFace loss Python front-end: - Added 3D U-Net Model - Added Cosmoflow Model - Ported CANDLE Pilot1 models - Support nvprof - Added channelwise fully connected layer - Added support for non square kernels, padding, stride, and dilation for the convolution module - Support for OpenMPI launcher Performance optimizations: - Use cuDNN 8 RNN API and CUDA Graphs in GRU layer - Cache CUDA Graphs for each active mini-batch size - Tuned performance of slice, concatenate, and tessellate layers on ARM processors - Parallelize computation of Gaussian random numbers - Optimizing tessellate, concatenate, and slice layers on CPU Experiments & Applications: - Added experiment scripts for ATOM cWAE Gordon Bell simulations - LBANN-ATOM model inference and analysis Internal features: - Wrapper classes for CUDA Graphs API - Elementary examples of using complex numbers - cuDNN handles are now wrapped in RAII management classes - Improved HWLOC compatility for v1.11 and v2.x - Added an enum type of visitor hooks that will eventually be used to allow callbacks or other visitors to operate at user defined hook points - Changed checkpoint logic to checkpoint at the start of epochs and changed the naming scheme to use the callback phase (visitor hook) in the name rather than the current execution context. - Added in-memory binary model exchange for LTFB. - Added support for ROCm and MIOpen - Added support for oneDNN - Updated the bamboo test environment to use local executable rather than hard coded executables - Overhauled and refactored serialization throughout code to use Cereal serialization library - Significant cleanup and refactoring of code base to improve compile times. Moving to ensure that code adheres to standard split of header between declaration and implementation functions (for templated code). Specifically focused on serialization functions and comm class. Reduced dependencies through over reaching header inclusions. - The relationship of execution_contexts and training_algorithms was clarified. There is still work to do here. - Added DistConv tests both convolution and pooling layers - Support padding in distributed embedding layer - Added dump model graph callback - Added perturb learning rate callback - Added batched inference algorithm - Switched ATOM tests to use CPU embedding and tessellate layers to minimize noise I/O & data readers: - Experimental data reader that generates graph random walks with HavoqGT - Added explict tournament execution mode - Added support to split training data reader into validation and tournament readers - node2vec data reader Build system: - Hydrogen v1.5.0+ - Aluminum v0.5.0+ - DiHydrogen v0.2.0 is required - C++14 or newer standard with CUDA (CMake: "-DCMAKE_CUDA_STANDARD=14") - OpenCV is now an optional dependency via CMake "LBANN_WITH_VISION" - CNPY is now an optional dependency via CMake "LBANN_WITH_CNPY" - Adds support in the build_lbann.sh script for concretizing extra packages with the primary LBANN installation - New features in the build script to setup / configure the build environment, but stop and allow the user to manually add extra packages - Add a set of user-focused build scripts that use the main build_lbann.sh script to setup good defaults on known systems - Added application specific build scripts for users such as ATOM - Added support for pulling from Spack mirrors and setting them up - Split embedded Python support from Python Front End - Switched Spack-based build script to use Spack's clingo concretizer Bug fixes: - Fixed a bug where LBANN didn't set the Hydrogen RNG seed - Fixed both CosmoFlow and UNet models PFE as well as addressed issues in the data reader and data coordinator. - Fixed the HDF5 data reader to properly specify the supported I/O types - Fixed calculation of the linearized response size - Fixed the data coordinator's interface to input_layer - Fixed error with deterministic execution of dropout layers Retired features: - Removed deprecated JAG leader mode which was made obsolete when the data reader moved into the data coordinator - Removed the deprecated partitioned data reader modes that were used to partition and overlap data sets for multiple models - Removed deprecated ActivationDescriptor class
PreviousNext