Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RF] New RooFit batch mode with GPU support #9004

Merged
merged 53 commits into from
Dec 11, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
08ab252
[RF] Implement prototype RooFitDriver and RooBatchCompute library
manolismih Jan 6, 2021
204eafe
[RF] Implementing Roo{Add,Prod}Pdf, reordering includes
manolismih May 20, 2021
d9d83ec
[RF] Implement more PDFs in the new batchcompute library
manolismih May 26, 2021
4e912bd
[RF] Finalize RooBatchCompute library
manolismih Jun 7, 2021
25a576d
[RF] Implement weighted fits in RooFitDriver
manolismih Jun 14, 2021
e87b08c
[RF] Implement multithread computations in RooBatchCompute
manolismih Jul 1, 2021
7b82a5a
[RF] Follow RooFit naming conventions in `roofit/roobatchcompute`
manolismih Jul 1, 2021
bb68097
[RF] Implement CPU computeBatch for RooGenericPdf
manolismih Jul 5, 2021
688ebc4
[RF] Implement canComputeBatchWithCuda()
manolismih Jul 8, 2021
471c7cf
[RF] Consider constraints term in new RooFit batch mode
guitargeek Jun 23, 2021
8ab4c48
[RF] Implement extended fit mode in new RooFit batch mode
guitargeek Jun 23, 2021
78e7bd3
[RF] Ignore RooAbsArgs that are above fitted node in RooFitDrver
guitargeek Jun 23, 2021
7dbfd09
[RF] Separate buffer for scalar results in RooFitDriver
guitargeek Jun 29, 2021
443ff46
[RF] Evaluate nodes that don't depend on observables in scalar mode
guitargeek Jul 2, 2021
8ab9540
[RF] Compute with CUDA in parallel with CPU step 1/2
manolismih Aug 3, 2021
9bd6a91
[RF] Compute with CUDA in parallel with CPU step 2/2
manolismih Aug 18, 2021
4a13dbe
[RF] Correcty synchronize integral objects in RooFitDriver
guitargeek Jul 15, 2021
35a39d3
[RF] Bugfixes in RooFitDriver::markGPUNodes()
manolismih Sep 6, 2021
64151a3
[RF] Implement fallback mode for RooFit batch mode
guitargeek Jul 14, 2021
a7f1b36
[RF] Skip constraint pdfs in RooProdPdf::computeBatch
guitargeek Jul 2, 2021
51e5697
[RF] Make dispatch pointer a function parameter
manolismih Sep 7, 2021
d9107cd
[RF] Introduce BatchMode options 'rbc::Cpu', 'rbc::Cuda' and 'rbc::Off'
manolismih Sep 14, 2021
cfdc3ae
[RF] Update RooBatchCompute's README.md
manolismih Sep 14, 2021
4af1ced
[RF] Write docs for RooBatchCompute, RooFitDriver, and RooNLLVarNew
manolismih Sep 15, 2021
9cc525e
[RF] Account for every client and server in RooFitDriver
manolismih Oct 5, 2021
50b311d
[RF] Avoid clone of constrained term when using bach mode
lmoneta Oct 6, 2021
6d758aa
[RF] getValues() from RooAbsReals using the RooFitDriver
manolismih Oct 7, 2021
dc87a57
[RF] Introduce threshold when assigning slow nodes to GPU/CPU
manolismih Oct 12, 2021
e1ab666
[RF] Use `std::map` instead of `std::unordered_map` in RooBatchCompute
guitargeek Aug 30, 2021
8a4b982
[RF] Implement (multi-)range fits in new batch mode
guitargeek Oct 20, 2021
b55296a
[RF] Bugfix (CUDA computations always run in default stream)
manolismih Oct 21, 2021
0594c56
[RF] Update the string that `testRooAbsPdf` expects from message logger
guitargeek Nov 19, 2021
72931f0
[RF] Add support for non-scalar integrals in new RooFit batch mode
guitargeek Oct 20, 2021
33f919e
[RF] Add support for RooAbsCachedPdf in new RooFit batch mode
guitargeek Oct 25, 2021
604c52c
[RF] Simplify RooFitDriver constructor
guitargeek Oct 27, 2021
7ac79f5
[RF] RooFitDriver: also add servers of observables to computation queue
guitargeek Oct 27, 2021
9331c0f
[RF] New `RooAbsData::getCategoryBatches()` for category data access
guitargeek Oct 28, 2021
6270a67
[RF] Consider all RooAbsArgs in RooFitDriver, including now categories
guitargeek Oct 28, 2021
45f7684
[RF] Log loaded computation libraries only in RooFitDriver
guitargeek Oct 29, 2021
4d4cdf3
[RF] Remove `RooPolynomial` batch code (can't handle vectors for coefs)
guitargeek Nov 1, 2021
8d4356e
[RF] Implement `RooRealSumFunc/Pdf::computeBatch`
guitargeek Nov 1, 2021
9fe5ee3
[RF] RooFitDriver: `AbsBuffer` helpers to factor out buffering logic
guitargeek Nov 2, 2021
23c72ee
[RF] Adapt RooFit driver to take arbitrary reducer nodes as top node
guitargeek Nov 2, 2021
dec4630
[RF] Add constraint term in new batch mode with `RooAddition`
guitargeek Nov 9, 2021
16a0171
[RF] Split up data in RooFitDriver if there is a RooSimultaneous
guitargeek Nov 12, 2021
46530e6
[RF] RooSimultaneous support in new batchmode with RooFitDriver
guitargeek Nov 12, 2021
af5ea3b
[RF] Support for `BinIntegration()` in new batch mode
guitargeek Nov 19, 2021
e42ec1f
[RF] Avoid premature deletion of nodeInfo and buffers for integrals
guitargeek Nov 19, 2021
399b8b9
[RF] Remove assertion for same layout in `RooDataHist::calcTreeIndex`
guitargeek Nov 19, 2021
c64cf6c
[RF] Use Kahan summation in `RooNLLVarNew`
guitargeek Nov 17, 2021
4b8b37f
[RF] Add support for NaN-packing in new batch mode
guitargeek Nov 19, 2021
4b7b14d
[RF] Add basic unit test for RooFitDriver (`testRooFitDriver`)
guitargeek Dec 8, 2021
93a3aba
[RF] Disable `DISABLED_IntegrateBins_SubRange` in testTestStatistics
guitargeek Dec 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions cmake/modules/RootConfiguration.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,11 @@ if(CMAKE_USE_PTHREADS_INIT)
else()
set(haspthread undef)
endif()
if(cuda)
set(hascuda define)
else()
set(hascuda undef)
endif()
if(x11)
set(hasxft define)
else()
Expand Down
1 change: 1 addition & 0 deletions config/RConfigure.in
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
#@setresuid@ R__HAS_SETRESUID /**/
#@hasmathmore@ R__HAS_MATHMORE /**/
#@haspthread@ R__HAS_PTHREAD /**/
#@hascuda@ R__HAS_CUDA /**/
#@hasxft@ R__HAS_XFT /**/
#@hascocoa@ R__HAS_COCOA /**/
#@hasvc@ R__HAS_VC /**/
Expand Down
19 changes: 13 additions & 6 deletions roofit/batchcompute/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Library which powers fast batch computations in Roofit.
ROOT_LINKER_LIBRARY(RooBatchCompute
src/Initialisation.cxx
src/RooMath.cxx
src/RunContext.cxx
DEPENDENCIES
Core
Expand All @@ -24,32 +25,38 @@ endif()
# Instantiations of the shared objects which provide the actual computation functions.

# Generic implementation for CPUs that don't support vector instruction sets.
ROOT_LINKER_LIBRARY(RooBatchCompute_GENERIC src/RooBatchCompute.cxx TYPE SHARED DEPENDENCIES RooFitCore RooBatchCompute)
ROOT_LINKER_LIBRARY(RooBatchCompute_GENERIC src/RooBatchCompute.cxx src/ComputeFunctions.cxx TYPE SHARED DEPENDENCIES RooBatchCompute)
target_compile_options(RooBatchCompute_GENERIC PRIVATE ${common-flags} -DRF_ARCH=GENERIC)

# Windows platform and ICC compiler need special code and testing, thus the feature has not been implemented yet for these.
if (ROOT_PLATFORM MATCHES "linux|macosx" AND CMAKE_SYSTEM_PROCESSOR MATCHES x86_64 AND CMAKE_CXX_COMPILER_ID MATCHES "GNU|Clang")

target_compile_options(RooBatchCompute PRIVATE -DR__RF_ARCHITECTURE_SPECIFIC_LIBS)

ROOT_LINKER_LIBRARY(RooBatchCompute_SSE4.1 src/RooBatchCompute.cxx TYPE SHARED DEPENDENCIES RooFitCore RooBatchCompute)
ROOT_LINKER_LIBRARY(RooBatchCompute_AVX src/RooBatchCompute.cxx TYPE SHARED DEPENDENCIES RooFitCore RooBatchCompute)
ROOT_LINKER_LIBRARY(RooBatchCompute_AVX2 src/RooBatchCompute.cxx TYPE SHARED DEPENDENCIES RooFitCore RooBatchCompute)
ROOT_LINKER_LIBRARY(RooBatchCompute_SSE4.1 src/RooBatchCompute.cxx src/ComputeFunctions.cxx TYPE SHARED DEPENDENCIES RooBatchCompute)
ROOT_LINKER_LIBRARY(RooBatchCompute_AVX src/RooBatchCompute.cxx src/ComputeFunctions.cxx TYPE SHARED DEPENDENCIES RooBatchCompute)
ROOT_LINKER_LIBRARY(RooBatchCompute_AVX2 src/RooBatchCompute.cxx src/ComputeFunctions.cxx TYPE SHARED DEPENDENCIES RooBatchCompute)

# Flags -fno-signaling-nans, -fno-trapping-math and -O3 are necessary to enable autovectorization (especially for GCC).
set(common-flags $<$<CXX_COMPILER_ID:GNU>:-fno-signaling-nans>)
list(APPEND common-flags $<$<OR:$<CONFIG:Release>,$<CONFIG:RelWithDebInfo>>: -fno-trapping-math -O3>)

target_compile_options(RooBatchCompute_SSE4.1 PRIVATE ${common-flags} -msse4 -DRF_ARCH=SSE4)
target_compile_options(RooBatchCompute_AVX PRIVATE ${common-flags} -mavx -DRF_ARCH=AVX )
target_compile_options(RooBatchCompute_AVX PRIVATE ${common-flags} -mavx -DRF_ARCH=AVX)
target_compile_options(RooBatchCompute_AVX2 PRIVATE ${common-flags} -mavx2 -DRF_ARCH=AVX2)

# AVX512 is only supported in gcc 6+
# We focus on AVX512 capable processors that support at least the skylake-avx512 instruction sets.
if(NOT (CMAKE_CXX_COMPILER_ID STREQUAL "GNU") OR CMAKE_CXX_COMPILER_VERSION VERSION_GREATER_EQUAL 6)
ROOT_LINKER_LIBRARY(RooBatchCompute_AVX512 src/RooBatchCompute.cxx TYPE SHARED DEPENDENCIES RooFitCore RooBatchCompute)
ROOT_LINKER_LIBRARY(RooBatchCompute_AVX512 src/RooBatchCompute.cxx src/ComputeFunctions.cxx TYPE SHARED DEPENDENCIES RooBatchCompute)
target_compile_options(RooBatchCompute_AVX512 PRIVATE ${common-flags} -march=skylake-avx512 -DRF_ARCH=AVX512)
endif()

endif() # vector versions of library

if (cuda)
ROOT_LINKER_LIBRARY(RooBatchCompute_CUDA src/RooBatchCompute.cu src/ComputeFunctions.cu TYPE SHARED DEPENDENCIES RooBatchCompute)
target_compile_options(RooBatchCompute_CUDA PRIVATE -DRF_ARCH=CUDA -lineinfo --expt-relaxed-constexpr)
endif()

ROOT_INSTALL_HEADERS()
77 changes: 34 additions & 43 deletions roofit/batchcompute/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,63 +5,54 @@ _Contains optimized computation functions for PDFs that enable significantly fas
### Purpose
While fitting, a significant amount of time and processing power is spent on computing the probability function for every event and PDF involved in the fitting model. To speed up this process, roofit can use the computation functions provided in this library. The functions provided here process whole data arrays (batches) instead of a single event at a time, as in the legacy evaluate() function in roofit. In addition, the code is written in a manner that allows for compiler optimizations, notably auto-vectorization. This library is compiled multiple times for different [vector instuction set architectures](https://en.wikipedia.org/wiki/SIMD) and the optimal code is executed during runtime, as a result of an automatic hardware detection mechanism that this library contains. **As a result, fits can benefit by a speedup of 3x-16x.**

As of ROOT v6.26, RooBatchComputes also provides multithread and [CUDA](https://en.wikipedia.org/wiki/CUDA) instances of the computation functions, resulting in even greater improvements for fitting times.

### How to use
This library is an internal component of RooFit, as a results users are not supposed to actively interact with it. Instead, they can benefit from significantly faster times for fitting by calling `fitTo()` and providing a `BatchMode(RooBatchCompute::Cpu)` or a `BatchMode(RooBatchCompute::Cuda)` option.
```c++
// fit using the most efficient library that the computer's cpu can support
RooMyPDF.fitTo(data, BatchMode("cpu"));

// fit using the cuda library along with the most efficient library that the computer's cpu can support
RooMyPDF.fitTo(data, BatchMode("cuda"));
```
**Note: In case the system does not support vector instructions, the `RooBatchCompute::Cpu` option is guaranteed to work properly by using a generic cpu library. In contrast, users must first make sure that their system supports cuda in order to use the `RooBatchCompute::Cuda` option. If this is not the case, an exemption will be thrown.**

If `RooBatchCompute::Cuda` is selected, RooFit will launch CUDA kernels for computing possibilities and potentially other intense computations. At the same time, the most efficent cpu library loaded will also handle parts of the computations in parallel with the GPU (or potentially, if it's faster all of them), thus gaining full advantage of the available hardware. For this purpose `RooFitDriver`, a newly created RooFit class (in roofitcore) takes over the task of analyzing the computations and assigning each to the correct piece of hardware, taking into consideration the performance boost or penalty that may arise with every method of computing.

#### Multithread computations
The CPU instance of the computing library can furthermore execute multithread computations. This also applies for computations handled by the CPU in the `RooBatchCompute::Cuda` mode. To use them, one needs to set the desired number of parallel tasks before calling `fitTo()` as shown below:
```c++
ROOT::EnableImplicitMT(nThreads);
RooMyPDF.fitTo(data, BatchMode(RooBatchCompute::Cpu)); // can also use RooBatchCompute::Cuda
```

### User-made PDFs
The easiest and most efficient way of accelerating your PDFs is to request their addition to the official RooFit by submiting a ticket [here](https://github.com/root-project/root/issues/new). The ROOT team will gladly assist you and take care of the details.

The above process might take some time and the users will be required to update ROOT to use the newly introduced PDFs. In the meantime, you are able to significantly improve the speed of fitting (but not take full advantage of the RooBatchCompute library), at least by using the batch evaluation feature.
To make use of it, one should override [`RooAbsReal::evaluateSpan()`](https://root.cern.ch/doc/master/classRooAbsReal.html#a1e5129ffbc63bfd04c01511fd354b1b8)
While your code is integrated, you are able to significantly improve the speed of fitting (but not take full advantage of the RooBatchCompute library), at least by using the batch evaluation feature.
To make use of it, one should override `RooAbsReal::computeBatch()`
```c++
RooSpan<double> RooMyPDF::evaluateSpan(RooBatchCompute::RunContext& evalData, const RooArgSet* normSet) const
void RooMyPDF::computeBatch(RooBatchCompute::RooBatchComputeInterface*, double* output, size_t nEvents, RooBatchCompute::DataMap& dataMap) const
```
The evalData is a simple struct that holds the vector data for the fitting in the form of `RooSpan<double>`.
The normSet (normalization set) is used for invoking the computation and retrieving the values of the variables of the PDF.
You don't need to worry about these arguments as they will be provided by the RooFit internal functions that will call `evaluateSpan()`.
This method must be implemented so that it fills the `output` array with the **normalized** probabilities computed for `nEvents` events, the data of which can be retrieved from `dataMap`. `dataMap` is a simple `std::map<RooRealVar*, RooSpan<const double>>`. Note that it is not necessary to evaluate any of the objects that the PDF relies to, because they have already been evaluated by the RooFitDriver, so that their updated results are always present in `dataMap`. The `RooBatchCompute::RooBatchComputeInterface` pointer should be ignored.

```c++
RooSpan<double> RooMyPDF::evaluateSpan(RooBatchCompute::RunContext& evalData, const RooArgSet* normSet) const
void RooMyPDF::computeBatch(RooBatchCompute::RooBatchComputeInterface*, double* output, size_t nEvents, RooBatchCompute::DataMap& dataMap) const
{
// Retrieve `RooSpan`s for each parameter of the PDF
RooSpan<const double> span1 = var1->getValues(evalData, normSet);
// or: auto span1 = var1->getValues(evalData, normSet);
RooSpan<const double> span2 = var2->getValues(evalData, normSet);
RooSpan<const double> span1 = dataMap.at(&*proxyVar1);
// or: auto span1 = dataMap.at(&*proxyVar1);
RooSpan<const double> span2 = dataMap.at(&*proxyVar2);

// let's assume c is a scalar parameter of the PDF. In this case getValues will return a RooSpan with only one value.
RooSpan<const double> scalar = c->getValues(evalData, normset);

// Get the number of nEvents
size_t nEvents=0;
for (auto& i:{xData,meanData,sigmaData})
nEvents = std::max(nEvents,i.size());

// Allocate the output array
evalData.makeBatch(this, nEvents);
// let's assume c is a scalar parameter of the PDF. In this case the dataMap contains a RooSpan with only one value.
RooSpan<const double> scalar = dataMap.at(&*c);

// Perform computations in a for-loop
// Use VDT if possible to facilitate auto-vectorization
for (size_t i=0; i<nEvents; ++i) {
output[i] = RooBatchCompute::fast_log(span1[i]+span2[i]) + scalar[0]; //scalar is a RooSpan of length 1
}
return output;
}
```
Make sure to add the `evaluateSpan()` function signature in the header `RooMyPDF.h` and mark it as `override` to ensure that you have successfully overriden the method. In case the data types (scalar or vector) for the variables can not be predicted when writing the source code, you can use [BracketAdapterWithMask](https://github.com/root-project/root/blob/2b84398d4f52462a120083b3c5d1e0b952cc5221/roofit/batchcompute/inc/BracketAdapter.h#L55). This class overloads the `operator[]` and is constructed by a RooSpan. In case the RooSpan used for construction has a length of 1, ie represents a scalar variable, then `BracketAdapterWithMask::operator[]` always returns the scalar value, regrdless of the index used. This allows us to write:

```c++
RooSpan<double> RooMyPDF::evaluateSpan(RooBatchCompute::RunContext& evalData, const RooArgSet* normSet) const
{
// Construct BracketAdapterWithMasks for each variable if we're not sure whether they are scalar of vectors.
BracketAdapterWithMask adapter1(var1->getValues(evalData, normSet));
BracketAdapterWithMask adapter2(var2->getValues(evalData, normSet));
BracketAdapterWithMask scalar(c->getValues(evalData, normSet));

// prepare the computations as above
...

// by calling adapter[i] we either get the i-th or 0-th element, if the variable is a vector or a scalar respectively.
for (size_t i=0; i<nEvents; ++i) {
output[i] = RooBatchCompute::fast_log(adapter1[i]+adapter2[i]) + scalar[i];
}
return output;
}
```

As a final note, always remember to append `RooBatchCompute::` to the classes defined in the RooBatchCompute library, or write `using namespace RooBatchCompute`.
Make sure to add the `computeBatch()` function signature in the header `RooMyPDF.h` and mark it as `override` to ensure that you have successfully overriden the method. As a final note, always remember to append `RooBatchCompute::` to the classes defined in the RooBatchCompute library, or write `using namespace RooBatchCompute`.
Loading