Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 28 million developers.Sign up
Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of
thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms.
thrust::tabulate has been introduced to tabulate the values of functions taking integers. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance. Finally, a new TBB
reduce_by_key implementation provides 80% faster performance.
Breaking API Changes
Custom user backend systems' tag types must now inherit from the corresponding system's
thrust::cuda::execution_policy) instead of the tag
thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch.
thrust::distanceare no longer dispatched based on iterator system type and thus may no longer be customized.
Pointertemplate parameters have been eliminated.
iterator_adaptorhas been moved into the
iterator_facadehas been moved into the
iterator_core_accesshas been moved into the
All iterators' nested pointer
typedef(the type of the result of
operator->) is now
voidinstead of a pointer type to indicate that such expressions are currently impossible.
typedefis now a signed integral type instead of a floating point type.
normal_distributionhas been moved into the
- Placeholder expressions may no longer include the comma operator.
- Execution Policies
Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.
For example, instead of wrapping raw pointers allocated by
thrust::deviceexecution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.
The following execution policies are supported in this version:
- uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in
- Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their
- Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as
THRUST_DEVICE_SYSTEM_CPPhas been added as a compile-time target for the device backend.
mergeperformance is 2-15x faster.
- CUDA comparison sort performance is 1.3-4x faster.
- CUDA set operation performance is 1.5-15x faster.
reduce_by_keyperformance is 80% faster.
- Several algorithms have been parallelized with TBB.
- Support for user allocators in vectors has been improved.
- The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
- Warnings have been eliminated in various contexts.
- Warnings about
__device__-only functions called from
__host__ __device__functions have been eliminated in various contexts.
- Documentation about algorithm requirements have been improved.
- Simplified the minimal_custom_backend example.
- Simplified the cuda/custom_temporary_allocation example.
- Simplified the cuda/fallback_allocator example.
- #248 fix broken
counting_iterator<float>behavior with OpenMP
- #231, #209 fix set operation failures with CUDA
- #187 fix incorrect occupancy calculation with CUDA
- #153 fix broken multigpu behavior with CUDA
- #142 eliminate warning produced by
thrust::random::taus88and MSVC 2010
- #208 correctly initialize elements in temporary storage when necessary
- #16 fix compilation error when sorting bool with CUDA
- #10 fix ambiguous overloads of
g++versions 4.3 and lower may fail to dispatch
thrust::get_temporary_buffercorrectly causing infinite recursion in examples such as cuda/custom_temporary_allocation.
- Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
- Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
- Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.