Assets 4

Summary

Thrust 1.7.0 introduces a new interface for controlling algorithm execution as well as several new algorithms and performance improvements. With this new interface, users may directly control how algorithms execute as well as details such as the allocation of temporary storage. Key/value versions of thrust::merge and the set operation algorithms have been added, as well stencil versions of partitioning algorithms. thrust::tabulate has been introduced to tabulate the values of functions taking integers. For 32b types, new CUDA merge and set operations provide 2-15x faster performance while a new CUDA comparison sort provides 1.3-4x faster performance. Finally, a new TBB reduce_by_key implementation provides 80% faster performance.

Breaking API Changes

  • Dispatch
    • Custom user backend systems' tag types must now inherit from the corresponding system's execution_policy template (e.g. thrust::cuda::execution_policy) instead of the tag struct (e.g. thrust::cuda::tag). Otherwise, algorithm specializations will silently go unfound during dispatch.

      See examples/minimal_custom_backend.cu and examples/cuda/fallback_allocator.cu for usage examples.

    • thrust::advance and thrust::distance are no longer dispatched based on iterator system type and thus may no longer be customized.

  • Iterators
    • iterator_facade and iterator_adaptor's Pointer template parameters have been eliminated.
    • iterator_adaptor has been moved into the thrust namespace (previously thrust::experimental::iterator_adaptor).
    • iterator_facade has been moved into the thrust namespace (previously thrust::experimental::iterator_facade).
    • iterator_core_access has been moved into the thrust namespace (previously thrust::experimental::iterator_core_access).
      All iterators' nested pointer typedef (the type of the result of operator->) is now void instead of a pointer type to indicate that such expressions are currently impossible.
      Floating point counting_iterators' nested difference_type typedef is now a signed integral type instead of a floating point type.
  • Other
    • normal_distribution has been moved into the thrust::random namespace (previously thrust::random::experimental::normal_distribution).
    • Placeholder expressions may no longer include the comma operator.

New Features

  • Execution Policies
    • Users may directly control the dispatch of algorithm invocations with optional execution policy arguments.

      For example, instead of wrapping raw pointers allocated by cudaMalloc with thrust::device_ptr, the thrust::device execution_policy may be passed as an argument to an algorithm invocation to enable CUDA execution.

      The following execution policies are supported in this version:

      • thrust::host
      • thrust::device
      • thrust::cpp::par
      • thrust::cuda::par
      • thrust::omp::par
      • thrust::tbb::par
  • Algorithms
    • free
    • get_temporary_buffer
    • malloc
    • merge_by_key
    • partition with stencil
    • partition_copy with stencil
    • return_temporary_buffer
    • set_difference_by_key
    • set_intersection_by_key
    • set_symmetric_difference_by_key
    • set_union_by_key
    • stable_partition with stencil
    • stable_partition_copy with stencil
    • tabulate

New Examples

  • uninitialized_vector demonstrates how to use a custom allocator to avoid the automatic initialization of elements in thrust::device_vector.

Other Enhancements

  • Authors of custom backend systems may manipulate arbitrary state during algorithm dispatch by incorporating it into their execution_policy parameter.
  • Users may control the allocation of temporary storage during algorithm execution by passing standard allocators as parameters via execution policies such as thrust::device.
  • THRUST_DEVICE_SYSTEM_CPP has been added as a compile-time target for the device backend.
  • CUDA merge performance is 2-15x faster.
  • CUDA comparison sort performance is 1.3-4x faster.
  • CUDA set operation performance is 1.5-15x faster.
  • TBB reduce_by_key performance is 80% faster.
  • Several algorithms have been parallelized with TBB.
  • Support for user allocators in vectors has been improved.
  • The sparse_vector example is now implemented with merge_by_key instead of sort_by_key.
  • Warnings have been eliminated in various contexts.
  • Warnings about __host__ or __device__-only functions called from __host__ __device__ functions have been eliminated in various contexts.
  • Documentation about algorithm requirements have been improved.
  • Simplified the minimal_custom_backend example.
  • Simplified the cuda/custom_temporary_allocation example.
  • Simplified the cuda/fallback_allocator example.

Bug Fixes

  • #248 fix broken counting_iterator<float> behavior with OpenMP
  • #231, #209 fix set operation failures with CUDA
  • #187 fix incorrect occupancy calculation with CUDA
  • #153 fix broken multigpu behavior with CUDA
  • #142 eliminate warning produced by thrust::random::taus88 and MSVC 2010
  • #208 correctly initialize elements in temporary storage when necessary
  • #16 fix compilation error when sorting bool with CUDA
  • #10 fix ambiguous overloads of reinterpret_tag

Known Issues

  • g++ versions 4.3 and lower may fail to dispatch thrust::get_temporary_buffer correctly causing infinite recursion in examples such as cuda/custom_temporary_allocation.

Acknowledgments

  • Thanks to Sean Baxter, Bryan Catanzaro, and Manjunath Kudlur for contributing a faster merge implementation for CUDA.
  • Thanks to Sean Baxter for contributing a faster set operation implementation for CUDA.
  • Thanks to Cliff Woolley for contributing a correct occupancy calculation algorithm.