<a href="https://colab.research.google.com/github/trefftzc/cis677/blob/main/Thrust_algorithms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Thrust's algorithms

Based on
https://nvidia.github.io/cccl/thrust/api_docs/algorithms.html

Nine groups of algorithms:

1. Copying
2. Merging
3. Prefix sums
4. Reductions
5. Reordering
6. Searching
7. Set Operations
8. Sorting
9. Transformations

## 1. Copying

a. Gather

b. Scatter

c. swap_ranges

d. copy

e. copy_n

f. unitialized_copy



1.a. Gather:

gather copies elements from a source array into a destination range according to a map. For each input iterator i in the range [map_first, map_last), the value input_first[*i] is assigned to *(result + (i - map_first)). RandomAccessIterator must permit random access.

In [27]:
%%writefile gather.cu
#include <thrust/gather.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>

int main() {
  // mark even indices with a 1; odd indices with a 0
  int values[10] = {1, 0, 1, 0, 1, 0, 1, 0, 1, 0};
  thrust::device_vector<int> d_values(values, values + 10);

  // gather all even indices into the first half of the range
  // and odd indices to the last half of the range
  int map[10]   = {0, 2, 4, 6, 8, 1, 3, 5, 7, 9};
  thrust::device_vector<int> d_map(map, map + 10);

  thrust::device_vector<int> d_output(10);
  thrust::gather(d_map.begin(), d_map.end(),
               d_values.begin(),
               d_output.begin());
// d_output is now {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
  thrust::host_vector<int> h_output(10);
  thrust::copy(d_output.begin(), d_output.end(), h_output.begin());
  for(int value : h_output) {
    std::cout << value << " ";
  }
  std::cout << std::endl;
  return 0;
}

Writing gather.cu


In [28]:
!!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub gather.cu -o gather -arch sm_75


[]

In [29]:
!./gather

1 1 1 1 1 0 0 0 0 0 


1.b. scatter

scatter copies elements from a source range into an output array according to a map. For each iterator i in the range [first, last), the value *i is assigned to output[*(map + (i - first))]. The output iterator must permit random access. If the same index appears more than once in the range [map, map + (last - first)), the result is undefined.

In [30]:
%%writefile scatter.cu
#include <thrust/scatter.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>

int main() {
  // mark even indices with a 1; odd indices with a 0
  int values[10] = {1, 0, 1, 0, 1, 0, 1, 0, 1, 0};
  thrust::device_vector<int> d_values(values, values + 10);

  // scatter all even indices into the first half of the
  // range, and odd indices vice versa
  int map[10]   = {0, 5, 1, 6, 2, 7, 3, 8, 4, 9};
  thrust::device_vector<int> d_map(map, map + 10);

  thrust::device_vector<int> d_output(10);
  thrust::scatter(d_values.begin(), d_values.end(),
                d_map.begin(), d_output.begin());
  // d_output is now {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
  thrust::host_vector<int> h_output(10);
  thrust::copy(d_output.begin(), d_output.end(), h_output.begin());
  for(int value : h_output) {
    std::cout << value << " ";
  }
  std::cout << std::endl;
  return 0;
}



Writing scatter.cu


In [31]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub scatter.cu -o scatter -arch sm_75

In [32]:
!./scatter

1 1 1 1 1 0 0 0 0 0 


1.c. swap_ranges

swap_ranges swaps each of the elements in the range [first1, last1) with the corresponding element in the range [first2, first2 + (last1 - first1)). That is, for each integer n such that 0 <= n < (last1 - first1), it swaps *(first1 + n) and *(first2 + n). The return value is first2 + (last1 - first1).

In [33]:
%%writefile swap_ranges.cu
#include <thrust/swap.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>

int main() {
 thrust::device_vector<int> v1(2), v2(2);
  v1[0] = 1;
  v1[1] = 2;
  v2[0] = 3;
  v2[1] = 4;

  thrust::swap_ranges(v1.begin(), v1.end(), v2.begin());
// v1[0] == 3, v1[1] == 4, v2[0] == 1, v2[1] == 2
  thrust::host_vector<int> h_v1(2);
  thrust::host_vector<int> h_v2(2);
  thrust::copy(v1.begin(), v1.end(), h_v1.begin());
  thrust::copy(v2.begin(), v2.end(), h_v2.begin());
  std::cout << "v1: ";
  for(int value : h_v1) {
    std::cout << value << " ";
  }
  std::cout << std::endl;
  std::cout << "v2: ";
  for(int value : h_v2) {
    std::cout << value << " ";
  }
  std::cout << std::endl;
  return 0;
}


Writing swap_ranges.cu


In [34]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub swap_ranges.cu -o swap_ranges -arch sm_75

In [35]:
!./swap_ranges

v1: 3 4 
v2: 1 2 


1.d. copy

copy copies elements from the range [first, last) to the range [result, result + (last - first)). That is, it performs the assignments *result = *first, *(result + 1) = *(first + 1), and so on. Generally, for every integer n from 0 to last - first, copy performs the assignment *(result + n) = *(first + n). Unlike std::copy, copy offers no guarantee on order of operation. As a result, calling copy with overlapping source and destination ranges has undefined behavior.

The return value is result + (last - first).

In [36]:
%%writefile copy.cu
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>

int main() {
  thrust::device_vector<int> vec0(10);
  thrust::device_vector<int> vec1(10);
  for(int i = 0; i < 10; ++i) {
    vec0[i] = i;
  }

  thrust::copy(vec0.begin(), vec0.end(),
             vec1.begin());

// vec1 is now a copy of vec0
  thrust::host_vector<int> h_v0(10);
  thrust::host_vector<int> h_v1(10);
  thrust::copy(vec0.begin(), vec0.end(), h_v0.begin());
  thrust::copy(vec1.begin(), vec1.end(), h_v1.begin());
  std::cout << "vec0: ";
  for(int value : h_v0) {
    std::cout << value << " ";
  }
  std::cout << std::endl;
  std::cout << "vec1: ";
  for(int value : h_v1) {
    std::cout << value << " ";
  }
  std::cout << std::endl;
  return 0;
}


Writing copy.cu


In [37]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub copy.cu -o copy -arch sm_75

In [38]:
!./copy

vec0: 0 1 2 3 4 5 6 7 8 9 
vec1: 0 1 2 3 4 5 6 7 8 9 


1.e. copy_n

copy_n copies elements from the range [first, first + n) to the range [result, result + n). That is, it performs the assignments *result = *first, *(result + 1) = *(first + 1), and so on. Generally, for every integer i from 0 to n, copy performs the assignment *(result

i) = *(first + i). Unlike std::copy_n, copy_n offers no guarantee on order of operation. As a result, calling copy_n with overlapping source and destination ranges has undefined behavior.

The return value is result + n.

The algorithm’s execution is parallelized as determined by exec.

The following code snippet demonstrates how to use copy to copy from

In [39]:
%%writefile copy_n.cu
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>
#include <iostream>

int main() {
  thrust::device_vector<int> vec0(10);
  thrust::device_vector<int> vec1(10);
  for(int i = 0; i < 10; ++i) {
    vec0[i] = i;
  }
  int n = 5;
  thrust::copy_n(thrust::device,vec0.begin(), n,
             vec1.begin());

// vec1 now contains the first 5 elements of vec0
  thrust::host_vector<int> h_v0(10);
  thrust::host_vector<int> h_v1(10);
  thrust::copy(vec0.begin(), vec0.end(), h_v0.begin());
  thrust::copy(vec1.begin(), vec1.end(), h_v1.begin());
  std::cout << "vec0: ";
  for(int value : h_v0) {
    std::cout << value << " ";
  }
  std::cout << std::endl;
  std::cout << "vec1: ";
  for(int value : h_v1) {
    std::cout << value << " ";
  }
  std::cout << std::endl;
  return 0;
}

Writing copy_n.cu


In [40]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub copy_n.cu -o copy_n -arch sm_75

In [41]:
!./copy_n

vec0: 0 1 2 3 4 5 6 7 8 9 
vec1: 0 1 2 3 4 0 0 0 0 0 


1.f. unitialized_copy

In thrust, the function thrust::device_new allocates memory for an object and then creates an object at that location by calling a constructor. Occasionally, however, it is useful to separate those two operations. If each iterator in the range [result, result + (last - first)) points to uninitialized memory, then uninitialized_copy creates a copy of [first, last) in that range. That is, for each iterator i in the input, uninitialized_copy creates a copy of *i in the location pointed to by the corresponding iterator in the output range by ForwardIterator's value_type's copy constructor with *i as its argument.

The algorithm’s execution is parallelized as determined by exec.


In [42]:
%%writefile unitialized_copy.cu
#include <thrust/uninitialized_copy.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/device_malloc.h>
#include <iostream>

struct Int
{
  __host__ __device__
  Int(int x) : val(x) {}
  int val;
};

const int N = 137;


int main() {
  Int val(46);
  thrust::device_vector<Int> input(N, val);
  thrust::device_ptr<Int> array = thrust::device_malloc<Int>(N);
  thrust::uninitialized_copy(thrust::device, input.begin(), input.end(), array);

// Int x = array[i];
// x.val == 46 for all 0 <= i < N


  return 0;
}

Writing unitialized_copy.cu


In [43]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub unitialized_copy.cu -o unitialized_copy -arch sm_75

In [44]:
!./unitialized_copy

## 2. merge

a. merge

b. merge_by_key

2.a. merge

merge combines two sorted ranges [first1, last1) and [first2, last2) into a single sorted range. That is, it copies from [first1, last1) and [first2, last2) into [result, result + (last1 - first1) + (last2 - first2)) such that the resulting range is in ascending order. merge is stable, meaning both that the relative order of elements within each input range is preserved, and that for equivalent elements in both input ranges the element from the first range precedes the element from the second. The return value is result + (last1 - first1) + (last2 - first2).

This version of merge compares elements using operator<.

In [45]:
%%writefile merge.cu
#include <thrust/merge.h>
#include <iostream>

using namespace std;

int main() {
  int A1[6] = {1, 3, 5, 7, 9, 11};
  int A2[7] = {1, 1, 2, 3, 5,  8, 13};

  int result[13];

  int *result_end = thrust::merge(A1, A1 + 6, A2, A2 + 7, result);
  // result = {1, 1, 1, 2, 3, 3, 5, 5, 7, 8, 9, 11, 13}

  for(int i = 0; i < result_end - result; i++) {
    cout << result[i] << " ";
  }
  cout << endl;

}

Writing merge.cu


In [46]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub merge.cu -o merge -arch sm_75

In [47]:
!./merge

1 1 1 2 3 3 5 5 7 8 9 11 13 


2.b. merge_by_key

merge_by_key performs a key-value merge. That is, merge_by_key copies elements from [keys_first1, keys_last1) and [keys_first2, keys_last2) into a single range, [keys_result, keys_result + (keys_last1 - keys_first1) + (keys_last2 - keys_first2)) such that the resulting range is in ascending key order.

At the same time, merge_by_key copies elements from the two associated ranges [values_first1 + (keys_last1 - keys_first1)) and [values_first2 + (keys_last2 - keys_first2)) into a single range, [values_result, values_result + (keys_last1 - keys_first1) + (keys_last2 - keys_first2)) such that the resulting range is in ascending order implied by each input element’s associated key.

merge_by_key is stable, meaning both that the relative order of elements within each input range is preserved, and that for equivalent elements in all input key ranges the element from the first range precedes the element from the second.

The return value is is (keys_result + (keys_last1 - keys_first1) + (keys_last2 - keys_first2)) and (values_result + (keys_last1 - keys_first1) + (keys_last2 - keys_first2)).

This version of merge_by_key compares key elements using a function object comp.

The algorithm’s execution is parallelized using exec.

In [48]:
%%writefile merge_by_key.cu


#include <thrust/functional.h>
#include <thrust/execution_policy.h>
#include <iostream>

using namespace std;

int main() {
  int A_keys[6] = {11, 9, 7, 5, 3, 1};
  int A_vals[6] = { 0, 0, 0, 0, 0, 0};

  int B_keys[7] = {13, 8, 5, 3, 2, 1, 1};
  int B_vals[7] = { 1, 1, 1, 1, 1, 1, 1};

  int keys_result[13];
  int vals_result[13];

  thrust::pair<int*,int*> end =
    thrust::merge_by_key(thrust::host,
                       A_keys, A_keys + 6,
                       B_keys, B_keys + 7,
                       A_vals, B_vals,
                       keys_result, vals_result,
                       ::cuda::std::greater<int>());

// keys_result = {13, 11, 9, 8, 7, 5, 5, 3, 3, 2, 1, 1, 1}
// vals_result = { 1,  0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1}

  for(int i = 0; i < end.first - keys_result; i++) {
    cout << keys_result[i] << " " << vals_result[i] << endl;
  }
  cout << endl;

}

Writing merge_by_key.cu


In [49]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub merge_by_key.cu -o merge_by_key -arch sm_75

In [50]:
!./merge_by_key

13 1
11 0
9 0
8 1
7 0
5 0
5 1
3 0
3 1
2 1
1 0
1 1
1 1



## 3. Prefix Sums

a. inclusive_scan

b. exclusive_scan


3.a. inclusive_scan

inclusive_scan computes an inclusive prefix sum operation. The term ‘inclusive’ means that each result includes the corresponding input operand in the partial sum. When the input and output sequences are the same, the scan is performed in-place.

inclusive_scan is similar to std::partial_sum in the STL. The primary difference between the two functions is that std::partial_sum guarantees a serial summation order, while inclusive_scan requires associativity of the binary operation to parallelize the prefix sum.

Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

The algorithm’s execution is parallelized as determined by exec.

In [51]:
%%writefile inclusive_scan.cu
#include <thrust/scan.h>
#include <thrust/execution_policy.h>
#include <thrust/functional.h>
#include <iostream>
#include <bits/stdc++.h>

using namespace std;

int main() {

  int data[10] = {-5, 0, 2, -3, 2, 4, 0, -1, 2, 8};

  thrust::maximum<int> binary_op;

  thrust::inclusive_scan(thrust::host, data, data + 10, data, binary_op); // in-place scan

  // data is now {-5, 0, 2, 2, 2, 4, 4, 4, 4, 8}

  for(int i = 0; i < 10; i++) {
    cout << data[i] << " ";
  }
  cout << endl;

  return 0;

}


Writing inclusive_scan.cu


In [52]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub inclusive_scan.cu -o inclusive_scan -arch sm_75

In [53]:
!./inclusive_scan


-5 0 2 2 2 4 4 4 4 8 


3.b. exclusive_scan

exclusive_scan computes an exclusive prefix sum operation. The term ‘exclusive’ means that each result does not include the corresponding input operand in the partial sum. More precisely, init is assigned to *result and the value binary_op(init, *first) is assigned to *(result + 1), and so on. This version of the function requires both an associative operator and an initial value init. When the input and output sequences are the same, the scan is performed in-place.

Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run.

The algorithm’s execution is parallelized as determined by exec.

In [54]:
%%writefile exclusive_scan.cu
#include <thrust/scan.h>
#include <thrust/execution_policy.h>
#include <thrust/functional.h>
#include <iostream>
#include <bits/stdc++.h>

using namespace std;

int main() {

  int data[10] = {-5, 0, 2, -3, 2, 4, 0, -1, 2, 8};

  thrust::maximum<int> binary_op;
  // The initial value is 1
  thrust::exclusive_scan(thrust::host, data, data + 10, data, 1, binary_op); // in-place scan

  // data is now {1, 1, 1, 2, 2, 2, 4, 4, 4, 4 }

  for(int i = 0; i < 10; i++) {
    cout << data[i] << " ";
  }
  cout << endl;

  return 0;

}


Writing exclusive_scan.cu


In [55]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub exclusive_scan.cu -o exclusive_scan -arch sm_75

In [56]:
!./exclusive_scan

1 1 1 2 2 2 4 4 4 4 


## 4. Reductions

a. Comparisons

b. Counting

c. Extrema

d. Logical

e. Predicates

f. Transformed Reductions

g. thrust::reduce

h. thrust::reduce_by_key

i. thrust::reduce_into

4.a. Comparisons

equal returns true if the two ranges [first1, last1) and [first2, first2 + (last1 - first1)) are identical when compared element-by-element, and otherwise returns false.

This version of equal returns true if and only if for every iterator i in [first1, last1), binary_pred(*i, *(first2 + (i - first1))) is true.

The following code snippet demonstrates how to use equal to compare the elements in two ranges modulo 2.

In [57]:
%%writefile equal.cu
#include <thrust/equal.h>
#include <iostream>

using namespace std;

struct compare_modulo_two
{
  __host__ __device__
  bool operator()(int x, int y) const
  {
    return (x % 2) == (y % 2);
  }
};

int main() {
  int x[6] = {0, 2, 4, 6, 8, 10};
  int y[6] = {1, 3, 5, 7, 9, 11};

  bool result = thrust::equal(x, x + 5, y, compare_modulo_two());

  // result is false
  cout << result << endl;

  return 0;
}

Writing equal.cu


In [58]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub equal.cu -o equal -arch sm_75

In [59]:
!./equal

0


4.b. Counting

count finds the number of elements in [first,last) that are equal to value. More precisely, count returns the number of iterators i in [first, last) such that *i == value.

count_if finds the number of elements in [first,last) for which a predicate is true. More precisely, count_if returns the number of iterators i in [first, last) such that pred(*i) == true.

The algorithm’s execution is parallelized as determined by exec.

In [60]:
%%writefile count.cu
#include <thrust/count.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <iostream>

using namespace std;

struct is_odd
{
  __host__ __device__
  bool operator()(int x)
  {
    return x % 2 == 1;
  }
};

int main() {
// Example of count
 thrust::device_vector<int> vec(5,0);
  vec[1] = 1;
  vec[3] = 1;
  vec[4] = 1;

  // count the 1s
  int result = thrust::count(vec.begin(), vec.end(), 1);
  // result == 3
  cout << result << endl;
// ------------------------------------------------
// Example of count_if
// fill a device_vector with even & odd numbers

  vec[0] = 0;
  vec[1] = 1;
  vec[2] = 2;
  vec[3] = 3;
  vec[4] = 4;

// count the odd elements in vec
   result = thrust::count_if(thrust::device, vec.begin(), vec.end(), is_odd());
// result == 2
  cout << result << endl;
  return 0;
}

Writing count.cu


In [61]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub count.cu -o count -arch sm_75

In [62]:
!./count

3
2


4.c. Extrema

minmax_element finds the smallest and largest elements in the range [first, last). It returns a pair of iterators (imin, imax) where imin is the same iterator returned by min_element and imax is the same iterator returned by max_element. This function is potentially more efficient than separate calls to min_element and max_element.

In [63]:
%%writefile minmax.cu
#include <thrust/extrema.h>
#include <thrust/pair.h>
#include <iostream>

using namespace std;

struct key_value
{
  int key;
  int value;
};

struct compare_key_value
{
  __host__ __device__
  bool operator()(key_value lhs, key_value rhs)
  {
    return lhs.key < rhs.key;
  }
};

int main() {
  key_value data[4] = { {4,5}, {0,7}, {2,3}, {6,1} };

  thrust::pair<key_value*,key_value*> extrema = thrust::minmax_element(data, data + 4, compare_key_value());

  // extrema.first   == data + 1
  // *extrema.first  == {0,7}
  // extrema.second  == data + 3
  // *extrema.second == {6,1}
  cout << (*extrema.first).key << " " << (*extrema.first).value << endl;
  cout << (*extrema.second).key << " " << (*extrema.second).value << endl;

  return 0;
}

Writing minmax.cu


In [64]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub minmax.cu -o minmax -arch sm_75

In [65]:
!./minmax

0 7
6 1


In [66]:
%%writefile maxelement.cu
#include <thrust/extrema.h>
#include <thrust/pair.h>
#include <iostream>

using namespace std;

struct key_value
{
  int key;
  int value;
};

struct compare_key_value
{
  __host__ __device__
  bool operator()(key_value lhs, key_value rhs)
  {
    return lhs.key < rhs.key;
  }
};

int main() {
  key_value data[4] = { {4,5}, {0,7}, {2,3}, {6,1} };

  key_value *largest = thrust::max_element(thrust::host, data, data + 4, compare_key_value());

  // largest == data + 3
  // *largest == {6,1}
  cout << (*largest).key << " " << (*largest).value << endl;

  return 0;
}

Writing maxelement.cu


In [67]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub maxelement.cu -o maxelement -arch sm_75

In [68]:
!./maxelement

6 1


In [69]:
%%writefile minelement.cu
#include <thrust/extrema.h>
#include <thrust/pair.h>
#include <iostream>

using namespace std;

struct key_value
{
  int key;
  int value;
};

struct compare_key_value
{
  __host__ __device__
  bool operator()(key_value lhs, key_value rhs)
  {
    return lhs.key < rhs.key;
  }
};

int main() {

  key_value data[4] = { {4,5}, {0,7}, {2,3}, {6,1} };

  key_value *smallest = thrust::min_element(thrust::host, data, data + 4, compare_key_value());

// smallest == data + 1
// *smallest == {0,7}
  cout << (*smallest).key << " " << (*smallest).value << endl;

  return 0;
}

Writing minelement.cu


In [70]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub minelement.cu -o minelement -arch sm_75

In [71]:
!./minelement

0 7


## 4.d. Logical

4.d.1.  all_of

4.d.2.  any_of

4.d.3   none_of



## 4.d.1. all_of

all_of determines whether all elements in a range satisfy a predicate. Specifically, all_of returns true if pred(*i) is true for every iterator i in the range [first, last) and false otherwise.

The algorithm’s execution is parallelized as determined by exec.

In [72]:
%%writefile all_of.cu
#include <thrust/logical.h>
#include <thrust/functional.h>
#include <thrust/execution_policy.h>
#include <iostream>

using namespace std;

int main() {
  bool A[3] = {true, true, false};
  bool result;
  result = thrust::all_of(thrust::host, A, A + 2, ::cuda::std::identity{}); // returns true
  cout << result << endl;
  result = thrust::all_of(thrust::host, A, A + 3, ::cuda::std::identity{}); // returns false
  cout << result << endl;
  // empty range
  result = thrust::all_of(thrust::host, A, A, ::cuda::std::identity{}); // returns true
  cout << result << endl;
  return 0;

}

Writing all_of.cu


In [73]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub all_of.cu -o all_of -arch sm_75

In [74]:
!./all_of

1
0
1


## 4.d.2. any_of

any_of determines whether any element in a range satisfies a predicate. Specifically, any_of returns true if pred(*i) is true for any iterator i in the range [first, last) and false otherwise.

In [75]:
%%writefile any_of.cu
#include <thrust/logical.h>
#include <thrust/functional.h>
#include <thrust/execution_policy.h>
#include <iostream>

using namespace std;

int main() {
  bool A[3] = {true, true, false};
  bool result;
  result = thrust::any_of(A, A + 2, ::cuda::std::identity{}); // returns true
  cout << result << endl;
  result = thrust::any_of(A, A + 3, ::cuda::std::identity{}); // returns true
  cout << result << endl;
  result = thrust::any_of(A + 2, A + 3, ::cuda::std::identity{}); // returns false
  cout << result << endl;
  // empty range
  result = thrust::any_of(A, A, ::cuda::std::identity{}); // returns false
  cout << result << endl;
  return 0;

}

Writing any_of.cu


In [76]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub any_of.cu -o any_of -arch sm_75

In [77]:
!./any_of

1
1
0
0


## 4.d.3. none_of

none_of determines whether no element in a range satisfies a predicate. Specifically, none_of returns true if there is no iterator i in the range [first, last) such that pred(*i) is true, and false otherwise.

The algorithm’s execution is parallelized as determined by exec.

In [78]:
%%writefile none_of.cu
#include <thrust/logical.h>
#include <thrust/functional.h>
#include <thrust/execution_policy.h>
#include <iostream>

using namespace std;

int main() {
  bool A[3] = {true, true, false};
  bool result;

  result = thrust::none_of(thrust::host, A, A + 2, ::cuda::std::identity{}); // returns false
  cout << result << endl;
  result = thrust::none_of(thrust::host, A, A + 3, ::cuda::std::identity{}); // returns false
  cout << result << endl;
  result = thrust::none_of(thrust::host, A + 2, A + 3, ::cuda::std::identity{}); // returns true
  cout << result << endl;
  // empty range
  result = thrust::none_of(thrust::host, A, A, ::cuda::std::identity{}); // returns true
  cout << result << endl;
  return 0;

}

Writing none_of.cu


In [79]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub none_of.cu -o none_of -arch sm_75

In [80]:
!./none_of

0
0
1
1


## 4.e. Predicates

4.e.1. is_sorted

4.e.2 is_sorted_until

4.e.3 is partitioned

## 4.e.1 is_sorted

is_sorted returns true if the range [first, last) is sorted in ascending order, and false otherwise.

Specifically, this version of is_sorted returns false if for some iterator i in the range [first, last - 1) the expression *(i + 1) < *i is true.

The algorithm’s execution is parallelized as determined by exec.

The following code demonstrates how to use is_sorted to test whether the contents of a device_vector are stored in ascending order using the thrust::device execution policy for parallelization.

In [81]:
%%writefile is_sorted.cu
#include <thrust/sort.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <iostream>

using namespace std;

int main() {
  thrust::device_vector<int> v(6);
  v[0] = 1;
  v[1] = 4;
  v[2] = 2;
  v[3] = 8;
  v[4] = 5;
  v[5] = 7;

  bool result = thrust::is_sorted(thrust::device, v.begin(), v.end());

  // result == false
  cout << result << endl;

  thrust::sort(v.begin(), v.end());
  result = thrust::is_sorted(thrust::device, v.begin(), v.end());

  // result == true
  cout << result << endl;
  return 0;
}

Writing is_sorted.cu


In [82]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub is_sorted.cu -o is_sorted -arch sm_75

In [83]:
!./is_sorted

0
1


## 4.e.2. is_sorted_until

This version of is_sorted_until returns the last iterator i in [first,last] for which the range [first,last) is sorted using operator<. If distance(first,last) < 2, is_sorted_until simply returns last.

In [84]:
%%writefile is_sorted_until.cu
#include <thrust/sort.h>
#include <iostream>

using namespace std;

int main() {

  int A[8] = {0, 1, 2, 3, 0, 1, 2, 3};

  int * B = thrust::is_sorted_until(A, A + 8);

  // B - A is 4
  // [A, B) is sorted
  int resultDifference = B - A;
  cout << resultDifference << endl;
  bool result = thrust::is_sorted(thrust::host, A, B);
  cout << result << endl;
}

Writing is_sorted_until.cu


In [85]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub is_sorted_until.cu -o is_sorted_until -arch sm_75

In [86]:
!./is_sorted_until

4
1


## 4.e.3. is_partitioned

is_partitioned returns true if the given range is partitioned with respect to a predicate, and false otherwise.

Specifically, is_partitioned returns true if [first, last) is empty of if [first, last) is partitioned by pred, i.e. if all elements that satisfy pred appear before those that do not.

In [87]:
%%writefile is_partitioned.cu
#include <thrust/partition.h>
#include <iostream>

using namespace std;

struct is_even
{
  __host__ __device__
  bool operator()(const int &x)
  {
    return (x % 2) == 0;
  }
};

int main() {

  int A[] = {2, 4, 6, 8, 10, 1, 3, 5, 7, 9};
  int B[] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
  bool result;
  result = thrust::is_partitioned(A, A + 10, is_even()); // returns true
  cout << result << endl;
  result = thrust::is_partitioned(B, B + 10, is_even()); // returns false
  cout << result << endl;
  return 0;
}

Writing is_partitioned.cu


In [88]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub is_partitioned.cu -o is_partitioned -arch sm_75

In [89]:
!./is_partitioned

1
0


## 4.f Transformed Reductions

4.f.1. tranform_reduce

4.f.2 inner_product

## 4.f.1. transform_reduce

transform_reduce fuses the transform and reduce operations. transform_reduce is equivalent to performing a transformation defined by unary_op into a temporary sequence and then performing reduce on the transformed sequence. In most cases, fusing these two operations together is more efficient, since fewer memory reads and writes are required.

transform_reduce performs a reduction on the transformation of the sequence [first, last) according to unary_op. Specifically, unary_op is applied to each element of the sequence and then the result is reduced to a single value with binary_op using the initial value init. Note that the transformation unary_op is not applied to the initial value init. The order of reduction is not specified, so binary_op must be both commutative and associative.

The algorithm’s execution is parallelized as determined by exec.

In [90]:
%%writefile transform_reduce.cu

#include <thrust/transform_reduce.h>
#include <thrust/functional.h>
#include <thrust/execution_policy.h>
#include <iostream>

using namespace std;



template<typename T>
struct absolute_value
{
  __host__ __device__ T operator()(const T &x) const
  {
    return x < T(0) ? -x : x;
  }
};

int main() {

  int data[6] = {-1, 0, -2, -2, 1, -3};
  int result = thrust::transform_reduce(thrust::host,
                                      data, data + 6,
                                      absolute_value<int>(),
                                      0,
                                      thrust::maximum<int>());
// result == 3
  cout << result << endl;
  return 0;

}

Writing transform_reduce.cu


In [91]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub transform_reduce.cu -o transform_reduce -arch sm_75

In [92]:
!./transform_reduce

3


## 4.f.2. inner_product
inner_product calculates an inner product of the ranges [first1, last1) and [first2, first2 + (last1 - first1)).

Specifically, this version of inner_product computes the sum init + (*first1 * *first2) + (*(first1+1) * *(first2+1)) + ...

Unlike the C++ Standard Template Library function std::inner_product, this version offers no guarantee on order of execution.

In [93]:
%%writefile inner_product.cu
#include <thrust/inner_product.h>
#include <iostream>

using namespace std;

int main() {

  float vec1[3] = {1.0f, 2.0f, 5.0f};
  float vec2[3] = {4.0f, 1.0f, 5.0f};

  float result = thrust::inner_product(vec1, vec1 + 3, vec2, 0.0f);

  // result == 31.0f
    cout << result << endl;
  return 0;
}

Writing inner_product.cu


In [94]:
!nvcc -Icccl/thrust -Icccl/libcudacxx/include -Icccl/cub inner_product.cu -o inner_product -arch sm_75

In [95]:
!./inner_product

31
