## Overview
Where as concurrency conceptually describes distinct tasks, which can run on a single core, tasks interact with each other and often overlap in time. With parallelism all the tasks are identical, they run the same thing on multiple cores independently as they are not dependent on each other. Parallelism can be explicit or implicit. Explicit parallelism is where the programmer specifies how to parallelize the work, can produce better performance but not scalable to the number of cores on the machine and mainly useful when writing for specific hardware. Implicit parallelism is where we leave the decision to the implementation, it makes the best use of the available resources.  
There are 2 kinds of parallelism's task parallelism and data parallelism.
 * Task parallelism : Used for distributing processing, also called thread level parallelism. When we have a large computation problem, we spilt it into smaller tasks and start some threads to run these smaller tasks parallely. To implement task parallelism we use the fork and join paradime, we start up multiple task's/thread's and wait for all the tasks to finish, these tasks can all be different.
 * Data parallelism : Used for distributing data, also called vectorization. When we have a large dataset, we split it into several samller data subsets and start some threads to process these subsets parallely on different core's. Each thread will produce the result for that part of the data and then there is a final reduce step to combine the partial results into the overall result. To impement data parallelism we use the vectorization paradime, if we have a array of N elements, we split this array into M sub arrays, we start up M threads to process each sub array, then there is a reduce task to combine the results, so the M tasks are all the same. Data parallelism is used in graphic processor units. Modern CPUs also have direct support for vectorization. This is also known as the single instruction/multiple data architecture(SIMD). CPUs with vectors have seperate instructions to perform SIMD instructions. Along with concurrency data locality also improves performance in data parallelism, data is stored close to where it is processed.

## Data Parallelism Example
This example sums up the elements of a vector on 4 different tasks parallelly. For this the vector is divided into 4 parts, then each task adds up each part, in the final step the result from each task are added to get the full sum.

```
static std::mt19937 mt;    //Random number generator
std::uniform_real_distribution<double> dist(0, 100); //Uniform distributer from 0 to 100 for the random number generator

//Compute the sum of a range of elements
double accum(double *beg, double *end)
{
    return std::accumulate(beg, end, 0.0);
}

//Divide the data into 4 parts
//Use seperate threads to process each subset
double add_parallel(std::vector<double>& vec)
{
    auto vec0 = &vec[0];
    auto vsize = vec.size();
    
    //Start the threads
    auto fut1 = std::async(std::launch::async, accum, vec0, vec0 + vsize/4);
    auto fut2 = std::async(std::launch::async, accum, vec0 + vsize/4, vec0 + 2*vsize/4);
    auto fut3 = std::async(std::launch::async, accum, vec0 + 2*vsize/4, vec0 + 3*vsize/4);
    auto fut4 = std::async(std::launch::async, accum, vec0 + 3*vsize/4, vec0 + vsize);
    
    //Reduce step
    return fut1.get() + fut2.get() + fut3.get() + fut4.get();
}

int main()
{
    //Populate a vector with elements 1,2, ...., 16
    std::vector<double> vec(16);
    std::iota(vec.begin(), vec.end(), 1.0);
    
    //Populate a vector with 10,000 random elements
    std::vector<double> vrand(10'000);
    std::generate(vrand.begin(), vrand.end(), [&vrand]{ return dist(mt); });
    
    add_parallel(vec);
    add_parallel(vrand);
}
```

## Standard Algorithms
A set of functions in the standard library which implement classic algorithms like searching, sorting, populating, copying, reordering etc. They operate on containers and sequences of data. They are present in \<algorithm\> and \<numeric\> namespaces. To perform one of these algorithms, we call the corresponding function with the iterator range(last element not included), the algorithm iterates over the range of elements, performs operations on the elements and returns the result.  
For example std::find() algorithm returns an iterator to the first matching element.
```
std::string str{"Hello world"};
auto res = std::find(str.cbegin(), str.cend(), 'o');
if(res != str.cend())
{
    //Access the result
}
```
Many algorithms take a predicate, a function which takes arguments of the element type and returns a bool. For example std::find_if() allows us to supply our own predicate(any callable object). In the below code predicate is used with std::find_if() to ignore case.
```
auto res = std::find_if(str.cbegin(), str.cend(),[](const char c)
                                                 {
                                                     return ::toupper(c) == 'O';
                                                 });
```

## Execution Policies
There are 4 different ways to execute our code.
* Sequential : A single instruction processing one data item.
* Vectorized : A single instruction processing several data items, requires suitable data structures(arrays and vectors) and hardware support.
* Parallelized : Several instructions each process one data item, at the same time. Requires suitable algorithms.
* Parallelized + Vectorized : Several instructions each process several data items, at the same time. Requires suitable algorithm, data structure and hardware support.  

With the modern C++ we can specify the execution policy when calling the algorithm. Asking for a specific policy is just a request it may be ignored, if vectorized harware is not present, or if insufficient system resources are available, or if parallel or vectorized version has not been implemented by the compiller etc. We tell the algorithm which policy to use by giving it the execution policy object. We can pass a policy object as an optional first argument, if not passed it uses the sequential implementation.
```
std::sort(vec.begin(), vec.end()); //Perform sort using non policy, sequencial execution

std::sort(std::execution::seq, vec.begin(), vec.end()); //Perform sort using sequential execution
std::sort(std::execution::par, vec.begin(), vec.end()); //Perform sort using parallel execution
std::sort(std::execution::par_unseq, vec.begin(), vec.end()); //Perform sort using parallel and vectorized execution
std::sort(std::execution::unseq, vec.begin(), vec.end()); //Perform sort using vectorized execution
```
In sequenced execution, all the operations are performed on a single thread(on the thread which calls the algorithm) and the operations are not interleaved. In parallel execution, operations are performed in parallel across a number of threads, operations performed on different threads may interleave so the programmer is responsible to prevent data races.  
```
int main()  
{  
    std::vector<int> vec(20'000);
    int count = 0;
    
    //Since count is used by multiple threads and not protected, it will lead to data race
    //The vector will not be filled from 1 to 20,000, they will duplicate entries in the vector
    std::for_each(std::execution::par, vec.begin(), vec.end(), [&count](int& x)
                                                               {
                                                                   x = ++count;
                                                               }
                                                               );
}
```
In unsequenced execution, operations are performed on a single thread, the programmer should avoid any modification of shared state between the elements and avoid using mutexes, locks and other forms of synchronization. In parallel unsequenced execution, operations are performed in parallel across a number of threads, operations performed on the same thread may be interleaved, the programmer must avoid the data races and avoid modification of shared state between the elements.  
Most of the algorithms in C++ now support execution policies, some algorithms are naturally sequencial(e.g equal_range()), these are left untouched. For some algorithms new versions have been introduced with new names, for example accumulate() version with policy support is called reduce().  
Without policy, exception thrown by the algorithm should be handled by the caller or it's caller etc. If no one handles the exception in the execution stack the program terminates. With execution policies things are bit more complicated as there may be multiple threads, any thread can throw a exception and each thread has its own execution stack. For this reason program will be terminated if an exception is thrown in a algorithm with a policy.

## New Parallel Algorithms
std::accumulate() is the old non-parallel algorithm, this takes a iterator range and an initial value, it will add all the values to the initial value and returns the result. We can pass a callable object as an optional fourth argument to change the operation from +. In modern C++ std::reduce() has been introduced, re-implementation of std::accumulate() with execution policies.
```
std::vector<int> vec{0, 1, 2, 3, 4, 5, 6, 7};

auto sum = std::reduce(std::execution::par, vec.begin(), vec.end(), 0);
```
std::reduce() will not give the correct answer if reordering the operations alters the result or regrouping them alters the result i.e the operator must be commutative and associative. This is true for addition and multiplication but not true for substraction, division and floating-point arithmetic(due to rounding math).  
std::partial_sum() is also another old non-parallel agorithm, it is used for integration, each element in the destination vector will be the sum of the elements so far in the source vector(e.g the 3rd element in the destination vector will be the sum of the first 3 elements in the source vector, the 4th element will be the sum of the first 4 elements and so on). In modern C++ std::inclusive_scan() has been introduced, re-implementation of std::partial_sum() with execution policies.
```
std::vector<int> vec{1, 2, 3, 4};
std::vecotr<int> vec2(vec.size());

std::inclusive_scan(std::execution::par, vec.begin(), vec.end(), vec2.begin());
```
There is also std::exclusive_scan(), similar but excludes the current element when integrating.  
std::transform() is a old non-parallel algorithm, takes a iterator range and a callable object, applies the callable object to every element and stores the result in the destination. Another form of std::transform() takes 2 input vectors, and a callable object with 2 parameters and does the transform. std::transform() and std::reduce() are used to implement map and reduce in parallel programming, divide the data into subsets, start a thread for each subset, each thread calls transform() to transform the data, then call reduce() to combine each thread's results into the final answer. In modern C++ we have single function which does both with a execution policy, std::transform_reduce(). std::transform_reduce() takes a transform function, which takes 2 arguments of the element types and returns a value of its result type. It also takes a reduce function, which takes 2 arguments of the transforms return type and returns a value of the final result type. The default transform function is * and reduce function is +.
```
//Find biggest error between 2 vectors

std::vector<double> expected{0.1, 0.2, 0.3, 0.4, 0.5};
std::vector<double> actual{0.09, 0.22, 0.27, 0.41, 0.52};

auto max_diff = std::transform_reduce(std::execution::par,
                                      std::begin(expected), std::end(expected),
                                      std::begin(actual),
                                      0.0, //Inital value for the largest difference
                                      [](auto diff1, auto diff2){return std::max(diff1, diff2);}, //Reduce operation
                                      [](auto exp, auto act){return std::abs(act-exp);}, //Transform operation
                                      );
```