###  Activity 2.0: Loop Ordering and False Sharing

This activity reinforces the concept of reduction and the caching principles taught in the lecture on Cilk on Sep. 18. It is recommended that you run this on the CS machines `gradx.cs.jhu.edu` or `ugradx.cs.jhu.edu`.  The results make sense here.  It is OK to run this on any machine that has at least 4 cores.  If you run on a different machine, you may end up with slightly different results. It is OK if your results don't track exactly with the expected findings. On my M1 Apple Silicon laptop the results get confusing.

**Due date**: Tuesday October 1st, 2024, 5:00 pm EDT.

**Instructions for Submission**: Submit via Gradescope.

### The Program

This is a nested loop program that counts the number of occurences of a list of tokens in an array of elements. This is a common computing pattern in data analytics. This could be used to count the number of messages sent in a network from a set of sources.

There are two serial versions of the program.  These are:
  * `countTokensElementsFirst`: loop over the larger `elements` array in the outer loop and the smaller `tokens` array in the inner loop.
  * `countTokensTokensFirst`: loop over the smaller `tokens` array in the outer loop and the larger `elements` array in the inner loop.
  
This is not a 2-d dimensional data structure like our previous examples. It is 2 separate arrays.

#### Programming


Complete the *TODO* instructions in [ebook/activities/tokens_omp/activity2_tokens.cpp](https://github.com/randalburns/pcds.2024/blob/main/ebook/activities/tokens_omp/activity2_tokens.cpp)

1. Add `parallel for` directives to functions:
   * `omp_countTokensElementsFirst`
   * `omp_countTokensTokensFirst`
2. Add `parallel for` and `reduction` directive for the array `token_counts` for:
    * `omp_countTokensElementsFirst_reduce`
    * `omp_countTokensTokensFirst_reduce`
  
The array reduction clause was added to OpenMP and requires one to specify the length of the array.  A simple example is provided in https://dvalters.github.io/optimisation/code/2016/11/06/OpenMP-array_reduction.html.

3. Unroll the loop 8 times in `unroll_omp_countTokensElementsFirst_reduce`. You may assume the the `tokens` array is evenly divisible by 8.

On the `gradx.cs.jhu.edu` machine after I added this code, I got the timing results
```
Tokens First time: 8.07097 seconds
Elements First time: 6.93468 seconds
OMP Tokens First time: 2.10465 seconds
OMP Elements First time: 1.78919 seconds
OMP Tokens First Reduce time: 1.99353 seconds
OMP Elements First Reduce time.: 1.78073 seconds
Unroll OMP Elements First Reduce time.: 0.926184 seconds
```
building with the command line
<!-- ``` -->
> g++ -O0 -fopenmp activity2_tokens.cpp
<!-- ``` -->
Compiling with `-O0` turns off all compiler optimizations to prevent the compiler from making unknown optimizations that would confound our results.

#### Questions

Provide brief but complete answers to the following questions in the following cell.

1. Why is it more efficient to iterate over the `tokens` in the inner loop? 
(_Hint_: Access to both arrays is sequential. This is a question of memory access patterns, cache capacity and cache misses.)


2. Of the functions `omp_countTokensElementsFirst` and `omp_countTokensTokensFirst`:


    a. Which function performs <b><i>unsafe sharing</i></b> in the `tokens` array?


    b. Which function assigns different elements of the `tokens` array to different threads?

    
3. For the function that assigns different tokens to different threads, how does <b><i>false sharing</b></i> arise? Be specific about the memory access pattern or include a drawing/schema.
4. For the unrolled loop, why is it more efficient? What computations and instructions are avoided?

#### Answers

<i>Include your answers to Questions 1-4 in this cell</i>

#### Code

<i>Copy-Paste your code from `activity2_tokens.cpp` to the cell below</i>

In [None]:
// C++ Code goes here

// TODO