## RSBench
Compiler: gcc 7.2.0  
Flags: `-std=gnu99 -fopenmp -ffast-math -march=native -g -Ofast -DSTATUS`  
Libs: `-lm`   

  
(Haswell) Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz  
`divsd`: 10-20 Cycles  
`mulsd`: 5 cycles  
`movsd`: 3 cycles  
L1 Cache: 32 kB, 8 way, 64 sets, 64 B line size, **latency 4**, per core.    
L2 Cache: 256 kB, 8 way, 512 sets, 64 B line size, **latency 12**, per core.

|   | L1 Cache | L2 Cache | L3 Cache | DRAM |
|:---|:------:|:--------:|:----------:|:-----|
|**Cache Lines / Cycle** | 1.9728 | 1.2120 | 0.7745 | 0.4402 |

  ---
### Serial Run

### CPU Time
Inclusively RSBench spends 98.7% of its time in `xs_kernel.c`.   
Exclusively only 53.0% of total time is spent in `xs_kernel.c`.   
The other time (45.7%) is spent in the math library. libm-2.17.so    

|xs_kernel.c|CPU Inclusive|CPU Exclusive|
|:----------|:-----------:|:-----------:|
|`calculate_micro_xs_doppler`|97.6%|24.6%|
|---> `loop at line 181`|46.4%|21.4%|
|`calculate_sig_T`|48.1%|3.4%|
|`fast_nuclear_W`|25.0%|23.9%|
|---> `line 72`|20.9%|20.9%|

---

#### calculate_micro_xs_dopper( ) | Loop at line 181
|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | L2 Data Cache Miss Rate |
|:---------:|:----------------------:|:-----------------------:|:-----------------------:|
| 21.4 %     |    **0.49**           | **9.0%  **              |     73.5%               | 
```c
	// Loop over Poles within window, add contributions                 
|181|	for( int i = w.start; i < w.end; i++ )                        //       CPUTIME
|182|   {
|183|	    Pole pole = data.poles[nuc][i];                                    // 4.5%
|184|
|185|	    // Prep Z
|186|	    double complex Z = (E - pole.MP_EA) * dopp;
|187|	    if( cabs(Z) < 6.0 )
|188|		    (*abrarov)++;
|189|	    (*alls)++;
|190|
|191|	    // Evaluate Fadeeva Function
|192|	    complex double faddeeva = fast_nuclear_W( Z );
|193|
|194|	    // Update W
|195|	    sigT += creal( pole.MP_RT * faddeeva * sigTfactors[pole.l_value] );// 8.6%
|196|	    sigA += creal( pole.MP_RA * faddeeva);
|197|	    sigF += creal( pole.MP_RF * faddeeva);
|198|   }
```

#### calculate_sig_T (Where most of the libm time comes from 49.7% Total CPU Time)
|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | L2 Data Cache Miss Rate |
|:---------:|:----------------------:|:-----------------------:|:-----------------------:|
| 48.1%(Inclusive)     |    1.46           | 1.7%                   |     73.4%               | 
```c
|208|void calculate_sig_T( int nuc, double E, Input input, CalcDataPtrs data, complex double * sigTfactors )
|209|{
|210|	double phi;
|211|
|212|	for( int i = 0; i < input.numL; i++ )
|213|	{
|214|		phi = data.pseudo_K0RS[nuc][i] * sqrt(E);
|215|
|216|		if( i == 1 )
|217|			phi -= - atan( phi );
|218|		else if( i == 2 )
|219|			phi -= atan( 3.0 * phi / (3.0 - phi*phi));
|220|		else if( i == 3 )
|221|			phi -= atan(phi*(15.0-phi*phi)/(15.0-6.0*phi*phi));
|222|
|223|		phi *= 2.0;
|224|
|225|		sigTfactors[i] = cos(phi) - sin(phi) * _Complex_I;
|226|	}
|227|}
```

#### fast_nuclear_W
"This function uses a combination of the Abrarov Approximation
and the QUICK_W three term asymptotic expansion.
Only expected to use Abrarov ~0.5% of the time."
  
Function defines several hard coded values and spends most time on line 72 (20.9% CPU Time):

|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | L2 Data Cache Miss Rate |
|:---------:|:----------------------:|:-----------------------:|:-----------------------:|
| 20.9 %    |    1.15                | 2.0%                    |     73.5%               | 
```c
|72| double complex W = I * Z * (a/(Z*Z - b) + c/(Z*Z - d));  
```

---
### 72 Thread Run
#### calculate_sig_T (Where most of the libm time comes from)

|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | L2 Data Cache Miss Rate |
|:---------:|:----------------------:|:-----------------------:|:-----------------------:|
| 27.0% (Inclusive)     |    0.46          | 1.8%           |     69.0%               | 
```c
|208|void calculate_sig_T( int nuc, double E, Input input, CalcDataPtrs data, complex double * sigTfactors )
|209|{
|210|	double phi;
|211|
|212|	for( int i = 0; i < input.numL; i++ )
|213|	{
|214|		phi = data.pseudo_K0RS[nuc][i] * sqrt(E);
|215|
|216|		if( i == 1 )
|217|			phi -= - atan( phi );
|218|		else if( i == 2 )
|219|			phi -= atan( 3.0 * phi / (3.0 - phi*phi));
|220|		else if( i == 3 )
|221|			phi -= atan(phi*(15.0-phi*phi)/(15.0-6.0*phi*phi));
|222|
|223|		phi *= 2.0;
|224|
|225|		sigTfactors[i] = cos(phi) - sin(phi) * _Complex_I;
|226|	}
|227|}
```

#### fast_nuclear_W
  
Function defines several hard coded values and spends most time on line 72 (22.5% CPU Time):

|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | L2 Data Cache Miss Rate |
|:---------:|:----------------------:|:-----------------------:|:-----------------------:|
| 37.8%     |    .51            | 2.9%                   |     69.0%               | 
```c
| 3|// This function uses a combination of the Abrarov Approximation
| 4|// and the QUICK_W three term asymptotic expansion.
| 5|// Only expected to use Abrarov ~0.5% of the time.
| 6|double complex fast_nuclear_W( double complex Z )
| 7|{
| 8|	// Abrarov 
| 9|	if( cabs(Z) < 6.0 )
|10|	{
|11|		// Precomputed parts for speeding things up
|12|		// (N = 10, Tm = 12.0)
            (...)
            
|52|		double complex W = I * ( 1 - cexp(I*12.*Z) ) / (12. * Z );
|53|		double complex sum = 0;
|54|		for( int n = 0; n < 10; n++ )
|55|		{
|56|			complex double top = neg_1n[n] * cexp(I*12.*Z) - 1.;
|57|			complex double bot = denominator_left[n] - 144.*Z*Z;
|58|			sum += an[n] * (top/bot);
|59|		}
|60|		W += prefactor * Z  * sum;
|61|		return W;
|62|	}
|63|
|64|	// QUICK_2 3 Term Asymptotic Expansion (Accurate to O(1e-6)).
|65|	// Pre-computed parameters
        (...)
        
|71|	// Three Term Asymptotic Expansion
|72|	double complex W = I * Z * (a/(Z*Z - b) + c/(Z*Z - d));
|73|       
|74|	return W;
|75|}
```

### Multithreading Result
When run on a single thread RSBench is limited within the **mathlib, pressuring the execution core** and the **memory bandwidth/latency**.  When run using all the threads (72 on haswell(thing), 88 on broadwell(it)) the bottleneck moves to be a slight majority more caused by the **branch mispredictions**.  