## RSBench

Compiler: gcc 7.2.0  
Flags: `-std=gnu99 -fopenmp -ffast-math -march=native -g -Ofast -DSTATUS`  
Libs: `-lm`   
Run Options: `-t 1` //(serial run)


### Serial Run
(Haswell) Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz  
`divsd`: 10-20 Cycles  
`mulsd`: 5 cycles  
`movsd`: 3 cycles  
L1 Cache: 32 kB, 8 way, 64 sets, 64 B line size, **latency 4**, per core.    
L2 Cache: 256 kB, 8 way, 512 sets, 64 B line size, **latency 12**, per core.
  

---

### CPU Time
Inclusively RSBench spends 98.7% of its time in `xs_kernel.c`.   
Exclusively only 53.0% of total time is spent in `xs_kernel.c`.   
The other time (45.7%) is spent in the math library. libm-2.17.so  

  


|xs_kernel.c|CPU Inclusive|CPU Exclusive|
|:----------|:-----------:|:-----------:|
|calculate_micro_xs_doppler|97.6%|24.6%|
|---> loop at line 181|46.4%|21.4%|
|calculate_sig_T|48.1%|3.4%|
|fast_nuclear_W|25.0%|23.9%|
|---> line 72|20.9%|20.9%|

---

#### calculate_micro_xs_dopper( ) | Loop at line 181
|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | L2 Data Cache Miss Rate |
|:---------:|:----------------------:|:-----------------------:|:-----------------------:|
| 21.4 %     |    **0.49**           | **9.0%  **              |     73.5%               | 
```c
	// Loop over Poles within window, add contributions                 
|181|	for( int i = w.start; i < w.end; i++ )                        //       CPUTIME
|182|   {
|183|	    Pole pole = data.poles[nuc][i];                                    // 4.5%
|184|
|185|	    // Prep Z
|186|	    double complex Z = (E - pole.MP_EA) * dopp;
|187|	    if( cabs(Z) < 6.0 )
|188|		    (*abrarov)++;
|189|	    (*alls)++;
|190|
|191|	    // Evaluate Fadeeva Function
|192|	    complex double faddeeva = fast_nuclear_W( Z );
|193|
|194|	    // Update W
|195|	    sigT += creal( pole.MP_RT * faddeeva * sigTfactors[pole.l_value] );// 8.6%
|196|	    sigA += creal( pole.MP_RA * faddeeva);
|197|	    sigF += creal( pole.MP_RF * faddeeva);
|198|   }
```

#### calculate_sig_T (Where most of the libm time comes from 49.7% Total CPU Time)
|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | L2 Data Cache Miss Rate |
|:---------:|:----------------------:|:-----------------------:|:-----------------------:|
| 48.1%(Inclusive)     |    1.46           | 1.7%                   |     73.4%               | 
```c
|208|void calculate_sig_T( int nuc, double E, Input input, CalcDataPtrs data, complex double * sigTfactors )
|209|{
|210|	double phi;
|211|
|212|	for( int i = 0; i < input.numL; i++ )
|213|	{
|214|		phi = data.pseudo_K0RS[nuc][i] * sqrt(E);
|215|
|216|		if( i == 1 )
|217|			phi -= - atan( phi );
|218|		else if( i == 2 )
|219|			phi -= atan( 3.0 * phi / (3.0 - phi*phi));
|220|		else if( i == 3 )
|221|			phi -= atan(phi*(15.0-phi*phi)/(15.0-6.0*phi*phi));
|222|
|223|		phi *= 2.0;
|224|
|225|		sigTfactors[i] = cos(phi) - sin(phi) * _Complex_I;
|226|	}
|227|}
```

#### fast_nuclear_W
"This function uses a combination of the Abrarov Approximation
and the QUICK_W three term asymptotic expansion.
Only expected to use Abrarov ~0.5% of the time."
  
Function defines several hard coded values and spends most time on line 72 (20.9% CPU Time):

|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | L2 Data Cache Miss Rate |
|:---------:|:----------------------:|:-----------------------:|:-----------------------:|
| 20.9 %     |    1.15            | 2.0%                   |     73.5%               | 
```c
|72| double complex W = I * Z * (a/(Z*Z - b) + c/(Z*Z - d));  //20.9% CPUTIME(E)
```

---

### CPI
|Scope| Cycles / Instructions|
|:-----|:-----:|
|`calculate_micro_xs_doppler( ) : Loop at line 181`|  **2.04 CPI**  |
|`calculate_sig_T( ) : Exclusive for math lib`| .69 CPI | 
|`fast_nuclear_W( ) : Line 72`| .87 CPI|

#### Issue Cycles
`calculate_micro_xs_doppler( ) | Loop at line 181`:  
-- 2.40e+10 Full Issue | 11.3% Cycles Issuing Max Instructions  
-- 7.60e+07 No Issue | less than .1% Cycles Issuing No Instructions  
-- 2.12e+11 Total Cycles  
  
`calculate_sig_T( ) | Exclusive for math lib`:  
-- 2.41e+11 Full Issue | 51.5% Cycles Issuing Max Instructions  
-- 7.42e+08 No Issue | 0.2% Cycles Issuing No Instructions  
-- 4.68e+11 Total Cycles  
  
`fast_nuclear_W( ) | Line 72`:  
-- 5.38e+09 Full Issue | 2.6% Cycles Issuing Max Instructions  
-- 2.00e+07 No Issue | less than .1% Cycles Issuing No Instructions   
-- 2.04e+11 Total Cycles  
#### Retiring Cycles
`calculate_micro_xs_doppler( ) | Loop at line 181`:  
--  2.87e+10 Full Retire | 13.5% Cycles Retiring Max Instructions  
--  8.08e+10 No Retire | 38.1% Cycles Retiring No Instructions  
-- 2.12e+11 Total Cycles  
  
`calculate_sig_T( ) | Exclusive for math lib`:  
-- 1.35e+11 Full Retire | 28.8% Cycles Retiring Max Instructions  
-- 1.11e+11 No Retire | 23.7% Cycles Retiring No Instructions  
-- 4.68e+11 Total Cycles  
  
`fast_nuclear_W( ) | Line 72`:  
-- 5.99e+10 Full Retire | 29.4% Cycles Retiring Max Instructions  
-- 4.14e+10 No Retire | 20.3% Cycles Retiring No Instructions   
-- 2.04e+11 Total Cycles  

### Memory
#### Data Cache
`calculate_micro_xs_doppler( ) | Loop at line 181`:   
-- 4.94e+09 L1 Data Cache Misses | **9.0%** L1 Cache Miss Rate  
-- 3.63e+09 L2 Data Cache Misses | 26.5% L1 Misses Hit L2  
-- 5.47e+10 Load/Store Instructions

`calculate_sig_T( ) | Exclusive for math lib`:  
-- 3.50e+09 L1 Data Cache Misses | 1.7% L1 Cache Miss Rate  
-- 2.57e+09 L2 Data Cache Misses | 26.6% L1 Misses Hit L2  
-- 2.06e+11 Load/Store Instructions

`fast_nuclear_W( ) | Line 72`:  
-- 2.75e+09 L1 Data Cache Misses | 2.0% L1 Cache Miss Rate  
-- 2.02e+09 L2 Data Cache Misses | 26.5% L1 Misses Hit L2  
-- 1.41e+11 Load/Store Instructions

In [6]:
2.02 / 2.75

0.7345454545454545