## CoMD
Classical Molecular Dynamics algorithms and workloads


(Haswell) Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz  
`divsd`: 10-20 Cycles  
`mulsd`: 5 cycles  
`movsd`: 3 cycles  
L1 Cache: 32 kB, 8 way, 64 sets, 64 B line size, **latency 4**, per core.    
L2 Cache: 256 kB, 8 way, 512 sets, 64 B line size, **latency 12**, per core.
  
|   | L1 Cache | L2 Cache | L3 Cache | DRAM |
|:---|:------:|:--------:|:----------:|:-----|
|**Cache Lines / Cycle** | 1.9728 | 1.2120 | 0.7745 | 0.4402 |

  ---
### Serial Run
CPUTIME: 8.94e+06  
94.2% in `ljForce(SimFlat* s)` function within `ljForce.c`  
76.7% Total Time in `for (int iOff=iBox*MAXATOMS,ii=0; ii<nIBox; ii++,iOff++)` loop, line 189



#### `ljForce(SimFlat* s)` :  94.2% CPUTIME

---
```c
|144|int ljForce(SimFlat* s)
       (...)
|169|  for (int iBox=0; iBox<s->boxes->nLocalBoxes; iBox++)
         (...)
|175|    for (int jTmp=0; jTmp<nNbrBoxes; jTmp++)
           (...)
|185|      for (int iOff=iBox*MAXATOMS,ii=0; ii<nIBox; ii++,iOff++)
             (...)
```           
|Scope  |  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | 
|:---|:---------:|:----------------------:|:-----------------------:| 
| `for (int jOff=MAXATOMS*jBox,ij=0; ij<nJBox; ij++,jOff++)`| 76.7% CPU |    1.01 Ins / Cycle    | 0.6% L1 Data Cache Miss | 
```c      
|189|        for (int jOff=MAXATOMS*jBox,ij=0; ij<nJBox; ij++,jOff++)
|190|        {                                                        
|191|          real_t dr[3];
|192|          int jId = s->atoms->gid[jOff];  
|193|          if (jBox < s->boxes->nLocalBoxes && jId <= iId )      //  5.9% CPUTIME
|194|            continue; // don't double count local-local pairs.
|195|          real_t r2 = 0.0;
|196|          for (int m=0; m<3; m++)
|197|          {
|198|            dr[m] = s->atoms->r[iOff][m]-s->atoms->r[jOff][m];  //  6.4% CPUTIME
```
  
  
|Scope|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | 
|:---|:---:|:---:|:---:|
|`r2+=dr[m]*dr[m];` | 11.7% CPU |    1.38 Ins / Cycle    | 1.6% L1 Data Cache Miss |
```c        
|199|            r2+=dr[m]*dr[m];                                     
|200|          }
|201|
|202|          if ( r2 > rCut2) continue;                            //  6.8% CPUTIME
|203|  
|204|          // Important note:
|205|          // from this point on r actually refers to 1.0/r
|206|          r2 = 1.0/r2;
```
|Scope|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | 
|:---|:---:|:---:|:---:|
|`real_t r6 = s6 * (r2*r2*r2);` | 12.8% CPU |    0.68 Ins / Cycle    | 0.2% L1 Data Cache Miss |
        /*                     NOTE: Dependency Chain                       */
```c        
|207|          real_t r6 = s6 * (r2*r2*r2);                           
|208|          real_t eLocal = r6 * (r6 - 1.0) - eShift;             //  5.4% CPUTIME
|209|          s->atoms->U[iOff] += 0.5*eLocal;                      //  6.6% CPUTIME
|210|          s->atoms->U[jOff] += 0.5*eLocal;
|211|
|212|          // calculate energy contribution based on whether
|213|          // the neighbor box is local or remote
|214|          if (jBox < s->boxes->nLocalBoxes)
|215|            ePot += eLocal;
|216|          else
|217|            ePot += 0.5 * eLocal;
|218|
|219|          // different formulation to avoid sqrt computation
|220|          real_t fr = - 4.0*epsilon*r6*r2*(12.0*r6 - 6.0);
|221|          for (int m=0; m<3; m++)
|222|          {
|223|            s->atoms->f[iOff][m] -= dr[m]*fr;
|224|            s->atoms->f[jOff][m] += dr[m]*fr;                         
|225|          }
|226|        }// loop over atoms in jBox 
|227|      } // loop over atoms in iBox
|228|    } // loop over neighbor boxes
|229|  } // loop over local boxes in system
       (...)
|236|}       
```

---

### 72 Thread Run ./CoMD -x 200 -y 200 -z 200
```c
|145|int ljForce(SimFlat* s)
|146|{
        (...)
|173|   for (int iBox=0; iBox<s->boxes->nLocalBoxes; iBox++)
|174|   {
           (...)
|178|      for (int jTmp=0; jTmp<nNbrBoxes; jTmp++)
|179|      {
              (...)
|187|         for (int iOff=MAXATOMS*iBox; iOff<(iBox*MAXATOMS+nIBox); iOff++)
|188|         {
                 (...)
```
                 
|Scope|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | 
|:---|:---:|:---:|:---:|
|`for (int jOff=jBox*MAXATOMS; jOff<(jBox*MAXATOMS+nJBox); jOff++)` | 53.6% CPU |  0.53 Ins / Cycle    | 0.4% L1 Data Cache Miss |
```c
|191|            for (int jOff=jBox*MAXATOMS; jOff<(jBox*MAXATOMS+nJBox); jOff++)
|192|            {
|193|               real3 dr;
|194|               real_t r2 = 0.0;
|195|               for (int m=0; m<3; m++)
|196|               {
|197|                  dr[m] = s->atoms->r[iOff][m]-s->atoms->r[jOff][m];
```

|Scope|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | 
|:---|:---:|:---:|:---:|
|`r2+=dr[m]*dr[m];` | 9.4% CPU |    0.60 Ins / Cycle    | 0.6% L1 Data Cache Miss |
```c
|198|                  r2+=dr[m]*dr[m];
|199|               }
|200|
|201|               if ( r2 <= rCut2 && r2 > 0.0)
|202|               {
|203|
|204|                  // Important note:
|205|                  // from this point on r actually refers to 1.0/r
|206|                  r2 = 1.0/r2;
```

|Scope|  CPUTIME  | Instructions per Cycle | L1 Data Cache Miss Rate | 
|:---|:---:|:---:|:---:|
|`real_t r6 = s6 * (r2*r2*r2);` | 9.4% CPU |    0.36 Ins / Cycle    | 0.2% L1 Data Cache Miss |
```c
|207|                  real_t r6 = s6 * (r2*r2*r2);
|208|                  real_t eLocal = r6 * (r6 - 1.0) - eShift;
|209|                  s->atoms->U[iOff] += 0.5*eLocal;
|210|                  ePot += 0.5*eLocal;
|211|
|212|                  // different formulation to avoid sqrt computation
|213|                  real_t fr = - 4.0*epsilon*r6*r2*(12.0*r6 - 6.0);
|214|                  for (int m=0; m<3; m++)
|215|                  {
|216|                     s->atoms->f[iOff][m] -= dr[m]*fr;
|217|                  }
|218|               }
|219|            } // loop over atoms in jBox
|220|         } // loop over atoms in iBox
|221|      } // loop over neighbor boxes
|222|   } // loop over local boxes in system
        (...)
|228|}

```


  
72 Threads -x 200 -y 200 -z 200: `Aggregate CPUTIME`: 3.13e+10   
1 Thread: `Aggregate CPUTIME`: 8.94e+06

In [13]:
3.72e+09 / (1.42e+12 + 3.10e+11)

0.0021502890173410406