(Pulling_Sim_Data)=
# Extracting Simulation Data

Once you have actually run your simulations, you now need to extract the correct information and in a meaningful way. It is unfortunately not as easy as simply running all the data through a master analysis program; one must first choose how they will analyze the data (e.g. using any of the analysis methods shown in the theory section), then ensure that your data is both independent and uncorrelated [as discussed previously](Runnning_Sims).

## Extracting the Correct Information
For this step, we must ensure that the data we are extracting from our simulation is meaningful to the analysis technique we will run. Recall that each method needs different information

* [TI](TI) requires $\frac{dU}{d\lambda}$ at each $\vec{q}$ point.
* [EXP](exp_avg) needs *either* $\Delta  U_{k,k+1}(\vec{q})$ or $\Delta U_{k,k-1}(\vec{q})$ depending on which direction EXP is being operated in. 
* [BAR](BAR) must have *both* $\Delta  U_{k,k+1}(\vec{q})$ and $\Delta U_{k,k-1}(\vec{q})$ between all pairs of states.
* [WHAM](WHAM) and [MBAR](MBAR) have to have the complete set of $\Delta  U_{k,j}(\vec{q})$ with $j=1\ldots K$ along the entire transformation path; WHAM must have this information binned.

The potential derivative required for TI must generally be calculated during the simulation; it cannot be postprocessed by a code that does not evaluate the derivatives. The potential energy differences required for EXP, BAR, MBAR, and WHAM be calculated either during the simulation or in post-processing. We recommend calculating the potential differences in code when possible to avoid extra overhead and possible errors produced by running the simulation twice, and to reduce the amount of stored information. Although TI must be usually be calculated in code, as it requires the derivative, there is one condition under which it actually has the fastest computation time. If the alchemical path you have chosen is a [linear alchemical path](linear_xform), then you get

$\frac{dU}{d\lambda}=U_0(\vec{q}) - U_1(\vec{q})$

which can be calculated with only the endpoint energies. However, because of the [problems with linear paths](linear_xform), this simplification is rarely that useful.

(correlation)=
## Correlation

Once we have extracted the information, we need to make sure that the data we process to extract free energies is independent. As [mentioned in running simulations](independent_samples), samples close together in time are correlated to each other in all but the most simple systems and we **must** have uncorrelated samples for our data to be meaningful. 

The autocorrelation time is a measure which tells us the time between effectively uncorrelated samples and a number of approaches exist for calculating it. The most common start point is to compute the normalized autocorrelation function of an observable *X* over the duration of the whole simulation, $\mathcal{T}$.<!-- Trying to make the time T distinguishable from the temperature T--> We first make a notation of 

$\displaystyle \delta X(t) = X(t) - \mathcal{T}^{-1}\int_{t=0}^\mathcal{T} X(t) dt$

where we have defined the instantaneous value of *X* less the average value of *X*. We now compute the quantity:

$\displaystyle C_X(\Delta t) = \frac{\int_{\tau=0}^{\mathcal{T}} \delta X(\tau) \delta X(\tau+\Delta t) d\tau}{\int_{\tau=0}^{\mathcal{T}} \delta X(\tau)^2  d\tau}$

where $\tau$ is the autocorrelation time. If we get $C_X(\Delta t)=0$ both at and after $\Delta t$, then the two samples separated by $\Delta t$ are uncorrelated and can be treated as independent. 

Given that we usually have *N* samples taken at $\delta t$ time apart, then  $C_X(\delta t)$ is now discrete at particular $i$, requiring us to redefine our two equations:

$\delta X(i) = X(i) - \frac{1}{N}\sum_{i=0}^N X(i)$

and

$C_X(i) = \frac{\sum_{j=0}^{N} \delta X(j) \delta X(j+i)}{\sum_{j=0}^N \delta X(j)^2}$

where the autocorrelation time, $\tau$ is now the integral under $C_X$. One must be careful when integrating this function though as the noise, especially at more than half simulation time, becomes rather substantial. Often, the autocorrelation function can be fit to an exponential, which makes $\tau$ the relaxation time of the exponential function. A good rule of thumb is that simulations should run a minimum 50$\tau$ as an estimate since longer correlation times may not be detected in short simulations. Once you have $\tau$, under standard statistical assumptions, samples can be considered independent if they are spaced by 2$\tau$. If you do not calculate the correlation times, your statistical uncertainty will be lower than true uncertainty. Fortunately, many simulation packages come with methods, some of which are more sophisticated than that presented here, to calculate the autocorrelation time.

### Applying Correlation Corrections
Once the time is calculated, you still must apply it to your data. If your free energy method computes single state averages, like [TI](TI), then the average over all samples can be used for your mean; this *included correlated samples.* Your effective variance is then the regular variance multiplied by $\sqrt{2\tau/\Delta t}$ where $\Delta t$ is the time difference between samples. As an alternate method, or when your averages are not computed from single states, you can *subsample* the data by analyzing only the set of samples separated by 2$\tau$. Consider the following simple example with [BAR](BAR) wwhere we want the free energy difference between states 1 and 2.

* 5 ns of simulation time
* 10 ps between each snapshot (500 snapshots total)
* Assume autocorrelation for 1 &rarr; 2 of 20ps
* Assume autocorrelation for 2 &rarr; 1 of 40ps

Under these conditions, when we go to analyze $\Delta U_{1,2}$, we need every fourth sample as the correlation time is 20 ps and we want samples every 2$\tau$). Similarly, we should take every eight sample from $\Delta U_{2,1}$ since the correlation is time is 40 ps and $2\tau = 80\,\mathrm{ps}$. If this is done correctly, we will not have discarded any unique data as the information we ignored is already contained in the samples we kept.

True independent samples between configurations is achieved only if *all* coordinates are also uncorrelated between samples, not just energies. Although independent energies usually implies independent samples, there are situations where the energy is approximately independently sampled within the noise, but the configuration space is not as well sampled. This may occur, for example, when a ligand contains a configuration with comparable binding affinity as another, but rarely visits that conformer. If one were looking only at energies, it may be hard to detect this lack of sampling.