# Problem 1-1: Data deduplication in free energy calculations

## 1. Overlapped time frames in a MD simulation

This notebook discusses whether there is a need to deduplicate the overlapped time frames when calculating free energy differences. Specifically, "overlapped" or "duplicate" time frames are caused in a simulation whenever the simulation is interrupted at some point before reaching the final time frame due to timeout or a GROMACS error. A checkpoint file is required for such simulations to be restarted. As the simulation is typically restarted from the last saved checkpoint rather than exactly where the simulation stops there would be overlapped time frames from the last checkpoint to the point where the simulation stops.

To demonstrate this, here I take the simple system adopted in the previous Problem 1 as the system of interest, which is a molecule composed of 4 uncharged vdW sites. In the folder `Data/Problem_1-1/interrupted_simulation` I saved the output files of a 2D alchemical metadynamics simulation of this system. The simulation was manually interrupted around the middle of the way and was later extended to the expected length of 20 ns. 

Below we first check the file COLVAR obtained from this simulation:

In [1]:
import plumed
data = plumed.read_as_pandas('Data/Problem_1-1/interrupted_simulation/COLVAR')
data

  data = plumed.read_as_pandas('Data/Problem_1-1/interrupted_simulation/COLVAR')


Unnamed: 0,time,theta,lambda,metad.bias
0,0.00000,-3.032395,0.0,0.000000
1,0.02000,-3.073884,0.0,0.000000
2,0.04000,3.050476,0.0,0.000000
3,0.06000,2.953665,0.0,0.000000
4,0.08000,2.978590,0.0,0.000000
...,...,...,...,...
1090377,19999.92095,-2.340282,6.0,341.633030
1090378,19999.94095,-2.279066,6.0,339.879673
1090379,19999.96095,-2.268446,7.0,345.509855
1090380,19999.98095,-2.196914,6.0,337.667748


With `STRIDE=10` and `dt=0.002` (ps), 20 ns of simulation should generate 1000001 data points (including the initial time frame). However, the COLVAR file of simulation A has 1090382 frames since a fair amount of time frames are overlapped. As can be checked below, there are overlapped time frames because the simulation stopped at 9647.28 ps, while the last checkpoint was only updated to 7839.68 ps since the checkpoint file only updates for every 15 minutes in a GROMACS simulation. As such, when the simulation was restarted with the checkpoint file, it continued from 7839.68 ps instead of 9647.28 ps, leading to duplicate time frames from 7839.68 ps to 9647.28 ps. 

In [2]:
data.loc[482360:482370]

Unnamed: 0,time,theta,lambda,metad.bias
482360,9647.200458,-0.296379,2.0,243.001426
482361,9647.220458,-0.251398,1.0,245.201731
482362,9647.240458,-0.225176,2.0,244.034849
482363,9647.260458,-0.27779,0.0,247.643763
482364,9647.280458,-0.374748,1.0,243.537231
482365,7839.680372,-2.939678,1.0,249.620467
482366,7839.700372,-3.086773,1.0,250.748685
482367,7839.720372,3.026472,0.0,254.126936
482368,7839.740372,2.82896,0.0,252.104507
482369,7839.760372,2.571524,0.0,246.34669


## 2.  The influence of overlapped time frames on free energy calculations

Apparently, the overlapped time frames could influence the average bias used for reweighting in free energy calculations. To estimate the free energy difference correctly, I assumed that we should discard the first occurrence of the overlapped time frames such that the correct and continuous CV time series is considered. Notably, given that the statistics of the duplicate time frames should be roughly the same, I do think that the results obtained from the simulation with or without deduplication should not deviate from each other too much. However, as demonstrated below, free energy calculations with data deduplication are statistically different from the ones without any data deduplication. It is also not consistent with the benchmark we got from the simulation without any interruption. 

Below are the functions we previously used for free energy calculations:

In [3]:
import numpy as np
np.random.seed(1994) # makes notebook reproducible
kBT = 2.478956208925815

def analyze(traj, n_blocks, discard=0):
    n = int(len(traj) * (1.0 - discard))   # number of data points considered
    # make sure the number of frames is a multiple of nblocks (discard the first few frames)
    n = (n // n_blocks) * n_blocks
    bias = np.array(traj["metad.bias"])
    bias -= np.max(bias) # avoid overflows
    w = np.exp(bias / kBT)[-n:].reshape((n_blocks, -1)) # shape: (nblocks, nframes in one block), weight for each point
    
    # A: coupled state, B: uncoupled state
    isA = np.array(traj["lambda"] == 0)[-n:].reshape((n_blocks, -1)) # 1 if in A (np.in_ converts bool to 0 or 1)
    isB = np.array(traj["lambda"] == np.max(traj["lambda"]))[-n:].reshape((n_blocks, -1)) # 1 if in B
    
    B = 200 # number of bootstrap iterations
    boot = np.random.choice(n_blocks, size=(B, n_blocks))  # draw samples from np.arange(n_blocks), size refers the output size
    popA = np.average(isA[boot], axis=(1,2), weights=w[boot])  # Note that isA[boot] is a 3D array
    popB = np.average(isB[boot], axis=(1,2), weights=w[boot])  # shapes of popA and popB: (B,)

    df = np.log(popA / popB) # this is in kBT units
    popA0 = np.average(isA, weights=w)
    popB0 = np.average(isB, weights=w)
    return np.log(popA0 / popB0), np.std(df)

def time_average(hills, t0=0.75):
    # time-averaged potential, computed averaging over the final 25%
    n0 = int(len(hills) * t0)   # number of data points considered
    w = np.hstack((np.ones(n0), np.linspace(1, 0, len(hills) - n0)))  # the weights for the first n0 points are 1
    hills = hills.copy()
    hills.height *= w
    return hills

With the functions above, we can calculate the free energy difference with or without data deduplication, as elaborated below.

### 2-1. Method 1: Free energy calculations without data deduplication

Here we just repeat exactly what we did in the previous Problem 1 notebook, which is without data deduplication. 

#### Step 1: Generate `HILLS_2D_modified`.

In [4]:
hills = plumed.read_as_pandas('Data/Problem_1-1/interrupted_simulation/HILLS_2D')
hills_avg = time_average(hills, t0=0.8)
plumed.write_pandas(hills_avg, 'Data/Problem_1-1/interrupted_simulation/without_deduplication/HILLS_2D_modified')

  hills = plumed.read_as_pandas('Data/Problem_1-1/interrupted_simulation/HILLS_2D')


#### Step 2: Generate `COLVAR_SUM_BIAS` that contains average bias for reweighting

In [5]:
%%bash
source /Users/Wei-TseHsu/Documents/Software/PLUMED/plumed2/sourceme.sh
cd Data/Problem_1-1/interrupted_simulation/without_deduplication/
plumed driver --plumed plumed_sum_bias.dat --noatoms

PLUMED: PLUMED is starting
PLUMED: Version: 2.8.0-dev (git: 63008b018) compiled on Nov 21 2020 at 02:44:56
PLUMED: Please cite these papers when using PLUMED [1][2]
PLUMED: For further information see the PLUMED web page at http://www.plumed.org
PLUMED: Root: /Users/Wei-TseHsu/Documents/Software/PLUMED/plumed2/
PLUMED: For installed feature, see /Users/Wei-TseHsu/Documents/Software/PLUMED/plumed2//src/config/config.txt
PLUMED: Molecular dynamics engine: driver
PLUMED: Precision of reals: 8
PLUMED: Running over 1 node
PLUMED: Number of threads: 1
PLUMED: Cache line size: 512
PLUMED: Number of atoms: 0
PLUMED: File suffix: 
PLUMED: FILE: plumed_sum_bias.dat
PLUMED: Action READ
PLUMED:   with label theta
PLUMED:   with stride 1
PLUMED:   reading data from file ../COLVAR
PLUMED:   reading value theta and storing as theta
PLUMED: Action READ
PLUMED:   with label lambda
PLUMED:   with stride 1
PLUMED:   reading data from file ../COLVAR
PLUMED:   reading value lambda and storing as lambda
PLU

#### Step 3: Calculate the free energy difference and its uncertainty

In [6]:
results = analyze(plumed.read_as_pandas('Data/Problem_1-1/interrupted_simulation/without_deduplication/COLVAR_SUM_BIAS'), n_blocks=50, discard=0.2)
print(f'The free energy difference obtained without data deduplication is {results[0]:.3f} +/- {results[1]:.3f}kT.')

  results = analyze(plumed.read_as_pandas('Data/Problem_1-1/interrupted_simulation/without_deduplication/COLVAR_SUM_BIAS'), n_blocks=50, discard=0.2)


The free energy difference obtained without data deduplication is -2.587 +/- 0.073kT.


### 2-2. Method 2: Free energy calculations WITH data deduplication

Below we first write a function to read and deduuplicate PLUMED outputs as needed:

In [7]:
def read_plumed_output(plumed_output):
    """
    This function modifies the given plumed output file if it is corrupted, meaning that
    there might be some duplicates in the time series having the same time frames. If the
    file is not corrupted, this fucntion does nothing but only read in the data. 

    Parameters
    ----------
    plumed_output (str): The filename of the plumed output (such as HILLS or COLVAR files) to be read
    """
    # Note that the below lines have no effects on the raw data if the time series is not corrupted.
    data_original = plumed.read_as_pandas(plumed_output)
    data = data_original[~data_original["time"].duplicated(keep='last')]  # deduplicate time frames
    data = data.dropna()        # drop N/A in case that there is any
    data = data.reset_index()   # reset the index of the data frame, after this an column "index" will be added
    data = data.drop(columns=["index"])    # drop the index column

    return data

Here is a quick look of the result of this function:

In [8]:
data = read_plumed_output('Data/Problem_1-1/interrupted_simulation/COLVAR')
data

  data_original = plumed.read_as_pandas(plumed_output)


Unnamed: 0,time,theta,lambda,metad.bias
0,0.00000,-3.032395,0.0,0.000000
1,0.02000,-3.073884,0.0,0.000000
2,0.04000,3.050476,0.0,0.000000
3,0.06000,2.953665,0.0,0.000000
4,0.08000,2.978590,0.0,0.000000
...,...,...,...,...
999996,19999.92095,-2.340282,6.0,341.633030
999997,19999.94095,-2.279066,6.0,339.879673
999998,19999.96095,-2.268446,7.0,345.509855
999999,19999.98095,-2.196914,6.0,337.667748


As shown above, after data deduplication, the total number of time frames becomes 1000001. In addition, it can be checked that the first occurrence of the duplicate time frames were discarded. 

Note that the only difference between Method 2 and Method 1 is that in Method 2, instead of using `plumed.read_as_pandas` to read in PLUMEd outputs, we read in the files using our own functions `read_plumed_output` with data deduplication enabled. 

#### Step 1: Deduplicate `HILLS` and generate `HILLS_modified`
Before generating `HILLS_2D_modified`, we need to deduplicate the overlapped time frames as below.

In [9]:
hills = read_plumed_output('Data/Problem_1-1/interrupted_simulation/HILLS_2D')
hills_avg = time_average(hills, t0=0.8)
plumed.write_pandas(hills_avg, 'Data/Problem_1-1/interrupted_simulation/with_deduplication/HILLS_2D_modified')

  data_original = plumed.read_as_pandas(plumed_output)


#### Step 2: Deduplicate `COLVAR` and generate `COLVAR_SUM_BIAS` that contains average bias for reweighting

In [10]:
%%bash
source /Users/Wei-TseHsu/Documents/Software/PLUMED/plumed2/sourceme.sh
cd Data/Problem_1-1/interrupted_simulation/with_deduplication/
plumed driver --plumed plumed_sum_bias.dat --noatoms

PLUMED: PLUMED is starting
PLUMED: Version: 2.8.0-dev (git: 63008b018) compiled on Nov 21 2020 at 02:44:56
PLUMED: Please cite these papers when using PLUMED [1][2]
PLUMED: For further information see the PLUMED web page at http://www.plumed.org
PLUMED: Root: /Users/Wei-TseHsu/Documents/Software/PLUMED/plumed2/
PLUMED: For installed feature, see /Users/Wei-TseHsu/Documents/Software/PLUMED/plumed2//src/config/config.txt
PLUMED: Molecular dynamics engine: driver
PLUMED: Precision of reals: 8
PLUMED: Running over 1 node
PLUMED: Number of threads: 1
PLUMED: Cache line size: 512
PLUMED: Number of atoms: 0
PLUMED: File suffix: 
PLUMED: FILE: plumed_sum_bias.dat
PLUMED: Action READ
PLUMED:   with label theta
PLUMED:   with stride 1
PLUMED:   reading data from file ../COLVAR
PLUMED:   reading value theta and storing as theta
PLUMED: Action READ
PLUMED:   with label lambda
PLUMED:   with stride 1
PLUMED:   reading data from file ../COLVAR
PLUMED:   reading value lambda and storing as lambda
PLU

#### Step 3. Calculate the free energy difference and its uncertainty¶

In [11]:
results = analyze(plumed.read_as_pandas('Data/Problem_1-1/interrupted_simulation/with_deduplication/COLVAR_SUM_BIAS'), n_blocks=50, discard=0.2)
print(f'The free energy difference obtained with data dedupliation {results[0]:.3f} +/- {results[1]:.3f}kT.')

  results = analyze(plumed.read_as_pandas('Data/Problem_1-1/interrupted_simulation/with_deduplication/COLVAR_SUM_BIAS'), n_blocks=50, discard=0.2)


The free energy difference obtained with data dedupliation -1.756 +/- 0.065kT.


### 2-3. Comparison of the methods

Here is a summary of the results obtained above:
- Without data deduplication of HILLS and COLVAR: **-2.636 +/- 0.073 kT**
- With data deduplication of HILLS and COLVAR: **-1.802 +/- 0.074 kT**

As mentioned above, to my understanding, I think the data should be deduplicated so that the timeseries considered is continuous. However, it seems that the results obtained with data deduplication was not correct. To show this, I performed another simulation with exactly the same input files, whose PLUMED outputs are saved in the folder `Data/Problem_1-1/uninterrupted_simulation`. Since this simulation was not interrupted at all and has no concern about data deduplication, we could use it as a reference for our free energy calculations. Below we calculate the free energy difference of the system from this simulation. 

In [12]:
hills = plumed.read_as_pandas('Data/Problem_1-1/uninterrupted_simulation/HILLS_2D')
hills_avg = time_average(hills, t0=0.8)
plumed.write_pandas(hills_avg, 'Data/Problem_1-1/uninterrupted_simulation/HILLS_2D_modified')

  hills = plumed.read_as_pandas('Data/Problem_1-1/uninterrupted_simulation/HILLS_2D')


In [13]:
%%bash
source /Users/Wei-TseHsu/Documents/Software/PLUMED/plumed2/sourceme.sh
cd Data/Problem_1-1/uninterrupted_simulation/
plumed driver --plumed plumed_sum_bias.dat --noatoms

PLUMED: PLUMED is starting
PLUMED: Version: 2.8.0-dev (git: 63008b018) compiled on Nov 21 2020 at 02:44:56
PLUMED: Please cite these papers when using PLUMED [1][2]
PLUMED: For further information see the PLUMED web page at http://www.plumed.org
PLUMED: Root: /Users/Wei-TseHsu/Documents/Software/PLUMED/plumed2/
PLUMED: For installed feature, see /Users/Wei-TseHsu/Documents/Software/PLUMED/plumed2//src/config/config.txt
PLUMED: Molecular dynamics engine: driver
PLUMED: Precision of reals: 8
PLUMED: Running over 1 node
PLUMED: Number of threads: 1
PLUMED: Cache line size: 512
PLUMED: Number of atoms: 0
PLUMED: File suffix: 
PLUMED: FILE: plumed_sum_bias.dat
PLUMED: Action READ
PLUMED:   with label theta
PLUMED:   with stride 1
PLUMED:   reading data from file COLVAR
PLUMED:   reading value theta and storing as theta
PLUMED: Action READ
PLUMED:   with label lambda
PLUMED:   with stride 1
PLUMED:   reading data from file COLVAR
PLUMED:   reading value lambda and storing as lambda
PLUMED: A

In [14]:
results = analyze(plumed.read_as_pandas('Data/Problem_1-1/uninterrupted_simulation/COLVAR_SUM_BIAS'), n_blocks=500, discard=0.2)
print(f'The free energy difference is {results[0]:.3f} +/- {results[1]:.3f}kT.')

  results = analyze(plumed.read_as_pandas('Data/Problem_1-1/uninterrupted_simulation/COLVAR_SUM_BIAS'), n_blocks=500, discard=0.2)


The free energy difference is -2.335 +/- 0.069kT.


As shown above, our analysis method with data deduplication fail to produce a result consistent with the reference. With this, I have the following two questions:
- **Question 1**: When calculating free energies from a simulation that was interrupted at some point due to timout of crashing issues, should we deduplicate the overlapped time frames or not?
- **Question 2**: If the data should be deduplicated, what is the reason that caused the deviation of the result from the reference? 

I've been looking into this problem for a while but I fail to resolve the problem. It is quite common that a simulation is terminated and restarted at some point especially we are running for a long simulation. Therefore, it would be very helpful if you could provide some experiences with this. I appreciate a lot for your insights!