# Research data supporting "Systematic discrepancies between reference methods on non-covalent interactions within the S66 dataset"

This notebook accompanies the paper: **Systematic discrepancies between reference methods on non-covalent interactions within the S66 dataset**. It can be found on Github at: https://github.com/zenandrea/FNDMC-S66 and also on [Colab](https://colab.research.google.com/github/zenandrea/FNDMC-S66/blob/main/analyse.ipynb)

### Abstract

The accurate treatment of non-covalent interactions is necessary to model a wide range of applications, from molecular crystals to surface catalysts to aqueous solutions and many more.
Quantum diffusion Monte Carlo (DMC) and coupled cluster theory with single, double and perturbative triple excitations [CCSD(T)] are considered two widely-trusted methods for treating non-covalent interactions.
However, while they have been well-validated for small molecules, recent work has indicated that these two methods can disagree by more than 7.5 kcal/mol for larger systems.
The origin of this discrepancy remains unknown, and the absence of systematic comparisons, especially for medium-sized complexes, prevents us from identifying which systems where such disagreement may occur as well as the possible extent of these differences.
In this work, we leverage the latest developments in DMC to compute interaction energies for the entire S66 dataset, containing 66 medium-sized complexes with a balanced representation of dispersion and electrostatic interactions.
Comparison to CCSD(T) reveals systematic trends, with DMC predicting stronger binding than CCSD(T) for electrostatic-dominated systems, while the binding becomes weaker for dispersion-dominated systems.
We show that the relative strength of this discrepancy is correlated to the ratio of electrostatic and dispersion interaction, as obtained from energy decomposition analysis methods.
These new insights set the stage for guiding future developments in DMC, CCSD(T) as well as the wider electronic structure theory community.

## Table of Contents
- [Analysis of the DMC data for the S66 dataset](#analysis-of-the-dmc-data-for-the-s66-dataset)
- [SI - Estimating CCSD(T) deformation energy](#si---estimating-ccsdt-deformation-energy)
- [SI - Validating the use of CCSD(T) deformation energy](#si---validating-the-use-of-ccsdt-deformation-energy)
- [SI - Comparing DLA and TM localization schemes for H<sub>2</sub>O and AcOH dimer](#si---comparing-dla-and-tm-localization-schemes-for-h2o-and-acoh-dimer)
- [SI - Previous CCSD(T) literature and final CCSD(T) and CCSD(cT)-fit estimates](#si---previous-ccsdt-literature-and-final-ccsdt-and-ccsdtc-fit-estimates)
- [SI - Timestep dependence for the binding energy of each S66 system](#si---timestep-dependence-for-the-binding-energy-of-each-s66-system)
- [SI - Acetic acid dimer validation](#si---acetic-acid-dimer-validation)
- [MAIN - Comparison of DMC against CCSD(T) and CCSD(cT)](#main---comparison-of-dmc-against-ccsdt-and-ccsdtc)
- [SI - Comparison of DMC against CCSD(T)](#si---comparison-of-dmc-against-ccsdt)


In [None]:
# Check if we are in Google Colab environment
try:
    import google.colab
    IN_COLAB = True
    usetex = False
except:
    import os
    IN_COLAB = False
    if os.path.expanduser('~') == '/home/shixubenjamin':
        usetex = True
    else:
        usetex = False


# If in Google Colab, install the necessary data and set up the necessary environment
if IN_COLAB == True:
    import os

    # Replace 'YOUR_PAT_HERE' with your actual PAT
    os.environ['GITHUB_TOKEN'] = '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'

    # Retrieve the token from environment variables
    token = os.environ['GITHUB_TOKEN']
    repo_url = "https://github.com/zenandrea/FNDMC-S66.git"

    # Insert the token into the repository URL
    import getpass

    from urllib.parse import quote

    # URL-encode the token to handle special characters
    token_encoded = quote(token)

    # Construct the authenticated URL
    authenticated_url = repo_url.replace("https://", f"https://{token_encoded}@")

    # Clone the repository
    !git clone {authenticated_url}
    %cd /content/FNDMC-S66
    ! sudo apt-get install texlive-latex-recommended 
    ! sudo apt-get install dvipng texlive-latex-extra texlive-fonts-recommended  
    ! wget http://mirrors.ctan.org/macros/latex/contrib/type1cm.zip 
    ! unzip type1cm.zip -d /tmp/type1cm 
    ! cd /tmp/type1cm/type1cm/ && sudo latex type1cm.ins
    ! sudo mkdir /usr/share/texmf/tex/latex/type1cm 
    ! sudo cp /tmp/type1cm/type1cm/type1cm.sty /usr/share/texmf/tex/latex/type1cm 
    ! sudo texhash 
    ! apt install cm-super


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from Scripts.def_colors import map_DMC, dmc_color
from Scripts.define_setup import *
from Scripts.myfit import fit_err, fun_lin, fun_quad, fun_cub, fun_quart, get_chi2_alpha_parfun
from Scripts.jup_plot import *
import re

# Set the display option for maximum rows (you can adjust this based on your needs)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)

if usetex:
    textrue_import()

replot_graphs = False

# Load the data
dimer_info = pd.read_csv('Data/Collated_DMC_Energies/dim_info.csv', index_col=0)
monomer_info = pd.read_csv('Data/Collated_DMC_Energies/mol_info.csv', index_col=0)

dimer_dmc_total_energy_data = pd.read_csv('Data/Collated_DMC_Energies/results_dim.csv', index_col=0)
monomer_dmc_total_energy_data = pd.read_csv('Data/Collated_DMC_Energies/results_mol.csv', index_col=0)
monomer_geometry_correction_data = pd.read_csv('Data/Collated_DMC_Energies/delta_mol_ref.csv', index_col=0)

# Define formatted names for the dimer systems
formatted_name_list = []
for system_id in range(1,67):
    dimer_name = dimer_info.loc[system_id,'name'].replace('pi', '$\pi$')
    dimer_name = dimer_name.split('_')
    part1 = dimer_name[0]

    part2 =  re.sub(r'(\d+)', r'$_\1$', dimer_name[1]).split('-')[0]
    part3 = re.sub(r'(\d+)', r'$_\1$', dimer_name[1]).split('-')[1]
    
    if len(dimer_name) > 2:
        additional_info = dimer_name[2:][0]
        formatted_name = part2 + '$\cdots$' + part3 + ' (' + additional_info + ')'
    else:
        formatted_name = part2 + '$\cdots$' + part3
    formatted_name_list.append(formatted_name)

dimer_info['formatted_name'] = formatted_name_list
dimer_info = dimer_info[['formatted_name','name','mol1','mol2','Nel','Nelv','atoms']]

### Analysis of the DMC data for the S66 dataset

In [2]:
# Compute the binding energy for the S66
# Filter dmc data to only include data with dmc_type = 'DMCdla5' and dmc_Jas = 'Jopt'
filtered_dimer_dmc_total_energy_data = dimer_dmc_total_energy_data[(dimer_dmc_total_energy_data['dmc_type'] == 'DMCdla5') & (dimer_dmc_total_energy_data['dmc_Jas'] == 'Jopt')]

dmc_energy_data = {system_id: {'total_energy_dimer': 0, 'total_energy_monomer_1':0, 'total_energy_monomer_2':0, 'binding_energy': 0} for system_id in range(1,67)}

system_name_dict = {system_id: {'Original': '', 'New': ''} for system_id in range(1,67)}

# Loop over the the dimers
for system_id, system_data in filtered_dimer_dmc_total_energy_data.groupby('ID'):
    system_data = system_data.sort_values('tau', ascending=False)
    system_data.set_index( 'tau', inplace=True )
    system_name = dimer_info.loc[system_id,'name']
    
    monomer_data = {1:0, 2:0}
    monomer_geometry_correction = {1:0, 2:0}
    # Get the monomer data
    for monomer_num in [1,2]:
        monomer_id = f'{system_id:02d}_{monomer_num}'
        monomer_name = dimer_info.loc[system_id,f'mol{monomer_num}']
        monomer_ref_id = monomer_info.loc[monomer_name, 'ref']
        monomer_ref_data = monomer_dmc_total_energy_data[(monomer_dmc_total_energy_data['mol_id'] == monomer_ref_id) & (monomer_dmc_total_energy_data['dmc_type'] == 'DMCdla5') & (monomer_dmc_total_energy_data['dmc_Jas'] == 'Jopt')].sort_values('tau', ascending=False)
        monomer_ref_data.set_index( 'tau', inplace=True )
        # Add the geometry correction
        monomer_ref_data['ene'] = monomer_ref_data['ene'] + monomer_geometry_correction_data[monomer_geometry_correction_data['mol_id'] == monomer_id]['ene-ref'].values[0]
        monomer_data[monomer_num] = monomer_ref_data
    dmc_energy_data[system_id]['total_energy_dimer'] = system_data.copy()
    dmc_energy_data[system_id]['total_energy_monomer_1'] = monomer_data[1]
    dmc_energy_data[system_id]['total_energy_monomer_2'] = monomer_data[2]
    # Compute the binding energy
    system_data['binding_energy'] = system_data['ene'] - monomer_data[1]['ene'] - monomer_data[2]['ene']
    system_data['binding_energy_err'] = (system_data['err']**2 + monomer_data[1]['err']**2 + monomer_data[2]['err']**2)**0.5
    dmc_energy_data[system_id]['binding_energy'] = system_data

### SI - Estimating CCSD(T) deformation energy

In [3]:
counter=0
final_monomer_total_energy = {}


if replot_graphs:
    latex_input_str = ''

    for mol, monomer in monomer_info.groupby('mol'):
        monomer_dimer_index = monomer['ref'].tolist()[0].split('_')[0]
        monomer_name = re.sub(r'(\d+)', r'$_\1$', monomer.index.tolist()[0])
        monomer_data = monomer_dmc_total_energy_data[(monomer_dmc_total_energy_data['mol_id'] == monomer['ref'].values[0]) & (monomer_dmc_total_energy_data['dmc_type'] == 'DMCdla5') & (monomer_dmc_total_energy_data['dmc_Jas'] == 'Jopt')].sort_values('tau', ascending=False)
        monomer_data.set_index( 'tau', inplace=True )
        fig, ax = plt.subplots(figsize=(3.365,2), dpi=300,constrained_layout=True)

        # Fit the linear data
        fitting_data = monomer_data[ monomer_data.index <= 0.015 ]
        xdata = fitting_data.index.to_numpy()
        ydata = fitting_data['ene'].to_numpy()
        sigma = fitting_data['err'].to_numpy()
        xfit1, m1, s1 = fit_err(xdata,ydata,sigma,fitfun=fun_lin)

        # Fit the cubic data
        fitting_data = monomer_data[ monomer_data.index <= 0.11]
        xdata = fitting_data.index.to_numpy()
        ydata = fitting_data['ene'].to_numpy()
        sigma = fitting_data['err'].to_numpy()
        xfit3, m3, s3 = fit_err(xdata,ydata,sigma,fitfun=fun_cub)

        # Determine which fit is the best fit
        lin_cub_diff = abs(m1[0] - m3[0])
        if lin_cub_diff > s3[0]:
            system_error = lin_cub_diff
            error_type = r'$\Delta_\textrm{cubic fit}^\textrm{linear fit}$'
        else:
            system_error = s3[0]
            error_type = r'$\sigma_\textrm{cubic fit}$'
        extrap_system_total_energy = m3[0]

        final_monomer_total_energy[monomer['ref'].tolist()[0]] = {'Monomer': monomer_name, 'Dimer Geometry': dimer_info.loc[int(monomer_dimer_index),'formatted_name'] + f" (ID {int(monomer_dimer_index)})", 'Order': monomer['ref'].values[0].split('_')[1],'Total Energy': m3[0], 'Total Energy Error': system_error, 'Formatted Total Energy': f'{m3[0]:.2f}$\pm${system_error:.2f}','Error Type': error_type}


        # Plot the actual computed data
        ax.errorbar(monomer_data.index.tolist(), monomer_data['ene'].values - extrap_system_total_energy, yerr=monomer_data['err'].values, fmt='o', color='black',markeredgecolor='none',markersize=4, label=r'DMC//DLA')

        ax.plot(xfit1,m1 - extrap_system_total_energy,'--',color='blue', label=r'linear fit ($E^\textrm{bind}_{\tau \to 0}=$' + f'{m1[0]:.2f}' + r'${\pm}$' + f'{s1[0]:.2f})')
        ax.fill_between(xfit1,m1 - extrap_system_total_energy -1*s1,m1 - extrap_system_total_energy +1*s1,color='blue',alpha=0.2)

        ax.plot(xfit3,m3 - extrap_system_total_energy,'--',color='green', label=r'cubic fit ($E^\textrm{bind}_{\tau \to 0}=$' + f'{m3[0]:.2f}' + r'${\pm}$' + f'{s3[0]:.2f})')
        ax.fill_between(xfit3,m3 - extrap_system_total_energy -1*s3,m3 - extrap_system_total_energy +1*s3,color='green',alpha=0.2)

        ax.set_xlabel( 'DMC timestep [a.u.]' )
        ax.set_xticks( [0, 0.003, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 0.2, 0.3 ] )
        ax.set_xticklabels( [ '0', '3E-3', '0.01', '0.02', '0.03', '0.04', '0.05', '0.06', '0.1', '0.2', '0.3' ], rotation=90 )
        ax.set_xlim( [0,0.1*1.03] )
        ax.set_ylim([-5,5])
        ax.set_ylabel( 'Total Energy [kcal/mol]' )
        ax.legend(loc='lower left', fontsize=7)
        ax.set_title(f'{monomer_name}')

        counter +=1
        plt.savefig(f'Figures/Fig_SI_Monomer_{counter:02d}.png')

        latex_input_str += r"""\begin{figure}[!h]
\includegraphics[width=3.365in]{"""+ f"Figures/Fig_SI_Monomer_{counter:02d}.png" + r"""}
\caption{\label{fig:""" + f"monomer_{counter:02d}" + r"""} The time step dependence of the total energy of the """ + monomer_name + r""" monomer in the """ + dimer_info.loc[int(monomer_dimer_index),'formatted_name'] + f" dimer (ID {int(monomer_dimer_index)}) " + "geometry." + r"""}
\end{figure}
    
"""
    np.save('Data/Final/final_monomer_total_energy.npy', final_monomer_total_energy)

else:
    final_monomer_total_energy = np.load('Data/Final/final_monomer_total_energy.npy', allow_pickle=True).item()

In [4]:
# Convert the dictionary to a DataFrame
final_monomer_total_energy_df = pd.DataFrame(final_monomer_total_energy).T
final_monomer_total_energy_df = final_monomer_total_energy_df[['Monomer', 'Dimer Geometry', 'Order', 'Formatted Total Energy','Error Type']]
final_monomer_total_energy_df.columns = ['Monomer', 'Dimer Geometry', 'Order', 'Total Energy [kcal/mol]','Error Type']
latex_input_str = convert_df_to_latex_input(
    final_monomer_total_energy_df,
    start_input = '\\begin{table}',
    label = 'tab:monomer_tot_ene',
    caption = r'Total energy of the 14 monomers which make up the S66 dataset. These geometries are taken from specific dimer complexes within the S66 dataset that are identified in the table and the order in which the monomer appears (important for dimers consisting of the same molecule) is reported. The type of line used to extrapolate to the zero time step limit is also identified.',
    end_input = '\\end{table}',
    replace_input = {
    },
    center = True,
    df_latex_skip = 0,
    index=False,
    output_str = True,
    column_format = 'l' + 'r'*(len(final_monomer_total_energy_df.columns)-1)
)

with open('Tables/Table_SI_Monomer_tot_ene.tex', 'w') as f:
    f.write(latex_input_str)

final_monomer_total_energy_df

Unnamed: 0,Monomer,Dimer Geometry,Order,Total Energy [kcal/mol],Error Type
21_1,AcNH$_2$,AcNH$_2$$\cdots$AcNH$_2$ (ID 21),1,-25290.29$\pm$0.03,$\sigma_\textrm{cubic fit}$
20_1,AcOH,AcOH$\cdots$AcOH (ID 20),1,-28725.26$\pm$0.05,$\Delta_\textrm{cubic fit}^\textrm{linear fit}$
24_1,Benzene,Benzene$\cdots$Benzene ($\pi$-$\pi$) (ID 24),1,-23624.42$\pm$0.04,$\sigma_\textrm{cubic fit}$
37_1,Cyclopentane,Cyclopentane$\cdots$Neopentane (ID 37),1,-21586.06$\pm$0.03,$\sigma_\textrm{cubic fit}$
30_2,Ethene,Benzene$\cdots$Ethene (ID 30),2,-8610.43$\pm$0.02,$\sigma_\textrm{cubic fit}$
32_2,Ethyne,Uracil$\cdots$Ethyne (ID 32),2,-7823.07$\pm$0.02,$\sigma_\textrm{cubic fit}$
56_2,MeNH$_2$,Benzene$\cdots$MeNH$_2$ (NH-$\pi$) (ID 56),2,-11671.48$\pm$0.04,$\Delta_\textrm{cubic fit}^\textrm{linear fit}$
55_2,MeOH,Benzene$\cdots$MeOH (OH-$\pi$) (ID 55),2,-15103.82$\pm$0.02,$\sigma_\textrm{cubic fit}$
36_1,Neopentane,Neopentane$\cdots$Neopentane (ID 36),1,-22350.00$\pm$0.04,$\sigma_\textrm{cubic fit}$
34_1,Pentane,Pentane$\cdots$Pentane (ID 34),1,-22346.69$\pm$0.03,$\sigma_\textrm{cubic fit}$


In [5]:
deformation_energy_data = {system_id: {'dimer name': 0,'mol1 name': 0, 'mol2 name': 0, 'mol1 deformation energy': 0, 'mol2 deformation energy': 0} for system_id in range(1,67)}

for i in range(1,67):
    formatted_name = dimer_info.loc[i,'formatted_name']
    mol_1_name = dimer_info.loc[i,'mol1']
    mol_2_name = dimer_info.loc[i,'mol2']
    mol_1_deformation_energy = monomer_geometry_correction_data.loc[monomer_geometry_correction_data['mol_id'] == f'{i:02d}_1', 'ene-ref'].values[0]
    mol_2_deformation_energy = monomer_geometry_correction_data.loc[monomer_geometry_correction_data['mol_id'] == f'{i:02d}_2','ene-ref'].values[0]
    deformation_energy_data[i] = {'dimer name': formatted_name, 'mol1 deformation energy': f'{mol_1_deformation_energy:.3f}', 'mol2 deformation energy': f'{mol_2_deformation_energy:.3f}'}
# Convert to Pandas Dataframe and the convert to latex string

deformation_energy_data_df = pd.DataFrame(deformation_energy_data).T
deformation_energy_data_df.columns = ['Dimer Name', r'$\Delta E_\textrm{mon. 1, def.}^\textrm{CCSD(T)}$ [kcal/mol]', r'$\Delta E_\textrm{mon. 2, def.}^\textrm{CCSD(T)}$ [kcal/mol]']
latex_input_str = convert_df_to_latex_input(
    deformation_energy_data_df,
    start_input = '\\begin{table}',
    label = 'tab:monomer_deformation_ene',
    caption = r'Deformation energy for the two monomers within each of the dimers of the S66 dataset. This energy is with respect to the geometry used in Table~\ref{tab:monomer_tot_ene}.',
    end_input = '\\end{table}',
    replace_input = {
    },
    center = True,
    df_latex_skip = 0,
    adjustbox = 0.9,
    index=True,
    output_str = True,
    column_format = 'll' + 'r'*(len(deformation_energy_data_df.columns)-1)
)


# Write the DataFrame to a latex input
latex_input_str = '\n'.join(latex_input_str.splitlines()[7:-4]) + '\n'

with open('Tables/Table_SI_Deformation_energy_table.tex', 'w') as f:
    f.write(r"""\LTcapwidth=\textwidth
    
\begin{longtable}{llrr}
\caption{\label{tab:monomer_deformation_ene}Deformation energy for the two monomers within each of the dimers of the S66 dataset. This energy is with respect to the geometry used in Table~\ref{tab:monomer_tot_ene}.} \\

\toprule
ID & Dimer Name & $\Delta E_\textrm{mon. 1, def.}^\textrm{CCSD(T)}$ [kcal/mol] & $\Delta E_\textrm{mon. 2, def.}^\textrm{CCSD(T)}$ [kcal/mol] \\
\midrule
\endfirsthead



\caption[]{(continued)} \\
\endhead

\multicolumn{4}{r}{{Continued on next page}} \\
\endfoot

\bottomrule
\endlastfoot

""")
    f.write(latex_input_str)
    f.write(r"\end{longtable}")

deformation_energy_data_df

Unnamed: 0,Dimer Name,"$\Delta E_\textrm{mon. 1, def.}^\textrm{CCSD(T)}$ [kcal/mol]","$\Delta E_\textrm{mon. 2, def.}^\textrm{CCSD(T)}$ [kcal/mol]"
1,Water$\cdots$Water,0.031,0.0
2,Water$\cdots$MeOH,0.042,-0.016
3,Water$\cdots$MeNH$_2$,0.109,-0.026
4,Water$\cdots$Peptide,0.087,0.067
5,MeOH$\cdots$MeOH,0.056,-0.022
6,MeOH$\cdots$MeNH$_2$,0.222,-0.026
7,MeOH$\cdots$Peptide,0.147,-0.006
8,MeOH$\cdots$Water,0.038,-0.001
9,MeNH$_2$$\cdots$MeOH,-0.003,-0.033
10,MeNH$_2$$\cdots$MeNH$_2$,0.005,-0.015


### SI - Validating the use of CCSD(T) deformation energy

In [6]:
monomer_deformation_ene_dict = {}
counter = 0 
for mol, monomer in monomer_info.groupby('mol'):
    if monomer.test.tolist()[0] == True:
        monomer_all_dmc_data = monomer_dmc_total_energy_data[(monomer_dmc_total_energy_data['mol'] == mol) & (monomer_dmc_total_energy_data['dmc_type'] == 'DMCdla5') & (monomer_dmc_total_energy_data['dmc_Jas'] == 'Jopt')].sort_values('tau', ascending=False)
        ref_id = monomer['ref'].tolist()[0]
        

        ref_monomer_data = monomer_all_dmc_data[monomer_all_dmc_data['mol_id'] == ref_id]
        ref_monomer_data.set_index( 'tau', inplace=True )
        ref_monomer_tau_list = ref_monomer_data.index.tolist()
        for monomer_id in set(monomer_all_dmc_data['mol_id'].tolist()):
            monomer_dimer_id = monomer_id.split('_')[0]
            monomer_data = monomer_all_dmc_data[monomer_all_dmc_data['mol_id'] == monomer_id]
            monomer_data.set_index( 'tau', inplace=True )
            monomer_tau = 0.01
            monomer_deformation_ene = monomer_data.loc[monomer_tau]['ene'] - ref_monomer_data.loc[monomer_tau]['ene']
            monomer_deformation_ene_err = np.sqrt(monomer_data.loc[monomer_tau]['err']**2 + ref_monomer_data.loc[monomer_tau]['err']**2)
            ccsdt_deformation_ene = monomer_geometry_correction_data[monomer_geometry_correction_data['mol_id'] == monomer_id]['ene-ref'].values[0]
            if ref_id == monomer_id:
                monomer_deformation_ene_err = 0.0
            delta = ccsdt_deformation_ene - monomer_deformation_ene
            monomer_deformation_ene_dict[counter] = {'Monomer': final_monomer_total_energy[ref_id]['Monomer'], 'Dimer Geometry': dimer_info.loc[int(monomer_dimer_id),'formatted_name'], 'Order': monomer_id.split('_')[1], r'$\Delta E_\textrm{def.}^\textrm{DMC}$': f'{monomer_deformation_ene:.2f} $\pm$ {monomer_deformation_ene_err:.2f}', r'$\Delta E_\textrm{def.}^\textrm{CCSD(T)}$': f'{ccsdt_deformation_ene:.2f}', 'Deviation': f'{delta:.2f} $\pm$ {monomer_deformation_ene_err:.2f}'}
            counter += 1

In [7]:
monomer_deformation_ene_df = pd.DataFrame(monomer_deformation_ene_dict).T
monomer_deformation_ene_df

# Create latex input for the table
latex_input_str = convert_df_to_latex_input(
    monomer_deformation_ene_df,
    start_input = '\\begin{table}',
    label = 'tab:monomer_deformation_ene_validation',
    caption = r'Comparison between DMC ($0.01\,$au time step) and CCSD(T) for the deformation energy $E_\textrm{def.}$ of a subset of AcNH$_2$, AcOH, cyclopentane, peptide and urcail monomers found in the S66 dataset. The order in which the monomer appears in the dimer (in the provided .xyz geometry) is given. The reference monomer configuration to calculate $E_\textrm{def.}$ is given in Table~\ref{tab:monomer_deformation_ene}.',
    end_input = '\\end{table}',
    replace_input = {
    },
    center = True,
    df_latex_skip = 0,
    index=False,
    output_str = True,
    column_format = 'l' + 'r'*(len(monomer_deformation_ene_df.columns)-1)
)

display(monomer_deformation_ene_df)

with open('Tables/Table_SI_Monomer_deformation_ene_validation.tex', 'w') as f:
    f.write(latex_input_str)


Unnamed: 0,Monomer,Dimer Geometry,Order,$\Delta E_\textrm{def.}^\textrm{DMC}$,$\Delta E_\textrm{def.}^\textrm{CCSD(T)}$,Deviation
0,AcNH$_2$,AcNH$_2$$\cdots$AcNH$_2$,1,0.00 $\pm$ 0.00,0.0,0.00 $\pm$ 0.00
1,AcNH$_2$,AcNH$_2$$\cdots$Uracil,1,0.10 $\pm$ 0.05,0.06,-0.04 $\pm$ 0.05
2,AcNH$_2$,Pentane$\cdots$AcNH$_2$,2,-0.64 $\pm$ 0.05,-0.7,-0.07 $\pm$ 0.05
3,AcNH$_2$,Benzene$\cdots$AcNH$_2$ (NH-$\pi$),2,-0.68 $\pm$ 0.04,-0.65,0.03 $\pm$ 0.04
4,AcNH$_2$,AcNH$_2$$\cdots$AcNH$_2$,2,-0.07 $\pm$ 0.05,-0.0,0.07 $\pm$ 0.05
5,AcOH,Benzene$\cdots$AcOH (OH-$\pi$),2,-1.18 $\pm$ 0.05,-1.28,-0.10 $\pm$ 0.05
6,AcOH,AcOH$\cdots$Uracil,1,0.15 $\pm$ 0.06,0.07,-0.08 $\pm$ 0.06
7,AcOH,Ethyne$\cdots$AcOH (OH-$\pi$),2,-1.18 $\pm$ 0.06,-1.25,-0.07 $\pm$ 0.06
8,AcOH,AcOH$\cdots$AcOH,1,0.00 $\pm$ 0.00,0.0,0.00 $\pm$ 0.00
9,AcOH,Benzene$\cdots$AcOH,2,-1.18 $\pm$ 0.04,-1.29,-0.12 $\pm$ 0.04


### SI - Comparing DLA and TM localization schemes for H<sub>2</sub>O and AcOH dimer

In [8]:
# Compute the binding energy with the TM

tm_filtered_dimer_dmc_total_energy_data = dimer_dmc_total_energy_data[(dimer_dmc_total_energy_data['dmc_type'] == 'DMCtm5') & (dimer_dmc_total_energy_data['dmc_Jas'] == 'JoptLA')]

tm_dmc_energy_data = {}

tm_system_name_dict = {system_id: {'Original': '', 'New': ''} for system_id in range(1,67)}

system_loc_scheme_binding_energy = {id: {'TM Binding Energy': 0, 'TM Binding Energy Error': 0, 'DLA Binding Energy': 0, 'DLA Binding Energy Error': 0} for id in [1,20]}

# Loop over the the dimers
for system_id, system_data in tm_filtered_dimer_dmc_total_energy_data.groupby('ID'):
    tm_dmc_energy_data[system_id] = {'total_energy_dimer': 0, 'total_energy_monomer_1':0, 'total_energy_monomer_2':0, 'binding_energy': 0}
    system_data = system_data.sort_values('tau', ascending=False)
    system_data.set_index( 'tau', inplace=True )
    system_name = dimer_info.loc[system_id,'name']
    
    monomer_data = {1:0, 2:0}
    monomer_geometry_correction = {1:0, 2:0}
    # Get the monomer data
    for monomer_num in [1,2]:
        monomer_id = f'{system_id:02d}_{monomer_num}'
        monomer_name = dimer_info.loc[system_id,f'mol{monomer_num}']
        monomer_ref_id = monomer_info.loc[monomer_name, 'ref']
        monomer_ref_data = monomer_dmc_total_energy_data[(monomer_dmc_total_energy_data['mol_id'] == monomer_ref_id) & (monomer_dmc_total_energy_data['dmc_type'] == 'DMCtm5') & (monomer_dmc_total_energy_data['dmc_Jas'] == 'JoptLA')].sort_values('tau', ascending=False)
        monomer_ref_data.set_index( 'tau', inplace=True )
        # Add the geometry correction
        monomer_ref_data['ene'] = monomer_ref_data['ene'] + monomer_geometry_correction_data[monomer_geometry_correction_data['mol_id'] == monomer_id]['ene-ref'].values[0]
        monomer_data[monomer_num] = monomer_ref_data
    tm_dmc_energy_data[system_id]['total_energy_dimer'] = system_data.copy()
    tm_dmc_energy_data[system_id]['total_energy_monomer_1'] = monomer_data[1]
    tm_dmc_energy_data[system_id]['total_energy_monomer_2'] = monomer_data[2]
    # Compute the binding energy
    system_data['binding_energy'] = system_data['ene'] - monomer_data[1]['ene'] - monomer_data[2]['ene']
    system_data['binding_energy_err'] = (system_data['err']**2 + monomer_data[1]['err']**2 + monomer_data[2]['err']**2)**0.5
    tm_dmc_energy_data[system_id]['binding_energy'] = system_data

    # Extrapolate the binding energy to the zero time step limit
    fitting_data = system_data[ system_data.index <= 0.015 ]
    xdata = fitting_data.index.to_numpy()
    ydata = fitting_data['binding_energy'].to_numpy()
    sigma = fitting_data['binding_energy_err'].to_numpy()
    xfit1, m1, s1 = fit_err(xdata,ydata,sigma,fitfun=fun_lin)

    fitting_data = system_data[ system_data.index <= 0.11]
    xdata = fitting_data.index.to_numpy()
    ydata = fitting_data['binding_energy'].to_numpy()
    sigma = fitting_data['binding_energy_err'].to_numpy()
    xfit3, m3, s3 = fit_err(xdata,ydata,sigma,fitfun=fun_cub)

    linear_cubic_diff = abs(m1[0] - m3[0])
    if linear_cubic_diff > s3[0]:
        system_error = linear_cubic_diff
    else:
        system_error = s3[0]

    system_loc_scheme_binding_energy[system_id]['TM Binding Energy'] = m3[0]
    system_loc_scheme_binding_energy[system_id]['TM Binding Energy Error'] = system_error

    # Compute the binding energy with the DLA
    fitting_data = dmc_energy_data[system_id]['binding_energy'][dmc_energy_data[system_id]['binding_energy'].index <= 0.015 ]
    xdata = fitting_data.index.to_numpy()
    ydata = fitting_data['binding_energy'].to_numpy()
    sigma = fitting_data['binding_energy_err'].to_numpy()
    xfit1, m1, s1 = fit_err(xdata,ydata,sigma,fitfun=fun_lin)

    fitting_data = dmc_energy_data[system_id]['binding_energy'][dmc_energy_data[system_id]['binding_energy'].index <= 0.11]
    xdata = fitting_data.index.to_numpy()
    ydata = fitting_data['binding_energy'].to_numpy()
    sigma = fitting_data['binding_energy_err'].to_numpy()
    xfit3, m3, s3 = fit_err(xdata,ydata,sigma,fitfun=fun_cub)

    linear_cubic_diff = abs(m1[0] - m3[0])
    if linear_cubic_diff > s3[0]:
        system_error = linear_cubic_diff
    else:
        system_error = s3[0]

    system_loc_scheme_binding_energy[system_id]['DLA Binding Energy'] = m3[0]
    system_loc_scheme_binding_energy[system_id]['DLA Binding Energy Error'] = system_error

# Get estimate from the DTM localization scheme for the AcOH...AcOH dimer
dtm_acoh_dimer_bind_ene = pd.read_csv('Data/Acetic_Acid_Validation/LDA_eCEPP_DLTM_CASINO.csv',index_col=0).sort_values('tau', ascending=True)

fitting_data = dtm_acoh_dimer_bind_ene[dtm_acoh_dimer_bind_ene.index <= 0.015 ]
xdata = fitting_data.index.to_numpy()
ydata = fitting_data['ene'].to_numpy()
sigma = fitting_data['err'].to_numpy()
xfit1, m1, s1 = fit_err(xdata,ydata,sigma,fitfun=fun_lin)

fitting_data = dtm_acoh_dimer_bind_ene[dtm_acoh_dimer_bind_ene.index <= 0.11]
xdata = fitting_data.index.to_numpy()
ydata = fitting_data['ene'].to_numpy()
sigma = fitting_data['err'].to_numpy()
xfit3, m3, s3 = fit_err(xdata,ydata,sigma,fitfun=fun_cub)

linear_cubic_diff = abs(m1[0] - m3[0])
if linear_cubic_diff > s3[0]:
    system_error = linear_cubic_diff
else:
    system_error = s3[0]

# Create a dictionary with tuples as values for binding energy and error
system_loc_scheme_binding_energy_formatted = {
    (r'H2O$\cdots$H2O', 'TM'): (system_loc_scheme_binding_energy[1]['TM Binding Energy'], system_loc_scheme_binding_energy[1]['TM Binding Energy Error']),
    (r'H2O$\cdots$H2O', 'DLA'): (system_loc_scheme_binding_energy[1]['DLA Binding Energy'], system_loc_scheme_binding_energy[1]['DLA Binding Energy Error']),
    (r'AcOH$\cdots$AcOH', 'TM'): (system_loc_scheme_binding_energy[20]['TM Binding Energy'], system_loc_scheme_binding_energy[20]['TM Binding Energy Error']),
    (r'AcOH$\cdots$AcOH', 'DLA'): (system_loc_scheme_binding_energy[20]['DLA Binding Energy'], system_loc_scheme_binding_energy[20]['DLA Binding Energy Error']),
    (r'AcOH$\cdots$AcOH', 'DTM'): (m3[0], system_error)
}

# Convert the dictionary to a pandas DataFrame
system_loc_scheme_binding_energy_df = pd.DataFrame(system_loc_scheme_binding_energy_formatted).T

# Rename the columns for clarity
system_loc_scheme_binding_energy_df.columns = ['Initial Binding Energy', 'Error']

# Optionally apply formatting later if you need it displayed as strings
system_loc_scheme_binding_energy_df['Eint [kcal/mol]'] = system_loc_scheme_binding_energy_df.apply(
    lambda row: f"{row['Initial Binding Energy']:.2f}$\pm${row['Error']:.2f}", axis=1
)

system_loc_scheme_binding_energy_df = system_loc_scheme_binding_energy_df[['Eint [kcal/mol]']]

# Create latex input string
latex_input_str = convert_df_to_latex_input(
    system_loc_scheme_binding_energy_df,
    start_input = '\\begin{table}',
    label = 'tab:loc_scheme_test',
    caption = r'Comparison of the extrapolated interaction energy $\Delta_\textrm{int}$ for the TM and DLA localization schemes for the H$_2$O$\cdots$H$_2$O (ID 1) and AcOH$\cdots$AcOH dimers (ID 20).',
    end_input = '\\end{table}',
    center = True,
    df_latex_skip = 0,
    index=True,
    output_str = True,
    column_format = 'll' + 'r'*(len(system_loc_scheme_binding_energy_df.columns)-1)
)

with open('Tables/Table_SI_System_loc_scheme.tex','w') as f:
    f.write(latex_input_str)

system_loc_scheme_binding_energy_df

Unnamed: 0,Unnamed: 1,Eint [kcal/mol]
H2O$\cdots$H2O,TM,-5.06$\pm$0.04
H2O$\cdots$H2O,DLA,-5.17$\pm$0.03
AcOH$\cdots$AcOH,TM,-19.98$\pm$0.06
AcOH$\cdots$AcOH,DLA,-20.17$\pm$0.11
AcOH$\cdots$AcOH,DTM,-20.30$\pm$0.09


## SI - Previous CCSD(T) literature and final CCSD(T) and CCSD(cT)-fit estimates

In [9]:
keshwarni_cc_data = pd.read_excel('Data/Coupled_Cluster_References/Kesharwani_10.1071_CH17588_SI.xlsx', sheet_name='F12c_aVTZ-F12 CCSD',usecols = 'J:L').dropna().drop([7, 8]).reset_index(drop=True)
keshwarni_cc_data.columns = ['HF', 'MP2', 'CCSD']
keshwarni_cc_data['(T)'] = pd.read_excel('Data/Coupled_Cluster_References/Kesharwani_10.1071_CH17588_SI.xlsx', sheet_name='(T) ',usecols = 'P').drop(list(range(19))).reset_index(drop=True)['Unnamed: 15']
keshwarni_cc_data['(cT)'] = keshwarni_cc_data['(T)'] /( 0.7764+0.278*(keshwarni_cc_data['MP2'] - keshwarni_cc_data['HF'])/(keshwarni_cc_data['CCSD']  - keshwarni_cc_data['HF']))
keshwarni_cc_data['(cT)-(T)'] = keshwarni_cc_data['(cT)'] - keshwarni_cc_data['(T)']

# Load CCSD(T) references
ccsdt_references = pd.read_csv('Data/Coupled_Cluster_References/Hobza_Nagy.csv', index_col=0)
ccsdt_references['CCSD(T) Final'] = ccsdt_references['Hobza_1']
ccsdt_references['CCSD(T) Error'] = ccsdt_references['Hobza_1']
ccsdt_references['CCSD(cT)-fit Final'] = ccsdt_references['Hobza_1']

for i in range(66):
    if np.isnan(ccsdt_references['Martin_Gold'][i+1]):
        ccsdt_references.loc[i+1,'CCSD(T) Final'] = np.average([ccsdt_references['Hobza_2'][i+1], ccsdt_references['Martin_Silver'][i+1], ccsdt_references['14k-Gold'][i+1]])
        ccsdt_references.loc[i+1,'CCSD(T) Error'] = np.std([ccsdt_references['Hobza_2'][i+1], ccsdt_references['Martin_Silver'][i+1], ccsdt_references['14k-Gold'][i+1]])*2
        ccsdt_references.loc[i+1,'CCSD(cT)-fit Final'] = ccsdt_references.loc[i+1,'Martin_Silver'] - keshwarni_cc_data['(cT)-(T)'][i]

    else:
        ccsdt_references.loc[i+1,'CCSD(T) Final'] = np.average([ccsdt_references['Hobza_2'][i+1], ccsdt_references['Martin_Gold'][i+1], ccsdt_references['14k-Gold'][i+1]])
        ccsdt_references.loc[i+1,'CCSD(T) Error'] = np.std([ccsdt_references['Hobza_2'][i+1], ccsdt_references['Martin_Gold'][i+1], ccsdt_references['14k-Gold'][i+1]])*2
        ccsdt_references.loc[i+1,'CCSD(cT)-fit Final'] = ccsdt_references.loc[i+1,'Martin_Silver'] - keshwarni_cc_data['(cT)-(T)'][i]

ccsdt_raw_references = ccsdt_references.copy()

# Round to nearest 0.01 kcal/mol
ccsdt_references = ccsdt_references.round(2)
ccsdt_references = ccsdt_references.applymap(lambda x: f"{x:.2f}" if isinstance(x, (int, float)) else x)
ccsdt_references['formatted_name'] = dimer_info['formatted_name']

# Make a list combining CCSD(T) Final $\pm$ CCSD(T) Error
ccsdt_references['CCSD(T) Final'] = ccsdt_references['CCSD(T) Final'].astype(str)
ccsdt_references['CCSD(T) Error'] = ccsdt_references['CCSD(T) Error'].astype(str)
ccsdt_references['CCSD(cT)-fit Final'] = ccsdt_references['CCSD(cT)-fit Final'].astype(str)
ccsdt_references['CCSD(T) Final'] = ccsdt_references['CCSD(T) Final'] + '$\pm$' + ccsdt_references['CCSD(T) Error']
ccsdt_references['CCSD(cT)-fit Final'] = ccsdt_references['CCSD(cT)-fit Final'] + '$\pm$' + ccsdt_references['CCSD(T) Error']

# Only include 'formatted_name', 'Hobza_2', 'Martin_Silver', '14k-Gold', 'CCSD(T) Final' and 'CCSD(cT)-fit Final' columns
ccsdt_references_table = ccsdt_references[['formatted_name', 'Hobza_2', 'Martin_Silver', '14k-Gold', 'CCSD(T) Final', 'CCSD(cT)-fit Final']]
ccsdt_references_table.columns = ['System', r'\v{R}ez\'a\v{c} \textit{et al.} (2006)', r'Kesharwani \textit{et al.} (2018)', r'Nagy \textit{et al.} (2023)', 'Final CCSD(T)', 'Final CCSD(cT)-fit']

# Write the DataFrame to a latex input
latex_input_str = '\n'.join(convert_df_to_latex_input(
    ccsdt_references_table,
    start_input = '\\begin{table}',
    label = 'tab:cc_references',
    caption = r'CCSD(T) references for the S66 dataset. The final CCSD(T) and CCSD(cT)-fit values are computed as the average of the values from the three references. The error is computed as twice the standard deviation of the values from the three references.',
    end_input = '\\end{table}',
    replace_input = {
    },
    adjustbox = 1,
    center = True,
    df_latex_skip = 0,
    rotate_column_header = True,
    output_str = True,
    column_format = 'll' + 'r'*len(ccsdt_references_table.columns)
).splitlines()[7:-4]) + '\n'

with open('Tables/Table_SI_CCSDT_references_table.tex', 'w') as f:
    f.write(r"""\LTcapwidth=\textwidth
    
\begin{longtable}{llrrrrrr}
\caption{\label{tab:cc_references}CCSD(T) references for the S66 dataset. The final CCSD(T) and CCSD(cT)-fit values are computed as the average of the values from the three references. The error is computed as twice the standard deviation of the values from the three references.} \\

\toprule
 & \rotatebox{90}{System} & \rotatebox{90}{\v{R}ez\'a\v{c} \textit{et al.} (2006)} & \rotatebox{90}{Kesharwani \textit{et al.} (2018)} & \rotatebox{90}{Nagy \textit{et al.} (2023)} & \rotatebox{90}{Final CCSD(T)} & \rotatebox{90}{Final CCSD(cT)-fit} \\ 
\midrule
\endfirsthead



\caption[]{(continued)} \\
\endhead

\multicolumn{8}{r}{{Continued on next page}} \\
\endfoot

\bottomrule
\endlastfoot

""")
    f.write(latex_input_str)
    f.write(r"\end{longtable}")

# ccsdt_references_table_latex = ccsdt_references_table.to_latex(index=True, escape=False, column_format='lrrrrr')
ccsdt_references_table

  ccsdt_references = ccsdt_references.applymap(lambda x: f"{x:.2f}" if isinstance(x, (int, float)) else x)


Unnamed: 0,\rotatebox{90}{System},\rotatebox{90}{\v{R}ez\'a\v{c} \textit{et al.} (2006)},\rotatebox{90}{Kesharwani \textit{et al.} (2018)},\rotatebox{90}{Nagy \textit{et al.} (2023)},\rotatebox{90}{Final CCSD(T)},\rotatebox{90}{Final CCSD(cT)-fit}
1,Water$\cdots$Water,-5.01,-4.98,-4.99,-4.99$\pm$0.03,-4.96$\pm$0.03
2,Water$\cdots$MeOH,-5.7,-5.67,-5.67,-5.68$\pm$0.03,-5.63$\pm$0.03
3,Water$\cdots$MeNH$_2$,-7.04,-6.99,-7.0,-7.01$\pm$0.05,-6.94$\pm$0.05
4,Water$\cdots$Peptide,-8.22,-8.18,-8.19,-8.20$\pm$0.03,-8.15$\pm$0.03
5,MeOH$\cdots$MeOH,-5.85,-5.82,-5.83,-5.83$\pm$0.02,-5.78$\pm$0.02
6,MeOH$\cdots$MeNH$_2$,-7.67,-7.62,-7.62,-7.64$\pm$0.04,-7.55$\pm$0.04
7,MeOH$\cdots$Peptide,-8.34,-8.31,-8.31,-8.32$\pm$0.03,-8.25$\pm$0.03
8,MeOH$\cdots$Water,-5.09,-5.06,-5.07,-5.08$\pm$0.02,-5.03$\pm$0.02
9,MeNH$_2$$\cdots$MeOH,-3.11,-3.09,-3.09,-3.10$\pm$0.02,-3.05$\pm$0.02
10,MeNH$_2$$\cdots$MeNH$_2$,-4.22,-4.18,-4.19,-4.20$\pm$0.03,-4.13$\pm$0.03


## SI - Timestep dependence for the binding energy of each S66 system

In [10]:
 # Plot Binding Energy and Total energy of the dimer

final_binding_energy =  {f'{system_id}': [0,0] for system_id in range(1,67)}
final_all_energy =  {f'{system_id}': {energy: 0 for energy in  ['Dimer', 'Dimer Error', 'Monomer 1','Monomer 1 Error', 'Monomer 2', 'Monomer 2 Error', 'Binding Energy', 'Binding Energy Error']} for system_id in range(1,67)}

latex_input_str = ''

if replot_graphs:
    for system_id in range(1,67):
        name = dimer_info.loc[system_id,'formatted_name']
        
        fig, ax = plt.subplots(1,2,figsize=(6.69,2), dpi=300,constrained_layout=True)
        ax[0].set_xlabel( 'DMC timestep [a.u.]' )
        ax[0].set_xticks( [0, 0.003, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 0.2, 0.3 ] )
        ax[0].set_xticklabels( [ '0', '3E-3', '0.01', '0.02', '0.03', '0.04', '0.05', '0.06', '0.1', '0.2', '0.3' ], rotation=90 )
        ax[0].set_xlim( [0,0.1*1.03] )
        ax[0].set_ylabel( r'$\Delta E_\textrm{int}$ [kcal/mol]' )

        # reference quantum-chemistry result
        ax[0].axhline( ccsdt_raw_references.loc[system_id,'CCSD(T) Final'], c='gray', ls='--', label='CCSD(T)')
        ccsdt_upper_lim = float(ccsdt_raw_references.loc[system_id,'CCSD(T) Final']) + float(ccsdt_raw_references.loc[system_id,'CCSD(T) Error'])
        ccsdt_lower_lim = float(ccsdt_raw_references.loc[system_id,'CCSD(T) Final']) - float(ccsdt_raw_references.loc[system_id,'CCSD(T) Error'])
        ax[0].fill_between([0,0.15],[ccsdt_lower_lim,ccsdt_lower_lim], [ccsdt_upper_lim,ccsdt_upper_lim], color='gray',alpha=0.2,edgecolor='none')

        ax[0].errorbar(dmc_energy_data[system_id]['binding_energy'].index.tolist(), dmc_energy_data[system_id]['binding_energy']['binding_energy'].values, yerr=dmc_energy_data[system_id]['binding_energy']['binding_energy_err'].values, fmt='o', color='black',markeredgecolor='none',markersize=4, label=r'DMC//DLA')

        system_binding_energy_data = dmc_energy_data[system_id]['binding_energy']

        taumaxfit = 0.11 #0.10
        fitting_data = system_binding_energy_data[ system_binding_energy_data.index <= taumaxfit ]
        xdata = fitting_data.index.to_numpy()
        ydata = fitting_data['binding_energy'].to_numpy()
        sigma = fitting_data['binding_energy_err'].to_numpy()

        xfit, m, s = fit_err(xdata,ydata,sigma,fitfun=fun_cub)
        ax[0].plot(xfit,m,'--',color='red', label=r'$\Delta E_\textrm{int}^\textrm{cubic extrap.}=$' + f'{m[0]:.2f}' + r'${\pm}$' + f'{s[0]:.2f}')
        ax[0].fill_between(xfit,m-1*s,m+1*s,color='red',alpha=0.2)

        binding_energy_data = system_binding_energy_data[ system_binding_energy_data.index <= 0.011 ]
        xdata = binding_energy_data.index.to_numpy()
        ydata = binding_energy_data['binding_energy'].to_numpy()
        sigma = binding_energy_data['binding_energy_err'].to_numpy()

        xfit1, m1, s1 = fit_err(xdata,ydata,sigma,fitfun=fun_lin)
        ax[0].plot(xfit1,m1,'--',color='blue', label=r'$\Delta E_\textrm{int}^\textrm{lin. extrap.}=$' + f'{m1[0]:.2f}' + r'${\pm}$' + f'{s1[0]:.2f}')
        ax[0].fill_between(xfit1,m1-1*s1,m1+1*s1,color='blue',alpha=0.2)

        linear_cubic_diff = abs(m[0] - m1[0])
        if linear_cubic_diff > s[0]:
            system_error = linear_cubic_diff
            error_type = r'$\Delta_\textrm{cubic fit}^\textrm{linear fit}$'
        else:
            system_error = s[0]
            error_type = r'$\sigma_\textrm{cubic fit}$'


        final_binding_energy[f'{system_id}'] = [m[0],system_error]
        final_all_energy[f'{system_id}']['Binding Energy'] = m[0]
        final_all_energy[f'{system_id}']['Binding Energy Error'] = system_error
        final_all_energy[f'{system_id}']['Error Type'] = error_type

        # Assuming you have a subplot ax[0]
        handles, labels = ax[0].get_legend_handles_labels()

        # Reorder the handles and labels (example: swap the order)
        # Modify the indices to get the desired order
        order = [0,3,1,2]  # This is just an example, change the indices as needed

        # Apply the new order to the legend
        ax[0].legend([handles[i] for i in order], [labels[i] for i in order], fontsize=7, ncol=2,frameon=False)

        ax[1].set_xlabel( 'DMC timestep [a.u.]' )
        ax[1].set_xticks( [0, 0.003, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 0.2, 0.3 ] )
        ax[1].set_xticklabels( [ '0', '3E-3', '0.01', '0.02', '0.03', '0.04', '0.05', '0.06', '0.1', '0.2', '0.3' ], rotation=90 )
        ax[1].set_xlim( [0,0.1*1.03] )
        ax[1].set_ylabel( 'Total Energy [kcal/mol]' )


        system_dimer_total_energy_data = dmc_energy_data[system_id]['total_energy_dimer']

        fitting_data = system_dimer_total_energy_data[ system_dimer_total_energy_data.index <= 0.013 ]
        xdata = fitting_data.index.to_numpy()
        ydata = fitting_data['ene'].to_numpy()
        sigma = fitting_data['err'].to_numpy()

        xfit1, m1, s1 = fit_err(xdata,ydata,sigma,fitfun=fun_lin)

        extrap_system_total_energy = m1[0]

        ax[1].plot(xfit1,m1 - extrap_system_total_energy,'--',color='blue', label=r'$E^\textrm{lin. extrap.}=$' + f'{m1[0]:.2f}' + r'${\pm}$' + f'{s1[0]:.2f}')
        ax[1].fill_between(xfit1,m1 - extrap_system_total_energy -1*s1,m1 - extrap_system_total_energy +1*s1,color='blue',alpha=0.2)

        ax[1].errorbar(dmc_energy_data[system_id]['total_energy_dimer'].index.tolist(), dmc_energy_data[system_id]['total_energy_dimer']['ene'].values - extrap_system_total_energy, yerr=dmc_energy_data[system_id]['total_energy_dimer']['err'].values, fmt='o', color='black',markeredgecolor='none',markersize=4, label=r'DMC//DLA')

        taumaxfit = 0.11 #0.10
        fitting_data = system_dimer_total_energy_data[ system_dimer_total_energy_data.index <= taumaxfit ]
        xdata = fitting_data.index.to_numpy()
        ydata = fitting_data['ene'].to_numpy()
        sigma = fitting_data['err'].to_numpy()

        xfit3, m3, s3 = fit_err(xdata,ydata,sigma,fitfun=fun_cub)

        ax[1].plot(xfit3,m3 - extrap_system_total_energy,'--',color='red', label=r'$E^\textrm{cubic extrap.}=$' + f'{m3[0]:.2f}' + r'${\pm}$' + f'{s3[0]:.2f}')
        ax[1].fill_between(xfit3,m3 - extrap_system_total_energy -1*s3,m3 - extrap_system_total_energy +1*s3,color='green',alpha=0.2)

        # Assuming you have a subplot ax[0]
        handles, labels = ax[1].get_legend_handles_labels()

        # Reorder the handles and labels (example: swap the order)
        # Modify the indices to get the desired order
        order = [2,0,1]  # This is just an example, change the indices as needed

        # Apply the new order to the legend
        ax[1].legend([handles[i] for i in order], [labels[i] for i in order], fontsize=7,frameon=False)
        ax[1].set_ylim([-5,5])

        fig.suptitle(f'{name} (ID {system_id})')
        fig.savefig(f'Figures/Fig_SI_S66_{system_id:02d}.png',format='png')
        latex_input_str += r"""\begin{figure}[!h]
    \includegraphics[width=6.69in]{"""+ f"Figures/Fig_SI_S66_{system_id:02d}.png" + r"""}
    \caption{\label{fig:""" + f"dimer_{system_id:02d}" + r"""} The time step dependence of $\Delta E_\textrm{int}$ and the total energy of the dimer complex for the """ + f'{name} (ID {system_id}) dimer.' + r"""}
\end{figure}
    
"""
        # Make fits for the total energy of the monomers as well based on the energy_fit_type
        final_all_energy[f'{system_id}'][f'Dimer'] = m[0]
        final_all_energy[f'{system_id}'][f'Dimer Error'] = system_error

        for monomer_num in [1,2]:
            monomer_total_energy_data = dmc_energy_data[system_id][f'total_energy_monomer_{monomer_num}']

            fitting_data = monomer_total_energy_data[ monomer_total_energy_data.index <= 0.045 ]
            xdata = fitting_data.index.to_numpy()
            ydata = fitting_data['ene'].to_numpy()
            sigma = fitting_data['err'].to_numpy()

            xfit1, m1, s1 = fit_err(xdata,ydata,sigma,fitfun=fun_lin)

            fitting_data = monomer_total_energy_data[ monomer_total_energy_data.index <= 0.11 ]
            xdata = fitting_data.index.to_numpy()
            ydata = fitting_data['ene'].to_numpy()
            sigma = fitting_data['err'].to_numpy()

            xfit3, m3, s3 = fit_err(xdata,ydata,sigma,fitfun=fun_cub)

            linear_cubic_diff = abs(m1[0] - m3[0])
            if linear_cubic_diff > s3[0]:
                system_error = linear_cubic_diff
                error_type = r'$\Delta_\textrm{cubic fit}^\textrm{linear fit}$'
            else:
                system_error = s3[0]
                error_type = r'$\sigma_\textrm{cubic fit}$'


            final_all_energy[f'{system_id}'][f'Monomer {monomer_num}'] = m3[0]
            final_all_energy[f'{system_id}'][f'Monomer {monomer_num} Error'] = system_error
            final_all_energy[f'{system_id}'][f'Monomer {monomer_num} Error Type'] = error_type
    
    np.save('Data/Final/final_binding_energy.npy', final_binding_energy)
    np.save('Data/Final/final_all_energy.npy', final_all_energy)

else:
    final_binding_energy = np.load('Data/Final/final_binding_energy.npy', allow_pickle=True).item()
    final_all_energy = np.load('Data/Final/final_all_energy.npy', allow_pickle=True).item()

# latex_input_str = ''
# for system_id in range(1,67):
#     name = dimer_info.loc[system_id,'formatted_name']
#     latex_input_str += r"""\begin{figure}[!h]
#     \includegraphics[width=6.69in]{"""+ f"Figures/Fig_SI_S66_{system_id:02d}.png" + r"""}
#     \caption{\label{fig:""" + f"dimer_{system_id:02d}" + r"""} The time step dependence of $\Delta E_\textrm{int}$ and the total energy of the dimer complex for the """ + f'{name} (ID {system_id}) dimer.' + r"""}
# \end{figure}
    
# """
    

  fig, ax = plt.subplots(1,2,figsize=(6.69,2), dpi=300,constrained_layout=True)


In [11]:
# Turn the final_all_energy dictionary into a pandas dataframe
final_binding_energy_df = pd.DataFrame(final_all_energy).T

# Set the index name based on dimer_info.loc[system_id,'name']
final_binding_energy_df['System'] = [dimer_info.loc[system_id,'formatted_name'] for system_id in range(1,67)]

# Give binding energy and error a new name
final_binding_energy_df[r'$\Delta E_\textrm{int}$ [kcal/mol]'] = [f"{final_binding_energy_df['Binding Energy'][system_id].round(2):.2f}$\pm${final_binding_energy_df['Binding Energy Error'][system_id].round(2):.2f}" for system_id in range(66)]
final_binding_energy_df = final_binding_energy_df[['System',r'$\Delta E_\textrm{int}$ [kcal/mol]','Error Type']]


# Write the DataFrame to a latex input
latex_input_str = '\n'.join(convert_df_to_latex_input(
    final_binding_energy_df,
    start_input = '\\begin{table}',
    label = 'tab:dmc-final-energies',
     caption = r'Final DMC $\Delta E_\textrm{int}$ estimates for the S66 dataset. The polynomial fit (either linear or cubic) used to extrapolate the zero time step valueis also reported.',
    end_input = '\\end{table}',
    replace_input = {
    },
    adjustbox = 1,
    center = True,
    df_latex_skip = 0,
    rotate_column_header = True,
    output_str = True,
    column_format = 'll' + 'r'*len(final_binding_energy_df.columns)
).splitlines()[7:-4]) + '\n'

with open('Tables/Table_SI_Final_binding_energy.tex', 'w') as f:
    f.write(r"""\LTcapwidth=\textwidth
    
\begin{longtable}{llrr}
\caption{\label{tab:dmc-final-energies}Final DMC $\Delta E_\textrm{int}$ estimates for the S66 dataset. The polynomial fit (either linear or cubic) used to extrapolate the zero time step valueis also reported.} \\

\toprule
 & System & $\Delta E_\textrm{int}$ [kcal/mol] & Fit type \\
\midrule
\endfirsthead



\caption[]{(continued)} \\
\endhead

\multicolumn{4}{r}{{Continued on next page}} \\
\endfoot

\bottomrule
\endlastfoot

""")
    f.write(latex_input_str)
    f.write(r"\end{longtable}")

final_binding_energy_df

  final_binding_energy_df[r'$\Delta E_\textrm{int}$ [kcal/mol]'] = [f"{final_binding_energy_df['Binding Energy'][system_id].round(2):.2f}$\pm${final_binding_energy_df['Binding Energy Error'][system_id].round(2):.2f}" for system_id in range(66)]


Unnamed: 0,\rotatebox{90}{System},\rotatebox{90}{$\Delta E_\textrm{int}$ [kcal/mol]},\rotatebox{90}{Error Type}
1,Water$\cdots$Water,-5.17$\pm$0.03,$\sigma_\textrm{cubic fit}$
2,Water$\cdots$MeOH,-5.82$\pm$0.04,$\sigma_\textrm{cubic fit}$
3,Water$\cdots$MeNH$_2$,-7.18$\pm$0.04,$\sigma_\textrm{cubic fit}$
4,Water$\cdots$Peptide,-8.58$\pm$0.06,$\sigma_\textrm{cubic fit}$
5,MeOH$\cdots$MeOH,-5.93$\pm$0.10,$\Delta_\textrm{cubic fit}^\textrm{linear fit}$
6,MeOH$\cdots$MeNH$_2$,-7.83$\pm$0.07,$\Delta_\textrm{cubic fit}^\textrm{linear fit}$
7,MeOH$\cdots$Peptide,-8.57$\pm$0.07,$\sigma_\textrm{cubic fit}$
8,MeOH$\cdots$Water,-5.24$\pm$0.07,$\Delta_\textrm{cubic fit}^\textrm{linear fit}$
9,MeNH$_2$$\cdots$MeOH,-3.11$\pm$0.07,$\Delta_\textrm{cubic fit}^\textrm{linear fit}$
10,MeNH$_2$$\cdots$MeNH$_2$,-4.20$\pm$0.10,$\Delta_\textrm{cubic fit}^\textrm{linear fit}$


## SI - Acetic acid dimer validation

In [12]:
acetic_acid_data = {'LDA//DLA(eCEPP)//CASINO': {}, 'LDA//TM(eCEPP)//CASINO': {}, 'LDA//DTM(eCEPP)//CASINO': {}, 'LDA//TM(ccECP)//QMCPACK': {}, 'PBE//TM(ccECP)//QMCPACK': {}, 'PBE0//TM(ccECP)//QMCPACK': {},'LDA//AE//QMCPACK': {}}


acetic_acid_data['LDA//DLA(eCEPP)//CASINO'] = pd.read_csv('Data/Acetic_Acid_Validation/LDA_eCEPP_DLA_CASINO.csv',index_col=0).sort_values('tau', ascending=True)
acetic_acid_data['LDA//TM(eCEPP)//CASINO'] = pd.read_csv('Data/Acetic_Acid_Validation/LDA_eCEPP_TM_CASINO.csv',index_col=0).sort_values('tau', ascending=True)
acetic_acid_data['LDA//DTM(eCEPP)//CASINO'] = pd.read_csv('Data/Acetic_Acid_Validation/LDA_eCEPP_DLTM_CASINO.csv',index_col=0).sort_values('tau', ascending=True)
acetic_acid_data['LDA//TM(ccECP)//QMCPACK'] =  pd.read_csv('Data/Acetic_Acid_Validation/LDA_ccECP_TM_QMCPACK.csv', index_col=0, skiprows=2).iloc[:,[-3,-2]]
acetic_acid_data['PBE0//TM(ccECP)//QMCPACK'] =  pd.read_csv('Data/Acetic_Acid_Validation/PBE0_ccECP_TM_QMCPACK.csv', index_col=0, skiprows=2).iloc[:,[-3,-2]]
acetic_acid_data['PBE//TM(ccECP)//QMCPACK'] =  pd.read_csv('Data/Acetic_Acid_Validation/PBE_ccECP_TM_QMCPACK.csv', index_col=0, skiprows=2).iloc[:,[-3,-2]]
acetic_acid_data['LDA//AE//QMCPACK'] =  pd.read_csv('Data/Acetic_Acid_Validation/LDA_AE_QMCPACK.csv', index_col=0, skiprows=2).iloc[:,[-3,-2]]

for method in acetic_acid_data:
    acetic_acid_data[method].columns = ['binding_energy', 'binding_energy_err']


fig, ax = plt.subplots(figsize=(4,3), dpi=300,constrained_layout=True)
ax.set_xlabel( 'DMC timestep [a.u.]' )
ax.set_xticks( [0, 0.003, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.1, 0.2, 0.3 ] )
ax.set_xticklabels( [ '0', '3E-3', '0.01', '0.02', '0.03', '0.04', '0.05', '0.06', '0.1', '0.2', '0.3' ], rotation=90 )
ax.set_xlim( [0,0.1*1.03] )

# Plot the total energy 
for method in acetic_acid_data:
    ax.errorbar(acetic_acid_data[method].index.tolist(),acetic_acid_data[method]['binding_energy'].tolist(),yerr=acetic_acid_data[method]['binding_energy_err'].tolist(),fmt='o',markerfacecolor='none',markersize=4,label=method,alpha=0.45,markeredgewidth=1)


# ax.errorbar(acetic_acid_data['TM PBE0'].index.tolist(),acetic_acid_data['TM PBE0']['binding_energy'].tolist(),yerr=acetic_acid_data['TM PBE0']['binding_energy_err'].tolist(),fmt='x',color='red',markersize=4,label='TM PBE0',alpha=0.45,markeredgewidth=1)
# ax.errorbar(acetic_acid_data['TM PBE'].index.tolist(),acetic_acid_data['TM PBE']['binding_energy'].tolist(),yerr=acetic_acid_data['TM PBE']['binding_energy_err'].tolist(),fmt='x',color='blue',markersize=4,label='TM PBE',alpha=0.45,markeredgewidth=1)
# ax.errorbar(acetic_acid_data['TM LDA'].index.tolist(),acetic_acid_data['TM LDA']['binding_energy'].tolist(),yerr=acetic_acid_data['TM LDA']['binding_energy_err'].tolist(),fmt='x',color='green',markersize=4,label='TM LDA',alpha=0.45,markeredgewidth=1)
# ax.errorbar(acetic_acid_data['AE LDA'].index.tolist(),acetic_acid_data['AE LDA']['binding_energy'].tolist(),yerr=acetic_acid_data['AE LDA']['binding_energy_err'].tolist(),fmt='s',color='brown',markerfacecolor='none',markersize=4,label='AE LDA',alpha=0.45,markeredgewidth=1)

# reference quantum-chemistry result
ax.axhline( ccsdt_raw_references.loc[20,'CCSD(T) Final'], c='gray', ls='--', label='CCSD(T)')
ccsdt_upper_lim = float(ccsdt_raw_references.loc[20,'CCSD(T) Final']) + float(ccsdt_raw_references.loc[20,'CCSD(T) Error'])
ccsdt_lower_lim = float(ccsdt_raw_references.loc[20,'CCSD(T) Final']) - float(ccsdt_raw_references.loc[20,'CCSD(T) Error'])
ax.fill_between([0,0.15],[ccsdt_lower_lim,ccsdt_lower_lim], [ccsdt_upper_lim,ccsdt_upper_lim], color='gray',alpha=0.2,edgecolor='none')

ax.set_ylim([-22,-18])
ax.legend(ncol=2,fontsize=7,frameon=True)
plt.savefig('Figures/Fig_SI_Acetic_Acid_Validation.png')


In [13]:
# Table of the value for the smallest time step for each method
acetic_acid_table = {method: {r'$\tau$': f'{acetic_acid_data[method].index.tolist()[0]:.3f}', r'$\Delta E_\textrm{int}$': f"{acetic_acid_data[method]['binding_energy'].tolist()[0]:.2f}$\pm${acetic_acid_data[method]['binding_energy_err'].tolist()[0]:.2f}" } for method in acetic_acid_data}

acetic_acid_table_df = pd.DataFrame(acetic_acid_table).T
display(acetic_acid_table_df)

# Write the DataFrame to a latex input
latex_input_str = convert_df_to_latex_input(
    acetic_acid_table_df,
    start_input = '\\begin{table}',
    label = 'tab:acetic_acid_validation',
    caption = r'Validation of the DLA localization scheme with an LDA trial wave-function for the AcOH$\cdots$AcOH dimer (ID 20). The smallest time step $\tau$ and the corresponding interaction energy $\Delta E_\textrm{int}$ are reported using various trial wave-functions, localization schemes as well as with all-electron LDA',
    end_input = '\\end{table}',
    center = True,
    df_latex_skip = 0,
    output_str = True,
    column_format = 'll' + 'r'*(len(acetic_acid_table_df.columns)-1)
)


with open('Tables/Table_SI_Acetic_acid_validation.tex','w') as f:
    f.write(latex_input_str)


Unnamed: 0,$\tau$,$\Delta E_\textrm{int}$
LDA//DLA(eCEPP)//CASINO,0.003,-20.27$\pm$0.08
LDA//TM(eCEPP)//CASINO,0.003,-19.99$\pm$0.08
LDA//DTM(eCEPP)//CASINO,0.003,-20.15$\pm$0.10
LDA//TM(ccECP)//QMCPACK,0.002,-20.25$\pm$0.14
PBE//TM(ccECP)//QMCPACK,0.01,-20.10$\pm$0.06
PBE0//TM(ccECP)//QMCPACK,0.01,-20.17$\pm$0.06
LDA//AE//QMCPACK,0.002,-20.19$\pm$0.11


## MAIN - Comparison of DMC against CCSD(T) and CCSD(cT)

In [14]:
# Print the final data. Create a figure with three rows and plot final_binding_energy
fig, axs = plt.subplots( nrows=3, ncols=1, figsize=(6.67,7),dpi=600,constrained_layout=True)

datarange1 = list(range(1,24))
datarange2 = list(range(24,47))
datarange3 = list(range(47,67))

axs[0].axhline(0, color='k', ls='--')
axs[1].axhline(0, color='k', ls='--')
axs[2].axhline(0, color='k', ls='--')


axs[0].errorbar(datarange1,[final_binding_energy[f'{i}'][0] - final_binding_energy[f'{i}'][0] for i in datarange1], yerr = [final_binding_energy[f'{i}'][1] for i in datarange1], capsize=3, marker = 'none', ls='none', color = 'blue')
axs[1].errorbar(datarange2,[final_binding_energy[f'{i}'][0] - final_binding_energy[f'{i}'][0] for i in datarange2],yerr = [final_binding_energy[f'{i}'][1] for i in datarange2], marker = 'none',ls='none', color = 'blue', capsize=3)
axs[2].errorbar(datarange3,[final_binding_energy[f'{i}'][0] - final_binding_energy[f'{i}'][0] for i in datarange3], yerr = [final_binding_energy[f'{i}'][1] for i in datarange3], marker = 'none',ls='none', color = 'blue', capsize=3)

# Plot the Martin silver reference value
axs[0].scatter(datarange1, [ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0] for x in datarange1],c='silver',marker='x', label=f'CCSD(T) [MAD: {np.mean([abs(ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0]) for x in datarange1]):.2f}]')
axs[1].scatter(datarange2, [ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0] for x in datarange2],c='silver',marker='x', label=f'CCSD(T) [MAD: {np.mean([abs(ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0]) for x in datarange2]):.2f}]')
axs[2].scatter(datarange3, [ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0] for x in datarange3],c='silver',marker='x', label=f'CCSD(T) [MAD: {np.mean([abs(ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0]) for x in datarange3]):.2f}]')

axs[0].set_xticks(datarange1)
# Plot the names in the figure
for i in datarange1:
    axs[0].text(i,-0.9,f"{final_binding_energy[f'{i}'][0]:.2f}({int(round(100*final_binding_energy[f'{i}'][1]))})",fontsize=8,ha='center',rotation=90,  bbox=dict(facecolor='white', edgecolor='none',alpha=0.8 ))

axs[1].set_xticks(datarange2)
for i in datarange2:
    axs[1].text(i,-0.9,f"{final_binding_energy[f'{i}'][0]:.2f}({int(round(100*final_binding_energy[f'{i}'][1]))})",fontsize=8,ha='center',rotation=90,  bbox=dict(facecolor='white', edgecolor='none',alpha=0.8 ))
axs[2].set_xticks(datarange3)
for i in datarange3:
    axs[2].text(i,-0.9,f"{final_binding_energy[f'{i}'][0]:.2f}({int(round(100*final_binding_energy[f'{i}'][1]))})",fontsize=8,ha='center',rotation=90,  bbox=dict(facecolor='white', edgecolor='none',alpha=0.8 ))

axs[0].set_ylim([-1,1])
axs[1].set_ylim([-1,1])
axs[2].set_ylim([-1,1])

axs[2].set_xlabel('S66 system')

axs[0].legend(loc='upper left')
axs[1].legend(loc='upper left')
axs[2].legend(loc='upper left')

axs[0].set_title('H-bonded systems')
axs[1].set_title('Dispersion systems')
axs[2].set_title('Mixed systems')

fig.supylabel('Difference against DMC [kcal/mol]')

plt.savefig('Figures/Fig_MAIN_S66_compare_woCT.png')

# Plot the cT_data
axs[1].scatter(datarange2, [ccsdt_raw_references.loc[x,'CCSD(cT)-fit Final']- final_binding_energy[f'{x}'][0] for x in datarange2],c='gold',marker='x', label=f'CCSD(cT)-fit [MAD: {np.mean([abs(ccsdt_raw_references.loc[x,"CCSD(cT)-fit Final"] - final_binding_energy[f"{x}"][0]) for x in datarange2]):.2f}]')

plt.savefig('Figures/Fig_MAIN_S66_compare.png')


In [15]:
print(f'Overall MAD: {np.mean([abs(ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0]) for x in datarange1 + datarange2 + datarange3]):.2f}')

Overall MAD: 0.21


In [16]:
# Print the final data. Create a figure with three rows and plot final_binding_energy
fig, axs = plt.subplots( figsize=(6.67,3.5),dpi=600,constrained_layout=True)

datarange1 = list(range(1,24))

axs.axhline(0, color='k', ls='--',zorder=1,alpha=0.8)

axs.errorbar(datarange1,[final_binding_energy[f'{i}'][0] - final_binding_energy[f'{i}'][0] for i in datarange1], yerr = [final_binding_energy[f'{i}'][1] for i in datarange1], capsize=3, marker = 'none', ls='none', color = 'blue',zorder=1,alpha=0.7)

# Plot the Martin silver reference value
axs.scatter(datarange1, [ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0] for x in datarange1],c='silver',marker='x', label=f'CCSD(T) [MAD: {np.mean([abs(ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0]) for x in datarange1]):.2f}]',alpha=0.85)


axs.set_xticks(datarange1)
axs.set_xticklabels([f"{dimer_info.loc[system_id,'formatted_name']} - {system_id:02d}" for system_id in datarange1],rotation=90)
# Plot the names in the figure
for i in datarange1:
    axs.text(i,-1.1,f"{final_binding_energy[f'{i}'][0]:.2f}({int(round(100*final_binding_energy[f'{i}'][1]))})",fontsize=8,ha='center',rotation=90) #,  bbox=dict(facecolor='white', edgecolor='none',alpha=0.8 ))

axs.set_ylim([-1.2,1.2])
axs.set_yticks([-1.2,-0.6,0.0,0.6,1.2])

axs.set_xlabel('System')

axs.legend(loc='upper left')

axs.set_title('H-bonded systems')


fig.supylabel('Difference against DMC [kcal/mol]')

plt.savefig('Figures/Fig_MAIN_S66_compare_a_H_bonded.png')

In [17]:
# Print the final data. Create a figure with three rows and plot final_binding_energy
fig, axs = plt.subplots( figsize=(6.67,4.2),dpi=600,constrained_layout=True)

datarange1 = list(range(24,47))

axs.axhline(0, color='k', ls='--',zorder=1,alpha=0.8)

axs.errorbar(datarange1,[final_binding_energy[f'{i}'][0] - final_binding_energy[f'{i}'][0] for i in datarange1], yerr = [final_binding_energy[f'{i}'][1] for i in datarange1], capsize=3, marker = 'none', ls='none', color = 'blue',zorder=1,alpha=0.7)

# Plot the Martin silver reference value
axs.scatter(datarange1, [ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0] for x in datarange1],c='silver',marker='x', label=f'CCSD(T) [MAD: {np.mean([abs(ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0]) for x in datarange1]):.2f}]',alpha=0.85)

# Plot CCSD(cT)-fit estimates
axs.scatter(datarange1, [ccsdt_raw_references.loc[x,'CCSD(cT)-fit Final']- final_binding_energy[f'{x}'][0] for x in datarange2],c='gold',marker='x', label=f'CCSD(cT)-fit [MAD: {np.mean([abs(ccsdt_raw_references.loc[x,"CCSD(cT)-fit Final"] - final_binding_energy[f"{x}"][0]) for x in datarange2]):.2f}]',alpha=0.7)

axs.set_xticks(datarange1)
axs.set_xticklabels([f"{dimer_info.loc[system_id,'formatted_name']} - {system_id:02d}" for system_id in datarange1],rotation=90)
# Plot the names in the figure
for i in datarange1:
    axs.text(i,-1.1,f"{final_binding_energy[f'{i}'][0]:.2f}({int(round(100*final_binding_energy[f'{i}'][1]))})",fontsize=8,ha='center',rotation=90) #,  bbox=dict(facecolor='white', edgecolor='none',alpha=0.8 ))

axs.set_ylim([-1.2,1.2])
axs.set_yticks([-1.2,-0.6,0.0,0.6,1.2])

axs.set_xlabel('System')

axs.legend(loc='upper left')

axs.set_title('Dispersion-dominated systems')


fig.supylabel('Difference against DMC [kcal/mol]')

plt.savefig('Figures/Fig_MAIN_S66_compare_b_Dispersion_dominated.png')

In [18]:
# Print the final data. Create a figure with three rows and plot final_binding_energy
fig, axs = plt.subplots( figsize=(6.67,4.0),dpi=600,constrained_layout=True)

datarange1 = list(range(47,67))

axs.axhline(0, color='k', ls='--',zorder=1,alpha=0.8)

axs.errorbar(datarange1,[final_binding_energy[f'{i}'][0] - final_binding_energy[f'{i}'][0] for i in datarange1], yerr = [final_binding_energy[f'{i}'][1] for i in datarange1], capsize=3, marker = 'none', ls='none', color = 'blue',zorder=1,alpha=0.7)

# Plot the Martin silver reference value
axs.scatter(datarange1, [ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0] for x in datarange1],c='silver',marker='x', label=f'CCSD(T) [MAD: {np.mean([abs(ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0]) for x in datarange1]):.2f}]',alpha=0.85)

axs.set_xticks(datarange1)
axs.set_xticklabels([f"{dimer_info.loc[system_id,'formatted_name']} - {system_id:02d}" for system_id in datarange1],rotation=90)
# Plot the names in the figure
for i in datarange1:
    axs.text(i,-1.1,f"{final_binding_energy[f'{i}'][0]:.2f}({int(round(100*final_binding_energy[f'{i}'][1]))})",fontsize=8,ha='center',rotation=90) #,  bbox=dict(facecolor='white', edgecolor='none',alpha=0.8 ))

axs.set_ylim([-1.2,1.2])
axs.set_yticks([-1.2,-0.6,0.0,0.6,1.2])

axs.set_xlabel('System')

axs.legend(loc='upper left')

axs.set_title('Mixed systems')


fig.supylabel('Difference against DMC [kcal/mol]')

plt.savefig('Figures/Fig_MAIN_S66_compare_c_Mixed.png')

In [19]:
# Plot relative differences for all systems
fig, axs = plt.subplots( nrows=3, ncols=1, figsize=(6.67,7),dpi=600,constrained_layout=True)

datarange1 = list(range(1,24))
datarange2 = list(range(24,47))
datarange3 = list(range(47,67))

axs[0].axhline(0, color='k', ls='--')
axs[1].axhline(0, color='k', ls='--')
axs[2].axhline(0, color='k', ls='--')


axs[0].errorbar(datarange1,[final_binding_energy[f'{i}'][0] - final_binding_energy[f'{i}'][0] for i in datarange1], yerr = [abs(final_binding_energy[f'{i}'][1]*100/final_binding_energy[f'{i}'][0]) for i in datarange1], capsize=3, marker = 'none', ls='none', color = 'blue')
axs[1].errorbar(datarange2,[final_binding_energy[f'{i}'][0] - final_binding_energy[f'{i}'][0] for i in datarange2],yerr = [abs(final_binding_energy[f'{i}'][1]*100/final_binding_energy[f'{i}'][0]) for i in datarange2], marker = 'none',ls='none', color = 'blue', capsize=3)
axs[2].errorbar(datarange3,[final_binding_energy[f'{i}'][0] - final_binding_energy[f'{i}'][0] for i in datarange3], yerr = [abs(final_binding_energy[f'{i}'][1]*100/final_binding_energy[f'{i}'][0]) for i in datarange3], marker = 'none',ls='none', color = 'blue', capsize=3)

# Plot the final (averaged) CCSD(T) reference value
axs[0].scatter(datarange1, [(ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange1],c='silver',marker='x', label=f'CCSD(T) [MRD: {np.mean([abs((ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0])/final_binding_energy[f"{x}"][0])*100 for x in datarange1]):.2f}%]')
axs[1].scatter(datarange2, [(ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange2],c='silver',marker='x', label=f'CCSD(T) [MRD: {np.mean([abs((ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0])/final_binding_energy[f"{x}"][0])*100 for x in datarange2]):.2f}%]')
axs[2].scatter(datarange3, [(ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange3],c='silver',marker='x', label=f'CCSD(T) [MRD: {np.mean([abs((ccsdt_raw_references.loc[x,"CCSD(T) Final"] - final_binding_energy[f"{x}"][0])/final_binding_energy[f"{x}"][0])*100 for x in datarange3]):.2f}%]')

# Plot the cT_data
# axs[0].scatter(datarange1, [(-s66_cT_data[x-1]- final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange1],c='gold',marker='x', label=f'CCSD(cT)-fit [MAD: {np.mean([abs(-s66_cT_data[x-1] - final_binding_energy[f"{x}"][0]) for x in datarange1]):.2f}]')
axs[1].scatter(datarange2, [(ccsdt_raw_references.loc[x,'CCSD(cT)-fit Final']- final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange2],c='gold',marker='x', label=f'CCSD(cT)-fit [MRD: {np.mean([abs((ccsdt_raw_references.loc[x,"CCSD(cT)-fit Final"] - final_binding_energy[f"{x}"][0])/final_binding_energy[f"{x}"][0])*100 for x in datarange2]):.2f}%]')
# axs[2].scatter(datarange3, [(-s66_cT_data[x-1]- final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange3],c='gold',marker='x', label=f'CCSD(cT)-fit [MAD: {np.mean([abs(-s66_cT_data[x-1] - final_binding_energy[f"{x}"][0]) for x in datarange3]):.2f}]')

axs[0].set_xticks(datarange1)
axs[1].set_xticks(datarange2)
axs[2].set_xticks(datarange3)

axs[0].set_ylim([-10,20])
axs[1].set_ylim([-10,20])
axs[2].set_ylim([-10,20])

axs[2].set_xlabel('S66 system')
axs[0].legend(loc='upper center')
axs[1].legend(loc='upper center')
axs[2].legend(loc='upper center')

axs[0].set_title('H-bonded systems')
axs[1].set_title('Dispersion systems')
axs[2].set_title('Mixed systems')

fig.supylabel('Relative difference against DMC \%')
plt.savefig('Figures/Fig_SI_S66_compare_relative.png')

## SI - Comparison of DMC against CCSD(T)

In [20]:
# Turn the final_all_energy dictionary into a pandas dataframe
final_binding_energy_comparison_df = pd.DataFrame(final_all_energy).T

# Set the index name based on dimer_info.loc[system_id,'name']
final_binding_energy_comparison_df['System'] = [dimer_info.loc[system_id,'formatted_name'] for system_id in range(1,67)]

# Give binding energy and error a new name
final_binding_energy_comparison_df[r'$\Delta E_\textrm{int.}^\textrm{DMC}$ [kcal/mol]'] = [f"{final_binding_energy_comparison_df['Binding Energy'][system_id].round(2):.2f}$\pm${final_binding_energy_comparison_df['Binding Energy Error'][system_id].round(2):.2f}" for system_id in range(66)]
final_binding_energy_comparison_df = final_binding_energy_comparison_df[['System',r'$\Delta E_\textrm{int.}^\textrm{DMC}$ [kcal/mol]']]
final_binding_energy_comparison_df[r'$\Delta E_\textrm{int.}^\textrm{CCSD(T)}$ [kcal/mol]'] = ccsdt_references['CCSD(T) Final'].tolist()
final_binding_energy_comparison_df['Deviation'] = [f"{ccsdt_raw_references['CCSD(T) Final'].tolist()[i] - final_binding_energy[f'{i+1}'][0]:.2f}$\pm${np.sqrt(ccsdt_raw_references['CCSD(T) Error'].tolist()[i]**2 + final_binding_energy[f'{i+1}'][1]**2):.2f} " for i in range(66)]

# Write the DataFrame to a latex input
latex_input_str = '\n'.join(convert_df_to_latex_input(
    final_binding_energy_comparison_df,
    start_input = '\\begin{table}',
    label = 'tab:dmc-cc-comparison',
     caption = r'Final DMC and CCSD(T) $\Delta E_\textrm{int.}$ estimates for the S66 dataset in kcal/mol, with their deviation of CCSD(T) from DMC given.',
    end_input = '\\end{table}',
    replace_input = {
    },
    adjustbox = 1,
    center = True,
    df_latex_skip = 0,
    rotate_column_header = True,
    output_str = True,
    column_format = 'll' + 'r'*len(final_binding_energy_comparison_df.columns)
).splitlines()[7:-4]) + '\n'

with open('Tables/Table_SI_DMC_CC_compare.tex', 'w') as f:
    f.write(r"""\LTcapwidth=\textwidth
\small
\begin{longtable}{llrrr}
\caption{\label{tab:dmc-cc-comparison}Final DMC and CCSD(T) $\Delta E_\textrm{int.}$ estimates for the S66 dataset in kcal/mol, with their deviation of CCSD(T) from DMC given.} \\

\toprule
 & System & $\Delta E_\textrm{int.}^\textrm{DMC}$ [kcal/mol] & $\Delta E_\textrm{int.}^\textrm{CCSD(T)}$ [kcal/mol] & Deviation [kcal/mol] \\
\midrule
\endfirsthead



\caption[]{(continued)} \\
\endhead

\multicolumn{4}{r}{{Continued on next page}} \\
\endfoot

\bottomrule
\endlastfoot

""")
    f.write(latex_input_str)
    f.write(r"\end{longtable}")
    f.write(r"\normalsize")

display(final_binding_energy_comparison_df)

  final_binding_energy_comparison_df[r'$\Delta E_\textrm{int.}^\textrm{DMC}$ [kcal/mol]'] = [f"{final_binding_energy_comparison_df['Binding Energy'][system_id].round(2):.2f}$\pm${final_binding_energy_comparison_df['Binding Energy Error'][system_id].round(2):.2f}" for system_id in range(66)]


Unnamed: 0,\rotatebox{90}{System},\rotatebox{90}{$\Delta E_\textrm{int.}^\textrm{DMC}$ [kcal/mol]},\rotatebox{90}{$\Delta E_\textrm{int.}^\textrm{CCSD(T)}$ [kcal/mol]},\rotatebox{90}{Deviation}
1,Water$\cdots$Water,-5.17$\pm$0.03,-4.99$\pm$0.03,0.17$\pm$0.04
2,Water$\cdots$MeOH,-5.82$\pm$0.04,-5.68$\pm$0.03,0.14$\pm$0.05
3,Water$\cdots$MeNH$_2$,-7.18$\pm$0.04,-7.01$\pm$0.05,0.18$\pm$0.06
4,Water$\cdots$Peptide,-8.58$\pm$0.06,-8.20$\pm$0.03,0.39$\pm$0.07
5,MeOH$\cdots$MeOH,-5.93$\pm$0.10,-5.83$\pm$0.02,0.09$\pm$0.10
6,MeOH$\cdots$MeNH$_2$,-7.83$\pm$0.07,-7.64$\pm$0.04,0.19$\pm$0.08
7,MeOH$\cdots$Peptide,-8.57$\pm$0.07,-8.32$\pm$0.03,0.25$\pm$0.07
8,MeOH$\cdots$Water,-5.24$\pm$0.07,-5.08$\pm$0.02,0.16$\pm$0.07
9,MeNH$_2$$\cdots$MeOH,-3.11$\pm$0.07,-3.10$\pm$0.02,0.02$\pm$0.08
10,MeNH$_2$$\cdots$MeNH$_2$,-4.20$\pm$0.10,-4.20$\pm$0.03,-0.00$\pm$0.10


## MAIN - Analysis of differences based on SAPT

In [21]:
# Plot the error between DMC and CCSD(T) against the dispersion/electrostatic ratio

import Data.SAPT.Sherrill_Biofragment_SAPT_S66 as sapt_s66

binding_energy_decomposition = pd.DataFrame( sapt_s66.DATA )

binding_energy_decomposition['ELST DISP+ELST RATIO'] = binding_energy_decomposition['SAPT ELST ENERGY'] /(binding_energy_decomposition['SAPT DISP ENERGY'] + binding_energy_decomposition['SAPT ELST ENERGY'])

binding_energy_decomposition['LOG(ELST DISP RATIO)'] = np.log(binding_energy_decomposition['SAPT ELST ENERGY'] /(binding_energy_decomposition['SAPT DISP ENERGY']))
binding_energy_decomposition['Dimer Formatted Name'] = dimer_info['formatted_name'].tolist()


fig, axs = plt.subplots(figsize=(3.36,3.5),dpi=600,constrained_layout=True)

quantity_to_look_at = 'LOG(ELST DISP RATIO)'

axs.scatter(np.array(binding_energy_decomposition[quantity_to_look_at].tolist())[[x-1 for x in datarange1]], [(ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange1],c='red',marker='x', label='Electrostatic')
axs.scatter(np.array(binding_energy_decomposition[quantity_to_look_at].tolist())[[x-1 for x in datarange2]], [(ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange2],c='blue',marker='x',label='Dispersion')
axs.scatter(np.array(binding_energy_decomposition[quantity_to_look_at].tolist())[[x-1 for x in datarange3]], [(ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange3],c='green',marker='x', label='Mixed')

axs.set_xlabel(r'LOG(ELST/DISP) ratio from SAPT')
axs.set_ylabel(r'[DMC-CCSD(T)]/|DMC| [%]')
axs.legend()

plt.savefig('Figures/Fig_MAIN_Error_decomposition.png')

In [22]:
# Get the R^2 value for the linear fit
from scipy.stats import linregress

slope, intercept, r_value, p_value, std_err = linregress(np.array(binding_energy_decomposition[quantity_to_look_at].tolist())[[x-1 for x in datarange1 + datarange2 + datarange3]], [(ccsdt_raw_references.loc[x,'CCSD(T) Final'] - final_binding_energy[f'{x}'][0])*100/final_binding_energy[f'{x}'][0] for x in datarange1 + datarange2 + datarange3])
print(f'R^2 value for the linear fit: {r_value**2:.2f}')

R^2 value for the linear fit: 0.78


In [23]:
binding_energy_decomposition_df = binding_energy_decomposition[['Dimer Formatted Name','SAPT ELST ENERGY','SAPT EXCH ENERGY','SAPT IND ENERGY','SAPT DISP ENERGY','LOG(ELST DISP RATIO)']]
binding_energy_decomposition_df.columns = ['System','ELST','EXCH','IND','DISP','LOG(ELST/DISP)']

# Round to nearest 3 decimal place
binding_energy_decomposition_df = binding_energy_decomposition_df.round(3)
display(binding_energy_decomposition_df)

# Write into latex input
latex_input_str = convert_df_to_latex_input(
    binding_energy_decomposition_df,
    start_input = '\\begin{table}',
    label = 'tab:sapt_s66_decomposition',
    caption = r'Symmetry-adapted perturbation theory (SAPT) energy decomposition for the S66 dataset taken from Ref.~\citenum{SAPT_Sherrill} using the SAPT0S-SA-jadz level of theory. The electrostatic (ELST), exchange (EXCH), induction (IND) and dispersion (DISP) energy components to the interaction energy are reported. The natural logarithm of the ratio between the electrostatic and dispersion energy is also reported.',
    end_input = '\\end{table}',
    replace_input = {
        '000 &': ' &',
        '000 \\': ' \\',
    },
    center = True,
    index=False,
    df_latex_skip = 0,
    output_str = True,
    column_format = 'l' + 'r'*(len(binding_energy_decomposition_df.columns)-1)
)

# Make table into multiple pages
latex_input_str = '\n'.join(latex_input_str.splitlines()[7:-4]) + '\n'


with open('Tables/Table_SI_SAPT_S66_decomposition.tex','w') as f:
    f.write(r"""\LTcapwidth=\textwidth
    \begin{longtable}{lrrrrr}
\caption{\label{tab:sapt_s66_decomposition}Symmetry-adapted perturbation theory (SAPT) energy decomposition for the S66 dataset taken from Ref.~\citenum{SAPT_Sherrill} using the SAPT0S-SA-jadz level of theory. The electrostatic (ELST), exchange (EXCH), induction (IND) and dispersion (DISP) energy components to the interaction energy are reported. The natural logarithm of the ratio between the electrostatic and dispersion energy is also reported.} \\

\toprule
System & ELST & EXCH & IND & DISP & LOG(ELST/DISP) \\ 
\midrule
\endfirsthead

\caption[]{(continued)}\\
\toprule
System & ELST & EXCH & IND & DISP & LOG(ELST/DISP) \\ 
\midrule
\endhead

\multicolumn{6}{r}{(Continued on next page)}\\
\endfoot

\bottomrule
\endlastfoot


""")
    f.write(latex_input_str)
    f.write(r"\end{longtable}")

Unnamed: 0,System,ELST,EXCH,IND,DISP,LOG(ELST/DISP)
S66-1,Water$\cdots$Water,-8.569,6.651,-1.992,-1.222,1.947
S66-2,Water$\cdots$MeOH,-9.517,8.04,-2.456,-1.747,1.695
S66-3,Water$\cdots$MeNH$_2$,-12.719,11.83,-3.785,-2.12,1.792
S66-4,Water$\cdots$Peptide,-13.376,11.329,-3.791,-2.639,1.623
S66-5,MeOH$\cdots$MeOH,-9.547,8.413,-2.586,-2.059,1.534
S66-6,MeOH$\cdots$MeNH$_2$,-13.21,13.167,-4.194,-2.93,1.506
S66-7,MeOH$\cdots$Peptide,-13.224,12.114,-3.984,-3.199,1.419
S66-8,MeOH$\cdots$Water,-8.454,6.853,-2.092,-1.447,1.765
S66-9,MeNH$_2$$\cdots$MeOH,-4.35,4.261,-0.991,-1.622,0.986
S66-10,MeNH$_2$$\cdots$MeNH$_2$,-5.97,6.435,-1.489,-2.472,0.882
