<font size="14">Collect R-L BCPs with corresponding regions and Rho values.</font>

This script is designed to pull data from a computational chemistry output file and merge the desired data into an Excel file that shows the bond paths between ligand and enzyme atoms with its corresponding Rho value. The output will be an Excel file containing system information about the bond paths, critical points, the regions, and charge density (rho).

In [1]:
import os
import pandas as pd
import re

current_directory = os.getcwd()
file_path = os.path.join(current_directory, "pyrr1_c1_sys1_SP_QTAIM.out") # Modify file name
ligand = 'pyrr1' # Modify this as needed

In order to run this script, which modifies and adjusts Excel files, the python environment needs xlsxwriter and openpyxl installed.

In your terminal, install xlsxwriter with the following command:

    pip install xlsxwriter

and

    pip install openpyxl


<font size ="10">Section 1: Collecting Bond path information.</font>

This section collects the "Bond Path" information from the AMS output file, used as input here, and creates an Excel file containing all bond path numbers, critical point numbers, and atom numbers.

In [2]:
temp_file_path = os.path.join(current_directory, "temp_file.txt")
excel_path = os.path.join(current_directory, "bond_paths_full.xlsx")


section_found = False
section_content = []

with open(file_path, 'r') as file:
    for line in file:
        # Start recording from the occurrence of '   1      2  '
        if '   1      2  ' in line and not section_found:
            section_found = True
            section_content.append(line)
            continue

        if section_found:
            if '---' in line or 'ANOTHER_SECTION_TITLE' in line or line.strip() == '':
                break
            section_content.append(line)
    
    if not section_found:
        print("Pattern '   1      2  ' not found. Exiting.")
        raise ValueError("Required pattern not found in the file.")

with open(temp_file_path, 'w') as temp_file:
    temp_file.writelines(section_content)

temp_file_df = pd.read_csv(temp_file_path)

print("First 5 rows of the DataFrame:")
print(temp_file_df.head())
print()
print(f"The data has been written to the file: {temp_file_path}")

First 5 rows of the DataFrame:
  1   582       1      2     1.810282     1.811319      21
0     2   350       1     10     1.533579     1.5...      
1     3   415       1    156     1.107736     1.1...      
2     4   711       1    157     1.109333     1.1...      
3     5   516       1    251     2.893704     3.0...      
4     6   348       2      3     1.196800     1.1...      

The data has been written to the file: c:\Users\Bellu\OneDrive\Desktop\temp_file.txt


In [3]:
# Define custom column widths from the input file; currently interested in arrays 0-3; other arrays are important later
colspecs = [(0, 4), (5, 10), (11, 20), (21, 30), (40, 50), (50, 60), (60, 70)]

df = pd.read_fwf(temp_file_path, colspecs=colspecs)

df.columns = ['#BP', '#CP', 'Atom 1', 'Atom 2', 'Distance', 'BP Length', 'BP Steps']

df['Atom 1'] = pd.to_numeric(df['Atom 1'], errors='coerce') # Ensures integers; returns NaN if a number isn't present
df['Atom 2'] = pd.to_numeric(df['Atom 2'], errors='coerce')

df_shortened = df.iloc[:, :4]

# Prevent resetting the index to avoid any unintended re-indexing; this avoids atom 100 written as 0, 101 as 1, 102 as 2, ... 
df_shortened.reset_index(drop=True, inplace=True)

with pd.ExcelWriter(excel_path, engine='xlsxwriter') as writer:
    df_shortened.to_excel(writer, index=False)

    workbook = writer.book
    worksheet = writer.sheets['Sheet1']
    worksheet.set_column('A:D', 15) 

os.remove(temp_file_path)

print("First 5 rows of the DataFrame:")
print(df_shortened.head())
print()
print(f"The data has been written to the file: {excel_path}")

First 5 rows of the DataFrame:
   #BP  #CP  Atom 1  Atom 2
0    2  350       1      10
1    3  415       1     156
2    4  711       1     157
3    5  516       1     251
4    6  348       2       3

The data has been written to the file: c:\Users\Bellu\OneDrive\Desktop\bond_paths_full.xlsx


<font size="12">Section 2: Collecting Geometry information.</font>

In this block, the code reads the "Geometry" section of the input file and places all atom numbers and atom regions into a new Excel file.

In [None]:
temp_file_path2 = os.path.join(current_directory, "temp_file2.txt")
geometries = os.path.join(current_directory, "geometry.xlsx")

section_found2 = False
section_content2 = []

with open(file_path, 'r') as file2:
    for line2 in file2:
        # Check if the line contains the first occurrence of '#BP'
        if '  Index Symbol   x (angstrom)   y (angstrom)   z (angstrom)' in line2 and not section_found2:
            section_found2 = True  # Start recording from this point
            section_content2.append(line2)
            continue

        if section_found2:
            if '---' in line2 or 'ANOTHER_SECTION_TITLE' in line2 or line2.strip() == '':
                break
          
            section_content2.append(line2)

    if not section_found2:
        print("Pattern for geometries not found. Exiting.")
        raise ValueError("Required pattern not found in the file.")

with open(temp_file_path2, 'w') as temp_file2: #local temp file
    temp_file2.writelines(section_content2)

column_widths = [7, 7, 15, 15, 15, 33]  # set widths for temporary file, accounts for long region names

df2 = pd.read_fwf(temp_file_path2, widths=column_widths)

df2.to_excel(geometries, index=False)

df2_shortened = df2.iloc[:, [0, 5]]  # Index 0 is atom number, index 5 is region name

df2_shortened.to_excel(geometries, index=False)

os.remove('temp_file2.txt')

print("First 5 rows of the DataFrame:")
print(df2_shortened.head())
print()
print(f"The data has been written to the file: {geometries}")

First 5 rows of the DataFrame:
   ------                      -
0       1  region=cys170minusONH
1       2  region=cys170minusONH
2       3  region=cys170minusONH
3       4  region=cys170minusONH
4       5   region=thr107minusCO

The data has been written to the file: c:\Users\Bellu\OneDrive\Desktop\geometry.xlsx


<font size="12">Section 3: Collate Bond Path Data with Geometry Data</font>

The "Geometry" file is read and merged into the "Bond Path" file, creating a third Excel that contains the entire system's atoms, regions, and bond path numbers.

In [6]:
bond_paths_file = os.path.join(current_directory, "bond_paths_full.xlsx")
geometry_file = os.path.join(current_directory, "geometry_shortened.xlsx")
output_file = os.path.join(current_directory, "bond_path_with_geo_data.xlsx")

def merge_bond_paths_with_regions(bond_paths_file, geometry_file, output_file):
    bond_paths = pd.read_excel(excel_path)
    geometry = pd.read_excel(geometries)
    
    # Creates column names for the geometry data; because bond_path has headers, so to append them, geometry nust have columns also
    geometry.columns = ['Atom ID', 'Region']
    
    bond_paths = bond_paths.merge(geometry, how='left', left_on='Atom 1', right_on='Atom ID') 
    bond_paths = bond_paths.rename(columns={'Region': 'Region Atom 1'}).drop('Atom ID', axis=1)
    
    bond_paths = bond_paths.merge(geometry, how='left', left_on='Atom 2', right_on='Atom ID')
    bond_paths = bond_paths.rename(columns={'Region': 'Region Atom 2'}).drop('Atom ID', axis=1)
    
    bond_paths.to_excel(output_file, index=False)

merge_bond_paths_with_regions(bond_paths_file, geometry_file, output_file)

output_df = pd.read_excel(output_file)

print("First 5 rows of the DataFrame:")
print(output_df.head())
print()
print(f"The data has been written to the file: {output_file}")

First 5 rows of the DataFrame:
   #BP  #CP  Atom 1  Atom 2          Region Atom 1          Region Atom 2
0    2  350       1      10  region=cys170minusONH  region=cys170minusONH
1    3  415       1     156  region=cys170minusONH  region=cys170minusONH
2    4  711       1     157  region=cys170minusONH  region=cys170minusONH
3    5  516       1     251  region=cys170minusONH          region=val77R
4    6  348       2       3  region=cys170minusONH  region=cys170minusONH

The data has been written to the file: c:\Users\Bellu\OneDrive\Desktop\bond_path_with_geo_data.xlsx


<font size="12">Section 4: Read Bond Path-Geometry Data for Ligand.</font>

This section uses the "Bond Path with Geo Data" Excel file, which contains the entire system's bond paths, critical points, atoms, and regions for the variable 'ligand'. This is assigned at the top of the script as pyrr3. The 5 & 6 columns of the file, which contain region information, are read and return only rows where 'ligand' appears once. This is appended to a file called "Bond Path Filtered". 

There is no fail for this section; if the input had a Bond Path section, then it has Rho values.

In [8]:
bond_path = os.path.join(current_directory, "bond_path_with_geo_data.xlsx")
bond_path_df = pd.read_excel(bond_path)

filtered_df = bond_path_df[((bond_path_df.iloc[:, 4].str.contains(ligand, na=False)) & 
                            (~bond_path_df.iloc[:, 5].str.contains(ligand, na=False))) |
                           ((~bond_path_df.iloc[:, 4].str.contains(ligand, na=False)) & 
                            (bond_path_df.iloc[:, 5].str.contains(ligand, na=False)))]

filtered_df.to_excel(os.path.join(current_directory, "bond_paths_filtered.xlsx"), index=False)

print("First 5 rows of the DataFrame:")
print(filtered_df.head())
print()
print(f"The data has been written to the file: {bond_path}")

First 5 rows of the DataFrame:
    #BP  #CP  Atom 1  Atom 2          Region Atom 1 Region Atom 2
5     7  539       2     102  region=cys170minusONH  region=pyrr1
29   31  543      12     240    region=ala26minusNH  region=pyrr1
32   34  393      14      76    region=ala26minusNH  region=pyrr1
33   35  391      14     183    region=ala26minusNH  region=pyrr1
46   48  425      21      34         region=gln28BB  region=pyrr1

The data has been written to the file: c:\Users\Bellu\OneDrive\Desktop\bond_path_with_geo_data.xlsx


<font size ="12">Section 5: Collate Rho Data.</font>

Using the "Bond Paths Filtered" file, the input file is read for rho values of bond paths which contain exactly one instance of 'ligand'. The rho values and data from "Bond Paths Filtered" are appended to "Bond Paths Rho", which now contains all the desired data: bond path numbers, critical point numbers, both regions, and rho.

In [9]:
bond_path2 = os.path.join(current_directory, "bond_paths_filtered.xlsx")
bond_path_df2 = pd.read_excel(bond_path2)

def extract_rho_value(bond_path_number, file_path):
    bond_path_pattern = f'CP #\\s*{bond_path_number}'  
    rho_pattern = r'Rho\s*=\s*([\d.eE+-]+)'  # Regex to capture scientific notation
    section_found = False

    with open(file_path, 'r') as file:
        for line in file:
            if re.search(bond_path_pattern, line):
                section_found = True  
                continue
            
            if section_found:
                match = re.search(rho_pattern, line)
                if match:
                    return float(match.group(1))  # Return the Rho value in scientific notation as a float
    return None

bond_path_df2['Rho'] = bond_path_df2['#CP'].apply(lambda x: extract_rho_value(x, file_path))

bond_path_df2.to_excel(os.path.join(current_directory, "bond_paths_rho.xlsx"), index=False)

with pd.ExcelWriter(os.path.join(current_directory, "bond_paths_rho.xlsx"), engine='xlsxwriter') as writer:
    bond_path_df2.to_excel(writer, index=False)

    workbook = writer.book
    worksheet = writer.sheets['Sheet1']

    worksheet.set_column('E:E', 25)  
    worksheet.set_column('F:F', 25) 
    worksheet.set_column('G:G', 20) 

    last_row = len(bond_path_df2) + 1  # +1 for the header row

    worksheet.write_formula(f'G{last_row + 1}', f'=SUM(G2:G{last_row})', 
                            workbook.add_format({'bold': True}))
    
print("First 5 rows of the DataFrame:")
print(bond_path_df2.head())
print()
print(f"The data has been written to the file: {bond_path2}")


First 5 rows of the DataFrame:
   #BP  #CP  Atom 1  Atom 2          Region Atom 1 Region Atom 2       Rho
0    7  539       2     102  region=cys170minusONH  region=pyrr1  0.013294
1   31  543      12     240    region=ala26minusNH  region=pyrr1  0.017597
2   34  393      14      76    region=ala26minusNH  region=pyrr1  0.010896
3   35  391      14     183    region=ala26minusNH  region=pyrr1  0.014879
4   48  425      21      34         region=gln28BB  region=pyrr1  0.005474

The data has been written to the file: c:\Users\Bellu\OneDrive\Desktop\bond_paths_filtered.xlsx


Run one of the following cells depending on your OS.

In [None]:
# Mac
os.system(f"open {os.path.join(current_directory, 'bond_paths_rho.xlsx')}")

In [None]:
# Linux
os.system(f"xdg-open {os.path.join(current_directory, 'bond_paths_rho.xlsx')}")

In [10]:
# Windows
os.startfile(os.path.join(current_directory, "bond_paths_rho.xlsx"))

<font size="12">Section 6: Remove all unnecessary files. </font>

For troubleshooting purposes, absolutely don't run following script block.

In [11]:
def remove_files(filenames, directory):
    for filename in filenames:
        file_path = os.path.join(directory, filename)
        try:
            os.remove(file_path)
            print(f"Removed: {file_path}")
        except FileNotFoundError:
            print(f"File not found: {file_path}")
        except Exception as e:
            print(f"Error removing {file_path}: {e}")

files_to_remove = [
    "bond_path_with_geo_data.xlsx",
    "bond_paths_filtered.xlsx",
    "bond_paths_full.xlsx",
    "geometry.xlsx"
]

remove_files(files_to_remove, current_directory)


Removed: c:\Users\Bellu\OneDrive\Desktop\bond_path_with_geo_data.xlsx
Removed: c:\Users\Bellu\OneDrive\Desktop\bond_paths_filtered.xlsx
Removed: c:\Users\Bellu\OneDrive\Desktop\bond_paths_full.xlsx
Removed: c:\Users\Bellu\OneDrive\Desktop\geometry.xlsx


<font size=12>Is this accurate?</font>

How do I know this worked instead of the script collecting random lines and appending values? The Excel generated with this script was compared against two other Excels: one created by two individuals performing manual analyses and one created by hobbling the results of two bash scripts together. The values were the same.

<font size=12>Future Work</font>

Next steps will be to make the entire script a single function that can iterate over several AMS output files, and collate them into a single Excel file.