<font size="14">Collect R-L BCPs with corresponding regions and Rho values.</font>

This script is designed to pull data from a computational chemistry output file and merge the desired data into an Excel file that shows the bond paths between ligand and enzyme atoms with its corresponding Rho value. The output will be an Excel file containing system information about the bond paths, critical points, the regions, and charge density (rho).

The first code block defines the current working directory, which file is used as input, and the ligand within the input.

In [62]:
import os
import pandas as pd
import re

current_directory = os.getcwd()
file_path = os.path.join(current_directory, "pyrr1_c1_sys1_SP_QTAIM.out") # Modify file name
ligand = 'pyrr1' # Modify this as needed

This script uses xlsxwriter and openpyxl to modifies and adjust Excel file.

In your terminal, install xlsxwriter with the following command:

    pip install xlsxwriter

and

    pip install openpyxl


<font size ="10">Section 1: Collecting Bond path information.</font>

This section collects the "Bond Path" information from the AMS output file, used as input here, and creates an Excel file containing all bond path numbers, critical point numbers, and atom numbers.


In [63]:
# The temp_file_path text file will temporarily hold data before writing it to the bond_paths_full Excel file.

temp_file_path = os.path.join(current_directory, "temp_file.txt")
excel_path = os.path.join(current_directory, "bond_paths_full.xlsx")

In [64]:
section_found = False # By initiating this condition as false, the function below will turn it True if the condition is found
section_content = [] # This is where data is stored if section_found = True

with open(file_path, 'r') as file: # Opens the input file
    for line in file:
        # Start recording from the occurrence of '   1      2  '
        if '   1      2' in line and not section_found: # If this pattern is found while section_found = False
            section_found = True # This condition is now true and the function will start appending lines to section_content
            section_content.append(line)
            continue

        if section_found: # When this is true, continue reading and appending
            if '---' in line or 'ANOTHER_SECTION_TITLE' in line or line.strip() == '': 
                break
            section_content.append(line)
    
    if not section_found: # Fail if pattern is not found
        print("Pattern '   1      2' not found. Exiting.")
        raise ValueError("Required pattern not found in the file.")

with open(temp_file_path, 'w') as temp_file:
    temp_file.writelines(section_content)


The following block imports the lines from temp_file_path into a dataframe for readability. If the function read or appended correctly, then the first five rows of the dataframe will be printed and look similar to this:

    First 5 rows of the DataFrame:
     2   360       1      3     1.454230     1.454787      19
    0     3   480       1    124     1.046427     1.0...      
    1     4   219       2    115     3.034049     3.1...      
    2     5   220       2    117     3.218932     3.2...      
    3     6   377       3      4     1.530883     1.5...      
    4     7   221       3     11     1.519229     1.5...      

    The data has been written to the file:  path/to/directory/tempfile.txt

If this does not appear, then the function did not work correctly. To troubleshoot, check:
* Is the input file correctly named at the top of the script?
* Does the ligand variable match the input file?
* Some multi-atom files don't start with '   1      2' in the bond path section . Check the input file to see the first bond path is 1 to 3 or another atom.

In [65]:
temp_file_df = pd.read_csv(temp_file_path)

print("First 5 rows of the DataFrame:")
print(temp_file_df.head())
print()
print(f"The data has been written to the file: {temp_file_path}")

First 5 rows of the DataFrame:
  1   582       1      2     1.810282     1.811319      21
0     2   350       1     10     1.533579     1.5...      
1     3   415       1    156     1.107736     1.1...      
2     4   711       1    157     1.109333     1.1...      
3     5   516       1    251     2.893704     3.0...      
4     6   348       2      3     1.196800     1.1...      

The data has been written to the file: c:\Users\jenbu\Desktop\5580\new_get_rho\temp_file.txt


The same data set is read into a dataframe, given names for each column and using the column specifications for exact width. Then, the atoms numbers for Atom 1 and Atom 2 are appended to the dataframe, which is then copied to a second dataframe which contains only the first four arrays. The second dataframe is constrained to not allow an index reset; early drafts of this script had atoms above 100 reindexed to 0.

xlsxwriter is called in to write the second dataframe to an Excel file and set column widths. In the last line, the temporary text file is deleted.

In [66]:
# Define custom column widths from the input file; currently interested in arrays 0-3; other arrays are important later
colspecs = [(0, 4), (5, 10), (11, 20), (21, 30), (40, 50), (50, 60), (60, 70)]

df = pd.read_fwf(temp_file_path, colspecs=colspecs)

df.columns = ['#BP', '#CP', 'Atom 1', 'Atom 2', 'Distance', 'BP Length', 'BP Steps']

df['Atom 1'] = pd.to_numeric(df['Atom 1'], errors='coerce') # Ensures integers; returns NaN if a number isn't present
df['Atom 2'] = pd.to_numeric(df['Atom 2'], errors='coerce')

df_shortened = df.iloc[:, :4]

# Prevent resetting the index to avoid any unintended re-indexing; this avoids atom 100 written as 0, 101 as 1, 102 as 2, ... 
df_shortened.reset_index(drop=True, inplace=True)

with pd.ExcelWriter(excel_path, engine='xlsxwriter') as writer:
    df_shortened.to_excel(writer, index=False)

    workbook = writer.book
    worksheet = writer.sheets['Sheet1']
    worksheet.set_column('A:D', 15) 

os.remove(temp_file_path)

The next block shows the first five lines from the shortened dataframe and should look similar to this:

    First 5 rows of the DataFrame:
            #BP  #CP  Atom 1  Atom 2
        0    2  350       1      10
        1    3  415       1     156
        2    4  711       1     157
        3    5  516       1     251
        4    6  348       2       3  

    The data has been written to the file:  path/to/directory/bond_paths_full.xlsx


In [67]:
print("First 5 rows of the DataFrame:")
print(df_shortened.head())
print()
print(f"The data has been written to the file: {excel_path}")

First 5 rows of the DataFrame:
   #BP  #CP  Atom 1  Atom 2
0    2  350       1      10
1    3  415       1     156
2    4  711       1     157
3    5  516       1     251
4    6  348       2       3

The data has been written to the file: c:\Users\jenbu\Desktop\5580\new_get_rho\bond_paths_full.xlsx


<font size="12">Section 2: Collecting Geometry information.</font>

In this block, the code reads the "Geometry" section of the input file and places all atom numbers and atom regions into a new Excel file.

In [68]:
# Here, new variables are designated, both the temporary text file and new Excel file.

temp_file_path2 = os.path.join(current_directory, "temp_file2.txt")
geometries = os.path.join(current_directory, "geometry.xlsx")

In [69]:
section_found2 = False # By initiating this condition as false, the function below will turn it True if the condition is found
section_content2 = [] # This is where data is stored if section_found = True

with open(file_path, 'r') as file2: # Opens the input file
    for line2 in file2:
        # Check if the line contains the first occurrence of '#BP'
        if '  Index Symbol   x (angstrom)   y (angstrom)   z (angstrom)' in line2 and not section_found2: # If pattern found while section_found2 = False
            section_found2 = True  # Start recording from this point
            section_content2.append(line2)
            continue

        if section_found2: # Continue reading and appending
            if '---' in line2 or 'ANOTHER_SECTION_TITLE' in line2 or line2.strip() == '':
                break
          
            section_content2.append(line2)

    if not section_found2: # Fail if pattern is not found
        print("Pattern for geometries not found. Exiting.")
        raise ValueError("Required pattern not found in the file.")

with open(temp_file_path2, 'w') as temp_file2: # Second temporary file
    temp_file2.writelines(section_content2)

In [70]:
# Define column widths for the second temporary file
column_widths = [7, 7, 15, 15, 15, 33]  # Accounts for long region names

df2 = pd.read_fwf(temp_file_path2, widths=column_widths) # Read the second temporary file into a new data frame using column widths

df2.to_excel(geometries, index=False) # Write the df to the geometry.xlsx file

df2_shortened = df2.iloc[:, [0, 5]]  # Read only arrays 0 and 5 from the geomtery file to a new dataframe; 0 is atom number, 5 is region name

df2_shortened.to_excel(geometries, index=False) # Rewrite to the same geometry.xlsx

os.remove('temp_file2.txt') # Delete the second temporary file

The next block shows the first five lines from the second shortened dataframe with Geometry data:

    First 5 rows of the DataFrame:
        Index             Unnamed: 5
    0      1  region=cys170minusONH
    1      2  region=cys170minusONH
    2      3  region=cys170minusONH
    3      4  region=cys170minusONH
    4      5   region=thr107minusCO


    The data has been written to the file:  path/to/directory/geometry.xlsx


In [71]:
print("First 5 rows of the DataFrame:")
print(df2_shortened.head())
print()
print(f"The data has been written to the file: {geometries}")

First 5 rows of the DataFrame:
   Index             Unnamed: 5
0      1  region=cys170minusONH
1      2  region=cys170minusONH
2      3  region=cys170minusONH
3      4  region=cys170minusONH
4      5   region=thr107minusCO

The data has been written to the file: c:\Users\jenbu\Desktop\5580\new_get_rho\geometry.xlsx


<font size="12">Section 3: Collate Bond Path Data with Geometry Data</font>

The "Geometry" file is read and merged into the "Bond Path" file, creating a third Excel that contains the entire system's atoms, regions, and bond path numbers.

In [72]:
# The bond_paths and geometry Excel files are called in as variables and will be used as input for the next function

bond_paths_file = os.path.join(current_directory, "bond_paths_full.xlsx")
geometry_file = os.path.join(current_directory, "geometry_shortened.xlsx")

# After the function joins the relevant data from bond_path and geometry, it will sve it to the following file
output_file = os.path.join(current_directory, "bond_path_with_geo_data.xlsx")

In [73]:
# Function that will correlate the atom numbers in bond_path to the atom numbers in geometries.

def merge_bond_paths_with_regions(bond_paths_file, geometry_file, output_file): # input, input, output
    bond_paths = pd.read_excel(excel_path) # input
    geometry = pd.read_excel(geometries) # input
    
    # Creates column names for the geometry data; bond_path has headers, so to merge them, geometry nust have columns also
    geometry.columns = ['Atom ID', 'Region']
    
    # In the bond_paths dataframe, create a column to the left of the Atom 1 column and place the region name from the geometry file
    bond_paths = bond_paths.merge(geometry, how='left', left_on='Atom 1', right_on='Atom ID') 
    bond_paths = bond_paths.rename(columns={'Region': 'Region Atom 1'}).drop('Atom ID', axis=1)
    
    # In the bond_paths dataframe, create a column to the left of the Atom 2 column and place the region name from the geometry file
    bond_paths = bond_paths.merge(geometry, how='left', left_on='Atom 2', right_on='Atom ID')
    bond_paths = bond_paths.rename(columns={'Region': 'Region Atom 2'}).drop('Atom ID', axis=1)
    
    bond_paths.to_excel(output_file, index=False) # Write the data to the output file

merge_bond_paths_with_regions(bond_paths_file, geometry_file, output_file) # Run the above function

output_df = pd.read_excel(output_file) # Write the output to the bond_path_with_geo_data Excel file


The next block shows the first five lines from the second shortened dataframe with Geometry data:

      First 5 rows of the DataFrame:
         #BP  #CP  Atom 1  Atom 2          Region Atom 1          Region Atom 2
      0    2  350       1      10  region=cys170minusONH  region=cys170minusONH
      1    3  415       1     156  region=cys170minusONH  region=cys170minusONH
      2    4  711       1     157  region=cys170minusONH  region=cys170minusONH
      3    5  516       1     251  region=cys170minusONH          region=val77R
      4    6  348       2       3  region=cys170minusONH  region=cys170minusONH


    The data has been written to the file:  path/to/directory/bond_path_with_geo_data.xlsx


In [74]:
print("First 5 rows of the DataFrame:")
print(output_df.head())
print()
print(f"The data has been written to the file: {output_file}")

First 5 rows of the DataFrame:
   #BP  #CP  Atom 1  Atom 2          Region Atom 1          Region Atom 2
0    2  350       1      10  region=cys170minusONH  region=cys170minusONH
1    3  415       1     156  region=cys170minusONH  region=cys170minusONH
2    4  711       1     157  region=cys170minusONH  region=cys170minusONH
3    5  516       1     251  region=cys170minusONH          region=val77R
4    6  348       2       3  region=cys170minusONH  region=cys170minusONH

The data has been written to the file: c:\Users\jenbu\Desktop\5580\new_get_rho\bond_path_with_geo_data.xlsx


<font size="12">Section 4: Read Bond Path-Geometry Data for Ligand.</font>

This section uses the "Bond Path with Geo Data" Excel file, which contains the entire system's bond paths, critical points, atoms, and regions for the variable 'ligand'. This is assigned at the top of the script as pyrr3. The 5 & 6 columns of the file, which contain region information, are read and return only rows where 'ligand' appears once. This is appended to a file called "Bond Path Filtered". 

There is no fail for this section; if the input had a Bond Path section, then it has Rho values.

In [75]:
# Read in the joined bond path-geometry file to a new dataframe
bond_path = os.path.join(current_directory, "bond_path_with_geo_data.xlsx")
bond_path_df = pd.read_excel(bond_path)

# Output file containing only lines where ligand appears once
output_file2 = os.path.join(current_directory, "bond_paths_filtered.xlsx")

In [76]:
# Read through the bond_path_df to read for instances of 'ligand'. If 'ligand' is present in the 4th column and not the 5th,
# or in the 5th column and not the 4th, append that entire line of data to the filtered dataframe

filtered_df = bond_path_df[((bond_path_df.iloc[:, 4].str.contains(ligand, na=False)) & 
                            (~bond_path_df.iloc[:, 5].str.contains(ligand, na=False))) | # Ligand in 4 and not in 5
                           ((~bond_path_df.iloc[:, 4].str.contains(ligand, na=False)) & 
                            (bond_path_df.iloc[:, 5].str.contains(ligand, na=False)))] # Ligand in 5 and not in 4

# Write the filtered dataframe to a new Excel file
filtered_df.to_excel(os.path.join(current_directory, output_file2), index=False)

The next block shows the first five lines from the second shortened dataframe with Geometry data:

    First 5 rows of the DataFrame:
        #BP  #CP  Atom 1  Atom 2          Region Atom 1 Region Atom 2
    5     7  539       2     102  region=cys170minusONH  region=pyrr1
    29   31  543      12     240    region=ala26minusNH  region=pyrr1
    32   34  393      14      76    region=ala26minusNH  region=pyrr1
    33   35  391      14     183    region=ala26minusNH  region=pyrr1
    46   48  425      21      34         region=gln28BB  region=pyrr1   

    The data has been written to the file:  path/to/directory/bond_path_filtered.xlsx


In [77]:
print("First 5 rows of the DataFrame:")
print(filtered_df.head())
print()
print(f"The data has been written to the file: {output_file2}")

First 5 rows of the DataFrame:
    #BP  #CP  Atom 1  Atom 2          Region Atom 1 Region Atom 2
5     7  539       2     102  region=cys170minusONH  region=pyrr1
29   31  543      12     240    region=ala26minusNH  region=pyrr1
32   34  393      14      76    region=ala26minusNH  region=pyrr1
33   35  391      14     183    region=ala26minusNH  region=pyrr1
46   48  425      21      34         region=gln28BB  region=pyrr1

The data has been written to the file: c:\Users\jenbu\Desktop\5580\new_get_rho\bond_paths_filtered.xlsx


There is now an Excel file, bond_paths_filtered, that contains the bond path for 1 ligand atom interacting with a non-ligand atom. This data is useful for determining which residues in the active site interact with the ligand, for creating cluster models appropriate for QM analyses.

Of immediate interest is collecting the charge density values of each of the bond paths. The summed rho value collected in the next section is most useful to qualifying inhibitor binding (weak, moderate, excellent).

<font size ="12">Section 5: Collate Rho Data.</font>

Using the "Bond Paths Filtered" file, the input file is read for rho values of bond paths which contain exactly one instance of 'ligand'. The rho values and data from "Bond Paths Filtered" are appended to "Bond Paths Rho", which now contains all the desired data: bond path numbers, critical point numbers, both regions, and rho.

In [96]:
# Read in the filtered bond path data and put in a dataframe
bond_path2 = os.path.join(current_directory, "bond_paths_filtered.xlsx")
bond_path_df2 = pd.read_excel(bond_path2)

# Output file for relevant bond path and charge density data
bond_path_rho = os.path.join(current_directory, "bond_paths_rho.xlsx")

In [97]:
# This function reads the initial input file for the bond path number, identifies the corresponding rho values, and appends
# it to the rho_pattern variabe in scientific notation.

def extract_rho_value(bond_path_number, file_path): 
    bond_path_pattern = f'CP #\\s*{bond_path_number}'  # The pattern is 'CP', for critical point. All CPs are collected here
    rho_pattern = r'Rho\s*=\s*([\d.eE+-]+)'  # Regex to capture scientific notation
    section_found = False

    with open(file_path, 'r') as file:
        for line in file:
            if re.search(bond_path_pattern, line): # if 'CP' is found in this line, read the next line
                section_found = True  
                continue
            
            if section_found:
                match = re.search(rho_pattern, line) # Read this line for the rho_pattern
                if match: #if found, return the Rho value in scientific notation as a float
                    return float(match.group(1))  
    return None

# Read the dataframe for 'CP' and and for each line present, run the extract_rho_value function. Place the rho value associated with that bond path
# back into the dataframe in a new column titled 'Rho'
bond_path_df2['Rho'] = bond_path_df2['#CP'].apply(lambda x: extract_rho_value(x, file_path))

# Write the expanded dataframe to the bond_path_rho Excel file
bond_path_df2.to_excel(os.path.join(current_directory, bond_path_rho), index=False)

In [98]:
# Call in xlsxwriter to adjust the columns so the full region name is visible
with pd.ExcelWriter(os.path.join(current_directory, bond_path_rho), engine='xlsxwriter') as writer:
    bond_path_df2.to_excel(writer, index=False)

    workbook = writer.book
    worksheet = writer.sheets['Sheet1']

    worksheet.set_column('E:E', 25)  
    worksheet.set_column('F:F', 25) 
    worksheet.set_column('G:G', 20) 

    # Identify the last row with rho values
    last_row = len(bond_path_df2) + 1  # +1 for the header row

    # Then sum all the rho values and place it at the end, bolded
    worksheet.write_formula(f'G{last_row + 1}', f'=SUM(G2:G{last_row})', 
                            workbook.add_format({'bold': True}))

The next block shows the first five lines from the second shortened dataframe with Geometry data:

    First 5 rows of the DataFrame:
    #BP  #CP  Atom 1  Atom 2          Region Atom 1 Region Atom 2       Rho
    0    7  539       2     102  region=cys170minusONH  region=pyrr1  0.013294
    1   31  543      12     240    region=ala26minusNH  region=pyrr1  0.017597
    2   34  393      14      76    region=ala26minusNH  region=pyrr1  0.010896
    3   35  391      14     183    region=ala26minusNH  region=pyrr1  0.014879
    4   48  425      21      34         region=gln28BB  region=pyrr1  0.005474

    The data has been written to the file:  path/to/directory/bond_path_rho.xlsx


In [99]:
print("First 5 rows of the DataFrame:")
print(bond_path_df2.head())
print()
print(f"The data has been written to the file: {bond_path_rho}")

First 5 rows of the DataFrame:
   #BP  #CP  Atom 1  Atom 2          Region Atom 1 Region Atom 2       Rho
0    7  539       2     102  region=cys170minusONH  region=pyrr1  0.013294
1   31  543      12     240    region=ala26minusNH  region=pyrr1  0.017597
2   34  393      14      76    region=ala26minusNH  region=pyrr1  0.010896
3   35  391      14     183    region=ala26minusNH  region=pyrr1  0.014879
4   48  425      21      34         region=gln28BB  region=pyrr1  0.005474

The data has been written to the file: c:\Users\jenbu\Desktop\5580\new_get_rho\bond_paths_rho.xlsx


To open the bond_path_rho Excel file, run one of the following cells depending on your OS.

In [92]:
# Mac
os.system(f"open {os.path.join(current_directory, bond_path_rho)}")

1

In [94]:
# Linux
os.system(f"xdg-open {os.path.join(current_directory, bond_path_rho)}")

1

In [95]:
# Windows
os.startfile(os.path.join(current_directory, bond_path_rho))


<font size="12">Section 6: Remove all unnecessary files. </font>

For extended troubleshooting purposes, absolutely don't run following script block.

In [None]:
def remove_files(filenames, directory): # Remove any given files in a given directory
    for filename in filenames:
        file_path = os.path.join(directory, filename) # Join the path and file name(s)
        try:
            os.remove(file_path)
            print(f"Removed: {file_path}")
        except FileNotFoundError:
            print(f"File not found: {file_path}") # Error if a file is not found in the directory
        except Exception as e:
            print(f"Error removing {file_path}: {e}") # Error if a file cannot be removed (ie, it's open)

files_to_remove = [ # Names of all the Excel files made earlier in the script
    "bond_path_with_geo_data.xlsx",
    "bond_paths_filtered.xlsx",
    "bond_paths_full.xlsx",
    "geometry.xlsx"
]

remove_files(files_to_remove, current_directory) # Run the function in the current working directory and List the removed files


Removed: c:\Users\jenbu\Desktop\5580\new_get_rho\bond_path_with_geo_data.xlsx
Removed: c:\Users\jenbu\Desktop\5580\new_get_rho\bond_paths_filtered.xlsx
Removed: c:\Users\jenbu\Desktop\5580\new_get_rho\bond_paths_full.xlsx
Removed: c:\Users\jenbu\Desktop\5580\new_get_rho\geometry.xlsx


<font size=12>Is this accurate?</font>

How do I know this worked instead of the script collecting random lines and appending values? The Excel generated with this script was compared against two other Excels: one created by two individuals performing manual analyses and one created by hobbling the results of two bash scripts together. The values were the same.

<font size=12>Future Work</font>

Next steps will be to make the entire script a single function that can iterate over several AMS output files, and collate them into a single Excel file.