# Main IFG Script
Creates functional group data excel file given a target set of smiles codes.

Before running this script, ensure that IFG had been installed to your computer.  <br>
To see more information about this script, please view the doc page for this script at the [IFG docs](https://wtriddle.github.io/IFG/scripts/data_collection.html)

## Script Modules and Packages

Imported modules and usages:

- <b>csv</b> loads SMILES csv file data into python variables
- <b>os</b> gets file paths for python to locate files on computer file system
- <b>logging</b> logs structures which fail the functional group processing algorithm
- <b>traceback</b> collects the exact python error for failed structures
- <b>pandas</b> tabulates functional group python data and exports it to excel 
- <b>tqdm</b> shows visual progress bar of processing through each SMILES structure 
- <b>chem.molecule</b> imports the SMILES-based computation logic to compute the functional groups present in a SMILES code created from the IFG python package

In [1]:
import csv
import logging
import os
import traceback

import pandas
from tqdm import tqdm

from chem.molecule import Molecule

## Script I/O Files

File path to csv file of SMILES codes to process in this script 

In [2]:
STRUCTURES_PATH = os.getcwd() + "/smiles/smiles.csv"
print(STRUCTURES_PATH)

c:\Users\wtrid\Documents\Software Development\IFG\ifg\scripts/smiles/smiles.csv


File path to excel file that will be generated by this script

In [3]:
MAIN_OUTPUT_PATH = os.getcwd() + '/output/functional_groups.xlsx'
print(MAIN_OUTPUT_PATH)

c:\Users\wtrid\Documents\Software Development\IFG\ifg\scripts/output/functional_groups.xlsx


## Script Data Set-Up

SMILES codes data load

The SMILES code loading below uses the file path to the SMILES code csv file <b>STRUCTURES_PATH</b> and the python module <b>csv</b> to read in its data row by row and save it into a variable called <b>STRUCTURES</b>. The csv file variable is used because it is initally a file object from the <b>csv</b> module, then the data is pulled out into a python list called <b>STRUCTURES</b>. 

Please ensure that your csv file format matches with the default. If you receive an error in loading, please consider adjusting your csv file format to the default, or load in the data according to your needs. Also, ensure that the SMILES code formats are usable by the IFG pacakge.

In [4]:
STRUCTURES_CSV_FILE = csv.reader(open(STRUCTURES_PATH, "r+", encoding="UTF-8"))
STRUCTURES = [(smiles,refcode) for (smiles,refcode) in STRUCTURES_CSV_FILE][1:]
STRUCTURES

[('CC(=O)NCCNC(C)=O', 'ABAWEG01'),
 ('c1ccc2c(c1)cccc2C#Cc1ccccc1C#Cc1cccc2ccccc12', 'ABEJIC'),
 ('O=C1NC2C(N(CN2N(=O)=O)N(=O)=O)N1N(=O)=O', 'ABEJOH'),
 ('COC', 'ABEWAG'),
 ('CC(O)=O', 'ACETAC07'),
 ('CC(=O)Nc1ccc(C)cc1', 'ACTOLD07'),
 ('OCC1OC(O)C(O)C(O)C1O', 'ADGALA03'),
 ('COC(=O)c1nccnc1N', 'ADUWUS01'),
 ('[N+]C1CCC2CCC(N2C1=O)C(=O)[O-]', 'AFAFAP'),
 ('CC1(C)OC(CC(O)=O)C(=O)O1', 'AFEREJ'),
 ('CC1(CC2CCCCN2O1)C#N', 'AHETOX'),
 ('CC(C)C(=O)NCc1ccccc1', 'AHEYOC'),
 ('N#Cc1cccc(Oc2cccnc2)c1C#N', 'AHITUH'),
 ('CC(C)(C)c1ccc(O)c(CN2CCN(CC2)Cc2cc(ccc2O)C(C)(C)C)c1', 'AHOPOD'),
 ('[N+]C(C)C(=O)[O-]', 'ALUCAL05'),
 ('O=C(Nc1cccc2ccccc12)Nc1cccc2ccccc12', 'AMAFEZ01'),
 ('[N+]CCCCCC(=O)[O-]', 'AMCAPR11'),
 ('CC1(C)O[N+](=C2CCCCC2=C1)[O-]', 'ANAXOD'),
 ('COc1ccc(cc1)C(O)=O', 'ANISIC04'),
 ('CC1(COc2cc(ccc12)C(=O)N1CCCCC1)Cc1ccccc1', 'ANIVID'),
 ('CC1(CCCCC1)OC(=O)c1c2ccccc2cc2ccccc12', 'ANOBAH'),
 ('CC(C)(C)OC(=O)C12C3c4ccccc4C(C(c4ccccc14)c1ccccc21)(C(=O)OC(C)(C)C)c1ccccc31',
  'ANODAJ'),
 ('

Functional Group Data Variables Defined

- <b>all_data</b> each entry is a molecule's derived functional group data using the all data format
- <b>exact_data</b> each entry is a molecule's derived functional group data using the exact data format
- <b>mol</b> a variable for an individual molecule based on the <b>chem.molecule</b> Molecule class
- <b>failed_mols</b> a list of molecules by structure name which did not pass the functional group algorithm due to an internal error

<b>all_data</b> and <b>exact_data</b> contain the list version of data which is converted into a table format with column names and row values via <b>pandas</b> later. See below for their usages.


In [5]:
all_data: list[dict] = []
exact_data: list[dict] = []
mol: Molecule
failed_mols: list[str] = []

Logging File Set-Up

The logging setup code below create an empty main.log file and prepares the <b>logging</b> module to log failed structure information into the main.log file during the script structures processing steps below

In [6]:
with open("main.log", mode="w", encoding="UTF-8") as file:
    file.truncate(0)
logging.basicConfig(format='%(message)s', filename='main.log')

## Script Structures Processing

Functional Groups Identification Structures Processing

The code loop below processes through the loaded SMILES codes and identifies their functional groups using the <b>Molecule</b> class from the IFG python package. The data extracted from each code is appended to each <b>all_data</b> and <b>exact_data</b> data list accordingly until all structures have been processed. 


In [7]:
##### Structure Bar Status #####
with tqdm(total=len(STRUCTURES)) as bar:

    ##### SMILES Structure Loop #####
    for (smiles, refcode) in STRUCTURES:

        ##### Molecule Data #####
        try:
            mol = Molecule(smiles, name=refcode, type="mol")
        except:
            failed_mols.append(smiles + " " + refcode)
            logging.error(f"{refcode} {smiles} Failed to be processed \n {traceback.format_exc()}")
            bar.update(1)
            continue

        ##### All Functional Group Format Data #####
        all_data.append({
            "Refcode": mol.name,
            "SMILES": smiles,
            "Aromatic Rings": mol.aromatic_ring_count,
            "Non Aromatic Rings": mol.non_aromatic_ring_count,
            "Rings": mol.total_ring_count,
            "AminoAcid": "Yes" if mol.amino_acid else "No",
            **mol.functional_groups_all,
        })

        ##### Exact Functional Group Format Data #####
        exact_data.append({
            "Refcode": mol.name,
            "SMILES": smiles,
            "Aromatic Rings": mol.aromatic_ring_count,
            "Non Aromatic Rings": mol.non_aromatic_ring_count,
            "Rings": mol.total_ring_count,
            "AminoAcid": "Yes" if mol.amino_acid else "No",
            **mol.functional_groups_exact,
        })

        ##### Status Bar Update #####
        bar.update(1)

100%|██████████| 831/831 [00:05<00:00, 148.04it/s]


Failed Structures Output

For a brief inspection of the failed stuctures, use the code cell below. For greater detail, check out the main.log file in your `ifg/scripts` folder

In [8]:
failed_mols

[]

## Script Data Tabulation

Functional Group Dataframes Set-Up

The following two <b>pandas</b> dataframes <b>df_all</b> and <b>df_exact</b> are python tables which convert the <b>all_data</b> and <b>exact_data</b> data lists collected from the script processing into tabulated row/column data that can be exported into excel, respectively.

In [9]:
df_all = pandas.DataFrame(all_data).fillna(0).set_index("Refcode")
df_exact = pandas.DataFrame(exact_data).fillna(0).set_index("Refcode")

Show the tabulated version of <b>all_data</b> (<b>df_all</b>)

In [10]:
df_all

Unnamed: 0_level_0,SMILES,Aromatic Rings,Non Aromatic Rings,Rings,AminoAcid,Ketone,Amide,SecondaryAmine,Alkyne,Non Aromatic Ketone,...,Imide,HydroPeroxide,Non Aromatic Carbonate,Non Aromatic Alkyne,Isonitrile,Aromatic TertiaryAmine,Aldoxime,Non Aromatic Peroxide,Carbonate,Acetal
Refcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABAWEG01,CC(=O)NCCNC(C)=O,0,0,0,No,2.0,2.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ABEJIC,c1ccc2c(c1)cccc2C#Cc1ccccc1C#Cc1cccc2ccccc12,5,0,5,No,0.0,0.0,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ABEJOH,O=C1NC2C(N(CN2N(=O)=O)N(=O)=O)N1N(=O)=O,0,2,2,No,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ABEWAG,COC,0,0,0,No,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ACETAC07,CC(O)=O,0,0,0,No,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YURRIN,CCOC(=O)C(C)=C(CC(O)c1ccc(cc1)C#N)c1ccccc1,2,0,2,No,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ZILFUV,CC1=C(C=C(C#N)C(=O)N1)c1ccncc1C,1,1,2,No,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ZOJRAR01,Cc1cc(C=O)c(O)c(c1)C(C)(C)C,1,0,1,No,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ZOZXOB,OCC1OC(CC1O)N1C=NC2=C1C(=O)NC=N2,0,3,3,No,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Show the tabulated version of <b>exact_data</b> (<b>df_exact</b>)

In [11]:
df_exact

Unnamed: 0_level_0,SMILES,Aromatic Rings,Non Aromatic Rings,Rings,AminoAcid,Amide,Alkyne,Non Aromatic Amide,Non Aromatic TertiaryAmine,Nitro,...,Imide,HydroPeroxide,Non Aromatic Carbonate,Non Aromatic Alkyne,Isonitrile,Aromatic TertiaryAmine,Aldoxime,Non Aromatic Peroxide,Carbonate,Acetal
Refcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABAWEG01,CC(=O)NCCNC(C)=O,0,0,0,No,2.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ABEJIC,c1ccc2c(c1)cccc2C#Cc1ccccc1C#Cc1cccc2ccccc12,5,0,5,No,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ABEJOH,O=C1NC2C(N(CN2N(=O)=O)N(=O)=O)N1N(=O)=O,0,2,2,No,0.0,0.0,2.0,2.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ABEWAG,COC,0,0,0,No,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ACETAC07,CC(O)=O,0,0,0,No,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
YURRIN,CCOC(=O)C(C)=C(CC(O)c1ccc(cc1)C#N)c1ccccc1,2,0,2,No,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ZILFUV,CC1=C(C=C(C#N)C(=O)N1)c1ccncc1C,1,1,2,No,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ZOJRAR01,Cc1cc(C=O)c(O)c(c1)C(C)(C)C,1,0,1,No,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ZOZXOB,OCC1OC(CC1O)N1C=NC2=C1C(=O)NC=N2,0,3,3,No,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Python Table To Excel Export

Pandas Excel Exporter Interface with xlsxwriter

<b>writer</b> is a <b>pandas</b> variable which interacts with dataframes (the tables defined here) for excel file I/O. The target excel file to interact with is your excel file configured in the script I/O files.

In [12]:
writer = pandas.ExcelWriter(MAIN_OUTPUT_PATH)

Excel Export with Pandas

The code below places the two dataframes (tables) shown above into two seperate excel sheets in your `MAIN_OUTPUT_PATH` excel file. The columns are width adjusted to fit their assumed size, and then the writer object is closed.

In [13]:
##### All Functional Groups Data Sheet Export #####
df_all.to_excel(writer, sheet_name="all_data", freeze_panes=(1, 1))
all_sheet = writer.sheets["all_data"]
all_sheet.set_column(0, 0, 13)      # Refcode column width
all_sheet.set_column(1, 1, 125)     # SMILES column width
df_all_columns: list[str] = [str(col) for col in df_all.columns][1:]
for i, col in enumerate(df_all_columns):
    all_sheet.set_column(i+2, i+2, len(col)+7)

##### Exact Functional Groups Data Sheet Export #####
df_exact.to_excel(writer, sheet_name="exact_data", freeze_panes=(1, 1))
exact_sheet = writer.sheets["exact_data"]
exact_sheet.set_column(0, 0, 13)      # Refcode column width
exact_sheet.set_column(1, 1, 125)     # SMILES column width
df_exact_columns: list[str] = [str(col) for col in df_exact.columns][1:]
for i, col in enumerate(df_exact_columns):
    exact_sheet.set_column(i+2, i+2, len(col)+7)

##### Excel File Save #####
writer.close()

Please view your excel sheet at your `MAIN_OUTPUT_PATH` excel file to see the tabulated data placed into an excel file. If you encounter any significant errors, please first research as best you can to attempt to solve the problem. If the issue persist, then you may open an issue in the IFG GitHub repository.