In [None]:
import os
os.getcwd()
os.chdir("/home/allen")

## Implementation of the main steps for chemical data curation following the paper "Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research" from 2010 (D.Fourches, ...)

Link: https://pubmed.ncbi.nlm.nih.gov/20572635/


In [None]:
#import modules
from rdkit import Chem
from opencadd.compounds.standardization import convert_format,remove_fragments,disconnect_metals,detect_inorganic,remove_salts

In [None]:
# INITIAL LIST OF SMILES
# import test smiles dataset
test_smiles = "NC(CC(=O)O)C(=O)[O-].O.O.[Na+]"
test_smiles2 = "C(C(=O)[O-])(Cc1n[n-]nn1)(C[NH3+])(C[N+](=O)[O-])"
test_smiles3= "CN(C)C.Cl.Cl.Br"

### Removal of Inorganics and Mixtures

Since molecular descriptors can only be computed for organic compunds, all inorganic compunds must be removed before the descriptors are calculated. (Chapter 2.1. Fourches 2010)

For the flagging and following removal of compounds containing inorganic molecules, we can use the function `detect_inorganic`. This function returns a boolean value of "True" when it finds a inorganic molecule. We can run this flagging in a pre-processing step of the data, and discard those compounds. 
"Inorganic compounds are known to have biological effects, like for example toxic effects."(Chapter 2.1. Fourches 2010)(fix citation)
 Due to their potential bioactivity we can not distinguish if the recored activity of a mixed compound is caused by it's organic or inorganic part. Therefore the entry is useless and can be discarded. ! THIS SHOULD BE LOGGED AND MANUAL CURATION SHOULD BE ENABLED !
An alternate and easy way would be that every SMILES  is undertaken a substring search, where a match of a inorganic compound pattern (search pattern set should be defined) would be  flagged.

Due to the fact, that the treatment is not as simple as it apprears the paper (Fourches, 2010) recommends to delete records containing mixtures. ! THIS AGAIN CAN BE LOGGED AND MANUAL CURATION CAN BE DONE WITH THIS SET ! The ease up the curation various filtering functions can be implemented to help decide which to keep and which to discard. Three types of mixtures are described. ! CHECK IF IMPLEMENTATION WOULD BE POSSIBLE EASY AND FAST ! Common and widely used practice is to retain molecules with the highest molecular weight or the largest number of atoms(Chapter 2.1. Fourches 2010), but the paper (Fourches, 2010) states this might not be the best solution, and further investigation in mixtures should only be done if there is a reason to belive that the biological activity is really caused by the largest molecule and not the mixture itself.

Those actions might be performed, before the entered SMILES are beeing converted into mol-files. Some described steps are related to string pattern searches.

In [None]:

#  Removal of mixtures, inorganics (and eventually organometallics)
# Functions detect_inorganic,remove_fragments, disconnect_metals, detect_inorganic again

In [None]:
# Pseudocode for filtering the inorganic records

new_data = array of smiles
records_organics = []
records_inorganics = []

for x in new_data:
    convert_smiles_to_mol(x)
    if detect_inorganic(x)=="False":
        records_organics.append(x) #QUESTION: Can I just store mol-files in an array?
    elif detect_inorganic(x)=="True":
        records_inorganics.append(x)
    else:
        raise Exception("Something is wrong with:" x)
return records_organics, records_inorganics
# no further processing will happen to `records_inorganics`
# `records_organics` is passed on in the pipeline


In [None]:
# Pseudocode for filtering the mixture records

new_data = records_organics
no_fragement_records = []
contains_fragment_records = []

#TODO: Write a helper function that returns a boolean value, when it finds a fragment (the InChI is changed by a removed fragment), a metals (disconnect_metals has been performed)
# Functions could be named `detect_fragment` and  `detect_metals`
# Or I can write a function that just checks if the execution of a function actually altered the InChI --> might be the simpler solution
for x in new_data:
    convert_smiles_to_mol(x)
    if detect_fragment(x)=="False":
        no_fragement_record.append(x)
    elif detect_fragment(x)=="True":
        contains_fragment_records.append(x)
    else:
        raise Exception("Something is wrong with" x)
return no_fragement_record,contains_fragment_records

# [OPTIONAL] return the largest fragment of the record
new_data = contains_fragment_records
contains_largest_fragement = []
for x in new_data:
    choose_largest_fragment(x)
    contains_largest_fragement.append(x)
return contains_largest_fragement

# OR remove known common fragments with the function `remove_fragments`

# Code would be the same as obove
# Have to write a helper function to continue the pipeline with his set, without mixing it up with the "safe_dataset"






In [None]:
# Pseudocode for filtering metals 
new_data = no_fragement_record
metal_true = []
metal_false = []

for x in new_data:
    convert_smiles_to_mol(x)
    if detect_metals(x)=="False":
        metal_false.append(x)
    elif detect_metals(x)=="True":
        metal_true.append(x)
    else:
        raise Exception("Something is wrong with" x)
return metal_true,metal_false

### Structural Conversion and Cleaning

Some drugs need to be transformed "into their salt form to enhance how the drug disscolves (...) and (to) increase it's effectiveness. (https://www.drugs.com/article/pharmaceutical-salts.html (03/12/21)) Therefore it is common for chemical compound databases to contain records of salts. If possible it is recommended to delete the records containing salts completely, since, similar to in-organic compounds, "most descriptor-generating software (can not process salts)" (Fourches 2010 Chapter 2.2 ).While not beeing desirable, it is still an acceptable procedure to convert compounds into their neutral forms. But cases like this should be tagged, filtered and afterwards manually curated or compared to the actual neutral form of that compound. 
In case that we want to continue working on the converted records, we should perform the following steps:
- check if records contain compounds with presence of metals --> difficult case, filter out (already done this - one step ahead)
- removing the salts from the record
- neutralize the record (normalization or basic standardization)
- neutralize the charges
- to be discussed: the adding/removing of hydrogens, both got pros and cons (pro addingH --> higher prediction performances / con addingH --> may introduce noise --> less reliable models)(removingH might introduce erros in calculating descriptors, due to it might not handle certain cases well)



In [None]:
# Structural coversion
# Cleaning/removal of salts
# Functions remove_salts
# normalize_molecules
# handle_charges
# handle_hydrogens

### Normalization of Specific Chemotypes

More complex than just Normalization.

In [None]:
# Normalization of specific chemotypes
# normalize_molecules

In [None]:
# Treatment of tautomeric forms
# handle_tautomers

### Removal of duplicates

In [None]:
# Analysis/removal of duplicates

In [None]:
# Manual inspection