<a href="https://colab.research.google.com/github/supersciencegrl/chemistry-scripts/blob/main/batch_chemical_parameters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Batch chemical parameters

Input a list of chemical names (or CAS numbers!), and receive molecular weights, molecular formulae and empirical formulae, and CAS numbers in return. 

Click 'play' on the next two code blocks to import modules and functions for your script. You only need to do this once at the start of every session! 

In [11]:
# Install packages. You only need to do this once per device (probably)
!pip install molmass
!pip install pubchempy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [12]:
# Import modules and functions. You only need to do this once, at the start of every session
import io
import re

from google.colab import files
import molmass
import pandas as pd
import pubchempy as pcp
import requests

def iscas(name):
    ''' Function to determine whether a string is a CAS number '''
    
    x = re.match('\d{2,7}-\d\d-\d$', name.strip())
    if x:
        z = x[0]
        digitsBackwards = z.replace('-', '')[::-1]
        checksum = 0
        for i, digit in enumerate(digitsBackwards):
            checksum += (i * int(digit))
        if checksum%10 == int(z[-1]):
            return True
        else:
            return False

    else: # No regex match
        return False

def lookupCompoundPubChem(name, identifier):
    ''' returns pubchempy Compound or None from identifier '''
    
    compounds = pcp.get_compounds(name, identifier)
    try:
        Compound = compounds[0]
        return Compound

    except IndexError:
        return None
    except urllib.error.URLError as error:
        print(error)
        return None

def searchFromPubChem(Compound):
    ''' Finds IUPAC name, MW, MW, exact mass, SMILES, InChI, InChIKey from PubChem Compound '''
    
    cpddict = {}
    if Compound:
        if Compound.synonyms:
            cpddict['cpdname'] = Compound.synonyms[0]
        cpddict['iupac'] = Compound.iupac_name
        try:
            cpddict['mw'] = float(Compound.molecular_weight)
        except ValueError:
            cpddict['mw'] = Compound.molecular_weight
        cpddict['mf'] = Compound.molecular_formula
        cpddict['exactMass'] = Compound.exact_mass
        cpddict['smiles'] = Compound.isomeric_smiles
        cpddict['inchi'] = Compound.inchi
        cpddict['inchikey'] = Compound.inchikey

        cpddict['cas'] = CompoundtoCAS(Compound)
        
    else:
        return None

    return cpddict

def CompoundtoCAS(Compound):
    ''' Returns CAS # from PubChem Compound '''
    
    if Compound:
        casfound = False
        for s in Compound.synonyms:
            casfound = iscas(s)
            if casfound:
                return s
        if not casfound:
            return None

    else:
      return None

def calcEmpiricalFormula(Compound):
    ''' Returns empirical formula from a Compound '''

    if Compound:
        mf = Compound.molecular_formula
        if mf:
            formula = molmass.Formula(mf)
            return formula.empirical
        else:
            return None

    else:
      return None

# For debugging
test = 'paracetamol'
test_compound_list = ['paracetamol', 'methylene blue', 'acetic acid', 'iron(II) chloride', 'RuPhos', '64-19-7', 'notACompoundTest']

## Upload your csv below when the 'Choose Files' button appears!

In [17]:
# Input your list of compounds as a csv
!rm *.csv # Delete all existing csv files
uploaded = files.upload()
filename = next(iter(uploaded))
input_df = pd.read_csv(io.BytesIO(uploaded[filename]), header=None)

compound_list = input_df[input_df.columns[0]]

num_compounds = len(input_df.index)
suffix = 's' if num_compounds > 1 else ''
print(f'\n{num_compounds} compound{suffix} imported!')

Saving input.csv to input.csv

5 compounds imported!


### Now generate your result table:

In [21]:
# Set up result table structure
columns = ['name', 'CAS', 'MW', 'MF', 'empirical formula']
df = pd.DataFrame(columns=columns)

# Featurize compound list
for cpd in compound_list:
    # Find compound in PubChem
    Compound = lookupCompoundPubChem(cpd, 'name') # Suitable for chemical name or CAS lookup
    cas = CompoundtoCAS(Compound)
    mw = Compound.molecular_weight if Compound else None
    mf = Compound.molecular_formula if Compound else None

    parameters = {'name': cpd,
                  'CAS': cas,
                  'MW': mw,
                  'MF': mf,
                  'empirical formula': calcEmpiricalFormula(Compound)} # Must match columns above
    df.loc[len(df)] = parameters # Append to DataFrame

### View your results with the next cell
(does not update automatically)

In [22]:
print(df)

               name          CAS      MW           MF empirical formula
0       paracetamol     103-90-2  151.16      C8H9NO2           C8H9NO2
1    methylene blue      61-73-4   319.9  C16H18ClN3S       C16H18ClN3S
2  notACompoundTest         None    None         None              None
3            RuPhos  787618-22-8   466.6    C30H43O2P         C30H43O2P
4       acetic acid      64-19-7   60.05       C2H4O2              CH2O


### And download your results as a csv with the next cell

In [16]:
df.to_csv('output.csv', encoding = 'utf-8-sig')
files.download('output.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

This script was made for [@Imekulo](https://twitter.com/Imekulo) on Twitter. 

Please [contact me](https://supersciencegrl.co.uk) or comment with improvements. Feel absolutely free to reuse, preferably with credit so others can find the original link with any updates. It's relatively easy to add new columns: I normally use modifications of [@khoivan88](https://github.com/khoivan88)'s script `pka_lookup_pubchem.py` [here](https://github.com/khoivan88/pka_lookup) to do so. 