<a href="https://colab.research.google.com/github/sofia-sunny/Short_Introductory_Tutorials/blob/main/05_Pubchem_Data_Retrieval.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Pubchem and Data Retrieval**
**PubChem** is a comprehensive chemical database that contains information on millions of chemical compounds. Each compound entry includes details such as molecular structure, chemical identifiers (like SMILES, InChI), molecular weight, and chemical properties (e.g., logP, solubility).


One way to interact with **PubChem** database is using a Python library that interacts with it to retrieve chemical information.This library is called **PubChemPy**. We will first install and then import it as pcp

In [None]:
# Install RDKit (if not already installed)
!pip install rdkit

Collecting rdkit
  Downloading rdkit-2025.3.3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.0 kB)
Downloading rdkit-2025.3.3-cp311-cp311-manylinux_2_28_x86_64.whl (34.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m34.9/34.9 MB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: rdkit
Successfully installed rdkit-2025.3.3


In [None]:
# Install and import pubchempy
!pip install pubchempy
import pubchempy as pcp

Collecting pubchempy
  Downloading PubChemPy-1.0.4.tar.gz (29 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pubchempy
  Building wheel for pubchempy (setup.py) ... [?25l[?25hdone
  Created wheel for pubchempy: filename=PubChemPy-1.0.4-py3-none-any.whl size=13818 sha256=6148d649d973a1bc308cdf8eedab850ff5eede40dcdddb5d0ee353b036f4f093
  Stored in directory: /root/.cache/pip/wheels/8b/e3/6c/3385b2db08b0985a87f5b117f98d0cb61a3ae3ca3bcbbd8307
Successfully built pubchempy
Installing collected packages: pubchempy
Successfully installed pubchempy-1.0.4


### **pcp.get_compounds() Function**
This function is very useful for retrieving detailed chemical information for compounds of interest, making it a valuable tool in chemoinformatics.

The **pcp.get_compounds** function allows you to fetch chemical data (like molecular weight, formula, etc.) from PubChem using an **identifier.** It’s a convenient way to query and programmatically work with chemical compound data.

### It returns **a list of Compound objects** that match the search criteria

### **Syntax:**
The general syntax for pcp.get_compounds() is as follows:

###**pcp.get_compounds**(identifier, namespace', **kwargs)

### **identifier**:This is the unique identifier used to search for compounds. It can be a name, CID (PubChem Compound Identifier), SMILES string, InChI, formula, or another identifier that uniquely specifies a compound.

### **namespace:**This specifies the type of identifier being used. Common namespaces include:

* **'cid'**: PubChem Compound Identifier (default)
* **'name'**: Common name of the compound
* **'smiles'**: SMILES string
* **'inchi'**: InChI string
* **'formula'**: Molecular formula

**Example:**
aspirin = pcp.get_compounds('aspirin', 'name')
Here aspirin is the identifier and the name is the namespace

###**kwargs:** Additional arguments to customize the search or the data retrieved. These can include search options like searchtype for similarity searches or filters for property-based searches.

### Using **pcp.get_compounds()** to get the SMILES of a compound from its **name**
**get_compounds('aspirin', 'name')** returns a list of Compound objects that match the name aspirin.

In [None]:
# Use pubchempy to get the compound by name
compound = pcp.get_compounds('aspirin', 'name')
#The type of the compound that we get from the pcp.get_compounds
# is actually a list.
type(compound)

list

We are interested in getting the first member of the list: **compound[0]**. This first item is an object of type Compound, which represents the most relevant match from PubChem.

Get  the canonical SMILES of the first compound in the list

In [None]:
aspirin_canonical_smiles = compound[0].canonical_smiles
# Print the SMILES strings
print(f"The canonical SMILES of aspirin is: {aspirin_canonical_smiles}")

The canonical SMILES of aspirin is: CC(=O)OC1=CC=CC=C1C(=O)O


### Another Example:
Using **pcp.get_compounds()** to get the SMILES and name of a compound from its **CID**

In [None]:
compound = pcp.get_compounds('2244', 'cid')
compound_smiles = compound[0].canonical_smiles
print(f"The canonical SMILES of the compound is: {compound_smiles}")
compound_name = compound[0].iupac_name
print(f"The name of the compound is: {compound_name}")

The canonical SMILES of the compound is: CC(=O)OC1=CC=CC=C1C(=O)O
The name of the compound is: 2-acetyloxybenzoic acid


### Using **pcp.get_compounds()** method to get the SMILES and name of a compound from its molecular **formula**

In [None]:
compound = pcp.get_compounds('C2H6O', 'formula')
compound_smiles = compound[0].canonical_smiles
print(f"The canonical SMILES of the compound is: {compound_smiles}")
compound_name = compound[0].iupac_name
print(f"The name of the compound is: {compound_name}")

The canonical SMILES of the compound is: CCO
The name of the compound is: ethanol


### **From SMILES to name and other identifires:**

In [None]:
# SMILES string for a compound
comp_smiles = "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"

In [None]:
# First get the "compound[0]" using pcp.get_compounds method
compound = pcp.get_compounds(comp_smiles, 'smiles')[0]

In [None]:
# Print compound information
print(f"Common Name: {compound.synonyms[0]}")
print(f"IUPAC Name: {compound.iupac_name}")
print(f"CID: {compound.cid}")
print(f"Molecular Formula: {compound.molecular_formula}")
print(f"Molecular Weight: {compound.molecular_weight}")
print(f"Canonical SMILES: {compound.canonical_smiles}")

Common Name: caffeine
IUPAC Name: 1,3,7-trimethylpurine-2,6-dione
CID: 2519
Molecular Formula: C8H10N4O2
Molecular Weight: 194.19
Canonical SMILES: CN1C=NC2=C1C(=O)N(C(=O)N2C)C


### We can write a function to **get the SMILES from the name**


In [None]:
# Function to get SMILES from a chemical name
def get_smiles(chemical_name):
    compound = pcp.get_compounds(chemical_name, 'name')[0]
    smiles = compound.canonical_smiles
    return smiles

### Example:

In [None]:
# Example chemical name
chemical_name = "ethanol"

# Get SMILES
smiles = get_smiles(chemical_name)

# Print the results
print(f"Chemical Name: {chemical_name}")
print(f"SMILES: {smiles}")


Chemical Name: ethanol
SMILES: CCO


### Writing a Function to Return a List of SMILES from a list of Compounds

Define a function called **get_smiles_list** that takes a **list of compound names** and returns a list of their corresponding SMILES strings **(smiles_list)**


In [None]:
# Define a function to get SMILES strings from a list of compound's names
def get_smiles_list(compound_names):
    smiles_list = []
    for name in compound_names:
        compounds = pcp.get_compounds(name, 'name')
        if compounds: # This checks that the compound actually exists!
            smiles = compounds[0].canonical_smiles
            smiles_list.append(smiles)
        else: # if the compunds doesn't exist
            smiles_list.append(None)
    return smiles_list

### Using the above function:

In [None]:
# Suppose we have the following list of compounds:
compound_names = ["water", "methane", "ethanol", "glucose", "caffeine", "acetone", "benzene", "ibuprofen"]

# # Example of using the above function to get the list of SMILES strings
get_smiles_list(compound_names)

['O',
 'C',
 'CCO',
 'C(C1C(C(C(C(O1)O)O)O)O)O',
 'CN1C=NC2=C1C(=O)N(C(=O)N2C)C',
 'CC(=O)C',
 'C1=CC=CC=C1',
 'CC(C)CC1=CC=C(C=C1)C(C)C(=O)O']