
# Molecular Informatics: Chemical Property and Structure Retrieval

## 1. Introduction and Objectives
In this notebook, we explore the properties and structures of chemical compounds using data from the PubChem database. By entering the names of chemical compounds, users can retrieve:
- Molecular formula and weight
- Experimental data such as melting point, boiling point, density, and solubility
- Molecular structure visualizations

### Learning Objectives
1. Understand how to use APIs (pug_rest and pug_view) to retrieve chemical data.
2. Learn to parse and display chemical data in Python.
3. Understand data parsing and cleaning for concise chemical property reports
4. Visualize chemical structures using RDKit.

Let's start by loading the required libraries and defining functions to retrieve data from PubChem.


## 2. Required Libraries

Import the necessary libraries for API calls, data processing, and visualization.

In [None]:
# Import necessary libraries
import requests  # For PubChem API requests
import pandas as pd  # For tabular data handling
from rdkit import Chem  # RDKit for molecular representation
from rdkit.Chem import Draw  # RDKit for molecular visualization
from IPython.display import display  # To display images in the notebook
import re  # For data cleaning

# Install ipywidgets if not already installed
# Run the following in the terminal or notebook if you encounter an issue:
# pip install ipywidgets
import ipywidgets as widgets  # For interactive widgets


## 3. Function Definitions

3.1 CID Retrieval
The first step is to retrieve the PubChem Compound Identifier (CID) for a given chemical name.

In [None]:
def get_cid(chemical_name):
    """
    Fetch the CID (Compound Identifier) for a chemical name using the PubChem API.
    """
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{chemical_name}/cids/JSON"
    try:
        response = requests.get(url)
        response.raise_for_status()
        return response.json()["IdentifierList"]["CID"][0]
    except Exception as e:
        print(f"Error fetching CID for {chemical_name}: {e}")
        return None


## 3.2 Explanation of fetch_chemical_data Function
Explanation of fetch_chemical_data Function

### Explanation of fetch_chemical_data Function

The fetch_chemical_data function in this notebook is designed to retrieve two types of data from the PubChem database for a given compound: 
1. **Basic Molecular Properties** (e.g., Molecular Formula, Molecular Weight, SMILES string)
2. **Experimental Properties** (e.g., Melting Point, Boiling Point, Density, Solubility)

These two categories require different API endpoints due to the nature of PubChem's data organization:

#### 1. Retrieving Basic Molecular Properties Using pug_rest

The pug_rest endpoint is a standard REST API used for retrieving basic chemical data quickly and efficiently. This endpoint provides properties such as the **Molecular Formula, Molecular Weight, and SMILES** string, which are straightforward values associated with each compound in PubChem’s database. We can fetch these basic properties with the following API structure:
```python
   prop_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/property/MolecularWeight,MolecularFormula,IsomericSMILES/JSON"

In [None]:
def fetch_basic_properties(cid):
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/property/MolecularWeight,MolecularFormula,CanonicalSMILES/JSON"
    try:
        response = requests.get(url)
        response.raise_for_status()
        properties = response.json()["PropertyTable"]["Properties"][0]
        return {
            "Molecular Formula": properties.get("MolecularFormula", "N/A"),
            "Molecular Weight": f"{properties.get('MolecularWeight', 'N/A')} g/mol",
            "SMILES": properties.get("CanonicalSMILES", None)
        }
    except Exception as e:
        print(f"Error fetching basic properties for CID {cid}: {e}")
        return {}

## 3.4 Implementation of fetch_chemical_data

In [None]:
def fetch_chemical_data(cid):
    """
    Retrieve molecular and experimental data for a given CID.
    """
    # Initialize data structures
    properties = {"Molecular Formula": "N/A", "Molecular Weight": "N/A", "SMILES": ""}
    experimental_data = {"Melting Point": "N/A", "Boiling Point": "N/A", "Solubility": "N/A"}
    hazards = "N/A"

    # Fetch basic properties
    prop_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/{cid}/property/MolecularWeight,MolecularFormula,CanonicalSMILES/JSON"
    try:
        response = requests.get(prop_url)
        response.raise_for_status()
        prop_data = response.json()["PropertyTable"]["Properties"][0]
        properties.update({
            "Molecular Formula": prop_data.get("MolecularFormula", "N/A"),
            "Molecular Weight": f"{prop_data.get('MolecularWeight', 'N/A')} g/mol",
            "SMILES": prop_data.get("CanonicalSMILES", "")
        })
    except Exception as e:
        print(f"Error fetching properties for CID {cid}: {e}")

    # Fetch experimental data
    exp_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
    try:
        response = requests.get(exp_url)
        response.raise_for_status()
        sections = response.json().get("Record", {}).get("Section", [])
        for section in sections:
            if section.get("TOCHeading") == "Chemical and Physical Properties":
                for sub_section in section.get("Section", []):
                    if sub_section.get("TOCHeading") == "Experimental Properties":
                        for prop in sub_section.get("Section", []):
                            heading = prop.get("TOCHeading")
                            if heading in experimental_data:
                                values = [
                                    info.get("Value", {}).get("StringWithMarkup", [{}])[0].get("String", "")
                                    for info in prop.get("Information", [])
                                ]
                                if heading in ["Melting Point", "Boiling Point"]:
                                    experimental_data[heading] = ", ".join(v for v in values if "°C" in v) or "N/A"
                                elif heading == "Solubility":
                                    experimental_data[heading] = clean_solubility(values)
    except Exception as e:
        print(f"Error fetching experimental data for CID {cid}: {e}")

    # Fetch hazard statements
    hazards = fetch_hazard_statements(cid)

    return properties, experimental_data, hazards




### 2. Why `pug_view` is Required for Experimental Properties

Experimental data (such as melting point, boiling point, density, and solubility) often involve detailed records with multiple data points, including conditions (e.g., temperature and pressure), source citations, and formatting for easy readability. Such data is stored in a more hierarchical and structured format, which `pug_rest` does not support.

To handle this, PubChem provides the `pug_view` endpoint, which allows access to the more complex records, including:

- **Hierarchical sections** (organized under "Chemical and Physical Properties" and "Experimental Properties").
- **Details on experimental conditions** and **source references** for each measurement.

Using the `pug_view` endpoint, we fetch experimental data as follows:

```python
exp_url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"

In [None]:
def clean_solubility(values):
    """
    Remove metadata like 'NTP, 1992' from solubility values.
    """
    cleaned_values = []
    for value in values:
        cleaned_value = re.sub(r'\s*\(.*?\)|\s*\[.*?\]', '', value).strip()
        if cleaned_value:
            cleaned_values.append(cleaned_value)
    return ", ".join(cleaned_values) if cleaned_values else "N/A"

def fetch_experimental_data(cid):
    """
    Retrieve experimental data (melting point, boiling point, solubility) using the PubChem API.
    """
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
    experimental_data = {"Melting Point": "N/A", "Boiling Point": "N/A", "Solubility": "N/A"}
    try:
        response = requests.get(url)
        response.raise_for_status()
        sections = response.json().get("Record", {}).get("Section", [])
        for section in sections:
            if section.get("TOCHeading") == "Chemical and Physical Properties":
                for sub_section in section.get("Section", []):
                    if sub_section.get("TOCHeading") == "Experimental Properties":
                        for prop in sub_section.get("Section", []):
                            heading = prop.get("TOCHeading")
                            if heading in experimental_data:
                                values = [info.get("Value", {}).get("StringWithMarkup", [{}])[0].get("String", "")
                                          for info in prop.get("Information", [])]
                                if heading in ["Melting Point", "Boiling Point"]:
                                    experimental_data[heading] = ", ".join(v for v in values if "°C" in v) or "N/A"
                                elif heading == "Solubility":
                                    experimental_data[heading] = clean_solubility(values)
        return experimental_data
    except Exception as e:
        print(f"Error fetching experimental data for CID {cid}: {e}")
        return experimental_data


### 3.3 Why Hazard Information Retrieval Differs

Hazard information retrieval differs significantly from basic molecular properties or experimental data because of the following reasons:

- **Complexity of Data**: Hazard classifications include detailed regulatory information, such as GHS (Globally Harmonized System) codes, hazard statements, and warning labels.
- **Separate API Sections**: Hazard information is stored in the "Safety and Hazards" section of the `pug_view` hierarchy, requiring specific parsing logic to navigate the nested structure.

To ensure the hazard data is readable and concise, the following steps are taken during data cleaning:

- **Removal of Regulatory Codes**: Strings like "H225" or "H319" are omitted as they are not user-friendly.
- **Exclusion of Metadata**: Annotations like "[EU Classification]" are stripped from the hazard statements.
- **Elimination of Redundant Phrases**: Repeated or unnecessary text is removed to improve clarity.

In [None]:
def fetch_hazard_statements(cid):
    """
    Fetch and parse hazard information for a given CID from the "Safety and Hazards" section.
    """
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
    hazard_statements = []
    try:
        response = requests.get(url)
        response.raise_for_status()
        sections = response.json().get("Record", {}).get("Section", [])
        for section in sections:
            if section.get("TOCHeading") == "Safety and Hazards":
                for sub_section in section.get("Section", []):
                    if sub_section.get("TOCHeading") == "Hazards Identification":
                        for ghs_section in sub_section.get("Section", []):
                            if ghs_section.get("TOCHeading") == "GHS Classification":
                                for info in ghs_section.get("Information", []):
                                    if info["Name"] == "GHS Hazard Statements":
                                        hazard_statements += [
                                            statement['String'] for statement in info["Value"]["StringWithMarkup"]
                                        ]
        return clean_hazard_descriptions(hazard_statements)
    except Exception as e:
        print(f"Error fetching hazard statements for CID {cid}: {e}")
        return "N/A"


## 4. Parse and Clean Hazard Information

Hazard classifications include regulatory details, such as GHS codes and warnings. This function parses and cleans hazard data to remove redundant and unnecessary information.


In [None]:
def clean_hazard_descriptions(statements):
    """
    Clean hazard descriptions by removing duplicates, regulatory codes, and unnecessary metadata.
    """
    cleaned_statements = []
    seen_descriptions = set()

    for statement in statements:
        hazards = [s.strip() for s in statement.split(";")]
        for hazard in hazards:
            hazard = re.sub(r'\bH\d{3}\b', '', hazard)
            hazard = re.sub(r'\s*\[.*?\]|\(.*?\)\s*', '', hazard)
            hazard_lower = hazard.lower()
            if hazard and hazard_lower not in seen_descriptions:
                cleaned_statements.append(hazard)
                seen_descriptions.add(hazard_lower)

    return ", ".join(cleaned_statements) if cleaned_statements else "N/A"


## 5. Fetch Molecular Structures

Retrieve the molecular structure from SMILES and generate an RDKit object for visualization.


In [None]:
def fetch_structure(smiles):
    """
    Generate a molecular structure image from a SMILES string using RDKit.
    """
    try:
        mol = Chem.MolFromSmiles(smiles)
        return mol if mol else None
    except Exception as e:
        print(f"Error generating structure for SMILES {smiles}: {e}")
        return None


## 7. Fetch and Display Data for Multiple Compounds

This main function retrieves chemical data, including structures, and organizes the results into a table.


In [None]:
def fetch_chemical_data(chemical_names):
    data_rows = []
    structures = []
    for chemical_name in chemical_names:
        cid = get_cid(chemical_name)
        if cid:
            basic_properties = fetch_basic_properties(cid)
            experimental_data = fetch_experimental_data(cid)
            hazards = fetch_hazard_statements(cid)
            smiles = basic_properties.get("SMILES", "")
            mol = fetch_structure(smiles) if smiles else None
            if mol:
                structures.append((chemical_name, mol))
            data_rows.append({
                "Chemical Name": chemical_name,
                "Molecular Formula": basic_properties.get("Molecular Formula", "N/A"),
                "Molecular Weight": basic_properties.get("Molecular Weight", "N/A"),
                "Melting Point (°C)": experimental_data.get("Melting Point", "N/A"),
                "Boiling Point (°C)": experimental_data.get("Boiling Point", "N/A"),
                "Solubility": experimental_data.get("Solubility", "N/A"),
                "Hazards": hazards
            })
        else:
            data_rows.append({
                "Chemical Name": chemical_name,
                "Molecular Formula": "Error",
                "Molecular Weight": "Error",
                "Melting Point (°C)": "Error",
                "Boiling Point (°C)": "Error",
                "Solubility": "Error",
                "Hazards": "Error"
            })
    return pd.DataFrame(data_rows), structures


## 8. Interactive Input and Results Display
Enter chemical names to fetch data, display in a scrollable table, and visualize molecular structures.


In [None]:
def display_scrollable_table(df):
    styled_df = df.style.set_table_attributes("style='display:inline'")
    styled_df = styled_df.set_properties(**{'white-space': 'pre-wrap'})  # Enable text wrapping
    output = widgets.Output()
    with output:
        display(styled_df)
    scrollable_widget = widgets.VBox([
        widgets.Label("Chemical Data Table"),
        widgets.HBox([
            widgets.Output(layout=widgets.Layout(height='400px', overflow_y='scroll')),
            output
        ])
    ])
    display(scrollable_widget)

def display_structures(structures):
    for name, mol in structures:
        print(f"Structure of {name}:")
        display(Draw.MolToImage(mol, size=(150, 150)))

# 9. User input and execution

In [None]:

chemical_names_input = input("Enter chemical names separated by commas: ")
chemical_names = [name.strip() for name in chemical_names_input.split(",") if name.strip()]

chemical_data_df, structures = fetch_chemical_data(chemical_names)
display_scrollable_table(chemical_data_df)
display_structures(structures)

## 10. Conclusion

This notebook demonstrates the use of PubChem APIs for retrieving molecular and experimental properties, hazard information, and molecular structures. By combining pug_rest and pug_view, we efficiently handle different types of chemical data.