# Exploring PubChem

**UNDERCONSTRUCTION**
This is a non-programmatic activity designed to further your familiarity with PubChem.  It is based on the Current Protocols article [Exploring Chemical Information in PubChem](https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.217), the 2019 Cheminformatics OLCC activities [4.1 PubChem Web Interface for Text](https://chem.libretexts.org/Courses/Intercollegiate_Courses/Cheminformatics/04%3A_Searching_Databases_for_Chemical_Information/4.01%3A_PubChem_Web_Interfaces_for_Text)

# PubChem Overview (2025)

The [PubChem homepage](https://pubchem.ncbi.nlm.nih.gov) provides a unified search interface that allows users to explore an extensive set of interrelated chemical and biological databases. Originally centered around the three core databases—**[Compound](https://pubchem.ncbi.nlm.nih.gov/docs/compounds)**, **[Substance](https://pubchem.ncbi.nlm.nih.gov/docs/substances)**, and **[BioAssays](https://pubchem.ncbi.nlm.nih.gov/docs/bioassays)**—PubChem has since expanded to include additional data domains such as **[Genes](https://pubchem.ncbi.nlm.nih.gov/docs/genes)**, **[Proteins](https://pubchem.ncbi.nlm.nih.gov/docs/proteins)**, **[Pathways](https://pubchem.ncbi.nlm.nih.gov/docs/pathways)**, **[Taxonomies](https://pubchem.ncbi.nlm.nih.gov/docs/taxonomies)**, **[Cell Lines](https://pubchem.ncbi.nlm.nih.gov/docs/cell-lines)**, **[Patents](https://pubchem.ncbi.nlm.nih.gov/docs/patents)**, **[Literature](https://pubchem.ncbi.nlm.nih.gov/docs/literature)**, and more. These additional records enable users to explore the relationships between small molecules and their biological context.



## PubChem as a Data Aggregator
The public often has a misconception of PubChem as a data validator. PubChem is really a data aggregator and as of July 2025 there are over 1,063 sources, including government agencies, university labs, pharmaceutical companies, substance vendors, and other databases. An up-to-date list of PubChem’s data sources is available at the [PubChem Sources page](https://pubchem.ncbi.nlm.nih.gov/sources). When a source contributes data it is stored as a substance or bioassay data record and assigned an substance ID (SID) or assay ID (or AID).  PubChem than uses a canonicalization technique to associate the SID with a specific chemical structure and uses that structure to associate the substance record with a compound record. So the compound record is the aggregation of the substance records uploaded from multiple data sources for a specific compound, and so doing it maintains the data provenance by linking back to the original Substance record and depositors information. If an uploaded bioassay has an associated SID, it is then connected to the compound record of the compound that SID is associated with. See the PubChem document: [What is the difference between a substance and a compound in PubChem?](https://pubchem.ncbi.nlm.nih.gov/docs/compound-vs-substance)

<div class="alert alert-block alert-success"> 


<details>
<summary>Does PubChem Validate the data uploaded by a depositor?</summary>
<br><p>No, PubChem only validates the chemical structure and then uses that to connect the substance record with the compound record of that chemical.
</p>
</div>



![image.png](attachment:318221c2-de26-4095-ae79-9c8ab6a614ca.png)  
Relationship showing how data sources contribute substance and bioassay data that can be connected to a compound's data record. ([image source link](https://pubchem.ncbi.nlm.nih.gov/docs/compound-vs-substance).)



To gain an understanding of pubchem you should go to the about link on the top of the [homepage](https://pubchem.ncbi.nlm.nih.gov/) The left column allows you t navigate through the documentation

In short, the PubChem homepage serves as a **centralized hub** for accessing chemical, biological, and pharmacological data linked through a chemical lens.

# PubChem's Landing Page
This activity will be a quick exploration of the features of PubChem's landing page, [https://pubchem.ncbi.nlm.nih.gov/](https://pubchem.ncbi.nlm.nih.gov/).

## About
The left frame of the [About link](https://pubchem.ncbi.nlm.nih.gov/docs/about) provides access to the resources in the [Docs link](https://pubchem.ncbi.nlm.nih.gov/docs/) and we will systematically go over these shortly.

### Entrez
NIH Entrez (French for "Enter") is the search and retrieval system of the National Center for Biotechnology Information (NCBI), a division of the U.S. National Library of Medicine (NLM) of the National Institutes of Health (NIH). It is essentially an integrated search engine for accessing a wide range of interconnected databases hosted by NCBI, of which PubChem is one. Entrez is integrated throughout NCBI resources and can be accessed through the search bar on top of the NCBI landing page [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/)


![image.png](attachment:18ebe53b-3d37-4063-81ee-578dd11da81a.png)  
Screenshot of [https://www.ncbi.nlm.nih.gov/](https://www.ncbi.nlm.nih.gov/) (08/2/2025).


The [Entrez Help NCBI Help Manual](https://www.ncbi.nlm.nih.gov/books/NBK3837/#EntrezHelp.Access_to_the_Entrez_System) is a 20 minute read that goes over the Entrez Databases, how to access and search them.  A complete list of Entrez resources can be found at the [All Resources link](https://www.ncbi.nlm.nih.gov/guide/all/). PubChem's home page is a "unified chemical search interface" that covers all three of PubChem's primary databases (Substance, Compound and BioAssay) and supports searches using molecular formula, and line notations like SMILES and InChIs. There is an option for text searches within Entrez from the PubChem homepage and more information is available in the [Entrez Advanced Search](https://pubchem.ncbi.nlm.nih.gov/docs/advanced-search-entrez) option.




# NLM Video Tutorials on PubChem
The following video tutorials are part of the NLM PubChem training course and are available here for you to review.
## Searching with Structures iin PubChem
This tutorial goes over using the PubChem Sketcher. A more comprehensive tutorial can e obtained in the [PubChem Sketcher Help Doc](https://pubchem.ncbi.nlm.nih.gov/docs/sketcher-help).  

In [None]:
%%html
# this is a code cell that connects to a youtube, and you need to execute it the first time you start the kernel with shift-enter.
<iframe width="560" height="315" src="https://www.youtube.com/embed/DWTvE0pXwBU?si=TOIfORcuJspPehQb" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

## Finding Chemical Information in PubChem

In [None]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/jZw7w9jithI?si=K9YMsxLYILAtVyGQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

## Finding Links and Citations in PubChem

In [None]:
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/jpb0XZeCd5Q?si=fmhtaZNgcF_OkNa8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

```python
import requests

def get_lcss(cid):
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
    params = {'toc': 'LCSS TOC'}
    response = requests.get(url, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"HTTP {response.status_code}: could not fetch LCSS for CID {cid}")
        return None

lcss = get_lcss(887)  # methanol
print(json.dumps(lcss, indent=2))
```

In [4]:
import requests
import json
import os

def get_lcss(cid):
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
    params = {'toc': 'LCSS TOC'}
    response = requests.get(url, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"HTTP {response.status_code}: could not fetch LCSS for CID {cid}")
        return None

# Step 1: Create the 'data' directory in the current working directory
data_dir = os.path.join(os.getcwd(), "data")
os.makedirs(data_dir, exist_ok=True)  # create if it doesn't exist

# Step 2: Fetch and save the LCSS data
cid = 887  # methanol
lcss_data = get_lcss(cid)

if lcss_data:
    output_path = os.path.join(data_dir, f"CID_{cid}_LCSS.json")
    with open(output_path, "w") as f:
        json.dump(lcss_data, f, indent=2)
    print(f"LCSS data saved to: {output_path}")


LCSS data saved to: /home/rebelford/datachemOLCC/module-development/02-Public-Chemical-Databases/data/CID_887_LCSS.json


In [None]:
import requests
import json
import os

# Step 0: Setup data directory
data_dir = os.path.join(os.getcwd(), "data")
os.makedirs(data_dir, exist_ok=True)

# Step 1: Define stock room chemicals
stock_room = ['acetone', 'benzene', 'methanol', 'ethanol']

# Step 2: Function to get CID from chemical name
def get_cid(chemical_name):
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/{chemical_name}/cids/JSON"
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        cids = data.get("IdentifierList", {}).get("CID", [])
        return cids[0] if cids else None
    else:
        print(f"Error getting CID for {chemical_name}")
        return None

# Step 3: Function to get LCSS data from CID
def get_lcss(cid):
    url = f"https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{cid}/JSON"
    params = {'toc': 'LCSS TOC'}
    response = requests.get(url, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error getting LCSS for CID {cid}")
        return None

# Step 4: Loop through each chemical and process
for chemical in stock_room:
    output_path = os.path.join(data_dir, f"{chemical}.json")
    
    # Skip if file already exists
    if os.path.exists(output_path):
        print(f"Skipping {chemical}: file already exists.")
        continue

    # Get CID
    cid = get_cid(chemical)
    if cid is None:
        print(f"Skipping {chemical}: CID not found.")
        continue

    # Get LCSS JSON
    lcss_data = get_lcss(cid)
    if lcss_data is None:
        print(f"Skipping {chemical}: LCSS not found.")
        continue

    # Save to file
    with open(output_path, "w") as f:
        json.dump(lcss_data, f, indent=2)
    print(f"Saved LCSS for {chemical} (CID {cid}) to {output_path}")


## Making Dictionary
### Load all LCSS JSON files into a dictionary

In [5]:
import os
import json

data_dir = os.path.join(os.getcwd(), "data")
lcss_dict = {}

# Load each JSON file in the data directory
for filename in os.listdir(data_dir):
    if filename.endswith(".json"):
        chemical_name = os.path.splitext(filename)[0].strip()
        filepath = os.path.join(data_dir, filename)
        with open(filepath, 'r') as f:
            lcss_dict[chemical_name] = json.load(f)


### Explore LCSS Structure Programmatically


In [6]:
with open("combined_lcss.json", "w") as f:
    json.dump(lcss_dict, f, indent=2)


### sTEP 1 Explore LCSS Recursively (TOC Headings)

In [7]:
def explore_sections(sections, level=0):
    for section in sections:
        heading = section.get("TOCHeading", "No Heading")
        print("  " * level + f"- {heading}")
        if "Section" in section:
            explore_sections(section["Section"], level + 1)

# Define and call on acetone
if "acetone" in lcss_dict:
    acetone_lcss = lcss_dict["acetone"]
    explore_sections(acetone_lcss["Record"]["Section"])
else:
    print("Acetone not found in dictionary.")

- 2D Structure
- 3D Conformer
- Crystal Structures
  - Crystal Structure Data
- Primary Hazards
- Hazard Classification
  - GHS Classification
  - NFPA Hazard Classification
- Substance Identification
  - CAS
  - IUPAC Name
  - Depositor-Supplied Synonyms
  - Molecular Formula
  - Molecular Weight
  - InChI
  - InChIKey
  - Physical Description
  - Chemical Classes
- Chemical and Physical Properties
  - Boiling Point
  - Melting Point
  - Flash Point
  - Solubility
  - Density
  - Vapor Density
  - Vapor Pressure
  - Critical Temperature & Pressure
  - Color/Form
  - Odor
  - Odor Threshold
  - Taste
- Flammability and Explosivity
  - Fire Hazards
  - Fire Potential
  - Flammable Limits
  - Explosive Limits and Potential
  - Lower Explosive Limit (LEL)
  - Upper Explosive Limit (UEL)
  - Autoignition Temperature
- Stability and Reactivity
  - Physical Dangers
  - Chemical Dangers
  - Reactivity Profile
  - Hazardous Reactivities & Incompatibilities
  - Air and Water Reactions
  - React

### Step 2: Display content of selected sections

In [8]:
def extract_info(section):
    if "Information" in section:
        for info in section["Information"]:
            value = info.get("Value", {}).get("StringWithMarkup", [{}])[0].get("String", "")
            if value:
                print("  •", value)


### Step 3: Search for a TOC Heading and print it

In [9]:
def find_section_by_heading(sections, heading_to_find):
    for section in sections:
        heading = section.get("TOCHeading", "")
        if heading.lower() == heading_to_find.lower():
            return section
        if "Section" in section:
            result = find_section_by_heading(section["Section"], heading_to_find)
            if result:
                return result
    return None


In [10]:
section = find_section_by_heading(acetone_lcss["Record"]["Section"], "First Aid Measures")
if section:
    print(f"== {section['TOCHeading']} ==")
    extract_info(section)
else:
    print("Section not found.")


Section not found.


In [11]:
def extract_info(section, level=0):
    indent = "  " * level
    heading = section.get("TOCHeading", "Unknown")
    print(f"{indent}== {heading} ==")

    # Print textual content if present
    if "Information" in section:
        for info in section["Information"]:
            value = info.get("Value", {}).get("StringWithMarkup", [{}])[0].get("String", "")
            if value:
                print(f"{indent}• {value}")

    # Recursively print content in subsections
    if "Section" in section:
        for subsection in section["Section"]:
            extract_info(subsection, level + 1)


In [12]:
def interactive_section_browser(lcss_record):
    top_sections = lcss_record["Record"]["Section"]
    headings = [s.get("TOCHeading", "Unknown") for s in top_sections]
    
    print("\nChoose a section:")
    for i, h in enumerate(headings):
        print(f"{i+1}. {h}")
    
    choice = int(input("\nEnter a number: ")) - 1
    if 0 <= choice < len(top_sections):
        section = top_sections[choice]
        print()
        extract_info(section)  # now recursive
    else:
        print("Invalid choice.")
interactive_section_browser(acetone_lcss)


Choose a section:
1. 2D Structure
2. 3D Conformer
3. Crystal Structures
4. Primary Hazards
5. Hazard Classification
6. Substance Identification
7. Chemical and Physical Properties
8. Flammability and Explosivity
9. Stability and Reactivity
10. Toxicity and Health
11. Exposure and Personal Protection
12. Storage and Handling
13. First Aid
14. Cleanup and Disposal



Enter a number:  2



== 3D Conformer ==


In [None]:
def interactive_section_browser_loop(lcss_record):
    top_sections = lcss_record["Record"]["Section"]
    headings = [s.get("TOCHeading", "Unknown") for s in top_sections]

    while True:
        print("\nChoose a section (or 0 to exit):")
        for i, h in enumerate(headings):
            print(f"{i+1}. {h}")
        
        try:
            choice = int(input("\nEnter a number: "))
            if choice == 0:
                break
            index = choice - 1
            if 0 <= index < len(top_sections):
                print()
                extract_info(top_sections[index])
            else:
                print("Invalid number.")
        except ValueError:
            print("Please enter a valid number.")

interactive_section_browser_loop(acetone_lcss)

# Maintain Data Provenance
## Step 1: Build Reference lookup dictionary

In [13]:
def build_reference_lookup(lcss_record):
    ref_entries = lcss_record["Record"].get("Reference", [])
    ref_dict = {}
    for ref in ref_entries:
        ref_num = ref.get("ReferenceNumber")
        if ref_num is not None:
            ref_dict[ref_num] = ref
    return ref_dict


## Update extract_info() with citation

In [14]:
def extract_info(section, ref_lookup=None, level=0):
    indent = "  " * level
    heading = section.get("TOCHeading", "Unknown")
    print(f"{indent}== {heading} ==")

    if "Information" in section:
        for info in section["Information"]:
            value = info.get("Value", {}).get("StringWithMarkup", [{}])[0].get("String", "")
            ref_num = info.get("ReferenceNumber")
            citation = ""
            if ref_num and ref_lookup and ref_num in ref_lookup:
                ref = ref_lookup[ref_num]
                citation = f" [{ref.get('SourceName', 'Unknown Source')}]"
            if value:
                print(f"{indent}• {value}{citation}")

    # Recurse into subsections
    if "Section" in section:
        for subsection in section["Section"]:
            extract_info(subsection, ref_lookup, level + 1)


In [15]:
acetone_lcss = lcss_dict["acetone"]
ref_lookup = build_reference_lookup(acetone_lcss)
explore_sections(acetone_lcss["Record"]["Section"])  # still works
extract_info(acetone_lcss["Record"]["Section"][0], ref_lookup=ref_lookup)


- 2D Structure
- 3D Conformer
- Crystal Structures
  - Crystal Structure Data
- Primary Hazards
- Hazard Classification
  - GHS Classification
  - NFPA Hazard Classification
- Substance Identification
  - CAS
  - IUPAC Name
  - Depositor-Supplied Synonyms
  - Molecular Formula
  - Molecular Weight
  - InChI
  - InChIKey
  - Physical Description
  - Chemical Classes
- Chemical and Physical Properties
  - Boiling Point
  - Melting Point
  - Flash Point
  - Solubility
  - Density
  - Vapor Density
  - Vapor Pressure
  - Critical Temperature & Pressure
  - Color/Form
  - Odor
  - Odor Threshold
  - Taste
- Flammability and Explosivity
  - Fire Hazards
  - Fire Potential
  - Flammable Limits
  - Explosive Limits and Potential
  - Lower Explosive Limit (LEL)
  - Upper Explosive Limit (UEL)
  - Autoignition Temperature
- Stability and Reactivity
  - Physical Dangers
  - Chemical Dangers
  - Reactivity Profile
  - Hazardous Reactivities & Incompatibilities
  - Air and Water Reactions
  - React