# Python for browsing MeSH 'Persons' branch in PubMed

Author: [Dan Wendling](https://github.com/wendlingd/Browse-InfoSeeking-PubMed); Last modified 2025-07-05

**You have to register for a MyNCBI account before you can run this. Details below.**

This report counts PubMed studies that describe how various audiences seek and consume biomedical information; an HTML file allows users to browse the list and visit PubMed records without running Python.

Requires:
    
- Email address and API key from your MyNCBI account; https://www.ncbi.nlm.nih.gov/books/NBK3842/
- BioPython package and Entrez, https://biopython.org/docs/latest/Tutorial/chapter_entrez.html
  - PubMed Entrez esearch: https://biopython.org/docs/latest/Tutorial/chapter_entrez.html#esearch-searching-the-entrez-databases

This search strategy uses 6 years as the cutoff, meaning you will retrieve at least 5 years of records.

MeSH changes periodically; you may want to update the csv file here to match the pages starting from https://www.ncbi.nlm.nih.gov/mesh/68009272 or https://meshb.nlm.nih.gov/treeView > Named Groups > Persons. Most yearly updates are done by January.


In [None]:
# If needed
# !pip install biopython
# pip install --upgrade biopython


---
## **Startup**


In [None]:
import os
import json
from Bio import Entrez
import pandas as pd
import time
import urllib.parse

import datetime
from datetime import datetime
from datetime import timedelta
today = datetime.today().strftime('%Y-%m-%d')

from pathlib import *
# To be used with str(Path.home())
# Set working directory and directories for read/write
home_folder = str(Path.home()) # os.path.expanduser('~')
currFileDir = home_folder + 'Browse-Infoseeking/'

print(f'Our file directory is {currFileDir}')


In [None]:
# Load the user groups file - selected levels and terms from the Persons branch (M01) of NLM's MeSH vocabulary
audienceList = pd.read_csv('Browse-Infoseeking/Data/MatchFiles/personsBranch.csv')

# Limit if testing
# audienceList = audienceList.sample(5)

print(f'audienceList shape is {audienceList.shape}\n')
audienceList.head(10)


---
## **MyNCBI authentication, testing**

The documentation shows many options: 
- MyNCBI - You must put an email address in your queries: https://www.ncbi.nlm.nih.gov/myncbi/
- E-utilities Quick Start: https://www.ncbi.nlm.nih.gov/books/NBK25500/
- Also include an API key from your MyNCBI account so you can increase your number of requests per second: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/

"In order not to overload the E-utility servers, NCBI recommends that users post no more than three URL requests per second and limit large jobs to either weekends or between 9:00 PM and 5:00 AM Eastern time during weekdays." - https://www.ncbi.nlm.nih.gov/books/NBK25497/


In [None]:
# Set a delay to avoid getting blocked
TIME_DELAY = 0.5

# Load credentials and set Entrez parameters. These will be automatically included in the API requests below.
with open('cred.json') as f:
    creds = json.load(f)

Entrez.email = creds['email']
Entrez.api_key = creds['key']

print(Entrez.email)


In [None]:
# Test your authentication by asking for the list of supported databases
with Entrez.einfo() as handle:
    record = Entrez.read(handle)

print(record["DbList"])


In [None]:
# Test pubmed access by asking for one recordset count
stream = Entrez.esearch(db="pubmed", term="health literacy[Mesh]", retmax=0)
record = Entrez.read(stream)

print(f'{int(record["Count"]):,}')


---
## **Configure the search query**

You can alter the search strategy to suit your need. The default is below. One example of a URL encoder/decoder is https://meyerweb.com/eric/tools/dencoder/

```
(NAMED AUDIENCE)[Mesh]
AND ("Information Seeking Behavior"[Mesh] OR "user stud*"[Title/Abstract] OR "case stud*"[Title/Abstract] OR usability[Title/Abstract])
AND ("last 6 years"[PDat])
```

We experimented with the following; however, the number of false positives may require extra time to sort through:

- "User-Centered Design"[Mesh]
- "Human-Centered Design"[Title/Abstract]
- "Consumer Health Information"[Mesh]
- "Data Collection"[Mesh] 
- "Qualitative Research"[Mesh] 
- "interview*"[Title/Abstract]


In [None]:
# --- constant parts of the search -------------------------------------------
CORE_FILTERS = (
    ' AND ("Information Seeking Behavior"[Mesh] '
    'OR "user stud*"[Title/Abstract] '
    'OR "case stud*"[Title/Abstract] '
    'OR usability[Title/Abstract]) '
    'AND ("last 6 years"[PDat])'
)
PUBMED_BASE  = "https://pubmed.ncbi.nlm.nih.gov/?term="


---
## **Collect from API**


In [None]:
def build_query(mesh_term: str) -> str:
    """Return a full PubMed query for one MeSH descriptor."""
    descriptor = f'"{mesh_term}"[Mesh]' if " " in mesh_term else f'{mesh_term}[Mesh]'
    return descriptor + CORE_FILTERS

def get_count(mesh_term: str) -> int:
    """Fetch record count for a single PubMed query."""
    query = build_query(mesh_term)
    try:
        with Entrez.esearch(db="pubmed",
                            term=query,
                            retmax=0,           # we only need the count
                            rettype="count") as h:
            res = Entrez.read(h)
        return int(res["Count"])
    except Exception as exc:
        print(f"{mesh_term}: {exc}")
        return 0

def make_link(query: str, count: int) -> str:
    """Convert query & count into an HTML hyperlink."""
    url = PUBMED_BASE + urllib.parse.quote_plus(query)
    return f'<a href="{url}" target="_blank">{count:,}</a>'

# ---------------------------------------------------------------------------
results = []

for _, row in audienceList.iterrows():
    term   = row["MeSH"]
    indent = row["Indent"]

    query  = build_query(term)
    count  = get_count(term)
    if count:
        results.append([indent, term, make_link(query, count)])
        print(f"{term}: {count:,}")

    time.sleep(TIME_DELAY)

databaseResult = pd.DataFrame(results, columns=["Indent", "MeSH", "Count"])

print("\n=== Done ===")


In [None]:
# View the resulting dataframe
pd.set_option('display.max_colwidth', None)

databaseResult.head()


---
## **Output to HTML**

In [None]:
# Add list item content
def make_html_row(row):
    return f'<li class="indent{row["Indent"]}">{row["MeSH"]} ({row["Count"]})</li>'

# Generate the <li> rows
databaseResult['HTML'] = databaseResult.apply(make_html_row, axis=1)

# HTML skeleton
html_builder = [
    "<html><head><title>Audience-specific information studies at pubmed.gov</title>",
    "<style>",
    "  body {margin-left:1em; font-family: Arial, Helvetica, sans-serif;}",
    "  ul {list-style-type: none; margin-left:0; padding-left:0;}",
    "  .indent1 {padding-left:0em;}",
    "  .indent2 {padding-left:2em;}",
    "  .indent3 {padding-left:4em;}",
    "  .indent4 {padding-left:6em;}",
    "  .indent5 {padding-left:8em;}",
    "</style></head><body>",
    "<h1>Understand your audiences: Health-medical information-seeking-behavior studies at pubmed.gov</h1>",
    f"<p><strong>{today}</strong> &ndash; Use this report to understand information needs, information seeking, and ",
    "information-use behaviors for various audiences for biomedical information. Example: The &quot;caregivers&quot; ",
    "link below runs the following search strategy at ",
    '<a href="https://pubmed.ncbi.nlm.nih.gov/">https://pubmed.ncbi.nlm.nih.gov/</a>: Caregivers[Mesh] AND ',
    "(&quot;Information Seeking Behavior&quot;[Mesh] OR &quot;user stud*&quot;[Title/Abstract] OR &quot;case ",
    "stud*&quot;[Title/Abstract] OR usability[Title/Abstract]) AND (&quot;last 6 years&quot;[PDat]). This report ",
    "was built from selected layers and terms in the ",
    '<a href="https://www.ncbi.nlm.nih.gov/mesh/68009272">Persons branch of the MeSH tree.</a> Terms with zero ',
    "results are not shown. Note: New database records are built incrementally, and some new records do not have ",
    "MeSH terms - the newest records may not be retrievable here. Counts will go out of date; Python code for running ",
    "the report yourself is at ",
    '<a href="https://github.com/wendlingd/Browse-InfoSeeking-PubMed">https://github.com/wendlingd/Browse-InfoSeeking-PubMed.</a></p>',
    "<ul>"
]

# Append all list items
html_builder.extend(databaseResult['HTML'].tolist())

# Close the HTML
html_builder.extend([
    "</ul>",
    "</body>",
    "</html>"
])

# Write the output file
output_path = "Doc/DataReports/InfoSeekingStudies.html"
with open(output_path, "w", encoding="utf-8") as f:
    f.write('\n'.join(html_builder))

print(f'\n*** Done ***\n\nReport written to Doc/DataReports/InfoSeekingStudies.html. If you log into your MyNCBI account before viewing the report, the search terms will be highlighted.')

