PubMed Journal Article Fetcher for Google Colab
This notebook fetches articles from specified journals for a given month using PubMed API

In [1]:
# Install required packages

!pip install biopython pandas requests

import os, pandas as pd, requests, time, warnings
from Bio import Entrez
from datetime import datetime, timedelta
import xml.etree.ElementTree as ET
from typing import List, Dict, Optional
# warnings.filterwarnings(‘ignore’)

Collecting biopython


  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)


Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.3 MB[0m [31m?[0m eta [36m-:--:--[0m

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m44.0 MB/s[0m  [33m0:00:00[0m
[?25h

Installing collected packages: biopython


Successfully installed biopython-1.85


# Instructions for use

📋 INSTRUCTIONS:

1. Update the EMAIL variable with your email address (required by NCBI)
2. Modify the JOURNALS list with your desired journals
3. Set the YEAR and MONTH you want to search
4. Run the main() function

⚠️  IMPORTANT NOTES:

- Use your real email address - it’s required by NCBI’s usage policy
- Journal names should match PubMed’s format exactly
- Large queries may take several minutes to complete
- Be respectful of API rate limits

🚀 To start, run: main()

In [2]:
# Configuration - MODIFY THESE VALUES
EMAIL = os.getenv("UNPAYWALL_EMAIL", "")

JOURNALS = [
    "International Forum of Allergy & Rhinology",
    "Rhinology",
    "JAMA Otolaryngology–Head & Neck Surgery",
    "Otolaryngology–Head and Neck Surgery",
    "European Annals of Otorhinolaryngology–Head and Neck Diseases",
    "Journal of Voice",
    "American Journal of Rhinology & Allergy",
    "JARO – Journal of the Association for Research in Otolaryngology",
    "Journal of Otolaryngology–Head & Neck Surgery",
    "Laryngoscope",
    "Auris Nasus Larynx"
    "new england journal of medicine",
    "JAMA"
]

In [3]:
from datetime import datetime

now = datetime.now()
YEAR = now.year
MONTH = now.month - 1

In [4]:
#!/usr/bin/env python3
class PubMedFetcher:

    def __init__(self, email: str):
        """Initialize with email for API requests"""
        self.email = email
        Entrez.email = email
        # Be respectful to NCBI servers
        self.request_delay = 0.34  # ~3 requests per second max

    def search_articles(self, journals: List[str], year: int, month: int) -> List[str]:
        """
        Search for articles in specified journals for given month/year

        Args:
            journals: List of journal names
            year: Year to search
            month: Month to search (1-12)

        Returns:
            List of PubMed IDs
        """
        # Format date range for the month
        start_date = f"{year}/{month:02d}/01"

        # Calculate last day of month
        if month == 12:
            next_month = 1
            next_year = year + 1
        else:
            next_month = month + 1
            next_year = year

        last_day = (datetime(next_year, next_month, 1) - timedelta(days=1)).day
        end_date = f"{year}/{month:02d}/{last_day}"

        # Build search query
        journal_query = " OR ".join([f'"{journal}"[Journal]' for journal in journals])
        date_query = f"({start_date}[PDAT] : {end_date}[PDAT])"
        search_query = f"({journal_query}) AND {date_query}"

        print(f"Search query: {search_query}")
        print(f"Searching for articles from {start_date} to {end_date}")

        try:
            # Search PubMed
            handle = Entrez.esearch(
                db="pubmed",
                term=search_query,
                retmax=10000,  # Adjust based on expected results
                sort="pub+date"
            )
            search_results = Entrez.read(handle)
            handle.close()

            id_list = search_results["IdList"]
            print(f"Found {len(id_list)} articles")
            return id_list

        except Exception as e:
            print(f"Error searching PubMed: {e}")
            return []

    def fetch_article_details(self, pmid_list: List[str]) -> List[Dict]:
        """
        Fetch detailed information for articles

        Args:
            pmid_list: List of PubMed IDs

        Returns:
            List of article dictionaries
        """
        articles = []
        batch_size = 200  # Process in batches to avoid overwhelming API

        for i in range(0, len(pmid_list), batch_size):
            batch = pmid_list[i:i + batch_size]
            print(f"Processing batch {i//batch_size + 1}/{(len(pmid_list)-1)//batch_size + 1}")

            try:
                # Fetch article details
                handle = Entrez.efetch(
                    db="pubmed",
                    id=",".join(batch),
                    rettype="xml",
                    retmode="xml"
                )
                records = handle.read()
                handle.close()

                # Parse XML
                root = ET.fromstring(records)

                for article_elem in root.findall(".//PubmedArticle"):
                    article_info = self._parse_article_xml(article_elem)
                    if article_info:
                        articles.append(article_info)

                # Be respectful to servers
                time.sleep(self.request_delay)

            except Exception as e:
                print(f"Error fetching batch: {e}")
                continue

        return articles

    def _parse_article_xml(self, article_elem) -> Optional[Dict]:
        """Parse article XML element to extract information"""
        try:
            # Extract basic article info
            medline_citation = article_elem.find(".//MedlineCitation")
            article = medline_citation.find(".//Article")

            # PMID
            pmid = medline_citation.find(".//PMID").text

            # Title
            title_elem = article.find(".//ArticleTitle")
            title = title_elem.text if title_elem is not None else "N/A"

            # Authors
            authors = []
            author_list = article.find(".//AuthorList")
            if author_list is not None:
                for author in author_list.findall(".//Author"):
                    last_name = author.find(".//LastName")
                    first_name = author.find(".//ForeName")
                    if last_name is not None:
                        author_name = last_name.text
                        if first_name is not None:
                            author_name += f", {first_name.text}"
                        authors.append(author_name)

            authors_str = "; ".join(authors[:5])  # Limit to first 5 authors
            if len(authors) > 5:
                authors_str += " et al."

            # Journal
            journal_elem = article.find(".//Journal/Title")
            journal = journal_elem.text if journal_elem is not None else "N/A"

            # Publication date
            pub_date = article.find(".//Journal/JournalIssue/PubDate")
            year = month = day = ""

            if pub_date is not None:
                year_elem = pub_date.find(".//Year")
                month_elem = pub_date.find(".//Month")
                day_elem = pub_date.find(".//Day")

                year = year_elem.text if year_elem is not None else ""
                month = month_elem.text if month_elem is not None else ""
                day = day_elem.text if day_elem is not None else ""

            pub_date_str = f"{year}-{month}-{day}".strip("-")

            # Volume and Issue
            volume_elem = article.find(".//Journal/JournalIssue/Volume")
            issue_elem = article.find(".//Journal/JournalIssue/Issue")

            volume = volume_elem.text if volume_elem is not None else ""
            issue = issue_elem.text if issue_elem is not None else ""

            # Pages
            pagination = article.find(".//Pagination/MedlinePgn")
            pages = pagination.text if pagination is not None else ""

            # DOI
            article_ids = article_elem.find(".//PubmedData/ArticleIdList")
            doi = ""
            if article_ids is not None:
                for article_id in article_ids.findall(".//ArticleId"):
                    if article_id.get("IdType") == "doi":
                        doi = article_id.text
                        break

            # Abstract
            abstract_elem = article.find(".//Abstract/AbstractText")
            abstract = ""
            if abstract_elem is not None:
                # Handle structured abstracts
                if abstract_elem.get("Label"):
                    abstract_parts = []
                    for abs_part in article.findall(".//Abstract/AbstractText"):
                        label = abs_part.get("Label", "")
                        text = abs_part.text or ""
                        if label:
                            abstract_parts.append(f"{label}: {text}")
                        else:
                            abstract_parts.append(text)
                    abstract = " ".join(abstract_parts)
                else:
                    abstract = abstract_elem.text or ""

            return {
                "PMID": pmid,
                "Title": title,
                "Authors": authors_str,
                "Journal": journal,
                "Publication_Date": pub_date_str,
                "Volume": volume,
                "Issue": issue,
                "Pages": pages,
                "DOI": doi,
                "Abstract": abstract[:500] + "..." if len(abstract) > 500 else abstract  # Truncate long abstracts
            }

        except Exception as e:
            print(f"Error parsing article: {e}")
            return None

def main(year: int, month: int): # Modified to accept year and month

    print("=== PubMed Journal Article Fetcher ===\n")

    # Configuration - MODIFY THESE VALUES
    EMAIL = "shvecht@gmail.com"  # Replace with your email

    JOURNALS = [
        "International Forum of Allergy & Rhinology",
        "Rhinology",
        "JAMA Otolaryngology–Head & Neck Surgery",
        "Otolaryngology–Head and Neck Surgery",
        "European Annals of Otorhinolaryngology–Head and Neck Diseases",
        "Journal of Voice",
        "American Journal of Rhinology & Allergy",
        "JARO – Journal of the Association for Research in Otolaryngology",
        "Journal of Otolaryngology–Head & Neck Surgery",
        "Laryngoscope",
        "Auris Nasus Larynx", # Corrected typo here
        "new england journal of medicine", # Added back the missing journals
        "JAMA" # Added back the missing journals
    ]  # Replace with your journal list

    # Using the passed in year and month instead of global variables
    search_year = year
    search_month = month

    print(f"Email: {EMAIL}")
    print(f"Journals: {', '.join(JOURNALS)}")
    print(f"Search period: {search_year}-{search_month:02d}")
    print("-" * 50)

    # Validate email
    if EMAIL == "your.email@example.com":
        print("⚠️  Please update the EMAIL variable with your actual email address!")
        print("This is required by NCBI's API usage policy.")
        return

    # Initialize fetcher
    fetcher = PubMedFetcher(EMAIL)

    # Search for articles using the passed in year and month
    print("🔍 Searching for articles...")
    pmid_list = fetcher.search_articles(JOURNALS, search_year, search_month)

    if not pmid_list:
        print("❌ No articles found matching the criteria.")
        return

    # Fetch article details
    print(f"\n📖 Fetching details for {len(pmid_list)} articles...")
    articles = fetcher.fetch_article_details(pmid_list)

    if not articles:
        print("❌ Failed to fetch article details.")
        return

    # Create DataFrame
    df = pd.DataFrame(articles)

    # Display results
    print(f"\n✅ Successfully retrieved {len(articles)} articles!")
    print(f"\nColumns: {', '.join(df.columns.tolist())}")

    # Show first few rows
    print(f"\nFirst 5 articles:")
    print(df.head().to_string(max_colwidth=50))

    # Save to CSV
    output_dir = os.path.join(str(search_year), f"{search_month:02d}")
    os.makedirs(output_dir, exist_ok=True)
    filename = os.path.join(output_dir, f"pubmed_articles_{search_year}_{search_month:02d}.csv")
    df.to_csv(filename, index=False)
    print(f"\n💾 Results saved to: {filename}")

    # Display summary statistics
    print(f"\n📊 Summary:")
    print(f"Total articles: {len(articles)}")
    print(f"Unique journals: {df['Journal'].nunique()}")
    print(f"Articles per journal:")
    journal_counts = df['Journal'].value_counts()
    for journal, count in journal_counts.head(10).items():
        print(f"  • {journal}: {count}")

    return df

In [5]:
output_dir = os.path.join(str(YEAR), f"{MONTH:02d}")
os.makedirs(output_dir, exist_ok=True)
df = main(YEAR, MONTH)
# df_raw should be your full all-hits table
df.to_csv(os.path.join(output_dir, "ent_raw_results.csv"), index=False)
df.to_json(os.path.join(output_dir, "ent_all_results.json"), orient="records", force_ascii=False, indent=2)


=== PubMed Journal Article Fetcher ===

Email: shvecht@gmail.com
Journals: International Forum of Allergy & Rhinology, Rhinology, JAMA Otolaryngology–Head & Neck Surgery, Otolaryngology–Head and Neck Surgery, European Annals of Otorhinolaryngology–Head and Neck Diseases, Journal of Voice, American Journal of Rhinology & Allergy, JARO – Journal of the Association for Research in Otolaryngology, Journal of Otolaryngology–Head & Neck Surgery, Laryngoscope, Auris Nasus Larynx, new england journal of medicine, JAMA
Search period: 2025-09
--------------------------------------------------
🔍 Searching for articles...
Search query: ("International Forum of Allergy & Rhinology"[Journal] OR "Rhinology"[Journal] OR "JAMA Otolaryngology–Head & Neck Surgery"[Journal] OR "Otolaryngology–Head and Neck Surgery"[Journal] OR "European Annals of Otorhinolaryngology–Head and Neck Diseases"[Journal] OR "Journal of Voice"[Journal] OR "American Journal of Rhinology & Allergy"[Journal] OR "JARO – Journal of t

Found 519 articles

📖 Fetching details for 519 articles...
Processing batch 1/3


Processing batch 2/3


Processing batch 3/3



✅ Successfully retrieved 519 articles!

Columns: PMID, Title, Authors, Journal, Publication_Date, Volume, Issue, Pages, DOI, Abstract

First 5 articles:
       PMID                                              Title                                            Authors                                    Journal Publication_Date Volume Issue              Pages                        DOI                                           Abstract
0  41026796  Evaluation of Disparities in Management of Chr...  Zhu, Christina; Zheng, Wynne; Clementi, Emily;...    American journal of rhinology & allergy      2025-Sep-30               19458924251383016  10.1177/19458924251383016  ObjectiveTo evaluate disparities in the manage...
1  41026592  Artificial Intelligence Model for Imaging-Base...  Dayan, Gabriel S; Hénique, Gautier; Bahig, Hou...  JAMA otolaryngology-- head & neck surgery      2025-Sep-30                                  10.1001/jamaoto.2025.3225  IMPORTANCE: Although not included in the eig