# Download PubMed Abstracts

> **Note**: This notebook is compatible with both Google Colab and local Jupyter environments. Colab-specific sections are clearly marked.

In this notebook, I collect a large set of PubMed abstracts using the Entrez API to build a domain-specific corpus for word embedding training. I use a MeSH-based search to retrieve relevant articles, filter for those with available abstracts, and save the cleaned data to CSV for downstream NLP tasks. This corpus will serve as the foundation for comparing embedding quality across different tokenization methods in a biomedical context.

In [13]:
# install biopython
!pip install biopython



## Import Libraries

In [14]:
import sys
import os
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive')

    project_path = '/content/drive/MyDrive/NLP_Projects/Week_3/word-embeddings-playground'
    if os.path.exists(project_path):
        os.chdir(project_path)
        print(f"Changed working directory to: {project_path}")
    else:
        raise FileNotFoundError(f"Project path not found: {project_path}")
else:
    print("Not running in Colab — skipping Drive mount.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Changed working directory to: /content/drive/MyDrive/NLP_Projects/Week_3/word-embeddings-playground


In [15]:
# import libraries
from Bio import Entrez, Medline
import pandas as pd
import numpy as np
import time

## Collecting PubMed Abstracts Using Entrez and Medline

In this section, we collect biomedical abstracts from PubMed using the NCBI Entrez API.

1. **Set up Entrez access**: We begin by specifying a contact email (required by NCBI) to identify ourselves in API requests.
2. **Search for articles**: We define a query using the MeSH term `"medicine"` and retrieve up to 100,000 PubMed IDs that match the search.
3. **Fetch abstracts**: We then download the abstracts in batches of 500 using the `efetch` endpoint. For each article, we extract:
   - PMID (PubMed ID)
   - Title
   - Abstract text

We add a 1-second delay between batches to respect NCBI’s rate limits.

In [16]:
Entrez.email = 'rymcnamara4@gmail.com'

search_query = 'medicine[MeSH Terms]'

In [17]:
def search_abstracts(query, max_results = 100_000):
  """
  Search PubMed for article IDs using a specified query.

  Parameters:
      query (str): The search term or MeSH query to run against PubMed.
      max_results (int, optional): Maximum number of PubMed IDs to retrieve. Default is 100,000.

  Returns:
      list: A list of PubMed IDs (PMIDs) matching the query.
  """
  handle = Entrez.esearch(db = 'pubmed', term = query, retmax = max_results, retmode = 'xml')
  record = Entrez.read(handle)
  return record['IdList']

In [18]:
ids = search_abstracts(search_query)
print(f'Found {len(ids)} abstracts.')

Found 9999 abstracts.


In [19]:
def fetch_abstracts(ids, batch_size = 500):
  """
  Fetch PubMed article metadata (title, abstract, PMID) for a list of IDs.

  Parameters:
      ids (list): A list of PubMed IDs (PMIDs) to retrieve data for.
      batch_size (int, optional): Number of articles to fetch per API call. Default is 500.

  Returns:
      list: A list of dictionaries, each containing 'PMID', 'Title', and 'Abstract' for one article.
  """
  abstracts = []
  for i in range(0, len(ids), batch_size):
    batch_ids = ids[i:i + batch_size]
    fetch_handle = Entrez.efetch(db = 'pubmed', id = ','.join(batch_ids), rettype = 'medline', retmode = 'text')
    records = Medline.parse(fetch_handle)
    for rec in records:
      title = rec.get('TI', 'No Title')
      abstract = rec.get('AB', 'No Abstract')
      id = rec.get('PMID', 'Unknown')
      abstracts.append({'PMID': id, 'Title': 'title', 'Abstract': abstract})

    print(f'Fetched {i + len(batch_ids)} abstracts so far...')
    time.sleep(1)

  return abstracts

In [20]:
abstracts = fetch_abstracts(ids)

Fetched 500 abstracts so far...
Fetched 1000 abstracts so far...
Fetched 1500 abstracts so far...
Fetched 2000 abstracts so far...
Fetched 2500 abstracts so far...
Fetched 3000 abstracts so far...
Fetched 3500 abstracts so far...
Fetched 4000 abstracts so far...
Fetched 4500 abstracts so far...
Fetched 5000 abstracts so far...
Fetched 5500 abstracts so far...
Fetched 6000 abstracts so far...
Fetched 6500 abstracts so far...
Fetched 7000 abstracts so far...
Fetched 7500 abstracts so far...
Fetched 8000 abstracts so far...
Fetched 8500 abstracts so far...
Fetched 9000 abstracts so far...
Fetched 9500 abstracts so far...
Fetched 9999 abstracts so far...


## Convert and Save Abstracts to CSV

After collecting the PubMed abstracts, we convert the data into a pandas DataFrame for easier processing and analysis.

Steps:
1. Convert the list of abstracts to a `DataFrame`.
2. Filter out entries that contain no abstract text.
3. Print the number of remaining abstracts.
4. Save the cleaned DataFrame to a CSV file in the `data/` directory.

In [21]:
abstracts_df = pd.DataFrame(abstracts)

In [22]:
abstracts_df = abstracts_df[abstracts_df['Abstract'] != 'No Abstract']

In [23]:
print(f'There are {len(abstracts_df)} abstracsts.')

There are 9982 abstracsts.


In [24]:
abstracts_df.to_csv('./data/pubmed_abstracts.csv', header = True, index = False)