# Multimodal RAG for Biomedical Research using BigQuery

## Introduction

This notebook demonstrates how to build a serverless, multimodal Retrieval-Augmented Generation (RAG) pipeline for biomedical literature using BigQuery. It demonstrates how to overcome the limitations of traditional text-only search by creating a multimodal pipeline that can "read" and reason about the rich visual information in scientific articles, such as charts, tables, and diagrams. By the end of this notebook, you will have a powerful, scalable, and multimodal search engine built entirely on BigQuery's native AI capabilities. With the exception of the initial upload of public PDFs articles into GCS, the rest of the workflow is orchestrated using BigFrames, which provides a pandas-like API for scalable data processing.

![Architecture Diagram](https://github.com/rarsan/MedQuery/blob/d9c0fa8b699b842c3e0d19f046dd17077c3ac83b/assets/arch_diagram.png?raw=true)

The core technical steps are as follows:

1.  **Data Ingestion and Preparation**: We begin by cataloging scientific articles from the PMC Open Access dataset. A `bpd.remote_function` is defined and deployed to process the source PDFs from Cloud Storage. This function splits each multi-page document into single-page PNG files, creating a new, image-based dataset ready for multimodal analysis.

2.  **Multimodal Embedding Generation**: Each page image is represented as a blob object in a BigFrames DataFrame. We then use the `bigframes.ml.llm.MultimodalEmbeddingGenerator` to call the `multimodalembedding` model, generating a 1408-dimension vector embedding for each page image. This captures the visual and textual information of the page without relying on traditional OCR.

3.  **Vector Indexing for Semantic Search**: The generated embeddings are stored in a BigQuery table. To enable efficient similarity search, a `VECTOR_INDEX` is created on the embedding column using the `bigframes.bigquery.create_vector_index` function, configured with a `COSINE` distance type and an `IVF` index type.

4.  **Enrichment with Generative AI**: To add structured metadata, we leverage BigFrames' `.ai` functions. We use `.ai.classify` to categorize articles by study type and patient population, and `.ai.map` (which uses `AI.GENERATE_TABLE` under the hood) to extract PICO components from abstracts.

5.  **Multimodal RAG Implementation**: The final RAG pipeline is implemented by:
    *   Generating an embedding for a natural language query.
    *   Using `bigframes.bigquery.vector_search` to perform a similarity search against the indexed page embeddings to retrieve the most relevant page images.
    *   Passing the original query and the retrieved page images as context to a `bigframes.ml.llm.GeminiTextGenerator` to synthesize a final, evidence-based answer.

## Setup and Configuration

In [None]:
import bigframes.pandas as bpd
import os
from google.cloud import storage
from google.cloud import bigquery
from bigframes.ml.llm import GeminiTextGenerator, MultimodalEmbeddingGenerator

PROJECT_ID = "YOUR_PROJECT_ID"  # @param {type:"string"}
REGION = "us-central1"  # @param {type:"string"}

# Names of bucket and dataset (to be created)
BUCKET_NAME = "YOUR_BUCKET_NAME"  # @param {type:"string"}
DATASET_ID = "YOUR_DATASET_ID"  # @param {type:"string"}

bpd.options.bigquery.project = PROJECT_ID
bpd.options.bigquery.location = REGION

### 2. Authenticate to Google Cloud

In [10]:
import sys
if "google.colab" in sys.modules:
    from google.colab import auth
    auth.authenticate_user()
else:
    # Application Default Credentials
    pass

### 3. Install necessary libraries

In [None]:
!pip install --upgrade bigframes tqdm

## Cloud Resource Setup

### 1. Create BigQuery dataset and GCS bucket

In [None]:

storage_client = storage.Client()
bq_client = bigquery.Client(project=PROJECT_ID, location=REGION)

# Create dataset
dataset = bigquery.Dataset(f"{PROJECT_ID}.{DATASET_ID}")
dataset.location = REGION
dataset = bq_client.create_dataset(dataset, exists_ok=True)
print(f"BQ Dataset '{DATASET_ID}' created successfully.")

# Create bucket
try:
    bucket = storage_client.get_bucket(BUCKET_NAME)
    print(f"GCS Bucket '{BUCKET_NAME}' already exists.")
except Exception:
    print(f"GCS Bucket '{BUCKET_NAME}' not found. Creating it...")
    bucket = storage_client.create_bucket(BUCKET_NAME)
    print(f"GCS Bucket '{BUCKET_NAME}' created successfully.")

### 2. Create BigQuery cloud connection
Follow the steps [here](https://cloud.google.com/bigquery/docs/multimodal-data-dataframes-tutorial#create_a_connection) to create a BigQuery Cloud Resource connection and [grant it the necessary permissions](https://cloud.google.com/bigquery/docs/multimodal-data-dataframes-tutorial#grant-permissions) to use Cloud Storage and Vertex AI.

## Data Onboarding (Structured and Unstructured)

### 1. Download List of Articles

In this notebook, we will download, process and analyze articles around nutritional health.
First step is to download these public PDF articles into your GCS bucket. For that purpose, use this compiled the list of nutritional health articles published in 2025 (as of Sept 16, 2025):

In [3]:
!gsutil cp gs://biomedical-search/nutrition-health/nutrition-health.2025.csv .

Copying gs://biomedical-search/nutrition-health/nutrition-health.2025.csv...
- [1 files][  2.5 MiB/  2.5 MiB]                                                
Operation completed over 1 objects/2.5 MiB.                                      



Note: PubMed Central provides a [search portal](https://pmc.ncbi.nlm.nih.gov/search) to retrieve list of articles IDs based on a user query. To facilitate this manual step, we've run the following query and saved the search results (as of Sep 16, 2025) in `nutrition-health.2025.csv` which you just downloaded:

```
"has pdf"[filter] AND "Diet, Food, and Nutrition"[mh] AND ("Public Health"[mh] OR "Health Status"[mh] OR "Disease"[mh] OR "Exercise"[mh]) NOT "Animal Nutritional Physiological Phenomena"[mh]
```

### 2. Upload Articles PDFs into GCS bucket

The following Python script is a one-time retrieval of the articles for subsequent processing. It reads `nutrition-health.2025.csv` file, which contains a list of scientific publications relevant to our research.

For each article, identified by its PubMed Central ID (PMCID), the script performs a series of automated steps:

- Fetches Full Article Packages: It queries the NCBI Open Archives service to download the complete article package (.tar.gz). This package is crucial as it contains both the unstructured data (PDF) and the semi-structured data (NXML) file with the article's metadata and full OCR'ed text.

- Extracts and Uploads Key Files: The script opens the downloaded archive, extracts the primary PDF and NXML files, and uploads them into your GCS bucket in a nested hierarchy (gs://<bucket-name>/articles/<PMCID>/). This process is idempotent; it checks if an article has already been processed and skips it to avoid redundant work.

- Parses and Collects Metadata: As it processes each NXML file, it extracts key metadata fields such as the article's title, abstract, and the full body text.

- Creates a Metadata Manifest: Finally, all the extracted metadata is compiled and saved into a new BQ table.

By the end of this process, our GCS bucket is populated with the core content of each article, and we have a clean, structured metadata table for the next stage of analysis.

In [136]:
import os
import tarfile
import time
import xml.etree.ElementTree as ET
from urllib.parse import urlparse
from urllib.request import urlopen

import pandas as pd
import requests
from google.cloud import bigquery
from google.cloud import storage
from tqdm import tqdm

def parse_article_xml(xml_content: str) -> dict:
    """Parses the NXML content to extract key metadata fields."""
    try:
        # Replace non-breaking space characters which can cause parsing errors
        xml_content = xml_content.replace('\xa0', ' ')
        root = ET.fromstring(xml_content)
        
        title = root.findtext('.//article-title')
        
        abstract_element = root.find('.//abstract')
        abstract = ''.join(abstract_element.itertext()).strip() if abstract_element is not None else None
        
        body_element = root.find('.//body')
        body_text = ''.join(body_element.itertext()).strip() if body_element is not None else None

        return {
            'title': title,
            'abstract': abstract,
            'body_text': body_text
        }
    except ET.ParseError as e:
        print(f"      - Warning: XML parsing error: {e}")
        return {}

def process_article(row: pd.Series, bucket: storage.Bucket) -> tuple[str, dict | None]:
    """
    Finds the article package URL, downloads it, extracts NXML and PDF,
    uploads them to GCS, and returns the extracted metadata.
    """
    pmcid = row.get('pmcid')
    if not pmcid or pd.isna(pmcid):
        return "no_pmcid", {}

    gcs_article_folder = f"articles/{pmcid}"

    # Efficiently check for existing NXML and PDF files by iterating through blobs once
    # without loading the entire list into memory.
    blobs_iterator = bucket.list_blobs(prefix=f"{gcs_article_folder}/")
    nxml_files = []
    pdf_files = []
    for blob in blobs_iterator:
        if blob.name.lower().endswith('.nxml'):
            nxml_files.append(blob)
        elif blob.name.lower().endswith('.pdf'):
            pdf_files.append(blob)

    # Assume the first NXML found is the main one.
    nxml_blob = nxml_files[0] if nxml_files else None

    if nxml_blob:
        try:
            xml_content = nxml_blob.download_as_string().decode('utf-8')
            metadata = parse_article_xml(xml_content)

            # Populate the full metadata dictionary for skipped items
            metadata['pmcid'] = pmcid
            metadata['nxml_gcs_uri'] = f"gs://{bucket.name}/{nxml_blob.name}"

            # Find the PDF that matches the NXML basename for correctness
            article_basename = os.path.splitext(os.path.basename(nxml_blob.name))[0]
            pdf_blob = next((p for p in pdf_files if os.path.splitext(os.path.basename(p.name))[0] == article_basename), None)
            if pdf_blob:
                metadata['pdf_gcs_uri'] = f"gs://{bucket.name}/{pdf_blob.name}"

            return "skipped", metadata
        except Exception as e:
            print(f"  - Error reading existing NXML for {pmcid}: {e}")
            return "skipped_error", {}

    try:
        # 1. Get article package URL from PMC OA Web Service
        oa_url = f"https://www.ncbi.nlm.nih.gov/pmc/utils/oa/oa.fcgi?id={pmcid}"
        response = requests.get(oa_url, timeout=20)
        response.raise_for_status()
        
        root = ET.fromstring(response.content)
        package_link_element = root.find(".//link[@format='tgz']")
        
        if package_link_element is None or 'href' not in package_link_element.attrib:
            return "no_package_link", {}

        package_url = package_link_element.get('href').replace("ftp://", "https://", 1)
        package_filename = os.path.basename(urlparse(package_url).path)
        
        # 2. Download the package to a temporary local file
        local_temp_path = f"/tmp/{package_filename}"
        with urlopen(package_url, timeout=90) as ftp_response:
            with open(local_temp_path, 'wb') as f:
                f.write(ftp_response.read())

        # 3. Open the tarfile and process NXML and PDF files
        with tarfile.open(local_temp_path, "r:gz") as tar:
            # First, find the main .nxml file to establish the article's base filename.
            all_members = tar.getmembers()
            nxml_member = next((m for m in all_members if m.isfile() and m.name.lower().endswith('.nxml')), None)

            if not nxml_member:
                os.remove(local_temp_path)
                return "no_content_in_package", {}
            
            article_basename = os.path.splitext(os.path.basename(nxml_member.name))[0]
            
            # Process and upload the NXML file
            nxml_gcs_path = f"{gcs_article_folder}/{os.path.basename(nxml_member.name)}"
            nxml_blob = bucket.blob(nxml_gcs_path)
            nxml_content_bytes = tar.extractfile(nxml_member).read()
            nxml_blob.upload_from_string(nxml_content_bytes, content_type='application/xml')
            
            xml_content_str = nxml_content_bytes.decode('utf-8')
            article_metadata = parse_article_xml(xml_content_str)
            article_metadata['nxml_gcs_uri'] = f"gs://{bucket.name}/{nxml_gcs_path}"

            # Find and upload the corresponding PDF file
            pdf_member = next((m for m in all_members if m.isfile() and os.path.splitext(os.path.basename(m.name))[0] == article_basename and m.name.lower().endswith('.pdf')), None)
            if pdf_member:
                pdf_gcs_path = f"{gcs_article_folder}/{os.path.basename(pdf_member.name)}"
                pdf_blob = bucket.blob(pdf_gcs_path)
                pdf_blob.upload_from_string(tar.extractfile(pdf_member).read(), content_type='application/pdf')
                article_metadata['pdf_gcs_uri'] = f"gs://{bucket.name}/{pdf_gcs_path}"

        # 4. Clean up the local temporary file
        os.remove(local_temp_path)

        if not article_metadata:
            return "no_content_in_package", {}
            
        article_metadata['pmcid'] = pmcid
        return "processed", article_metadata

    except requests.exceptions.RequestException as e:
        return "failed_fetch", {"error": str(e)}
    except tarfile.ReadError as e:
        return "failed_tar_read", {"error": str(e)}
    except Exception as e:
        return "failed_other", {"error": str(e)}

Run the following to upload the first 50 articles to your GCS bucket

In [None]:
# --- Configuration ---
CSV_FILE = 'data/nutrition-health.2025.csv'
MAX_ARTICLES_TO_PROCESS = 50 # Set to None to process all articles
# --- End Configuration ---

if not os.path.exists(CSV_FILE):
    print(f"❌ Error: '{CSV_FILE}' not found. Please download it first.")
    exit(1)

articles_df = pd.read_csv(CSV_FILE)
# Normalize column names: lowercase, replace special chars
articles_df.columns = articles_df.columns.str.lower().str.replace('[/ -]', '_', regex=True)

if MAX_ARTICLES_TO_PROCESS is not None:
    print(f"Limiting processing to the first {MAX_ARTICLES_TO_PROCESS} articles for testing.")
    articles_df = articles_df.head(MAX_ARTICLES_TO_PROCESS)

print(f"Found {len(articles_df)} articles to process from '{CSV_FILE}'.")

all_metadata = []
results = {"processed": 0, "skipped": 0, "no_pmcid": 0, "no_package_link": 0, "failed": 0}

with tqdm(total=len(articles_df), desc="Processing articles") as pbar:
    for index, row in articles_df.iterrows():
        status, metadata = process_article(row, bucket)
        # Use a more robust way to handle status categories like 'skipped_error'
        status_category = status.split('_')[0]
        results[status_category] = results.get(status_category, 0) + 1
        if status in ["processed", "skipped"] and metadata:
            all_metadata.append(metadata)
        pbar.set_postfix(processed=results['processed'], skipped=results['skipped'], failed=results['failed'])
        pbar.update(1)
        # Pause to avoid overwhelming the API with requests.
        time.sleep(0.4)

print("\n--- Processing Summary ---")
print(f"✅ Processed and uploaded: {results['processed']} new articles.")
print(f"⏩ Skipped (already exist): {results['skipped']} articles.")
print(f"❌ Failed: {results['failed']} articles.")

if all_metadata:
    processed_metadata_df = pd.DataFrame(all_metadata)

    # Merge the extracted metadata with the original dataframe from the CSV
    final_df = pd.merge(
        articles_df,
        processed_metadata_df,
        on='pmcid',
        how='left',
        suffixes=('_csv', '_xml')
    )

    # The XML title is preferred. Fill missing values with the original CSV title.
    final_df['title'] = final_df['title_xml'].fillna(final_df['title_csv'])

    # Define the final column set in the desired order
    bq_columns = [
        'pmid', 'pmcid', 'doi', 'title', 'authors', 'citation',
        'journal_book', 'publication_year', 'create_date', 'abstract',
        'pdf_gcs_uri', 'nxml_gcs_uri',
        'body_text',
    ]
    
    # Filter to only the columns we want in BigQuery and handle potential missing columns
    bq_df = final_df[[col for col in bq_columns if col in final_df.columns]]

    # --- Save to BigQuery using BigFrames ---
    print(f"\nUploading metadata for {len(bq_df)} articles to BigQuery table 'articles'...")
    try:
        # Convert pandas DataFrame to BigQuery DataFrame
        bq_articles_df = bpd.read_pandas(bq_df)
        
        # Write the BigQuery DataFrame to a BigQuery table
        bq_articles_df.to_gbq(
            destination_table=f"{DATASET_ID}.articles",
            if_exists='append'
        )
        print(f"✅ Successfully uploaded metadata to BigQuery.")
    except Exception as e:
        print(f"❌ Error uploading to BigQuery: {e}")
        print(f"Saving metadata to local CSV 'articles_metadata.csv' as a fallback.")
        bq_df.to_csv('articles_metadata.csv', index=False)
else:
    print("\nNo new or existing metadata was processed to save.")

Found 8632 articles to process from 'data/nutrition-health.2024.csv'.


Processing articles: 100%|██████████| 8632/8632 [11:02:02<00:00,  4.60s/it, failed=7, processed=8619, skipped=0]    



--- Processing Summary ---
✅ Processed and uploaded: 8619 new articles.
⏩ Skipped (already exist): 0 articles.
❌ Failed: 7 articles.

Uploading metadata for 8632 articles to BigQuery table 'articles'...


✅ Successfully uploaded metadata to BigQuery.


In [201]:
bq_articles_df.head(5)


Unnamed: 0,pmid,pmcid,doi,title,authors,citation,journal_book,publication_year,create_date,abstract,pdf_gcs_uri,nxml_gcs_uri,body_text
0,38337706,PMC10857452,10.3390/nu16030421,The Value of an Ecological Approach to Improve...,"Raiten DJ, Steiber AL, Dary O, Bremer AA.",Nutrients. 2024 Jan 31;16(3):421. doi: 10.3390...,Nutrients,2024,2024/02/10,"Globally, children are exposed to multiple hea...",gs://bq-ai-compete-pmc-oa/articles/PMC10857452...,gs://bq-ai-compete-pmc-oa/articles/PMC10857452...,"1. IntroductionGlobally, a holistic public hea..."
1,39095683,PMC11782318,10.1007/s12094-024-03595-1,Charting cancer’s course: revealing the role o...,"Martin-Quesada AI, Hennessy MA, Gutiérrez AC.",Clin Transl Oncol. 2025 Feb;27(2):473-485. doi...,Clin Transl Oncol,2025,2024/08/02,A variety of pathophysiological mechanisms exi...,gs://bq-ai-compete-pmc-oa/articles/PMC11782318...,gs://bq-ai-compete-pmc-oa/articles/PMC11782318...,IntroductionThe recognition of physical exerci...
2,39155613,PMC11730648,10.1002/hpja.913,Nutrition and physical activity practices in f...,"Tran G, Kerr E, Kelly B, Ryan ST, Norman J, Ha...",Health Promot J Austr. 2025 Jan;36(1):e913. do...,Health Promot J Austr,2025,2024/08/19,AbstractIssue Addressed\nMunch & Move is a New...,gs://bq-ai-compete-pmc-oa/articles/PMC11730648...,gs://bq-ai-compete-pmc-oa/articles/PMC11730648...,1BACKGROUNDThe New South Wales (NSW) Ministry ...
3,39796475,PMC11722646,10.3390/nu17010041,Beyond Borders: Investigating the Impact of CO...,"Li J, Wilczyńska DM, Lipowska M, Łada-Maśko AB...",Nutrients. 2024 Dec 26;17(1):41. doi: 10.3390/...,Nutrients,2024,2025/01/11,Background/Objectives: The mechanisms linking ...,gs://bq-ai-compete-pmc-oa/articles/PMC11722646...,gs://bq-ai-compete-pmc-oa/articles/PMC11722646...,1. IntroductionSince the first confirmed case ...
4,39694877,PMC11710946,10.1111/ijpo.13195,\n,"Strugnell C, Gaskin CJ, Becker D, Orellana L, ...",Pediatr Obes. 2025 Feb;20(2):e13195. doi: 10.1...,Pediatr Obes,2025,2024/12/18,SummaryBackgroundDuring the coronavirus diseas...,gs://bq-ai-compete-pmc-oa/articles/PMC11710946...,gs://bq-ai-compete-pmc-oa/articles/PMC11710946...,1INTRODUCTIONPoorer child health and well‐bein...


## Generative AI in BigFrames

The new [BigFrames AI accessor](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.ai.AIAccessor) methods are powerful, high-level abstractions that can significantly simplify your code, make it more readable, and enable new capabilities. They are designed to replace manual SQL construction and complex data manipulation with simple, chained Python calls.

In [None]:
bpd.options.experiments.ai_operators = True
bpd.options.compute.ai_ops_confirmation_threshold = 100

# LLM to be used in BigQuery AI functions
model = GeminiTextGenerator(model_name="gemini-2.0-flash-001")

# bq_articles_df = bpd.read_gbq(F"{PROJECT_ID}.{DATASET_ID}.articles", use_cache=False)
articles_sample_df = bq_articles_df.head(100)

### 1. Filter with natural language (AI.filter)

You can filter a DataFrame using a natural language condition instead of SQL. This gives you incredible flexibility to narrow down articles with queries like "only studies involving children" or "articles that discuss side-effects", which is impossible with predefined widgets.

In [None]:
articles_sample_df['topic'] = 'Multiple Sclerosis'

filtered_articles_df = articles_sample_df.ai.filter(
    "Based on the following abstract, the article is relevant to this topic: {topic}. Abstract: {abstract}", model
)
filtered_articles_df


Unnamed: 0,pmid,pmcid,doi,title,authors,citation,journal_book,publication_year,create_date,abstract,pdf_gcs_uri,nxml_gcs_uri,body_text,topic
0,40871741,PMC12389322,10.3390/nu17162713,The Role of Nutrition and Physical Activity in...,"Grosu C, Ignat EB, Alexa D, Ciubotaru A, Leon ...",Nutrients. 2025 Aug 21;17(16):2713. doi: 10.33...,Nutrients,2025,2025/08/28,"Multiple sclerosis (MS) is a chronic, immune-m...",gs://bq-ai-compete-pmc-oa/articles/PMC12389322...,gs://bq-ai-compete-pmc-oa/articles/PMC12389322...,1. IntroductionMultiple sclerosis (MS) is a de...,Multiple Sclerosis


### 2. Classify with natural language (AI.classify)

You can automatically classify rows into a set of predefined labels. This is a great way to add new structured metadata. 

Let's first classify by study type. This a critical classification to allow us to distinguish between high-level evidence like a meta-analysis and preliminary evidence like a case report. Here are the labels used:
- Systematic Review / Meta-Analysis
- Randomized Controlled Trial (RCT)
- Observational Study (e.g., Cohort, Case-Control)
- Review Article / Expert Opinion
- Basic Science / Animal Study

In [None]:
study_design_labels = ["Systematic Review / Meta-Analysis", "Randomized Controlled Trial", "Observational Study", "Review Article / Expert Opinion", "Basic Science / Animal Study"]

articles_sample_df = articles_sample_df.ai.classify(
    "Classify this biomedical article by study design based on its abstract: {abstract}", model,
    study_design_labels, 'study_type'
)

articles_sample_df[['pmcid', 'title', 'journal_book', 'abstract', 'study_type']]

Unnamed: 0,pmcid,title,journal_book,abstract,study_type
0,PMC10892465,"Weight Categories, Trajectories, Eating Behavi...",Nutrients,Pre-pregnancy overweight and obesity are assoc...,Observational Study
1,PMC11547757,The Multiple Challenges of Nutritional Microbi...,Nutrients,The global coronavirus disease 2019 (COVID-19)...,Review Article / Expert Opinion
2,PMC11934245,"Eggs, Dietary Choline, and Nonalcoholic Fatty ...",J Nutr,BackgroundEggs are rich in bioactive compounds...,Observational Study
3,PMC11795593,Transcriptomic and metabolomic-based revelatio...,Poult Sci,To investigate the effect of fresh corn extrac...,Basic Science / Animal Study
4,PMC12252281,Effects of 12-Week Dietary Inflammatory Index-...,Nutrients,Background: Frailty is common in colorectal ca...,Randomized Controlled Trial
5,PMC11064901,"Habitual coffee consumption and office, home, ...",J Hypertens,Objectives:Heterogeneous are the results of th...,Observational Study
6,PMC11999145,Association between advanced lung cancer infla...,PLoS One,IntroductionGallstones are a common digestive ...,Observational Study
7,PMC11357025,Cardiovascular Risk Factors as Predictors of N...,Nutrients,Aging is commonly accompanied by increased car...,Observational Study
8,PMC11357324,Creatine Improves Total Sleep Duration Followi...,Nutrients,Females historically experience sleep disturba...,Randomized Controlled Trial
9,PMC11023303,Perinatal mortality in German dairy cattle: Un...,PLoS One,Perinatal mortality (PM) is a common issue on ...,Observational Study


Let's classify by population or target subject, as nutritional needs vary dramatically across different populations. For example, a clinician looking for information for an elderly patient can easily filter out pediatric or athletic studies.
We will use the following labels:
- Pediatric / Children
- Geriatric / Elderly
- Athletes
- Clinical Population (e.g., patients with a specific disease)
- General Population



In [7]:
population_labels = ["Pediatric", "Geriatric", "Athletes", "Clinical Population", "General Population"]

articles_sample_df = articles_sample_df.ai.classify(
    "Classify this biomedical article by target patient population based on its abstract: {abstract}", model,
    population_labels, 'population'
)

articles_sample_df[['pmcid', 'title', 'journal_book', 'abstract', 'population']]

Unnamed: 0,pmcid,title,journal_book,abstract,population
0,PMC10892465,"Weight Categories, Trajectories, Eating Behavi...",Nutrients,Pre-pregnancy overweight and obesity are assoc...,Clinical Population
1,PMC11547757,The Multiple Challenges of Nutritional Microbi...,Nutrients,The global coronavirus disease 2019 (COVID-19)...,General Population
2,PMC11934245,"Eggs, Dietary Choline, and Nonalcoholic Fatty ...",J Nutr,BackgroundEggs are rich in bioactive compounds...,General Population
3,PMC11795593,Transcriptomic and metabolomic-based revelatio...,Poult Sci,To investigate the effect of fresh corn extrac...,General Population
4,PMC12252281,Effects of 12-Week Dietary Inflammatory Index-...,Nutrients,Background: Frailty is common in colorectal ca...,Clinical Population
5,PMC11064901,"Habitual coffee consumption and office, home, ...",J Hypertens,Objectives:Heterogeneous are the results of th...,General Population
6,PMC11999145,Association between advanced lung cancer infla...,PLoS One,IntroductionGallstones are a common digestive ...,General Population
7,PMC11357025,Cardiovascular Risk Factors as Predictors of N...,Nutrients,Aging is commonly accompanied by increased car...,Geriatric
8,PMC11357324,Creatine Improves Total Sleep Duration Followi...,Nutrients,Females historically experience sleep disturba...,Clinical Population
9,PMC11023303,Perinatal mortality in German dairy cattle: Un...,PLoS One,Perinatal mortality (PM) is a common issue on ...,General Population


In [8]:
articles_sample_df.head(5)

Unnamed: 0,pmid,pmcid,doi,title,authors,citation,journal_book,publication_year,create_date,abstract,pdf_gcs_uri,nxml_gcs_uri,body_text,study_type,population
0,38398884,PMC10892465,10.3390/nu16040560,"Weight Categories, Trajectories, Eating Behavi...","Schenk S, Ravussin Y, Lacroix A, Quansah DY, P...",Nutrients. 2024 Feb 18;16(4):560. doi: 10.3390...,Nutrients,2024,2024/02/24,Pre-pregnancy overweight and obesity are assoc...,gs://bq-ai-compete-pmc-oa/articles/PMC10892465...,gs://bq-ai-compete-pmc-oa/articles/PMC10892465...,1. IntroductionThe increasing prevalence of ob...,Observational Study,Clinical Population
1,39519526,PMC11547757,10.3390/nu16213693,The Multiple Challenges of Nutritional Microbi...,"Donkers A, Seel W, Klümpen L, Simon MC.",Nutrients. 2024 Oct 30;16(21):3693. doi: 10.33...,Nutrients,2024,2024/11/09,The global coronavirus disease 2019 (COVID-19)...,gs://bq-ai-compete-pmc-oa/articles/PMC11547757...,gs://bq-ai-compete-pmc-oa/articles/PMC11547757...,1. IntroductionThe term gut microbiota refers ...,Review Article / Expert Opinion,General Population
2,39424072,PMC11934245,10.1016/j.tjnut.2024.10.026,"Eggs, Dietary Choline, and Nonalcoholic Fatty ...","Yiannakou I, Long MT, Jacques PF, Beiser A, Pi...",J Nutr. 2025 Mar;155(3):923-935. doi: 10.1016/...,J Nutr,2025,2024/10/18,BackgroundEggs are rich in bioactive compounds...,gs://bq-ai-compete-pmc-oa/articles/PMC11934245...,gs://bq-ai-compete-pmc-oa/articles/PMC11934245...,IntroductionThe egg is a nutrient-dense food t...,Observational Study,General Population
3,39848207,PMC11795593,10.1016/j.psj.2025.104814,Transcriptomic and metabolomic-based revelatio...,"Tian J, Wu Y, Zhao W, Zhang G, Zhang H, Xue L,...",Poult Sci. 2025 Feb;104(2):104814. doi: 10.101...,Poult Sci,2025,2025/01/23,To investigate the effect of fresh corn extrac...,gs://bq-ai-compete-pmc-oa/articles/PMC11795593...,gs://bq-ai-compete-pmc-oa/articles/PMC11795593...,"IntroductionNowadays, the global broiler indus...",Basic Science / Animal Study,General Population
4,40647307,PMC12252281,10.3390/nu17132203,Effects of 12-Week Dietary Inflammatory Index-...,"Wang Y, Liu Y, Cheng L, He J, Cheng X, Lin X, ...",Nutrients. 2025 Jul 1;17(13):2203. doi: 10.339...,Nutrients,2025,2025/07/12,Background: Frailty is common in colorectal ca...,gs://bq-ai-compete-pmc-oa/articles/PMC12252281...,gs://bq-ai-compete-pmc-oa/articles/PMC12252281...,1. IntroductionColorectal cancer (CRC) has bec...,Randomized Controlled Trial,Clinical Population


### 3a. Map to extract info with natural language (AI.map)

In evidence-based medicine and research, the PICO framework is a standard method for summarizing the core components of a clinical study. It's an acronym for:
- P - Population/Patient: Who was studied? (e.g., "Post-menopausal women", "Children with asthma")
- I - Intervention: What was the treatment or exposure? (e.g., "Vitamin D supplementation", "Mediterranean diet")
- C - Comparison/Control: What was the intervention compared to? (e.g., "Placebo", "Standard diet")
- O - Outcome: What was the result being measured? (e.g., "Bone density", "Reduction in symptoms")
Manually identifying the PICO elements in dozens of articles is time-consuming. Use `.ai.map` to automate this, running an LLM over each retrieved article to extract this information in a structured way.

In [None]:
# Define the prompt for the extraction task
# prompt = """
# From the following article's abstract , extract the PICO components (Population, Intervention, Comparison, Outcome). 
# If a component is not mentioned, use "N/A" as the value.
# Abstract: {abstract}
# """

prompt = """
You are a helpful biomedical research assistant.
Extract the specific PICO components that is Population, Intervention, Comparison, Outcome from the following biomedical article: {abstract}
"""

sample_rows_df = articles_sample_df.copy()

sample_rows_df.ai.map(
    instruction=prompt,
    model=model,
    output_schema={"population": "string", "intervention": "string", "comparison": "string", "outcome": "string"}
)[['pmcid', 'title', 'journal_book', 'abstract', 'population', 'intervention', 'comparison', 'outcome']]

Let's save the newly identified labels and information extracted into a new table `article_labels`

In [None]:
# Save the results to a BigQuery table.
articles_sample_df.to_gbq(f"{PROJECT_ID}.{DATASET_ID}.article_labels", if_exists="replace")

### 3b. Use `AI.GENERATE_TABLE` to extract info with natural language

The `ai.map` function in BigFrames is a convenient wrapper that, under the hood, generates a SQL query. Here is the equivalent SQL which uses the ML.GENERATE_TABLE function to perform the same PICO information extraction task:

In [None]:
%%bigquery --project {PROJECT_ID}
-- This SQL assumes you have a BigQuery model `llm_model` created that points to
-- a Vertex AI LLM (e.g., gemini-2.0-flash-001).
SELECT
  pmcid,
  title,
  comparison,
  population,
  intervention,
  outcome,
  prompt
FROM AI.GENERATE_TABLE(
    MODEL `pdf_analysis.llm_model`, -- Replace with your BQ dataset and LLM model
    (
      SELECT
        pmcid,
        title,
        abstract,
        (
          "You are a helpful biomedical research assistant. Extract the specific PICO components or Population, Intervention, Comparison, Outcome from the following biomedical article abstract: ",
          abstract
        ) AS prompt
      FROM
        `pdf_analysis.articles` -- Replace with your BQ dataset
      LIMIT 2
    ),
    STRUCT(
      "population STRING, intervention STRING, comparison STRING, outcome STRING" AS output_schema,
      0 AS temperature
    )
);

## Multimodal Analysis in BigFrames

In [217]:
# Use this if bpd session is reset or lost
bq_articles_df = bpd.read_gbq(f"{DATASET_ID}.articles", use_cache=False)

### 1. Split multi-pages Articles into single-page PDFs

Let's create a remote function for splitting PDF articles into PDF single pages

In [181]:
import fitz
from google.cloud import storage
from pathlib import Path

@bpd.remote_function(
    reuse=False,
    packages=["PyMuPDF", "google-cloud-storage"],
    cloud_function_service_account="default",
    cloud_function_memory_mib=2048
)
def pdf_split(pdf_uri: str, dst_folder:str) -> list[str]:
    print(f"Starting to process PDF: {pdf_uri}")

    try:
        storage_client = storage.Client()
        bucket_name, source_blob_name = pdf_uri.split('/', 3)[2:]
        bucket = storage_client.bucket(bucket_name)
        source_path = Path(source_blob_name)

        # Determine the destination prefix based on the source file's parent folder (e.g., a PMCID)
        pmc_id_folder = source_path.parent.name
        destination_prefix = f"{dst_folder}/{pmc_id_folder}/"

        # --- Idempotency Check ---
        # Check if files already exist at the destination. If so, skip processing.
        existing_blobs = list(bucket.list_blobs(prefix=destination_prefix, max_results=1))
        if existing_blobs:
            print(f"Destination {destination_prefix} already populated. Skipping processing.")
            # List all blobs and derive the unique base paths
            all_blobs_in_folder = bucket.list_blobs(prefix=destination_prefix)
            processed_uris = {str(Path(blob.name).with_suffix('')) for blob in all_blobs_in_folder}
            return list(processed_uris)

        # --- If not processed, proceed with downloading and splitting ---
        print(f"Downloading blob: {source_blob_name} from bucket: {bucket_name}")
        source_blob = bucket.blob(source_blob_name)
        pdf_bytes = source_blob.download_as_bytes()
        print(f"Successfully downloaded {len(pdf_bytes)} bytes.")

        generated_page_ids = []
        with fitz.open(stream=pdf_bytes, filetype="pdf") as pdf_document:
            print(f"PDF has {len(pdf_document)} pages.")
            base_filename = source_path.stem

            for page_num in range(len(pdf_document)):
                page = None
                pix = None
                try:
                    page = pdf_document.load_page(page_num)
                    with fitz.open() as new_pdf:
                        new_pdf.insert_pdf(pdf_document, from_page=page_num, to_page=page_num)
                        pdf_output = new_pdf.tobytes()

                    pix = page.get_pixmap(dpi=150)
                    png_output = pix.tobytes("png")

                    # Define a unique page_id without extension, including the subfolder.
                    page_id_base = f"{base_filename}_page_{page_num + 1}"
                    page_id = str(Path(pmc_id_folder) / page_id_base)
                    # Construct destination blob names using the page_id.
                    full_base_path = str(Path(dst_folder) / page_id)

                    # Upload the files
                    bucket.blob(f"{full_base_path}.pdf").upload_from_string(pdf_output, content_type='application/pdf')
                    bucket.blob(f"{full_base_path}.png").upload_from_string(png_output, content_type='image/png')
                    generated_page_ids.append(full_base_path)
                finally:
                    # Clean up large loop-scoped objects explicitly.
                    del page, pix

        print(f"Successfully processed. Returning {len(generated_page_ids)} page IDs.")
        return generated_page_ids

    except Exception as e:
        print(f"An error occurred while processing {pdf_uri}: {e}")
        return [] # Return an empty list on error

Let's retrieve the columns we need to process the articles PDFs

In [219]:
# bq_articles_df = bpd.read_gbq_table(f"{DATASET_ID}.articles")
pdf_df = bq_articles_df.drop(columns=['body_text', 'nxml_gcs_uri'])

In [223]:
pdf_df.shape

(8632, 11)

Now let's run `pdf_split` directly on your dataframe of article PDFs. This will take several minutes...

In [None]:
destination_folder = "pages"
processed_pdf_df = pdf_df.assign(
    page_base_uris=pdf_df['pdf_gcs_uri'].apply(pdf_split, args=(destination_folder,))
)

# Materialize the results to a new, permanent table.
# This is the action that will trigger the long-running job.
processed_pdf_df.to_gbq(f"{DATASET_ID}.article_pages", if_exists="append")

print(f"Successfully processed PDFs and saved results to `article_pages`")

Successfully processed PDFs and saved results to `article_pages`


Notice how each article was split into several single-page PDFs, and each page was converted to a PNG file, all persisted in a new GCS folder `pages` in the same bucket

![PDF articles are split into single-page PDFs and PNGs for multimodal analysis](https://github.com/rarsan/BQ-Biomed-AI/blob/1ac2d334e61f7c0f8dd5de99676b0b4be6d9d8cf/assets/gcs_pdf_articles_split_into_pages.png?raw=True)

Let's materialize this dataset of articles with their list of pages in BigQuery

In [None]:
processed_pdf_df = bpd.read_gbq(f"{DATASET_ID}.article_pages")

In [226]:
processed_pdf_df.shape

(8632, 12)

In [227]:
# Explode the 'page_ids' array to create a new row for each page.
pages_df = processed_pdf_df.explode('page_base_uris').rename(columns={'page_base_uris': 'page_id'})
cols = ['page_id'] + [col for col in pages_df.columns if col != 'page_id']
pages_df = pages_df[cols]

# Immediately reset the index to ensure it's unique (0, 1, 2, ...)
pages_df = pages_df.reset_index(drop=True)
print("Shape of the new flattened DataFrame:", pages_df.shape)
pages_df.head()

Shape of the new flattened DataFrame: (114550, 12)


Unnamed: 0,page_id,pmid,pmcid,doi,title,authors,citation,journal_book,publication_year,create_date,abstract,pdf_gcs_uri
0,pages/PMC11496284/fpubh-12-1436683_page_15,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...
1,pages/PMC11496284/fpubh-12-1436683_page_9,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...
2,pages/PMC11496284/fpubh-12-1436683_page_8,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...
3,pages/PMC11496284/fpubh-12-1436683_page_11,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...
4,pages/PMC11496284/fpubh-12-1436683_page_16,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...


In [228]:
pages_df.sample(n=1)

Unnamed: 0,page_id,pmid,pmcid,doi,title,authors,citation,journal_book,publication_year,create_date,abstract,pdf_gcs_uri
32972,pages/PMC11658349/12916_2024_Article_3809_page_3,39695648,PMC11658349,10.1186/s12916-024-03809-x,Healthy eating patterns associated with reduce...,"Xia B, Li Y, Hu L, Xie P, Mi N, Lv L, Liang Z,...",BMC Med. 2024 Dec 18;22(1):589. doi: 10.1186/s...,BMC Med,2024,2024/12/19,BackgroundLimited epidemiological evidence exi...,gs://bq-ai-compete-pmc-oa/articles/PMC11658349...


In [229]:
# After exploding, the 'page_id' column contains the base URI string for each page.
# We'll use this string to derive the page number and the full URIs for the PDF and image files.
pages_df = pages_df.assign(
    # Extract the page number from the end of the base URI string.
    page_number=pages_df['page_id'].str.extract(r'_page_(\d+)').astype('Int64'),
    # Append extensions to the base URI to get the full paths
    page_pdf_uri="gs://" + BUCKET_NAME + '/' + pages_df['page_id'] + '.pdf',
    page_image_uri="gs://" + BUCKET_NAME + '/' + pages_df['page_id'] + '.png',
)

pages_df.head()


Unnamed: 0,page_id,pmid,pmcid,doi,title,authors,citation,journal_book,publication_year,create_date,abstract,pdf_gcs_uri,page_number,page_pdf_uri,page_image_uri
0,pages/PMC11496284/fpubh-12-1436683_page_15,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...,15,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...
1,pages/PMC11496284/fpubh-12-1436683_page_9,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...,9,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...
2,pages/PMC11496284/fpubh-12-1436683_page_8,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...,8,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...
3,pages/PMC11496284/fpubh-12-1436683_page_11,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...,11,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...
4,pages/PMC11496284/fpubh-12-1436683_page_16,39444959,PMC11496284,10.3389/fpubh.2024.1436683,The influence of minimum dietary diversity on ...,"Shibeshi AH, Asfaw ZG.",Front Public Health. 2024 Oct 9;12:1436683. do...,Front Public Health,2024,2024/10/24,BackgroundUndernutrition persists as a critica...,gs://bq-ai-compete-pmc-oa/articles/PMC11496284...,16,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...,gs://bq-ai-compete-pmc-oa/pages/PMC11496284/fp...


Save the results into new `pages` table

In [230]:
# Save the results to a BigQuery table.
pages_df.to_gbq(f"{PROJECT_ID}.{DATASET_ID}.pages", if_exists="append")

### 2. Read and Resize Images

Later in this section we will use multimodal embedding model to generate vector embeddings for all page images.

While the embedding model automatically resizes images, sending smaller, pre-resized images is a best practice. It reduces network latency and can lower costs by minimizing the number of tokens processed by the model.

Let's first read the pages table and create a multimodal dataframe:

In [None]:
pages_df = bpd.read_gbq(f"{PROJECT_ID}.{DATASET_ID}.pages", use_cache=False)
print("Shape of the initial pages dataframe:", pages_df.shape)
# Create blob column from image uri column 
pages_df["image"] = pages_df["page_image_uri"].str.to_blob()
pages_df["image_size"] = pages_df["image"].size()
pages_df.sample(1)

Alternatively, you can create a multimodal dataframe directly from the images in the GCS bucket.
We will use the first option above, i.e. `pages` table, as we will use the combination of pre-extracted structured data and newly fetched unstructured blob objects.


In [None]:
# Alternative for creating a multimodal dataframe directly from the images in the GCS bucket
# pages_df = bpd.from_glob_path(f"gs://{BUCKET_NAME}/pages/*.png", name="image")

In [285]:
pages_df.shape

(207015, 16)

In [None]:
sample_pages_df = pages_df.head(5).copy()
sample_pages_df

In [307]:
sample_pages_df["image"] = sample_pages_df["page_image_uri"].str.to_blob()

# sample_pages_df = sample_pages_df.assign(
#     page_resized_image_uri="gs://" + BUCKET_NAME + '/' + sample_pages_df['page_id'] + '_512.jpeg',
# )

sample_pages_df["image_resized"] = sample_pages_df["image"].blob.image_resize(
#     (512, 512), dst=sample_pages_df['page_resized_image_uri'], engine="opencv"
    (512, 512), dst=f"gs://{BUCKET_NAME}/image_resized/", engine="opencv"
)

sample_pages_df[['page_image_uri', 'image', 'image_resized']]

Unnamed: 0,page_image_uri,image,image_resized
0,gs://bq-ai-compete-pmc-oa/pages/PMC12300522/nutrients-17-02382_page_38.png,,
1,gs://bq-ai-compete-pmc-oa/pages/PMC10934908/nutrients-16-00583_page_11.png,,
2,gs://bq-ai-compete-pmc-oa/pages/PMC12145321/394_2025_Article_3733_page_5.png,,
3,gs://bq-ai-compete-pmc-oa/pages/PMC11368077/1678-8060-mioc-119-e240055_page_8.png,,
4,gs://bq-ai-compete-pmc-oa/pages/PMC11960743/trae140_page_5.png,,


### 3. Generate Image Embeddings 

In [None]:
# Generate embeddings
embedding_model = MultimodalEmbeddingGenerator(model_name='multimodalembedding@001')

# Make embedding call and store the resulting DF indexed identically to 'pages_df'
prediction_result_df = embedding_model.predict(pages_df['image'])

page_embeddings_df = pages_df.assign(
    embedding=prediction_result_df['ml_generate_embedding_result'],
    embedding_status=prediction_result_df['ml_generate_embedding_status']
)

# Display the result to confirm the new columns are present alongside all original columns.
page_embeddings_df[['page_id', 'page_image_uri', 'image', 'embedding', 'embedding_status']].head()

Unnamed: 0,page_id,page_image_uri,image,embedding,embedding_status
0,pages/PMC12300522/nutrients-17-02382_page_38,gs://bq-ai-compete-pmc-oa/pages/PMC12300522/nutrients-17-02382_page_38.png,,[ 0.00073479 0.00139476 0.02833971 ... -0.00793358 -0.0050436  -0.05914517],
1,pages/PMC10934908/nutrients-16-00583_page_11,gs://bq-ai-compete-pmc-oa/pages/PMC10934908/nutrients-16-00583_page_11.png,,[ 0.00774734 -0.01377369 0.00907896 ... -0.02722635 0.00464822  -0.0553794 ],
2,pages/PMC12145321/394_2025_Article_3733_page_5,gs://bq-ai-compete-pmc-oa/pages/PMC12145321/394_2025_Article_3733_page_5.png,,[ 0.0300483 -0.00139495 0.01621595 ... 0.01112892 0.00044618  -0.05923868],
3,pages/PMC11368077/1678-8060-mioc-119-e240055_page_8,gs://bq-ai-compete-pmc-oa/pages/PMC11368077/1678-8060-mioc-119-e240055_page_8.png,,[ 0.01762628 0.05310114 -0.0054284 ... -0.00721965 0.00441023  -0.00886399],
4,pages/PMC11960743/trae140_page_5,gs://bq-ai-compete-pmc-oa/pages/PMC11960743/trae140_page_5.png,,[ 0.01361597 0.00714499 0.01628631 ... 0.00099009 0.02931252  -0.02790345],


In [238]:
page_embeddings_df.shape

(207015, 18)

Persist page image embeddings into new BigQuery table `page_embeddings` excluding the blob image:

In [None]:
page_embeddings_df.drop(columns=['image']).to_gbq(f"{PROJECT_ID}.{DATASET_ID}.page_embeddings", if_exists="append")

Check for any potential embedding failures:

In [249]:
failed_pages_df = page_embeddings_df[page_embeddings_df["embedding_status"] != ""]
print(f"Found {len(failed_pages_df)} rows that failed embedding generation.")

failed_pages_df[['page_id', 'embedding_status']].head()

Found 47 rows that failed embedding generation.


Unnamed: 0,page_id,embedding_status
20943,pages/PMC11349736/fpubh-12-1357891_page_2,INVALID_ARGUMENT: Multimodal embedding failed ...
21064,pages/PMC10959929/41598_2024_Article_57627_page_6,INVALID_ARGUMENT: Multimodal embedding failed ...
21179,pages/PMC12021309/JEP-31-0_page_9,INVALID_ARGUMENT: Multimodal embedding failed ...
21191,pages/PMC11124383/nutrients-16-01410_page_4,INVALID_ARGUMENT: Multimodal embedding failed ...
21202,pages/PMC11346270/12889_2024_Article_19856_pag...,INVALID_ARGUMENT: Multimodal embedding failed ...


Delete or retry those few failed rows since the subsequent vector index creation won't work with empty `embedding` values.

In [250]:
prediction_result_df = embedding_model.predict(failed_pages_df['image'], max_retries=1)

corrected_pages_df = failed_pages_df.assign(
    embedding=prediction_result_df['ml_generate_embedding_result'],
    embedding_status=prediction_result_df['ml_generate_embedding_status']
)

In [253]:
newly_successful_df = corrected_pages_df[corrected_pages_df["embedding_status"] == ""]
print(f"Fixed {len(newly_successful_df)} rows that previously failed embedding generation.")


Fixed 47 rows that previously failed embedding generation.


Merge new corrected rows into existing `page_embeddings` table

In [254]:
temp_table_id = f"{DATASET_ID}.temp_corrected_embeddings"
newly_successful_df.to_gbq(temp_table_id, if_exists="replace")
print(f"Staged corrected rows in temporary table: {temp_table_id}")

# 4. Use a MERGE statement to update the main table
# This is the most efficient way to update rows in place.
merge_sql = f"""
MERGE `{PROJECT_ID}.{DATASET_ID}.page_embeddings` AS target
USING `{PROJECT_ID}.{DATASET_ID}.temp_corrected_embeddings` AS source
ON target.page_id = source.page_id
WHEN MATCHED THEN
    UPDATE SET
    target.embedding = source.embedding,
    target.embedding_status = source.embedding_status
"""

print("Merging corrections into the main page_embeddings table...")
merge_job = bq_client.query(merge_sql)
merge_job.result()  # Wait for the MERGE job to complete

Staged corrected rows in temporary table: temp_corrected_embeddings
Merging corrections into the main page_embeddings table...


<google.cloud.bigquery.table._EmptyRowIterator at 0x15f753b10>

## Vector Search in BigFrames

### 1. Create Vector Index

Let's create vector index over the `page_embeddings` table using BigFrames BQ built-function `create_vector_index`.

Note: vector index creation requires tables with more than 5000 rows.

In [263]:
import bigframes.bigquery as bbq

bbq.create_vector_index(
    table_id = f"{DATASET_ID}.page_embeddings",
    column_name = "embedding",
    stored_column_names=['page_id', 'pmcid', 'page_number', 'journal_book', 'publication_year', 'title', 'abstract', 'pdf_gcs_uri', 'page_pdf_uri', 'page_image_uri'],
    replace= True,
    index_name = "page_embedding_index",
    distance_type="cosine",
    index_type= "ivf"
)

Check the status of the vector index and percentage coverage:

In [264]:
sql_query = f"SELECT * FROM `{PROJECT_ID}.{DATASET_ID}`.INFORMATION_SCHEMA.VECTOR_INDEXES WHERE index_name = 'page_embedding_index'"

index_status = (
    bpd.read_gbq(sql_query, use_cache=False)
    [["index_name", "table_name", "index_status", "coverage_percentage", "last_refresh_time", "unindexed_row_count", "total_logical_bytes", "total_storage_bytes"]]
)
index_status

Unnamed: 0,index_name,table_name,index_status,coverage_percentage,last_refresh_time,unindexed_row_count,total_logical_bytes,total_storage_bytes
0,page_embedding_index,page_embeddings,ACTIVE,0,,206964,0,0


### 2. Semantic Search over all pages

In [266]:
def semantic_search_pages(query, top_k=5):
    query_embedding_df = embedding_model.predict([query])

    vector_search_results = bbq.vector_search(
        base_table=f"{DATASET_ID}.page_embeddings",
        column_to_search="embedding",
        query=query_embedding_df,
        query_column_to_search="ml_generate_embedding_result",
        distance_type="COSINE",
        fraction_lists_to_search = 0.15,
        top_k=top_k
    )

    return vector_search_results.rename(columns={'content': 'query'})


Let's run a simple semantic search

In [None]:
text_question = "What are the effects of ketogenic diet for patients with multiple sclerosis?"
results = semantic_search_pages(text_question)

In [268]:
results['image'] = results['page_image_uri'].str.to_blob()
results[["query", "page_id", "title", "page_image_uri", "image", "distance"]].sort_values("distance")


Unnamed: 0,query,page_id,title,page_image_uri,image,distance
0,What are the effects of ketogenic diet for patients with multiple sclerosis?,pages/PMC11085120/nutrients-16-01258_page_7,Ketogenic Diet in the Treatment of Epilepsy,gs://bq-ai-compete-pmc-oa/pages/PMC11085120/nutrients-16-01258_page_7.png,,0.856061
0,What are the effects of ketogenic diet for patients with multiple sclerosis?,pages/PMC12389322/nutrients-17-02713_page_4,The Role of Nutrition and Physical Activity in Modulating Disease Progression and Quality of Life in Multiple Sclerosis,gs://bq-ai-compete-pmc-oa/pages/PMC12389322/nutrients-17-02713_page_4.png,,0.859029
0,What are the effects of ketogenic diet for patients with multiple sclerosis?,pages/PMC12389322/nutrients-17-02713_page_5,The Role of Nutrition and Physical Activity in Modulating Disease Progression and Quality of Life in Multiple Sclerosis,gs://bq-ai-compete-pmc-oa/pages/PMC12389322/nutrients-17-02713_page_5.png,,0.861456
0,What are the effects of ketogenic diet for patients with multiple sclerosis?,pages/PMC10890917/10-1055-s-0044-1779269_page_2,Ketogenic diet in pharmacoresistant epilepsies: a clinical nutritional assessment,gs://bq-ai-compete-pmc-oa/pages/PMC10890917/10-1055-s-0044-1779269_page_2.png,,0.872318
0,What are the effects of ketogenic diet for patients with multiple sclerosis?,pages/PMC12378136/40620_2025_Article_2285_page_12,Ketogenic diets in chronic kidney disease patients: a review for skeptics by skeptics,gs://bq-ai-compete-pmc-oa/pages/PMC12378136/40620_2025_Article_2285_page_12.png,,0.872888


## Putting it all together for a multimodal RAG

### Helper Dataframe function for LLM predict with multi-blob

Let's create a wrapper function for BigFrames LLM predict call which takes a single text prompt and a dataframe of relevant pages (vector search results), and performs a single predict call with multiple blobs. This will be used in our RAG application relying on BigFrames only.

In [271]:
from bigframes.ml.llm import GeminiTextGenerator

def predict_with_multiple_blobs(
    model: GeminiTextGenerator,
    question: str,
    pages_df: bpd.DataFrame,
    temperature: float = 0,
    max_output_tokens: int = 1024,
) -> bpd.DataFrame:
    """
    Generates a response from a single text prompt and multiple images.
    This version works directly with a BigFrames DataFrame to improve efficiency.

    Args:
        model: An initialized GeminiTextGenerator model.
        question: The textual part of the prompt.
        images_df: The BigFrames DataFrame containing the image data.
        temperature: The model temperature for controlling randomness.
        max_output_tokens: The maximum number of tokens to generate.

    Returns:
        A BigFrames DataFrame containing the model's single response.
    """
    # 1. Extract the single PDF URI from the pages df.
    uri_series = pages_df['page_pdf_uri']
    
    # 2. Convert the URIs to blobs.
    # This assumes a default connection is configured, for example:
    # bpd.options.bigquery.connection = "your-project.your-region.your-connection"
    blob_series = uri_series.str.to_blob()

    # 3. Pivot the blob Series to transform the rows of images into columns
    #    in a single-row DataFrame.
    blob_df = blob_series.to_frame(name="blob").reset_index(drop=True)
    blob_df['col_id'] = "image_" + blob_df.index.to_series().astype(str)
    blob_df['row_id'] = 0
    predict_df = blob_df.pivot(index="row_id", columns="col_id", values="blob")

    # 4. Construct the prompt with the text question and the new image columns.
    prompt = [
        "You are a helpful biomedical research assistant. ",
        "Based on the provided PDF pages retrieved from scientific biomedical articles, ",
        "provide a comprehensive and synthesized answer to this question: ",
        f"{question}"
    ]
    for col in predict_df.columns:
        prompt.append(predict_df[col])

    # For debugging, print the full prompt array
    print("Full prompt array for LLM:")
    for item in prompt:
        if isinstance(item, str):
            print(f"  - (str) {item[:100]}...")
        elif isinstance(item, bpd.Series):
            print(f"  - (bpd.Series) {item.name}: {item.iloc[0]}")
        else:
            print(f"  - (other) {type(item)}")

    # 5. Call predict with the single-row DataFrame.
    response_df = model.predict(
        predict_df,
        prompt=prompt,
        temperature=temperature,
        max_output_tokens=max_output_tokens,
    )
    return response_df

Let's give that predict wrapper function a try with the same text question:

In [None]:
model = GeminiTextGenerator(model_name="gemini-2.0-flash-001")
text_question = "What are the effects of ketogenic diet for patients with multiple sclerosis?"

# Get the single response from the model
response = predict_with_multiple_blobs(
    model=model,
    question=text_question,
    pages_df=results
)

print("\n--- LLM Response ---")
print(response['ml_generate_text_llm_result'].iloc[0])

Full prompt array for LLM:
  - (str) You are a helpful biomedical research assistant. ...
  - (str) Based on the provided PDF pages retrieved from scientific biomedical articles, ...
  - (str) provide a comprehensive and synthesized answer to this question: ...
  - (str) What are the effects of ketogenic diet for patients with multiple sclerosis?...
  - (bpd.Series) image_0: {'uri': 'gs://bq-ai-compete-pmc-oa/pages/PMC12389322/nutrients-17-02713_page_4.pdf', 'version': None, 'details': None}
  - (bpd.Series) image_1: {'uri': 'gs://bq-ai-compete-pmc-oa/pages/PMC12389322/nutrients-17-02713_page_5.pdf', 'version': None, 'details': None}
  - (bpd.Series) image_2: {'uri': 'gs://bq-ai-compete-pmc-oa/pages/PMC11085120/nutrients-16-01258_page_7.pdf', 'version': None, 'details': None}
  - (bpd.Series) image_3: {'uri': 'gs://bq-ai-compete-pmc-oa/pages/PMC10890917/10-1055-s-0044-1779269_page_2.pdf', 'version': None, 'details': None}
  - (bpd.Series) image_4: {'uri': 'gs://bq-ai-compete-pmc-oa/pag


--- LLM Response ---
Based on the provided text, here's a synthesized summary of the effects of the ketogenic diet (KD) for patients with multiple sclerosis (MS):

**Potential Benefits:**

*   **Neuroprotection:** KD may exert neuroprotective effects. It has been linked to increased levels of brain-derived neurotrophic factor (BDNF), which supports neuronal survival and promotes neuroplasticity.
*   **Anti-inflammatory Effects:** KD modulates the inflammatory response implicated in MS pathology. It suppresses the activation of microglia and astrocytes (key mediators of neuroinflammation), reduces oxidative stress, and enhances mitochondrial function.
*   **Reduced Demyelination and Neuronal Damage:** In experimental models, KD administration led to reduced infiltration of immune cells into the central nervous system (CNS) and decreased myelin-reactive T cell responses, thereby attenuating demyelination and neuronal damage.
*   **Improved Disability and Disease Progression:** Clinical 

### Multimodal RAG

Let's put everything together:

In [None]:
def multimodal_rag(
    model: GeminiTextGenerator,
    query_text: str = None,
    top_k: int = 5,
    temperature: float = 0,
    max_output_tokens: int = 1024,
) -> str:
    """
    Performs Retrieval-Augmented Generation (RAG) using multimodal search.

    This function first retrieves the most relevant document pages using a
    semantic search. It then passes the user's question and the retrieved
    pages to a generative model to synthesize a final answer, using the
    `predict_with_multiple_blobs` llm utility function.

    Args:
        model (GeminiTextGenerator):
            An instance of the GeminiTextGenerator model.
        query_text (str, optional): 
            The text part of the user's query.
        top_k (int):
            The number of top matching pages to retrieve for context.
        temperature (float):
            The model temperature for controlling randomness.
        max_output_tokens (int):
            The maximum number of tokens to generate.

    Returns:
        str:
            The synthesized answer from the generative model.
    """
    print("\n--- User Query ---")
    if query_text: print(f"Text: {query_text}")
    
    search_results = semantic_search_pages(
        query_text,
        top_k=top_k
    )
        
    print("\nGenerating final answer with RAG...")
    response_df = predict_with_multiple_blobs(
        model, 
        query_text, 
        search_results,
        temperature=temperature,
        max_output_tokens=max_output_tokens
    )

    llm_response_text = response_df['ml_generate_text_llm_result'].iloc[0]
    print("\n--- LLM Response ---")
    print(llm_response_text)
    return llm_response_text

In [None]:
model = GeminiTextGenerator(model_name="gemini-2.0-flash-001")
text_question = "What are the effects of ketogenic diet for patients with multiple sclerosis?"

## Other sample questions to ask...
# text_question = "What are the most effective exercise routines for reducing biomarkers of metabolic disease, and can you show me data for both men and women?"
# text_question = "What is the evidence for a direct link between unhealthy diets and an increased risk of dementia later in life?"
# text_question = "I'm a public health official. Provide me with the key takeaways from the research on effective community-based interventions that have successfully improved adolescent eating habits in underserved populations."

answer = multimodal_rag(model, text_question)


--- User Query ---
Text: What are the effects of ketogenic diet for patients with multiple sclerosis?



Generating final answer with RAG...



--- LLM Response ---
Based on the provided PDF pages, the effects of the ketogenic diet (KD) for patients with multiple sclerosis (MS) are:

*   **Neuroprotective, anti-inflammatory, and metabolic benefits:** Preliminary studies suggest that KD may exert these benefits, potentially contributing to improvements in neurological function, disease progression, and patient quality of life.
*   **Reduction in inflammation and neuroprotection:** KD modulates the inflammatory response implicated in MS pathology. It suppresses the activation of microglia and astrocytes, enhances mitochondrial function, reduces oxidative stress, reduces infiltration of immune cells into the central nervous system (CNS), decreases myelin-reactive T cell responses, and increases levels of brain-derived neurotrophic factor (BDNF).
*   **Improvements in disability and disease progression:** Clinical evidence supports a potential role for KD in improving physical disability and altering disease trajectory in MS. Stu

## Cleanup

In [None]:
# Clean up GCP assets created as part of bigframes remote_function
def cleanup_remote_function_assets(remote_udf, ignore_failures=False):
    """Clean up the GCP assets behind a bigframes remote function."""

    session = bpd.get_global_session()

    # Clean up BQ remote function
    try:
        session.bqclient.delete_routine(remote_udf.bigframes_remote_function)
    except Exception:
        # By default don't raise exception in cleanup
        if not ignore_failures:
            raise

    # Clean up cloud function
    try:
        session.cloudfunctionsclient.delete_function(name=remote_udf.bigframes_cloud_function)
    except Exception:
        # By default don't raise exception in cleanup
        if not ignore_failures:
            raise

# Delete remote functions
# cleanup_remote_function_assets(pdf_split)

# Close session to delete associated resources
# bpd.close_session()

# Delete GCS bucket
# !gsutil -m rm -r gs://{GCS_BUCKET_NAME}

# Delete BigQuery dataset
# !bq rm -r -f {PROJECT_ID}:{DATASET_ID}