### Instructions for Running Grobid Docker

Before running this script, make sure you have Grobid running as a Docker container.

You can start Grobid using the following command:

```docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.0```

This exposes the Grobid REST API at `http://localhost:8070`.

In [10]:
import os
import re

import requests
from lxml import etree

In [3]:
# --- File & Path Helpers ---
def ensure_dir_exists(folder):
    if not os.path.exists(folder):
        os.makedirs(folder)

def get_file_path(filename, folder):
    return os.path.join(os.getcwd(), folder, filename)

In [13]:
# --- Config & Path ---
PDF_FILENAME = "2021Bouza.pdf"  # Change this to your target file
PROCESSED_DIR = "data/processed"
BASENAME = os.path.splitext(PDF_FILENAME)[0]
PYMUPDF_TXT_PATH = os.path.join(PROCESSED_DIR, f"{BASENAME}_pymupdf.txt")

In [24]:
# --- Extraction Logic ---
def extract_metadata_from_pymupdf(txt_path):
    with open(txt_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Title
    title_match = re.search(r"(?<=\| )(.*?nm wavelength range)", text, re.DOTALL)
    title = title_match.group(1).strip() if title_match else ""

    # Abstract
    abstract_match = re.search(r"ABSTRACT\s*(.*?)\s*I\. INTRODUCTION", text, re.DOTALL)
    abstract = abstract_match.group(1).strip() if abstract_match else ""

    # Authors
    authors_match = re.search(r"wavelength range\s+(.*?)\s+AFFILIATIONS", text, re.DOTALL)
    authors = authors_match.group(1).replace('\n', ' ').strip() if authors_match else ""

    # Sections
    sections = re.findall(r"\n([IVX]+\. [A-Z \-]+)\n(.*?)(?=\n[IVX]+\. |$)", text, re.DOTALL)

    return title, authors, abstract, sections

In [25]:
# --- Extraction Logic ---
def extract_metadata_from_pymupdf(txt_path):
    with open(txt_path, "r", encoding="utf-8") as f:
        text = f.read()

    # Title
    title_match = re.search(r"(?<=\| )(.*?nm wavelength range)", text, re.DOTALL)
    title = title_match.group(1).strip() if title_match else ""

    # Abstract
    abstract_match = re.search(r"ABSTRACT\s*(.*?)\s*I\. INTRODUCTION", text, re.DOTALL)
    abstract = abstract_match.group(1).strip() if abstract_match else ""

    # Authors
    authors_match = re.search(r"wavelength range\s+(.*?)\s+AFFILIATIONS", text, re.DOTALL)
    authors = authors_match.group(1).replace('\n', ' ').strip() if authors_match else ""

    # Sections
    sections = re.findall(r"\n([IVX]+\. [A-Z \-]+)\n(.*?)(?=\n[IVX]+\. |$)", text, re.DOTALL)

    return title, authors, abstract, sections

In [26]:
# --- Run Extraction ---
title, authors, abstract, sections = extract_metadata_from_pymupdf(PYMUPDF_TXT_PATH)

# --- Output ---
print("Title:", title)
print("Authors:", authors)
print("Abstract:", abstract)
print("Sections:")
for heading, content in sections:
    print(f"\n{heading.strip()}")
    print(content.strip()[:300], "...")

Title: DECEMBER 02 2021
The spectrum of a 1-μm-wavelength-driven tin microdroplet
laser-produced plasma source in the 5.5–265.5 nm
wavelength range
Z. Bouza 
 ; J. Byers 
 ; J. Scheers 
 ; R. Schupp 
 ; Y. Mostafa; L. Behnke; Z. Mazzotta 
 ; J. Sheil 
 ;
W. Ubachs 
 ; R. Hoekstra 
 ; M. Bayraktar 
 ; O. O. Versolato  
AIP Advances 11, 125003 (2021)
https://doi.org/10.1063/5.0073839
Articles You May Be Interested In
Radiation transport and scaling of optical depth in Nd:YAG laser-produced microdroplet-tin plasma
Appl. Phys. Lett. (September 2019)
Production of 13.5 nm light with 5% conversion efficiency from 2 μ m laser-driven tin microdroplet plasma
Appl. Phys. Lett. (December 2023)
Laser-induced vaporization of a stretching sheet of liquid tin
J. Appl. Phys. (February 2021)
 05 July 2025 13:58:50


--- Page 2 ---

AIP Advances
ARTICLE
scitation.org/journal/adv
The spectrum of a 1-μm-wavelength-driven tin
microdroplet laser-produced plasma source
in the 5.5–265.5 nm wavelength range
A

In [27]:
# --- Output ---
print("Title:", title)
print("Authors:", authors)
print("Abstract:", abstract)
print("Sections:")
for heading, content in sections:
    print(f"\n{heading.strip()}")
    print(content.strip()[:300], "...")

Title: DECEMBER 02 2021
The spectrum of a 1-μm-wavelength-driven tin microdroplet
laser-produced plasma source in the 5.5–265.5 nm
wavelength range
Z. Bouza 
 ; J. Byers 
 ; J. Scheers 
 ; R. Schupp 
 ; Y. Mostafa; L. Behnke; Z. Mazzotta 
 ; J. Sheil 
 ;
W. Ubachs 
 ; R. Hoekstra 
 ; M. Bayraktar 
 ; O. O. Versolato  
AIP Advances 11, 125003 (2021)
https://doi.org/10.1063/5.0073839
Articles You May Be Interested In
Radiation transport and scaling of optical depth in Nd:YAG laser-produced microdroplet-tin plasma
Appl. Phys. Lett. (September 2019)
Production of 13.5 nm light with 5% conversion efficiency from 2 μ m laser-driven tin microdroplet plasma
Appl. Phys. Lett. (December 2023)
Laser-induced vaporization of a stretching sheet of liquid tin
J. Appl. Phys. (February 2021)
 05 July 2025 13:58:50


--- Page 2 ---

AIP Advances
ARTICLE
scitation.org/journal/adv
The spectrum of a 1-μm-wavelength-driven tin
microdroplet laser-produced plasma source
in the 5.5–265.5 nm wavelength range
A

In [28]:
title

'DECEMBER 02 2021\nThe spectrum of a 1-μm-wavelength-driven tin microdroplet\nlaser-produced plasma source in the 5.5–265.5 nm\nwavelength range\nZ. Bouza \n ; J. Byers \n ; J. Scheers \n ; R. Schupp \n ; Y. Mostafa; L. Behnke; Z. Mazzotta \n ; J. Sheil \n ;\nW. Ubachs \n ; R. Hoekstra \n ; M. Bayraktar \n ; O. O. Versolato \ue923 \nAIP Advances 11, 125003 (2021)\nhttps://doi.org/10.1063/5.0073839\nArticles You May Be Interested In\nRadiation transport and scaling of optical depth in Nd:YAG laser-produced microdroplet-tin plasma\nAppl. Phys. Lett. (September 2019)\nProduction of 13.5\u2009nm light with 5% conversion efficiency from 2 μ m laser-driven tin microdroplet plasma\nAppl. Phys. Lett. (December 2023)\nLaser-induced vaporization of a stretching sheet of liquid tin\nJ. Appl. Phys. (February 2021)\n 05 July 2025 13:58:50\n\n\n--- Page 2 ---\n\nAIP Advances\nARTICLE\nscitation.org/journal/adv\nThe spectrum of a 1-μm-wavelength-driven tin\nmicrodroplet laser-produced plasma source\n

In [29]:
sections

[('I. INTRODUCTION',
  'Laser-produced plasma (LPP) generated from liquid tin (Sn)\nmicrodroplets provides extreme ultraviolet (EUV) light for mod-\nern nanolithography,1–7 enabling the continued reduction of feature\nsizes on affordable integrated circuits (ICs). Such laser-produced\nplasmas of tin are characterized by a strong emission peak near\n13.5 nm, originating from transitions between complex excited\nstates in multiply charged Sn10+–Sn15+ ions.8–17\nMultilayer optics are used in industrial lithography machines\nto collect the EUV light from its source and to provide an image\nof the so-called mask onto the wafer. These optics are designed to\nreflect wavelengths in a 2%-wavelength bandwidth centered around\n13.5 nm (the bandwidth limitation is, in part, due to the many ∼10\nrequired reflective surfaces).18,19 As such, most spectroscopic works\non Sn LPPs have focused on the “in-band” wavelength region17,20–24\nor on nearby out-of-band (OOB) EUV emission features,14,23,25–31\n

In [41]:
# ==== Grobid Extraction ====
def parse_pdf_with_grobid(pdf_path, output_path, service_url="http://localhost:8070/api/processFulltextDocument"):
    with open(pdf_path, 'rb') as pdf_file:
        files = {'input': (pdf_path, pdf_file, 'application/pdf')}
        response = requests.post(service_url, files=files)
        if response.status_code == 200:
            with open(output_path, "w", encoding="utf-8") as out:
                out.write(response.text)
            print(f"[Grobid] TEI XML written to {output_path}")
            return output_path
        else:
            raise Exception(f"Grobid error: {response.status_code} {response.text}")

parse_pdf_with_grobid(pdf_path, tei_output)


[Grobid] TEI XML written to /Users/jamesbyers/code/github/knowledge_graphs/data/processed/2021Bouza_tei.xml


'/Users/jamesbyers/code/github/knowledge_graphs/data/processed/2021Bouza_tei.xml'

In [None]:
# --- Print/Return Results ---
print("\n" + "="*80)
print("Title:", title)
print("Authors:", authors)
print("Abstract:", abstract[:300] + ("..." if len(abstract) > 300 else ""))
print("Sections:")
for section in sections[:3]:  # Print first 3 sections as example
    print(f"\n== {section['heading']} ==")
    print(section['paragraphs'][0][:200] + ("..." if len(section['paragraphs'][0]) > 200 else "") if section['paragraphs'] else "[No Content]")
print("="*80)

# Abstract and sections now available as variables for further use.


Title: The spectrum of a 1-μm-wavelength-driven tin microdroplet laser-produced plasma source in the 5.5-265.5 nm wavelength range
Authors: []
Abstract: Production of 13.5 nm light with 5% conversion efficiency from 2 μ m laser-driven tin microdroplet plasma
Sections:


In [9]:
sections

[]