# Simple Materials Data Extraction with Hugging Face (Google Colab)

This notebook performs **simple, single-pass extraction** of materials science data from PDF papers using Hugging Face Transformers with `meta-llama/Llama-3.2-3B-Instruct`.

**What this notebook does:**
1. Load Llama-3.2-3B-Instruct model with 4-bit quantization
2. Extract text from PDF papers
3. Use a single LLM prompt to extract materials data
4. Output structured JSON with compositions and properties

**This is a simplified approach** - for production use with iterative refinement, evaluation, and validation, see the full KnowMat2 pipeline notebook.

## Prerequisites

- Google Colab with GPU runtime (T4 or better)
- Hugging Face account with access to Llama models
- HF token for gated model access

## 1. Setup Environment

In [None]:
# Check GPU availability
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"GPU Memory: {gpu_memory:.1f} GB")
else:
    print("WARNING: No GPU detected!")
    print("Go to Runtime > Change runtime type > Select GPU")

GPU: Tesla T4
GPU Memory: 15.8 GB


In [None]:
# Login to Hugging Face for gated model access
# You need to:
# 1. Create a HF account at https://huggingface.co
# 2. Accept Llama license at https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
# 3. Create an access token at https://huggingface.co/settings/tokens

from huggingface_hub import login
login()  # This will prompt for your HF token

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
!curl -LsSf https://astral.sh/uv/install.sh | sh
!uv --version

downloading uv 0.9.25 x86_64-unknown-linux-gnu
no checksums to verify
installing to /usr/local/bin
  uv
  uvx
everything's installed!
uv 0.9.25


In [None]:
# Install required packages
!uv pip install -q --system transformers accelerate bitsandbytes
!uv pip install -q --system PyPDF2
print("Dependencies installed!")

Dependencies installed!


## 2. Load Llama-3.2-3B-Instruct Model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig

MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"

print(f"Loading {MODEL_ID}...")
print("This may take a few minutes on first run.\n")

# Configure 4-bit quantization to fit in Colab GPU memory
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.float16,
)

print(f"\nModel loaded successfully!")
print(f"Model device: {model.device}")

Loading meta-llama/Llama-3.2-3B-Instruct...
This may take a few minutes on first run.



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]


Model loaded successfully!
Model device: cuda:0


In [None]:
# Create text generation pipeline
text_generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=2048,
    do_sample=False,  # Deterministic output
    return_full_text=False,
)

# Test the model
print("Testing model...")
test_messages = [
    {"role": "user", "content": "Say 'ready' if you can help extract materials science data."}
]
response = text_generator(test_messages)
print(f"Response: {response[0]['generated_text'][:200]}")
print("\nModel is ready!")

Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Testing model...
Response: Ready. What materials science data do you need help extracting?

Model is ready!


## 3. Define Extraction Functions

In [None]:
from PyPDF2 import PdfReader
import json
import re

def extract_text_from_pdf(pdf_path, max_pages=10):
    """Extract text from PDF file.

    Args:
        pdf_path: Path to the PDF file
        max_pages: Maximum number of pages to extract (default 10)

    Returns:
        Extracted text as string
    """
    reader = PdfReader(pdf_path)
    text_parts = []

    for i, page in enumerate(reader.pages[:max_pages]):
        page_text = page.extract_text()
        if page_text:
            text_parts.append(page_text)

    return "\n\n".join(text_parts)


def extract_json_from_response(text):
    """Extract JSON object from LLM response text."""
    # Remove markdown code blocks if present
    text = re.sub(r'```json\s*', '', text)
    text = re.sub(r'```\s*', '', text)

    # Find JSON object
    start = text.find('{')
    end = text.rfind('}')

    if start != -1 and end != -1 and end > start:
        json_str = text[start:end+1]
        try:
            return json.loads(json_str)
        except json.JSONDecodeError:
            # Try fixing common issues
            json_str = json_str.replace("'", '"')
            try:
                return json.loads(json_str)
            except:
                pass

    return None


def extract_materials_data(text, generator, max_text_length=6000):
    """Extract materials science data from paper text using LLM.

    Args:
        text: Paper text to extract from
        generator: HuggingFace text generation pipeline
        max_text_length: Maximum characters of text to send to LLM

    Returns:
        Dictionary with extracted data or error information
    """
    # Truncate text if too long
    if len(text) > max_text_length:
        text = text[:max_text_length] + "\n[... truncated ...]"

    prompt = f"""You are a materials science expert. Extract structured data from this research paper.

Return a JSON object with this exact structure:
{{
  "compositions": [
    {{
      "formula": "chemical formula or composition",
      "processing": "synthesis/processing method",
      "properties": [
        {{
          "name": "property name",
          "value": "measured value",
          "unit": "unit of measurement",
          "conditions": "measurement conditions (optional)"
        }}
      ]
    }}
  ]
}}

Guidelines:
- Extract ALL material compositions mentioned in the paper
- Include ALL measured properties (mechanical, thermal, electrical, optical, etc.)
- Preserve exact values as reported (including inequalities like ">50" or ranges like "10-20")
- Include units for all properties
- Note measurement conditions when provided (temperature, pressure, etc.)

Paper text:
{text}

Return ONLY the JSON object, no other text or explanation."""

    messages = [{"role": "user", "content": prompt}]

    try:
        response = generator(messages)
        response_text = response[0]["generated_text"]

        # Parse JSON from response
        data = extract_json_from_response(response_text)

        if data:
            return data
        else:
            return {
                "error": "Could not parse JSON from response",
                "raw_response": response_text[:1000]
            }

    except Exception as e:
        return {
            "error": str(e),
            "raw_response": None
        }


print("Extraction functions defined!")

Extraction functions defined!


## 4. Upload PDF Papers

In [None]:
from google.colab import files
from pathlib import Path

# Create papers directory
papers_dir = Path('/content/papers')
papers_dir.mkdir(exist_ok=True)

print("Upload your PDF papers:")
print("(You can upload multiple files)\n")

uploaded = files.upload()

# Move uploaded files to papers directory
for filename in uploaded.keys():
    if filename.endswith('.pdf'):
        src = Path(filename)
        dst = papers_dir / filename
        src.rename(dst)
        print(f"Saved: {dst}")

Upload your PDF papers:
(You can upload multiple files)



Saving 46248.pdf to 46248.pdf
Saving d2ra01505f.pdf to d2ra01505f.pdf
Saving rosi2003.pdf to rosi2003.pdf
Saving synergy-of-co(h2o)6-2-with-a-polyoxometalate-leads-to-aqueous-homogeneous-hydrogen-evolution-experiments-and.pdf to synergy-of-co(h2o)6-2-with-a-polyoxometalate-leads-to-aqueous-homogeneous-hydrogen-evolution-experiments-and.pdf
Saving the-inconsistency-in-adsorption-properties-and-powder-xrd-data-of-mof-5-is-rationalized-by-framework-interpenetration.pdf to the-inconsistency-in-adsorption-properties-and-powder-xrd-data-of-mof-5-is-rationalized-by-framework-interpenetration.pdf
Saved: /content/papers/46248.pdf
Saved: /content/papers/d2ra01505f.pdf
Saved: /content/papers/rosi2003.pdf
Saved: /content/papers/synergy-of-co(h2o)6-2-with-a-polyoxometalate-leads-to-aqueous-homogeneous-hydrogen-evolution-experiments-and.pdf
Saved: /content/papers/the-inconsistency-in-adsorption-properties-and-powder-xrd-data-of-mof-5-is-rationalized-by-framework-interpenetration.pdf


In [None]:
# List available papers
papers_dir = Path('/content/papers')
pdf_files = list(papers_dir.glob('*.pdf'))

if not pdf_files:
    print(f"No PDF files found in {papers_dir}")
    print("Please upload PDF papers using the cell above.")
    papers_to_process = []
else:
    print(f"Found {len(pdf_files)} PDF papers:\n")
    for i, pdf_path in enumerate(pdf_files, 1):
        size_mb = pdf_path.stat().st_size / (1024 * 1024)
        print(f"{i}. {pdf_path.name} ({size_mb:.2f} MB)")

    papers_to_process = sorted(pdf_files, key=lambda p: p.stat().st_size)
    print(f"\nWill process {len(papers_to_process)} papers.")

Found 5 PDF papers:

1. rosi2003.pdf (0.28 MB)
2. 46248.pdf (0.46 MB)
3. the-inconsistency-in-adsorption-properties-and-powder-xrd-data-of-mof-5-is-rationalized-by-framework-interpenetration.pdf (1.25 MB)
4. d2ra01505f.pdf (0.88 MB)
5. synergy-of-co(h2o)6-2-with-a-polyoxometalate-leads-to-aqueous-homogeneous-hydrogen-evolution-experiments-and.pdf (6.36 MB)

Will process 5 papers.


## 5. Run Simple Extraction

In [None]:
# Run extraction on uploaded papers
from pathlib import Path
import json

if not papers_to_process:
    print("No papers to process. Please upload PDFs first.")
else:
    output_dir = Path('/content/output')
    output_dir.mkdir(exist_ok=True)

    results = []

    for i, paper in enumerate(papers_to_process, 1):
        print(f"\n{'='*60}")
        print(f"Processing [{i}/{len(papers_to_process)}]: {paper.name}")
        print('='*60)

        try:
            # Extract text from PDF
            print("Extracting text from PDF...")
            text = extract_text_from_pdf(paper)
            print(f"Extracted {len(text)} characters")

            # Run extraction
            print("\nRunning materials extraction with Llama...")
            data = extract_materials_data(text, text_generator)

            # Save results
            output_file = output_dir / f"{paper.stem}_extraction.json"
            with open(output_file, 'w') as f:
                json.dump(data, f, indent=2)

            print(f"Results saved to: {output_file}")

            # Display summary
            if 'compositions' in data:
                n_comp = len(data['compositions'])
                total_props = sum(len(c.get('properties', [])) for c in data['compositions'])
                print(f"\nExtracted:")
                print(f"  - {n_comp} compositions")
                print(f"  - {total_props} total properties")

                # Show first few compositions
                for j, comp in enumerate(data['compositions'][:3], 1):
                    formula = comp.get('formula', 'N/A')
                    n_props = len(comp.get('properties', []))
                    print(f"  {j}. {formula} ({n_props} properties)")

                results.append({
                    'paper': paper.name,
                    'status': 'success',
                    'compositions': n_comp,
                    'properties': total_props
                })
            elif 'error' in data:
                print(f"\nExtraction error: {data['error']}")
                results.append({
                    'paper': paper.name,
                    'status': 'error',
                    'error': data['error']
                })

        except Exception as e:
            print(f"\nFailed: {e}")
            results.append({
                'paper': paper.name,
                'status': 'failed',
                'error': str(e)
            })

    # Print summary
    print(f"\n{'='*60}")
    print("EXTRACTION SUMMARY")
    print('='*60)
    for r in results:
        if r['status'] == 'success':
            print(f"  [OK] {r['paper']}: {r['compositions']} compositions, {r['properties']} properties")
        else:
            print(f"  [X]  {r['paper']}: {r.get('error', 'unknown error')}")


Processing [1/5]: rosi2003.pdf
Extracting text from PDF...
Extracted 19283 characters

Running materials extraction with Llama...
Results saved to: /content/output/rosi2003_extraction.json

Extracted:
  - 3 compositions
  - 5 total properties
  1. Zn4O(BDC)3(BDC/H110051,4-benzenedicarboxylate) (2 properties)
  2. Zn4O(BDC)3(BDC/H110051,4-benzenedicarboxylate) (2 properties)
  3. Zn4O(BDC)3(BDC/H110051,4-benzenedicarboxylate) (1 properties)

Processing [2/5]: 46248.pdf
Extracting text from PDF...
Extracted 24789 characters

Running materials extraction with Llama...
Results saved to: /content/output/46248_extraction.json

Extracted:
  - 1 compositions
  - 4 total properties
  1. single-wall carbon nanotubes (4 properties)

Processing [3/5]: d2ra01505f.pdf
Extracting text from PDF...
Extracted 50516 characters

Running materials extraction with Llama...
Results saved to: /content/output/d2ra01505f_extraction.json

Extracted:
  - 1 compositions
  - 3 total properties
  1. Zn4(1,4-benzo d

## 6. View Extraction Results

In [None]:
import json
from pathlib import Path

output_dir = Path('/content/output')
extraction_files = list(output_dir.glob('*_extraction.json'))

if not extraction_files:
    print("No extraction results found.")
else:
    print(f"Found {len(extraction_files)} extraction result(s):\n")

    for ef in extraction_files:
        print(f"{'='*60}")
        print(f"File: {ef.name}")
        print('='*60)

        with open(ef) as f:
            data = json.load(f)

        if 'compositions' in data:
            compositions = data['compositions']
            print(f"\nTotal compositions: {len(compositions)}\n")

            for i, comp in enumerate(compositions[:5], 1):
                print(f"--- Composition {i} ---")
                print(f"Formula: {comp.get('formula', 'N/A')}")
                print(f"Processing: {comp.get('processing', 'N/A')[:80]}")

                props = comp.get('properties', [])
                if props:
                    print(f"Properties ({len(props)}):")
                    for p in props[:5]:
                        name = p.get('name', 'N/A')
                        value = p.get('value', 'N/A')
                        unit = p.get('unit', '')
                        conditions = p.get('conditions', '')
                        cond_str = f" @ {conditions}" if conditions else ""
                        print(f"  - {name}: {value} {unit}{cond_str}")
                    if len(props) > 5:
                        print(f"  ... and {len(props) - 5} more properties")
                print()

            if len(compositions) > 5:
                print(f"... and {len(compositions) - 5} more compositions")

        elif 'error' in data:
            print(f"\nError: {data['error']}")
            if 'raw_response' in data and data['raw_response']:
                print(f"\nRaw response preview:")
                print(data['raw_response'][:500])

Found 5 extraction result(s):

File: rosi2003_extraction.json

Total compositions: 3

--- Composition 1 ---
Formula: Zn4O(BDC)3(BDC/H110051,4-benzenedicarboxylate)
Processing: synthesis/processing method
Properties (2):
  - weight percent of hydrogen: 4.5 weight percent @ at 78 K
  - weight percent of hydrogen: 1.0 weight percent @ at room temperature

--- Composition 2 ---
Formula: Zn4O(BDC)3(BDC/H110051,4-benzenedicarboxylate)
Processing: synthesis/processing method
Properties (2):
  - weight percent of hydrogen: 2.0 weight percent @ at room temperature and 10 bar
  - weight percent of hydrogen: 4.0 weight percent @ at room temperature and 10 bar

--- Composition 3 ---
Formula: Zn4O(BDC)3(BDC/H110051,4-benzenedicarboxylate)
Processing: synthesis/processing method
Properties (1):
  - weight percent of hydrogen: 6.5 weight percent @ design target for automobile fueling

File: d2ra01505f_extraction.json

Total compositions: 1

--- Composition 1 ---
Formula: Zn4(1,4-benzo dicarboxylic ac

## 7. Download Results

In [None]:
from google.colab import files
from pathlib import Path
import shutil

output_dir = Path('/content/output')
extraction_files = list(output_dir.glob('*_extraction.json'))

if extraction_files:
    # Create zip file of results
    shutil.make_archive('/content/extraction_results', 'zip', output_dir)
    files.download('/content/extraction_results.zip')
    print(f"Download started! ({len(extraction_files)} file(s))")
else:
    print("No results to download.")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Download started! (5 file(s))


## 8. (Optional) Extract from a Single Paper Interactively

In [None]:
# Interactive extraction from a single paper
# Modify the path below to point to your PDF

pdf_path = "/content/papers/your_paper.pdf"  # <-- Change this!

from pathlib import Path
import json

pdf_file = Path(pdf_path)

if not pdf_file.exists():
    print(f"File not found: {pdf_path}")
    print("\nAvailable papers:")
    for p in Path('/content/papers').glob('*.pdf'):
        print(f"  - {p}")
else:
    print(f"Processing: {pdf_file.name}\n")

    # Extract text
    text = extract_text_from_pdf(pdf_file)
    print(f"Extracted {len(text)} characters\n")

    # Run extraction
    print("Running extraction...\n")
    data = extract_materials_data(text, text_generator)

    # Display results
    print(json.dumps(data, indent=2))

File not found: /content/papers/your_paper.pdf

Available papers:
  - /content/papers/rosi2003.pdf
  - /content/papers/46248.pdf
  - /content/papers/the-inconsistency-in-adsorption-properties-and-powder-xrd-data-of-mof-5-is-rationalized-by-framework-interpenetration.pdf
  - /content/papers/d2ra01505f.pdf
  - /content/papers/synergy-of-co(h2o)6-2-with-a-polyoxometalate-leads-to-aqueous-homogeneous-hydrogen-evolution-experiments-and.pdf


## Summary

This notebook provides **simple, single-pass extraction** of materials science data:

1. **Loaded `meta-llama/Llama-3.2-3B-Instruct`** with 4-bit quantization
2. **Extracted text** from PDFs using PyPDF2
3. **Used a single prompt** to extract compositions and properties
4. **Output structured JSON** with materials data

**Output format:**
```json
{
  "compositions": [
    {
      "formula": "Zr41.2Ti13.8Cu12.5Ni10Be22.5",
      "processing": "Arc melting + water quenching",
      "properties": [
        {"name": "glass transition temperature", "value": "625", "unit": "K"},
        {"name": "hardness", "value": "5.2", "unit": "GPa"}
      ]
    }
  ]
}
```

**Limitations of simple extraction:**
- No iterative refinement based on evaluation
- No hallucination detection or correction
- No aggregation from multiple extraction attempts
- May miss data or include errors

**For production use**, consider the full KnowMat2 pipeline which includes:
- Subfield detection
- Iterative extraction with evaluation feedback
- Aggregation and validation
- Quality flagging for human review