## 📄🔍 A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

This notebook demonstrates how to use the **Schema-Miner** package to discover and refine structured schemas from scientific literature with the help of **large language models (LLMs)** and **expert feedback**.

Schemas help turn **unstructured scientific text** (papers, specifications) into structured, machine-readable representations. With Schema-Miner, you can:  
- Automatically extract candidate schemas from domain documents  
- Iteratively refine schemas with expert feedback 
- Validate schemas against a wider corpus of literature

We’ll walk through the process step by step, using **Atomic Layer Deposition (ALD)** as an example domain.

## 🔎 Workflow Overview

The schema mining workflow has **three main stages**:

1. **Initial Schema Mining** → Generate an initial schema from a process specification using an LLM  
2. **Preliminary Schema Refinement** → Incorporate expert feedback and a small curated corpus of scientific papers  
3. **Finalize Schema Refinement** → Validate against a bigger corpus

A complete workflow diagram and detailed explanation can be found [here](../../README.md).

**Overview**
1. [Intalling Schema_miner](#installing-schema_miner)
2. [Configuration](#configuration)
3. [Stage 01: Initial Schema Mining](#initial-schema-mining)
4. [Stage 02: Preliminary Schema Refinement](#pre-schema-refinement)
5. [Stage 03: Finalize Schema Refinement](#finalize-schema-refinement)

### 🗂️ Installing Schema_miner <a id='installing-schema-miner'></a>

You can install schema-miner via **pip**:

```bash
pip install schema-miner
```

Or directly from GitHub (latest development version):

```bash
git clone https://github.com/sciknoworg/schema-miner.git
cd schema-miner
pip install -r requirements.txt
```

### 🛠️ Configuration <a id='configuration'></a>

Before running schema-miner, configure your environment.

In [18]:
import warnings
warnings.filterwarnings("ignore")

In [19]:
# Configure logging
import logging
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

In [None]:
from schema_miner.config.envConfig import EnvConfig
EnvConfig.OPENAI_api_key = '<insert-your-openai-key>'
EnvConfig.OPENAI_organization_id = '<insert-your-openi-organization-id>'

### 📂 Data Setup

Before running schema-miner, we need to tell schema-miner where to find:

- **Domain specification document(s)** → used for initial schema mining  
- **Curated corpus** (high-quality literature) → used for refinement  
- **Broader corpus** (larger set of papers) → used for final validation

Update the paths below according to your setup.

In [20]:
# Process Specification for Stage 1
process_specification_filepath = '../data/stage-1/Atomic-Layer-Deposition/Experimental-Usecase'
process_specification_filename = 'ALD-Process-Development.pdf'

# Small curated corpus of scientific papers
scientific_paper_stage2_dir = '../data/stage-2/Atomic-Layer-Deposition/research-papers/experimental-usecase'

# Bigger corpus of scientific papers
scientific_paper_stage3_dir = '../data/stage-3/Atomic-Layer-Deposition/research-papers/experimental_usecase'

### 🧪 Process Setup

In this notebook, we demonstrate schema mining on the domain of **Atomic Layer Deposition (ALD)**.  
This process involves alternating exposures of a substrate to chemical precursors, enabling precise thin-film growth at the atomic scale.

> 💡 To adapt this notebook to a new domain, replace the domain name/description here and update the input literature paths in the Data Setup section.

In [21]:
# Add process name and process description whose schema have to be extracted 
from schema_miner.config.processConfig import ProcessConfig
ProcessConfig.Process_name = "Atomic Layer Deposition"
ProcessConfig.Process_description = "An ALD process involves a series of controlled chemical reactions used to deposit thin films on a surface at an atomic level"

### ⚙️ Stage 01: Initial Schema Mining <a id='initial-schema-mining'></a>

During Stage 1, we task the LLM with designing an initial JSON schema based on a process specification document that contains limited information about an ALD process. For this example, we have used OpenAI's GPT-4 language model, using a prompt template as demonstrated [here](../../schema_miner/prompts/schema_extraction/prompt_template1.py).

In [22]:
import json
from pathlib import Path
from schema_miner.pdf_text_extractor import pdf_text_extractor
from schema_miner.schema_extractor.extract_schema import extract_schema_stage1

In [None]:
# Large Language Model (LLM) to be used for Schema Extraction
llm_model_name = 'gpt-4o'

results_file_path = "../results/stage-1/Atomic-Layer-Deposition/experimental-schema"
process_specification = pdf_text_extractor(process_specification_filepath, process_specification_filename, return_text = True)
schema = extract_schema_stage1(llm_model_name, process_specification, results_file_path, save_schema = True)

In [24]:
print(f'{ProcessConfig.Process_name} Schema:\n{json.dumps(schema, indent = 2)}')

Atomic Layer Deposition Schema:
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Atomic Layer Deposition Process",
  "type": "object",
  "properties": {
    "reactantSelection": {
      "type": "object",
      "description": "Details about the precursor and co-reactant selection.",
      "properties": {
        "precursor": {
          "type": "string",
          "description": "The chemical compound used as the precursor."
        },
        "coReactant": {
          "type": "string",
          "description": "The chemical compound used as the co-reactant."
        },
        "deliveryMethod": {
          "type": "string",
          "enum": [
            "vapor drawn",
            "carrier gas assisted",
            "bubbling"
          ],
          "description": "Method of delivering the precursor to the chamber."
        }
      },
      "required": [
        "precursor",
        "coReactant"
      ]
    },
    "chemicalComposition": {
      "type": "object",
 

In [25]:
expert_feedback_stage1 = 'The reactivity should be mentioned either at the process conditions or film properties. It’s the result of deposition under specific conditions not a standard property of the precursor or co-reactant.'

### 🔄 Stage 02: Preliminary Schema Refinement <a id='pre-schema-refinement'></a>

During Stage 2, the initial JSON schema from Stage 1 is further refined using a curated dataset of scientific literature related to the ALD process, compiled by domain experts. This stage follows an iterative approach, incorporating expert feedback into each schema generated by the LLM. Here in this notebook, we have used only one expert's feedback to demostrate how it can be incorporated into the workflow. The prompt used for this stage is provided [here](../../schema_miner/prompts/schema_extraction/prompt_template2.py).

In [None]:
from schema_miner.schema_extractor.extract_schema import extract_schema_stage2
    
schema = Path("../results/stage-1/Atomic-Layer-Deposition/experimental-schema/gpt-4o.json")
results_file_path = "../results/stage-2/Atomic-Layer-Deposition/experimental-schema"
scientific_paper = pdf_text_extractor(scientific_paper_stage2_dir, '1 Groner et al.pdf', return_text = True)
schema = extract_schema_stage2(llm_model_name, schema, expert_feedback_stage1, scientific_paper, results_file_path, save_schema = True)

In [27]:
print(f'{ProcessConfig.Process_name} Schema:\n{json.dumps(schema, indent = 2)}')

Atomic Layer Deposition Schema:
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Atomic Layer Deposition Process",
  "type": "object",
  "properties": {
    "reactantSelection": {
      "type": "object",
      "description": "Details about the precursor and co-reactant selection.",
      "properties": {
        "precursor": {
          "type": "string",
          "description": "The chemical compound used as the precursor."
        },
        "coReactant": {
          "type": "string",
          "description": "The chemical compound used as the co-reactant."
        },
        "deliveryMethod": {
          "type": "string",
          "enum": [
            "vapor drawn",
            "carrier gas assisted",
            "bubbling"
          ],
          "description": "Method of delivering the precursor to the chamber."
        }
      },
      "required": [
        "precursor",
        "coReactant"
      ]
    },
    "chemicalComposition": {
      "type": "object",
 

In [28]:
expert_feedback_stage2 = '1. Remove the reactor property in the additional details and include this in the reactor design property in the process conditions unit under the reactor name.\n2. Bubblertemperatures: merge it with the precursor temperature property under the BubblerTemperature name in the precursor unit.\n3. Substrate type and substrate units should be merged under the Substrate name'

### ✅ Stage 03: Finalize Schema Refinement <a id='finalize-schema-refinement'></a>

Similar to Stage 2, Stage 3 follows the same workflow and uses the same approach, with the key difference being the use of a larger corpus of scientific literature related to the ALD process. The goal of this stage is to provide the LLM with a broader perspective and more comprehensive information about the target process. The prompt used for this stage is provided [here](../../schema_miner/prompts/schema_extraction/prompt_template3.py).

In [None]:
from schema_miner.schema_extractor.extract_schema import extract_schema_stage3
    
schema = Path("../results/stage-2/Atomic-Layer-Deposition/experimental-schema/gpt-4o.json")
results_file_path = "../results/stage-3/Atomic-Layer-Deposition/experimental-schema"
scientific_paper = pdf_text_extractor(scientific_paper_stage3_dir, '1-Mattinen et al.pdf', return_text = True)
schema = extract_schema_stage3(llm_model_name, schema, expert_feedback_stage2, scientific_paper, results_file_path, save_schema = True)

In [30]:
print(f'{ProcessConfig.Process_name} Schema:\n{json.dumps(schema, indent = 2)}')

Atomic Layer Deposition Schema:
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Atomic Layer Deposition Process",
  "type": "object",
  "properties": {
    "reactantSelection": {
      "type": "object",
      "description": "Details about the precursor and co-reactant selection.",
      "properties": {
        "precursor": {
          "type": "string",
          "description": "The chemical compound used as the precursor."
        },
        "coReactant": {
          "type": "string",
          "description": "The chemical compound used as the co-reactant."
        },
        "deliveryMethod": {
          "type": "string",
          "enum": [
            "vapor drawn",
            "carrier gas assisted",
            "bubbling"
          ],
          "description": "Method of delivering the precursor to the chamber."
        }
      },
      "required": [
        "precursor",
        "coReactant"
      ]
    },
    "chemicalComposition": {
      "type": "object",
 