## üß†üîó SciKGExtract: Agentic Pipeline for Scientific Knowledge Graph Extraction - Example Usage

This notebook demonstrates how to use the **SciKGExtract** to extract structured knowledge from scientific documents, normalize the extracted knowledge with external databases like [PubChem](https://pubchem.ncbi.nlm.nih.gov/), evaluate the quality of the extraction, refine the extraction based on feedbacks and finally populate an ORKG knowledge graph.

The SciKGExtract framework leverages an **Agentic Pipeline architecture** with sequential agent execution to perform the various tasks involved in the knowledge extraction process. The overall execution is orchestrated by an Orchestrator Agent, which coordinates the different components of the pipeline including extraction, normalization, evaluation, refinement, and KG population. We will see how each of these components/agents work together to achieve the final goal of populating a knowledge graph with high-quality structured data extracted from scientific literature.

We will walk through the process step-by-step, using the **ZnO ALD processes** extraction as an example.

#### üìã Overview
We will explore the following scenarios to demonstrate the capabilities of SciKGExtract:
1. **Basic Structured Knowledge Extraction** ‚Üí Extract structured knowledge from scientific documents without normalization, evaluation, or refinement.
2. **Knowledge Extraction with Normalization** ‚Üí Extract structured knowledge and normalize it using external databases like PubChem.
3. **Knowledge Extraction with Normalization, Evaluation, and Refinement** ‚Üí Extract structured knowledge, normalize it, evaluate the quality of the extraction, and refine it based on feedback.
4. **Populating an ORKG Knowledge Graph** ‚Üí Populate an ORKG knowledge graph with the refined structured knowledge extracted from scientific documents.

#### ‚öôÔ∏è Section 1: Initial Configurations
We will start with some initial configurations necessary for the extraction process. The configurations includes setting up the LLM models, defining the input and output directories, process descriptions for ZnO ALD.

##### üì¶ Import Necessary Packages and Modules

In [1]:
# Append the parent directory to sys.path
import sys
sys.path.append("..")

# Apply nest_asyncio to allow nested event loops
import nest_asyncio
nest_asyncio.apply()

In [2]:
# Python Imports
import json

# SciKG-Extract Config Imports
from scikg_extract.config.process.processConfig import ProcessConfig
from scikg_extract.config.llm.envConfig import EnvConfig

# Import Utilities
from scikg_extract.utils.log_handler import LogHandler
from scikg_extract.utils.file_utils import read_json_file, read_text_file

##### üìù Configure Logging

In [3]:
# Setup and Initialize Module Logging
logger = LogHandler.setup_module_logging("scikg_extract")

##### ü§ñ Setting Up LLM Models

In [4]:
# LLM to be used for structured knowledge extraction
llm_name_extraction = "gpt-4o"

# LLM to be used for Normalization especially for chemical entities disambiguation
llm_name_normalization = "gpt-5"

# LLM to be used for Reflection/Evaluation of the extracted knowledge
llm_name_reflection = "gpt-4o"

# LLM to be used for Feedback formulation based on the evaluation results
llm_name_feedback = "gpt-4o"

In [5]:
# Set OpenAI API Key and Organization ID
EnvConfig.OPENAI_api_key = '<insert-your-openai-key>'
EnvConfig.OPENAI_organization_id = '<insert-your-openi-organization-id>'

##### üìÇ Input and Output Directories

In [5]:
# Scientific document containing ZnO ALD experimental processes in markdown format
scientific_docs_dir = "../data/research-papers/ALD/markdown/ZnO-IGZO-papers/experimental-usecase/ZnO/ZnEt2 - H2O/2 Lujala et al.md"
scientific_document = read_text_file(scientific_docs_dir)

# ALD Process Schema path for experimental processes
process_schema_path = "../data/schemas/ALD-experimental/ALD-experimental-schema.json"
process_schema = read_json_file(process_schema_path)

# Domain-expert curated examples for ZnO ALD processes
examples_path = "../data/examples/Atomic-layer-deposition/ZnO/example1.txt"
examples = read_text_file(examples_path)

# Manually curated PubChem synonym to CID mapping dictionary
pubchem_lookup_dict_path = "../data/resources/PubChem-Synonym-CID.json"
synonym_to_cid_mapping = read_json_file(pubchem_lookup_dict_path)

# PubChem LMDB database created from PubChem CID data dump
lmdb_pubchem_path = "../data/external/pubchem/pubchem_cid_lmdb"

##### üß™ ZnO ALD Process Description

In [6]:
# Process Name
ProcessConfig.Process_name = "Atomic Layer Deposition"

ProcessConfig.Process_description = """
Atomic layer deposition (ALD) is a surface-controlled thin film deposition technique that can enable ultimate control over the film thickness, uniformity on large-area substrates and conformality on 3D (nano)structures. Each ALD cycle consists at least two half-cycles (but can be more complex), containing a precursor dose step and a co-reactant exposure step, separated by purge or pump steps. Ideally the same amount of material is deposited in each cycle, due to the self-limiting nature of the reactions of the precursor and co-reactant with the surface groups on the substrate. By carrying out a certain number of ALD cycles, the targeted film thickness can be obtained.

In this extraction task, we are focusing on ZnO (Zinc Oxide) thin film deposition via ALD. A ZnO ALD (Zinc Oxide Atomic Layer Deposition) process deposits thin ZnO films through sequential, self-limiting surface reactions between a zinc precursor and an oxidant. The process typically consists of repeating ALD cycles, each containing a precursor pulse (e.g., diethylzinc (DEZ), Zn(acac)‚ÇÇ, or Zn(thd)‚ÇÇ), a purge step, an oxidant pulse (commonly H‚ÇÇO, O‚ÇÉ, or O‚ÇÇ plasma), followed by another purge. These reactions form a conformal zinc-oxygen layer per cycle with precise thickness control. The aim of a ZnO ALD process is to produce high-quality, uniform, conformal ZnO films with controlled thickness, crystallinity (amorphous or polycrystalline depending on temperature), and stoichiometry.
"""

#### üéØ Section 2: Basic Structured Knowledge Extraction
In this section, we will demonstrate how to perform basic structured knowledge extraction from scientific documents using the SciKGExtract framework. We will extract information related to ZnO processes from the scientific paper downloaded from [AtomicLimits Database](https://www.atomiclimits.com/alddatabase/). All the subsequent sections will use the same scientific article as an input to ensure consistency and evolutuion of the extraction process.

##### üì¶ Import the Necessary Modules and Packages

In [7]:
# Import Orchestrator Agent
from scikg_extract.agents.orchestrator_agent import orchestrate_extraction_workflow

# Import Configurations
from scikg_extract.config.agents.orchestrator import OrchestratorConfig
from scikg_extract.config.agents.workflow import WorkflowConfig

# Import Data Models
from data.models.schema.ALD_experimental_schema import ALDProcessList

##### ‚öôÔ∏è Orchestrator Agent Configuration
The Orchestrator Agent is responsible for coordinating the different components of the extraction pipeline. We will configure it with the necessary parameters necessary for the basic extraction task.

In [8]:
# Initialize orchestrator configuration
orchestrator_config = OrchestratorConfig(
    llm_name=llm_name_extraction,
    process_schema=process_schema,
    scientific_document=scientific_document,
    examples=examples,
    extraction_data_model=ALDProcessList
)

##### üîÄ Worflow Configuration
The workflow configuration defines different flags that control the behavior of the extraction pipeline. For basic extraction, we will set the flags to disable normalization, evaluation, and refinement.

In [10]:
# Initialize the Workflow configuration
workflow_config = WorkflowConfig(normalize_extracted_data=False, clean_extracted_data=False, validate_extracted_data=False)

##### ‚ñ∂Ô∏è Execute the Basic Structured Knowledge Extraction Workflow
With the configurations in place, we can now execute the basic structured knowledge extraction workflow using the SciKGExtract framework. The orchestrator agent will manage the flow of data and control between the different components of the pipeline to extract structured knowledge from the input scientific documents.

In [None]:
# Extract knowledge using the orchestrator agent
final_state = orchestrate_extraction_workflow(orchestrator_config, workflow_config)

# Get the extracted knowledge from the final state
extracted_knowledge = final_state["extracted_json"]

In [None]:
# Display Extracted Knowledge
print(json.dumps(extracted_knowledge, indent=4))

#### üîó Knowledge Extraction with Normalization
In this section, we will extend the basic structured knowledge extraction process by incorporating normalization of the extracted data using external databases like PubChem. Normalization helps in standardizing the extracted information, espeicially for properties like chemical names and identifiers, mapping similar entities to a common identifier from PubChem database.

##### ‚öôÔ∏è Orchestrator Agent Configuration
The Orchestrator Agent configuration remains similar to the basic extraction setup with the addition of normalization properties including PubChem LMDB path containing indexed PubChem Synonyms to CID mapping for fast lookup, synonym to CID mapping dictionary containing manually curated synonyms to PubChem CIDs and the LLM model name used for normalization disambiguation.

In [13]:
# Initialize orchestrator configuration
orchestrator_config = OrchestratorConfig(
    llm_name=llm_name_extraction,
    normalization_llm_name=llm_name_normalization,
    process_schema=process_schema,
    scientific_document=scientific_document,
    examples=examples,
    extraction_data_model=ALDProcessList,
    pubchem_lmdb_path=lmdb_pubchem_path,
    synonym_to_cid_mapping=synonym_to_cid_mapping
)

##### üîÄ Workflow Configuration
The workflow configuration for normalization will enable the normalization flag while keeping evaluation and refinement disabled.

In [14]:
# Initialize the Workflow configuration
workflow_config = WorkflowConfig(normalize_extracted_data=True, clean_extracted_data=False, validate_extracted_data=False)

##### ‚ñ∂Ô∏è Execute the Knowledge Extraction with Normalization Workflow
With the updated configurations, we can now execute the knowledge extraction workflow with normalization enabled. The orchestrator agent will coordinate the extraction and normalization processes to produce standardized structured knowledge from the input scientific documents.

In [None]:
# Extract knowledge using the orchestrator agent
final_state = orchestrate_extraction_workflow(orchestrator_config, workflow_config)

# Get the extracted knowledge from the final state
normalized_extracted_knowledge = final_state["normalized_json"]

In [None]:
# Display Extracted Knowledge
print(json.dumps(normalized_extracted_knowledge, indent=4))

#### ‚úÖ Knowledge Extraction with Normalization, Evaluation, and Refinement
In this section, we will further enhance the knowledge extraction process by incorporating LLM-as-a-Judge evaluation and refinement based on feedback. This will help improve the quality of the extracted knowledge by assessing its completeness and correctness, and refining it based on the evaluation results.

##### üì¶ Import the Validation Rubrics
We will import the necessary validation rubrics that will be used by the LLM-as-a-Judge to evaluate the quality of the extracted knowledge. These rubrics are an extension of the [YESciEval framework](https://github.com/sciknoworg/YESciEval).

In [8]:
# Validation Rubrics Imports
from scikg_extract.evaluation.rubrics.informativeness import Completeness, Correctness

##### ‚öôÔ∏è Orchestrator Agent Configuration
The orchestrator agent configuration will now include additional parameters for evaluation and refinement, including the LLM model for reflection, list of rubric names for evaluation and LLM model for feedback incorporation.

In [None]:
# Initialize orchestrator configuration
orchestrator_config = OrchestratorConfig(
    llm_name=llm_name_extraction,
    normalization_llm_name=llm_name_normalization,
    process_schema=process_schema,
    scientific_document=scientific_document,
    examples=examples,
    extraction_data_model=ALDProcessList,
    pubchem_lmdb_path=lmdb_pubchem_path,
    synonym_to_cid_mapping=synonym_to_cid_mapping,
    reflection_llm_name=llm_name_reflection,
    rubrics=[Completeness, Correctness],
    feedback_llm_name=llm_name_feedback
)

##### üîÄ Workflow Configuration
The workflow configuration now enables normalization, evaluation, and refinement flags to activate the respective components in the extraction pipeline.

In [None]:
# Initialize the Workflow configuration
workflow_config = WorkflowConfig(normalize_extracted_data=True, clean_extracted_data=False, validate_extracted_data=True, total_validation_retries=1)

##### ‚ñ∂Ô∏è Execute the Extraction Workflow with Normalization, Evaluation, and Refinement
With all the configurations set, we can now execute the knowledge extraction workflow that includes normalization, evaluation, and refinement. The orchestrator agent will oversee the entire process, ensuring that each component functions correctly and that the final output is a high-quality structured knowledge representation extracted from the scientific documents.

In [None]:
# Extract knowledge using the orchestrator agent
final_state = orchestrate_extraction_workflow(orchestrator_config, workflow_config)

# Get the extracted knowledge from the final state
normalized_extracted_knowledge = final_state["normalized_json"]

In [None]:
# Display Extracted Knowledge
print(json.dumps(normalized_extracted_knowledge, indent=4))