## üß†üîó SciKGExtract: Agentic Pipeline for Scientific Knowledge Graph Extraction - Tutorials!

### Tutorial 3: Knowledge Extraction with Normalization and Reflection from Scientific Text
Welcome to the third tutorial of SciKGExtract! In this tutorial, we will introduce the reflection agent to evaluate the extracted knowledge based on the predefined criteria. We will be using specifically, LLM-as-Judge paradigm to assess the quality of the extracted knowledge based on rubrics such as completeness and correctness. These evaluations with rating scores and rationales will provide a comprehensive understanding of the extraction quality and identify areas for improvement for the extraction agent.

**Prequisites:**
- Completion of Tutorial 1: [SciKGExtract_Knowledge_Extraction](01_SciKGExtract_Knowledge_Extraction.ipynb)
- Completion of Tutorial 2: [SciKGExtract_Knowledge_Extraction_Normalization](02_SciKGExtract_Knowledge_Extraction_Normalization.ipynb)

**Focus of this Tutorial:**
- How to configure and execute the knowledge extraction workflow with normalization and reflection enabled.
- Different evaluation rubrics used by the reflection agent to assess the quality of the extracted knowledge.
- The nature of the evaluation results including rating scores and rationales provided by the reflection agent.

#### üìã Overview
We will cover the following sections in this tutorial:
1. [**Initial Configuration**](#section-1-initial-configurations): Setting up the environment, loading necessary libraries, and defining the sample scientific text.
2. [**Process Definition**](#section-2-process-definition): Defining the scientific process for which we want to perform structured knowledge extraction.
3. [**Knowledge Extraction with Normalization and Reflection**](#section-3-knowledge-extraction-with-normalization-and-reflection): Configuring and executing the knowledge extraction workflow with normalization and reflection enabled.
4. [**Interpreting Evaluation Results**](#section-4-interpreting-evaluation-results): Understanding the evaluation results provided by the reflection agent including rating scores and rationales.
5. [**Next Steps**](#section-5-next-steps): rief overview of what will be covered in the next tutorial.

#### ‚öôÔ∏è Section 1: Initial Configurations <a id="section-1-initial-configurations"></a>
We will start with some initial configurations necessary for the extraction process. The configurations includes loading required libraries, setting up the LLM models and defining the input and output directories.

##### Import Necessary Packages and Modules

In [1]:
# Append the parent directory to sys.path
import sys
sys.path.append("../..")

# Apply nest_asyncio to allow nested event loops
import nest_asyncio
nest_asyncio.apply()

In [2]:
# Python Imports
from IPython.display import JSON, display

# SciKG-Extract Config Imports
from scikg_extract.config.process.processConfig import ProcessConfig
from scikg_extract.config.llm.envConfig import EnvConfig

# Import Utilities
from scikg_extract.utils.log_handler import LogHandler
from scikg_extract.utils.file_utils import read_json_file, read_text_file

# Import Orchestrator Agent
from scikg_extract.agents.orchestrator_agent import orchestrate_extraction_workflow

# Import Configurations
from scikg_extract.config.agents.orchestrator import OrchestratorConfig
from scikg_extract.config.agents.workflow import WorkflowConfig

# Import Data Models
from data.models.schema.ALD_experimental_schema import ALDProcessList

##### Configure Logging

In [3]:
# Setup and Initialize Module Logging
logger = LogHandler.setup_module_logging("scikg_extract")

##### Setting Up LLM Model

In [29]:
# LLM to be used for structured knowledge extraction
llm_name_extraction = "gpt-4o"

# LLM to be used for Normalization especially for chemical entities disambiguation
llm_name_normalization = "gpt-5"

# LLM to be used for Reflection/Evaluation of the extracted knowledge
llm_name_reflection = "deepseek-r1"

In [None]:
# Set OpenAI API Key and Organization ID
EnvConfig.OPENAI_api_key = '<insert-your-openai-key>'
EnvConfig.OPENAI_organization_id = '<insert-your-openi-organization-id>'

##### Input and Output Directories

In [30]:
# Scientific document containing ZnO ALD experimental processes in markdown format
scientific_docs_dir = "../../data/research-papers/ALD/markdown/ZnO-IGZO-papers/experimental-usecase/ZnO/ZnEt2 - H2O/2 Lujala et al.md"
scientific_document = read_text_file(scientific_docs_dir)

# ALD Process Schema path for experimental processes
process_schema_path = "../../data/schemas/ALD-experimental/ALD-experimental-schema.json"
process_schema = read_json_file(process_schema_path)

# Domain-expert curated examples for ZnO ALD processes
examples_path = "../../data/examples/Atomic-layer-deposition/ZnO/example1.txt"
examples = read_text_file(examples_path)

# Manually curated PubChem synonym to CID mapping dictionary
pubchem_lookup_dict_path = "../../data/resources/PubChem-Synonym-CID.json"
synonym_to_cid_mapping = read_json_file(pubchem_lookup_dict_path)

# PubChem LMDB database created from PubChem CID data dump
lmdb_pubchem_path = "../../data/external/pubchem/pubchem_cid_lmdb"

#### üß™ Section 2: Process Definition <a id="section-2-process-definition"></a>
In this section, we will define the scientific process for which we want to perform structured knowledge extraction. We will use the Atomic Layer Deposition (ALD) of Zinc Oxide (ZnO) as our example process. We have developed a detailed process schema for [ALD experimental process](../../data/schemas/ALD-experimental/ALD-experimental-schema.json) using our previous work: [**SCHEMA-MINER PRO**](https://github.com/sciknoworg/schema-miner), and will use it to guide the extraction process. Additionally, we will provide domain-expert curated examples to help the extraction agent understand the specific entities and relationships we are interested in.

In [31]:
# Process Name
ProcessConfig.Process_name = "Atomic Layer Deposition"

ProcessConfig.Process_description = """
Atomic layer deposition (ALD) is a surface-controlled thin film deposition technique that can enable ultimate control over the film thickness, uniformity on large-area substrates and conformality on 3D (nano)structures. Each ALD cycle consists at least two half-cycles (but can be more complex), containing a precursor dose step and a co-reactant exposure step, separated by purge or pump steps. Ideally the same amount of material is deposited in each cycle, due to the self-limiting nature of the reactions of the precursor and co-reactant with the surface groups on the substrate. By carrying out a certain number of ALD cycles, the targeted film thickness can be obtained.

In this extraction task, we are focusing on ZnO (Zinc Oxide) thin film deposition via ALD. A ZnO ALD (Zinc Oxide Atomic Layer Deposition) process deposits thin ZnO films through sequential, self-limiting surface reactions between a zinc precursor and an oxidant. The process typically consists of repeating ALD cycles, each containing a precursor pulse (e.g., diethylzinc (DEZ), Zn(acac)‚ÇÇ, or Zn(thd)‚ÇÇ), a purge step, an oxidant pulse (commonly H‚ÇÇO, O‚ÇÉ, or O‚ÇÇ plasma), followed by another purge. These reactions form a conformal zinc-oxygen layer per cycle with precise thickness control. The aim of a ZnO ALD process is to produce high-quality, uniform, conformal ZnO films with controlled thickness, crystallinity (amorphous or polycrystalline depending on temperature), and stoichiometry.
"""

#### üìÑ Knowledge Extraction with Normalization and Reflection <a id="section-3-knowledge-extraction-with-normalization-and-reflection"></a>
In this section, we will further enhance the knowledge extraction process by incorporating LLM-as-a-Judge evaluations as part of the reflection agent. The reflection agent will assess the quality of the extracted knowledge based on rubrics defined in [YesciEval](https://aclanthology.org/2025.acl-long.675.pdf) framework. Specifically, we will evaluate the completeness and correctness of the extracted knowledge. The reflection agent will provide rating scores and rationales for each criterion, which will help us understand the strengths and weaknesses of the extraction process.

##### Import the Validation Rubrics
We will import the necessary validation rubrics that will be used by the LLM-as-a-Judge to evaluate the quality of the extracted knowledge. These rubrics are an extension of the [YESciEval framework](https://github.com/sciknoworg/YESciEval).

In [32]:
# Validation Rubrics Imports
from scikg_extract.evaluation.rubrics.informativeness import Completeness, Correctness

##### Orchestrator Agent Configuration
The orchestrator agent configuration will now include additional parameters for evaluation. The additions include specifying the LLM model to be used as LLM-as-a-Judge, and defining a list of evaluation rubrics to be used for assessing the extracted knowledge.

In [33]:
# Initialize orchestrator configuration
orchestrator_config = OrchestratorConfig(
    llm_name=llm_name_extraction,
    normalization_llm_name=llm_name_normalization,
    process_schema=process_schema,
    scientific_document=scientific_document,
    examples=examples,
    extraction_data_model=ALDProcessList,
    pubchem_lmdb_path=lmdb_pubchem_path,
    synonym_to_cid_mapping=synonym_to_cid_mapping,
    reflection_llm_name=llm_name_reflection,
    rubrics=[Completeness, Correctness]
)

##### Workflow Configuration
The workflow configuration now enables normalization and evaluation flags to activate the respective components in the extraction pipeline.

In [34]:
# Initialize the Workflow configuration
workflow_config = WorkflowConfig(normalize_extracted_data=True, clean_extracted_data=False, validate_extracted_data=True, total_validation_retries=1)

##### Execute the Extraction Workflow with Normalization, Evaluation, and Refinement
With all the configurations set, we can now execute the knowledge extraction workflow that includes extraction, normalization, and evaluation. The orchestrator agent will oversee the entire process, ensuring that each component functions correctly and that the final output is a high-quality structured knowledge representation extracted from the scientific text and evaluation results.

In [35]:
# Extract knowledge using the orchestrator agent
final_state = orchestrate_extraction_workflow(orchestrator_config, workflow_config)

# Get the extracted knowledge from the final state
normalized_extracted_knowledge = final_state["normalized_json"]

2026-01-20 14:40:22 - scikg_extract.agents.orchestrator_agent - INFO - Starting Orchestrator Agent for extraction workflow...
2026-01-20 14:40:22 - scikg_extract.agents.orchestrator_agent - INFO - Compiled the orchestractor workflow graph.
2026-01-20 14:40:22 - scikg_extract.agents.orchestrator_agent - INFO - Invoking the orchestrator workflow...
2026-01-20 14:40:22 - scikg_extract.agents.extraction_agent - INFO - Starting knowledge extraction agent...
2026-01-20 14:40:22 - scikg_extract.agents.extraction_agent - INFO - Executing the extraction StateGraph for structured knowledge extraction...
2026-01-20 14:40:22 - scikg_extract.tools.extraction.structured_knowledge_extraction - INFO - Starting structured knowledge extraction tool...
2026-01-20 14:42:00 - scikg_extract.tools.extraction.json_validator - INFO - Starting JSON validation tool...
2026-01-20 14:42:00 - scikg_extract.tools.extraction.pubchem_normalization - INFO - Starting PubChem normalization tool...
2026-01-20 14:42:15 - s

##### Display Extracted Knowledge
Since the extracted normalized knowledge is based on the defined process schema which is very complex, nested and can contain multiple processes, we will display different parts of the extracted knowledge separately for better clarity and understanding. Specifically, we will display the ALD System, Reactants Selection and deposition temperature from process parameters of the first extracted process. The complete extracted and normalized knowledge can be accessed using `normalized_extracted_knowledge` variable.

In [36]:
print("### ALD System ###")
ald_system = normalized_extracted_knowledge["processes"][0]["aldSystem"]
display(JSON(ald_system))

print("### Reactants Selection ###")
reactants_selection = normalized_extracted_knowledge["processes"][0]["reactantSelection"]
display(JSON(reactants_selection))

print("### Process Parameters - Deposition Temperature ###")
process_parameters_temperature = normalized_extracted_knowledge["processes"][0]["processParameters"]["temperature"]
display(JSON(process_parameters_temperature))

### ALD System ###


<IPython.core.display.JSON object>

### Reactants Selection ###


<IPython.core.display.JSON object>

### Process Parameters - Deposition Temperature ###


<IPython.core.display.JSON object>

#### ‚öñÔ∏è Section 4: Interpreting Evaluation Results <a id="section-4-interpreting-evaluation-results"></a>
In this section, we will interpret the evaluation results provided by the reflection agent. The evaluation results include rating scores and rationales for each of the defined rubrics: completeness and correctness. We will analyze these results to understand the quality of the extracted knowledge and identify areas for improvement.

In [37]:
# Get the evaluation results from the final state
evaluation_results = final_state["evaluation_results"]

In [38]:
print("### Evaluation Results -- Completeness ###")
display(JSON(evaluation_results["completeness"]))

print("### Evaluation Results -- Correctness ###")
display(JSON(evaluation_results["correctness"]))

### Evaluation Results -- Completeness ###


<IPython.core.display.JSON object>

### Evaluation Results -- Correctness ###


<IPython.core.display.JSON object>

Based on these evaluation results, we can identify specific areas where the extraction process can be improved. In this example, we can see that extracted knowledge misses some required properties such as `numberOfCycles` in the thickness Control as mentioned in the rationale for completeness. This indicates that the extraction agent needs to be more thorough in capturing all relevant details from the scientific text. Additionally, the correctness evaluation highlights some inaccuracies in `growthPerCycle` values, suggesting that extraction agent needs to be more careful in extracting range of values if specified in the text.

These insights will be valuable for refining the extraction process in future iterations, ensuring that the extracted knowledge is both comprehensive and accurate.

#### üöÄ Section 5: Next Steps <a id="section-5-next-steps"></a>
In the next tutorial: [SciKGExtract_Knowledge_Extraction_Normalization_Refinement](04_SciKGExtract_Knowledge_Extraction_Normalization_Refinement.ipynb), we will explore how to refine and improve the extraction process based on the evaluation results obtained from the reflection agent. We will discuss about the feedback agent that can utilize these evaluation insights to iteratively enhance the extraction quality.