## üß†üîó SciKGExtract: Agentic Pipeline for Scientific Knowledge Graph Extraction - Tutorials!

### Tutorial 1: Knowledge Extraction from Scientific Texts
Welcome to the first tutorial of SciKGExtract! In this tutorial, we will demonstrate how to use the SciKGExtract pipeline to extract structured knowledge from scientific texts.

**What is SciKGExtract?**

The SciKGExtract framework leverages an **Agentic Pipeline architecture** with sequential agent execution to perform the various tasks involved in the knowledge extraction process. The overall execution is orchestrated by an Orchestrator Agent, which coordinates the different components of the pipeline including extraction, normalization, evaluation, refinement, and KG population.

**Focus of this Tutorial**

In this tutorial, we will only focus on the **Knowledge Extraction** component of the SciKGExtract pipeline. We will demonstrate how to extract entities and relationships from a sample scientific text using the extraction agent and output the structured knowledge. Next subsequent tutorials will cover the other components of the framework including normalization, evaluation, refinement, and KG population.

#### üìã Overview
We will cover the following steps in this tutorial:
1. [**Initial Configuration**](#section-1-initial-configurations): Setting up the environment, loading necessary libraries, and defining the sample scientific text.
2. [**Process Definition**](#section-2-process-definition): Defining the scientific process for which we want to perform structured knowledge extraction.
3. [**Knowledge Extraction**](#section-3-knowledge-extraction): Using the orchestration agent to extract entities and relationships from the scientific text based on the process schema.
4. [**Next Steps**](#section-4-next-steps): Brief overview of what will be covered in the next tutorial.

#### ‚öôÔ∏è Section 1: Initial Configurations <a id="section-1-initial-configurations"></a>
We will start with some initial configurations necessary for the extraction process. The configurations includes loading required libraries, setting up the LLM models and defining the input and output directories.

##### Import Necessary Packages and Modules

In [1]:
# Append the parent directory to sys.path
import sys
sys.path.append("../..")

# Apply nest_asyncio to allow nested event loops
import nest_asyncio
nest_asyncio.apply()

In [2]:
# Python Imports
from IPython.display import JSON, display

# SciKG-Extract Config Imports
from scikg_extract.config.process.processConfig import ProcessConfig
from scikg_extract.config.llm.envConfig import EnvConfig

# Import Utilities
from scikg_extract.utils.log_handler import LogHandler
from scikg_extract.utils.file_utils import read_json_file, read_text_file

# Import Orchestrator Agent
from scikg_extract.agents.orchestrator_agent import orchestrate_extraction_workflow

# Import Configurations
from scikg_extract.config.agents.orchestrator import OrchestratorConfig
from scikg_extract.config.agents.workflow import WorkflowConfig

# Import Data Models
from data.models.schema.ALD_experimental_schema import ALDProcessList

##### Configure Logging

In [3]:
# Setup and Initialize Module Logging
logger = LogHandler.setup_module_logging("scikg_extract")

##### Setting Up LLM Model

In [4]:
# LLM to be used for structured knowledge extraction
llm_name_extraction = "gpt-4o"

In [None]:
# Set OpenAI API Key and Organization ID
EnvConfig.OPENAI_api_key = '<insert-your-openai-key>'
EnvConfig.OPENAI_organization_id = '<insert-your-openi-organization-id>'

##### Input and Output Directories

In [5]:
# Scientific document containing ZnO ALD experimental processes in markdown format
scientific_docs_dir = "../../data/research-papers/ALD/markdown/ZnO-IGZO-papers/experimental-usecase/ZnO/ZnEt2 - H2O/2 Lujala et al.md"
scientific_document = read_text_file(scientific_docs_dir)

# ALD Process Schema path for experimental processes
process_schema_path = "../../data/schemas/ALD-experimental/ALD-experimental-schema.json"
process_schema = read_json_file(process_schema_path)

# Domain-expert curated examples for ZnO ALD processes
examples_path = "../../data/examples/Atomic-layer-deposition/ZnO/example1.txt"
examples = read_text_file(examples_path)

#### üß™ Section 2: Process Definition <a id="section-2-process-definition"></a>
In this section, we will define the scientific process for which we want to perform structured knowledge extraction. We will use the Atomic Layer Deposition (ALD) of Zinc Oxide (ZnO) as our example process. We have developed a detailed process schema for [ALD experimental process](../../data/schemas/ALD-experimental/ALD-experimental-schema.json) using our previous work: [**SCHEMA-MINER PRO**](https://github.com/sciknoworg/schema-miner), and will use it to guide the extraction process. Additionally, we will provide domain-expert curated examples to help the extraction agent understand the specific entities and relationships we are interested in.

In [6]:
# Process Name
ProcessConfig.Process_name = "Atomic Layer Deposition"

ProcessConfig.Process_description = """
Atomic layer deposition (ALD) is a surface-controlled thin film deposition technique that can enable ultimate control over the film thickness, uniformity on large-area substrates and conformality on 3D (nano)structures. Each ALD cycle consists at least two half-cycles (but can be more complex), containing a precursor dose step and a co-reactant exposure step, separated by purge or pump steps. Ideally the same amount of material is deposited in each cycle, due to the self-limiting nature of the reactions of the precursor and co-reactant with the surface groups on the substrate. By carrying out a certain number of ALD cycles, the targeted film thickness can be obtained.

In this extraction task, we are focusing on ZnO (Zinc Oxide) thin film deposition via ALD. A ZnO ALD (Zinc Oxide Atomic Layer Deposition) process deposits thin ZnO films through sequential, self-limiting surface reactions between a zinc precursor and an oxidant. The process typically consists of repeating ALD cycles, each containing a precursor pulse (e.g., diethylzinc (DEZ), Zn(acac)‚ÇÇ, or Zn(thd)‚ÇÇ), a purge step, an oxidant pulse (commonly H‚ÇÇO, O‚ÇÉ, or O‚ÇÇ plasma), followed by another purge. These reactions form a conformal zinc-oxygen layer per cycle with precise thickness control. The aim of a ZnO ALD process is to produce high-quality, uniform, conformal ZnO films with controlled thickness, crystallinity (amorphous or polycrystalline depending on temperature), and stoichiometry.
"""

#### üìÑ Section 3: Knowledge Extraction <a id="section-3-knowledge-extraction"></a>
In this section, we will demonstrate how to use the SciKGExtract framework to extract structured knowledge from the defined scientific text. The SciKGExtract Orchestrator Agent coordinates the extraction process by utilizing the Extraction Agent to identify and extract relevant entities and relationships based on the defined process schema. To instruct the Orchestrator Agent, we will provide a orchestrator configuration that specifies the necessary parameters and workflow configuration to specify which components to execute.

##### Orchestrator Agent Configuration
The Orchestrator Agent is responsible for coordinating the different components of the extraction pipeline. We will configure it with the necessary parameters necessary for the basic extraction task.

In [7]:
# Initialize orchestrator configuration
orchestrator_config = OrchestratorConfig(
    llm_name=llm_name_extraction,
    process_schema=process_schema,
    scientific_document=scientific_document,
    examples=examples,
    extraction_data_model=ALDProcessList
)

##### Worflow Configuration
The workflow configuration defines different flags that control the behavior of the extraction pipeline. For only knowledge extraction, we will set the flags to disable normalization, evaluation, and refinement.

In [8]:
# Initialize the Workflow configuration
workflow_config = WorkflowConfig(normalize_extracted_data=False, clean_extracted_data=False, validate_extracted_data=False)

##### Execute the Basic Structured Knowledge Extraction Workflow
With the configurations in place, we can now execute the basic structured knowledge extraction workflow using the SciKGExtract framework. The orchestrator agent will manage the flow of data and control between the different components of the pipeline to extract structured knowledge from the input scientific documents.

In [9]:
# Extract knowledge using the orchestrator agent
final_state = orchestrate_extraction_workflow(orchestrator_config, workflow_config)

# Get the extracted knowledge from the final state
extracted_knowledge = final_state["extracted_json"]

2026-01-19 18:07:28 - scikg_extract.agents.orchestrator_agent - INFO - Starting Orchestrator Agent for extraction workflow...
2026-01-19 18:07:28 - scikg_extract.agents.orchestrator_agent - INFO - Compiled the orchestractor workflow graph.
2026-01-19 18:07:28 - scikg_extract.agents.orchestrator_agent - INFO - Invoking the orchestrator workflow...
2026-01-19 18:07:28 - scikg_extract.agents.extraction_agent - INFO - Starting knowledge extraction agent...
2026-01-19 18:07:28 - scikg_extract.agents.extraction_agent - INFO - Executing the extraction StateGraph for structured knowledge extraction...
2026-01-19 18:07:28 - scikg_extract.tools.extraction.structured_knowledge_extraction - INFO - Starting structured knowledge extraction tool...
2026-01-19 18:07:55 - scikg_extract.tools.extraction.json_validator - INFO - Starting JSON validation tool...


##### Display Extracted Knowledge
Since the extracted knowledge is based on the defined process schema which is very complex, nested and can contain multiple processes, we will display different parts of the extracted knowledge separately for better clarity and understanding. Specifically, we will display the ALD System, Reactants Selection and deposition temperature from process parameters of the first extracted process. The complete extracted knowledge can be explored in the `extracted_knowledge` variable.

In [10]:
print("### ALD System ###")
ald_system = extracted_knowledge["processes"][0]["aldSystem"]
display(JSON(ald_system))

print("### Reactants Selection ###")
reactants_selection = extracted_knowledge["processes"][0]["reactantSelection"]
display(JSON(reactants_selection))

print("### Process Parameters - Deposition Temperature ###")
process_parameters_temperature = extracted_knowledge["processes"][0]["processParameters"]["temperature"]
display(JSON(process_parameters_temperature))

### ALD System ###


<IPython.core.display.JSON object>

### Reactants Selection ###


<IPython.core.display.JSON object>

### Process Parameters - Deposition Temperature ###


<IPython.core.display.JSON object>

#### üöÄ Section 4: Next Steps <a id="section-4-next-steps"></a>
In the next tutorial: [SciKGExtract_Knowledge_Extraction_Normalization](02_SciKGExtract_Knowledge_Extraction_Normalization.ipynb), we will build upon the knowledge extraction process demonstrated here by incorporating normalization of the extracted entities especially the chemical compounds using external databases like PubChem. This will enhance the quality and usability of the extracted knowledge for downstream applications.