## üß†üîó SciKGExtract: Agentic Pipeline for Scientific Knowledge Graph Extraction - Tutorials!

### Tutorial 2: Knowledge Extraction With Normalization from Scientific Texts
Welcome to the second tutorial of SciKGExtract! In this tutorial, we will expand upon the knowledge extraction process demonstrated in the first [tutorial](01_SciKGExtract_Knowledge_Extraction.ipynb) by incorporating normalization of the extracted entities, particularly chemical compounds, using external databases like [PubChem](https://pubchem.ncbi.nlm.nih.gov/). This will enhance the quality and usability of the extracted knowledge for downstream applications.

**Focus of this Tutorial:**
In this tutorial, we will focus on the following key aspects:
1. **Knowledge Extraction**: We will utilize the SciKGExtract framework to extract structured knowledge from scientific texts, specifically focusing on the Atomic Layer Deposition (ALD) of Zinc Oxide (ZnO) process.
2. **Normalization**: SciKGExtract will be configured to normalize the extracted chemical entities using PubChem, ensuring that the extracted data is standardized and can be easily integrated with other datasets.

#### üìã Overview
We will cover the following steps in this tutorial:
1. [**Initial Configuration**](#section-1-initial-configurations): Setting up the environment, loading necessary libraries, and defining the sample scientific text.
2. [**Process Definition**](#section-2-process-definition): Defining the scientific process for which we want to perform structured knowledge extraction.
3. [**Knowledge Extraction with Normalization**](#section-3-knowledge-extraction-with-normalization): Using the orchestration agent to extract and normalize entities and relationships from the scientific text based on the process schema.
4. [**Next Steps**](#section-4-next-steps): Brief overview of what will be covered in the next tutorial.

#### ‚öôÔ∏è Section 1: Initial Configurations <a id="section-1-initial-configurations"></a>
We will start with some initial configurations necessary for the extraction process. The configurations includes loading required libraries, setting up the LLM models and defining the input and output directories.

##### Import Necessary Packages and Modules

In [1]:
# Append the parent directory to sys.path
import sys
sys.path.append("../..")

# Apply nest_asyncio to allow nested event loops
import nest_asyncio
nest_asyncio.apply()

In [11]:
# Python Imports
import json

# SciKG-Extract Config Imports
from scikg_extract.config.process.processConfig import ProcessConfig
from scikg_extract.config.llm.envConfig import EnvConfig

# Import Utilities
from scikg_extract.utils.log_handler import LogHandler
from scikg_extract.utils.file_utils import read_json_file, read_text_file

# Import Orchestrator Agent
from scikg_extract.agents.orchestrator_agent import orchestrate_extraction_workflow

# Import Configurations
from scikg_extract.config.agents.orchestrator import OrchestratorConfig
from scikg_extract.config.agents.workflow import WorkflowConfig

# Import Data Models
from data.models.schema.ALD_experimental_schema import ALDProcessList

##### Configure Logging

In [3]:
# Setup and Initialize Module Logging
logger = LogHandler.setup_module_logging("scikg_extract")

##### Setting Up LLM Model

In [4]:
# LLM to be used for structured knowledge extraction
llm_name_extraction = "gpt-4o"

# LLM to be used for Normalization especially for chemical entities disambiguation
llm_name_normalization = "gpt-5"

In [None]:
# Set OpenAI API Key and Organization ID
EnvConfig.OPENAI_api_key = '<insert-your-openai-key>'
EnvConfig.OPENAI_organization_id = '<insert-your-openi-organization-id>'

##### Input and Output Directories

In [5]:
# Scientific document containing ZnO ALD experimental processes in markdown format
scientific_docs_dir = "../../data/research-papers/ALD/markdown/ZnO-IGZO-papers/experimental-usecase/ZnO/ZnEt2 - H2O/2 Lujala et al.md"
scientific_document = read_text_file(scientific_docs_dir)

# ALD Process Schema path for experimental processes
process_schema_path = "../../data/schemas/ALD-experimental/ALD-experimental-schema.json"
process_schema = read_json_file(process_schema_path)

# Domain-expert curated examples for ZnO ALD processes
examples_path = "../../data/examples/Atomic-layer-deposition/ZnO/example1.txt"
examples = read_text_file(examples_path)

# Manually curated PubChem synonym to CID mapping dictionary
pubchem_lookup_dict_path = "../../data/resources/PubChem-Synonym-CID.json"
synonym_to_cid_mapping = read_json_file(pubchem_lookup_dict_path)

# PubChem LMDB database created from PubChem CID data dump
lmdb_pubchem_path = "../../data/external/pubchem/pubchem_cid_lmdb"

#### üß™ Section 2: Process Definition <a id="section-2-process-definition"></a>
In this section, we will define the scientific process for which we want to perform structured knowledge extraction. We will use the Atomic Layer Deposition (ALD) of Zinc Oxide (ZnO) as our example process. We have developed a detailed process schema for [ALD experimental process](../../data/schemas/ALD-experimental/ALD-experimental-schema.json) using our previous work: [**SCHEMA-MINER PRO**](https://github.com/sciknoworg/schema-miner), and will use it to guide the extraction process. Additionally, we will provide domain-expert curated examples to help the extraction agent understand the specific entities and relationships we are interested in.

In [6]:
# Process Name
ProcessConfig.Process_name = "Atomic Layer Deposition"

ProcessConfig.Process_description = """
Atomic layer deposition (ALD) is a surface-controlled thin film deposition technique that can enable ultimate control over the film thickness, uniformity on large-area substrates and conformality on 3D (nano)structures. Each ALD cycle consists at least two half-cycles (but can be more complex), containing a precursor dose step and a co-reactant exposure step, separated by purge or pump steps. Ideally the same amount of material is deposited in each cycle, due to the self-limiting nature of the reactions of the precursor and co-reactant with the surface groups on the substrate. By carrying out a certain number of ALD cycles, the targeted film thickness can be obtained.

In this extraction task, we are focusing on ZnO (Zinc Oxide) thin film deposition via ALD. A ZnO ALD (Zinc Oxide Atomic Layer Deposition) process deposits thin ZnO films through sequential, self-limiting surface reactions between a zinc precursor and an oxidant. The process typically consists of repeating ALD cycles, each containing a precursor pulse (e.g., diethylzinc (DEZ), Zn(acac)‚ÇÇ, or Zn(thd)‚ÇÇ), a purge step, an oxidant pulse (commonly H‚ÇÇO, O‚ÇÉ, or O‚ÇÇ plasma), followed by another purge. These reactions form a conformal zinc-oxygen layer per cycle with precise thickness control. The aim of a ZnO ALD process is to produce high-quality, uniform, conformal ZnO films with controlled thickness, crystallinity (amorphous or polycrystalline depending on temperature), and stoichiometry.
"""

#### üìÑ Knowledge Extraction with Normalization <a id="section-3-knowledge-extraction-with-normalization"></a>
In this section, we will extend the knowledge extraction process by incorporating normalization of the extracted entities, particularly chemical compounds, using external databases like PubChem. We will configure the SciKGExtract Orchestrator Agent to include normalization steps in the extraction workflow. The Orchestrator Agent will utilize the Extraction Agent to identify and extract relevant entities and relationships based on the defined process schema, and then normalize these entities using PubChem.

##### Orchestrator Agent Configuration
The Orchestrator Agent configuration remains similar to the knowledge extraction setup with the addition of normalization properties including PubChem LMDB path containing indexed PubChem Synonyms to CID mapping for fast lookup, synonym to CID mapping dictionary containing manually curated synonyms to PubChem CIDs and the LLM model name used for normalization disambiguation.

In [7]:
# Initialize orchestrator configuration
orchestrator_config = OrchestratorConfig(
    llm_name=llm_name_extraction,
    normalization_llm_name=llm_name_normalization,
    process_schema=process_schema,
    scientific_document=scientific_document,
    examples=examples,
    extraction_data_model=ALDProcessList,
    pubchem_lmdb_path=lmdb_pubchem_path,
    synonym_to_cid_mapping=synonym_to_cid_mapping
)

##### Workflow Configuration
The workflow configuration for normalization will enable the normalization flag while keeping evaluation and refinement disabled.

In [8]:
# Initialize the Workflow configuration
workflow_config = WorkflowConfig(normalize_extracted_data=True, clean_extracted_data=False, validate_extracted_data=False)

##### Execute the Knowledge Extraction with Normalization Workflow
With the updated configurations, we can now execute the knowledge extraction workflow with normalization enabled. The orchestrator agent will coordinate the extraction and normalization processes to produce standardized structured knowledge from the input scientific documents.

In [None]:
# Extract knowledge using the orchestrator agent
final_state = orchestrate_extraction_workflow(orchestrator_config, workflow_config)

# Get the extracted knowledge from the final state
normalized_extracted_knowledge = final_state["normalized_json"]

##### Display Extracted Knowledge
Since the extracted normalized knowledge is based on the defined process schema which is very complex, nested and can contain multiple processes, we will display different parts of the extracted knowledge separately for better clarity and understanding. Specifically, we will display the ALD System, Reactants Selection and deposition temperature from process parameters of the first extracted process. These components will show how normalization has standardized the chemical entities and other relevant information extracted from the scientific text. The complete extracted normalized knowledge can be explored in `normalized_extracted_knowledge` variable.

In [12]:
print("### ALD System ###\n")
ald_system = normalized_extracted_knowledge["processes"][0]["aldSystem"]
print(json.dumps(ald_system, separators=(',', ':')))

print("\n### Reactants Selection ###\n")
reactants_selection = normalized_extracted_knowledge["processes"][0]["reactantSelection"]
print(json.dumps(reactants_selection, separators=(',', ':')))

print("\n### Process Parameters - Deposition Temperature ###\n")
process_parameters_temperature = normalized_extracted_knowledge["processes"][0]["processParameters"]["temperature"]
print(json.dumps(process_parameters_temperature, separators=(',', ':')))

### ALD System ###

{"aldMethod":[{"compound":{"value":"ZnO","sameAs":["https://pubchem.ncbi.nlm.nih.gov/compound/14806"]},"method":"TALD"}],"materialDeposited":{"value":"ZnO","sameAs":["https://pubchem.ncbi.nlm.nih.gov/compound/14806"]}}

### Reactants Selection ###

{"precursor":[{"compound":{"value":"ZnO","sameAs":["https://pubchem.ncbi.nlm.nih.gov/compound/14806"]},"precursor":{"value":"DMZn","sameAs":["https://pubchem.ncbi.nlm.nih.gov/compound/6093185"]}},{"compound":{"value":"ZnO","sameAs":["https://pubchem.ncbi.nlm.nih.gov/compound/14806"]},"precursor":{"value":"DEZn","sameAs":["https://pubchem.ncbi.nlm.nih.gov/compound/11185","https://pubchem.ncbi.nlm.nih.gov/compound/101667988"]}}],"coReactant":[{"compound":{"value":"ZnO","sameAs":["https://pubchem.ncbi.nlm.nih.gov/compound/14806"]},"coReactant":{"value":"H2O","sameAs":["https://pubchem.ncbi.nlm.nih.gov/compound/962"]}}],"carrierGas":{"value":"N2","sameAs":["https://pubchem.ncbi.nlm.nih.gov/compound/947"]},"purgingGas":{"value

#### üöÄ Section 4: Next Steps <a id="section-4-next-steps"></a>
In the next tutorial: [SciKGExtract_Knowledge_Extraction_Normalization_Reflection](03_SciKGExtract_Knowledge_Extraction_Normalization_Reflection.ipynb), we will introduce another agent called the Reflection Agent to evaluate the extracted knowledge based on predefined criteria. We will be using specifically, LLM-as-Judge approach to assess the quality of the extracted knowledge based on criteria such as completeness and correctness. These evaluations will help identify areas for improvement for the extraction agent.