## 📄🔍 A Human-in-the-Loop Workflow for Scientific Schema Mining with Large Language Models

This notebook presents a human-in-the-loop approach for scientific schema mining utilizing large language models (LLMs). It outlines the steps for creating and preparing the data necessary for schema mining, and describes the three stages involved in extracting and mining JSON schemas from scientific literature, incorporating continuous expert feedback. A complete workflow diagram and detailed explanation can be found [here](../README.md).

In this notebook, we will be taking the example of an ALD (Atomic Layer Deposition) process and design the schema for it!

**Overview**
1. [Preparing the data](#preparing-data)
2. [Stage 01: Initial Schema Mining](#initial-schema-mining)
3. [Stage 02: Preliminary Schema Refinement](#pre-schema-refinement)
4. [Stage 03: Finalize Schema Refinement](#finalize-schema-refinement)

### 🗂️ Preparing the data <a id='preparing-data'></a>

The initial step before schema mining is to prepare the data for the large language model (LLM). This preparation involves reviewing various PDF documents related to Atomic Layer Deposition (ALD) processes across different stages. For Stage 1, we will use the process specification document provided by domain experts to extract the initial JSON schema. For Stages 2 and 3, we will utilize a curated dataset, compiled by domain experts, containing various scientific research papers on the ALD process. All relevant documents are organized and available [here](../data).

In [2]:
import os
from utils.utils import pdf_loader

def convert_pdf_to_text(source_dir_path: str):
    """
    Extracts text from all the PDF's in the source directory

    Args:
        source_dir_path (str): The file path of the source directory

    Returns:
        list: The list of string for each of the files in the source directory
    """
    pdf_text_lst = []
    
    #Iterating over all the files in the source directory
    for filename in os.listdir(source_dir_path):
        #Checking for the PDF file
        if not filename.endswith('.pdf'): continue

        #Reading and extracting the pdf text
        complete_file_path = f'{source_dir_path}/{filename}'
        pdf_text = pdf_loader(complete_file_path)
        pdf_text_lst.append(pdf_text)
    return pdf_text_lst

In [3]:
#Reading the process specification documents for stage 1
source_dir_path = '../data/stage-1'
stage_1_text = convert_pdf_to_text(source_dir_path)

#Reading the domain-expert's small curated collection of research papers for stage 2
source_dir_path = '../data/stage-2/research-papers/experimental-usecase'
stage_2_text = convert_pdf_to_text(source_dir_path)

#Reading the large collection of research papers for stage 3
source_dir_path = '../data/stage-3/research-papers/experimental_usecase'
stage_3_text = convert_pdf_to_text(source_dir_path)


### 📄⛏️ Stage 01: Initial Schema Mining <a id='initial-schema-mining'></a>

During Stage 1, we task the LLM with designing an initial JSON schema based on a process specification document that contains limited information about an ALD process. For this example, we have used OpenAI's GPT-4 language model, using a prompt template as demonstrated [here](../src/prompts/prompt_template1.py).

In [4]:
import json
from utils.utils import extract_json_schema
from services.openai_llm_inference import Openai_LLM_Inference
from prompts import prompt_template1

def schema_extraction_stage1(llm_model_name: str, specification_doc: str):
    """
    Extract the Initial JSON schema from the given process specification document and the large language model.

    Args:
        llm_model_name (str): The large language model name
        specification_doc (str): The text of the process specification document

    Returns:
        dict: The extracted Initial Schema
    """
    try:
        #Initializing the Object for LLM inference
        llm = Openai_LLM_Inference(llm_model_name)
        print(f'Using {llm}')
        
        #Calling the LLM's completion API
        var_dict = {'context': specification_doc}
        model_output = llm.completion(prompt_template1, var_dict)
        if not model_output:
            print(f'No Output from the Model: {llm_model_name}, terminating the workflow')
            return None

        #Extracting the updated schema
        print('\nExtracting the JSON object from the model\'s output...')
        initial_schema = extract_json_schema(model_output, json_encl_expr = [('```json', '```'), ('```', '```')])
        if not initial_schema:
            print('Error extracting the JSON schema from the LLM\'s output, Stopping the inference from the LLM...')
            return None
        
        print('\nInitial Schema successfully extracted from the LLM\'s Output!!')
        return initial_schema
    except Exception as e:
            print(f'Exception Occurred while doing LLM ({llm_model_name}) Inference')
            print(f'Exception: {e}')
            return None

In [5]:
process_specification_doc = stage_1_text[0]
llm_model_name = 'gpt-4o'
initial_schema = schema_extraction_stage1(llm_model_name, process_specification_doc)

Using OPENAI - LLM Inference with Model: gpt-4o

Extracting the JSON object from the model's output...

Initial Schema successfully extracted from the LLM's Output!!


In [6]:
print('Initial Schema:\n')
print(json.dumps(initial_schema, indent=4))

Initial Schema:

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "Atomic Layer Deposition (ALD) Process Schema",
    "type": "object",
    "properties": {
        "processParameters": {
            "type": "object",
            "description": "Parameters defining the ALD process.",
            "properties": {
                "temperature": {
                    "type": "number",
                    "description": "Temperature at which the ALD process is conducted, in Celsius.",
                    "units": "Celsius"
                },
                "pressure": {
                    "type": "number",
                    "description": "Pressure inside the reactor during the ALD process, in Pascals.",
                    "units": "Pascals"
                },
                "precursorFlowRate": {
                    "type": "number",
                    "description": "Flow rate of the precursor gas, in standard cubic centimeters per minute (sccm).",
          

In [7]:
expert_feedback_stage1 = 'The reactivity should be mentioned either at the process conditions or film properties. It’s the result of deposition under specific conditions not a standard property of the precursor or co-reactant.'

### 📄⛏️ Stage 02: Preliminary Schema Refinement <a id='pre-schema-refinement'></a>

During Stage 2, the initial JSON schema from Stage 1 is further refined using a curated dataset of scientific literature related to the ALD process, compiled by domain experts. This stage follows an iterative approach, incorporating expert feedback into each schema generated by the LLM. Here in this notebook, we have used only one expert's feedback to demostrate how it can be incorporated into the workflow. The prompt used for this stage is provided [here](../src/prompts/prompt_template2.py).

In [8]:
from prompts import prompt_template2

def schema_extraction_stage2(llm_model_name: str, schema: dict, review: str, scientific_literature: list):
    """
    Updates the initial JSON schema with the small domain-expert's curated scientific literature and continous expert's feedback

    Args:
        llm_model_name (str): The large language model name
        schema (dict): The initial schema from stage 1
        review (str): The expert's feedback on the initial schema from stage 1
        scientific_literature (list): The list of text of domain-expert's curated scientific literature

    Returns:
        dict: The updated JSON schema
    """
    
    #Initializing the object for LLM Inference
    llm = Openai_LLM_Inference(llm_model_name)
    print(f'Using {llm}')
    
    #Initializing the counter
    index = 0
    
    #Iterating over the scientific literature
    for sci_lit in scientific_literature:
        print(f'\n========================== Iteration {index + 1} =================================')
        
        #Extracting the schema from the LLM
        print('\nCalling the completion API of the model')
        var_dict = {'current_schema': schema, 'full_text': sci_lit, 'domain_expert_review': review}
        model_output = llm.completion(prompt_template2, var_dict)
        
        #Returns None, if any exception occurrs during the LLM Inference
        if not model_output:
            print(f'No output from model: {llm_model_name}, continuing to the next paper..')
            continue
        
        #Extracting the updated schema
        updated_schema = extract_json_schema(model_output, json_encl_expr = [('```json', '```'), ('```', '```')])
        if not updated_schema:
            print('Unable to extract JSON, stopping the inference from the LLM...')
            break
        
        #Updating the previous schema with the current schema for next iteration
        schema = updated_schema
        index += 1
        
    print(f'JSON schema updated with LLM: {llm_model_name}')
    return schema

In [9]:
llm_model_name = 'gpt-4o'
stage2_schema = schema_extraction_stage2(llm_model_name, initial_schema, expert_feedback_stage1, stage_2_text[:5])

Using OPENAI - LLM Inference with Model: gpt-4o


Calling the completion API of the model


Calling the completion API of the model


Calling the completion API of the model


Calling the completion API of the model


Calling the completion API of the model
JSON schema updated with LLM: gpt-4o


In [10]:
print('Stage-2 Updated Schema:\n')
print(json.dumps(stage2_schema, indent=4))

Stage-2 Updated Schema:

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "Atomic Layer Deposition (ALD) Process Schema",
    "type": "object",
    "properties": {
        "processParameters": {
            "type": "object",
            "description": "Parameters defining the ALD process.",
            "properties": {
                "temperature": {
                    "type": "number",
                    "description": "Temperature at which the ALD process is conducted, in Celsius. Typical range is 50 to 500 \u00b0C.",
                    "units": "Celsius"
                },
                "pressure": {
                    "type": "number",
                    "description": "Pressure inside the reactor during the ALD process, typically maintained at 0.25 Torr.",
                    "units": "Pascals"
                },
                "precursorFlowRate": {
                    "type": "number",
                    "description": "Flow rate of the precursor

In [11]:
expert_feedback_stage2 = '1. Remove the reactor property in the additional details and include this in the reactor design property in the process conditions unit under the reactor name.\n2. Bubblertemperatures: merge it with the precursor temperature property under the BubblerTemperature name in the precursor unit.\n3. Substrate type and substrate units should be merged under the Substrate name'

### 📄⛏️ Stage 03: Finalize Schema Refinement <a id='finalize-schema-refinement'></a>

Similar to Stage 2, Stage 3 follows the same workflow and uses the same approach, with the key difference being the use of a larger corpus of scientific literature related to the ALD process. The goal of this stage is to provide the LLM with a broader perspective and more comprehensive information about the target process. The prompt used for this stage is provided [here](../src/prompts/prompt_template3.py).

In [12]:
from prompts import prompt_template3

def schema_extraction_stage3(llm_model_name: str, schema: dict, review: str, scientific_literature: list):
    """
    Updates the JSON schema with the large scientific literature dataset and continous expert's feedback

    Args:
        llm_model_name (str): The large language model name
        schema (dict): The JSON schema from stage 2
        review (str): The expert's feedback on the schema from stage 2
        scientific_literature (list): The list of text of large scientific literature dataset

    Returns:
        dict: The updated JSON schema
    """
    
    #Initializing the object for LLM Inference
    llm = Openai_LLM_Inference(llm_model_name)
    print(f'Using {llm}')
    
    #Initializing the counter
    index = 0
    
    #Iterating over the scientific literature
    for sci_lit in scientific_literature:
        print(f'\n========================== Iteration {index + 1} =================================')
        
        #Extracting the schema from the LLM
        print('\nCalling the completion API of the model')
        var_dict = {'current_schema': schema, 'full_text': sci_lit, 'domain_expert_review': review}
        model_output = llm.completion(prompt_template3, var_dict)

        #Returns None, if any exception occurrs during the LLM Inference
        if not model_output:
            print(f'No output from model: {llm_model_name}, continuing to the next paper..')
            continue
        
        #Extracting the updated schema
        updated_schema = extract_json_schema(model_output, json_encl_expr = [('```json', '```'), ('```', '```')])
        if not updated_schema:
            print('Unable to extract JSON, stopping the inference from the LLM...')
            break

        #Updating the previous schema with the current schema for next iteration
        schema = updated_schema
        index += 1
    
    print(f'JSON schema updated with LLM: {llm_model_name}')
    return schema

In [13]:
llm_model_name = 'gpt-4o'
stage3_schema = schema_extraction_stage3(llm_model_name, stage2_schema, expert_feedback_stage2, stage_3_text[:5])

Using OPENAI - LLM Inference with Model: gpt-4o


Calling the completion API of the model


Calling the completion API of the model


Calling the completion API of the model
JSON schema updated with LLM: gpt-4o


In [14]:
print('Stage-3 Updated Schema:\n')
print(json.dumps(stage3_schema, indent=4))

Stage-3 Updated Schema:

{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "Atomic Layer Deposition (ALD) Process Schema",
    "type": "object",
    "properties": {
        "processParameters": {
            "type": "object",
            "description": "Parameters defining the ALD process.",
            "properties": {
                "temperature": {
                    "type": "number",
                    "description": "Temperature at which the ALD process is conducted, in Celsius. Typical range is 50 to 500 \u00b0C.",
                    "units": "Celsius"
                },
                "pressure": {
                    "type": "number",
                    "description": "Pressure inside the reactor during the ALD process, typically maintained at 0.25 Torr.",
                    "units": "Pascals"
                },
                "precursorFlowRate": {
                    "type": "number",
                    "description": "Flow rate of the precursor