
## MultiProvSense

This module takes a scientific publication (in PDF format) as input, processes it, and performs the following tasks:

1. **Reads the PDF file** – Extracts content from the document.
2. **Extracts Provenance Information** – Gathers publication metadata and formats it into JSON.
3. **Generates PROV-O Representation** – Creates a Turtle format output following the PROV ontology.

The generated Turtle representation using sequence chat:

```turtle
@prefix brainkb: <https://brainkb.org/>.
@prefix prov: <http://www.w3.org/ns/prov#>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.

brainkb:paper-1
    a prov:Entity ;
    dc:title "CodeKGC: Code Language Model for Generative Knowledge Graph Construction" ;
    dc:identifier "10.1145/3641850" ;
    dc:publisher "ACM" ;
    dc:rights "Copyright 2024 held by the owner/author(s). Publication rights licensed to ACM." ;
    dc:date "March 2024" ;
    prov:wasAttributedTo [
        a prov:Agent ;
        prov:actedOnBehalfOf brainkb:organization-1 ;
        prov:wasAssociatedWith brainkb:paper-1 ;
        foaf:name "Zhen Bi" ;
        foaf:mbox <mailto:bizhen_zju@zju.edu.cn>
    ], [
        a prov:Agent ;
        prov:actedOnBehalfOf brainkb:organization-2 ;
        prov:wasAssociatedWith brainkb:paper-1 ;
        foaf:name "Jing Chen" ;
        foaf:mbox <mailto:jingc0116@gmail.com>
    ], [
        a prov:Agent ;
        prov:actedOnBehalfOf brainkb:organization-2 ;
        prov:wasAssociatedWith brainkb:paper-1 ;
        foaf:name "Yinuo Jiang" ;
        foaf:mbox <mailto:3200100732@zju.edu.cn>
    ], [
        a prov:Agent ;
        prov:actedOnBehalfOf brainkb:organization-3 ;
        prov:wasAssociatedWith brainkb:paper-1 ;
        foaf:name "Feiyu Xiong" ;
        foaf:mbox <mailto:feiyu.xfy@zju.edu.cn>
    ], [
        a prov:Agent ;
        prov:actedOnBehalfOf brainkb:organization-3 ;
        prov:wasAssociatedWith brainkb:paper-1 ;
        foaf:name "Wei Guo" ;
        foaf:mbox <mailto:huaisu@taobao.com>
    ], [
        a prov:Agent ;
        prov:actedOnBehalfOf brainkb:organization-2 ;
        prov:wasAssociatedWith brainkb:paper-1 ;
        foaf:name "Huajun Chen" ;
        foaf:mbox <mailto:huajunsir@zju.edu.cn>
    ], [
        a prov:Agent ;
        prov:actedOnBehalfOf brainkb:organization-2 ;
        prov:wasAssociatedWith brainkb:paper-1 ;
        foaf:name "Ningyu Zhang" ;
        foaf:mbox <mailto:zhangningyu@zju.edu.cn>
    ].

brainkb:organization-1
    a foaf:Organization ;
    foaf:name "Zhejiang University".

brainkb:organization-2
    a foaf:Organization ;
    foaf:name "Zhejiang University-Ant Group Joint Laboratory of Knowledge Graph".

brainkb:organization-3
    a foaf:Organization ;
    foaf:name "Alibaba Group".
```

Contact: tekraj@mit.edu

License: MIT


In [1]:
import os
from dotenv import load_dotenv
import warnings

warnings.filterwarnings("ignore", message="Model openai/gpt-4o-mini is not found")


load_dotenv()

OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")


model = "meta-llama/llama-3.1-70b-instruct" 
llm_config = {
    "model": model,
    "base_url": "https://openrouter.ai/api/v1",#to connect to open router
    "api_key": os.environ.get("OPENROUTER_API_KEY")
}


In [2]:
from typing import Annotated
import yaml
import PyPDF2
# this function can be used as tools for code execution
def read_first_two_pages_with_metadata(file_path: Annotated[str, "input file path"]) -> str:
    text = "" 
    with open(file_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        
        # Extract text from the first two pages
        for i, page in enumerate(reader.pages):
            if i < 2:
                page_text = page.extract_text() or ""
                text += page_text.encode('utf-8', errors='ignore').decode('utf-8')
            else:
                break  # stop after first 2 pages
    
    return text

In [3]:
from autogen import ConversableAgent, AssistantAgent

metadata_extractor_agent = AssistantAgent(
    "MetadataExtractorAgent",
    system_message="""
                    Extract the following metadata from the given input text:
                
                    1. Paper Title: The title of the paper.
                    2. Authors: List of all authors.
                       - Affiliations: List all affiliations for each author (if any).
                       - Emails: List all email addresses for each author (if any).
                    3. DOI: The DOI of the paper, if available.
                    4. Editors: List of all editors.
                       - Affiliations: List all affiliations for each editor (if any).
                       - Emails: List all email addresses for each editor (if any).
                    5. Reviewers: List of all reviewers.
                       - Affiliations: List all affiliations for each reviewer (if any).
                       - Emails: List all email addresses for each reviewer (if any).
                    6. Key Dates:
                       - Date Received: The date the paper was received.
                       - Date Revised: The date the paper was revised.
                       - Date Published: The date the paper was published.
                    7. Publisher name: The name of the journal or conference or workshop in which the paper was published.
                    8. License Information: The license under which the paper is published, if specified.
                    
                    Please note:
                    - Each author, editor, and reviewer may have multiple affiliations and email addresses. Include all listed affiliations and emails for each person.
                    - If any of the above metadata is missing or unavailable, mark it as 'N/A.'
                    
                    Return the extracted information in a structured JSON format.
                    """,
    llm_config=llm_config,
    human_input_mode="NEVER"
)

turtle_agent = AssistantAgent(
    "ProvenanceGeneratorAgent",
    system_message="""You are an AI agent that converts the JSON provenance file into the turtle representation. 
    Convert the provided JSON file, which contains provenance metadata, into Turtle format. 
    Use https://brainkb.org as the default prefix and apply terms from the provenance ontology.""",
    llm_config=llm_config,
    human_input_mode="NEVER"
)

json_result_agent = ConversableAgent(
    "InitialResultFormatterAgent",
    system_message="""Extract only the JSON object from the text, including all nested data within curly braces {} and brackets []. Ignore any introductory text, explanations, or code formatting such as ```json around the JSON. """,
    llm_config=llm_config,
    human_input_mode="NEVER"
)


user_agent = ConversableAgent(
    "UserAgent",
    human_input_mode="NEVER",
)



task = read_first_two_pages_with_metadata("sample_pdf/3641850_.pdf")
chat_results = user_agent.initiate_chats(
    [
        {
            "recipient": metadata_extractor_agent,
            "message": task,
            "max_turns": 1,
            "summary_method": "last_msg",
        },
        {
            "recipient": json_result_agent,
            "message": "Extract JSON",
            "max_turns": 1,
            "summary_method": "last_msg",
        },
        {
            "recipient": turtle_agent,
            "message": """You are an AI agent that converts the JSON provenance file into the turtle representation. 
    Convert the provided JSON file, which contains provenance metadata, into Turtle format. 
    Use https://brainkb.org as the default prefix and apply terms from the provenance ontology.""",
            "max_turns": 2,
            "summary_method": "last_msg",
        },
       
    ]
)

print("Fourth Chat Summary: ", chat_results[2].summary)

flaml.automl is not available. Please install flaml[automl] to enable AutoML functionalities.


[34m
********************************************************************************[0m
[34mStarting a new chat....[0m
[34m
********************************************************************************[0m
[33mUserAgent[0m (to MetadataExtractorAgent):

CodeKGC:CodeLanguageModelforGenerativeKnowledge
GraphConstruction
ZHENBI andJINGCHEN ,ZhejiangUniversity,Hangzhou,ChinaandZhejiangUniversity-AntGroup
Joint Laboratoryof Knowledge Graph, Hangzhou, China
YINUOJIANG ,ZhejiangUniversity,Hangzhou,ChinaandZhejiangUniversity—AntGroupJointLabora-
toryof Knowledge Graph,Hangzhou, China
FEIYU XIONG andWEIGUO ,Alibaba Group,Hangzhou, China
HUAJUN CHEN andNINGYU ZHANG ,Zhejiang University, Hangzhou, China and Zhejiang
University—Ant GroupJoint Laboratoryof Knowledge Graph, Hangzhou, China
Current generative knowledge graph construction approaches usually fail to capture structural knowledge
by simply flattening natural language into serialized texts or a specification language. However, l

The generated Turtle representation using Group chat:

```turtle
@prefix brainkb: <https://brainkb.org/>.
@prefix prov: <http://www.w3.org/ns/prov#>.
@prefix foaf: <http://xmlns.com/foaf/0.1/>.

brainkb:paper_1 a prov:Entity ;
  prov:wasDerivedFrom <https://doi.org/10.1145/3641850> ;
  foaf:title "CodeKGC: Code Language Model for Generative Knowledge Graph Construction" ;
  brainkb:hasPublisher brainkb:publisher_1 ;
  brainkb:hasLicense brainkb:license_1 ;
  brainkb:hasPublicationDate "March 2024" ;
  prov:generatedAtTime "2024-03-01T00:00:00Z"^^xsd:dateTime ;
  prov:hadPrimarySource <https://doi.org/10.1145/3641850> ;
  prov:wasDerivedFrom brainkb:previous_version.

brainkb:publisher_1 a foaf:Organization ;
  foaf:name "ACM" ;
  foaf:homepage <https://www.acm.org/> ;
  foaf:logo <https://www.acm.org/images/logo.png>.

brainkb:license_1 a prov:Entity ;
  foaf:description "Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.".

brainkb:author_1 a foaf:Person ;
  foaf:name "Zhen Bi" ;
  foaf:email <mailto:bizhen_zju@zju.edu.cn> ;
  brainkb:hasAffiliation brainkb:affiliation_1, brainkb:affiliation_2.

brainkb:author_2 a foaf:Person ;
  foaf:name "Jing Chen" ;
  foaf:email <mailto:jingc0116@gmail.com> ;
  brainkb:hasAffiliation brainkb:affiliation_1, brainkb:affiliation_2.

brainkb:author_3 a foaf:Person ;
  foaf:name "Yinuo Jiang" ;
  foaf:email <mailto:3200100732@zju.edu.cn> ;
  brainkb:hasAffiliation brainkb:affiliation_1, brainkb:affiliation_2.

brainkb:author_4 a foaf:Person ;
  foaf:name "Feiyu Xiong" ;
  foaf:email <mailto:feiyu.xfy@zju.edu.cn> ;
  brainkb:hasAffiliation brainkb:affiliation_3.

brainkb:author_5 a foaf:Person ;
  foaf:name "Wei Guo" ;
  foaf:email <mailto:huaisu@taobao.com> ;
  brainkb:hasAffiliation brainkb:affiliation_3.

brainkb:author_6 a foaf:Person ;
  foaf:name "Huajun Chen" ;
  foaf:email <mailto:huajunsir@zju.edu.cn> ;
  brainkb:hasAffiliation brainkb:affiliation_1, brainkb:affiliation_2.

brainkb:author_7 a foaf:Person ;
  foaf:name "Ningyu Zhang" ;
  foaf:email <mailto:zhangningyu@zju.edu.cn> ;
  brainkb:hasAffiliation brainkb:affiliation_1, brainkb:affiliation_2.

brainkb:affiliation_1 a foaf:Organization ;
  foaf:name "Zhejiang University, Hangzhou, China".

brainkb:affiliation_2 a foaf:Organization ;
  foaf:name "Zhejiang University-Ant Group Joint Laboratory of Knowledge Graph, Hangzhou, China".

brainkb:affiliation_3 a foaf:Organization ;
  foaf:name "Alibaba Group, Hangzhou, China".

brainkb:paper_1 prov:wasAttributedTo brainkb:author_1, brainkb:author_2, brainkb:author_3, brainkb:author_4, brainkb:author_5, brainkb:author_6, brainkb:author_7.

brainkb:previous_version a prov:Entity ;
  prov:generatedAtTime "2023-12-01T00:00:00Z"^^xsd:dateTime ;
  prov:wasDerivedFrom brainkb:even_earlier_version.
```

In [4]:
from autogen import ConversableAgent, AssistantAgent, GroupChat, GroupChatManager
import os
from dotenv import load_dotenv
import warnings
from typing import Annotated
import yaml
import PyPDF2

warnings.filterwarnings("ignore", message="Model openai/gpt-4o-mini is not found")

load_dotenv()

OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")

model = "meta-llama/llama-3.1-70b-instruct"
llm_config = {
    "model": model,
    "base_url": "https://openrouter.ai/api/v1",
    "api_key": os.environ.get("OPENROUTER_API_KEY"),
    
    "seed":42,
}

# Define function to read first two pages of PDF with metadata
def read_first_two_pages_with_metadata(file_path: Annotated[str, "input file path"]) -> str:
    text = ""
    with open(file_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        
        # Extract text from the first two pages
        for i, page in enumerate(reader.pages):
            if i < 2:
                page_text = page.extract_text() or ""
                text += page_text.encode('utf-8', errors='ignore').decode('utf-8')
            else:
                break  # stop after first 2 pages
    
    return text

# Define agents
metadata_extractor_agent = AssistantAgent(
    "MetadataExtractorAgent",
    system_message="""
        Extract the following metadata from the given input text:
        
        1. Paper Title: The title of the paper.
        2. Authors: List of all authors.
           - Affiliations: List all affiliations for each author (if any).
           - Emails: List all email addresses for each author (if any).
        3. DOI: The DOI of the paper, if available.
        4. Editors: List of all editors.
           - Affiliations: List all affiliations for each editor (if any).
           - Emails: List all email addresses for each editor (if any).
        5. Reviewers: List of all reviewers.
           - Affiliations: List all affiliations for each reviewer (if any).
           - Emails: List all email addresses for each reviewer (if any).
        6. Key Dates:
           - Date Received: The date the paper was received.
           - Date Revised: The date the paper was revised.
           - Date Published: The date the paper was published.
        7. Publisher name: The name of the journal or conference or workshop in which the paper was published.
        8. License Information: The license under which the paper is published, if specified.
        
        Please return the extracted information in a structured JSON format.
    """,
    llm_config=llm_config,
    human_input_mode="NEVER"
)

turtle_agent = ConversableAgent(
    "ProvenanceGeneratorAgent",
    system_message="Convert the provided JSON metadata into Turtle format using https://brainkb.org as the default prefix and applying terms from the provenance ontology.",
    llm_config=llm_config,
    human_input_mode="NEVER"
)

json_result_agent = ConversableAgent(
    "InitialResultFormatterAgent",
    system_message="You are a JSON cleaner agent",
    llm_config=llm_config,
    human_input_mode="NEVER"
)

user_agent = ConversableAgent(
    "UserAgent", 
    is_termination_msg=lambda x: x.get("content", "").find("TERMINATE") >= 0,
    human_input_mode="NEVER",
)

# Define message function to handle nested JSON and Turtle conversion
def message(recipient, messages, sender, config):
    last_content = recipient.chat_messages_for_summary(sender)[-1]["content"] 
    return f"Extract only the JSON object from the text, including all nested data within curly braces and brackets . Ignore any introductory text, explanations, or code formatting such as json around the JSON.:\n\n{last_content}"

def turtle_msg(recipient, messages, sender, config):
    last_content = recipient.chat_messages_for_summary(sender)[-1]["content"]
    return f"Take JSON provenance information from context and convert into the turtle representation.  Use https://brainkb.org as the default prefix and apply terms from the provenance ontology:\n\n{last_content}. Return TERMINATE upon successful turtle generation complete"

# Register nested chats with the user proxy agent
user_agent.register_nested_chats(
    [
        {
            "recipient": json_result_agent,
            "message": message,
            "max_turns": 1,
            "summary_method": "last_msg", 
           
        },
        {
            "recipient": turtle_agent,
            "message":turtle_msg,
            "max_turns": 1,
            "summary_method": "last_msg", 
        },
    ],
    trigger=metadata_extractor_agent,
)

 

# Start chat with metadata extraction
task = read_first_two_pages_with_metadata("sample_pdf/3641850_.pdf")
print(f"Task to metadata_extractor_agent: {task}")

# Initiate chat
chat_result = user_agent.initiate_chat(
    recipient=metadata_extractor_agent, 
    message=task,
    max_turns=2,
    summary_method="last_msg",
    
    
)

Task to metadata_extractor_agent: CodeKGC:CodeLanguageModelforGenerativeKnowledge
GraphConstruction
ZHENBI andJINGCHEN ,ZhejiangUniversity,Hangzhou,ChinaandZhejiangUniversity-AntGroup
Joint Laboratoryof Knowledge Graph, Hangzhou, China
YINUOJIANG ,ZhejiangUniversity,Hangzhou,ChinaandZhejiangUniversity—AntGroupJointLabora-
toryof Knowledge Graph,Hangzhou, China
FEIYU XIONG andWEIGUO ,Alibaba Group,Hangzhou, China
HUAJUN CHEN andNINGYU ZHANG ,Zhejiang University, Hangzhou, China and Zhejiang
University—Ant GroupJoint Laboratoryof Knowledge Graph, Hangzhou, China
Current generative knowledge graph construction approaches usually fail to capture structural knowledge
by simply flattening natural language into serialized texts or a specification language. However, large gen-
erative language model trained on structured data such as code has demonstrated impressive capability in
understandingnaturallanguageforstructuralpredictionandreasoningtasks.Intuitively,weaddressthetask
ofgenerativeknowl