# Convert text to a graph

This notebook demonstrates how to extract graph from any text using the graph maker

Steps:
1. Split the document
2. Construct the knowledge graph
3. Save the graph to a graph database

In [1]:
import os

In [2]:
os.getenv("OPENAI_API_KEY")

'sk-proj-9ot48FSfk4enl3h8YFLKT3BlbkFJF3S0EnuhWAPR6VYQ4KZ0'

In [3]:
os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")
os.environ["GROQ_API_KEY"] = "TEST"

In [4]:
from knowledge_graph_maker import GraphMaker, Ontology, OpenAIClient
from knowledge_graph_maker import Document

## Split the document

In [5]:
from langchain import hub
from langchain_openai import ChatOpenAI
from langchain.chains import create_extraction_chain_pydantic
from langchain_core.pydantic_v1 import BaseModel
from typing import Optional, List

### Extract propositions from each paragraph

In this part, we will use an LLM to extract stand alone statements from a raw piece of text.

Example: Greg went to the park. He likes walking > ['Greg went to the park.', 'Greg likes walking']

Pulling out propositions is done via a well-crafted prompt. I'm going to pull it from LangHub, LangChain's home for prompts.

Use custom prompt to instruct an LLM to extract propositions.

In [6]:
obj = hub.pull("wfh/proposal-indexing")

Please use the `langsmith sdk` instead:
  pip install langsmith
Use the `pull_prompt` method.
  res_dict = client.pull_repo(owner_repo_commit)


We will be using gpt-4o-mini to extract the propositions

In [7]:
# chunking_llm = ChatOpenAI(model='gpt-4o-mini', openai_api_key = os.getenv("OPENAI_API_KEY", 'YouKey'))

In [7]:
chunking_llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)

Then I'll make a runnable w/ langchain, this'll be a short way to combine the prompt and llm

In [8]:
# use it in a runnable
runnable = obj | chunking_llm

The output from a runnable is a json-esque structure in a string. We need to pull the sentences out. I found that LangChain's example extraction was giving me a hard time so I'm doing it manually with a pydantic data class. There is definitely room to improve this.

Create your class then put it in an extraction chain.

In [9]:
# Pydantic data class
class Sentences(BaseModel):
    sentences: List[str]

# Extraction
structured_llm = chunking_llm.with_structured_output(Sentences)

Define a function to extract propositions

In [10]:
def get_propositions(text):
    runnable_output = runnable.invoke({
    	"input": text
    }).content
    
    propositions = structured_llm.invoke(runnable_output).sentences
    return propositions

Load source document

In [11]:
with open('./data_input/txt/eldenring.txt') as file:
    essay = file.read()

Then you need to decide what you send to your proposal maker. The prompt has an example that is about 1K characters long. So I would experiment with what works for you. This isn't another chunking decision, just pick something reasonable and try it out.

I'm using paragraphs

Split the documents at each line break

In [12]:
paragraphs = essay.split("\n\n")

Let's see how many we have

In [13]:
len(paragraphs)

81

In [14]:
paragraphs

['Lore in Elden Ring covers all the information related to the world and mythos of the game. From past events that occurred long ago, to the history of the Lands Between and the mystery of the elusive Elden Ring, all can be found here. Please note that this page contains heavy spoilers.',
 '- The Mythos of Elden Ring was written by George R. R. Martin. Mythos refers to the overall narrative theme or plot structure. It can also be interpreted as a belief system.\n- The Story of Elden Ring was written by Hidetaka Miyazaki and his team at FromSoftware\n- The Lore and Story of Elden Ring are told in a similar manner to other FromSofware "Souls" games, so players should expect to find plenty left to interpretation as well as seemingly contradictory statements about various game characters, elements, and concepts.',
 'Elden Ring Lore Overview',
 'The storyteller folds her slender hands - both pairs - and speaks. “It happened an age ago. But when I recall, I see it true.” So begins the tale o

For each paragraph, extract the propositions then put it in a list of propositions

In [15]:
paragraphs = paragraphs[0:2]
paragraphs

['Lore in Elden Ring covers all the information related to the world and mythos of the game. From past events that occurred long ago, to the history of the Lands Between and the mystery of the elusive Elden Ring, all can be found here. Please note that this page contains heavy spoilers.',
 '- The Mythos of Elden Ring was written by George R. R. Martin. Mythos refers to the overall narrative theme or plot structure. It can also be interpreted as a belief system.\n- The Story of Elden Ring was written by Hidetaka Miyazaki and his team at FromSoftware\n- The Lore and Story of Elden Ring are told in a similar manner to other FromSofware "Souls" games, so players should expect to find plenty left to interpretation as well as seemingly contradictory statements about various game characters, elements, and concepts.']

In [16]:
# Credit before extracting propositions $28.13
proposition_list = []

for para in paragraphs:
    propositions = get_propositions(para)
    
    proposition_list.extend(propositions)

Let's take a look at what the propositions look like

In [17]:
# Credit after extracting propositions $28.08
print (f"You have {len(proposition_list)} propositions")
proposition_list

You have 17 propositions


['Lore in Elden Ring covers all the information related to the world of Elden Ring.',
 'Lore in Elden Ring covers all the information related to the mythos of Elden Ring.',
 'Lore in Elden Ring includes past events that occurred long ago.',
 'Lore in Elden Ring includes the history of the Lands Between.',
 'Lore in Elden Ring includes the mystery of the elusive Elden Ring.',
 'All information related to the world and mythos of Elden Ring can be found on this page.',
 'This page contains heavy spoilers.',
 'The Mythos of Elden Ring was written by George R. R. Martin.',
 'Mythos refers to the overall narrative theme or plot structure.',
 'Mythos can also be interpreted as a belief system.',
 'The Story of Elden Ring was written by Hidetaka Miyazaki.',
 'Hidetaka Miyazaki was part of a team at FromSoftware.',
 "The Lore and Story of Elden Ring are told in a similar manner to other FromSoftware 'Souls' games.",
 'Players should expect to find plenty left to interpretation in Elden Ring.',


In [None]:
proposition_list

#### Save propositions to text file

In [19]:

# file_name = 'propositions.txt'

# # Open the file and write the contents
# with open(file_name, 'w') as file:
#     # Add a newline character to each line
#     file.writelines(f"{line}\n" for line in proposition_list)

In [20]:
# from propositions3 import eldenring_propositions as erp

In [21]:
# print (f"You have {len(erp)} propositions")

So you'll see that they look like regular sentences, but they are actually statements that are able to stand on their own. For example, one of the sentences in the raw text is "They meant well, but this is rarely true." if you were to chunk that on it's own, the LLM would have no idea who you're talking about. Who meant well? What is rarely true? But those have been covered by the propositions.

### Group each proposition to a chunk (Optional)

 Use an LLM that can reason about each proposition and determine whether or not it should be a part of an existing chunk or if a new chunk should be made.


 (Note: After testing, i found out that grouping each proposition to a chunks is not every effective.)

In [22]:
# from agentic_chunker import AgenticChunker

In [23]:
# ac = AgenticChunker()

In [24]:
# ac.add_propositions(erp)

In [25]:
# ac.pretty_print_chunks()

In [26]:
# ac.pretty_print_chunk_outline()

In [27]:
# print(ac.get_chunks(get_type='list_of_strings'))

In [28]:
# mylist = ac.get_chunks(get_type='list_of_strings')

In [29]:
# len(mylist)

In [30]:
# mylist

## Create the knowledge graph

### Define the Ontology

The ontology is a pydantic model with the following schema. 

```python
class Ontology(BaseModel):
    label: List[Union[str, Dict]]
    relationships: List[str]
```



In [18]:
ontology = Ontology(
    labels=[
        {"Person": "Person name without any adjectives, Remember a person may be referenced by their name or using a pronoun"},
        {"Object": "Objects are inanimate things that a person uses, Do not add the definite article 'the' in the object name"},
        {"Event": "Event involving multiple persons. Do not include qualifiers or verbs like gives, leaves, works etc."},
        {"Place": "Places are locations where specific events took place and where persons can go to and where objects can be found"},
        {"Faction": "Name of a group that a person belongs without any adjectives, Remember, a person may one belong to one faction"},
        {"Miscellaneous": "Any important concept can not be categorised with any other given label"},
    ],
    relationships=[
        "Relation between any pair of Entities"
        ],
)

### Select a LLM 

This model will:
- Generate summaries each proposition
- Extract the nodes and edges from the pydantic documents


In [19]:
## Open AI models
oai_model="gpt-4o-mini"

## OR Use OpenAI
llm = OpenAIClient(model=oai_model, temperature=0.1, top_p=0.5)

### Create pydantic documents from chunks

Documents is a pydantic model with the following schema 

```python
class Document(BaseModel):
    text: str
    metadata: dict
```

The metadata we add to the document here is copied to every relation that is extracted out of the document. More often than not, the node pairs have multiple relation with each other. The metadata helps add more context to these relations

In this example I am generating a summary of the text chunk, and the timestamp of the run, to be used as metadata. 

In [20]:
import datetime
current_time = str(datetime.datetime.now())

In [21]:
def generate_summary(text):
    SYS_PROMPT = (
        "Succintly summarise the text provided by the user. "
        "Respond only with the summary and no other comments"
    )
    try:
        summary = llm.generate(user_message=text, system_message=SYS_PROMPT)
    except:
        summary = ""
    finally:
        return summary

In [35]:
# proposition_list = erp
# proposition_list

In [22]:
docs = map(
    lambda t: Document(text=t, metadata={"summary": generate_summary(t), 'generated_at': current_time}),
    proposition_list
)

### Generate graph from pydantic documents

In [23]:
graph_maker = GraphMaker(ontology=ontology, llm_client=llm, verbose=True)

In [38]:
# mylist = list(docs)

#### Save Documents to text file

In [39]:
# # Specify the file name
# file_name = 'documents.txt'

# # Open the file and write the contents
# with open(file_name, 'w') as file:
#     # Add a newline character to each line
#     file.writelines(f"{line}\n" for line in mylist)

In [46]:
from documents import doclist as dl

In [None]:
mylist = list(dl)
mylist

In [None]:
mylist[0:1]

In [None]:
# graph = graph_maker.from_documents(
#     mylist, 
#     delay_s_between=0  
#     )

#### Generate Edges of Graph

In [None]:
# credit before creating edges - $28.06
graph = graph_maker.from_documents(
    list(docs), 
    delay_s_between=0 ## delay_s_between because otherwise groq api maxes out pretty fast. 
    )

In [25]:
graph

[Edge(node_1=Node(label='Miscellaneous', name='Lore'), node_2=Node(label='Place', name='Elden Ring'), relationship='Lore contains information related to the world of Elden Ring.', metadata={'summary': 'Lore in Elden Ring encompasses all information pertaining to its world.', 'generated_at': '2024-09-18 23:29:57.130501'}, order=0),
 Edge(node_1=Node(label='Miscellaneous', name='Lore'), node_2=Node(label='Miscellaneous', name='Elden Ring'), relationship='Lore contains information related to the mythos of Elden Ring.', metadata={'summary': 'Lore in Elden Ring encompasses the mythos and background information of the game.', 'generated_at': '2024-09-18 23:29:57.130501'}, order=1),
 Edge(node_1=Node(label='Miscellaneous', name='Lore'), node_2=Node(label='Event', name='past events'), relationship='Lore includes past events that occurred long ago.', metadata={'summary': "Elden Ring's lore encompasses historical events from the distant past.", 'generated_at': '2024-09-18 23:29:57.130501'}, orde

#### Save Edges to textfile

In [67]:
# # Specify the file name
# file_name = 'graph.txt'

# # Open the file and write the contents
# with open(file_name, 'w') as file:
#     # Add a newline character to each line
#     file.writelines(f"{line}\n" for line in graph)

In [32]:
from edges import edgelist as el

In [33]:
el

[Edge(node_1=Node(label='Miscellaneous', name='Lore'), node_2=Node(label='Place', name='Elden Ring'), relationship='Lore contains information related to the world of Elden Ring.', metadata={'summary': 'Lore in Elden Ring encompasses all the information about its world.', 'generated_at': '2024-09-18 17:05:35.996718'}, order=0),
 Edge(node_1=Node(label='Miscellaneous', name='Lore'), node_2=Node(label='Miscellaneous', name='Elden Ring'), relationship='Lore contains information related to the mythos of Elden Ring.', metadata={'summary': 'Lore in Elden Ring encompasses the mythos and background information of the game.', 'generated_at': '2024-09-18 17:05:35.996718'}, order=1),
 Edge(node_1=Node(label='Miscellaneous', name='Lore'), node_2=Node(label='Event', name='past events'), relationship='Lore includes past events that occurred long ago.', metadata={'summary': "Elden Ring's lore encompasses historical events from the distant past.", 'generated_at': '2024-09-18 17:05:35.996718'}, order=2)

In [35]:
for edge in el:
    print(edge.model_dump(exclude=['metadata']), "\n\n")

{'node_1': {'label': 'Miscellaneous', 'name': 'Lore'}, 'node_2': {'label': 'Place', 'name': 'Elden Ring'}, 'relationship': 'Lore contains information related to the world of Elden Ring.', 'order': 0} 


{'node_1': {'label': 'Miscellaneous', 'name': 'Lore'}, 'node_2': {'label': 'Miscellaneous', 'name': 'Elden Ring'}, 'relationship': 'Lore contains information related to the mythos of Elden Ring.', 'order': 1} 


{'node_1': {'label': 'Miscellaneous', 'name': 'Lore'}, 'node_2': {'label': 'Event', 'name': 'past events'}, 'relationship': 'Lore includes past events that occurred long ago.', 'order': 2} 


{'node_1': {'label': 'Miscellaneous', 'name': 'Lore'}, 'node_2': {'label': 'Place', 'name': 'Lands Between'}, 'relationship': 'Lore includes the history of the Lands Between.', 'order': 3} 


{'node_1': {'label': 'Miscellaneous', 'name': 'Lore'}, 'node_2': {'label': 'Miscellaneous', 'name': 'Elden Ring'}, 'relationship': 'Lore includes the mystery of Elden Ring.', 'order': 4} 


{'node_1': 

## Save the graph to a graph database

In [36]:
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "abc123456"
os.environ["NEO4J_URI"]= "bolt://localhost:7687"


In [37]:
from knowledge_graph_maker import Neo4jGraphModel

create_indices = False
neo4j_graph = Neo4jGraphModel(edges=el, create_indices=create_indices)
neo4j_graph.save()