# Convert text to a graph

This notebook demonstrates how to extract graph from any text using the graph maker

Steps:
1. Split the document
2. Construct the knowledge graph
3. Save the graph to a graph database

In [1]:
import os

In [2]:
from knowledge_graph_maker import GraphMaker, Ontology, OllamaClient
from knowledge_graph_maker import Document

INFO:numexpr.utils:Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.


## Split the document

In [3]:
from langchain import hub
from langchain.chains import create_extraction_chain_pydantic
from langchain_core.pydantic_v1 import BaseModel
from typing import Optional, List

### Extract propositions from each paragraph

In this part, we will use an LLM to extract stand alone statements from a raw piece of text.

Example: Greg went to the park. He likes walking > ['Greg went to the park.', 'Greg likes walking']

Pulling out propositions is done via a well-crafted prompt. I'm going to pull it from LangHub, LangChain's home for prompts.

Use custom prompt to instruct an LLM to extract propositions.

In [9]:
obj = hub.pull("wfh/proposal-indexing")

Please use the `langsmith sdk` instead:
  pip install langsmith
Use the `pull_prompt` method.
  res_dict = client.pull_repo(owner_repo_commit)


We will be using mistral to extract the propositions

In [10]:
# from langchain_ollama.llms import OllamaLLM
from langchain_ollama import ChatOllama

In [11]:
chunking_llm = ChatOllama(
    model="llama3.1",
    temperature=0,
    # other params...
)

Then I'll make a runnable w/ langchain, this'll be a short way to combine the prompt and llm

In [12]:
# use it in a runnable
runnable = obj | chunking_llm

The output from a runnable is a json-esque structure in a string. We need to pull the sentences out. I found that LangChain's example extraction was giving me a hard time so I'm doing it manually with a pydantic data class. There is definitely room to improve this.

Create your class then put it in an extraction chain.

In [13]:
# Pydantic data class
class Sentences(BaseModel):
    sentences: List[str]

# Extraction
structured_llm = chunking_llm.with_structured_output(Sentences)

In [9]:
res = structured_llm.invoke("Lore in Elden Ring covers all the information related to the world and mythos of the game. From past events that occurred long ago, to the history of the Lands Between and the mystery of the elusive Elden Ring, all can be found here. Please note that this page contains heavy spoilers.")

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


Define a function to extract propositions

In [10]:
res

Sentences(sentences=['The lore in Elden Ring covers all the information related to the world and mythos of the game.', 'From past events that occurred long ago, to the history of the Lands Between and the mystery of the elusive Elden Ring, all can be found here.', 'Please note that this page contains heavy spoilers.'])

In [11]:
res.sentences

['The lore in Elden Ring covers all the information related to the world and mythos of the game.',
 'From past events that occurred long ago, to the history of the Lands Between and the mystery of the elusive Elden Ring, all can be found here.',
 'Please note that this page contains heavy spoilers.']

In [134]:
def get_propositions(text):
    runnable_output = runnable.invoke({
    	"input": text
    })
    
    print("Runnable Output:", runnable_output)
    print("Runnable Output Content:", runnable_output.content)
    
    res = structured_llm.invoke(runnable_output.content)
    return res

Load source document

In [15]:
with open('./data_input/txt/eldenring.txt') as file:
    essay = file.read()

Then you need to decide what you send to your proposal maker. The prompt has an example that is about 1K characters long. So I would experiment with what works for you. This isn't another chunking decision, just pick something reasonable and try it out.

I'm using paragraphs

Split the documents at each line break

In [17]:
paragraphs = essay.split("\n\n")

Let's see how many we have

In [18]:
len(paragraphs)

81

In [19]:
paragraphs

['Lore in Elden Ring covers all the information related to the world and mythos of the game. From past events that occurred long ago, to the history of the Lands Between and the mystery of the elusive Elden Ring, all can be found here. Please note that this page contains heavy spoilers.',
 '- The Mythos of Elden Ring was written by George R. R. Martin. Mythos refers to the overall narrative theme or plot structure. It can also be interpreted as a belief system.\n- The Story of Elden Ring was written by Hidetaka Miyazaki and his team at FromSoftware\n- The Lore and Story of Elden Ring are told in a similar manner to other FromSofware "Souls" games, so players should expect to find plenty left to interpretation as well as seemingly contradictory statements about various game characters, elements, and concepts.',
 'Elden Ring Lore Overview',
 'The storyteller folds her slender hands - both pairs - and speaks. “It happened an age ago. But when I recall, I see it true.” So begins the tale o

For each paragraph, extract the propositions then put it in a list of propositions

In [192]:
print(paragraphs[9:10])
propo = get_propositions(paragraphs[9:10])
propo

["White Mask Varre\nThe White Masks encountered in Mohgwyn Palace were initially abducted by Mogh, Lord of Blood and turned into Bloody Fingers through an unknown ritual involving Mogh's accursed Omen blood. Of the four White Masks the Tarnished encounters in game, only Varre was able to withstand and tame the accursed blood. Despite being abducted by the Lord of Blood, he remains loyal to the Mohgwyn Dynasty and tries to indoctrinate the Tarnished into the dynasty. Even as he is dying at the at the end of his questline, forsaken by his Lord, he calls out to Mohg and blesses the Mohgwyn Dynasty."]


INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


Runnable Output: content='Here is the decomposed content:\n\n[\n  "The White Masks encountered in Mohgwyn Palace were initially abducted.",\n  "Mogh, Lord of Blood, abducted the White Masks.",\n  "Mogh used an unknown ritual involving Mogh\'s accursed Omen blood to transform the White Masks into Bloody Fingers.",\n  "Varre was one of the four White Masks that the Tarnished encountered in game.",\n  "Varre was able to withstand and tame Mogh\'s accursed Omen blood.",\n  "Despite being abducted by Mogh, Varre remains loyal to the Mohgwyn Dynasty.",\n  "Varre tries to indoctrinate the Tarnished into the Mohgwyn Dynasty.",\n  "As Varre is dying at the end of his questline, he is forsaken by his Lord, Mohg.",\n  "Even as he is dying, Varre calls out to Mohg and blesses the Mohgwyn Dynasty."\n]\n\nI followed the steps you provided:\n\n1. Split compound sentences into simple sentences.\n2. Separated named entities with descriptive information into distinct propositions.\n3. Decontextualized t

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


Sentences(sentences=['The White Masks were initially abducted.', 'Mogh, Lord of Blood, abducted the White Masks.', "Mogh used an unknown ritual involving Mogh's accursed Omen blood to transform the White Masks into Bloody Fingers.", 'Varre was one of the four White Masks that the Tarnished encountered in game.', "Varre was able to withstand and tame Mogh's accursed Omen blood.", 'Despite being abducted by Mogh, Varre remains loyal to the Mohgwyn Dynasty.', 'Varre tries to indoctrinate the Tarnished into the Mohgwyn Dynasty.', 'As Varre is dying at the end of his questline, he is forsaken by his Lord, Mohg.', 'Even as he is dying, Varre calls out to Mohg and blesses the Mohgwyn Dynasty.'])

In [191]:
print(paragraphs[9:10][0])
propo = get_propositions(paragraphs[9:10][0])
propo

White Mask Varre
The White Masks encountered in Mohgwyn Palace were initially abducted by Mogh, Lord of Blood and turned into Bloody Fingers through an unknown ritual involving Mogh's accursed Omen blood. Of the four White Masks the Tarnished encounters in game, only Varre was able to withstand and tame the accursed blood. Despite being abducted by the Lord of Blood, he remains loyal to the Mohgwyn Dynasty and tries to indoctrinate the Tarnished into the dynasty. Even as he is dying at the at the end of his questline, forsaken by his Lord, he calls out to Mohg and blesses the Mohgwyn Dynasty.


INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


Runnable Output: content='Here is the decomposed content:\n\n1. "The White Masks encountered in Mohgwyn Palace were initially abducted by Mogh."\n2. "Mogh is the Lord of Blood."\n3. "An unknown ritual involving Mogh\'s accursed Omen blood was used to turn the White Masks into Bloody Fingers."\n4. "Varre is one of the four White Masks that the Tarnished encounters in game."\n5. "Varre was able to withstand and tame Mogh\'s accursed Omen blood."\n6. "Despite being abducted by Mogh, Varre remains loyal to the Mohgwyn Dynasty."\n7. "Varre tries to indoctrinate the Tarnished into the Mohgwyn Dynasty."\n8. "As Varre is dying at the end of his questline, he is forsaken by his Lord, Mohg."\n9. "Even as he is dying, Varre calls out to Mohg and blesses the Mohgwyn Dynasty."\n\nHere are the results in JSON format:\n\n[\n  "The White Masks encountered in Mohgwyn Palace were initially abducted by Mogh.",\n  "Mogh is the Lord of Blood.",\n  "An unknown ritual involving Mogh\'s accursed Omen blood wa

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


Sentences(sentences=['The White Masks encountered in Mohgwyn Palace were initially abducted by Mogh.', 'Mogh is the Lord of Blood.', "An unknown ritual involving Mogh's accursed Omen blood was used to turn the White Masks into Bloody Fingers.", 'Varre is one of the four White Masks that the Tarnished encounters in game.', "Varre was able to withstand and tame Mogh's accursed Omen blood.", 'Despite being abducted by Mogh, Varre remains loyal to the Mohgwyn Dynasty.', 'Varre tries to indoctrinate the Tarnished into the Mohgwyn Dynasty.', 'As Varre is dying at the end of his questline, he is forsaken by his Lord, Mogh.', 'Even as he is dying, Varre calls out to Mogh and blesses the Mohgwyn Dynasty.'])

In [187]:
print(type(paragraphs[10:11]))
print(type(paragraphs[10:11][0]))

<class 'list'>
<class 'str'>


In [189]:
print(paragraphs[10:11])
propo = get_propositions(paragraphs[10:11])
propo

["Bloody Finger Hunter Yura\nA samurai from the Land of Reeds whose purpose is to hunt down Bloody Fingers. His biggest target is Eleonora, Violet Bloody Finger, a Tarnished whom he holds in high regard. It is uncertain what exactly caused Eleonora's corruption into a Bloody Finger - Yura himself warns the Tarnished against the dangers of Dragon Communion, stating that those who partake in it one day lose their humanity. On the other hand, other Bloody Fingers the Tarnished encounter, some of which are Yura's targets, are those that have been approached and swayed by Mohg, the Lord of Blood himself."]


INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


Runnable Output: content='Here is the decomposition:\n\n[\n  "Bloody Finger Hunter Yura was a samurai from the Land of Reeds.",\n  "Yura\'s purpose was to hunt down Bloody Fingers.",\n  "Eleonora was a Tarnished whom Yura held in high regard.",\n  "Eleonora was also known as Violet Bloody Finger.",\n  "It is uncertain what caused Eleonora\'s corruption into a Bloody Finger.",\n  "Yura warned the Tarnished against the dangers of Dragon Communion.",\n  "Those who partake in Dragon Communion one day lose their humanity, according to Yura.",\n  "Some Bloody Fingers that the Tarnished encountered were targets of Yura\'s hunt.",\n  "These Bloody Fingers had been approached and swayed by Mohg, the Lord of Blood.",\n  "Mohg was also known as the Lord of Blood."\n]\n\nNote: I\'ve tried to maintain the original phrasing from the input whenever possible, while still breaking down the complex sentences into simpler ones.' response_metadata={'model': 'llama3.1', 'created_at': '2024-09-13T03:28:08.3

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


Sentences(sentences=['Bloody Finger Hunter Yura was a samurai from the Land of Reeds.', "Yura's purpose was to hunt down Bloody Fingers.", 'Eleonora was a Tarnished whom Yura held in high regard.', 'Eleonora was also known as Violet Bloody Finger.', "It is uncertain what caused Eleonora's corruption into a Bloody Finger.", 'Yura warned the Tarnished against the dangers of Dragon Communion.', 'Those who partake in Dragon Communion one day lose their humanity, according to Yura.', "Some Bloody Fingers that the Tarnished encountered were targets of Yura's hunt.", 'These Bloody Fingers had been approached and swayed by Mohg, the Lord of Blood.', 'Mohg was also known as the Lord of Blood.'])

In [200]:
print(paragraphs[10:11][0])
propo = get_propositions(list(paragraphs[10:11][0]))
propo

['B', 'l', 'o', 'o', 'd', 'y', ' ', 'F', 'i', 'n', 'g', 'e', 'r', ' ', 'H', 'u', 'n', 't', 'e', 'r', ' ', 'Y', 'u', 'r', 'a', '\n', 'A', ' ', 's', 'a', 'm', 'u', 'r', 'a', 'i', ' ', 'f', 'r', 'o', 'm', ' ', 't', 'h', 'e', ' ', 'L', 'a', 'n', 'd', ' ', 'o', 'f', ' ', 'R', 'e', 'e', 'd', 's', ' ', 'w', 'h', 'o', 's', 'e', ' ', 'p', 'u', 'r', 'p', 'o', 's', 'e', ' ', 'i', 's', ' ', 't', 'o', ' ', 'h', 'u', 'n', 't', ' ', 'd', 'o', 'w', 'n', ' ', 'B', 'l', 'o', 'o', 'd', 'y', ' ', 'F', 'i', 'n', 'g', 'e', 'r', 's', '.', ' ', 'H', 'i', 's', ' ', 'b', 'i', 'g', 'g', 'e', 's', 't', ' ', 't', 'a', 'r', 'g', 'e', 't', ' ', 'i', 's', ' ', 'E', 'l', 'e', 'o', 'n', 'o', 'r', 'a', ',', ' ', 'V', 'i', 'o', 'l', 'e', 't', ' ', 'B', 'l', 'o', 'o', 'd', 'y', ' ', 'F', 'i', 'n', 'g', 'e', 'r', ',', ' ', 'a', ' ', 'T', 'a', 'r', 'n', 'i', 's', 'h', 'e', 'd', ' ', 'w', 'h', 'o', 'm', ' ', 'h', 'e', ' ', 'h', 'o', 'l', 'd', 's', ' ', 'i', 'n', ' ', 'h', 'i', 'g', 'h', ' ', 'r', 'e', 'g', 'a', 'r', 'd', '.'

In [None]:
for para in paragraphs:
    print("para: ",para)
    # print(type(list(para)))
    propo = get_propositions(list(para))
    print(propo)

In [151]:
proposition_list = []

for para in paragraphs[10:]:
    print("paragraph: ",para)
    print("len: ",len(para))
    if len(para) > 50:
        propo = get_propositions(para)
    
    if propo:
        proposition_list.extend(propo.sentences)

paragraph:  Bloody Finger Hunter Yura
A samurai from the Land of Reeds whose purpose is to hunt down Bloody Fingers. His biggest target is Eleonora, Violet Bloody Finger, a Tarnished whom he holds in high regard. It is uncertain what exactly caused Eleonora's corruption into a Bloody Finger - Yura himself warns the Tarnished against the dangers of Dragon Communion, stating that those who partake in it one day lose their humanity. On the other hand, other Bloody Fingers the Tarnished encounter, some of which are Yura's targets, are those that have been approached and swayed by Mohg, the Lord of Blood himself.
len:  603


INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


Runnable Output: content='Here is the decomposed content:\n\n[ "Bloody Finger Hunter Yura is a samurai.", "Yura is from the Land of Reeds.", "Yura\'s purpose is to hunt down Bloody Fingers.", "Eleonora is a Tarnished whom Yura holds in high regard.", "Eleonora is also known as Violet Bloody Finger.", "It is uncertain what caused Eleonora\'s corruption into a Bloody Finger.", "Yura warns the Tarnished against the dangers of Dragon Communion.", "Those who partake in Dragon Communion one day lose their humanity, according to Yura.", "Some Bloody Fingers that the Tarnished encounter are targets of Yura\'s hunt.", "These Bloody Fingers have been approached and swayed by Mohg, the Lord of Blood.", "Mohg is also known as the Lord of Blood." ]' response_metadata={'model': 'llama3.1', 'created_at': '2024-09-13T03:10:21.2983027Z', 'message': {'role': 'assistant', 'content': ''}, 'done_reason': 'stop', 'done': True, 'total_duration': 2579315900, 'load_duration': 20529900, 'prompt_eval_count': 777

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 200 OK"


ValidationError: 1 validation error for Sentences
sentences
  value is not a valid list (type=type_error.list)

Let's take a look at what the propositions look like

In [307]:
print (f"You have {len(proposition_list)} propositions")
proposition_list

You have 41 propositions


['Lore in Elden Ring covers all the information related to the world.',
 'The world refers to the Lands Between.',
 'The lore also covers past events that occurred long ago.',
 'These past events are part of the history of the Lands Between.',
 'There is a mystery surrounding the elusive Elden Ring.',
 'This page contains heavy spoilers.',
 'The Mythos of Elden Ring was written by George R. R. Martin.',
 'George R. R. Martin wrote the overall narrative theme or plot structure of Elden Ring.',
 'Mythos can also be interpreted as a belief system.',
 'The Story of Elden Ring was written by Hidetaka Miyazaki and his team at FromSoftware.',
 'Hidetaka Miyazaki and his team at FromSoftware wrote the story of Elden Ring.',
 "The Lore and Story of Elden Ring are told in a similar manner to other FromSoftware 'Souls' games.",
 'Players should expect to find plenty left to interpretation about various game characters, elements, and concepts.',
 'Seemingly contradictory statements will be found a

In [None]:
from proposition2 import eldenring_propositions as erp

In [None]:
print (f"You have {len(erp)} propositions")

In [None]:
erp
proposition_list = erp[:11]

proposition_list

So you'll see that they look like regular sentences, but they are actually statements that are able to stand on their own. For example, one of the sentences in the raw text is "They meant well, but this is rarely true." if you were to chunk that on it's own, the LLM would have no idea who you're talking about. Who meant well? What is rarely true? But those have been covered by the propositions.

### Group each proposition to a chunk (Optional)

 Use an LLM that can reason about each proposition and determine whether or not it should be a part of an existing chunk or if a new chunk should be made.


 (Note: After testing, i found out that grouping each proposition to a chunks is not every effective.)

In [None]:
from agentic_chunker import AgenticChunker

In [None]:
ac = AgenticChunker()

In [None]:
ac.add_propositions(erp)

In [None]:
ac.pretty_print_chunks()

In [None]:
ac.pretty_print_chunk_outline()

In [None]:
print(ac.get_chunks(get_type='list_of_strings'))

In [None]:
mylist = ac.get_chunks(get_type='list_of_strings')

In [None]:
len(mylist)

In [None]:
mylist

## Create the knowledge graph

### Define the Ontology

The ontology is a pydantic model with the following schema. 

```python
class Ontology(BaseModel):
    label: List[Union[str, Dict]]
    relationships: List[str]
```



In [None]:
ontology = Ontology(
    labels=[
        {"Person": "Person name without any adjectives, Remember a person may be referenced by their name or using a pronoun"},
        {"Object": "Do not add the definite article 'the' in the object name"},
        {"Event": "Event involving multiple people. Do not include qualifiers or verbs like gives, leaves, works etc."},
        {"Place": "Locations where specific events took place"},
        {"Faction": "Name of a group that a person belongs without any adjectives, Remember, a person may one belong to one faction"},
        {"Miscellaneous": "Any important concept can not be categorised with any other given label"},
    ],
    relationships=[
        "Relation between any pair of Entities"
        ],
)

### Select a LLM 

This model will:
- Generate summaries each proposition
- Extract the nodes and edges from the pydantic documents


In [None]:
## Groq models
model = "mistral-openorca:latest"
# model = "mixtral-8x7b-32768"
# model ="llama3-8b-8192"
# model = "llama3-70b-8192"
# model="gemma-7b-it"

## Open AI models
# oai_model="gpt-4o-mini"

## Use Groq
# llm = GroqClient(model=model, temperature=0.1, top_p=0.5)

## OR Use OpenAI
# llm = OpenAIClient(model=oai_model, temperature=0.1, top_p=0.5)

## Use Ollama
llm = OllamaClient(model=model, temperature=0.1, top_p=0.5)


### Create pydantic documents from chunks

Documents is a pydantic model with the following schema 

```python
class Document(BaseModel):
    text: str
    metadata: dict
```

The metadata we add to the document here is copied to every relation that is extracted out of the document. More often than not, the node pairs have multiple relation with each other. The metadata helps add more context to these relations

In this example I am generating a summary of the text chunk, and the timestamp of the run, to be used as metadata. 

In [None]:
import datetime
current_time = str(datetime.datetime.now())

In [None]:
def generate_summary(text):
    SYS_PROMPT = (
        "Succintly summarise the text provided by the user. "
        "Respond only with the summary and no other comments"
    )
    try:
        summary = llm.generate(user_message=text, system_message=SYS_PROMPT)
    except:
        summary = ""
    finally:
        return summary

In [None]:
docs = map(
    lambda t: Document(text=t, metadata={"summary": generate_summary(t), 'generated_at': current_time}),
    proposition_list
)

In [None]:
mylist = list(docs)
mylist

### Generate graph from pydantic documents

In [None]:
graph_maker = GraphMaker(ontology=ontology, llm_client=llm, verbose=False)

In [None]:
graph = graph_maker.from_documents(
    list(docs), 
    delay_s_between=0 ## delay_s_between because otherwise groq api maxes out pretty fast. 
    ) 
print("Total number of Edges", len(graph))

In [None]:
graph

In [None]:
for edge in graph:
    print(edge.model_dump)
    print(edge.model_dump(exclude=['metadata']), "\n\n")

## Save the graph to a graph database

In [None]:
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "12345678"
os.environ["NEO4J_URI"]= "bolt://localhost:7687"

In [None]:
from knowledge_graph_maker import Neo4jGraphModel

create_indices = False
neo4j_graph = Neo4jGraphModel(edges=graph, create_indices=create_indices)
neo4j_graph.save()