# Knowledge Graph Extraction

This notebook contains my investigations into creating knowledge graphs using LlamaIndex, LangChain, and Neo4j.  My goal is to produce a repeatable pattern for knowledge graph generation template that can be applied across many use cases. 

## Theory/Process

As I understand the process from my research the primary steps we need to take in order to go from unstructured data (such as a website, book, article, video, image, etc) to structures graph data is the following:

1. Pull the data into the application
2. Create representations of that unstructured data as a graph
3. Upload that data to our graph database (Neo4j in my case)
4. Query that data for RAG

---

## Setup

First we need to import some dependencies from LangChain and LlamaIndex:

In [2]:
from llama_index.core.schema import BaseNode, TextNode
from llama_index.readers.web  import SimpleWebPageReader

Then we can import modules related to HTML parsing:

In [3]:
from bs4 import BeautifulSoup, Tag, SoupStrainer
from markdownify import markdownify as md

I have some additional custom modules that I am using:

In [4]:
from src.log_utils import markdownTable
from src.llm import ollama, jsonOutputFixer

These are some base modules we use:

In [5]:
from typing import Callable
import json
import time
import re

Now that we have our libraries imported we can setup a couple of utility functions which will make life easier for us:

In [6]:
# General Utility Functions

def executionTimer() -> Callable:
  """
  A function timing utility.  This curry function will record a start time when called, then when the inner function is called will return the duration between the 2 calls.
  """
  start = time.perf_counter()
  def executionTimerEnd() -> str:
    end = time.perf_counter()

    return f'{end - start:0.2f} sec'

def stringCleanUp(chunk: str, replacers: list[tuple[str, str, re.RegexFlag]]):
  """
  A wrapper function around the 're' modules "#sub" method.  This allows us to make multiple edits to a string
  """
  output = chunk;

  for replacer in replacers:
    flags = re.NOFLAG
    if (len(replacer) == 3):
      flags = replacer[2]

    output = re.sub(replacer[0], replacer[1], output, flags=flags)

  return output

As part of the setup I am going to create a Node list that I can use down in the node processing.  This should let me experiment with multiple different types of documents

In [7]:
nodes: list[BaseNode] = list()

## HTML Utilities

As part of this process I am designing, we will want to fetch data from web browsers.  The example I am using is the [Dungeons & Dragons 5e SRD](https://5esrd.com).

The functions we are setting up here are:

| Function | Description |
| :-: | --- |
| extractElements | Pulls elements based on the provided arguments and returns them as a list |
| removeElements | Pulls elements based on the provided arguments but returns nothing |
| htmlToMarkdown | Converts our HTML document to Markdown by abstracting the heading style from the `markdownify` lib |

In [8]:
# HTML Utility Functions

from typing import TypedDict


BeautifulSoupSearch = TypedDict(
  'HTMLSearch',
  {
    'name': SoupStrainer,
    'attrs': SoupStrainer,
    'recursive': bool,
    'string': SoupStrainer
  }
)

def extractElements(html: Tag, search: BeautifulSoupSearch):
  output: list[Tag] = list()
  elements = html.findAll(
    name=search.get('name'),
    attr=search.get('attrs'),
    recursive=search.get('recursive') or True,
    string=search.get('string')
  )

  for elem in elements:
    output.append(elem.extract());

  return output;

def removeElements(html: Tag, search: BeautifulSoupSearch):
  extractElements(html, search)

def htmlToMarkdown(html: str):
  rawMdStr = md(str(html), heading_style="ATX")
  return stringCleanUp(
    rawMdStr,
    [
      [r'\n{3,}', '\n\n'],      # Remove excessive new lines
      [r'â', '-'],              # Remove unicode character
      [r'', ''],               # Remove unicode character
      [r'\x94', '', re.UNICODE] # Remove unicode character
    ]
  )

def cleanupWebsite():
  ...


### Loading the Data

With our utilities in place, we can now load the actual html.  Under the hood I am using `SimpleWebPageReader` from LlamaIndex.  The goal with this process is to support recursive loading of documents by allowing for links to be extracted from the text.  This does mean that we are loading/evaluating web pages one at a time but this is more memory friendly than loading the whole website and then parsing the documents.

In [9]:
def loadWebsite(
    url: str,
    contentTag: str = None,
    excludedTags: list[BeautifulSoupSearch] | None = None,
    extractTags: dict[str, BeautifulSoupSearch] = None,
    extractLinks: bool | str = False
):
  # Pull in html using LlamaIndex Loader
  page = SimpleWebPageReader().load_data([url]).pop()

  # The 'page' is a string so we need to convert that to HTML.
  # BeautifulSoup can help us with that
  html = BeautifulSoup(page.text, 'html.parser')

  # Depending on the URL you may get better results by 
  if (contentTag):
    html = html.find(contentTag)

  # Setup the extracted tags dict
  extractedTags: dict[str, list[str]] = dict()

  # Links are a valuable part of parsing process.  This motivates
  # the option for links to be its own extraction task as opposed
  # requiring the user enter a BeautifulSoup query for links
  if (isinstance(extractLinks, str) or extractLinks == True):
    links = list()
    for link in html.findAll(f'a', href=True):
      # If a string is passed, then we want to filter on that string
      if (isinstance(extractLinks, str)):
        if (link['href'].startswith(extractLinks)):
          links.append(link['href'])
      else:
        links.append(link['href'])
    
    extractedTags['a'] = links

  # If the extract tags arg has been provided, we want to go
  # find any instance of the search and return it as part of
  # the dict.
  if (extractTags):
    # Get the keys from the parameter
    keys = extractTags.keys();
    # Create a new dict using those keys
    for key, val in zip(keys, [list()]*len(keys)):
      extractedTags[key] = val

    for key in keys:
      search = extractTags[key]
      # Returns a list of html elements that match
      elements = extractElements(html, search)

      # All data should be markdown so we are going to
      # run the htmlToMarkdown cleanup
      for elem in elements:
        extractedTags[key].append(htmlToMarkdown(str(elem)))

  # If excluded strings have been provided, we need to remove
  # those from the HTML.  These will be HTML tags such as 'script'
  # or 'div'
  if (excludedTags):
    [removeElements(html, search) for search in excludedTags]
  
  # Once we have extracted all of the tags we have our "final"
  # HTML document that we can then convert into markdown
  markdown = htmlToMarkdown(str(html))

  return markdown, extractedTags


### Html Extraction

Now that we have our downloader process setup, we can go ahead and verify that we are able to download the website (as markdown).  I have configured the website loader to extract links and tables from the html before parsing to markdown.

The **links** are extracted so that we can recursively fetch additional pages if we would like.  The **tables** were extracted on a hunch that they may mess with our ability to generate chunks as we could end up with 1/2 a table in one chunk and another have in another chunk.

In [10]:
SOURCE_URL = "https://www.5esrd.com/classes"

websiteMd, tagMap = loadWebsite(
  SOURCE_URL,
  contentTag="main",
  excludedTags=[
    { "name": "script" },
    { "name": "div", "attrs": { "id": "toc_container" }}
  ],
  extractTags={ "tables": { "name": "table" } },
  extractLinks=SOURCE_URL
)

linkLength = 0
if tagMap.get('a'):
  linkLength = len(tagMap.get('a')) 

print(markdownTable(
  ['Index', 'Value'],
  [
    [ 'Website MD Size', len(websiteMd) ],
    [ 'Extracted Elements', ', '.join(list(tagMap.keys())) ],
    [ 'Links Found', linkLength],
    [ 'Tables Found', len(tagMap.get('tables')) if tagMap.get('tables') else 'N/A' ]
  ]
))

| Index | Value |
| --- | --- |
| Website MD Size | 23139 |
| Extracted Elements | a, tables |
| Links Found | 67 |
| Tables Found | 6 |




## Document Processing

The next main step in the process is the conversion of the unstructured data.  There are a couple of techniques I am going to explore here.  Those techniques are:

| Type | Description |
| :-: | --- |
| Recursive | Split on a list of user defined characters (recommended by LangChain) |
| Markdown Header | Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown) |
| Semantic | First splits on sentences. Then combines ones next to each other if they are semantically similar enough. Taken from Greg Kamradt - Currently marked as _Experimental_ in the LangChain Docs |
| Custom Markdown Header | A home rolled solution for chunking that is based on my own technique |

To test each implementation, I wrapped in a function which returns a LangChain `Document` list.  This standard interface allows us to test each technique using a higher order function.

First we setup our question

In [11]:
evaluatorQuestion = 'What is the purpose of levels?'

Below are the first 3 wrapped in a the standard interface:

In [12]:
from llama_index.core import VectorStoreIndex
from llama_index.core.llms import ChatMessage, MessageRole
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import OllamaEmbeddings
from llama_index_client import Document
from src.llm import get_service_context


def recursiveTextSplitter(markdown: str) -> list[Document]:
  splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
  )

  return splitter.create_documents([markdown])

def markdownTextSplitter(markdown: str) -> list[Document]:
  splitter = MarkdownHeaderTextSplitter(
    strip_headers=False,
    headers_to_split_on = [
      ("#", "Header 1"),
      ("##", "Header 2"),
      ("###", "Header 3"),
      ("####", "Header 4"),
      ("#####", "Header 5"),
    ]
  )

  return splitter.split_text(markdown)

def semanticTextSplitter(markdown: str) -> list[Document]:
  splitter = SemanticChunker(OllamaEmbeddings(model="mistral:7b"))
  return splitter.create_documents([markdown])

Below is the custom implementation that I wrote myself:

In [13]:
def splitMdByParagraph(markdownText: str, chunkList: list[str]):
  chunkList.extend(re.split('\n\n', markdownText))

def splitMdByHeader(markdownText: str, depth: int) -> list[str]:
  pattern = r'^#{' + str(depth) + r'}\s.*$'
  return re.split(pattern, markdownText, flags=re.MULTILINE)

def markdownHeaderParser(markdown: str, chunkList: list[str] = None, depth: int = None, max_chunk_size: int = None) -> list[str]:
  if (not depth):
    depth = 1

  if (not max_chunk_size):
    max_chunk_size = 500

  if (not chunkList):
    chunkList = list()

  subStrings = splitMdByHeader(markdown, depth)

  for documentPart in subStrings:
    if (len(documentPart) < max_chunk_size):
      chunkList.extend([documentPart])
    elif depth <= 5:
      newDepth = depth + 1
      markdownHeaderParser(documentPart, chunkList, depth=newDepth, max_chunk_size=max_chunk_size)
    else:
      splitMdByParagraph(documentPart, chunkList)

  return list(filter(lambda chunk: len(chunk) > 0, chunkList))

def customMarkdownTextSplitter(markdown: str) -> list[Document]:
  chunks = markdownHeaderParser(markdown, max_chunk_size=1000)

  documents: list[Document] = list()
  for chunk in chunks:
    documents.append(Document(chunk))

  return documents

Finally, we add the evaluator function for text splitting.  This will take in the previously setup splitter wrapper functions and output the results as well as some timing information for us.  

As part of the process we are crossing our streams a little bit and going from LangChain tooling over to LlamaIndex Embedding generation.  This is simply a product of my previous investigations where I started with LlamaIndex.

In [14]:
def langChainToLLamaIndex(langchainDocs: list[TextNode]) -> list[TextNode]:
  llamaDocs: list[TextNode] = list()
  for doc in langchainDocs:
    llamaDocs.append(TextNode(text=doc.page_content))

  return llamaDocs

def textSplitterEvaluation(
    fn: Callable[[str], list[Document]],
    name: str,
    fnInput: str,
    prompt: str,
    sourceName: str,
    chatHistory: list[ChatMessage] = None
):
  output = dict({
    'name': name,
    'success': False,
    'execution_msg': '',
    'splitter_result': '',
    'fn_execution_time': '',
    'index': None,
    'vector_execution_time': '',
    'query_result': '',
    'query_execution_time': ''
  });

  try:
    tic = time.perf_counter()
    output['splitter_result'] = fn(fnInput)
    toc = time.perf_counter()
    output['fn_execution_time'] = f'{toc - tic:0.2f} sec'

    nodes = langChainToLLamaIndex(output['splitter_result'])

    # Add url as metadata to each node
    for node in nodes:
      node.metadata['source'] = sourceName

    output['nodes'] = nodes

    tic = time.perf_counter()
    index = VectorStoreIndex(nodes=output['nodes'], service_context=get_service_context())
    toc = time.perf_counter()
    output['vector_execution_time'] = f'{toc - tic:0.2f} sec'

    queryEngine = index.as_chat_engine(
      verbose=True,
      chat_mode='best'
    );

    tic = time.perf_counter()
    
    output['query_result'] = queryEngine.chat(
      message=prompt,
      chat_history=chatHistory
    )
    toc = time.perf_counter()
    output['query_execution_time'] = f'{toc - tic:0.2f} sec'
    
    output['success'] = True
    output['execution_msg'] = 'Success'
  except KeyError as kErr:
    output['execution_msg'] = f'Dict Key Lookup Err [KeyError]: {kErr}'
    exit()
  except Exception as err:
    output['execution_msg'] = f'Error: {err.with_traceback()}'
  finally:
    return output

Finally with the setup out of the way, we can go ahead and test each technique.  This takes a minute as it generates the embeddings.

In [15]:
testResults = dict({
  'recursive': textSplitterEvaluation(recursiveTextSplitter, 'Recursive', websiteMd, evaluatorQuestion, SOURCE_URL),
  'markdown_header': textSplitterEvaluation(markdownTextSplitter, 'Markdown Header', websiteMd, evaluatorQuestion, SOURCE_URL),
  'custom_markdown_header': textSplitterEvaluation(customMarkdownTextSplitter, 'Custom Markdown Header', websiteMd, evaluatorQuestion, SOURCE_URL),
  'semantic': textSplitterEvaluation(semanticTextSplitter, 'Semantic', websiteMd, evaluatorQuestion, SOURCE_URL)
})

resultHeaders = ['Name', 'Success', 'Splitter Time', 'Index Time', 'Query Time']
resultTable = []
resultDisplay = [];

for recordKey in testResults.keys():
  result = testResults[recordKey]
  resultTable.append(
    [ result['name'], result['success'], result['fn_execution_time'], result['vector_execution_time'], result['query_execution_time'] ]
  )
 
  msg = ''
  if (result['success']):
    msg = result['query_result']
  else:
    msg = result['execution_msg']

  resultDisplay.append('**{}**\n```\n{}\n```\n\n'.format(result['name'], msg))

print(markdownTable(
  headers=resultHeaders,
  values=resultTable
))

print(''.join(resultDisplay))

[1;3;38;5;200mThought: I need to use a tool to help me answer the question.
Action: query_engine_tool
Action Input: {'input': 'What is the purpose of levels?', 'num_beams': 5}
[0m[1;3;34mObservation: The purpose of levels in the context of Dungeons & Dragons is to determine the number of spell slots a character has available for casting spells. The levels of the bard, cleric, druid, sorcerer, and wizard classes are added together, along with half of the character's levels in the paladin and ranger classes, to determine the total number of spell slots. This total is then consulted with the **Table: Multiclass Spellcaster** to determine the actual spell slots available to the character.
[0m[1;3;38;5;200mThought: I can use the `query_engine_tool` to get more information about the purpose of levels in Dungeons & Dragons.
Action: query_engine_tool
Action Input: {'input': 'What is the purpose of levels in D&D?', 'num_beams': 5}
[0m[1;3;34mObservation: In Dungeons & Dragons (D&D), leve

#### Example Results

| Name | Success | Splitter Time | Index Time | Query Time |
| --- | --- | --- | --- | --- |
| Recursive | True | 0.00 sec | 30.52 sec | 4.56 sec |
| Markdown Header | True | 0.00 sec | 20.10 sec | 15.92 sec |
| Custom Markdown Header | False |  |  |  |
| Semantic | True | 67.77 sec | 13.30 sec | 7.06 sec |


**Recursive**
```
The purpose of levels in the context of Dungeons & Dragons is to represent a character's advancement and growth as they gain experience points (XP) and increase their proficiency bonus. As a character gains levels, their hit point maximum increases, and they gain access to new spells and abilities. The levels also determine the character's available spell slots, which are used to cast spells. Overall, the levels system is designed to provide a framework for characters to grow and develop as they progress through the game.
```

**Markdown Header**
```
Levels in D&D serve several purposes:

1. Progression: Levels allow players to progress their characters over time, gaining new abilities, skills, and powers as they increase in level. This progression provides a sense of accomplishment and growth for the player and their character.
2. Balance: By limiting the power of lower-level characters compared to higher-level ones, levels help maintain balance in gameplay. As players gain experience and defeat challenges, their characters become stronger and more capable, allowing them to tackle increasingly difficult content.
3. Customization: Each level provides new options for character customization, such as additional spells or abilities. This allows players to tailor their characters to fit their preferred playstyle and the needs of their party.
4. Storytelling: Levels can be used to drive the story forward by providing challenges and obstacles that must be overcome. As players progress through levels, they encounter more difficult enemies, uncover new plot points, or discover hidden secrets.
5. Skill development: Each level provides an opportunity for players to develop their character's skills further. Whether it's learning new spells, improving combat abilities, or mastering a particular tool or weapon, levels allow players to grow and improve their characters over time.
6. Character growth: Levels provide a framework for character growth, allowing players to explore new abilities, develop their characters' personalities, and create meaningful connections with other characters in the game world.
7. Replayability: The leveling system encourages replayability by providing different challenges and experiences each time a player starts a new character or campaign. This helps keep the game fresh and exciting, even for experienced players.
8. Roleplaying opportunities: Levels can be used to create roleplaying opportunities, such as character milestones, achievements, or setbacks. These moments can help deepen the player's connection with their character and enhance the overall roleplaying experience.
9. Mechanical progression: Levels provide a mechanical framework for advancement, allowing players to unlock new abilities, spells, or items as they progress through the game. This progression helps maintain interest and challenge throughout the game.
10. Player satisfaction: Finally, levels can provide a sense of accomplishment and satisfaction for players as they progress through the game, overcoming challenges and achieving new milestones. This satisfaction can help keep players engaged and motivated to continue playing.
```

**Custom Markdown Header**
```

```

**Semantic**
```
The purpose of levels in the context of the given information is to determine the number of spell slots a multiclass spellcaster has available for casting spells at each spell level. The levels are used to calculate the number of spell slots gained from each class, and the resulting total number of spell slots available for casting spells at each spell level.

For example, a 10th-level multiclass spellcaster would have 4 spell slots available at 1st level, 3 spell slots available at 2nd level, and so on, until they reach their highest level (10th) where they have 4 spell slots available. This allows the spellcaster to cast spells of increasing difficulty and power as they gain experience and level up.
```

### Semantic + Tables

One curiosity I had while working on on the chunking step was _if_ the markdown tables could improve the vectors within the _Semantic_ chunker.  I had initially setup the parser to remove them because in my _Custom Markdown_ approach I was trying to avoid them being separated.  And with the text-based (token-based) chunkers, this was a high likely hood.  

So next I went ahead and created an alternative html document where I skipped the table extraction:

In [16]:
websiteMdWithTables, tagMap = loadWebsite(
  SOURCE_URL,
  contentTag="main",
  excludedTags=[
    { "name": "script" },
    { "name": "div", "attrs": { "id": "toc_container" }}
  ],
  extractLinks=SOURCE_URL
)

semanticTableResult = textSplitterEvaluation(semanticTextSplitter, 'SemanticTextSplitterWithTables', websiteMdWithTables, evaluatorQuestion, SOURCE_URL)

print(markdownTable(
  headers=resultHeaders,
  values=[
    [ semanticTableResult['name'], semanticTableResult['success'], semanticTableResult['fn_execution_time'], semanticTableResult['vector_execution_time'], semanticTableResult['query_execution_time'] ],
    [ testResults['semantic']['name'], testResults['semantic']['success'], testResults['semantic']['fn_execution_time'], testResults['semantic']['vector_execution_time'], testResults['semantic']['query_execution_time'] ]
  ]
))

print('**{}**\n```\n{}\n```\n\n'.format(semanticTableResult['name'], semanticTableResult['query_result'] if semanticTableResult['success'] else semanticTableResult['execution_msg'] ))
print('**{}**\n```\n{}\n```\n\n'.format(testResults['semantic']['name'], testResults['semantic']['query_result'] if testResults['semantic']['success'] else testResults['semantic']['execution_msg']))

[1;3;38;5;200mThought: I need to use a tool to help me answer the question.
Action: query_engine_tool
Action Input: {'input': 'What is the purpose of levels?', 'num_beams': 5}
[0m[1;3;34mObservation: The purpose of levels in the context provided is to determine the number of spell slots a multiclass spellcaster has available for casting spells at each spell level. The levels are used to calculate the number of spell slots gained from each class, and the resulting total number of spell slots available for casting spells at each spell level.

For example, a 10th-level multiclass spellcaster has 4 spell slots available at 1st level, 3 spell slots available at 2nd level, and so on. This allows the spellcaster to cast more complex and powerful spells as they gain experience and level up.

In addition to determining the number of spell slots available, levels also affect other abilities and features of a multiclass spellcaster, such as their ability to cast certain spells or use specific 

Here are some example results form that test.  It does seem that with additional context I can 

| Name | Success | Splitter Time | Index Time | Query Time |
| --- | --- | --- | --- | --- |
| SemanticTextSplitterWithTables | True | 67.30 sec | 13.29 sec | 7.46 sec |
| Semantic | True | 67.77 sec | 13.30 sec | 7.06 sec |


**SemanticTextSplitterWithTables**
```
The purpose of levels in the context of spell slots and alignment is to provide a framework for determining the maximum number of spells a creature can cast per day, based on their level of proficiency and their alignment. The levels represent different stages of proficiency and power, with higher levels corresponding to more powerful and skilled creatures.

In the context of Pact Magic, levels also determine the number of spell slots available for casting spells from other classes. This allows creatures with both the Spellcasting class feature and the Pact Magic class feature to combine their own spells with those of other classes, creating a more diverse and powerful magical repertoire.

Overall, levels in this context serve as a way to track a creature's progress and abilities, and to provide a framework for determining their maximum potential for spellcasting and magic use.
```


**Semantic**
```
The purpose of levels in the context of the given information is to determine the number of spell slots a multiclass spellcaster has available for casting spells at each spell level. The levels are used to calculate the number of spell slots gained from each class, and the resulting total number of spell slots available for casting spells at each spell level.

For example, a 10th-level multiclass spellcaster would have 4 spell slots available at 1st level, 3 spell slots available at 2nd level, and so on, until they reach their highest level (10th) where they have 4 spell slots available. This allows the spellcaster to cast spells of increasing difficulty and power as they gain experience and level up.
```

### Proposition Extraction

Recently I was watching a [video](https://www.youtube.com/watch?v=8OJC21T2SL4) on text extraction techniques and one technique that was proposed as a solution was based off of this [research paper](https://arxiv.org/abs/2312.06648) which suggests converting the text into propositions (a statement or assertion that expresses a judgment or opinion).  

As a product of the paper is a prompt which has been published on LangChain's LangSmith Hub: [Link](https://smith.langchain.com/hub/wfh/proposal-indexing).

Below is a splitter function which process the text into propositions.

In [17]:
from src.prompts import proposition
from src.llm import ollama
from langchain.llms.ollama import Ollama

Llama2LangChain = Ollama(model="mistral:7b")
test = proposition.PropositionPrompt()

propWebsitePrompt = test.createPrompt(websiteMd)

print(type(propWebsitePrompt))

print(markdownTable(
  headers=['Element', 'Size'],
  values=[
    [ 'Text', len(websiteMd) ],
    [ 'Tokens', Llama2LangChain.get_num_tokens(websiteMd) ]
  ]
))

NotImplementedError: Unsupported message type: <class 'llama_index.core.base.llms.types.ChatMessage'>

In [33]:
propositionNodes = markdownTextSplitter(websiteMd)

propositionList: list[str] = []
for node in propositionNodes:
  propositions = test.chat(
    llm=Llama2LangChain,
    message=node.page_content
  )

  propositionList.extend(propositions)

print(markdownTable(
  headers=['Element', 'Size'],
  values=[
    [ 'Documents', len(propositionNodes) ],
    [ 'Propositions', len(propositionList) ]
  ]
))

print(f"Propositions:\n```json\n{json.dumps(propositionList)}\n```\n\n")


NotImplementedError: Unsupported message type: <class 'llama_index.core.base.llms.types.ChatMessage'>

In [None]:
nodes.extend(testResults['semantic']['nodes'])

## Node Metadata

The next step is we want to look at each node and try and find some additional metadata about each one.  This metadata can then assist the LLM in finding the correct chunks when doing the retrieval step.



- Title Extraction
- Q&A Extraction
- Entity Extraction