<a href="https://colab.research.google.com/github/seanreed1111/colab-demos/blob/master/llamaindex_ingestion_and_metadata_cookbooks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:

# - https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-4/Ingestion_Pipeline/
# - https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-4/Metadata_Extraction/
# - https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-3/Evaluating_RAG_Systems/
# - https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-6/Router_And_SubQuestion_QueryEngine/

In [1]:
!pip install -qqq llama-index llama-index-vector-stores-qdrant

In [6]:
from google.colab import userdata
import nest_asyncio
nest_asyncio.apply()
from llama_index.core import SimpleDirectoryReader
import os
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import TitleExtractor
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

In [5]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

--2024-10-03 15:07:27--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‚Äòdata/paul_graham/paul_graham_essay.txt‚Äô


2024-10-03 15:07:27 (5.34 MB/s) - ‚Äòdata/paul_graham/paul_graham_essay.txt‚Äô saved [75042/75042]



In [10]:
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
    ]
)
nodes = pipeline.run(documents=documents);nodes[0].metadata

{'file_path': '/content/data/paul_graham/paul_graham_essay.txt',
 'file_name': 'paul_graham_essay.txt',
 'file_type': 'text/plain',
 'file_size': 75042,
 'creation_date': '2024-10-03',
 'last_modified_date': '2024-10-03'}

## let's add title extractor to the pipeline

In [9]:
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
    ]
)
nodes = pipeline.run(documents=documents);nodes[0].metadata

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:01<00:00,  3.80it/s]


{'file_path': '/content/data/paul_graham/paul_graham_essay.txt',
 'file_name': 'paul_graham_essay.txt',
 'file_type': 'text/plain',
 'file_size': 75042,
 'creation_date': '2024-10-03',
 'last_modified_date': '2024-10-03',
 'document_title': 'The Intersection of Technology, Art, and Philosophy: A Journey through Writing, Programming, and Artificial Intelligence'}

In [11]:
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
        OpenAIEmbedding(), #creates nodes[0].embedding
    ]
)
nodes = pipeline.run(documents=documents);nodes[0].metadata#, nodes[0].embedding

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:01<00:00,  4.05it/s]


{'file_path': '/content/data/paul_graham/paul_graham_essay.txt',
 'file_name': 'paul_graham_essay.txt',
 'file_type': 'text/plain',
 'file_size': 75042,
 'creation_date': '2024-10-03',
 'last_modified_date': '2024-10-03',
 'document_title': 'The Intersection of Art, Technology, and Programming: A Journey from Short Stories to AI and Fine Arts'}

In [12]:
# save and load to cache
pipeline.cache.persist("./llama_cache.json")
new_cache = IngestionCache.from_persist_path("./llama_cache.json")

In [13]:

#uses the premade cache
new_pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
        OpenAIEmbedding(), #creates nodes[0].embedding
    ],
    cache=new_cache,
)
nodes = pipeline.run(documents=documents)

In [14]:
nodes = pipeline.run(documents=documents);nodes[0].text

'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines ‚Äî CPU, disk drives, printer, card reader ‚Äî sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then s

# RAG using Ingestion Pipeline

In [15]:
import qdrant_client

from llama_index.vector_stores.qdrant import QdrantVectorStore

client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(
    client=client, collection_name="llama_index_vector_store"
)
pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TitleExtractor(),
        OpenAIEmbedding(),
    ],
    cache=new_cache,
    vector_store=vector_store,
)
# Ingest directly into a vector db
nodes = pipeline.run(documents=documents)



In [16]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(vector_store)
query_engine = index.as_query_engine()
response = query_engine.query("What did paul graham do growing up?")

print(response)

Paul Graham skipped a step in the evolution of computers and went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting to him.


# Custom Transformations

## Implementing custom transformations is pretty easy.

Let's include a transformation that removes special characters from the text before generating embeddings.

The primary requirement for transformations is that they should take a list of nodes as input and return a modified list of nodes.



In [17]:
from llama_index.core.schema import TransformComponent
import re

#can make a CVE extractor that searches the text and adds  CVE URL to the metadata
class TextCleaner(TransformComponent):
    def __call__(self, nodes, **kwargs):
        for node in nodes:
            node.text = re.sub(r"[^0-9A-Za-z ]", "", node.text)
        return nodes


pipeline = IngestionPipeline(
    transformations=[
        TokenTextSplitter(chunk_size=1024, chunk_overlap=100),
        TextCleaner(),
        OpenAIEmbedding(),
    ],
    cache=new_cache,
)

nodes = pipeline.run(documents=documents);nodes[0].text

'What I Worked OnFebruary 2021Before college the two main things I worked on outside of school were writing and programming I didnt write essays I wrote what beginning writers were supposed to write then and probably still are short stories My stories were awful They had hardly any plot just characters with strong feelings which I imagined made them deepThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called data processing This was in 9th grade so I was 13 or 14 The school districts 1401 happened to be in the basement of our junior high school and my friend Rich Draves and I got permission to use it It was like a mini Bond villains lair down there with all these alienlooking machines  CPU disk drives printer card reader  sitting up on a raised floor under bright fluorescent lightsThe language we used was an early version of Fortran You had to type programs on punch cards then stack them in the card reader and press a button to loa

## END OF https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-4/Ingestion_Pipeline/

# METADATA EXTRACTION
#### https://docs.llamaindex.ai/en/stable/examples/cookbooks/oreilly_course_cookbooks/Module-4/Metadata_Extraction/

In [18]:
!pip install -qqq llama-index
!pip install -qqq llama_index-readers-web

[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/56.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m56.5/56.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.4/7.4 MB[0m [31m74.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

In [19]:
from llama_index.llms.openai import OpenAI
from llama_index.core.schema import MetadataMode
from llama_index.core import Settings

In [20]:
llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo", max_tokens=512)
Settings.llm = llm

## Uses QuestionsAnsweredExtractor

In [22]:
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import QuestionsAnsweredExtractor
from llama_index.readers.web import SimpleWebPageReader

node_parser = TokenTextSplitter(
    separator=" ", chunk_size=256, chunk_overlap=128
)

question_extractor = QuestionsAnsweredExtractor(
    questions=3, llm=llm, metadata_mode=MetadataMode.EMBED
)

In [23]:
# from llama_index.readers.web import SimpleWebPageReader
reader = SimpleWebPageReader(html_to_text=True)
docs = reader.load_data(urls=["https://eugeneyan.com/writing/llm-patterns/"])

In [24]:
print(docs[0].get_content())

# [eugeneyan](/)

  * [Start Here](/start-here/ "Start Here")
  * [Writing](/writing/ "Writing")
  * [Speaking](/speaking/ "Speaking")
  * [Prototyping](/prototyping/ "Prototyping")
  * [About](/about/ "About")

# Patterns for Building LLM-based Systems & Products

[ [llm](/tag/llm/) [engineering](/tag/engineering/)
[production](/tag/production/) [üî•](/tag/üî•/) ]  ¬∑ 66 min read

> Discussions on [HackerNews](https://news.ycombinator.com/item?id=36965993),
> [Twitter](https://twitter.com/eugeneyan/status/1686531758701899776), and
> [LinkedIn](https://www.linkedin.com/posts/eugeneyan_patterns-for-building-
> llm-based-systems-activity-7092300473981927424-_wVo)

‚ÄúThere is a large class of problems that are easy to imagine and build demos
for, but extremely hard to make products out of. For example, self-driving:
It‚Äôs easy to demo a car self-driving around a block, but making it into a
product takes a decade.‚Äù -
[Karpathy](https://twitter.com/eugeneyan/status/1672692174704766976

In [25]:
orig_nodes = node_parser.get_nodes_from_documents(docs)

In [26]:
print(orig_nodes[20:28][3].get_content(metadata_mode="all"))

because evals were often conducted with untested, incorrect
ROUGE implementations.

![Dimensions of model evaluations with ROUGE](/assets/rogue-scores.jpg)

Dimensions of model evaluations with ROUGE
([source](https://aclanthology.org/2023.acl-long.107/))

And even with recent benchmarks such as MMLU, **the same model can get
significantly different scores based on the eval implementation**.
[Huggingface compared the original MMLU
implementation](https://huggingface.co/blog/evaluating-mmlu-leaderboard) with
the HELM and EleutherAI implementations and found that the same example could
have different prompts across various providers.

![Different prompts for the same question across MMLU
implementations](/assets/mmlu-prompt.jpg)

Different prompts for the same question across MMLU implementations
([source](https://huggingface.co/blog/evaluating-mmlu-leaderboard))

Furthermore, the evaluation approach differed across all three benchmarks:

  * Original MMLU: Compares predicted probabiliti

In [27]:
nodes_1 = node_parser.get_nodes_from_documents(docs)[20:28]
nodes_1 = question_extractor(nodes_1)

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 8/8 [00:03<00:00,  2.23it/s]


In [28]:
print(nodes_1[3].get_content(metadata_mode="all"))

[Excerpt from document]
questions_this_excerpt_can_answer: 1. How do different implementations of the MMLU benchmark affect the scores of the same model?
2. What are the differences in evaluation approaches between the original MMLU benchmark, HELM, and EleutherAI implementations?
3. How do varying prompts for the same question impact the evaluation of models in the MMLU benchmark?
Excerpt:
-----
because evals were often conducted with untested, incorrect
ROUGE implementations.

![Dimensions of model evaluations with ROUGE](/assets/rogue-scores.jpg)

Dimensions of model evaluations with ROUGE
([source](https://aclanthology.org/2023.acl-long.107/))

And even with recent benchmarks such as MMLU, **the same model can get
significantly different scores based on the eval implementation**.
[Huggingface compared the original MMLU
implementation](https://huggingface.co/blog/evaluating-mmlu-leaderboard) with
the HELM and EleutherAI implementations and found that the same example could
have diff

## Build Indexes

In [30]:
from llama_index.core import VectorStoreIndex
from llama_index.core.response.notebook_utils import (
    display_source_node,
    display_response,
)
index0 = VectorStoreIndex(orig_nodes)
index1 = VectorStoreIndex(orig_nodes[:20] + nodes_1 + orig_nodes[28:])

## Build Query Engine

In [31]:
query_engine0 = index0.as_query_engine(similarity_top_k=1)
query_engine1 = index1.as_query_engine(similarity_top_k=1)

In [32]:
query_str = (
    "Can you describe metrics for evaluating text generation quality, compare"
    " them, and tell me about their downsides"
)

response0 = query_engine0.query(query_str)
response1 = query_engine1.query(query_str)

### Response 0

In [35]:
print(query_str, '\n')
display_response(
    response0, source_length=1000, show_source=True, show_source_metadata=True
)

Can you describe metrics for evaluating text generation quality, compare them, and tell me about their downsides 



**`Final Response:`** Metrics for evaluating text generation quality can be categorized as context-dependent or context-free. Context-dependent metrics consider the context of the task and may need adjustments for different tasks. On the other hand, context-free metrics are task-agnostic and compare the generated output with provided references, making them versatile for various tasks.

Some commonly used metrics include BLEU, ROUGE, BERTScore, and MoverScore. BLEU is a precision-based metric that counts matching n-grams in the generated output and the reference. ROUGE evaluates the overlap of n-grams and word sequences between the generated text and the reference. BERTScore measures the similarity between the model's output and the reference using contextual embeddings. MoverScore assesses the similarity between the generated text and the reference based on the Earth Mover's Distance.

Each metric has its downsides. For example, BLEU may not consider semantic similarity, ROUGE may not capture the overall meaning, BERTScore could be computationally expensive, and MoverScore may require additional computational resources. These downsides highlight the importance of understanding the limitations of each metric when evaluating text generation quality.

---

**`Source Node 1/1`**

**Node ID:** 63c32918-94ab-49aa-88ca-19774e78d081<br>**Similarity:** 0.8380886067511155<br>**Text:** GPT-4) prefers the output of one model over a reference model. Metrics include win rate, bias, latency, price, variance, etc. Validated to have high agreement with 20k human annotations.

We can group metrics into two categories: context-dependent or context-free.

  * **Context-dependent** : These take context into account. They‚Äôre often proposed for a specific task; repurposing them for other tasks will require some adjustment.
  * **Context-free** : These aren‚Äôt tied to the context when evaluating generated output; they only compare the output with the provided gold references. As they‚Äôre task agnostic, they‚Äôre easier to apply to a wide variety of tasks.

To get a better sense of these metrics (and their potential shortfalls), we‚Äôll
explore a few of the commonly used metrics such as BLEU, ROUGE, BERTScore, and
MoverScore.

**[BLEU](https://dl.acm.org/doi/10.3115/1073083.1073135) (Bilingual Evaluation
Understudy)** is a precision-based metric: It counts the number of n-grams in
th...<br>**Metadata:** {}<br>

### Response 1

In [36]:
print(query_str, '\n')

display_response(
    response1, source_length=1000, show_source=True, show_source_metadata=True
)

Can you describe metrics for evaluating text generation quality, compare them, and tell me about their downsides 



**`Final Response:`** Metrics for evaluating text generation quality vary in their effectiveness depending on the task requirements. Some metrics, like BLEU and ROUGE, are commonly used but may not be suitable for tasks that demand creativity and diversity. These metrics rely on n-gram overlap between the generated text and a reference, which can limit their applicability in tasks such as abstractive summarization or dialogue generation where responses can vary widely. Additionally, these metrics may exhibit poor adaptability to different tasks and have issues with reproducibility, leading to challenges in reliably evaluating the quality of text generation models.

---

**`Source Node 1/1`**

**Node ID:** cfa77c3a-8b7e-41a4-9fd3-13eb1044cc8c<br>**Similarity:** 0.8512667214191415<br>**Text:** with tasks that require creativity and
diversity](https://arxiv.org/abs/2303.16634).

Second, these metrics often have **poor adaptability to a wider variety of
tasks**. Adopting a metric proposed for one task to another is not always
prudent. For example, exact match metrics such as BLEU and ROUGE are a poor
fit for tasks like abstractive summarization or dialogue. Since they‚Äôre based
on n-gram overlap between output and reference, they don‚Äôt make sense for a
dialogue task where a wide variety of responses are possible. An output can
have zero n-gram overlap with the reference but yet be a good response.

Third, these metrics have **poor reproducibility**. Even for the same metric,
[high variance is reported across different
studies](https://arxiv.org/abs/2008.12009), possibly due to variations in
human judgment collection or metric parameter settings. Another study of
[ROUGE scores](https://aclanthology.org/2023.acl-long.107/) across 2,000
studies found that scores were hard to re...<br>**Metadata:** {'questions_this_excerpt_can_answer': '1. How do existing metrics for evaluating natural language generation models perform when tasks require creativity and diversity?\n2. Why is it not always appropriate to adopt a metric proposed for one task to evaluate performance on another task in natural language generation?\n3. What challenges are associated with the reproducibility of metrics used to evaluate natural language generation models, and how do these challenges impact the reliability of research findings in the field?'}<br>

# Metadata Extraction Usage Pattern

You can use LLMs to automate metadata extraction with our Metadata Extractor modules.

Our metadata extractor modules include the following "feature extractors":

- `SummaryExtractor` - automatically extracts a summary over a set of Nodes
- `QuestionsAnsweredExtractor` - extracts a set of questions that each Node can answer
- `TitleExtractor` - extracts a title over the context of each Node
- `EntityExtractor` - extracts entities (i.e. names of places, people, things) mentioned in the content of each Node




Then you can chain the Metadata Extractors with our node parser.

In [48]:
# https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_metadata_extractor/

In [58]:
import llama_index
dir(llama_index.core.extractors.metadata_extractors)

['Any',
 'BaseExtractor',
 'BaseNode',
 'BasePydanticProgram',
 'Callable',
 'DEFAULT_ENTITY_MAP',
 'DEFAULT_ENTITY_MODEL',
 'DEFAULT_EXTRACT_TEMPLATE_STR',
 'DEFAULT_KEYWORD_EXTRACT_TEMPLATE',
 'DEFAULT_NUM_WORKERS',
 'DEFAULT_QUESTION_GEN_TMPL',
 'DEFAULT_SUMMARY_EXTRACT_TEMPLATE',
 'DEFAULT_TITLE_COMBINE_TEMPLATE',
 'DEFAULT_TITLE_NODE_TEMPLATE',
 'Dict',
 'Field',
 'KeywordExtractor',
 'LLM',
 'List',
 'Optional',
 'PrivateAttr',
 'PromptTemplate',
 'PydanticProgramExtractor',
 'QuestionsAnsweredExtractor',
 'Sequence',
 'SerializeAsAny',
 'Settings',
 'SummaryExtractor',
 'TextNode',
 'TitleExtractor',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'add_class_name',
 'cast',
 'run_jobs']

In [65]:
from llama_index.core.extractors.metadata_extractors import KeywordExtractor #not in the docs
kw_extractor = KeywordExtractor(llm=llm, keywords=10) #CAN ALSO PUT IN A KEYWORD EXTRACTION TEMPLATE FOR THE LLM

In [66]:
from llama_index.core.extractors import (
    TitleExtractor,
    QuestionsAnsweredExtractor,
)
from llama_index.core.node_parser import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=" ", chunk_size=512, chunk_overlap=128
)
title_extractor = TitleExtractor(nodes=5)
qa_extractor = QuestionsAnsweredExtractor(questions=3)

# assume documents are defined -> extract nodes
from llama_index.core.ingestion import IngestionPipeline

pipeline = IngestionPipeline(
    transformations=[text_splitter, title_extractor, qa_extractor, kw_extractor]
)

nodes = pipeline.run(
    documents=documents,
    in_place=True,
    show_progress=True,
)

Parsing nodes:   0%|          | 0/1 [00:00<?, ?it/s]

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:01<00:00,  4.48it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 46/46 [00:19<00:00,  2.34it/s]
100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 46/46 [00:09<00:00,  5.06it/s]


In [67]:
nodes[0].metadata

{'file_path': '/content/data/paul_graham/paul_graham_essay.txt',
 'file_name': 'paul_graham_essay.txt',
 'file_type': 'text/plain',
 'file_size': 75042,
 'creation_date': '2024-10-03',
 'last_modified_date': '2024-10-03',
 'document_title': '"From Punch Cards to AI: A Journey in Programming, Philosophy, and the Illusion of Artificial Intelligence"',
 'questions_this_excerpt_can_answer': "1. How did the transition from using punch cards on the IBM 1401 to microcomputers impact the author's programming experience and capabilities?\n2. What were some of the challenges the author faced when working with the IBM 1401, and how did these limitations shape their early programming endeavors?\n3. How did the author's early experiences with writing short stories and programming in their youth influence their later work and perspectives on technology and creativity?",
 'excerpt_keywords': 'punch cards, IBM 1401, programming, Fortran, microcomputers, data processing, creativity, technology, limitat

# Extract Metadata Using PydanticProgramExtractor

In [37]:
from pydantic import BaseModel, Field
from typing import List

class NodeMetadata(BaseModel):
    """Node metadata."""

    entities: List[str] = Field(
        ..., description="Unique entities in this text chunk."
    )
    summary: str = Field(
        ..., description="A concise summary of this text chunk."
    )

In [39]:
from llama_index.program.openai import OpenAIPydanticProgram
from llama_index.core.extractors import PydanticProgramExtractor

EXTRACT_TEMPLATE_STR = """\
Here is the content of the section:
----------------
{context_str}
----------------
Given the contextual information, extract out a {class_name} object.\
"""

openai_program = OpenAIPydanticProgram.from_defaults(
    output_cls=NodeMetadata,
    prompt_template_str="{input}",
    extract_template_str=EXTRACT_TEMPLATE_STR,
)

metadata_extractor = PydanticProgramExtractor(
    program=openai_program, input_key="input", show_progress=True
)

extract_metadata = metadata_extractor.extract(orig_nodes[0:1])

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.25s/it]


In [40]:
extract_metadata

[{'entities': ['eugeneyan', 'llm', 'engineering', 'production'],
  'summary': 'Patterns for Building LLM-based Systems & Products'}]

In [41]:
metadata_nodes = metadata_extractor.process_nodes(orig_nodes[0:1]);metadata_nodes

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.05it/s]


[TextNode(id_='a64e2060-42d5-48c9-b719-74c70f6a8b36', embedding=None, metadata={'entities': ['eugeneyan', 'llm', 'engineering', 'production'], 'summary': 'Patterns for Building LLM-based Systems & Products - Discussions on HackerNews, Twitter, and LinkedIn. Content includes discussions on self-driving technology.'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='https://eugeneyan.com/writing/llm-patterns/', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='31bdcee06733c3d18c370cb5296006308d8e200cf59ec243654906e320b0825a'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='f8852ce8-d054-4c27-a20b-a51b0f3fa140', node_type=<ObjectType.TEXT: '1'>, metadata={}, hash='993e43bb060cf2f183f894f8dec6708eadcac2b7d2760a94916dc82c24255acc')}, text='# [eugeneyan](/)\n\n  * [Start Here](/start-here/ "Start Here")\n  * [Writing](/writing/ "Writing")\n  * [Speaking](/speaking/ "Speaking")\n  * [Prototyp

In [46]:
metadata_nodes[0].metadata

{'entities': ['eugeneyan', 'llm', 'engineering', 'production'],
 'summary': 'Patterns for Building LLM-based Systems & Products - Discussions on HackerNews, Twitter, and LinkedIn. Content includes discussions on self-driving technology.'}