# Building a RAG Pipeline with Metadata Extraction

<a href="https://colab.research.google.com/github/run-llama/llama_extract/blob/main/examples/rag/rag_metadata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cookbook shows you how to build an e2e RAG pipeline with automatic metadata extraction to increase retrieval/synthesis on top of unstructured text data. The core tools we use are LlamaIndex, LlamaExtract, and LlamaParse.

In [None]:
!pip install llama-extract llama-parse llama-index
!pip install llama-index-llms-openai llama-index-embeddings-openai

## Setup

Create a [LlamaCloud account](https://cloud.llamaindex.ai/) if you haven't already done so. Setup the LlamaCloud API Key here.

In [None]:
import os

os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

#### Load Data

In [None]:
from pathlib import Path

folder = "../data/resumes"
files = ["12780508.pdf", "14224370.pdf", "19545827.pdf"]
full_files = [str(Path(folder) / f) for f in files]

## Load Documents and attach Metadata

We extract the metadata from each document and attach it on top of the parsed text.

There are two options for defining the schema:
1. **Use a pre-defined schema**: We use a pre-defined `ResumeMetadata` class to extract metadata values into. This is the most reliable way to generate metadata.
2. **Infer metadata using LlamaExtract**: We can use LlamaExtract's schema inference capabilities to infer a metadata schema from an existing set of documents.

**NOTE**: If you are using (2), you need to make sure you edit the schema afterwards to make it concise and non-nested. LlamaExtract's schema inference is currently in beta and may extract complicated schemas from existing documents. Simple, concise metadata typically works much better for RAG setups! 

In [None]:
from llama_extract import LlamaExtract

SCHEMA_NAME = "TEST_SCHEMA_2"
extractor = LlamaExtract()

### Option 1: Define the Schema Manually

In [None]:
from pydantic import BaseModel, Field


class ResumeMetadata(BaseModel):
    """Resume metadata."""

    years_of_experience: int = Field(
        ..., description="Number of years of work experience."
    )
    highest_degree: str = Field(
        ...,
        description="Highest degree earned (options: High School, Bachelor's, Master's, Doctoral, Professional",
    )
    professional_summary: str = Field(
        ..., description="A general summary of the candidate's experience"
    )

In [None]:
ResumeMetadata.schema()

{'description': 'Resume metadata.',
 'properties': {'years_of_experience': {'description': 'Number of years of work experience.',
   'title': 'Years Of Experience',
   'type': 'integer'},
  'highest_degree': {'description': "Highest degree earned (options: High School, Bachelor's, Master's, Doctoral, Professional",
   'title': 'Highest Degree',
   'type': 'string'},
  'professional_summary': {'description': "A general summary of the candidate's experience",
   'title': 'Professional Summary',
   'type': 'string'}},
 'required': ['years_of_experience', 'highest_degree', 'professional_summary'],
 'title': 'ResumeMetadata',
 'type': 'object'}

In [None]:
extraction_schema = await extractor.acreate_schema(
    "TEST_SCHEMA_3", ResumeMetadata.schema()
)

### Option 2: Schema Inference

We first use LlamaExtract to infer the schema from a subset of these files.

Make sure you specify a schema name - this will be visible in the UI! 

In [None]:
extraction_schema = await extractor.ainfer_schema(SCHEMA_NAME, [full_files[0]])

In [None]:
extraction_schema.data_schema

{'type': 'object',
 'properties': {'Skills': {'type': 'array', 'items': {'type': 'string'}},
  'Education': {'type': 'object',
   'properties': {'degree': {'type': 'string'},
    'institution': {'type': 'string'},
    'fieldOfStudy': {'type': 'string'},
    'graduationDate': {'type': 'string'}}},
  'Supervision': {'type': 'object',
   'properties': {'teamSize': {'type': 'integer'}}},
  'WorkHistory': {'type': 'array',
   'items': {'type': 'object',
    'properties': {'endDate': {'type': 'string'},
     'jobTitle': {'type': 'string'},
     'location': {'type': 'string'},
     'startDate': {'type': 'string'},
     'companyName': {'type': 'string'},
     'responsibilities': {'type': 'array', 'items': {'type': 'string'}}}}},
  'Accomplishments': {'type': 'array', 'items': {'type': 'string'}},
  'AccountingSupport': {'type': 'object',
   'properties': {'hours': {'type': 'integer'}, 'tasks': {'type': 'string'}}},
  'ProfessionalSummary': {'type': 'string'},
  'FinancialServiceRepresentative'

#### Adjust the Schema

Make any modifications to the schema as necessary. (**note**: This may depend on the output of your specific extraction)

In [None]:
new_schema = extraction_schema.data_schema.copy()
del new_schema["properties"]["AccountingSupport"]
del new_schema["properties"]["FinancialServiceRepresentative"]

# TODO: make further modifications yourself to make sure the extracted metadata is flat/concises

In [None]:
new_schema

{'type': 'object',
 'properties': {'Skills': {'type': 'array', 'items': {'type': 'string'}},
  'Education': {'type': 'object',
   'properties': {'degree': {'type': 'string'},
    'institution': {'type': 'string'},
    'fieldOfStudy': {'type': 'string'},
    'graduationDate': {'type': 'string'}}},
  'Supervision': {'type': 'object',
   'properties': {'teamSize': {'type': 'integer'}}},
  'WorkHistory': {'type': 'array',
   'items': {'type': 'object',
    'properties': {'endDate': {'type': 'string'},
     'jobTitle': {'type': 'string'},
     'location': {'type': 'string'},
     'startDate': {'type': 'string'},
     'companyName': {'type': 'string'},
     'responsibilities': {'type': 'array', 'items': {'type': 'string'}}}}},
  'Accomplishments': {'type': 'array', 'items': {'type': 'string'}},
  'ProfessionalSummary': {'type': 'string'}}}

In [None]:
update_response = await extractor.aupdate_schema(extraction_schema.id, new_schema)

### Run Extraction

We now run extraction for each document, and maintain a list of the JSON dictionaries.

In [None]:
extraction_results = await extractor.aextract(extraction_schema.id, full_files)

Extracting files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:04<00:00,  1.41s/it]


In [None]:
extraction_results

[ExtractionResult(id='c339f6cf-4a8b-4705-b9ca-74c4c8946803', created_at=datetime.datetime(2024, 7, 24, 22, 14, 28, 731534, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2024, 7, 24, 22, 14, 28, 731534, tzinfo=datetime.timezone.utc), schema_id='446278bf-ba61-4e2c-bfcb-37645e86126b', data={'highest_degree': "Bachelor's", 'years_of_experience': '2', 'professional_summary': 'Experienced financial service representative with a background in accounting support, customer service, and team supervision. Skilled in maintaining financial records, processing accounts payable, and ensuring compliance with procedural standards.'}, file=File(id='857ee650-0177-45d3-a39c-edf769deb86d', created_at=datetime.datetime(2024, 7, 24, 21, 47, 29, 573597, tzinfo=datetime.timezone.utc), updated_at=datetime.datetime(2024, 7, 24, 21, 47, 29, 573597, tzinfo=datetime.timezone.utc), name='12780508.pdf', file_size=25458, file_type='pdf', project_id='41711594-88c8-4ddf-b5bc-9c7a20725158', last_modified_at

If you pre-specified the metadata schema through `ResumeMetadata`, then run the below code block. Otherwise if you're using LlamaExtract's schema inference, run the code that's commented out instead.

In [None]:
# Use this if you pre-specified the metadata schenma
metadatas = [ResumeMetadata.parse_obj(r.data).dict() for r in extraction_results]

# # Use this if you are using LlamaExtract's schema inference
# # NOTE: Nested schemas do not work well for metadata filtering.
# # If LlamaExtract inferred a nested schema, it is your responsibility to simplify and flatten it
# # so we can easily attach to each document!
# metadatas = [r.data for r in extract_results]

In [None]:
metadatas[1]

{'years_of_experience': 10,
 'highest_degree': "Bachelor's",
 'professional_summary': 'Degreed accountant with more than 10 years of diversified accounting experience seeking accounting position at a well-established company in Houston.'}

### Load Documents

We then load these documents (using LlamaParse), and attach the metadata dictionaries to each document.

In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(result_type="text")
docs = parser.load_data(file_path=full_files)
# attach metadata
for metadata, doc in zip(metadatas, docs):
    doc.metadata.update(metadata)

## Build Index and Run

With these documents/metadata attached, we can now build a vector index and run it.

Since we have metadata attached, we can optionally choose to directly specify metadata or [auto-infer it](https://docs.llamaindex.ai/en/stable/examples/vector_stores/pinecone_auto_retriever/) in order to get higher-precision retrieval.

In [None]:
from llama_index.core import VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding


llm = OpenAI(model="gpt-4o-mini")
embed_model = OpenAIEmbedding(model="text-embedding-3-small")

Settings.llm = llm
Settings.embed_model = embed_model

index = VectorStoreIndex(docs)

In [None]:
from llama_index.core.vector_stores import MetadataFilters
from llama_index.core.query_engine import RetrieverQueryEngine
from typing import Optional


def get_query_engine(filters: Optional[MetadataFilters] = None):
    retriever = index.as_retriever(similarity_top_k=2, filters=filters)
    query_engine = RetrieverQueryEngine.from_args(
        retriever, response_mode="tree_summarize"
    )
    return query_engine

In [None]:
# Try querying with metadata filters
filters = MetadataFilters.from_dicts(
    [{"key": "years_of_experience", "value": 5, "operator": ">"}]
)
query_engine = get_query_engine(filters=filters)
response = query_engine.query(
    "What is the most recent job experience of the most senior candidate?"
)

In [None]:
print("**** RESPONSE ****")
print(str(response))

print("**** METADATA ****")
print(response.source_nodes[0].get_content(metadata_mode="all"))

**** RESPONSE ****
The most recent job experience of the most senior candidate is as an accountant, where they performed a variety of support duties related to the accounting function within a credit union. Their responsibilities included maintaining financial records, processing accounts payable, posting general ledger entries, reconciling accounts, and supervising two accounting clerks. They also prepared daily cash flow reports and ensured staff were adequately trained in their roles.
**** METADATA ****
years_of_experience: 10
highest_degree: Bachelor's
professional_summary: Degreed accountant with more than 10 years of diversified accounting experience seeking accounting position at a well-established company in Houston.

         40hrs Perform a variety of support duties related to the accounting function within the credit union; assisting the accounting team in
         maintaining the financial, statistical, and accounting records; Accounts Payable processing; posting general le