# LlamaCloud Client SDK: Document Metadata Management

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/client_sdk/doc_metadata.ipynb
    " target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This tutorial shows you how to update metadata onto a document.

**NOTE**: To add new documents with metadata, check out our "Inserting Custom Documents" tutorial.

You can update metadata in two ways with the low-level client SDK: 
- Using our `update_pipeline_file` method to update the metadata of an uploaded file.
- Using our `upsert_batch_pipeline_documents` method to update the metadata of uploaded documents.

## Setup

Here we setup our environment variables, data, and the client SDK.

In [2]:
import nest_asyncio
nest_asyncio.apply()

In [3]:
import os

os.environ["LLAMA_CLOUD_BASE_URL"] = "https://api.cloud.llamaindex.ai"

In [None]:
os.environ["LLAMA_CLOUD_API_KEY"] = "<LLAMA_CLOUD_API_KEY>"
os.environ["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"

#### Load Data

In [5]:
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf" -O data/apple_2021_10k.pdf

--2024-07-03 21:18:33--  https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf
Resolving s2.q4cdn.com (s2.q4cdn.com)... 2a0b:4d07:2::3, 2a0b:4d07:2::1, 2a0b:4d07:2::4, ...
Connecting to s2.q4cdn.com (s2.q4cdn.com)|2a0b:4d07:2::3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 789896 (771K) [application/pdf]
Saving to: ‘apple_2021_10k.pdf’


2024-07-03 21:18:33 (12.3 MB/s) - ‘apple_2021_10k.pdf’ saved [789896/789896]



In [4]:
from llama_parse import LlamaParse

documents = LlamaParse(result_type="markdown").load_data("./data/apple_2021_10k.pdf")

Error while parsing the file './apple_2021_10k.pdf': [Errno 2] No such file or directory: './apple_2021_10k.pdf'


#### Setup LlamaCloud Client SDK

In [18]:
from llama_cloud.client import LlamaCloud

client = LlamaCloud(
    token=os.environ["LLAMA_CLOUD_API_KEY"],
    base_url=os.environ["LLAMA_CLOUD_BASE_URL"]
)

#### Setup Index

Please setup an empty index. You can either do this through the UI or [programmatically](https://docs.cloud.llamaindex.ai/llamacloud/guides/framework_integration).

After you've done so, make sure to note down the pipeline_id, pipeline_name, project_id, and project_name in the variables below. You'll need these later! 

In [19]:
pipeline_id = "<pipeline_id>"
pipeline_name = "<pipeline_name>"
project_id = "<project_id>"
project_name = "<project_name>"

## Updating Metadata in Files


can be from manually uploaded files or data source files after ingested

#### Updating Metadata through `update_pipeline_file`

In [20]:
# upload file and add file to pipeline
with open('data/apple_2021_10k.pdf', 'rb') as f:
    file = client.files.upload_file(upload_file=f, project_id=project_id)
    pipeline_files = client.pipelines.add_files_to_pipeline(pipeline_id, request=[{'file_id': file.id}]) 

In [None]:
# adding metadata
pipeline_files = client.pipelines.update_pipeline_file(
    pipeline_id=pipeline_id, file_id=file.id, custom_metadata={ "editor": "jerry_liu" }
) 

#### Updating Metadata through `upsert_batch_pipeline_documents`

In [23]:
pipeline_docs = client.pipelines.list_pipeline_documents(pipeline_id)
len(pipeline_docs)

1

In [24]:
# inspect the first document
pipeline_docs[0].metadata

{'file_size': '789896',
 'last_modified_at': '2024-07-04T06:39:23',
 'file_path': 'apple_2021_10k.pdf',
 'file_name': 'apple_2021_10k.pdf',
 'pipeline_id': 'b4b8a624-cd50-4f54-8d20-a756427d961f'}

In [14]:
# change the metadata of the document
pipeline_docs[0].metadata["editor"] = "simon_suo"

In [15]:
upserted_docs = client.pipelines.upsert_batch_pipeline_documents(pipeline_id, request=[pipeline_docs[0]])
upserted_docs[0].metadata

{'file_size': '789896',
 'last_modified_at': '2024-07-04T06:37:50',
 'file_path': 'apple_2021_10k.pdf',
 'file_name': 'apple_2021_10k.pdf',
 'pipeline_id': 'a2de81e0-6917-4e23-8874-5f5170b1aa79',
 'editor': 'simon_suo'}

## Test Retrieval

We test retrieval through the framework integration.

#### Retrieval Through the Framework Integration

We can also define a retriever through the Python framework, through our `LlamaCloudIndex`.

In [None]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
import os

index = LlamaCloudIndex(
  name=pipeline_name, 
  project_name=project_name,
  api_key=os.getenv("LLAMA_CLOUD_API_KEY")
)

query_engine = index.as_query_engine(rerank_top_n=1)
response = query_engine.query("Who is the editor of this document.")
print(str(response) + "\n-------\n\nSources:\n\n")
for n in response.source_nodes:
    print(n.get_content(metadata_mode="all"))