# LlamaCloud Client SDK: Inserting Custom Documents

<a href="https://colab.research.google.com/github/run-llama/llamacloud-demo/blob/main/examples/client_sdk/create_custom_doc.ipynb
    " target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This tutorial shows you how to use the lower-level LlamaCloud Client SDK to insert custom documents into a pipeline.

We can do this on two levels:

- Inserting document text using our `CloudDocumentCreate` object.
- Directly uploading files

We insert both a parsed financial document (separately parsed through LlamaParse), as well as a toy custom document.

We can then retrieve from it using either the lower-level SDK or the higher-level integration.

In [2]:
import nest_asyncio
nest_asyncio.apply()

In [None]:
!pip install llama-index
!pip install llama-cloud

## Setup

Here we setup our environment variables, data, and the client SDK.

In [3]:
import os

os.environ["LLAMA_CLOUD_BASE_URL"] = "https://api.cloud.llamaindex.ai"

In [None]:
os.environ["LLAMA_CLOUD_API_KEY"] = "<LLAMA_CLOUD_API_KEY>"
os.environ["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"

#### Load Data

In [5]:
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf" -O data/apple_2021_10k.pdf

--2024-07-03 21:18:33--  https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf
Resolving s2.q4cdn.com (s2.q4cdn.com)... 2a0b:4d07:2::3, 2a0b:4d07:2::1, 2a0b:4d07:2::4, ...
Connecting to s2.q4cdn.com (s2.q4cdn.com)|2a0b:4d07:2::3|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 789896 (771K) [application/pdf]
Saving to: ‘apple_2021_10k.pdf’


2024-07-03 21:18:33 (12.3 MB/s) - ‘apple_2021_10k.pdf’ saved [789896/789896]



In [6]:
from llama_parse import LlamaParse

documents = LlamaParse(result_type="markdown").load_data("./apple_2021_10k.pdf")
# set metadata
for d in documents:
    d.metadata["type"] = "financial"

Started parsing the file under job_id cac11eca-f879-4906-8163-ce475134f434


#### Setup LlamaCloud Client SDK and Framework Client

Here we define both the client (giving us access to low-level client operations) as well as the `LlamaCloudIndex` defined through the framework.

In [10]:
from llama_cloud.client import LlamaCloud

client = LlamaCloud(
    token=os.environ["LLAMA_CLOUD_API_KEY"],
    base_url=os.environ["LLAMA_CLOUD_BASE_URL"]
)

from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
import os

index = LlamaCloudIndex(
  name=pipeline_name, 
  project_name=project_name,
  api_key=os.getenv("LLAMA_CLOUD_API_KEY")
)

#### Setup Index

Please setup an empty index. You can either do this through the UI or [programmatically](https://docs.cloud.llamaindex.ai/llamacloud/guides/framework_integration).

After you've done so, make sure to note down the pipeline_id, pipeline_name, project_id, and project_name in the variables below. You'll need these later! 

In [12]:
pipeline_id = "<pipeline_id>"
pipeline_name = "<pipeline_name>"
project_id = "<project_id>"
project_name = "<project_name>"

## Inserting Documents

Now let's create the custom Document objects. We assume that your pipeline has been created in the last section. Copy the pipeline and project ids into the box below.

We insert one document containing the parsed document text, and another document as a toy example.

#### Inserting Document Objects through the Client SDK

In [13]:
from llama_index.core.schema import Document

all_documents = [
    *documents, # LlamaParsed document
    Document(
        text="Jerry likes apples",
        metadata={"type": "test"},
    )
]

llama_cloud_documents = [d.to_cloud_document() for d in all_documents]

upserted_docs = client.pipelines.upsert_batch_pipeline_documents(
    pipeline_id, request=llama_cloud_documents
)

#### Inserting Document Objects through the Framework Integration

You can also do `index.insert` to directly upload document objects using the types defined by the framework.

In [53]:
# NOTE: the llamaparsed document is already in the right representation
from llama_index.core.schema import Document

documents_to_upload = [
    *documents,
    Document(
        text="Bob likes burgers",
        metadata={
            "type": "test"
        }
    )
]

for doc in documents_to_upload:
    index.insert(doc)

#### Inserting Files Directly

You can also insert files directly.

**NOTE**: To customize metadata follow the "Document Metadata Management" tutorial.

In [None]:
with open('data/apple_2021_10k.pdf', 'rb') as f:
    file = client.files.upload_file(upload_file=f, project_id=project_id)
    pipeline_files = client.pipelines.add_files_to_pipeline(pipeline_id, request=[{'file_id': file.id}])

#### Validating the Documents

After the documents have been inserted, we can validate that they exist in the pipeline.

In [None]:
pipeline_docs = client.pipelines.list_pipeline_documents(pipeline_id)
docs = [Document.from_cloud_document(d) for d in pipeline_docs]
print(len(pipeline_docs))
print(docs[0].get_content(metadata_mode="all"))

#### Deleting the Documents

If you want to reset, you can use the client SDK to delete pipeline documents.

In [None]:
pipeline_docs = client.pipelines.list_pipeline_documents(pipeline_id)
for doc in pipeline_docs:
    client.pipelines.delete_pipeline_document(pipeline_id, doc.id)
client.pipelines.sync_pipeline(pipeline_id)

## Test Retrieval

We can test retrieval through both the client SDK as well as the framework integration.

#### Retrieval Through the Client SDK

In [46]:
# Example 1 - retrieve documents by document_type
results = client.pipelines.run_search(
    pipeline_id, 
    query='what does jerry like',  
    search_filters={
        "filters": [
          {
            "key": "type",
            "value": "test",
            "operator": "=="
          },
        ],
    }
)

In [None]:
print(f"was returned {len(results.retrieval_nodes)} nodes")

for node in results.retrieval_nodes:
    print(node.node.text)
    # print("document_type", node.node.extra_info["document_type"])
    print("------")

print(results)

we notice that it doesn't retrieve insurance documents even the documents being very similiar

#### Retrieval Through the Framework Integration

We can also define a retriever through the Python framework, through our `LlamaCloudIndex`.

In [50]:
retriever = index.as_retriever(rerank_top_n=5)
nodes = retriever.retrieve("Purchases of marketable securities in 2020")
for n in nodes:
    print("-----")
    print(n.get_content())

-----
Summary of cash flows related to investing activities including **purchases** and proceeds from **marketable** and non-**marketable** **securities**, property acquisitions, and other investing activities.,
with the following table title:
Cash Flows from Investing Activities,
with the following columns:
- Activity: None
- **2020**: None
- 2019: None
- 2018: None

|**Purchases** of **marketable** **securities**| |(109,558)|(114,938)|(39,630)|
|---|---|---|---|---|
|Proceeds from maturities of **marketable** **securities**| |59,023|69,918|40,102|
|Proceeds from sales of **marketable** **securities**| |47,460|50,473|56,988|
|Payments for acquisition of property, plant and equipment| |(11,085)|(7,309)|(10,495)|
|Payments made in connection with business acquisitions, net| |(33)|(1,524)|(624)|
|**Purchases** of non-**marketable** **securities**| |(131)|(210)|(1,001)|
|Proceeds from non-**marketable** **securities**| |387|92|1,634|
|Other| |(608)|(791)|(1,078)|
|Cash generated by/(used 

#### E2E RAG Pipeline

You can just as easily define a RAG setup as well, using an LLM you define.

In [28]:
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")
query_engine = index.as_query_engine(rerank_top_n=5, llm=llm)
response = query_engine.query("federal deferred tax in 2019-2021")
print(str(response))

The federal deferred tax amounts for the years 2019 to 2021 are as follows:
- 2019: ($2,939)
- 2020: ($3,619)
- 2021: ($7,176)


In [29]:
response = query_engine.query("What does Jerry like?")
print(str(response))

Jerry likes apples.
