# Inline Citations with LlamaCloud
In this notebook we show you how to perform inline citations with LlamaCloud. 

## Setup

Install core packages, download files. You will need to upload these documents to LlamaCloud.

In [None]:
!pip install llama-index
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-index-question-gen-openai
!pip install llama-index-postprocessor-flag-embedding-reranker
!pip install git+https://github.com/FlagOpen/FlagEmbedding.git
!pip install llama-parse

In [None]:
# download Apple 
!wget "https://s2.q4cdn.com/470004039/files/doc_earnings/2023/q4/filing/_10-K-Q4-2023-As-Filed.pdf" -O data/apple_2023.pdf
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2022/q4/_10-K-2022-(As-Filed).pdf" -O data/apple_2022.pdf
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2021/q4/_10-K-2021-(As-Filed).pdf" -O data/apple_2021.pdf
!wget "https://s2.q4cdn.com/470004039/files/doc_financials/2020/ar/_10-K-2020-(As-Filed).pdf" -O data/apple_2020.pdf
!wget "https://www.dropbox.com/scl/fi/i6vk884ggtq382mu3whfz/apple_2019_10k.pdf?rlkey=eudxh3muxh7kop43ov4bgaj5i&dl=1" -O data/apple_2019.pdf

# download Tesla
!wget "https://ir.tesla.com/_flysystem/s3/sec/000162828024002390/tsla-20231231-gen.pdf" -O data/tesla_2023.pdf
!wget "https://ir.tesla.com/_flysystem/s3/sec/000095017023001409/tsla-20221231-gen.pdf" -O data/tesla_2022.pdf
!wget "https://www.dropbox.com/scl/fi/ptk83fmye7lqr7pz9r6dm/tesla_2021_10k.pdf?rlkey=24kxixeajbw9nru1sd6tg3bye&dl=1" -O data/tesla_2021.pdf
!wget "https://ir.tesla.com/_flysystem/s3/sec/000156459021004599/tsla-10k_20201231-gen.pdf" -O data/tesla_2020.pdf
!wget "https://ir.tesla.com/_flysystem/s3/sec/000156459020004475/tsla-10k_20191231-gen_0.pdf" -O data/tesla_2019.pdf

Some OpenAI and LlamaParse details. The OpenAI LLM is used for response synthesis.

In [51]:
# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()

In [52]:
import os
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = ""

In [53]:
# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = ""

## Load Documents into LlamaCloud

The first order of business is to download the 5 Apple and Tesla 10Ks and upload them into LlamaCloud.

You can easily do this by creating a pipeline and uploading docs via the "Files" mode.

After this is done, proceed to the next section.

## Define NodeCitationPostProcessor
Add node id to metadata to match the citation links

In [91]:
from typing import List, Optional

from llama_index.core import QueryBundle
from llama_index.core.postprocessor.types import BaseNodePostprocessor
from llama_index.core.schema import NodeWithScore


class NodeCitationProcessor(BaseNodePostprocessor):
    """
    Append node_id into metadata for citation purpose.
    Config SYSTEM_CITATION_PROMPT in your runtime environment variable to enable this feature.
    """

    def _postprocess_nodes(
        self,
        nodes: List[NodeWithScore],
        query_bundle: Optional[QueryBundle] = None,
    ) -> List[NodeWithScore]:
        for node_score in nodes:
            node_score.node.metadata["node_id"] = node_score.node.node_id
        return nodes

## Define System Citation Prompt
Modify the system prompt to add the citation links based on the metadata

In [92]:
SYSTEM_CITATION_PROMPT = """You have provided information from a knowledge base that has been passed to you in nodes of information.
Each node has useful metadata such as node ID, file name, page, etc.
Please add the citation to the data node for each sentence or paragraph that you reference in the provided information.
The citation format is: . [citation:<node_id>]()
Where the <node_id> is the unique identifier of the data node.

Example:
We have two nodes:
  node_id: xyz
  file_name: llama.pdf
  
  node_id: abc
  file_name: animal.pdf

User question: Tell me a fun fact about Llama.
Your answer:
A baby llama is called "Cria" [citation:xyz]().
It often live in desert [citation:abc]().
It\\'s cute animal."""

## Define LlamaCloud Retriever over Documents

In this section we define LlamaCloud Retriever over these documents.

In [93]:
from llama_index.indices.managed.llama_cloud import LlamaCloudIndex
import os

index = LlamaCloudIndex(
  name="apple_tesla_demo_2",
  project_name="llamacloud_demo",
  api_key=os.environ["LLAMA_CLOUD_API_KEY"]
)

#### Define chunk retriever

The chunk-level retriever does vector search with a final reranked set of `rerank_top_n=5`.

In [94]:
chunk_retriever = index.as_retriever(
    retrieval_mode="chunks",
    rerank_top_n=5
)
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o-mini", system_prompt=SYSTEM_CITATION_PROMPT)
query_engine_chunk = RetrieverQueryEngine.from_args(
    chunk_retriever, 
    llm=llm,
    response_mode="tree_summarize",
    node_postprocessors=[NodeCitationProcessor()]
)

## Generate final output matching citations with page labela
Given the found nodes, match the page assigned and build a final url

In [95]:
import re

def process_citations_with_sources(response) -> str:
    content = str(response)
    source_nodes = response.source_nodes

    # Create a lookup: citation_id -> page_label
    id_to_label = {
        str(node.id_): node.metadata.get('page_label', 'unknown')
        for node in source_nodes
    }

    # Track citation order and assign human-friendly numbers
    citation_order = {}
    citation_counter = 1

    def replace(match):
        nonlocal citation_counter
        citation_id = match.group(1).strip()
        if citation_id not in citation_order:
            citation_order[citation_id] = citation_counter
            citation_counter += 1
        number = citation_order[citation_id]
        page_label = id_to_label.get(citation_id, 'unknown')
        return f"[{number}](https://fake.url/SampleFile#page={page_label})"

    # Replace complete citations
    citation_regex = re.compile(r'\[citation:([^\]]+)\]')
    content = citation_regex.sub(replace, content)

    # Remove incomplete/broken citation tags
    incomplete_regex = re.compile(r'\[citation:[^\]]*$')
    content = incomplete_regex.sub('', content)

    return content

## Query it

In [96]:
response = query_engine_chunk.query("What are the tiny risks for apple 2022")

In [97]:
content = process_citations_with_sources(response)
print(content)

Apple Inc. faces several risks that could impact its business and financial performance in 2022. These include:

1. **Foreign Currency Exchange Risks**: The company's financial performance is subject to risks associated with changes in the value of the U.S. dollar relative to local currencies. Fluctuations in foreign currency exchange rates can adversely affect gross margins on products sold internationally, potentially leading to reduced demand if international pricing is raised to offset currency strength [1](https://fake.url/SampleFile#page=19)().

2. **Credit Risk**: Apple is exposed to credit risk related to its trade accounts receivable and vendor non-trade receivables. This risk is heightened during economic downturns, especially since a significant portion of its trade receivables is not covered by collateral or credit insurance [1](https://fake.url/SampleFile#page=19)().

3. **Supply Chain Risks**: The company relies heavily on single-source suppliers for many components. Any 