# LlamaParse + Cortex Search

This notebook walks through how to parse a complex report with LlamaParse, how to load it in Snowflake, and how to create a RAG via Cortex Search on top of the data loaded into Snowflake.

## Set key and import libraries
For this example, you will need a LlamaCloud API key and a Snowflake account. Get LlamaCloud API key following these [instructions](https://docs.cloud.llamaindex.ai/api_key), and signup for Snowflake [here](https://signup.snowflake.com/).

In [None]:
import os
import nest_asyncio
nest_asyncio.apply()

# llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-..."

# snowflake
os.environ["SNOWFLAKE_ACCOUNT"] = "..." # note: "_" can cause problems with the connection, use "-" instead
os.environ["SNOWFLAKE_USER"] = "..."
os.environ["SNOWFLAKE_PASSWORD"] = "..."
os.environ["SNOWFLAKE_ROLE"] = "..."
os.environ["SNOWFLAKE_WAREHOUSE"] = "..."
os.environ["SNOWFLAKE_DATABASE"] = "SEC_10KS" # note: make sure to use a database that already
os.environ["SNOWFLAKE_SCHEMA"] = "PUBLIC"

## Load Data

Use Snowflake's latest 10K or your favorite PDF.

To use Snowflake's latest 10K, download it from [here](https://d18rn0p25nwr6d.cloudfront.net/CIK-0001640147/663fb935-b123-4bbb-8827-905bcbb8953c.pdf) and rename it as `snowflake_2025_10k.pdf`.

## Parse PDF

In [2]:
from llama_cloud_services import LlamaParse

parser = LlamaParse(
    num_workers=4,       # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en",       # optionally define a language, default=en
)

# sync
result = parser.parse("./snowflake_2025_10k.pdf")

Started parsing the file under job_id c03cc793-7ead-4791-a48d-edf9e0a2dd70


In [3]:
# get the llama-index markdown documents
markdown_documents = result.get_markdown_documents(split_by_page=False)

## Convert LlamaIndex Documents to Dataframe

In [4]:
import pandas as pd

def documents_to_dataframe(documents):
    rows = []
    for doc in documents:
        row = {}
        # Store document ID
        row["ID"] = doc.id_
        # Add all metadata items as separate columns
        for key, value in doc.metadata.items():
            row[key] = value
        # Get text from the text_resource attribute, if available
        row["text"] = getattr(doc.text_resource, "text", None)
        rows.append(row)
    return pd.DataFrame(rows)

In [5]:
documents_df = documents_to_dataframe(markdown_documents)

In [6]:
documents_df

Unnamed: 0,ID,file_name,text
0,205e6e96-5ff2-4151-ab53-3d37f45af844,./snowflake_2025_10k.pdf,# Table of Contents\n\n# UNITED STATES\n\n# SE...


## Write DataFrame to Snowflake table

In [7]:
from snowflake.snowpark import Session

# Create Snowpark session
connection_parameters = {
    "account": os.getenv("SNOWFLAKE_ACCOUNT"),
    "user": os.getenv("SNOWFLAKE_USER"),
    "password": os.getenv("SNOWFLAKE_PASSWORD"),            
    "role": os.getenv("SNOWFLAKE_ROLE"),
    "warehouse": os.getenv("SNOWFLAKE_WARHEOUSE"),
    "database": os.getenv("SNOWFLAKE_DATABASE"),
    "schema": os.getenv("SNOWFLAKE_SCHEMA"),
}

session = Session.builder.configs(connection_parameters).create()

  warn_incompatible_dep(
  import pkg_resources


In [8]:
# convert to Snowpark DataFrame
snowpark_df = session.create_dataframe(documents_df)

In [None]:
# Write Snowpark DataFrame to a Snowflake table
# Use 'overwrite' to replace table or 'append' to add to existing table
snowpark_df.write.mode("overwrite").save_as_table("snowflake_10k")

## Split the text

In [42]:
split_text_sql = """
CREATE OR REPLACE TABLE SNOWFLAKE_10K_MARKDOWN_CHUNKS AS
SELECT
    ID,
    "file_name" as FILE_NAME,
    c.value::string as TEXT
FROM
    SNOWFLAKE_10K,
    LATERAL FLATTEN(input => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER(
        "text",
        'markdown',
        512,
        128
    )) c;
"""
session.sql(split_text_sql).collect()

[Row(status='Table SNOWFLAKE_10K_MARKDOWN_CHUNKS successfully created.')]

## Create Cortex Search Service

In [45]:
create_search_service_sql = """
CREATE OR REPLACE CORTEX SEARCH SERVICE SNOWFLAKE_10K_SEARCH_SERVICE
  ON TEXT
  ATTRIBUTES ID, FILE_NAME
  WAREHOUSE = S
  TARGET_LAG = '1 hour'
AS (
  SELECT
    ID,
    FILE_NAME,
    TEXT
  FROM SEC_10KS.PUBLIC.SNOWFLAKE_10K_MARKDOWN_CHUNKS
);
"""
session.sql(create_search_service_sql).collect()

[Row(status='Cortex search service SNOWFLAKE_10K_SEARCH_SERVICE successfully created.')]

In [46]:
import os
from snowflake.core import Root
from typing import List
from snowflake.snowpark.session import Session

class CortexSearchRetriever:

    def __init__(self, snowpark_session: Session, limit_to_retrieve: int = 4):
        self._snowpark_session = snowpark_session
        self._limit_to_retrieve = limit_to_retrieve

    def retrieve(self, query: str) -> List[str]:
        root = Root(session)

        search_service = (root
          .databases["SEC_10KS"]
          .schemas["PUBLIC"]
          .cortex_search_services["SNOWFLAKE_10K_SEARCH_SERVICE"]
        )
        resp = search_service.search(
          query=query,
          columns=["text"],
          limit=self._limit_to_retrieve
        )

        if resp.results:
            return [curr["text"] for curr in resp.results]
        else:
            return []
        
retriever = CortexSearchRetriever(snowpark_session=session, limit_to_retrieve=5)

In [68]:
retrieved_context = retriever.retrieve("What was the total revenue (in billions) for Snowflake in FY 2024? How much of that was product revenue?")

retrieved_context

['(1) For the fiscal years ended January 31, 2025, 2024, and 2023, respectively, approximately 65%, 67%, and 71% of cost of product revenue represented third-party cloud infrastructure expenses incurred in connection with the customers’ use of the Snowflake platform and the deployment and maintenance of the platform on public clouds, including different regional deployments.',
 'Our revenue was $3.6 billion, $2.8 billion, and $2.1 billion for the fiscal years ended January 31, 2025, 2024, and 2023, respectively. As a result of our historical rapid growth, limited operating history, large number of new product features, including those incorporating artificial intelligence and machine learning technology (AI Technology), and unstable macroeconomic conditions, our ability to accurately forecast our future results of operations, including revenue, gross margin, remaining performance',
 'reflecting these adjustments. Our platform has been adopted by many of the world’s largest organization

## Create a RAG

In [57]:
from snowflake.cortex import complete

class RAG:

    def __init__(self, session):
        self.session = session
        self.retriever = CortexSearchRetriever(snowpark_session=self.session, limit_to_retrieve=10)

    def retrieve_context(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        return self.retriever.retrieve(query)

    def generate_completion(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        prompt = f"""
          You are an expert assistant extracting information from context provided.
          Answer the question in concisely, yet completely. Only use the information provided.
          Context: {context_str}
          Question:
          {query}
          Answer:
        """
        response = ""
        response = complete("claude-4-sonnet", prompt, session = session)
        return response

    def query(self, query: str) -> str:
        context_str = self.retrieve_context(query)
        return self.generate_completion(query, context_str)


rag = RAG(session)

In [66]:
response = rag.query("What was the total revenue (in billions) for Snowflake in FY 2024? How much of that was product revenue?")

In [67]:
from IPython.display import Markdown

display(Markdown(response))

Based on the context provided, Snowflake's total revenue for fiscal year 2024 (ended January 31, 2024) was **$2.8 billion**.

For the product revenue component, the context shows that product revenue represented **95%** of total revenue in FY 2024. Therefore, product revenue was approximately **$2.66 billion** ($2.8 billion × 95%).

The remaining 5% ($0.14 billion) came from professional services and other revenue.