# Snowflake Production-ready RAG Quickstart

In this quickstart, we'll show how to build a RAG with the full snowflake stack including Cortex LLM Functions, Cortex Search, and TruLens observability.

In addition, we'll show how to run TruLens feedback functions with Cortex as the backend, and how to log TruLens traces and evaluation metrics to a Snowflake table.

Last, we'll show how to use TruLens guardrails for filtering retrieved context and reducing hallucination.

## Setup

First, we'll install the packages needed

In [None]:
# pip install snowflake-snowpark-python
# pip install notebook
# pip install snowflake-ml-python
# pip install trulens-eval
# pip install snowflake-sqlalchemy
# pip install llama-index
# pip install llama-index-readers-github
# pip install llama-index-embeddings-huggingface

Then we can load our credentials and set our Snowflake connection

In [1]:
from dotenv import load_dotenv

load_dotenv()

# import necessary packages
from snowflake.snowpark.session import Session


import os
from dotenv import load_dotenv

load_dotenv()

connection_details = {
    'account':  os.environ["SNOWFLAKE_ACCOUNT"],
    'user': os.environ["SNOWFLAKE_USER"],
    'password': os.environ["SNOWFLAKE_USER_PASSWORD"],
    'role': os.environ["SNOWFLAKE_ROLE"],
    'database': os.environ["SNOWFLAKE_DATABASE"],
    'schema': os.environ["SNOWFLAKE_SCHEMA"],
    'warehouse': os.environ["SNOWFLAKE_WAREHOUSE"]
}

session = Session.builder.configs(connection_details).create()

## Using Cortex Complete

With the session set, we have what need to call a Snowflake Cortex LLM:

In [2]:
from snowflake.cortex import Complete

text = """
    The Snowflake company was co-founded by Thierry Cruanes, Marcin Zukowski,
    and Benoit Dageville in 2012 and is headquartered in Bozeman, Montana.
"""

print(Complete("mistral-large", "how do snowflakes get their unique patterns?"))

Complete() is experimental since 1.0.12. Do not use it in production. 


 Snowflakes get their unique patterns through a complex process that involves both physics and chemistry. It all starts with a tiny particle in the atmosphere, like a dust or pollen grain, which serves as a nucleus for the snowflake to form around.

As this particle cools, water vapor in the air begins to condense and freeze onto it, forming an ice crystal. The shape of this initial crystal is determined by the arrangement of water molecules, which naturally form a hexagonal structure due to the hydrogen bonds between them.

As the ice crystal falls through the atmosphere, it encounters different temperatures and levels of humidity. These varying conditions cause the crystal to grow in a unique way, with intricate branches and patterns forming as more water molecules attach to it.

The six-sided symmetry of snowflakes comes from the hexagonal structure of water molecules. However, the specific pattern of each snowflake is influenced by the exact path it takes through the atmosphere and

## Cortex Search

Next, we'll turn to the retrieval component of our RAG and set up Cortex Search.

This requires three steps:

1. Read and preprocess unstructured documents.
2. Embed the cleaned documents with Arctic Embed.
3. Call the Cortex search service.

### Read and preprocess unstructured documents

For this example, we want to load Cortex Search with documentation from Github about a popular open-source library, Streamlit. To do so, we'll use a GitHub data loader available from LlamaHub.

Here we'll also expend some effort to clean up the text so we can get better search results.

In [None]:
import nest_asyncio
nest_asyncio.apply()
from llama_index.readers.github import GithubRepositoryReader, GithubClient

github_token = os.environ["GITHUB_TOKEN"]
client = github_client = GithubClient(github_token=github_token, verbose=False)

reader = GithubRepositoryReader(
    github_client=github_client,
    owner="streamlit",
    repo="docs",
    use_parser=False,
    verbose=True,
    filter_directories=(
        ["content"],
        GithubRepositoryReader.FilterType.INCLUDE,
    ),
    filter_file_extensions=(
        [".md"],
        GithubRepositoryReader.FilterType.INCLUDE,
    )
)

documents = reader.load_data(branch="main")

import re

def clean_up_text(content: str) -> str:
    """
    Remove unwanted characters and patterns in text input.

    :param content: Text input.
    
    :return: Cleaned version of original text input.
    """

    # Fix hyphenated words broken by newline
    content = re.sub(r'(\w+)-\n(\w+)', r'\1\2', content)

    unwanted_patterns = ['---\nvisible: false','---', '#','slug:']
    for pattern in unwanted_patterns:
        content = re.sub(pattern, "", content)

    # Remove all slugs starting with a \ and stopping at the first space
    content = re.sub(r'\\slug: [^\s]*', '', content)

    # normalize whitespace
    content = re.sub(r'\s+', ' ', content)
    return content

cleaned_documents = []

for d in documents:
    cleaned_text = clean_up_text(d.text)
    d.text = cleaned_text
    cleaned_documents.append(d)

### Embed the preprocessed documents

We'll use Snowflake's Arctic Embed model available from HuggingFace to embed the documents. We'll also use Llama-Index's `SemanticSplitterNodeParser` for processing.

In [None]:

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SemanticSplitterNodeParser

embed_model = HuggingFaceEmbedding("Snowflake/snowflake-arctic-embed-m")

splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=85, embed_model=embed_model
)    

With the embed model and splitter, we can execute them in an ingestion pipeline

In [None]:
from llama_index.core.ingestion import IngestionPipeline

cortex_search_pipeline = IngestionPipeline(
    transformations=[
        splitter,
    ],
)

results = cortex_search_pipeline.run(show_progress=True, documents=cleaned_documents)

import numpy as np

print(f"Roughly the proportion of chunks that are bigger than 512 tokens (approx 385 English words): {np.mean([len(curr.text.split()) > 385 for curr in res])}")

### Load data to Cortex Search

Now that we've embedded our documents, we're ready to load them to Cortex Search.

Here we can use the same connection details as we set up for Cortex Complete.

In [None]:
import os
import snowflake.connector
from tqdm.auto import tqdm

conn = snowflake.connector.connect(
    user=connection_details["user"],
    password=connection_details["password"],
    account=connection_details["account"],
    warehouse=connection_details["warehouse"],
    database=connection_details["database"],
    schema=connection_details["schema"]
)

conn.cursor().execute("CREATE OR REPLACE TABLE streamlit_docs(doc_text VARCHAR)")
for curr in tqdm(result):
    conn.cursor().execute("INSERT INTO streamlit_docs VALUES (%s)", curr.text)

### Call the Cortex Search Service

Here we'll create a CortexSearchRetreiver class to connect to our cortex search service and add the `retrieve` method that we can leverage for calling it.

In [3]:
import os
from snowflake.core import Root

class CortexSearchRetriever:

    def __init__(self, session = session, limit_to_retrieve: int = 4):
        self.session = session
        self._limit_to_retrieve = limit_to_retrieve
    
    def retrieve(self, query: str):
        root = Root(self.session)
        cortex_search_service = root.databases[
                os.environ["SNOWFLAKE_DATABASE"]].schemas[
                    os.environ["SNOWFLAKE_SCHEMA"]].cortex_search_services[
                        os.environ["SNOWFLAKE_CORTEX_SEARCH_SERVICE"]]
        resp = cortex_search_service.search(
                query=query,
                columns=["doc_text"],
                limit=self._limit_to_retrieve,
            )
        if resp.results:
            return [curr["doc_text"] for curr in resp.results]
        session.close()

In [4]:
retriever = CortexSearchRetriever(session=session, limit_to_retrieve=4)

retrieved_context = retriever.retrieve(query="How do I launch a streamlit app?")

len(retrieved_context)

4

## Create a RAG with built-in observability

Now that we've set up the components we need from Snowflake Cortex, we can build our RAG.

We'll do this by creating a custom python class with each the methods we need. We'll also add TruLens instrumentation with the `@instrument` decorator to our app.

The first thing we need to do however, is to set the database connection where we'll log the traces and evaluation results from our application. This way we have a stored record that we can use to understand the app's performance. This is done when initializing `Tru`.

In [25]:
from trulens_eval import Tru


db_url = "snowflake://{user}:{password}@{account}/{dbname}/{schema}?warehouse={warehouse}&role={role}".format(
    user=os.environ["SNOWFLAKE_USER"],
    account=os.environ["SNOWFLAKE_ACCOUNT"],
    password=os.environ["SNOWFLAKE_USER_PASSWORD"],
    dbname=os.environ["SNOWFLAKE_DATABASE"],
    schema=os.environ["SNOWFLAKE_SCHEMA"],
    warehouse=os.environ["SNOWFLAKE_WAREHOUSE"],
    role=os.environ["SNOWFLAKE_ROLE"],
)

tru = Tru(database_url=db_url)

Now we can construct the RAG.

In [26]:
from trulens_eval.tru_custom_app import instrument

class RAG_from_scratch:

    def __init__(self):
        self.retriever = CortexSearchRetriever(session=session, limit_to_retrieve=4)
    @instrument
    def retrieve_context(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        results = self.retriever.retrieve(query)
        return results

    @instrument
    def generate_completion(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        completion = Complete("mistral-large",query)
        return completion

    @instrument
    def query(self, query: str) -> str:
        context_str = self.retrieve_context(query)
        completion = self.generate_completion(query, context_str)
        return completion

rag = RAG_from_scratch()

In [27]:
from trulens_eval.feedback.provider.cortex import Cortex
from trulens_eval.feedback import Feedback
from trulens_eval import Select
import numpy as np

provider = Cortex("mistral-large")

f_groundedness = (
    Feedback(
    provider.groundedness_measure_with_cot_reasons, name="Groundedness")
    .on(Select.RecordCalls.retrieve_context.rets[:].collect())
    .on_output()
)

f_context_relevance = (
    Feedback(
    provider.context_relevance,
    name="Context Relevance")
    .on_input()
    .on(Select.RecordCalls.retrieve_context.rets[:])
    .aggregate(np.mean)
)

f_answer_relevance = (
    Feedback(
    provider.relevance,
    name="Answer Relevance")
    .on_input()
    .on_output()
    .aggregate(np.mean)
)

feedbacks = [f_context_relevance,
            f_answer_relevance,
            f_groundedness,
        ]

✅ In Groundedness, input source will be set to __record__.app.retrieve_context.rets[:].collect() .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input context will be set to __record__.app.retrieve_context.rets[:] .
✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .


In [28]:
from trulens_eval import TruCustomApp
tru_rag = TruCustomApp(rag,
    app_id = 'RAG v1',
    feedbacks = [f_groundedness, f_answer_relevance, f_context_relevance])

In [39]:
prompts = [
    "How do I launch a streamlit app?",
    "How can I capture the state of my session in streamlit?",
    "How do I install streamlit?",
    "How do I change the background color of a streamlit app?",
    "What's the advantage of using a streamlit form?",
    "What are some ways I should use checkboxes?",
    "How can I conserve space and hide away content?",
    "Can you recommend some resources for learning Streamlit?",
    "What are some common use cases for Streamlit?",
    "How can I deploy a streamlit app on the cloud?",
    "How do I add a logo to streamlit?",
    "What is the best way to deploy a Streamlit app?",
    "How should I use a streamlit toggle?",
    "How do I add new pages to my streamlit app?",
    "How do I write a dataframe to display in my dashboard?",
    "Can I plot a map in streamlit? If so, how?",
    "How do vector stores enable efficient similarity search?",
    "How do I prevent my child from using the internet?",
    "What should I pack for a camping trip this weekend?",
    "How do I defend myself against bear attacks?"
]

In [40]:
with tru_rag as recording:
    for prompt in prompts:
        rag.query(prompt)

In [30]:
from trulens_eval import Tru
tru = Tru()
tru.run_dashboard(port=1235)

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
Dashboard already running at path:   Network URL: http://192.168.4.206:1235



<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

In [31]:
tru.get_leaderboard()

Unnamed: 0_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RAG v1,1.0,0.75,0.815385,6.0,0.0


In [32]:
last_record = recording.records[-1]

from trulens_eval.utils.display import get_feedback_result
get_feedback_result(last_record, 'Context Relevance')

Unnamed: 0,question,context,ret
0,How do I launch a streamlit dashboard?,Your environment's name will appear in parenth...,0.8
1,How do I launch a streamlit dashboard?,![Watch your app launch](/images/streamlit-com...,0.4
2,How do I launch a streamlit dashboard?,The first step is to create a new Python scrip...,0.9
3,How do I launch a streamlit dashboard?,title: Basic concepts of Streamlit /get-start...,0.9


## Use Guardrails

In addition to making informed iteration, we can also directly use feedback results as guardrails at inference time. In particular, here we show how to use the context relevance score as a guardrail to filter out irrelevant context before it gets passed to the LLM. This both reduces hallucination and improves efficiency.

To do so, we'll rebuild our RAG using the @context-filter decorator on the method we want to filter, and pass in the feedback function and threshold to use for guardrailing.

In [33]:
# note: feedback function used for guardrail must only return a score, not also reasons
f_context_relevance_score = (
    Feedback(provider.context_relevance, name = "Context Relevance")
)

from trulens_eval.guardrails.base import context_filter

# note: feedback function used for guardrail must only return a score, not also reasons
f_context_relevance_score = (
    Feedback(provider.context_relevance, name = "Context Relevance")
    .on_input()
    .on(Select.RecordCalls.retrieve.rets)
)

from trulens_eval.guardrails.base import context_filter

class filtered_RAG_from_scratch:

    def __init__(self):
        self.retriever = CortexSearchRetriever(session=session, limit_to_retrieve=4)
    @instrument
    @context_filter(f_context_relevance_score, 0.75, keyword_for_prompt="query")
    def retrieve_context(self, query: str) -> list:
        """
        Retrieve relevant text from vector store.
        """
        results = self.retriever.retrieve(query)
        return results

    @instrument
    def generate_completion(self, query: str, context_str: list) -> str:
        """
        Generate answer from context.
        """
        completion = Complete("mistral-large",query)
        return completion

    @instrument
    def query(self, query: str) -> str:
        context_str = self.retrieve_context(query=query)
        completion = self.generate_completion(query=query, context_str=context_str)
        return completion

filtered_rag = filtered_RAG_from_scratch()


✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input context will be set to __record__.app.retrieve.rets .


In [34]:
from trulens_eval import TruCustomApp
filtered_tru_rag = TruCustomApp(filtered_rag,
    app_id = 'RAG v2',
    feedbacks = [f_groundedness, f_answer_relevance, f_context_relevance])

In [41]:
with filtered_tru_rag as recording:
    for prompt in prompts:
        filtered_rag.query(prompt)



In [42]:
tru.get_leaderboard()

Unnamed: 0_level_0,Context Relevance,Groundedness,Answer Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
RAG v2,0.854167,0.571732,1.0,23.285714,0.0
RAG v1,0.478571,0.491251,1.0,23.285714,0.0
