# Build a Retrieval Augmented Generation (RAG) based LLM assistant using Streamlit and Snowflake Cortex Search


Be sure to add langchain to the packages in the top right.

### A quick note on chunks

Chunk size is the maximum number of characters that a chunk can contain.
Chunk overlap is the number of characters that should overlap between two adjacent chunks.

The chunk size and chunk overlap parameters can be used to control the granularity of the text splitting. A smaller chunk size will result in more chunks, while a larger chunk size will result in fewer chunks. A larger chunk overlap will result in more chunks sharing common characters, while a smaller chunk overlap will result in fewer chunks sharing common characters.

There are many different ways to split text. Some common methods include:
Character-based splitting: This method divides the text into chunks based on individual characters.
Word-based splitting: This method divides the text into chunks based on words.
Sentence-based splitting: This method divides the text into chunks based on sentences.

We will use the Recursive Text Splitter Module. It is a module in the LangChain library that can be used to split text recursively. This means that the module will try to split the text into different characters until the chunks are small enough.

An example follows

In [None]:
USE DATABASE RAG_CHAT;
USE SCHEMA DATA;

In [None]:
# Import our packages that we'll use later. If you get an error, make sure  your package is added in the 'packages' dropdown above.
# I have seen scenarios where after adding a package it still says its not available. If that happens, restart your notebook session.
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = "This is a piece of text."

splitter = RecursiveCharacterTextSplitter(
    chunk_size = 10, #Adjust this as you see fit
    chunk_overlap  = 5, #This let's text have some form of overlap. Useful for keeping chunks contextual
    length_function = len
)

chunks = splitter.split_text(text)

for chunk in chunks:
    print(chunk)

## Organize Documents and Create Pre-Processing Function

Create a table function that will read the PDF documents and split them in chunks. We will be using the PyPDF2 and Langchain Python libraries to accomplish the necessary document processing tasks. Because as part of Snowpark Python these are available inside the integrated Anaconda repository, there are no manual installs or Python environment and dependency management required.


In [None]:
-- Here we are going to create a function that we will call in Snowflake.

create or replace function text_chunker(pdf_text string)
returns table (chunk varchar)
language python
runtime_version = '3.9'
handler = 'text_chunker'
packages = ('snowflake-snowpark-python', 'langchain')
as
$$
from snowflake.snowpark.types import StringType, StructField, StructType
from langchain.text_splitter import RecursiveCharacterTextSplitter
import pandas as pd

class text_chunker:

    def process(self, pdf_text: str):
        
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size = 1512, #Adjust this as you see fit, but this is not bad for a pdf document
            chunk_overlap  = 256, #This will have some form of overlap. Useful for keeping chunks contextual
            length_function = len
        )
    
        chunks = text_splitter.split_text(pdf_text)
        df = pd.DataFrame(chunks, columns=['chunks'])
        
        yield from df.itertuples(index=False, name=None)
$$;

If we want to test the function we just created, we can.

In [None]:
select * from table(text_chunker('Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry\'s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Why do we use it?
It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using \'Content here, content here\', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for \'lorem ipsum\' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).


Where does it come from?
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32.

The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.

Where can I get some?
There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which don\'t look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isn\'t anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary. Use a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.'));

Step 2. Download sample [PDF documents](https://github.com/Snowflake-Labs/sfguide-ask-questions-to-your-documents-using-rag-with-snowflake-cortex-search/tree/main)

Step 3. With our function created and tested, we now want to create a Stage with Directory Table where you will be uploading your documents.

In [None]:
-- create or replace stage docs ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE') DIRECTORY = ( ENABLE = true );


Step 4. Upload documents to your staging area

- Select Data on the left
- Click on your database RAG_CHAT
- Click on your schema DATA
- Click on Stages and select DOCS
- On the top right click on the **+Files** botton
- Drag and drop the PDF documents you downloaded

note to self - files are in the desktop\DOCAI folder

Step 5. Check files has been successfully uploaded

In [None]:
ls @docs;

## Pre-process and Label Documents

Step 1. Create the table where we are going to store the chunks for each PDF.

In [None]:
create or replace TABLE DOCS_CHUNKS_TABLE ( 
    RELATIVE_PATH VARCHAR(16777216), -- Relative path to the PDF file
    SIZE NUMBER(38,0), -- Size of the PDF
    FILE_URL VARCHAR(16777216), -- URL for the PDF
    SCOPED_FILE_URL VARCHAR(16777216), -- Scoped url (you can choose which one to keep depending on your use case)
    CHUNK VARCHAR(16777216), -- Piece of text
    CATEGORY VARCHAR(16777216) -- Will hold the document category to enable filtering
);

Step 2. Use the CORTREX PARSE_DOCUMENT function in order to read the PDF documents from the staging area. Use the function previously created to split the text into chunks. There is no need to create embeddings as that will be managed automatically by Cortex Search service later.

In [None]:
 insert into docs_chunks_table (relative_path, size, file_url,
                            scoped_file_url, chunk)

    select relative_path, 
            size,
            file_url, 
            build_scoped_file_url(@docs, relative_path) as scoped_file_url,
            func.chunk as chunk
    from 
        directory(@docs),
        TABLE(text_chunker (TO_VARCHAR(SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@docs, relative_path, {'mode': 'LAYOUT'})))) as func;

We can take a look at what this created. In this case, it's nothing too spectacular, just the text from the pdf chunked up (as expected)

In [None]:
select * from docs_chunks_table;

### Label the product category

We are going to use the power of Large Language Models to easily classify the documents we are ingesting in our RAG application. We are just going to use the file name but you could also use some of the content of the doc itself. Depending on your use case you may want to use different approaches. We are going to use a foundation LLM but you could even fine-tune your own LLM for your use case.

First we will create a temporary table with each unique file name and we will be passing that file name to one LLM using Cortex Complete function with a prompt to classify what that use guide refres too. The prompt will be as simple as this but you can try to customize it depending on your use case and documents. Classification is not mandatory for Cortex Search but we want to use it here to also demo hybrid search.

This will be the prompt where we are adding the file name `Given the name of the file between <file> and </file> determine if it is related to bikes or snow. Use only one word <file> ' || relative_path || '</file>'`

In [None]:
CREATE
OR REPLACE TEMPORARY TABLE docs_categories AS WITH unique_documents AS (
  SELECT
    DISTINCT relative_path
  FROM
    docs_chunks_table
),
docs_category_cte AS (
  SELECT
    relative_path,
    TRIM(snowflake.cortex.COMPLETE (
      'llama3-70b', -- You could use whatever LLM you want here. 
      'Given the name of the file between <file> and </file> determine if it is related to bikes or snow. Use only one word <file> ' || relative_path || '</file>'
    ), '\n') AS category
  FROM
    unique_documents
)
SELECT
  *
FROM
  docs_category_cte;

You can check that table to identify how many categories have been created and if they are correct:

In [None]:
select category from docs_categories group by category;

We can also check that each document category is correct:

In [None]:
select * from docs_categories;

Now we can just update the table with the chunks of text that will be used by Cortex Search service to include the category for each document:

In [None]:
update docs_chunks_table 
  SET category = docs_categories.category
  from docs_categories
  where  docs_chunks_table.relative_path = docs_categories.relative_path;

And lets go ahead and see how our table now looks

In [None]:
select * from docs_chunks_table;

## Create Cortex Search Service

Next step is to create the CORTEX SEARCH SERVICE in the table we created before.

- The name of the service is CC_SEARCH_SERVICE_CS.
- The service will use the column chunk to create embeddings and perform retrieval based on similarity search.
- The column category could be used as a filter.
- To keep this service updated, warehouse CC_FINS_WH will be used
- The service will be refreshed every minute.
- The data retrieved will contain the chunk, relative_path, file_url and category.

### A few notes about the CORTEX SEARCH SERVICE

When you create a Cortex Search Service, Snowflake performs transformations on your source data to get it ready for low-latency serving. When you create a search service, the search index is built as part of the create process. This means the CREATE CORTEX SEARCH SERVICE statement may take longer to complete for larger datasets.


The following command triggers the building of the search service for your data. 

In this example:

- Queries to the service will search for matches in the transcript_text column.
- The TARGET_LAG parameter dictates that the Cortex Search Service will check for updates to the base table support_transcripts approximately every minute. This would likely be too much in a production system. 
- The columns relative_path, file_url and category will be indexed so that they can be returned along with the results of queries on the transcript_text column.
- The column category will be available as a filter column when querying the transcript_text column.
- The warehouse compute_wh will be used for materializing the results of the specified query initially and each time the base table is changed.

Note
- Depending on the size of the warehouse specified in the query and the number of rows in your table, this CREATE command may take up to several hours to complete.
- Snowflake recommends using a dedicated warehouse of size no larger than MEDIUM for each service.
- Columns in the ATTRIBUTES field must be included in the source query, either via explicit enumeration or wildcard, ( * ) .


In [None]:
create or replace CORTEX SEARCH SERVICE CC_SEARCH_SERVICE_CS
ON chunk
ATTRIBUTES category
warehouse = CC_FINS_WH
TARGET_LAG = '1 minute'
as (
    select chunk,
        relative_path,
        file_url,
        category
    from docs_chunks_table
);

At this point, our basic setup is complete and we're now ready to build our chat interface in Streamlit. We can do a quick check to see what's going on with our Search Service though, just to make sure everything is fine


In [None]:
SELECT PARSE_JSON(
  SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
      'RAG_CHAT.DATA.CC_SEARCH_SERVICE_CS',
      '{
        "query": "Is there a temperature where we should not use our premium bicycle?",
        "columns":[
            "chunk",
            "category"
        ],
        "filter": {"@eq": {"category": "Bike"} },
        "limit":1
      }'
  )
)['results'] as results;

Here we can see that it found some relvant content for us (hopefully), but it's not formatted or easy to use. This is simply to show that the search service is working

### At this point, to keep things simple, you can stop and go to the Streamlit Application.

However. If you want to go ahead and create a task that will automatically ingest new files, continue on. 

We're going to create a Stream first. This will keep track of new files in our stage.

In [None]:
CREATE STREAM docs_stream ON STAGE docs;

So, we're going to manually refresh the stage here. There are ways to automate this, but just keep in mind that when we add files, refresh the stage so the stream picks them up.

In [None]:
ALTER STAGE docs REFRESH;

Next, we are creating a Snowflake Task. That Task will have some conditions to be executed and one action to take:

Where: This is going to be executed using warehouse COMPUTE_WH. Please name to your own Warehouse.

When: Check every 10 minutes, and execute in the case of new records in the docs_stream stream

What to do: Process the files and insert the records in the docs_chunks_table

In [None]:
create or replace task parse_and_insert_pdf_task 
    warehouse = RAG_CHAT_WH
    schedule = '10 minute'
    when system$stream_has_data('docs_stream')
    as
  
    insert into docs_chunks_table (relative_path, size, file_url,
                            scoped_file_url, chunk)
    select relative_path, 
            size,
            file_url, 
            build_scoped_file_url(@docs, relative_path) as scoped_file_url,
            func.chunk as chunk
    from 
        docs_stream,
        TABLE(text_chunker (TO_VARCHAR(SNOWFLAKE.CORTEX.PARSE_DOCUMENT(@docs, relative_path, {'mode': 'LAYOUT'})))) as func;

alter task parse_and_insert_pdf_task resume;

In [None]:
-- and if we want to shut down the task
alter task parse_and_insert_pdf_task suspend;