# Setting up the RAG Pipeline using Cortex Search

In this notebook, we're setting up the RAG pipeline using the Avalanche customer review data from the previously prepared data stored in the stage. 

By the end of the tutorial, we'll have a RAG pipeline ready to allow us to ask any questions about the data.

## Install prerequisite

Make sure to have `snowflake.core` library installed by clicking on Packages in the top menu and enter `snowflake.core` into the text box.

## List staged data

Here, we're listing the contents of the staged data at `@AVALANCHE_DB.AVALANCHE_SCHEMA.CUSTOMER_REVIEWS`

In [None]:
ls @avalanche_db.avalanche_schema.customer_reviews;

## Parse content

Here, we're extracting or parsing content from the `DOCX` files stored in stage by leveraging the Cortex `PARSE_DOCUMENT()` function.

The parsed content will be stored at a newly created table at  `AVALANCHE.AVALANCHE.PARSED_CONTENT`.

In [None]:
CREATE OR REPLACE TABLE AVALANCHE_DB.AVALANCHE_SCHEMA.PARSED_CONTENT AS SELECT 
      relative_path,
      TO_VARCHAR(
        SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
          @avalanche_db.avalanche_schema.customer_reviews, 
          relative_path, 
          {'mode': 'LAYOUT'}
        ) :content
      ) AS parsed_text
    FROM directory(@avalanche_db.avalanche_schema.customer_reviews)
    WHERE relative_path LIKE '%.docx'

## Query parsed content

Let's now query the parsed content stored in the `PARSED_CONTENT` table.

In [None]:
SELECT * FROM AVALANCHE_DB.AVALANCHE_SCHEMA.PARSED_CONTENT LIMIT 5

## Chunk content

We'll now take the parsed data and perform chunking via `SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER`.

Chunked content are stored at `AVALANCHE.AVALANCHE.CHUNKED_CONTENT`.

In [None]:
CREATE OR REPLACE TABLE AVALANCHE_DB.AVALANCHE_SCHEMA.CHUNKED_CONTENT (
    file_name VARCHAR,
    CHUNK VARCHAR
);

INSERT INTO AVALANCHE_DB.AVALANCHE_SCHEMA.CHUNKED_CONTENT (file_name, CHUNK)
SELECT
    relative_path,
    c.value AS CHUNK
FROM
    AVALANCHE_DB.AVALANCHE_SCHEMA.PARSED_CONTENT,
    LATERAL FLATTEN( input => SNOWFLAKE.CORTEX.SPLIT_TEXT_RECURSIVE_CHARACTER (
        parsed_text,
        'markdown',
        1800,
        250
    )) c;

## Query chunked content

Let's now query the chunked content stored in the `AVALANCHE_DB.AVALANCHE_SCHEMA.CHUNKED_CONTENT` table.

In [None]:
SELECT * FROM AVALANCHE_DB.AVALANCHE_SCHEMA.CHUNKED_CONTENT LIMIT 10

## Create RAG pipeline with Cortex Search

Here, we'll create the RAG pipeline with Cortex Search via `CORTEX SEARCH SERVICE` and this is made available at `AVALANCHE_DB.AVALANCHE_SCHEMA.AVALANCHE_SEARCH_SERVICE`.

In [None]:
CREATE OR REPLACE CORTEX SEARCH SERVICE AVALANCHE_DB.AVALANCHE_SCHEMA.AVALANCHE_SEARCH_SERVICE
    ON chunk
    WAREHOUSE = compute_wh
    TARGET_LAG = '1 minute'
    EMBEDDING_MODEL = 'snowflake-arctic-embed-l-v2.0'
    AS (
    SELECT
        file_name,
        chunk
    FROM AVALANCHE_DB.AVALANCHE_SCHEMA.CHUNKED_CONTENT
    );

## RAG query

Now comes the fun part, we're now going to utilize the RAG pipeline by asking a question.

First, let's use SQL to perform a RAG query using `SNOWFLAKE.CORTEX.SEARCH_PREVIEW()`.

In [None]:
-- Query it with SQL
SELECT PARSE_JSON(
  SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
      'AVALANCHE_DB.AVALANCHE_SCHEMA.AVALANCHE_SEARCH_SERVICE',
      '{
         "query": "Any goggles review?",
         "columns":[
            "file_name",
            "CHUNK"
         ],
         "limit":3
      }'
  )
)['results'] as results;

Now, let's perform the RAQ query in Python.

In [None]:
# Query it with Python
from snowflake.core import Root
from snowflake.snowpark.context import get_active_session
import streamlit as st
import json
import pandas as pd

session = get_active_session()

prompt="Any goggles review?"

root = Root(session)

# Query service
svc = (root
  .databases["AVALANCHE_DB"]
  .schemas["AVALANCHE_SCHEMA"]
  .cortex_search_services["AVALANCHE_SEARCH_SERVICE"]
)

resp = svc.search(
  query=prompt,
  columns=["CHUNK", "file_name"],
  limit=3
).to_json()

# JSON formatting
json_conv = json.loads(resp) if isinstance(resp, str) else resp
search_df = pd.json_normalize(json_conv['results'])

for _, row in search_df.iterrows():
    st.write(f"**{row['CHUNK']}**")
    st.caption(row['file_name'])
    st.write('---')