## Generating Synthetic Questions and Answers for RAG Evaluation

This notebook demonstrates how to use Cortex to generate synthetic questions and answers for evaluating a RAG application

This is a following the example in the [Hugging Face RAG evaluation notebook](https://huggingface.co/learn/cookbook/rag_evaluation), but using Cortex to generate the synthetic questions and answers and performing the operations in Snowpark (and not on the local machine).

In [1]:
# Snowpark for Python
from snowflake.snowpark.session import Session
from snowflake.snowpark.types import Variant
from snowflake.snowpark.version import VERSION

# Snowpark ML
# Misc
import pandas as pd
import json
import logging 
logger = logging.getLogger("snowflake.snowpark.session")
logger.setLevel(logging.ERROR)

from snowflake import connector
from snowflake.ml.utils import connection_params

In [2]:
with open('../../creds.json') as f:
    data = json.load(f)
    USERNAME = data['user']
    PASSWORD = data['password']
    SF_ACCOUNT = data['account']
    SF_WH = data['warehouse']

CONNECTION_PARAMETERS = {
   "account": SF_ACCOUNT,
   "user": USERNAME,
   "password": PASSWORD,
}

session = Session.builder.configs(CONNECTION_PARAMETERS).create()

In [3]:
snowflake_environment = session.sql('select current_user(), current_version()').collect()
snowpark_version = VERSION

from snowflake.ml import version
mlversion = version.VERSION


# Current Environment Details
print('User                        : {}'.format(snowflake_environment[0][0]))
print('Role                        : {}'.format(session.get_current_role()))
print('Database                    : {}'.format(session.get_current_database()))
print('Schema                      : {}'.format(session.get_current_schema()))
print('Warehouse                   : {}'.format(session.get_current_warehouse()))
print('Snowflake version           : {}'.format(snowflake_environment[0][1]))
print('Snowpark for Python version : {}.{}.{}'.format(snowpark_version[0],snowpark_version[1],snowpark_version[2]))
print('Snowflake ML version        : {}.{}.{}'.format(mlversion[0],mlversion[2],mlversion[4]))

User                        : RSHAH
Role                        : "RAJIV"
Database                    : "RAJIV"
Schema                      : "DOCAI"
Warehouse                   : "RAJIV"
Snowflake version           : 8.18.0
Snowpark for Python version : 1.11.1
Snowflake ML version        : 1.5.0


## Get Data
Movie reviews and the task is extracting actor names and movies from the reviews

In [4]:
from snowflake.snowpark.functions import col
import snowflake.snowpark.functions as f
from snowflake.snowpark.functions import col, split, lit, regexp_extract
from snowflake.cortex import Complete


article_df = session.table("RAJIV.PUBLIC.CONTENT_CHUNKS_10K")
filtered_df = article_df.filter(col("company_name").like("RIVIAN%"))

# Now you can perform further operations on filtered_df, like displaying the data
filtered_df.show()


------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"SEC_DOCUMENT_ID"          |"DOCUMENT_TYPE"   |"COMPANY_NAME"                |"SIC_CODE_CATEGORY"      |"SIC_CODE_DESCRIPTION"                 |"COUNTRY"  |"PERIOD_END_DATE"  |"CONTENT_CHUNK"                                     |"START_INDEX"  |"EMBEDDING"                                         |"DOCUMENT_INDEX_ROWNUM"  |"ROWNUM"  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [11]:
filtered_df.count()

497

You will only want to select a subset of the chunks to build your generated examples. 

In [5]:
shuffled_df = filtered_df.sample(n=50)
shuffled_df.count()

50

Let's build the Q&A pairs for the RAG evaluation.

In [6]:
QA_generation_prompt = """
Your task is to write a factoid question and an answer given a context.
Your factoid question should be answerable with a specific, concise piece of factual information from the context.
Your factoid question should be formulated in the same style as questions users could ask in a search engine.
This means that your factoid question MUST NOT mention something like "according to the passage" or "context".

Provide your answer as follows:

Output:::
Factoid question: (your factoid question)
Answer: (your answer to the factoid question)

Now here is the context.

Context: """

In [7]:
outdf = shuffled_df.withColumn(
    "GenQA",
    Complete(
        model='mixtral-8x7b',prompt = f.concat(
            f.lit(QA_generation_prompt),
            f.col("CONTENT_CHUNK"),
            f.lit("\n Output:::")
            ))
)
outputs = outdf.to_pandas()

Complete() is experimental since 1.0.12. Do not use it in production. 


In [8]:
outputs.GENQA[1]

" Factoid question: How does US GAAP affect the reporting of amounts in a company's financial statements?\nAnswer: US GAAP requires management to make estimates and assumptions that affect the amounts reported in a company's consolidated financial statements and accompanying notes. These estimates and assumptions are based on historical experience and other factors that management believes to be reasonable under the circumstances. The results of operations may be adversely affected if these assumptions change or if actual circumstances differ from those in the assumptions, which could cause the results of operations to fall below publicly announced guidance or the expectations of securities analysts and investors, potentially leading to a decline in the market price of the company's stock."

In [9]:
outdf = outdf.withColumn("split_col", split(col("GENQA"), lit('\n')))

# Create new columns 'GENQ' and 'GENA' by accessing elements of the array
outdf = outdf.withColumn("GENQ", col("split_col")[0])
outdf = outdf.withColumn("GENA", col("split_col")[1])

outdf.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"SEC_DOCUMENT_ID"          |"DOCUMENT_TYPE"   |"COMPANY_NAME"                |"SIC_CODE_CATEGORY"      |"SIC_CODE_DESCRIPTION"                 |"COUNTRY"  |"PERIOD_END_DATE"  |"CONTENT_CHUNK"                                     |"START_INDEX"  |"EMBEDDING"                                         |"DOCUMENT_INDEX_ROWNUM"  |"ROWNUM"  |"GENQA"                                             |"SPLIT_COL"                                         |"GENQ"   

In [10]:
#save our work periodically
table_name = "RAJIV.PUBLIC.CONTENT_CHUNKS_10K_GQA"  # Specify your new table name here
outdf.write.mode("overwrite").save_as_table(table_name)

In [11]:
outdf = session.table("RAJIV.PUBLIC.CONTENT_CHUNKS_10K_GQA")

## Critique

The questions generated by the previous agent can have many flaws: we should do a quality check before validating these questions.

### Groundness Criteria

In [12]:
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a integer between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and context.

Question: """

In [13]:
outdf = outdf.withColumn(
    "Eval_Groundness",
    Complete(
        model='mixtral-8x7b',prompt = f.concat(
            f.lit(question_groundedness_critique_prompt),
            f.col("GENQ"),
            f.lit("\n Context: "),
            f.col("CONTENT_CHUNK"),
            f.lit("\n Answer:::")
            ))
)

In [14]:
## you can always pull this into pandas to see the results
outputs = outdf.to_pandas()
outputs.EVAL_GROUNDNESS[4]

" Answer: The primary components of Rivian Automotive's R&D costs include personnel expenses for teams in engineering and research, prototyping expenses, consulting and contractor expenses, depreciation expenses, allocation of indirect expenses, and recurring non-cash stock compensation charges.\n\nEvaluation: The context provides a detailed breakdown of the components that constitute Rivian Automotive's R&D costs. Each component is explicitly mentioned, allowing for a clear and unambiguous answer to the question. The only addition to the answer based on the context is the mention of recurring non-cash stock compensation charges, which were introduced in the quarter ended December 31, 2021.\n\nTotal rating: 5"

Let's split out the score,which will later filter on.

In [15]:
# Extracting the score by taking the rightmost character of the last part of split_rating
outdf = outdf.withColumn("score_EG", regexp_extract(col("EVAL_GROUNDNESS"), r"Total rating:\s*(\d)", 1).cast("int"))


# Show results to verify
outdf.show()



--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"SEC_DOCUMENT_ID"          |"DOCUMENT_TYPE"   |"COMPANY_NAME"                |"SIC_CODE_CATEGORY"      |"SIC_CODE_DESCRIPTION"                 |"COUNTRY"  |"PERIOD_END_DATE"  |"CONTENT_CHUNK"                                     |"START_INDEX"  |"EMBEDDING"                                         |"DOCUMENT_INDEX_ROWNUM"  |"ROWNUM"  |"GENQA"                                          

In [16]:
working_table_name = "RAJIV.PUBLIC.CONTENT_CHUNKS_10K_GQA_All"  # Specify your new table name here
# Write the DataFrame to a new table in Snowflake
outdf.write.mode("overwrite").save_as_table(working_table_name)

### Relevance Criteria

In [17]:
question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to analysts.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as an integer between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: """

In [18]:
outdf = outdf.withColumn(
    "EVAL_RELEVANCE",
    Complete(
        model='mixtral-8x7b',prompt = f.concat(
            f.lit(question_relevance_critique_prompt),
            f.col("GENQ"),
            f.lit("\n Answer:::")
            ))
)

# Extracting the score by taking the rightmost character of the last part of split_rating
outdf = outdf.withColumn("score_ER", regexp_extract(col("EVAL_RELEVANCE"), r"Total rating:\s*(\d)", 1).cast("int"))

outdf.write.mode("overwrite").save_as_table(working_table_name)

# Show results to verify
outdf.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"SEC_DOCUMENT_ID"          |"DOCUMENT_TYPE"   |"COMPANY_NAME"                |"SIC_CODE_CATEGORY"      |"SIC_CODE_DESCRIPTION"                 |"COUNTRY"  |"PERIOD_END_DATE"  |"CONTENT_CHUNK"                                     |"START_INDEX"  |"EMBEDDING"                                         |"DOCUMENT_INDEX_ROWN

### Critique Criteria

In [19]:
outdf = session.table(working_table_name)
outdf.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"SEC_DOCUMENT_ID"          |"DOCUMENT_TYPE"   |"COMPANY_NAME"                |"SIC_CODE_CATEGORY"      |"SIC_CODE_DESCRIPTION"                 |"COUNTRY"  |"PERIOD_END_DATE"  |"CONTENT_CHUNK"                                     |"START_INDEX"  |"EMBEDDING"                                         |"DOCUMENT_INDEX_ROWN

In [20]:
question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independant this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.

For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independant from the context.

Provide your answer as follows:

Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as an integer between 1 and 5)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here is the question.

Question: """

In [21]:
outdf = outdf.withColumn(
    "EVAL_CRITIQUE",
    Complete(
        model='mixtral-8x7b',prompt = f.concat(
            f.lit(question_relevance_critique_prompt),
            f.col("GENQ"),
            f.lit("\n Answer:::")
            ))
)

# Extracting the score by taking the rightmost character of the last part of split_rating
outdf = outdf.withColumn("score_EC", regexp_extract(col("EVAL_CRITIQUE"), r"Total rating:\s*(\d)", 1).cast("int"))

outdf.write.mode("overwrite").save_as_table(working_table_name)

# Show results to verify
outdf.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"SEC_DOCUMENT_ID"          |"DOCUMENT_TYPE"   |"COMPANY_NAME"                |"SIC_CODE_CATEGORY"      |"SIC_CODE_DESCRIPTION"                 |"COUNTRY"  |"PERIOD_END_DATE"  |"CONTENT_CHUNK"                                     |"START_INDEX"  |"EMBEDD

Let's keep only the questions that have scored well, by filtering on the score.

In [22]:
filtered_df = outdf.filter((outdf['SCORE_EC'] > 3) & (outdf['SCORE_ER'] > 3) & (outdf['SCORE_EG'] > 3))
filtered_df.count()


25

In [23]:
selected_df = filtered_df.select("CONTENT_CHUNK", "GENQ", "GENA")
# Now, to save this DataFrame as a new table in Snowflake
selected_df.write.mode("overwrite").saveAsTable("RAJIV.PUBLIC.CONTENT_CHUNKS_10K_GQA")