# Content Design for RAG
This notebook is part of a collection of material related to content design principles for retrieval-augmented generation (RAG).

You can explore the complete collection here: [Content Design for RAG on GitHub](https://github.com/spackows/ICAAI-2024_RAG-CD/blob/main/README.md)

**Example scenario**

Imagine your company sells seeds and gardening supplies online.  On your website, you have articles with gardening information and advice.  You are building a RAG solution for your company website that can answer customer questions about your products, using your website articles as a knowledge base.

# Reggression testing
As you make changes to your RAG solution over time, you need to test that the solution has not regressed, that it continues to answer correctly questions that it answered correctly before your changes.

This sample notebook demonstrates a simple approach this problem: comparing generated output with expected output using a variety of techniques.

**Contents**
1. Regressions test question-answer pairs
2. Comparing generated output with expected answers

## 1. Regression test question-answer pairs
Imagine you have collected the following test data:
- Historical questions
- Corresponding answers that have been evaluated as correct
- Articles upon which the answers were grounded

In [1]:
g_test_questions = [
    {
        "id" : "test_01",
        "question_txt" : "How tall do cucumbers grow?",
        "expected_answer_txt" : "Cucumber plants can grow as high as 6 feet",
        "grounding_article_txt" : """## All things cucumber 
Cucumbers are popular for gardeners - beginners and advanced alike. 
They grow well in traditional garden beds, raised beds, and even containers on decks or balconies. 
Cucumber plants like to climb, and can grow as high as 6 feet. 
"""
    },
    {
        "id" : "test_02",
        "question_txt" : "Can I grow tomatoes in containers",
        "expected_answer_txt" : "Most tomato plants do well in containers.",
        "grounding_article_txt" : """## Growing tomatoes in pots 
Most tomato plants do well in containers. 
Determinate varieties, don't grow as large as indeterminate varieties. 
For anything other than compact determinate varieties, use a 5 gallon container at a minimum. 
"""
    }
]

And imagine you want to test several new model-prompt parameters combinations in your RAG solution:
- Version 1: Model with a tendency to be terse, very conservative prompt parameters
- Version 2: Model with a tendency to repeat given text (not "creative"), conservative parameters
- Version 3: Model with a tendency to be verbose, sampling decoding (risk of hallucination)

After running the test questions through the three versions of your RAG solution, you have the following results:

In [2]:
g_test_results_1 = [
    {
        "id" : "test_01",
        "run_time_answer_txt" : "6 feet"
    },
    {
        "id" : "test_02",
        "run_time_answer_txt" : "Yes"
    }
]

g_test_results_2 = [
    {
        "id" : "test_01",
        "run_time_answer_txt" : "Cucumber plants like to climb, and can grow as high as 6 feet."
    },
    {
        "id" : "test_02",
        "run_time_answer_txt" : "Most tomato plants do well in containers."
    }
]

g_test_results_3 = [
    {
        "id" : "test_01",
        "run_time_answer_txt" : "6 feet. But note that it's actually the cucumber plants, not cucumbers, that can grow as high as 6 feet. Cucumbers are typically much shorter, roughly the length of the cucumber itself."
    },
    {
        "id" : "test_02",
        "run_time_answer_txt" : "Yes   ( because the article explicitly says \" Most tomato plants do well in containers.\" )"
    }
]

## 2. Comparing generated output with expected answers
There are several ways to compare strings:
- 2.1 Fuzzy string matching 
- 2.2 Semantic similarity
- 2.3 Text distance
- 2.4 BLEU
- 2.5 ROUGE
- 2.6 METEOR
- 2.7 BertScore
- 2.8 Large language model

### 2.1 Fuzzy string matching

In [3]:
!pip install thefuzz | tail -n 1

Successfully installed rapidfuzz-3.9.6 thefuzz-0.22.1


In [4]:
from thefuzz import fuzz
import statistics

def getFuzzScores( test_results_arr, expected_answers_arr ):
    all_scores_arr = []
    for i in range( len( test_results_arr ) ):
        id = test_results_arr[i]["id"]
        question_txt = expected_answers_arr[i]["question_txt"]
        run_time_answer = test_results_arr[i]["run_time_answer_txt"]
        expected_answer = expected_answers_arr[i]["expected_answer_txt"]
        s1 = fuzz.ratio( run_time_answer, expected_answer )
        s2 = fuzz.partial_ratio( run_time_answer, expected_answer )
        s3 = fuzz.token_sort_ratio( run_time_answer, expected_answer )
        s4 = fuzz.token_set_ratio( run_time_answer, expected_answer )
        s5 = fuzz.partial_token_sort_ratio( run_time_answer, expected_answer )
        scores = [ s1, s2, s3, s4, s5 ]
        ave = statistics.mean( scores )
        all_scores_arr.append( { "id"       : id,
                                 "question" : question_txt,
                                 "expected_answer" : expected_answer,
                                 "run_time_answer" : run_time_answer,
                                 "s1" : s1,
                                 "s2" : s2,
                                 "s3" : s3,
                                 "s4" : s4,
                                 "s5" : s5,
                                 "ave_score" : ave } )        
    return all_scores_arr

In [5]:
import pandas as pd

results = getFuzzScores( g_test_results_1, g_test_questions )
pd.DataFrame( results )

Unnamed: 0,id,question,expected_answer,run_time_answer,s1,s2,s3,s4,s5,ave_score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet,25,100,25,100,83,66.6
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Yes,9,67,9,9,67,32.2


In [6]:
results = getFuzzScores( g_test_results_2, g_test_questions )
pd.DataFrame( results )

Unnamed: 0,id,question,expected_answer,run_time_answer,s1,s2,s3,s4,s5,ave_score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,"Cucumber plants like to climb, and can grow as...",81,83,82,100,86,86.4
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Most tomato plants do well in containers.,100,100,100,100,100,100.0


In [7]:
results = getFuzzScores( g_test_results_3, g_test_questions )
pd.DataFrame( results )

Unnamed: 0,id,question,expected_answer,run_time_answer,s1,s2,s3,s4,s5,ave_score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet. But note that it's actually the cucumb...,36,88,38,100,67,65.8
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,"Yes ( because the article explicitly says "" ...",63,100,67,100,72,80.4


### 2.2 Semantic similarity

In [8]:
!pip install sentence-transformers | tail -n 1

Successfully installed huggingface-hub-0.24.6 regex-2024.7.24 safetensors-0.4.4 sentence-transformers-3.0.1 tokenizers-0.19.1 transformers-4.44.2


In [9]:
from sentence_transformers import SentenceTransformer, util

  from tqdm.autonotebook import tqdm, trange


In [10]:
import numpy as np

st_model = SentenceTransformer( "all-MiniLM-L6-v2" )

def getSentenceTransformerScores( test_results_arr, expected_answers_arr ):
    all_scores_arr = []
    for i in range( len( test_results_arr ) ):
        id = test_results_arr[i]["id"]
        question_txt = expected_answers_arr[i]["question_txt"]
        run_time_answer = test_results_arr[i]["run_time_answer_txt"]
        expected_answer = expected_answers_arr[i]["expected_answer_txt"]
        run_time_answer_embeddings  = st_model.encode( [ run_time_answer ],  convert_to_tensor=True )
        expected_answer_embeddings = st_model.encode( [ expected_answer ], convert_to_tensor=True )
        cosine_scores = util.cos_sim( run_time_answer_embeddings, expected_answer_embeddings )
        sentence_transformers_score_arr = [ round( float( x ), 2 ) for x in cosine_scores[0] ]
        all_scores_arr.append( { "id" : id,
                                 "question" : question_txt,
                                 "expected_answer" : expected_answer,
                                 "run_time_answer" : run_time_answer,
                                 "score" : int( 100*sentence_transformers_score_arr[0] ) } )
    return all_scores_arr

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [11]:
results = getSentenceTransformerScores( g_test_results_1, g_test_questions )
pd.DataFrame( results )

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet,49
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Yes,-4


In [12]:
results = getSentenceTransformerScores( g_test_results_2, g_test_questions )
pd.DataFrame( results )

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,"Cucumber plants like to climb, and can grow as...",94
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Most tomato plants do well in containers.,100


In [13]:
results = getSentenceTransformerScores( g_test_results_3, g_test_questions )
pd.DataFrame( results )

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet. But note that it's actually the cucumb...,86
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,"Yes ( because the article explicitly says "" ...",92


### 2.3 Text distance

In [14]:
!pip install textdistance | tail -n 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Successfully installed textdistance-4.6.3


In [15]:
import textdistance as td

def gettextdistanceScores( test_results_arr, expected_answers_arr ):
    all_scores_arr = []
    for i in range( len( test_results_arr ) ):
        id = test_results_arr[i]["id"]
        question_txt = expected_answers_arr[i]["question_txt"]
        run_time_answer = test_results_arr[i]["run_time_answer_txt"]
        expected_answer = expected_answers_arr[i]["expected_answer_txt"]
        s01 = td.hamming.normalized_similarity( expected_answer, run_time_answer )
        #s02 = td.mlipns.normalized_similarity( expected_answer, run_time_answer )
        s03 = td.levenshtein.normalized_similarity( expected_answer, run_time_answer )
        s04 = td.damerau_levenshtein.normalized_similarity( expected_answer, run_time_answer )
        s05 = td.jaro_winkler.normalized_similarity( expected_answer, run_time_answer )
        s06 = td.jaro.normalized_similarity( expected_answer, run_time_answer )
        s07 = td.strcmp95.normalized_similarity( expected_answer, run_time_answer )
        s08 = td.needleman_wunsch.normalized_similarity( expected_answer, run_time_answer )
        s09 = td.gotoh.normalized_similarity( expected_answer, run_time_answer )
        s10 = td.smith_waterman.normalized_similarity( expected_answer, run_time_answer )
        s11 = td.jaccard.normalized_similarity( expected_answer, run_time_answer )
        s12 = td.sorensen.normalized_similarity( expected_answer, run_time_answer )
        s13 = td.sorensen_dice.normalized_similarity( expected_answer, run_time_answer )
        #s14 = td.dice.normalized_similarity( expected_answer, run_time_answer )
        s15 = td.tversky.normalized_similarity( expected_answer, run_time_answer )
        s16 = td.overlap.normalized_similarity( expected_answer, run_time_answer )
        #s17 = td.tanimoto.normalized_similarity( expected_answer, run_time_answer )
        s18 = td.cosine.normalized_similarity( expected_answer, run_time_answer )
        #s19 = td.monge_elkan.normalized_similarity( expected_answer, run_time_answer )
        s20 = td.bag.normalized_similarity( expected_answer, run_time_answer )
        s21 = td.lcsseq.normalized_similarity( expected_answer, run_time_answer )
        s22 = td.lcsstr.normalized_similarity( expected_answer, run_time_answer )
        s23 = td.ratcliff_obershelp.normalized_similarity( expected_answer, run_time_answer )
        #s24 = td.arith_ncd.normalized_similarity( expected_answer, run_time_answer )
        #s25 = td.rle_ncd.normalized_similarity( expected_answer, run_time_answer )
        #s26 = td.bwtrle_ncd.normalized_similarity( expected_answer, run_time_answer )
        s27 = td.sqrt_ncd.normalized_similarity( expected_answer, run_time_answer )
        s28 = td.entropy_ncd.normalized_similarity( expected_answer, run_time_answer )
        s29 = td.bz2_ncd.normalized_similarity( expected_answer, run_time_answer )
        s30 = td.lzma_ncd.normalized_similarity( expected_answer, run_time_answer )
        s31 = td.zlib_ncd.normalized_similarity( expected_answer, run_time_answer )
        s32 = td.mra.normalized_similarity( expected_answer, run_time_answer )
        s33 = td.editex.normalized_similarity( expected_answer, run_time_answer )
        scores = [      s01,      s03, s04, s05, s06, s07, s08, s09,
                   s10, s11, s12, s13,      s15, s16,      s18,
                   s20, s21, s22, s23,                s27, s28, s29,
                   s30, s31, s32, s33
                 ]
        scores = [ round( float( x ), 2 ) for x in scores ]
        ave = round( statistics.mean( scores ), 2 )
        all_scores_arr.append( { "id" : id,
                                 "question" : question_txt,
                                 "expected_answer" : expected_answer,
                                 "run_time_answer" : run_time_answer,
                                 "s01"        : int( 100*s01 ),
                                 "s03"        : int( 100*s03 ),
                                 "s04"        : int( 100*s04 ),
                                 "s05"        : int( 100*s05 ),
                                 "s06"        : int( 100*s06 ),
                                 "s07"        : int( 100*s07 ),
                                 "s08"        : int( 100*s08 ),
                                 "s09"        : int( 100*s09 ),
                                 "s10"        : int( 100*s10 ),
                                 "s11"        : int( 100*s11 ),
                                 "s12"        : int( 100*s12 ),
                                 "s13"        : int( 100*s13 ),
                                 "s15"        : int( 100*s15 ),
                                 "s16"        : int( 100*s16 ),
                                 "s18"        : int( 100*s18 ),
                                 "s20"        : int( 100*s20 ),
                                 "s21"        : int( 100*s21 ),
                                 "s22"        : int( 100*s22 ),
                                 "s23"        : int( 100*s23 ),
                                 "s27"        : int( 100*s27 ),
                                 "s28"        : int( 100*s28 ),
                                 "s29"        : int( 100*s29 ),
                                 "s30"        : int( 100*s30 ),
                                 "s31"        : int( 100*s31 ),
                                 "s32"        : int( 100*s32 ),
                                 "s33"        : int( 100*s33 ),
                                 "ave_score"  : int( 100*ave ) } )
    return all_scores_arr

In [16]:
results = gettextdistanceScores( g_test_results_1, g_test_questions )
df = pd.DataFrame( results )
df.iloc[:, 0:16]

Unnamed: 0,id,question,expected_answer,run_time_answer,s01,s03,s04,s05,s06,s07,s08,s09,s10,s11,s12,s13
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet,0,14,14,41,41,45,14,-25,100,14,25,25
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Yes,2,4,4,45,45,52,6,-196,0,4,9,9


In [17]:
df.iloc[:, 16:]

Unnamed: 0,s15,s16,s18,s20,s21,s22,s23,s27,s28,s29,s30,s31,s32,s33,ave_score
0,14,100,37,14,14,14,25,12,64,43,48,22,0,17,28
1,4,66,18,4,4,2,4,5,52,38,53,12,0,10,10


In [18]:
results = gettextdistanceScores( g_test_results_2, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,s01,s03,s04,s05,s06,s07,...,s22,s23,s27,s28,s29,s30,s31,s32,s33,ave_score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,"Cucumber plants like to climb, and can grow as...",27,67,67,88,80,88,...,43,80,47,97,58,73,61,50,67,71
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Most tomato plants do well in containers.,100,100,100,100,100,100,...,100,100,58,100,80,90,93,100,100,97


In [19]:
results = gettextdistanceScores( g_test_results_3, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,s01,s03,s04,s05,s06,s07,...,s22,s23,s27,s28,s29,s30,s31,s32,s33,ave_score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet. But note that it's actually the cucumb...,1,22,22,57,57,57,...,14,36,33,97,30,42,31,16,27,35
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,"Yes ( because the article explicitly says "" ...",5,45,45,59,59,61,...,45,62,37,94,44,58,53,16,50,55


### 2.4 BLEU
See: https://www.nltk.org/howto/bleu.html

In [20]:
!pip install nltk | tail -n 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Successfully installed nltk-3.9.1


In [21]:
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction

In [22]:
chencherry = SmoothingFunction()

def getBLEUScores( test_results_arr, expected_answers_arr ):
    all_scores_arr = []
    for i in range( len( test_results_arr ) ):
        id = test_results_arr[i]["id"]
        question_txt = expected_answers_arr[i]["question_txt"]
        run_time_answer = test_results_arr[i]["run_time_answer_txt"]
        expected_answer = expected_answers_arr[i]["expected_answer_txt"]
        bleu_score = sentence_bleu( [ expected_answer.split() ], run_time_answer.split(), smoothing_function=chencherry.method2 )
        all_scores_arr.append( { "id" : id,
                                 "question" : question_txt,
                                 "expected_answer" : expected_answer,
                                 "run_time_answer" : run_time_answer,
                                 "score" : int( 100*bleu_score ) } )
    return all_scores_arr

In [23]:
results = getBLEUScores( g_test_results_1, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet,2
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Yes,0


In [24]:
results = getBLEUScores( g_test_results_2, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,"Cucumber plants like to climb, and can grow as...",47
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Most tomato plants do well in containers.,100


In [25]:
results = getBLEUScores( g_test_results_3, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet. But note that it's actually the cucumb...,16
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,"Yes ( because the article explicitly says "" ...",34


### 2.5 ROUGE
See: https://pypi.org/project/rouge-score

In [26]:
!pip install rouge-score | tail -n 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Successfully installed rouge-score-0.1.2


In [38]:
from rouge_score import rouge_scorer

def getROUGEScores( test_results_arr, expected_answers_arr ):
    scorer = rouge_scorer.RougeScorer( ["rouge1", "rougeL" ], use_stemmer=True )
    all_scores_arr = []
    for i in range( len( test_results_arr ) ):
        id = test_results_arr[i]["id"]
        question_txt = expected_answers_arr[i]["question_txt"]
        run_time_answer = test_results_arr[i]["run_time_answer_txt"]
        expected_answer = expected_answers_arr[i]["expected_answer_txt"]
        rouge_score = scorer.score( expected_answer, run_time_answer )
        all_scores_arr.append( { "id" : id,
                                 "question" : question_txt,
                                 "expected_answer" : expected_answer,
                                 "run_time_answer" : run_time_answer,
                                 "score" : int( 100*rouge_score["rougeL"].fmeasure ) } )
    return all_scores_arr

In [39]:
results = getROUGEScores( g_test_results_1, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet,36
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Yes,0


In [40]:
results = getROUGEScores( g_test_results_2, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,"Cucumber plants like to climb, and can grow as...",81
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Most tomato plants do well in containers.,100


In [41]:
results = getROUGEScores( g_test_results_3, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet. But note that it's actually the cucumb...,42
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,"Yes ( because the article explicitly says "" ...",70


### 2.6 METEOR
See: https://www.nltk.org/howto/meteor.html

In [52]:
from nltk.translate.meteor_score import single_meteor_score

def getMETEORScores( test_results_arr, expected_answers_arr ):
    all_scores_arr = []
    for i in range( len( test_results_arr ) ):
        id = test_results_arr[i]["id"]
        question_txt = expected_answers_arr[i]["question_txt"]
        run_time_answer = test_results_arr[i]["run_time_answer_txt"]
        expected_answer = expected_answers_arr[i]["expected_answer_txt"]
        meteor_score = single_meteor_score( expected_answer.split(), run_time_answer.split() )
        all_scores_arr.append( { "id" : id,
                                 "question" : question_txt,
                                 "expected_answer" : expected_answer,
                                 "run_time_answer" : run_time_answer,
                                 "score" : int( 100*meteor_score ) } )
    return all_scores_arr

In [53]:
results = getMETEORScores( g_test_results_1, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet,22
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Yes,0


In [54]:
results = getMETEORScores( g_test_results_2, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,"Cucumber plants like to climb, and can grow as...",84
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Most tomato plants do well in containers.,99


In [55]:
results = getMETEORScores( g_test_results_3, g_test_questions )
df = pd.DataFrame( results )
df

Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet. But note that it's actually the cucumb...,61
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,"Yes ( because the article explicitly says "" ...",75


### 2.7 BertScore
See: https://pypi.org/project/bert-score

In [56]:
!pip install bert-score | tail -n 1

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Successfully installed bert-score-0.3.13


In [66]:
from bert_score import score

def getBertScores( test_results_arr, expected_answers_arr ):
    all_scores_arr = []
    for i in range( len( test_results_arr ) ):
        id = test_results_arr[i]["id"]
        question_txt = expected_answers_arr[i]["question_txt"]
        run_time_answer = test_results_arr[i]["run_time_answer_txt"]
        expected_answer = expected_answers_arr[i]["expected_answer_txt"]
        P, R, F1 = score( [ expected_answer ], [ run_time_answer ], lang="en" )
        all_scores_arr.append( { "id" : id,
                                 "question" : question_txt,
                                 "expected_answer" : expected_answer,
                                 "run_time_answer" : run_time_answer,
                                 "score" : int( 100*F1 ) } )
    return all_scores_arr

In [67]:
results = getBertScores( g_test_results_1, g_test_questions )
df = pd.DataFrame( results )
df

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet,82
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Yes,83


In [68]:
results = getBertScores( g_test_results_2, g_test_questions )
df = pd.DataFrame( results )
df

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,"Cucumber plants like to climb, and can grow as...",97
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Most tomato plants do well in containers.,100


In [69]:
results = getBertScores( g_test_results_3, g_test_questions )
df = pd.DataFrame( results )
df

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Unnamed: 0,id,question,expected_answer,run_time_answer,score
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet. But note that it's actually the cucumb...,88
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,"Yes ( because the article explicitly says "" ...",91


### 2.8 Large language model

See: [Foundation models Python library](https://ibm.github.io/watson-machine-learning-sdk/foundation_models.html)

### Prerequisites
Before you can prompt a foundation model in watsonx.ai, you must perform the following setup tasks:
- 2.1 Create an instance of the Watson Machine Learning service
- 2.2 Associate the Watson Machine Learning instance with the current project
- 2.3 Create an IBM Cloud API key
- 2.4 Look up the current project ID

#### 2.1 Create an instance of the Watson Machine Learning service
If you don't already have an instance of the IBM Watson Machine Learning service, you can create an instance of the service from the IBM Cloud catalog: [Watson Machine Learning service](https://cloud.ibm.com/catalog/services/watson-machine-learning)

#### 2.2 Associate an instance of the Watson Machine Learning service with the current project
The current project is the project in which you are running this notebook.

If an instance of Watson Machine Learning is not already associated with the current project, follow the instructions in this topic to do so: [Adding associated services to a project](https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/assoc-services.html?context=wx&audience=wdp)

#### 2.3 Create an IBM Cloud API key
Create an IBM Cloud API key by following these instruction: [Creating an IBM Cloud API key](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui#create_user_key)

Then paste your new IBM Cloud API key in the code cell below.

In [89]:
cloud_apikey = ""

g_wml_credentials = { 
    "url"    : "https://us-south.ml.cloud.ibm.com", 
    "apikey" : cloud_apikey
}

#### 2.4 Look up the current project ID
The current project is the project in which you are running this notebook. You can get the ID of the current project programmatically by running the following cell.

In [78]:
import os

g_project_id = os.environ["PROJECT_ID"]

Now prompt a model to evaluate the regression test results ...

In [76]:
g_template_1 = """Determine whether the new answer is missing any relevant information found in the original answer. 

Question:
%s

Original answer:
%s

New answer:
%s

Is any relevant information found in the original answer missing from the new answer?
"""

g_template_2 = """Determine whether the new answer contains extranious, potentially confusing information not found in the original answer. 

Question:
%s

Original answer:
%s

New answer:
%s

Is any extranious, potentially confusing information found in the new answer that was absent from the original answer?
"""

g_template_3 = """Determine whether the new answer means the same thing as the original answer, even if they are worded differently. 

Question:
%s

Original answer:
%s

New answer:
%s

Does the new answer convey the same essential information as the original answer, even if they are worded differently?
"""

In [81]:
from ibm_watson_machine_learning.foundation_models import Model
import json
import re

def llmCheckAnswer( model_id, prompt_parameters, prompt_template, question_txt, answer_org, answer_new, b_debug=False ):
    model = Model( model_id, g_wml_credentials, prompt_parameters, g_project_id )
    prompt_text = prompt_template % ( question_txt, answer_org, answer_new )
    raw_response = model.generate( prompt_text )
    if b_debug:
        print( "prompt_text:\n'" + prompt_text + "'\n" )
        print( "raw_response:\n" + json.dumps( raw_response, indent=3 ) )
    if ( "results" in raw_response ) \
       and ( len( raw_response["results"] ) > 0 ) \
       and ( "generated_text" in raw_response["results"][0] ):
        output = raw_response["results"][0]["generated_text"]
        b_equivalent = True if re.search( r"yes", output, re.IGNORECASE ) else False
        return output, b_equivalent
    else:
        return "", None

In [83]:
g_model_id = "google/flan-t5-xxl"

g_prompt_parameters = {
    "decoding_method" : "greedy",
    "min_new_tokens"  : 0,
    "max_new_tokens"  : 20
}

question_txt = g_test_questions[0]["question_txt"]
answer_org = g_test_questions[0]["expected_answer_txt"]
answer_new = g_test_results_1[0]["run_time_answer_txt"]

output, b_equivalent = equivalentAnswers( g_model_id, g_prompt_parameters, g_template_1, question_txt, answer_org, answer_new )
print( "Question:\n" + question_txt + "\n" )
print( "Expected answer:\n" + answer_org + "\n" )
print( "Run time answer:\n" + answer_new + "\n" )
print( "Run-time answer missing info: " + str( b_equivalent ) )

Question:
How tall do cucumbers grow?

Expected answer:
Cucumber plants can grow as high as 6 feet

Run time answer:
6 feet

Run-time answer missing info: True


In [84]:
def getLLMScores( model_id, prompt_parameters, test_results_arr, expected_answers_arr ):
    all_scores_arr = []
    for i in range( len( test_results_arr ) ):
        id = test_results_arr[i]["id"]
        question_txt = expected_answers_arr[i]["question_txt"]
        run_time_answer = test_results_arr[i]["run_time_answer_txt"]
        expected_answer = expected_answers_arr[i]["expected_answer_txt"]
        output_1, b_equivalent_1 = equivalentAnswers( model_id, prompt_parameters, g_template_1, question_txt, expected_answer, run_time_answer )
        output_2, b_equivalent_2 = equivalentAnswers( model_id, prompt_parameters, g_template_2, question_txt, expected_answer, run_time_answer )
        output_3, b_equivalent_3 = equivalentAnswers( model_id, prompt_parameters, g_template_3, question_txt, expected_answer, run_time_answer )
        all_scores_arr.append( { "id"       : id,
                                 "question" : question_txt,
                                 "expected_answer"   : expected_answer,
                                 "run_time_answer"   : run_time_answer,
                                 "missing_info"      : "FAIL" if b_equivalent_1 else "PASS",
                                 "irrelivant_info"   : "FAIL" if b_equivalent_2 else "PASS",
                                 "different_meaning" : "FAIL" if not b_equivalent_1 else "PASS" } )        
    return all_scores_arr

In [86]:
results = getLLMScores( g_model_id, g_prompt_parameters, g_test_results_1, g_test_questions )
pd.DataFrame( results )

Unnamed: 0,id,question,expected_answer,run_time_answer,missing_info,irrelivant_info,different_meaning
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet,FAIL,PASS,PASS
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Yes,PASS,PASS,FAIL


In [87]:
results = getLLMScores( g_model_id, g_prompt_parameters, g_test_results_2, g_test_questions )
pd.DataFrame( results )

Unnamed: 0,id,question,expected_answer,run_time_answer,missing_info,irrelivant_info,different_meaning
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,"Cucumber plants like to climb, and can grow as...",FAIL,FAIL,PASS
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,Most tomato plants do well in containers.,PASS,PASS,FAIL


In [88]:
results = getLLMScores( g_model_id, g_prompt_parameters, g_test_results_3, g_test_questions )
pd.DataFrame( results )

Unnamed: 0,id,question,expected_answer,run_time_answer,missing_info,irrelivant_info,different_meaning
0,test_01,How tall do cucumbers grow?,Cucumber plants can grow as high as 6 feet,6 feet. But note that it's actually the cucumb...,FAIL,FAIL,PASS
1,test_02,Can I grow tomatoes in containers,Most tomato plants do well in containers.,"Yes ( because the article explicitly says "" ...",PASS,PASS,FAIL
