# Content Design for RAG
This notebook is part of a collection of material related to content design principles for retrieval-augmented generation (RAG).

You can explore the complete collection here: [Content Design for RAG on GitHub](https://github.com/spackows/ICAAI-2024_RAG-CD/blob/main/README.md)

**Example scenario**

Imagine your company sells seeds and gardening supplies online.  On your website, you have articles with gardening information and advice.  You are building a RAG solution for your company website that can answer customer questions about your products, using your website articles as a knowledge base.

# Test topics
Your RAG solution cannot work as well as possible if the articles in the knowledge are not optimized for RAG.  One way to optimize content for RAG is to systematicaly test whether the content answers known user questions.

This sample notebook demonstrates a simple approach this problem: generate questions from the knowledge base articles and compare those generated questions with the known user questions.

**Contents**
1. Known user questions and relevant knowledge base article
2. Generate questions answered by the articles
3. Compare generated questions with user questions

## 1. Known user questions and relevant knowledge base article
Imagine you have collected the following real user questions:

In [1]:
g_user_questions_arr = [
    { "id" : "q_123", "txt" : "What are the different cucumber varieties?" },
    { "id" : "q_456", "txt" : "How tall do cucumbers grow?" },
    { "id" : "q_789", "txt" : "Can beginners grow cucumbers?" },
    { "id" : "q_012", "txt" : "Can you grow cucumbers in containers?" }
]

Imagine this is the content of the relevant article:

In [2]:
g_article_txt = """
## All things cucumber 
Cucumbers are popular for gardeners - beginners and advanced alike. 
They grow well in traditional garden beds, raised beds, and even containers on decks or balconies. 
Cucumber plants like to climb, and can grow as high as 6 feet. 
"""

## 2. Generate questions answered by the articles

See: [Foundation models Python library](https://ibm.github.io/watson-machine-learning-sdk/foundation_models.html)

### Prerequisites
Before you can prompt a foundation model in watsonx.ai, you must perform the following setup tasks:
- 2.1 Create an instance of the Watson Machine Learning service
- 2.2 Associate the Watson Machine Learning instance with the current project
- 2.3 Create an IBM Cloud API key
- 2.4 Look up the current project ID

#### 2.1 Create an instance of the Watson Machine Learning service
If you don't already have an instance of the IBM Watson Machine Learning service, you can create an instance of the service from the IBM Cloud catalog: [Watson Machine Learning service](https://cloud.ibm.com/catalog/services/watson-machine-learning)

#### 2.2 Associate an instance of the Watson Machine Learning service with the current project
The current project is the project in which you are running this notebook.

If an instance of Watson Machine Learning is not already associated with the current project, follow the instructions in this topic to do so: [Adding associated services to a project](https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/assoc-services.html?context=wx&audience=wdp)

#### 2.3 Create an IBM Cloud API key
Create an IBM Cloud API key by following these instruction: [Creating an IBM Cloud API key](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui#create_user_key)

Then paste your new IBM Cloud API key in the code cell below.

In [1]:
cloud_apikey = ""

g_wml_credentials = { 
    "url"    : "https://us-south.ml.cloud.ibm.com", 
    "apikey" : cloud_apikey
}

#### 2.4 Look up the current project ID
The current project is the project in which you are running this notebook. You can get the ID of the current project programmatically by running the following cell.

In [2]:
import os

g_project_id = os.environ["PROJECT_ID"]

Now prompt a model to evaluate the regression test results ...

In [25]:
g_template = """Article
----
## Growing tomatoes in pots 
Most tomato plants do well in containers. 
Determinate varieties, don't grow as large as indeterminate varieties. 
For anything other than compact determinate varieties, use a 5 gallon container at a minimum. 
----

Five (5) questions answered by the article:
1. Can you grow tomatoes in containers?
2. Which is larger, determinate varieties or indeterminate varieties?
3. Do tomatoes do well in pots?
4. Do determinate varieties grow as large as indeterminate ones?
5. What size of container is right for tomatoes?


Article
----
## Growing tomatoes in pots 
Most tomato plants do well in containers. 
Determinate varieties, don't grow as large as indeterminate varieties. 
For anything other than compact determinate varieties, use a 5 gallon container at a minimum. 
----

Five (5) questions answered by the article:
1. How large a pot do tomatoes require?
2. What is the minimum size of pot required to grow tomatoes?
3. Is a 8 gallon container large enough for tomatoes?
4. Can tomatoes grow in pots?
5. How well do tomatoes grow in containers?


Article
----
%s
----

Five (5) questions answered by the article:
"""

In [32]:
from ibm_watson_machine_learning.foundation_models import Model
import json
import re

def generateQuestions( model_id, prompt_parameters, prompt_template, article_txt, b_debug=False ):
    model = Model( model_id, g_wml_credentials, prompt_parameters, g_project_id )
    prompt_text = prompt_template % ( article_txt )
    raw_response = model.generate( prompt_text )
    if b_debug:
        print( "prompt_text:\n'" + prompt_text + "'\n" )
        print( "raw_response:\n" + json.dumps( raw_response, indent=3 ) )
    if ( "results" in raw_response ) \
       and ( len( raw_response["results"] ) > 0 ) \
       and ( "generated_text" in raw_response["results"][0] ):
        output = raw_response["results"][0]["generated_text"]
        questions_arr = re.findall( r"(\d\..*)", output )
        questions_arr = [ re.sub( r"^.*?\d+\.\s*", "", q ) for q in questions_arr ]
        return output, questions_arr
    else:
        return "", []

In [43]:
g_model_id = "meta-llama/llama-3-70b-instruct"

g_prompt_parameters = {
    "decoding_method" : "sample",
    "temperature"     : 0.7,
    "top_p"           : 1,
    "top_k"           : 50,
    "min_new_tokens"  : 0,
    "max_new_tokens"  : 200,
    "stop_sequences"  : [ "\n\n" ],
    "random_seed"     : 1131207194
}

output, gen_q_arr_1 = generateQuestions( g_model_id, g_prompt_parameters, g_template, g_article_txt )
print( "\nArticle:\n" + g_article_txt );
print( "Questions:\n\n" + json.dumps( gen_q_arr_1, indent=3 ) + "\n" )


Article:

## All things cucumber 
Cucumbers are popular for gardeners - beginners and advanced alike. 
They grow well in traditional garden beds, raised beds, and even containers on decks or balconies. 
Cucumber plants like to climb, and can grow as high as 6 feet. 

Questions:

[
   "Can cucumbers grow in traditional garden beds?",
   "Can cucumbers be grown in containers?",
   "Do cucumber plants grow in raised beds?",
   "Can cucumber plants climb?",
   "What is the maximum height cucumber plants can grow to?"
]



In [44]:
g_prompt_parameters["random_seed"] = 4014796558
output, gen_q_arr_2 = generateQuestions( g_model_id, g_prompt_parameters, g_template, g_article_txt )

g_prompt_parameters["random_seed"] = 383630822
output, gen_q_arr_3 = generateQuestions( g_model_id, g_prompt_parameters, g_template, g_article_txt )

g_prompt_parameters["random_seed"] = 2927515183
output, gen_q_arr_4 = generateQuestions( g_model_id, g_prompt_parameters, g_template, g_article_txt )

In [45]:
generated_questions_arr = [ gen_q_arr_1 + gen_q_arr_2 + gen_q_arr_3 + gen_q_arr_4 ]
print( json.dumps( generated_questions_arr, indent=3 ) )

[
   [
      "Can cucumbers grow in traditional garden beds?",
      "Can cucumbers be grown in containers?",
      "Do cucumber plants grow in raised beds?",
      "Can cucumber plants climb?",
      "What is the maximum height cucumber plants can grow to?",
      "Can you grow cucumbers in containers?",
      "Can cucumbers grow in raised beds?",
      "Can beginners grow cucumbers?",
      "Can cucumbers climb?",
      "What is the maximum height of cucumber plants?",
      "Can cucumbers be grown on decks or balconies?",
      "How tall can cucumber plants grow?",
      "Do cucumbers grow in traditional garden beds?",
      "Do cucumbers grow well in raised beds?",
      "Can cucumber plants climb?",
      "Are cucumbers easy to grow?",
      "Where can you grow cucumbers?",
      "Do cucumbers grow in containers?",
      "How tall can cucumber plants grow?",
      "Do cucumber plants like to climb?"
   ]
]


## 3. Compare generated questions with user questions

In [None]:
!pip install sentence-transformers | tail -n 1

In [54]:
from sentence_transformers import SentenceTransformer, util

In [55]:
import numpy as np

st_model = SentenceTransformer( "all-MiniLM-L6-v2" )

def sentenceTransformerScore( txt1, txt2 ):
    txt1_embeddings  = st_model.encode( [ txt1 ],  convert_to_tensor=True )
    txt2_embeddings = st_model.encode( [ txt2 ], convert_to_tensor=True )
    cosine_scores = util.cos_sim( txt1_embeddings, txt2_embeddings )
    sentence_transformers_score_arr = [ round( float( x ), 2 ) for x in cosine_scores[0] ]
    score = int( 100*sentence_transformers_score_arr[0] )
    return score

In [62]:
def compareQuestions( user_questions, generated_questions ):
    high_scores = []
    for user_question in user_questions:
        user_question_id = user_question["id"]
        user_question_txt = user_question["txt"]
        high_score = 0
        for generated_question in generated_questions:
            score = sentenceTransformerScore( user_question_txt, generated_question )
            if( score > high_score ):
                high_score = score
        answered = "YES" if high_score > 60 else "NO"
        high_scores.append( { "user_question_id"  : user_question_id,
                              "user_question_txt" : user_question_txt,
                              "answered"          : answered,
                              "high_score"        : high_score } )
    high_scores.sort( key=lambda s: s["high_score"] )
    return high_scores

In [63]:
import pandas as pd

results = compareQuestions( g_user_questions_arr, generated_questions_arr )
print( "\nArticle:\n" + g_article_txt + "\n" )
print( "User questions answered:\n" )
pd.DataFrame( results )


Article:

## All things cucumber 
Cucumbers are popular for gardeners - beginners and advanced alike. 
They grow well in traditional garden beds, raised beds, and even containers on decks or balconies. 
Cucumber plants like to climb, and can grow as high as 6 feet. 


User questions answered:



Unnamed: 0,user_question_id,user_question_txt,answered,high_score
0,q_123,What are the different cucumber varieties?,NO,57
1,q_456,How tall do cucumbers grow?,YES,64
2,q_789,Can beginners grow cucumbers?,YES,73
3,q_012,Can you grow cucumbers in containers?,YES,89
