# Content Design for RAG
This notebook is part of a collection of material related to content design principles for retrieval-augmented generation (RAG).

You can explore the complete collection here: [Content Design for RAG on GitHub](https://github.com/spackows/ICAAI-2024_RAG-CD/blob/main/README.md)

**Example scenario**

Imagine your company sells seeds and gardening supplies online.  On your website, you have articles with gardening information and advice.  You are building a RAG solution for your company website that can answer customer questions about your products, using your website articles as a knowledge base.

# Boost search
Your customers will often use "incorrect" terms in their submitted questions.  For example, they'll use jargon, terms from your competitors or other domains, or they'll misspell words.  Also, when a user's mental model or assumptions don't match the information in your articles, the way the person asks their question will not match the content in your articles.  All of these factors can cause search to underperform.

This sample notebook demonstrates two simple approaches this problem: adding synonyms and rephrasing the user input.

**Contents**
1. Sample knowledge base
2. Simple search function
3. Adding synonyms
4. Rephrasing the query

## 1. Sample knowledge base
Imagine your knowledge base contains the following three articles:

In [155]:
g_articles = [
    { "id" : "123", "title" : "Container gardening"          },
    { "id" : "456", "title" : "For the love of tomatoes"     },
    { "id" : "789", "title" : "Cultivating tomatoes in pots" },
    { "id" : "012", "title" : "All things cucumber"          }
]

## 2. Simple search function
To keep things simple, imagine your search component is simply title-based string matching:

In [None]:
!pip install thefuzz | tail -n 1

In [157]:
from thefuzz import fuzz

def search( txt ):
    matches = []
    for article in g_articles:
        score = fuzz.token_set_ratio( txt, article["title"] )
        matches.append( { "score" : score, "title" : article["title"] } )
    matches.sort( key = lambda match: match["score"], reverse=True )
    return matches

In [158]:
search( "tomato" )

[{'score': 40, 'title': 'For the love of tomatoes'},
 {'score': 35, 'title': 'Cultivating tomatoes in pots'},
 {'score': 16, 'title': 'Container gardening'},
 {'score': 16, 'title': 'All things cucumber'}]

## 3. Adding synonyms
As you systematically review questions being submitted to your RAG solution, you can collect synonyms people are using.

Imagine you have collected synonyms from the following user questions:

In [159]:
g_historical_questions = [
    { 
        "id" : "q01", 
        "question_txt" : "Can you grow tomatoes in pots?", 
        "synonyms" : {
            "pot"  : "container",
            "pots" : "containers"
        }
    },
    {
        "id" : "q02", 
        "question_txt" : "I want to grow veggies on my deck", 
        "synonyms" : {
            "veggie"  : "vegetable",
            "veggies" : "vegetables"
        }
    },
    { 
        "id" : "q03", 
        "question_txt" : "Do cukes do well in shade?",
        "synonyms" : {
            "cuke"  : "cucumber",
            "cukes" : "cucumbers"
        }
    }
]

In [160]:
def collectSynonyms( questions ):
    synonyms = {}
    for question in questions:
        for term in question["synonyms"].keys():
            if term not in synonyms:
                synonyms[ term ] = question["synonyms"][ term ]
    return synonyms

In [161]:
g_synonyms = collectSynonyms( g_historical_questions )

The following function simply adds synonyms to questions to increase the chance of search finding related articles:

In [162]:
import re
    
def addSynonyms( txt, synonyms ):
    words_arr = txt.split()
    final_words_arr = []
    for word in words_arr:
        final_words_arr.append( word )
        if( word in synonyms ):
            extra_txt = "( " + synonyms[ word ] + " )"
            final_words_arr.append( extra_txt )
    return " ".join( final_words_arr )

Compare searching with original text and with synonyms added

In [163]:
import json
import pandas as pd

    
def cleanText( txt ):
    txt = txt.lower()
    txt = re.sub( r"[^a-z0-9\-]", " ", txt )
    txt = re.sub( r" +", " ", txt )
    #txt = removeStopWords( txt )
    return txt
    
def searchQuestions( questions ):
    all_search_results = []
    for question in questions:
        question_txt_org = cleanText( question["question_txt"] )
        search_result_org = search( question_txt_org )
        question_txt_syn = addSynonyms( question_txt_org, g_synonyms )
        search_result_syn = search( question_txt_syn )
        all_search_results.append( { "q_org"    : question_txt_org, 
                                     "hits_org" : search_result_org,
                                     "q_syn"    : question_txt_syn, 
                                     "hits_syn" : search_result_syn } )
    return all_search_results

In [164]:

def hitsList( hits_arr ):
    ul_html = "<ul style='margin: 0px;'>"
    for hit in hits_arr:
        ul_html += "<li>[ " + str( hit["score"] ) + " ] " + hit["title"] + "</li>"
    return ul_html
    
def resultsTable( search_results ):
    css = "style='text-align: left; vertical-align: top; margin: 0px 20px 0px 20px;'"
    html = """<table>
<tr>
<th style="text-align: left;">Question (org)</th>
<th style="text-align: left;">Search results (org)</th>
<th style="text-align: left;">Question (synonyms)</th>
<th style="text-align: left;">Search results (synonyms)</th>
</tr>"""
    for result in search_results:
        html += "<tr>" + \
                "<td " + css + ">" + result["q_org"] + "</td>" + \
                "<td " + css + ">" + hitsList( result["hits_org"] ) + "</td>" + \
                "<td " + css + ">" + result["q_syn"] + "</td>" + \
                "<td " + css + ">" + hitsList( result["hits_syn"] ) + "</td>" + \
                "</tr>"
    return html

In [165]:
search_results = searchQuestions( g_historical_questions )

html = resultsTable( search_results )

In [166]:
from IPython.display import display, HTML

HTML( html )

Question (org),Search results (org),Question (synonyms),Search results (synonyms)
can you grow tomatoes in pots,[ 74 ] Cultivating tomatoes in pots[ 53 ] For the love of tomatoes[ 33 ] Container gardening[ 29 ] All things cucumber,can you grow tomatoes in pots ( containers ),[ 73 ] Cultivating tomatoes in pots[ 50 ] For the love of tomatoes[ 47 ] Container gardening[ 31 ] All things cucumber
i want to grow veggies on my deck,[ 36 ] Cultivating tomatoes in pots[ 35 ] Container gardening[ 35 ] For the love of tomatoes[ 27 ] All things cucumber,i want to grow veggies ( vegetables ) on my deck,[ 33 ] Cultivating tomatoes in pots[ 32 ] Container gardening[ 32 ] For the love of tomatoes[ 22 ] All things cucumber
do cukes do well in shade,[ 40 ] Cultivating tomatoes in pots[ 39 ] Container gardening[ 35 ] For the love of tomatoes[ 34 ] All things cucumber,do cukes ( cucumbers ) do well in shade,[ 47 ] All things cucumber[ 37 ] Cultivating tomatoes in pots[ 36 ] For the love of tomatoes[ 31 ] Container gardening


## 4. Rephrasing the query
As you systematically review questions being submitted to your RAG solution, you can collect examples of common misunderstandings that could be clarified or rephrased to improve search results.

This section demonstrates a simple method for using a large language model (LLM) to rewrite questions to improve search.

- 4.1 Write prompt text
- 4.2 Prompt an LLM
- 4.3 Test searching rewritten queries

### 4.1 Write prompt text
The following prompt describes two common points of confusion and instructs a large language model to identify the concept being mentioned in the given question:

In [167]:
g_template = """Identify the if the given user question is about one of the following concepts:

Concept: container gardening
Description: Growing plants some place other than an in-ground or raised garden bed.  
For example: growing on a balcony or deck and using hanging baskets, planters, pots, or containers.

Concept: soil composition
Description: The material plants grow in, including soil as well additions like peat moss, bark, 
and vermiculite.  Features include soil type (sand, silt, and clay), whether the soil is compact or loose, 
pH (how acidic or alkaline), nutrients, and drainage (ability to retaining moisture).

If the user question is not about one of these concepts, say "none"

User input: How large are sunflowers?
none

User input: What can I grow on my balcany?
container gardening

User input: Are cucumbers annuals?
none

User input: What kind of dirt do I need?
soil composition

User input: %s
"""

### 4.2 Prompt an LLM

See: [Foundation models Python library](https://ibm.github.io/watson-machine-learning-sdk/foundation_models.html)

### Prerequisites
Before you can prompt a foundation model in watsonx.ai, you must perform the following setup tasks:
- 4.2.1 Create an instance of the Watson Machine Learning service
- 4.2.2 Associate the Watson Machine Learning instance with the current project
- 4.2.3 Create an IBM Cloud API key
- 4.2.4 Look up the current project ID
- 4.2.5 Prompt the LLM

#### 4.2.1 Create an instance of the Watson Machine Learning service
If you don't already have an instance of the IBM Watson Machine Learning service, you can create an instance of the service from the IBM Cloud catalog: [Watson Machine Learning service](https://cloud.ibm.com/catalog/services/watson-machine-learning)

#### 4.2.2 Associate an instance of the Watson Machine Learning service with the current project
The current project is the project in which you are running this notebook.

If an instance of Watson Machine Learning is not already associated with the current project, follow the instructions in this topic to do so: [Adding associated services to a project](https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/assoc-services.html?context=wx&audience=wdp)

#### 4.2.3 Create an IBM Cloud API key
Create an IBM Cloud API key by following these instruction: [Creating an IBM Cloud API key](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui#create_user_key)

Then paste your new IBM Cloud API key in the code cell below.

In [168]:
cloud_apikey = ""

g_wml_credentials = { 
    "url"    : "https://us-south.ml.cloud.ibm.com", 
    "apikey" : cloud_apikey
}

#### 4.2.4 Look up the current project ID
The current project is the project in which you are running this notebook. You can get the ID of the current project programmatically by running the following cell.

In [169]:
import os

g_project_id = os.environ["PROJECT_ID"]

Just FYI: List supported models

In [170]:
from ibm_watson_machine_learning.foundation_models.utils.enums import ModelTypes

model_ids = list( map( lambda e: e.value, ModelTypes._member_map_.values() ) )
model_ids

['google/flan-t5-xxl',
 'google/flan-ul2',
 'bigscience/mt0-xxl',
 'eleutherai/gpt-neox-20b',
 'ibm/mpt-7b-instruct2',
 'bigcode/starcoder',
 'meta-llama/llama-2-70b-chat',
 'meta-llama/llama-2-13b-chat',
 'ibm/granite-13b-instruct-v1',
 'ibm/granite-13b-chat-v1',
 'google/flan-t5-xl',
 'ibm/granite-13b-chat-v2',
 'ibm/granite-13b-instruct-v2',
 'elyza/elyza-japanese-llama-2-7b-instruct',
 'ibm-mistralai/mixtral-8x7b-instruct-v01-q',
 'codellama/codellama-34b-instruct-hf',
 'ibm/granite-20b-multilingual']

#### 4.2.5 Prompt an LLM

In [171]:
from ibm_watson_machine_learning.foundation_models import Model
import json

g_model_id = "google/flan-t5-xxl"

g_prompt_parameters = {
    "decoding_method" : "greedy",
    "min_new_tokens"  : 0,
    "max_new_tokens"  : 20
}

def concept( prompt_template, input_txt, b_debug=False ):
    model = Model( g_model_id, g_wml_credentials, g_prompt_parameters, g_project_id )
    prompt_text = prompt_template % input_txt
    raw_response = model.generate( prompt_text )
    if b_debug:
        print( "prompt_text:\n'" + prompt_text + "'\n" )
        print( "raw_response:\n" + json.dumps( raw_response, indent=3 ) )
    if ( "results" in raw_response ) \
       and ( len( raw_response["results"] ) > 0 ) \
       and ( "generated_text" in raw_response["results"][0] ):
        return raw_response["results"][0]["generated_text"]
    else:
        return ""

In [172]:
txt = "how large a pot do I need for growing peppers"
concept_type = concept( g_template, txt )
print( "Concept: " + concept_type )

Concept: container gardening


### 4.3 Test searching rewritten queries

In [174]:
def searchQuestions( questions ):
    all_search_results = []
    for question in questions:
        question_txt_org = cleanText( question["question_txt"] )
        search_result_org = search( question_txt_org )
        question_txt_syn = addSynonyms( question_txt_org, g_synonyms )
        concept_type = concept( g_template, question_txt_syn )
        if( re.match( r"container gardening|soil composition", concept_type, re.IGNORECASE ) ):
            question_txt_syn = "[ " + concept_type + " ] " + question_txt_syn
        search_result_syn = search( question_txt_syn )
        all_search_results.append( { "q_org"    : question_txt_org, 
                                     "hits_org" : search_result_org,
                                     "q_syn"    : question_txt_syn, 
                                     "hits_syn" : search_result_syn } )
    return all_search_results

def resultsTable( search_results ):
    css = "style='text-align: left; vertical-align: top; margin: 0px 20px 0px 20px;'"
    html = """<table>
<tr>
<th style="text-align: left;">Question (org)</th>
<th style="text-align: left;">Search results (org)</th>
<th style="text-align: left;">Question (synonyms)</th>
<th style="text-align: left;">Search results (synonyms, rewritten)</th>
</tr>"""
    for result in search_results:
        html += "<tr>" + \
                "<td " + css + ">" + result["q_org"] + "</td>" + \
                "<td " + css + ">" + hitsList( result["hits_org"] ) + "</td>" + \
                "<td " + css + ">" + result["q_syn"] + "</td>" + \
                "<td " + css + ">" + hitsList( result["hits_syn"] ) + "</td>" + \
                "</tr>"
    return html

In [175]:
search_results = searchQuestions( g_historical_questions )

html = resultsTable( search_results )

HTML( html )

Question (org),Search results (org),Question (synonyms),"Search results (synonyms, rewritten)"
can you grow tomatoes in pots,[ 74 ] Cultivating tomatoes in pots[ 53 ] For the love of tomatoes[ 33 ] Container gardening[ 29 ] All things cucumber,[ container gardening ] can you grow tomatoes in pots ( containers ),[ 100 ] Container gardening[ 73 ] Cultivating tomatoes in pots[ 50 ] For the love of tomatoes[ 28 ] All things cucumber
i want to grow veggies on my deck,[ 36 ] Cultivating tomatoes in pots[ 35 ] Container gardening[ 35 ] For the love of tomatoes[ 27 ] All things cucumber,[ container gardening ] i want to grow veggies ( vegetables ) on my deck,[ 100 ] Container gardening[ 37 ] Cultivating tomatoes in pots[ 30 ] For the love of tomatoes[ 24 ] All things cucumber
do cukes do well in shade,[ 40 ] Cultivating tomatoes in pots[ 39 ] Container gardening[ 35 ] For the love of tomatoes[ 34 ] All things cucumber,do cukes ( cucumbers ) do well in shade,[ 47 ] All things cucumber[ 37 ] Cultivating tomatoes in pots[ 36 ] For the love of tomatoes[ 31 ] Container gardening
