# Content Design for RAG
This notebook is part of a collection of material related to content design principles for retrieval-augmented generation (RAG).

You can explore the complete collection here: [Content Design for RAG on GitHub](https://github.com/spackows/ICAAI-2024_RAG-CD/blob/main/README.md)

## Answer Natural Questions
This notebook demonstrates how to improve results on a subset of the [Natural Questions](https://ai.google.com/research/NaturalQuestions) benchmark by rewriting knowledge base content.

**Contents**
1. Download source files
2. Split articles into chunks
3. Create search component
4. Identify relevant chunks
5. Prompt LLM to answer questions
6. Evaluate results
7. Compile all results
8. Save results

## 1. Download source files

In [1]:
g_file_names_arr = [
"Abundance-of-elements-in-Earths-crust.org.txt",
"Atmosphere-of-Earth.org.txt",
"Axial-precession.org.txt",
"Axial-tilt.org.txt",
"Carbon-cycle.org.txt",
"Carbon-dioxide-in-Earths-atmosphere.org.txt",
"Continent.org.txt",
"Crust-geology.org.txt",
"Earth.org.txt",
"Earths-energy-budget.org.txt",
"Earths-internal-heat-budget.org.txt",
"Earths-magnetic-field.org.txt",
"Earths-orbit.org.txt",
"Earths-rotation.org.txt",
"Inner-core.org.txt",
"Mantle-geology.org.txt",
"Mantle-convection.org.txt",
"Plate-tectonics.org.txt",
"Structure-of-the-Earth.org.txt"
]

In [2]:
g_base_url = "https://raw.githubusercontent.com/spackows/ICAAI-2024_RAG-CD/main/Natural-Questions/"

In [3]:
!pip install wget | tail -n 1

Successfully installed wget-3.2


In [4]:
import os
import wget

g_qa_file_name = "questions-and-answers.jsonl"
url = g_base_url + g_qa_file_name
if not os.path.isfile( g_qa_file_name ):
    wget.download( url, out = g_qa_file_name )

In [5]:
!mkdir txt_org
#!mkdir txt_updated

In [6]:
import re

for file_name in g_file_names_arr:
    full_file_name = "txt_org/" + file_name
    #full_file_name = "txt_updated/" + re.sub( r"\.org", ".updated", file_name )
    url = g_base_url + full_file_name
    if not os.path.isfile( full_file_name ):
        wget.download( url, out = full_file_name )

In [None]:
!ls

In [None]:
!ls txt_org
#!ls txt_updated

## 2. Split articles into chunks

In [None]:
!pip install langchain_community | tail -n 1

In [None]:
!pip install langchain_text_splitters | tail -n 1

In [None]:
!pip install unstructured | tail -n 1

In [None]:
from langchain_community.document_loaders import DirectoryLoader

txt_dir_name = "txt_org"
#txt_dir_name = "txt_updated"
txt_loader = DirectoryLoader( txt_dir_name )
g_txt_docs_arr = txt_loader.load()

In [13]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

g_text_splitter_md = MarkdownHeaderTextSplitter( [ ( "#", "Header 1" ), ( "##", "Header 2" ), ( "###", "Header 3" ), ( "####", "Header 4" ) ], strip_headers=False )

In [14]:
g_txt_chunks = []

for doc in g_txt_docs_arr:
    chunks = g_text_splitter_md.split_text( doc.page_content )
    for chunk in chunks:
        chunk.metadata["source"] = doc.metadata["source"]
    g_txt_chunks.extend( chunks )

In [None]:
print( g_txt_chunks[0] )

## 3. Create search component

https://python.langchain.com/v0.1/docs/integrations/vectorstores/chroma

In [None]:
!pip install langchain_chroma | tail -n 1

In [31]:
import re
import chromadb
from langchain_chroma import Chroma

def createDocsMetadata( chunks ):
    ids_arr = []
    txt_arr = []
    metadata_arr = []
    current_source = ""
    chunk_counter = 0
    for chunk in chunks:
        txt = chunk.page_content
        source = chunk.metadata["source"]
        title = re.sub( r"^.*\/", "", source )
        title = re.sub( r"\..*$", "", title )
        title = re.sub( r"^Earths\-", "Earth's-", title )
        title = re.sub( r"\-geology\-", "-(geology)-", title )
        title = re.sub( r"\-", " ", title )
        if( source != current_source ):
            current_source = source
            chunk_counter = 0
        num_str = str( chunk_counter )
        if( chunk_counter < 10 ):
            num_str = "0" + num_str
        id = source + "_" + num_str
        ids_arr.append( id )
        txt_arr.append( txt )
        metadata_arr.append( { "source" : source, "chunk_num" : num_str, "title" : title } )
        chunk_counter += 1
    return ids_arr, txt_arr, metadata_arr

def createMDSimilarityRetriever( chunks_arr, chroma_client, collection_name ):
    ids_arr, txt_arr, metadata_arr = createDocsMetadata( chunks_arr )
    collection = chroma_client.create_collection( collection_name )
    collection.add( ids=ids_arr, documents=txt_arr, metadatas = metadata_arr )
    return collection

In [32]:
g_chroma_client = chromadb.Client()

In [None]:
collection_name = "txt_md_collection"

g_txt_md_similarity_db = createMDSimilarityRetriever( g_txt_chunks, g_chroma_client, collection_name )

## 4. Identify relevant chunks

### 4.1 Read questions

In [20]:
import json

def readQuestionsAndAnswers( file_name ):
    f = open( file_name, "r" )
    content = f.read()
    f.close()
    questions_json = {}
    lines_arr = content.splitlines()
    for line in lines_arr:
        json_line = json.loads( line )
        example_id = json_line["example_id"]
        question_txt = json_line["question_txt"]
        answers_arr = json_line["answers_arr"]
        article_title = json_line["article_title"]
        questions_json[ example_id ] = { "question" : question_txt, 
                                         "expected_article_title" : article_title, 
                                         "expected_answers_arr"   : answers_arr }
    return questions_json

In [21]:
g_qa_json = readQuestionsAndAnswers( g_qa_file_name )

In [23]:
example_id = list( g_qa_json.keys() )[0]

print( "example_id: " + example_id + "\n" )
print( json.dumps( g_qa_json[ example_id ], indent=3 ) )

example_id: -1368633715963532113

{
   "question": "where can carbon be found in the biosphere",
   "expected_article_title": "Carbon cycle",
   "expected_answers_arr": [
      "in all land - living organisms , both alive and dead , as well as carbon stored in soils",
      "in all land - living organisms , both alive and dead , as well as carbon stored in soils",
      "plants other living organisms soil",
      "The terrestrial biosphere"
   ]
}


### 4.2 Find relevant chunks for each question

In [24]:
def searchArticles( question_txt ):

    search_results = []
    
    raw_search_results = g_txt_md_similarity_db.query( query_texts = [ question_txt ], n_results = 2 )
    
    num_results = len( raw_search_results["distances"][0] ) if ( ( "distances" in raw_search_results ) and ( len( raw_search_results["distances"] ) > 0 ) ) else 0
    
    for i in range( num_results ):

        score     = raw_search_results["distances"][0][i]
        file_name = raw_search_results["metadatas"][0][i]["source"]
        chunk_num = raw_search_results["metadatas"][0][i]["chunk_num"]
        title     = raw_search_results["metadatas"][0][i]["title"]
        txt       = raw_search_results["documents"][0][i]
        
        search_results.append( { "search_diff" : round( 100 * score ) / 100, 
                                 "file_name"   : file_name,
                                 "chunk_num"   : chunk_num,
                                 "title"       : title,
                                 "chunk"       : txt } )
    
    return search_results

In [None]:
g_txt_md_similarity_db.query( query_texts = [ "what is found in the earth's crust" ], n_results = 2 )

In [26]:
def findRelevantChunks( qa_json ):
    relevant_chunks = {}
    example_ids_arr = qa_json.keys()
    for example_id in example_ids_arr:
        question_json = g_qa_json[ example_id ]
        question_txt = question_json["question"]
        relevant_chunks[ example_id ] = searchArticles( question_txt )
    return relevant_chunks

In [27]:
g_relevant_chunks = findRelevantChunks( g_qa_json )

In [None]:
example_id = list( g_qa_json.keys() )[0]

print( "example_id: " + example_id + "\n" )
print( json.dumps( g_relevant_chunks[ example_id ], indent=3 ) )

## 5. Prompt LLM to answer

See: [Foundation models Python library](https://ibm.github.io/watson-machine-learning-sdk/foundation_models.html)

### Prerequisites
Before you can prompt a foundation model in watsonx.ai, you must perform the following setup tasks:
- 5.1 Create an instance of the Watson Machine Learning service
- 5.2 Associate the Watson Machine Learning instance with the current project
- 5.3 Create an IBM Cloud API key
- 5.4 Look up the current project ID

#### 5.1 Create an instance of the Watson Machine Learning service
If you don't already have an instance of the IBM Watson Machine Learning service, you can create an instance of the service from the IBM Cloud catalog: [Watson Machine Learning service](https://cloud.ibm.com/catalog/services/watson-machine-learning)

#### 5.2 Associate an instance of the Watson Machine Learning service with the current project
The current project is the project in which you are running this notebook.

If an instance of Watson Machine Learning is not already associated with the current project, follow the instructions in this topic to do so: [Adding associated services to a project](https://dataplatform.cloud.ibm.com/docs/content/wsj/getting-started/assoc-services.html?context=wx&audience=wdp)

#### 5.3 Create an IBM Cloud API key
Create an IBM Cloud API key by following these instruction: [Creating an IBM Cloud API key](https://cloud.ibm.com/docs/account?topic=account-userapikey&interface=ui#create_user_key)

Then paste your new IBM Cloud API key in the code cell below.

In [34]:
cloud_apikey = ""

g_wml_credentials = { 
    "url"    : "https://us-south.ml.cloud.ibm.com", 
    "apikey" : cloud_apikey
}

#### 5.4 Look up the current project ID
The current project is the project in which you are running this notebook. You can get the ID of the current project programmatically by running the following cell.

In [35]:
import os

g_project_id = os.environ["PROJECT_ID"]

Now prompt a model to answer the questions ...

In [36]:
g_prompt_template = """
Article:
###
%s
###

Answer the following question using only information from the article. 
Answer in a complete sentence, with proper capitalization and punctuation. 
If there is no good answer in the article, say "I don't know".

Question: %s
Answer: 
"""

In [37]:
from ibm_watson_machine_learning.foundation_models import Model

g_model_id = "google/flan-t5-xxl"

g_prompt_parameters = {
    "decoding_method" : "greedy",
    "min_new_tokens"  : 0,
    "max_new_tokens"  : 300
}

g_model = Model( g_model_id, g_wml_credentials, g_prompt_parameters, g_project_id )

In [38]:
import math
import re

def countTokens( chunk_txt, model ):
    tokenized_response = model.tokenize( chunk_txt )
    return tokenized_response["result"]["token_count"]

def chopTxt( chunk_txt_org, model, max_tokens ):
    num_words = len( chunk_txt_org.split() )
    num_tokens = countTokens( chunk_txt_org, model )
    max_words = math.floor( max_tokens * num_words / num_tokens )
    chunk_txt = re.sub( r"\n", " __NEWLINE__ ", chunk_txt_org )
    num_prompt_template_words = 50
    chopped_txt = " ".join( chunk_txt.split()[ 0 : ( max_words - num_prompt_template_words ) ] )
    chopped_txt = re.sub( r"__NEWLINE__", "\n", chopped_txt )
    num_tokens = countTokens( chopped_txt, model )
    safety = 75
    while( ( num_tokens + safety ) > max_tokens ):
        max_words = max_words - 50
        chopped_txt = " ".join( chunk_txt.split()[ 0 : ( max_words - num_prompt_template_words ) ] )
        chopped_txt = re.sub( r"__NEWLINE__", "\n", chopped_txt )
        num_tokens = countTokens( chopped_txt, model )
    print( "num_tokens: " + str( num_tokens ) )
    return chopped_txt

In [39]:
def answerQuestion( chunk_txt, question_txt, b_debug=False ):
    chunk_txt = chopTxt( chunk_txt, g_model, 4000 )
    prompt_text = g_prompt_template % ( chunk_txt, question_txt )
    raw_response = g_model.generate( prompt_text )
    if b_debug:
        print( "prompt_text:\n'" + prompt_text + "'\n" )
        print( "raw_response:\n" + json.dumps( raw_response, indent=3 ) )
    if ( "results" in raw_response ) \
       and ( len( raw_response["results"] ) > 0 ) \
       and ( "generated_text" in raw_response["results"][0] ):
        output = raw_response["results"][0]["generated_text"]
        return output
    else:
        return ""

In [40]:
def answerQuestions( qa_json, relevant_chunks, b_debug=False ):
    generated_answers = {}
    example_ids_arr = qa_json.keys()
    for example_id in example_ids_arr:
        chunk_txt = relevant_chunks[ example_id ][0]["chunk"] + "\n\n" + relevant_chunks[ example_id ][1]["chunk"]
        question_txt = qa_json[ example_id ]["question"]
        answer_txt = answerQuestion( chunk_txt, question_txt, b_debug )
        generated_answers[ example_id ] = answer_txt
    return generated_answers

In [None]:
g_generated_answers = answerQuestions( g_qa_json, g_relevant_chunks, True )

In [None]:
print( json.dumps( g_generated_answers, indent=3 ) )

## 6. Evaluate results

In [None]:
!pip install sentence-transformers | tail -n 1

In [44]:
from sentence_transformers import SentenceTransformer, util

In [None]:
import numpy as np

st_model = SentenceTransformer( "all-MiniLM-L6-v2" )

def sentenceTransformerScore( txt1, txt2 ):
    txt1_embeddings  = st_model.encode( [ txt1 ],  convert_to_tensor=True )
    txt2_embeddings = st_model.encode( [ txt2 ], convert_to_tensor=True )
    cosine_scores = util.cos_sim( txt1_embeddings, txt2_embeddings )
    sentence_transformers_score_arr = [ round( float( x ), 2 ) for x in cosine_scores[0] ]
    score = int( 100*sentence_transformers_score_arr[0] )
    return score

In [46]:
def compareAnswers( expected_answers, run_time_answer ):
    high_score = 0
    for answer_txt in expected_answers:
        score = sentenceTransformerScore( answer_txt, run_time_answer )
        if( score > high_score ):
            high_score = score
    return high_score

In [49]:
def evaluateAnswers( qa_json, generated_answers ):
    eval_json = {}
    example_ids_arr = qa_json.keys()
    for example_id in example_ids_arr:
        expected_answers = qa_json[ example_id ]["expected_answers_arr"]
        run_time_answer = generated_answers[ example_id ]
        highest_score = compareAnswers( expected_answers, run_time_answer )
        eval_json[ example_id ] = highest_score
    return eval_json

In [50]:
g_eval_json = evaluateAnswers( g_qa_json, g_generated_answers )

In [None]:
print( json.dumps( g_eval_json, indent=3 ) )

## 7. Compile all results

In [None]:
g_results = []
for example_id in g_qa_json:
    question = g_qa_json[ example_id ]["question"]
    expected_answers_arr = g_qa_json[ example_id ]["expected_answers_arr"]
    expected_article = g_qa_json[ example_id ]["expected_article_title"]
    relevant_article = g_relevant_chunks[ example_id ][0]["title"]
    relevant_chunk = g_relevant_chunks[ example_id ][0]["chunk"]
    run_time_answer = g_generated_answers[ example_id ]
    score = g_eval_json[ example_id ]
    g_results.append( { "example_id" : "'" + str( example_id ),
                        "Expected article" : expected_article,
                        "Relevant article" : relevant_article,
                        "Chunk" : relevant_chunk,
                        "Question" : question,
                        "Expected answers" : "- " + "\n- ".join( expected_answers_arr ),
                        "Run-time answer" : run_time_answer,  
                        "Score" : score } )
g_results = sorted( g_results, key=lambda d: d["Score"] )
print( json.dumps( g_results[0], indent=3 ) )

In [None]:
import pandas as pd

df = pd.DataFrame( g_results )
df

## 8. Save results
**Note:** To use wslib to save results to your project, you must add a project token.

See: [ibm-watson-studio-lib for Python](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/ws-lib-python.html?context=wx&audience=wdp)

In [None]:
wslib.save_data( "results.csv", df.to_csv( index=False ).encode() )