# Implementation of AI using RAG, Vector Search, and OCI AI Inference Model
## Summarizing a Shooter's Description

### Task 1: Loading the text sources from a file.

The function will open all .txt files in a specified folder, read their contents, and split the text using the "========" separator. It will then collect all resulting chunks into an array. This array will be stored in a dictionary, with the file names as keys.

In [12]:
import os

def loadShooterDesc(directory_path):
   shooter_desc = {}
   
   for filename in os.listdir(directory_path):
      if filename.endswith(".txt"):  # assuming shooter_desc are in .txt files
         file_path = os.path.join(directory_path, filename)

         with open(file_path) as f:
            raw_shooter_desc = f.read()

         filename_without_ext = os.path.splitext(filename)[0]  # remove .txt extension
         shooter_desc[filename_without_ext] = [text.strip() for text in raw_shooter_desc.split('=====')]

   return shooter_desc


In [13]:
shooter_desc = loadShooterDesc('.')
shooter_desc

{'shooter_desc': ["These are descriptions of a shooter or shooters.\nwearing sunglasses\nSunglasses are on his head.\nIt's a guy.\nWearing black jacket\nhas a gun\nhe is wearing a blue cap and a mask"]}

The final step in preparing the source data is to organize the dictionary for easy ingestion into the vector database.

In [14]:
docs = [{'text': filename + ' | ' + section, 'path': filename} for filename, sections in shooter_desc.items() for section in sections]

# Sample the resulting data
docs[:2]

[{'text': "shooter_desc | These are descriptions of a shooter or shooters.\nwearing sunglasses\nSunglasses are on his head.\nIt's a guy.\nWearing black jacket\nhas a gun\nhe is wearing a blue cap and a mask",
  'path': 'shooter_desc'}]

## Task 2: Loading the Shooter Description chunks into the vector database

### Step 1: Create a database connection

In [1]:
# Choose any user name
un = "user1"
pw = "<<password>>"
cs = "localhost/FREEPDB1"

In [21]:
import oracledb
connection = oracledb.connect(user=un, password=pw, dsn=cs)

### Step 2: Create the shooter_desc table

Need a table inside our database to store the vectors and metadata.

In [22]:
table_name = 'shooter_desc'

with connection.cursor() as cursor:
    # Create the table
    create_table_sql = f"""
        CREATE TABLE IF NOT EXISTS {table_name} (
            id NUMBER PRIMARY KEY,
            payload CLOB CHECK (payload IS JSON),
            vector VECTOR
        )"""
    try:
        cursor.execute(create_table_sql)
    except oracledb.DatabaseError as e:
        raise

    connection.autocommit = True    

### Step 3: Vectorize the text chunks

Need an encoder to handle the vectorization

In [27]:
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer('all-MiniLM-L12-v2')

Process each chunk stored in the docs dictionary by encoding the text content. This will result in a structured format containing all chunks, their context (the source file name in this case), and the vector representation for each chunk.

In [28]:
import array

# Define a list to store the data
data = [
    {"id": idx, "vector_source": row['text'], "payload": row}
    for idx, row in enumerate(docs)
]

# Collect all texts for batch encoding
texts = [f"{row['vector_source']}" for row in data]

# Encode all texts in a batch
embeddings = encoder.encode(texts, batch_size=32, show_progress_bar=True)

# Assign the embeddings back to your data structure
for row, embedding in zip(data, embeddings):
    row['vector'] = array.array("f", embedding)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

### Step 4: Insert the chunks + vectors in the database

We utilized a cursor object from the established database connection to execute a command that truncates the specified table. This operation clears all existing rows, effectively resetting the table to an empty state and preparing it for new data insertion.

Next, the code prepares a list of tuples containing the new data, with each tuple comprising an ID, a JSON-encoded payload, and a vector.

The json.dumps function converts the payload into a JSON string format, ensuring complex data structures are properly serialized for database storage.

We then use the "cursor.executemany" method to insert all prepared tuples into the table in a single batch operation. This method is highly efficient for bulk inserts, reducing the number of database transactions and enhancing performance. Finally, the connection.commit method is called to commit the transaction, ensuring all changes are saved and made permanent in the database.

In [29]:
import json

with connection.cursor() as cursor:
    # Truncate the table
    cursor.execute(f"truncate table {table_name}")
    prepared_data = [(row['id'], json.dumps(row['payload']), row['vector']) for row in data]

    # Insert the data
    cursor.executemany(
        f"""INSERT INTO {table_name} (id, payload, vector)
            VALUES (:1, :2, :3)""", prepared_data)
    
    connection.commit()

In [30]:
with connection.cursor() as cursor:
    # Define the query to select all rows from a table
    query = f"SELECT * FROM {table_name}"

    # Execute the query
    cursor.execute(query)

    # Fetch all rows
    rows = cursor.fetchall()

    # Print the rows
    for row in rows[:5]:
        print(row)

(0, {'text': "shooter_desc | These are descriptions of a shooter or shooters.\nwearing sunglasses\nSunglasses are on his head.\nIt's a guy.\nWearing black jacket\nhas a gun\nhe is wearing a blue cap and a mask", 'path': 'shooter_desc'}, array('f', [-0.017050394788384438, 0.05293922498822212, -0.04627383127808571, -0.04897970333695412, 0.11442756652832031, 0.024894798174500465, 0.17750421166419983, -0.04325743019580841, -0.09360206127166748, 0.1341097056865692, 0.0343773253262043, -0.046136274933815, 0.03531084209680557, -0.04318045452237129, 0.0699775293469429, -0.012814182788133621, -0.04406075179576874, 0.0021118817385286093, 0.03387780487537384, -0.07251087576150894, -0.04990013316273689, -0.058471620082855225, -0.0064957113936543465, -0.06550478935241699, -0.0465887114405632, -0.04710136353969574, 0.09361844509840012, -0.022491637617349625, -0.049941081553697586, 0.07235714793205261, 0.09345515817403793, -0.042309414595365524, -0.028224248439073563, -0.0257843267172575, -0.04262553

# Vector retrieval and Large Language Model generation

To integrate the vector database (Oracle Database 23ai) and retrieve text chunks close to the "question" in vector space, we'll use these chunks to create an LLM prompt and ask the Oracle Generative AI Service for a well-worded response. This Retrieval-Augmented Generation (RAG) approach combines retrieval and generation methods to enhance natural language processing. The retriever finds relevant documents by embedding queries and documents into the same vector space, providing context for the generator. The generator, using models like LLaMA 2, integrates these documents to produce accurate and detailed responses, making RAG a versatile tool in NLP.

## Task 1: Vectorize the "question"

We will take the user's question, convert it into a vector, and feed this vector to the database. The database will then retrieve similar vectors and their associated metadata stored within it.

### Step 1: Define the SQL script used to retrieve the chunks

In [11]:
# topK represents the number of top results to retrieve
topK = 3

sql = f"""select payload, vector_distance(vector, :vector, COSINE) as score
from {table_name}
order by score
fetch approx first {topK} rows only"""

### Step 2: Transforming the question into a vector

In [12]:
question = "Combine all the descriptions of the shooter into 2 sentences"

### Step 3: Executing the query

Next, write the retrieval code. Employ the same encoder as in previous text chunks, generating a vector representation of the question.

In [13]:
with connection.cursor() as cursor:
    embedding = list(encoder.encode(question))
    vector = array.array("f", embedding)
    
    results  = []
    
    for (info, score, ) in cursor.execute(sql, vector=vector):
        text_content = info.read()
        results.append((score, json.loads(text_content)))

The SQL query is executed with the provided vector parameter, fetching relevant information from the database. For each result, the code retrieves the text content stored in JSON format and appends it to a list along with the calculated similarity score. This process iterates through all fetched results, accumulating them in the results list.

If we print the results, we obtain a list showing the "score" of each hit, which represents the distance in vector space between the question and the text chunk, as well as the metadata JSON embedded in each chunk.


In [14]:
import pprint
pprint.pp(results)

[(0.4276947372566744,
  {'text': 'shooter_desc | These are descriptions of a shooter or shooters.\n'
           'wearing sunglasses\n'
           'Sunglasses are on his head.\n'
           "It's a guy.\n"
           'Wearing black jacket\n'
           'has a gun\n'
           'he is wearing a blue cap and a mask',
   'path': 'shooter_desc'})]


## Task 2: Create the LLM prompt

Before sending anything to the LLM, we must ensure that our prompt does not exceed the maximum context length of the model. For LLaMA 2, this limit is 4,096 tokens, which includes both the input tokens (the prompt) and the response.

In [15]:
from transformers import LlamaTokenizerFast
import sys

tokenizer = LlamaTokenizerFast.from_pretrained("hf-internal-testing/llama-tokenizer")
tokenizer.model_max_length = sys.maxsize

def truncate_string(string, max_tokens):
    # Tokenize the text and count the tokens
    tokens = tokenizer.encode(string, add_special_tokens=True) 
    # Truncate the tokens to a maximum length
    truncated_tokens = tokens[:max_tokens]
    # transform the tokens back to text
    truncated_text = tokenizer.decode(truncated_tokens)
    return truncated_text

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


This code utilizes the Hugging Face Transformers library to tokenize text using the LlamaTokenizerFast model. The tokenizer is initialized from the pre-trained "hf-internal-testing/llama-tokenizer" model, with its "model_max_length" attribute set to "sys.maxsize" to accommodate extremely large inputs without length constraints.

The "truncate_string" function takes a string and a maximum token count as inputs. It tokenizes the input string, truncates the tokenized sequence to the specified maximum length, and then decodes the truncated tokens back into a string. This function effectively shortens the text to a specified token limit while preserving its readable format, which is useful for tasks requiring length constraints on input text.

To ensure sufficient space for the rest of the prompt and the answer, we will truncate our chunks to 1000 tokens.

In [16]:
# transform docs into a string array using the "paylod" key
docs_as_one_string = "\n=========\n".join([doc["text"] for doc in docs])
docs_truncated = truncate_string(docs_as_one_string, 1000)

In [17]:
prompt = f"""\
    <s>[INST] <<SYS>>
    You are a helpful assistant named Oracle chatbot. 
    USE ONLY the sources below and ABSOLUTELY IGNORE any previous knowledge.
    Use Markdown if appropriate.
    Assume the customer needs a clear description. Do not add additional details if it is not on the text file. Do not repeat the same information.
    <</SYS>> [/INST]

    [INST]
    Respond to PRECISELY to this question: "{question}.",  USING ONLY the following information and IGNORING ANY PREVIOUS KNOWLEDGE.
    Include code snippets and commands where necessary.
    NEVER mention the sources, always respond as if you have that knowledge yourself. Do NOT provide warnings or disclaimers.
    =====
    Sources: {docs_truncated}
    =====
    Answer (One paragraph, maximum 2 sentences, maximum 10 words, 90% spartan):
    [/INST]
    """
print(prompt)  # Print the prompt to verify its formatting

    <s>[INST] <<SYS>>
    You are a helpful assistant named Oracle chatbot. 
    USE ONLY the sources below and ABSOLUTELY IGNORE any previous knowledge.
    Use Markdown if appropriate.
    Assume the customer needs a clear description. Do not add additional details if it is not on the text file. Do not repeat the same information.
    <</SYS>> [/INST]

    [INST]
    Respond to PRECISELY to this question: "Combine all the descriptions of the shooter into 2 sentences.",  USING ONLY the following information and IGNORING ANY PREVIOUS KNOWLEDGE.
    Include code snippets and commands where necessary.
    =====
    Sources: <s> shooter_desc | These are descriptions of a shooter or shooters.
wearing sunglasses
Sunglasses are on his head.
It's a guy.
Wearing black jacket
has a gun
he is wearing a blue cap and a mask
    =====
    Answer (One paragraph, maximum 2 sentences, maximum 10 words, 90% spartan):
    [/INST]
    


## Task 3: Call the Generative AI Service LLM

### Initialize the OCI client

In [18]:
import oci
import logging
from oci.generative_ai_inference.models import LlamaLlmInferenceRequest, GenerateTextDetails, OnDemandServingMode

logging.basicConfig(level=logging.INFO)

compartment_id = '<<compartment id>>'
CONFIG_PROFILE = "DEFAULT"
config = oci.config.from_file('config', CONFIG_PROFILE)

# Service endpoint
endpoint = "https://inference.generativeai.us-chicago-1.oci.oraclecloud.com"

# GenAI client
generative_ai_inference_client = oci.generative_ai_inference.GenerativeAiInferenceClient(config=config, service_endpoint=endpoint, retry_strategy=oci.retry.NoneRetryStrategy(), timeout=(10,240))

### Make the call

This code uses Oracle Cloud Infrastructure (OCI) to generate text with the “meta.llama-2-70b-chat” model. It creates an inference request with parameters like the input prompt, maximum tokens, and settings for output randomness (temperature and top_p). The is_stream attribute is set to False.

The code specifies the serving mode, model ID, and compartment ID for the text generation request. This setup directs OCI on which model to use and how to process the request.

Finally, the request is sent to OCI’s Generative AI inference client. The client processes it, returns the generated text, which is then cleaned of any extra whitespace, and printed in a readable format.

In [19]:
generate_text_request = oci.generative_ai_inference.models.LlamaLlmInferenceRequest()

generate_text_request.prompt = prompt
generate_text_request.is_stream = False #SDK doesn't support streaming responses, feature is under development
generate_text_request.max_tokens = 1500
generate_text_request.temperature = 0.1
generate_text_request.top_p = 0.7
generate_text_request.frequency_penalty = 0.0

generate_text_detail = oci.generative_ai_inference.models.GenerateTextDetails(
    serving_mode=OnDemandServingMode(model_id="meta.llama-2-70b-chat"),
    compartment_id=<<compartment id>>,
    inference_request=generate_text_request

)
generate_text_detail.serving_mode = oci.generative_ai_inference.models.OnDemandServingMode(model_id="meta.llama-2-70b-chat")
generate_text_detail.compartment_id = compartment_id
generate_text_detail.inference_request = generate_text_request

In [20]:
try:
    generate_text_response = generative_ai_inference_client.generate_text(generate_text_detail)
    response = generate_text_response.data.inference_response.choices[0].text
    # logging.info("Full Response: %s", response.strip())
except Exception as e:
    logging.error("Error occurred: %s", str(e))
    response = None
    
if response:
    # response_length = len(response.strip())
    # print(f"Response Length: {response_length}")
    # print("Summary of the shooter:")
    print(response.strip())
else:
    print("No response received or response is empty.")

# print(response.strip())

The shooter is a male wearing a black jacket, blue cap, and a mask, holding a gun. He has sunglasses on his head.
