# Retrieval Augmented Generation (RAG) with Azure AI Search and OpenAI

This code demonstrates how to work with RAG to give more context to the LLM/SLM models to get a more accurate answer. The code uses Azure AI Search to index the documents and Azure OpenAI's embedding model to generate embeddings/vectors for the documents.

## Install python packages

In [25]:
%pip install python-dotenv
%pip install tiktoken
%pip install azure-search-documents
%pip install azure-identity
%pip install openai
%pip install PyPDF2
%pip install python-docx
%pip install pandas
%pip install openpyxl


Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




Note: you may need to restart the kernel to use updated packages.




## Connect to the Azure AI Search and OpenAI

Load environment variables from the `.env` file

In [26]:
import os
import re
from openai import AzureOpenAI
from dotenv import load_dotenv
from dotenv import dotenv_values

if os.path.exists(".env"):
    load_dotenv(override=True)
    config = dotenv_values(".env")

azure_openai_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
azure_openai_api_key = os.getenv("AZURE_OPENAI_API_KEY")
azure_openai_chat_completions_deployment_name = os.getenv("AZURE_OPENAI_CHAT_COMPLETIONS_DEPLOYMENT_NAME")

azure_openai_embedding_model = os.getenv("AZURE_OPENAI_EMBEDDING_MODEL")
embedding_vector_dimensions = os.getenv("EMBEDDING_VECTOR_DIMENSIONS")

azure_search_service_endpoint = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
azure_search_service_admin_key = os.getenv("AZURE_SEARCH_SERVICE_ADMIN_KEY")
search_index_name = os.getenv("SEARCH_INDEX_NAME_1")

openai_client = AzureOpenAI(
    azure_endpoint=azure_openai_endpoint,
    api_key=azure_openai_api_key,
    api_version="2024-06-01"
)

# Test connection to OpenAI ChatGPT
completion = openai_client.chat.completions.create(
    model=azure_openai_chat_completions_deployment_name,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who are you ?"}
    ])
print(completion.to_json())

{
  "id": "chatcmpl-A3saLi51YDKwT6uqrSUo7ViTFx8Uv",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "message": {
        "content": "I am an AI assistant designed to help you with a wide range of tasks and answer your questions. How can I assist you today?",
        "role": "assistant"
      },
      "content_filter_results": {
        "hate": {
          "filtered": false,
          "severity": "safe"
        },
        "self_harm": {
          "filtered": false,
          "severity": "safe"
        },
        "sexual": {
          "filtered": false,
          "severity": "safe"
        },
        "violence": {
          "filtered": false,
          "severity": "safe"
        }
      }
    }
  ],
  "created": 1725488173,
  "model": "gpt-4o-2024-05-13",
  "object": "chat.completion",
  "system_fingerprint": "fp_80a1bad4c7",
  "usage": {
    "completion_tokens": 27,
    "prompt_tokens": 21,
    "total_tokens": 48
  },
  "prompt_filter_

## Count the number of tokens in a text

Like LLM models, Embedding models defines a `max input`. It is defined in number of `tokens`. The `max_input` for `text-embedding-3-large` is 8191 tokens. So we need to split the text into chunks of 8191 tokens or less. For that, you need to get the number of tokens in a text string.

In [27]:
import tiktoken

def num_tokens_from_string(string: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name="cl100k_base")
    num_tokens = len(encoding.encode(string, disallowed_special=()))
    return num_tokens

# Test the function
num_tokens_from_string("tiktoken is great!")

6

The OpenAI embedding model `text-embedding-3-large` has a limit of `8191` tokens per request.
Before sending the files to the model, we need to split the text into chunks of less than `8191` tokens.
Count the number of tokens in the sample files and show the files with more than `8191` tokens.

In [28]:
import os
import csv
import docx
import pandas as pd
from PyPDF2 import PdfReader

def num_tokens_from_string(content):
    # Implement or import your token counting logic here
    return len(content.split())  # Example: Counting words as tokens

input_directory = './data/myDocuments/'

for filename in os.listdir(input_directory):
    file_path = os.path.join(input_directory, filename)
    content = ''

    if filename.endswith('.pdf'):
        with open(file_path, 'rb') as file:
            reader = PdfReader(file)
            for page in range(len(reader.pages)):
                content += reader.pages[page].extract_text()

    elif filename.endswith('.docx'):
        doc = docx.Document(file_path)
        for paragraph in doc.paragraphs:
            content += paragraph.text + '\n'

    elif filename.endswith('.csv'):
        with open(file_path, newline='', encoding='utf-8') as csvfile:
            reader = csv.reader(csvfile)
            for row in reader:
                content += ' '.join(row) + '\n'

    elif filename.endswith('.xlsx'):
        df = pd.read_excel(file_path)
        content += df.to_string(index=False)

    # Add more elif statements if needed for .doc or other formats

    tokens = num_tokens_from_string(content)
    if tokens > 8191:
        print(f'File {filename} has {tokens} tokens, which is more than 8191 (max) tokens.')
    else:
        print(f'File {filename} has {tokens} tokens.')


File BUDGET CIRCULAR NO. 2024-1 DATED APRIL 4 2024.pdf has 1973 tokens.


# Extract Document Text
Create a Function that will accept and extract text from various supported document types (such as .pdf, .docx, .csv, .xlsx, etc.) 

In [29]:

import os
import csv
import docx
import pandas as pd
from PyPDF2 import PdfReader
from typing import Optional

def extract_text_from_file(file_path: str) -> Optional[str]:
    """
    Extract text content from a supported document file.

    Parameters:
    - file_path: str - The path to the document file.

    Returns:
    - Optional[str]: The extracted text content, or None if the file type is unsupported.
    """
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"The file {file_path} does not exist.")

    content = ''

    # Extract text from a PDF file
    if file_path.endswith('.pdf'):
        with open(file_path, 'rb') as file:
            reader = PdfReader(file)
            for page in reader.pages:
                content += page.extract_text()

    # Extract text from a DOCX file
    elif file_path.endswith('.docx'):
        doc = docx.Document(file_path)
        for paragraph in doc.paragraphs:
            content += paragraph.text

    # Extract text from a CSV file
    elif file_path.endswith('.csv'):
        with open(file_path, newline='', encoding='utf-8') as csvfile:
            reader = csv.reader(csvfile)
            for row in reader:
                content += ' '.join(row)

    # Extract text from an Excel file
    elif file_path.endswith('.xlsx'):
        excel_file = pd.ExcelFile(file_path)
        for sheet_name in excel_file.sheet_names:
            df = pd.read_excel(file_path, sheet_name=sheet_name)
            content += df.to_string(index=False) + '\n\n'

    # Add more elif statements if needed for other file types (.doc, .txt, etc.)
    else:
        raise ValueError(f"Unsupported file type: {file_path}")

    return content.split('\n\n') if content else None

Generate a chunk titles

## Transforming/cleaning the documents

Functions that will need to remove all special characters and markdown syntax from the files. The function `clean_markdown_content()` will help us with this.

In [30]:
def clean_markdown_content(content):
    # Remove links
    link_pattern = r'\[([^\[]+)\]\(([^\)]+)\)'
    content = re.sub(link_pattern, r'\1', content)

    # Remove images
    image_pattern = r'\!\[([^\[]*)\]\(([^\)]+)\)'
    content = re.sub(image_pattern, '', content)

    # Remove all occurrences of **
    content = content.replace('**', '')
    content = content.replace('\n', '')

    return content

## Get the vector embedding for an input text

In [31]:
def get_embeddings_vector(text):

    response = openai_client.embeddings.create(
        input=text,
        model=azure_openai_embedding_model,
    )

    embedding = response.data[0].embedding

    return embedding

# Test the function
vector = get_embeddings_vector("Sample text")
print(vector)

[-0.012445454485714436, -0.04316772520542145, -0.009896679781377316, 0.011555095203220844, 0.006628769915550947, -0.013394517824053764, -0.041641395539045334, 0.059996481984853745, -0.019362857565283775, 0.0006653230520896614, 0.028941551223397255, 0.007895818911492825, 0.008854666724801064, -0.05162123963236809, 0.013981567695736885, 0.01329667679965496, -0.010253801941871643, 0.004921433515846729, 0.008018121123313904, -0.023071054369211197, -0.002494961256161332, 0.004640138708055019, -0.02659335359930992, 0.051856059581041336, 0.007440855260938406, -0.006550496444106102, -0.01611451432108879, 0.01270962692797184, 0.007949631661176682, 0.024616951122879982, 0.008957399986684322, 0.03972369804978371, -0.005596540868282318, -0.028334934264421463, 0.014862142503261566, 0.013736963272094727, 0.030917951837182045, 0.020253214985132217, 0.027826156467199326, 0.007920279167592525, 0.026025870814919472, 0.01541005540639162, -0.0445375069975853, -0.015449192374944687, -0.0200966689735651, 0.

## Create file chunks

This is where we split the markdown files in folder `./data/myDocuments` into chunks.

In [32]:
import uuid
import os
import json

input_directory = './data/myDocuments/'
output_directory = './data/chunks/'
suported_file_types = ('.pdf', '.docx', '.csv', '.xlsx')

# Create output directory if it doesn't exist
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

chunk_index = 0

# Loop through each file in the directory
for filename in os.listdir(input_directory):
    # Check if the file is a PDF
    if str(filename).endswith(suported_file_types):
        # Extract the file's title (for example, using the filename)
        page_title = os.path.splitext(filename)[0]

        # Open and read the PDF content
        extracted_paragraphs = extract_text_from_file(input_directory + filename)

        # Process each chunk
        for chunk in extracted_paragraphs:
            chunk_index += 1
            chunk_content =  clean_markdown_content(chunk.strip())

            if (num_tokens_from_string(chunk_content) > 8191):
                    print(f'Chunk {chunk_index} in file {filename} has more than 8191 tokens')
                    break
            else:
                print(f'Chunk {chunk_index} in file {filename} has {num_tokens_from_string(chunk_content)} tokens')

            vector = get_embeddings_vector(chunk_content)

            # Extract the chunk title using the first sentence or key content
            chunk_title = chunk_content.split('.\n')[0].strip()  # Assuming the first sentence ends with a period
            if len(chunk_title) > 200:  # Limiting title length for practicality
                chunk_title = chunk_title[:200] + '...'

            chunk_data = {
                "id": str(uuid.uuid4()),
                'page_title': page_title,
                'chunk_title': chunk_title,  # The first line is the title of the chunk
                'chunk_content': chunk_content,
                'vector': vector
            }
            print(chunk_title)

            chunk_file_name = f'chunk_{chunk_index}_{page_title}.json'.replace('?', '').replace(':', '').replace("'", '').replace('|', '').replace('/', '').replace('\\', '')

            # Write chunk into JSON file into output directory
            with open(f'{output_directory}/{chunk_file_name}', 'w') as f:
                json.dump(chunk_data, f)


Chunk 1 in file BUDGET CIRCULAR NO. 2024-1 DATED APRIL 4 2024.pdf has 1860 tokens
BUDGET  CIRCULAR1TOSUBJECT  :Background 1.02.0 Purpose3.0 CoveragePage 1 of 8This Circular  is issued  to prescribe  the updated  rules and regulations  on the grant  of the U/CA to civilian  personne...


By default, the length of the embedding vector will be `1536` for `text-embedding-3-small` or `3072` for `text-embedding-3-large`. You can reduce the dimensions of the embedding by passing in the dimensions parameter without the embedding losing its concept-representing properties.

## Create Index in Azure AI Search.

In [33]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    ComplexField,
    CorsOptions,
    SearchIndex,
    SearchField,
    ScoringProfile,
    SearchFieldDataType,
    SimpleField,
    SearchableField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticSearch,
    SemanticField
)

credential = AzureKeyCredential(azure_search_service_admin_key)

search_index_client = SearchIndexClient(
    endpoint=azure_search_service_endpoint,
    index_name=search_index_name,
    credential=credential
)

# create search index
fields = [
    SimpleField(
        name="id",
        type=SearchFieldDataType.String,
        key=True,
        sortable=True,
        filterable=True,
        facetable=True,
    ),
    SearchableField(name="page_title", type=SearchFieldDataType.String),
    SearchableField(name="chunk_title", type=SearchFieldDataType.String),
    SearchableField(name="chunk_content", type=SearchFieldDataType.String),
    SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=3072, #1536,
        vector_search_profile_name="myHnswProfile",
    ),
]

# Configure the vector search configuration
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw"
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHnsw",
        )
    ]
)

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="page_title"),
        # keywords_fields=[SemanticField(field_name="category")],
        content_fields=[SemanticField(field_name="chunk_content")]
    )
)

# Create the semantic settings with the configuration
semantic_search = SemanticSearch(configurations=[semantic_config])
# Create the search index with the semantic settings
search_index = SearchIndex(name=search_index_name, fields=fields,
                    vector_search=vector_search, semantic_search=semantic_search)
result = search_index_client.create_or_update_index(search_index)
print(f' {result.name} created')

 index-internal-doc created


In case you ned to delete an index, you can use the following code.

In [34]:
# delete index
# search_index_client.delete_index(search_index_name)

## Upload chunks/documents to Azure AI Search

In [35]:
import uuid
from azure.search.documents import SearchClient

search_client = SearchClient(endpoint=azure_search_service_endpoint, index_name=search_index_name, credential=credential)

# for each json file in ./data/chunks/ folder, load the json document and upload it to the search index

for filename in os.listdir(output_directory):
    if filename.endswith('.json'):
        with open(os.path.join(output_directory, filename), 'r') as file:
            document = json.load(file)

            result = search_client.upload_documents(documents=document)
            print(f"Upload of {filename} succeeded: { result[0].succeeded }")

Upload of chunk_1_BUDGET CIRCULAR NO. 2024-1 DATED APRIL 4 2024.json succeeded: True


## Perform a vector similarity search

This example shows a pure vector search using the vectorizable text query, all you need to do is pass in text and your vectorizer will handle the query vectorization.

In [36]:
from azure.search.documents.models import VectorizedQuery

# Pure Vector Search
query = "iot"

embedding = get_embeddings_vector(query)

vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="vector")

results = search_client.search(
    search_text=None,
    vector_queries= [vector_query],
    select=["page_title", "chunk_title", "chunk_content"],
)

for result in results:
    print(f"Page Title: {result['page_title']}")
    print(f"Chunk Title: {result['chunk_title']}")
    print(f"Chunk Content: {result['chunk_content']}")
    print(f"Score: {result['@search.score']}")


Page Title: BUDGET CIRCULAR NO. 2024-1 DATED APRIL 4 2024
Chunk Title: BUDGET  CIRCULAR1TOSUBJECT  :Background 1.02.0 Purpose3.0 CoveragePage 1 of 8This Circular  is issued  to prescribe  the updated  rules and regulations  on the grant  of the U/CA to civilian  personne...
Chunk Content: BUDGET  CIRCULAR1TOSUBJECT  :Background 1.02.0 Purpose3.0 CoveragePage 1 of 8This Circular  is issued  to prescribe  the updated  rules and regulations  on the grant  of the U/CA to civilian  personnel.Section  58 of the General  Provisions  of Republic  Act (RA) No. 11975 or the Fiscal  Year (FY) 2024  General  Appropriations  Act (GAA)  authorizes  the payment  of the U/CA not exceeding  Seven  Thousand  Pesos  (P7,000)  per annum  for each qualified  government  employee,  subject  to the guidelines,  rules, and regulations  issued  by the Department  of Budget  and Management  (DBM).These Circular covers civilian government  personnel  occupying  regular,  contractual,  or casual positions;  appoi

## Simulate a user query

This is where we will use the Azure AI Search to search for documents similar to the user query.

In [37]:
def get_response(user_query):

    with open('safety_prompt.txt', 'r') as file:
        safety_prompt = file.read()

    SystemPrompt = "You are a friendly and helpful assistant."+ safety_prompt

    response = openai_client.chat.completions.create(
        model=azure_openai_chat_completions_deployment_name,
        messages=[
            {"role": "system", "content": SystemPrompt},
            {"role": "user", "content": user_query}
        ],
        max_tokens=300,
        extra_body={
            "data_sources": [
                {
                    "type": "azure_search",
                    "parameters": {
                        "endpoint": azure_search_service_endpoint,
                        "index_name": search_index_name,
                        "authentication": {
                            "type": "api_key",
                            "key": azure_search_service_admin_key,
                        }
                    }
                }
            ]
        },
    )
    return response


In [38]:
user_query = input("Enter your question: ")
response = get_response(user_query)
print(response.to_json())


{
  "id": "b93c14f5-aebd-4971-8a3e-fb87b3fe4ffe",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The 2024 Uniform/Clothing Allowance (U/CA) budget for civilian government personnel is authorized under Section 58 of the General Provisions of Republic Act (RA) No. 11975 or the Fiscal Year (FY) 2024 General Appropriations Act (GAA). The U/CA is set at a maximum of Seven Thousand Pesos (P7,000) per annum for each qualified government employee, subject to the guidelines, rules, and regulations issued by the Department of Budget and Management (DBM) [doc4][doc5].",
        "role": "assistant",
        "end_turn": true,
        "context": {
          "citations": [
            {
              "content": "1 Agency  Specific  Budgets  - For the requirements  based on the previous  rate of F6,000  per employee;  and10.1.2 Miscellaneous  Personnel  Benefits  Fund  (MPBF)  - For the Pl,000 additional  requirement  per employee.For LGUs,

In [39]:
print(response.choices[0].message.content)

The 2024 Uniform/Clothing Allowance (U/CA) budget for civilian government personnel is authorized under Section 58 of the General Provisions of Republic Act (RA) No. 11975 or the Fiscal Year (FY) 2024 General Appropriations Act (GAA). The U/CA is set at a maximum of Seven Thousand Pesos (P7,000) per annum for each qualified government employee, subject to the guidelines, rules, and regulations issued by the Department of Budget and Management (DBM) [doc4][doc5].
