## Retrieval Augmented Generation in Action

All steps of RAG pattern are implemented in this notebook. We will use the same dataset as in the previous notebook.

Retrieval Augmentation Generation (RAG) is an architecture that augments the capabilities of a Large Language Model (LLM) like ChatGPT by adding an information retrieval system that provides the data. Adding an information retrieval system gives you control over the data used by an LLM when it formulates a response. For an enterprise solution, RAG architecture means that you can constrain natural language processing to your enterprise content sourced from vectorized documents, images, audio, and video.

![RAG with Azure Cognitive Search](RAG.png)


https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview



In [13]:
# Import required libraries
import os
import json
import openai
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI
#from langchain.llms import AzureOpenAI
from dotenv import load_dotenv
from tenacity import retry, wait_random_exponential, stop_after_attempt
from IPython.display import display, HTML, JSON, Markdown

from azure.search.documents.models import (
    QueryAnswerType,
    QueryCaptionType,
    QueryCaptionResult,
    QueryAnswerResult,
    SemanticErrorMode,
    SemanticErrorReason,
    SemanticSearchResultsType,
    QueryType,
    VectorizedQuery,
    VectorQuery,
    VectorFilterMode,    
)

# Configure environment variables
load_dotenv()

True

In [14]:
service_endpoint = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT")
#replace this name by your index name
index_name = "sk-cogsrch-vector-index-3"
key = os.getenv("AZURE_SEARCH_ADMIN_KEY")

# env variables that are used by LangChain
os.environ['OPENAI_API_KEY'] = os.getenv("OPENAI_API_KEY")
os.environ['OPENAI_API_TYPE'] = "azure"
os.environ['OPENAI_API_VERSION'] = os.getenv("OPENAI_DEPLOYMENT_VERSION")
os.environ['OPENAI_API_BASE'] = os.getenv("OPENAI_DEPLOYMENT_ENDPOINT")

OPENAI_DEPLOYMENT_ENDPOINT = os.getenv("OPENAI_DEPLOYMENT_ENDPOINT")
OPENAI_DEPLOYMENT_NAME = os.getenv("OPENAI_DEPLOYMENT_NAME")
OPENAI_MODEL_NAME = os.getenv("OPENAI_MODEL_NAME")

OPENAI_ADA_EMBEDDING_DEPLOYMENT_NAME = os.getenv("OPENAI_ADA_EMBEDDING_DEPLOYMENT_NAME")
OPENAI_ADA_EMBEDDING_MODEL_NAME = os.getenv("OPENAI_ADA_EMBEDDING_MODEL_NAME")

# Configure OpenAI API
openai.api_type = "azure"
openai.api_version = os.getenv("OPENAI_DEPLOYMENT_VERSION")
openai.api_base = os.getenv("OPENAI_DEPLOYMENT_ENDPOINT")
openai.api_key = os.getenv("OPENAI_API_KEY")
# ---
credential = AzureKeyCredential(key)


In [3]:
def init_llm(model=OPENAI_MODEL_NAME,
             deployment_name=OPENAI_DEPLOYMENT_NAME,
             temperature=0,
             max_tokens=500,
             ):

    llm = AzureChatOpenAI(deployment_name=deployment_name,
                      model=model,
                      temperature=temperature,
                      max_tokens=max_tokens,
                      model_kwargs={"stop": ["<|im_end|>"]}
                      )
    return llm

In [4]:
# Generate Document Embeddings using OpenAI Ada 002

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
# Function to generate embeddings for title and content fields, also used for query embeddings
def generate_embeddings(page):
    response = openai.Embedding.create(
        input=page, engine="text-embedding-ada-002")

    embeddings = response['data'][0]['embedding']
    return embeddings

In [5]:
generate_embeddings("just testing")

[-0.028509432449936867,
 -0.004917479120194912,
 -0.0009099990711547434,
 -0.02553638257086277,
 -0.018886836245656013,
 0.0037494946736842394,
 -0.019417736679315567,
 -0.014042356051504612,
 -0.0028486205264925957,
 -0.043056145310401917,
 0.01832938939332962,
 0.011474117636680603,
 -0.0011804272653535008,
 -0.0057138316333293915,
 -0.002528420416638255,
 0.0008834539912641048,
 0.022204972803592682,
 -0.018594838678836823,
 0.008494430221617222,
 -0.01258901134133339,
 0.009622597135603428,
 -0.00461221020668745,
 -0.014705982990562916,
 0.01826302520930767,
 -0.030208319425582886,
 -0.02349240891635418,
 0.009476599283516407,
 -0.03119048662483692,
 0.003971809986978769,
 -0.0008461249526590109,
 0.027659989893436432,
 -0.014214898459613323,
 -0.002727508544921875,
 -0.0216873437166214,
 -0.019603552296757698,
 -0.017227767035365105,
 -0.004804662428796291,
 -0.021103350445628166,
 0.022457150742411613,
 -0.001308175502344966,
 -0.007452535443007946,
 0.013710541650652885,
 -0.006

In [7]:
from langchain.prompts import ChatPromptTemplate
from langchain.prompts.chat import SystemMessagePromptTemplate, HumanMessagePromptTemplate

# init model
llm = init_llm()

# create template for prompt
template = ChatPromptTemplate.from_messages(
    [
        SystemMessagePromptTemplate.from_template(
            (""" 
                You are assistant helping the company technical support users with their questions about different product features. 
                Answer ONLY with the facts listed in the sources below delimited by triple backticks.
                If there isn't enough information in the Sources, say you don't know and ask a user to provide more details. 
                Do not generate answers that don't use the Sources below. 
                If asking a clarifying question to the user would help, ask the question. 

                Sources:
                ```{sources}```
            """    
            )
        ),
        HumanMessagePromptTemplate.from_template('''User Question: {question}'''),
    ]
)

#answer = llm(template.format_messages(profession="Financial Trading Consultant",  expertise="Risk Management",
#                                      question="How do you assess the risk tolerance of a new client?"))
#display(Markdown("gpt-35-turbo: " + answer.content))

### Init LLM model 

We use here Langchain ConversationChain and ConversationBufferMemory to keep the context of the conversation.

In [21]:
from langchain.memory import ConversationBufferWindowMemory
from langchain.chains import ConversationChain
from langchain.prompts import ChatPromptTemplate

llm = init_llm()
# ConversationBufferMemory is a memory that stores the conversation history
memory = ConversationBufferWindowMemory(k=1) #ConversationBufferMemory()
# try to change the verbose to True, to see more details
conversation = ConversationChain(llm=llm, memory=memory, verbose=False)

### Init Azure Cognitive Search client

In [22]:

search_client = SearchClient(

    service_endpoint, index_name=index_name, credential=credential)

#### Searching in Azure Cognitive Search, by using vector search

In [23]:

def search(question, top_k=3):
    
    vector_query = VectorizedQuery(vector=generate_embeddings(question), k_nearest_neighbors=3, fields="contentVector")
    
    results = search_client.search(
    search_text=question,
    vector_queries = [vector_query],
    select=["title", "content"],
    #adding semantic search configuration 
    query_type=QueryType.SEMANTIC,
    semantic_configuration_name='sk-semantic-config',
    query_caption=QueryCaptionType.EXTRACTIVE,
    query_answer=QueryAnswerType.EXTRACTIVE,
    top=3 
    )


    result_pages = []
    for result in results:
        result_pages.append(result['content'])

    sources = "\n\n".join([page for page in result_pages])
    return sources

#### Calling OpenAI with the retrieved documents from Azure Cognitive Search

In [24]:

def ask_openai(sources, question):
    response = conversation.run(input=template.format(sources=sources,
                                                    question=question))
    return response

## Put all RAG phases together

In [25]:
question = "what's semnatic kernel?"
#1. Retrieval. Call Azure Cognitive Search to retrieve relevant documents
sources = search(question, top_k=3)  # retrieval
#print(sources)
# 2. Generation. Call OpenAI to generate a final answer
response = ask_openai(sources, question)
display(Markdown(response))

Semantic Kernel is an open-source SDK (Software Development Kit) developed by Microsoft. It allows developers to easily combine AI services like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C# and Python. By using Semantic Kernel, developers can create AI apps that combine the best of both worlds. It is designed to support enterprise app developers who want to integrate AI into their existing apps. Semantic Kernel enables the creation of sophisticated pipelines that automate complex tasks for users by using multiple AI models, plugins, and memory together. For example, it can help automate the process of sending an email by retrieving information about the project, generating a response, and sending the email. Semantic Kernel is a key component in Microsoft's Copilot system, which combines AI models and plugins to create new experiences for users.

In [26]:
#ask more questions
question = "which programming languages are supported by semantic kernel?"
#1. Retrieval. Call Azure Cognitive Search to retrieve relevant documents
sources = search(question, top_k=3)
#print(sources)
# 2. Generation. Call OpenAI to generate a final answer
response = ask_openai(sources, question)
display(Markdown(response))

Semantic Kernel currently supports the following programming languages: C# and Python. However, Java and Typescript are actively being developed and are on the roadmap for future support. The SDK for each language follows common paradigms and styles to make it feel native and easy to use.

### Vector Search / RAG is multi-lingual

In this example we ask a question in Spanish on the text, which is in English. 

The answer is in Spansih since the question is in Spanish so OpenAI generates the answer in the same language as the question.

In [27]:
question = "¿Qué son el planificador semántico del kernel y el kernel?"
sources = search(question, top_k=3)
#print(sources)
response = ask_openai(sources, question)
display(Markdown(response))

El planificador semántico del kernel y el kernel son componentes del proyecto Semantic Kernel de Microsoft. El kernel es el encargado de orquestar las solicitudes de los usuarios y desarrolladores, ejecutando una cadena de funciones definida por el desarrollador. El planificador semántico utiliza los plugins registrados en el kernel para crear planes y abordar las necesidades de los usuarios. Estos componentes permiten crear pipelines sofisticados que automatizan tareas complejas utilizando modelos de IA, plugins y memoria.

In [28]:

question = "explain how to deploy Semantic Kernel to Azure as a web app service"
sources = search(question, top_k=3)
#print(sources)
response = ask_openai(sources, question)
display(Markdown(response))

To deploy Semantic Kernel to Azure as a web app service, you can follow these steps:

1. Make sure you have sufficient permissions to create resources in the target subscription.
2. Choose the deployment option based on your use case and preference:
   - Use existing Azure OpenAI Resources: This option allows you to use an existing Azure OpenAI instance and connect the Semantic Kernel web API to it. You can use the provided PowerShell or Bash script to deploy.
   - Create new Azure OpenAI Resources: This option deploys Semantic Kernel in a web app service and uses a new instance of Azure OpenAI. Note that access to new Azure OpenAI resources is currently limited due to high demand. You can use the provided PowerShell or Bash script to deploy.
   - Use existing OpenAI Resources: This option allows you to use your OpenAI account and connect the Semantic Kernel web API to it. You can use the provided PowerShell or Bash script to deploy.

3. Run the appropriate PowerShell or Bash script with the required parameters to deploy Semantic Kernel.
4. After the deployment is complete, you can access your Semantic Kernel web app service by clicking on the "Deploy to Azure" button and then selecting the resource whose name ends with "-skweb". The URL of your instance can be found next to the "Default domain" field on the Overview page.
5. To change the configuration of your deployment, go to the Azure Portal and click on the "Configuration" item in the "Settings" section of the left pane on the Semantic Kernel web app service page.
6. In the same pane, you can also access the "Monitoring" section to monitor your deployment and troubleshoot any issues.
7. If you encounter errors when making calls to the Semantic Kernel, make sure you have correctly entered the values for the following settings: AIService:AzureOpenAI, AIService:Endpoint, AIService:Models:Completion, AIService:Models:Embedding, AIService:Models:Planner. Note that AIService:Endpoint is ignored for OpenAI instances from openai.com but must be properly populated when using Azure OpenAI instances.
8. When you want to clean up the resources from this deployment, you can use the Azure portal or run the appropriate Azure CLI command, such as "az group delete --name YOUR_RESOURCE_GROUP".

Please note that the provided steps are based on the information available in the sources. If you need more specific details or encounter any issues during the deployment process, it is

In [29]:
#search example (not a question)
question = "semantic kernel planner"
sources = search(question, top_k=3)
#print(sources)
response = ask_openai(sources, question)
display(Markdown(response))

Semantic Kernel provides several planners that you can choose from. Here are the out-of-the-box planners provided by Semantic Kernel and their language support:

- BasicPlanner: A simplified version of SequentialPlanner that strings together a set of functions. Language support: C# (❌), Python (✅), Java (❌).
- ActionPlanner: Creates a plan with a single step. Language support: C# (✅), Python (❌), Java (❌).
- SequentialPlanner: Creates a plan with a series of steps that are interconnected with custom generated input and output variables. Language support: C# (✅), Python (❌), Java (❌).

If you want to use the SequentialPlanner instead of the ActionPlanner, you can update the appsettings.js file to configure the app to use SequentialPlanner. Additionally, if you are using gpt-3.5-turbo, it is recommended to initialize SequentialPlanner with a RelevancyThreshold in the CopilotChatPlanner.cs file.

Please note that if you have specific needs, you can also create a custom planner in Semantic Kernel.