# Phase 4 - Efficiency

This lab is about making sure you use less tokens - because tokens is the "currency" for AI requests and we want to save money! 
The way we tackle it in this lab is by caching questions and answers - so you don't need to ask your LLM all the time. To make this a little more sophisticated we are generating vectors for the questions - and therefore cache the semantics of the questions, not just the string.

Below you will find a sample how to do this.
It's your job to build this caching into your api so you can reduce the number of tokens used. Are there other optimizations you could think of?



If not already done run this in the top level folder:
```
pip install -r requirements.txt
```




In [2]:
import os
import tiktoken

from openai import AzureOpenAI
from dotenv import load_dotenv

# Load environment variables
if load_dotenv():
    print("Found Azure OpenAI API Base Endpoint: " + os.getenv("AZURE_OPENAI_ENDPOINT"))
else: 
    print("Azure OpenAI API Base Endpoint not found. Have you configured the .env file?")
    
API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
API_VERSION = os.getenv("OPENAI_API_VERSION")
RESOURCE_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")


client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_key = os.getenv("AZURE_OPENAI_API_KEY"),
    api_version = os.getenv("AZURE_OPENAI_VERSION")
)
deployment_name = os.getenv("AZURE_OPENAI_COMPLETION_DEPLOYMENT_NAME")
model_name = os.getenv("AZURE_OPENAI_COMPLETION_MODEL")

Found Azure OpenAI API Base Endpoint: https://cog-mdhi5lxwmoqli.openai.azure.com/


This is the object model for receiving questions.

In [3]:
from enum import Enum
from pydantic import BaseModel

class QuestionType(str, Enum):
    multiple_choice = "multiple_choice"
    true_or_false = "true_or_false"
    popular_choice = "popular_choice"
    estimation = "estimation"

class Ask(BaseModel):
    question: str | None = None
    type: QuestionType
    correlationToken: str | None = None

class Answer(BaseModel):
    answer: str
    correlationToken: str | None = None
    promptTokensUsed: int | None = None
    completionTokensUsed: int | None = None


## Get the number of tokens
tiktoken is a library which allows you to get the number of tokens. This will allow you to check how much tokens you've been using.
Ensure you pick the correct encoding for your model based on this list. https://github.com/openai/tiktoken/blob/c0ba74c238d18b4824c25f3c27fc8698055b9a76/tiktoken/model.py#L20


In [4]:
def get_num_tokens_from_string(string: str, encoding_name: str='p50k_base') -> int:
    """Returns the number of tokens in a text by a given encoding."""
    encoding = tiktoken.get_encoding(encoding_name)
    return len(encoding.encode(string))

number_of_tokens=get_num_tokens_from_string("Hello, Azure AI Adventure Day!")
print(f"Number of tokens in the string: {number_of_tokens}")


Number of tokens in the string: 7


# Use AI Search for semantic caching
The snippets below show you how we cache the semantic meaning of questions into Azure AI Search.

In [5]:
# use an embeddingsmodel to create embeddings
def get_embedding(text, model=os.getenv("AZURE_OPENAI_EMBEDDING_MODEL")):
    return client.embeddings.create(input = [text], model=model).data[0].embedding

In [6]:
from azure.identity import DefaultAzureCredential
from azure.core.credentials import AzureKeyCredential

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch,
    SearchIndex

)

credential = AzureKeyCredential(os.environ["AZURE_AI_SEARCH_KEY"]) if len(os.environ["AZURE_AI_SEARCH_KEY"]) > 0 else DefaultAzureCredential()

We create a new index to store questions and answers - and the vector which represents the semantic of it.

In [7]:

index_name = "question-semantic-index"

index_client = SearchIndexClient(
    endpoint=os.environ["AZURE_AI_SEARCH_ENDPOINT"], 
    credential=credential
)

# Create a search index with the fields and a vector field which we will fill with a vector based on the overview field
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
    SearchableField(name="question", type=SearchFieldDataType.String),
    SearchableField(name="answer", type=SearchFieldDataType.String),
    SearchField(name="vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_profile_name="myHnswProfile"),
]

# Configure the vector search configuration  
vector_search = VectorSearch(
    algorithms=[
        HnswAlgorithmConfiguration(
            name="myHnsw"
        )
    ],
    profiles=[
        VectorSearchProfile(
            name="myHnswProfile",
            algorithm_configuration_name="myHnsw",
        )
    ]
)

# Configure the semantic search configuration 
semantic_config = SemanticConfiguration(
    name="question-semantic-config",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="question"),
        keywords_fields=[SemanticField(field_name="answer")],
        content_fields=[SemanticField(field_name="question")]
    )
)

# Create the semantic settings with the configuration
semantic_search = SemanticSearch(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields,
                    vector_search=vector_search, semantic_search=semantic_search)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')

 question-semantic-index created


Now let's import some test data. As you see below the test data set contains 3 questions - all asking for the same with different words. So the semantics are the same, but it's not a word by word match. 

In [8]:
import json

questions = [{
        "question": "Which actor plays Tony Stark in the Marvel movies?",
        "answer": "Robert Downey Jr."
    },
    {
        "question": "In the Marvel Cinematic Universe, who is the actor that brings Tony Stark to life?",
        "answer": "Robert Downey Jr."
    },
    {
        "question": "Who brings the character of Tony Stark to life in the Marvel Cinematic Universe?",
        "answer": "Robert Downey Jr."
    }]

For testing purposes, we put those questions into the index. To do this, we create a vector for all questions which represent the meaning of the question.

In [9]:
# generate embedding for the question
index = 1
for question in questions:
    question["id"] = str(index)
    question["vector"] = get_embedding(question["question"])
    index = index + 1

Now let's put all those questions into our new index of Azure AI Search.

In [10]:
from azure.search.documents import SearchClient

# create new searchclient using our new index for the questions
search_client = SearchClient(
    endpoint=os.environ["AZURE_AI_SEARCH_ENDPOINT"], 
    index_name=index_name,
    credential=credential
)

# upload question to vector store
result = search_client.upload_documents(questions)
print(f"Successfully loaded {len(questions)} questions into Azure AI Search index.")

Successfully loaded 3 questions into Azure AI Search index.


Let's give it a try - ask the same question again - this time with yet another sentence but with the same semantic meaning.

In [11]:
question = "Who is Tony Stark in the MCU?"

Now we don't just ask the LLM - instead we generate an embedding and search for the vector in our new index - and get the top 5 questions and answers and print them.

In [13]:
from azure.search.documents.models import (
    VectorizedQuery
)

# create a vectorized query based on the question
vector = VectorizedQuery(vector=get_embedding(question), k_nearest_neighbors=5, fields="vector")

# create search client to retrieve movies from the vector store
found_questions = list(search_client.search(
    search_text=None,
    query_type="semantic",
    semantic_configuration_name="question-semantic-config",
    vector_queries=[vector],
    select=["question", "answer"],
    top=5
))

# print the found documents and the field that were selected
for result in found_questions:
    print("Question: {}".format(result["question"]))
    print("Answer: {}".format(result["answer"]))
    print("----------")
    print(result)

Question: Which actor plays Tony Stark in the Marvel movies?
Answer: Robert Downey Jr.
----------
{'answer': 'Robert Downey Jr.', 'question': 'Which actor plays Tony Stark in the Marvel movies?', '@search.score': 0.9419676, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}
Question: In the Marvel Cinematic Universe, who is the actor that brings Tony Stark to life?
Answer: Robert Downey Jr.
----------
{'answer': 'Robert Downey Jr.', 'question': 'In the Marvel Cinematic Universe, who is the actor that brings Tony Stark to life?', '@search.score': 0.9334844, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}
Question: Who brings the character of Tony Stark to life in the Marvel Cinematic Universe?
Answer: Robert Downey Jr.
----------
{'answer': 'Robert Downey Jr.', 'question': 'Who brings the character of Tony Stark to life in the Marvel Cinematic Universe?', '@search.score': 0.930987, '@search.reranker_score': None, '

As you have seen, we can get the correct answer for a question that was never asked the same way before, if we manage to cache questions and answers.

# YOUR Mission: 
Adjust the function below and reuse it in the main.py file later to deploy to Azure and to update your service. 
Ensure the answers provided are correct and in the correct format.

- for incomingn questions, create a vector embedding
- check if the answer is in the cache before 
- if yes, 
    - return the answer
    - put the new question & answer in the cache as well
- if no, 
    - reach out to the llm to get the answer. 
    - Then put the question & answer in the cache in case a similar question will come up again
- measure the tokens used 



In [30]:
from azure.search.documents.models import (
    VectorizedQuery
)
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

credential = AzureKeyCredential(os.environ["AZURE_AI_SEARCH_KEY"]) if len(os.environ["AZURE_AI_SEARCH_KEY"]) > 0 else DefaultAzureCredential()

async def ask_question(ask: Ask):
    """
    Ask a question
    """   
    print (ask.question)
    index_name = "question-semantic-index"

    # create new searchclient using our new index for the questions
    search_client = SearchClient(
        endpoint=os.environ["AZURE_AI_SEARCH_ENDPOINT"], 
        index_name=index_name,
        credential=credential
    )

    # create a vectorized query based on the question
    vector = VectorizedQuery(vector=get_embedding(ask.question), k_nearest_neighbors=5, fields="vector")

    # create search client to retrieve movies from the vector store
    found_questions = list(search_client.search(
        search_text=None,
        query_type="semantic",
        semantic_configuration_name="question-semantic-config",
        vector_queries=[vector],
        select=["question", "answer"],
        top=5
    ))


    # check if the question &  answer is in the cache already
    questionMatchCount = len(found_questions)
    tmp_score = 0
    cached_answer = ""

    print("Found questions: ", found_questions)
    for result in found_questions:
        if result["@search.score"] > tmp_score:
            cached_answer = result["answer"]
            tmp_score = result["@search.score"]

    print(tmp_score)
    
    if(questionMatchCount>0 and tmp_score > 0.95):
        print ("Found a match in the cache.")
        questions = [{
            "question": ask.question,
            "answer": cached_answer
        }]
        # generate embedding for the question
        index = search_client.get_document_count()  + 1
        for question in questions:
            question["id"] = str(index)
            question["vector"] = get_embedding(question["question"])
            index = index + 1
        # upload question to vector store
        result = search_client.upload_documents(questions)
        print(f"Successfully loaded {len(questions)} questions into Azure AI Search index.")
        answer = Answer(answer=cached_answer)
        return answer
    else:
        print("No match found in the cache.")        
        
        #   reach out to the llm to get the answer. 
        print('Sending a request to LLM')
        start_phrase = ask.question
        messages=  [{"role" : "assistant", "content" : start_phrase},
                     { "role" : "system", "content" : "Answer this question with a very short answer. Don't answer with a full sentence, and do not format the answer."}]
        
        response = client.chat.completions.create(
             model = deployment_name,
             messages =messages,
        )
        answer = Answer(answer=response.choices[0].message.content)

        #  put the new question & answer in the cache as well
        questions = [{
            "question": ask.question,
            "answer": answer.answer
        }]
        # generate embedding for the question
        index = search_client.get_document_count()  + 1
        for question in questions:
            question["id"] = str(index)
            question["vector"] = get_embedding(question["question"])
            index = index + 1
        # upload question to vector store
        result = search_client.upload_documents(questions)
        print(f"Successfully loaded {len(questions)} new questions into Azure AI Search index.")
        print ("Added a new answer and question to the cache: " + answer.answer + "in position" + str(index))
        return answer

Use this snippet to try your method with several questions.

In [31]:

ask = Ask(question="Who brings the character of Tony Stark to life in the Marvel Cinematic Universe? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo", type=QuestionType.multiple_choice)
answer = await ask_question(ask)
print('Answer:', answer)

ask = Ask(question="In the Marvel Cinematic Universe, who is the actor that brings Tony Stark to life? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo", type=QuestionType.multiple_choice)
answer = await ask_question(ask)
print('Answer:', answer)

ask = Ask(question="Which actor plays Tony Stark in the Marvel movies? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo", type=QuestionType.multiple_choice)
answer = await ask_question(ask)
print('Answer:', answer)

ask = Ask(question="Which actor plays Thor in the Marvel movies? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo", type=QuestionType.multiple_choice)
answer = await ask_question(ask)
print('Answer:', answer)

ask = Ask(question="In the Marvel Cinematic Universe, who is the actor that brings Thor to life? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo", type=QuestionType.multiple_choice)
answer = await ask_question(ask)
print('Answer:', answer)

ask = Ask(question="In the Marvel Cinematic Universe, who is the actor that brings Captain America to life? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo", type=QuestionType.multiple_choice)
answer = await ask_question(ask)
print('Answer:', answer)

Who brings the character of Tony Stark to life in the Marvel Cinematic Universe? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo
Found questions:  [{'answer': 'Robert Downey Jr.', 'question': 'In the Marvel Cinematic Universe, who is the actor that brings Tony Stark to life? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo', '@search.score': 0.9827833, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}, {'answer': 'Robert Downey Jr.', 'question': 'Which actor plays Tony Stark in the Marvel movies? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo', '@search.score': 0.965675, '@search.reranker_score': None, '@search.highlights': None, '@search.captions': None}, {'answer': 'Robert Downey Jr.', 'question': 'In the Marvel Cinematic Universe, who is the actor that brings Thor to life? Robert Downey Jr., Chris Hemsworth, Chris Evans, Mark Ruffalo', '@search.score': 0.96238065, '@search.reranker_score': None, '@search.h

Make sure you transfer your code changes into main.py (or additional files). Then redeploy your container using this command.
```
bash ./azd-hooks/deploy.sh phase4 $AZURE_ENV_NAME
```
Make sure to provide the URL of your endpoint in the team portal!