# Queries with and without Azure OpenAI

So far, you have your Search Engine loaded **from two different data sources in two diferent indexes**, on this notebook we are going to try some example queries and then use Azure OpenAI service to see if we can get a good answer for the user query.

The idea is that a user can ask a question about Computer Science (first datasource/index) or about Covid (second datasource/index), and the engine will respond accordingly.
This **Multi-Index** demo, mimics the scenario where a company loads multiple type of documents of different types and about completly different topics and the search engine must respond with the most relevant results.

## Set up variables

In [1]:
import os
import urllib
import requests
import random
import json
from collections import OrderedDict
from IPython.display import display, HTML, Markdown
from typing import List
from operator import itemgetter

# LangChain Imports needed
from langchain_openai import AzureChatOpenAI
from langchain_openai import AzureOpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.retrievers import BaseRetriever
from langchain_core.callbacks import CallbackManagerForRetrieverRun
from langchain_core.documents import Document
from langchain_core.runnables import ConfigurableField


# Our own libraries needed
from common.prompts import DOCSEARCH_PROMPT
from common.utils import get_search_results

from dotenv import load_dotenv
load_dotenv("credentials.env")

True

In [2]:
# Setup the Payloads header
headers = {'Content-Type': 'application/json','api-key': os.environ['AZURE_SEARCH_KEY']}
params = {'api-version': os.environ['AZURE_SEARCH_API_VERSION']}

## Multi-Index Search queries

In [3]:
# Text-based Indexes that we are going to query (from Notebook 01 and 02)
index1_name = "cogsrch-index-files-ykilinc"
index2_name = "cogsrch-index-csv-ykilinc"
indexes = [index2_name, index1_name]

Try questions that you think might be answered or addressed in computer science papers in 2020-2021 or that can be addressed by medical publications about COVID in 2020-2021. Try comparing the results with the open version of ChatGPT.<br>
The idea is that the answers using Azure OpenAI only looks at the information contained on these publications.

**Example Questions you can ask**:
- What is CLP?
- How Markov chains work?
- What are some examples of reinforcement learning?
- What are the main risk factors for Covid-19?
- What medicine reduces inflamation in the lungs?
- Why Covid doesn't affect kids that much compared to adults?
- Does chloroquine really works against covid?
- Who won the 1994 soccer world cup? # This question should yield no answer if the system is correctly grounded

In [4]:
QUESTION = "Tell me about large stable models problem??"

### Search on both indexes individually and aggragate results

#### **Note**: 
In order to standarize the indexes, **there must be 6 mandatory fields present on each index**: `id, title, name, location, chunk, chunkVector`. This is so that each document can be treated the same along the code. Also, **all indexes must have a semantic configuration**.

We are going to use Hybrid Queries: Text + Vector Search combined for optimal results!

In [5]:
agg_search_results = dict()
k = 10

for index in indexes:
    search_payload = {
        "search": QUESTION, # Text query
        "select": "id, title, name, location, chunk",
        "queryType": "semantic",
        "vectorQueries": [{"text": QUESTION, "fields": "chunkVector", "kind": "text", "k": k}], # Vector query
        "semanticConfiguration": "my-semantic-config",
        "captions": "extractive",
        "answers": "extractive",
        "count":"true",
        "top": k
    }

    r = requests.post(os.environ['AZURE_SEARCH_ENDPOINT'] + "/indexes/" + index + "/docs/search",
                     data=json.dumps(search_payload), headers=headers, params=params)
    print(r.status_code)

    search_results = r.json()
    agg_search_results[index]=search_results
    print("Index:", index, "Results Found: {}, Results Returned: {}".format(search_results['@odata.count'], len(search_results['value'])))

200
Index: cogsrch-index-csv-ykilinc Results Found: 24, Results Returned: 10
200
Index: cogsrch-index-files-ykilinc Results Found: 38, Results Returned: 10


In [6]:
# agg_search_results

### Display the top results (from both searches) based on the score

In [9]:
display(HTML('<h4>Top Answers</h4>'))

for index,search_results in agg_search_results.items():

    for result in search_results['@search.answers']:
        if result['score'] > 0.5: # Show answers that are at least 50% of the max possible score=1
            display(HTML('<h5>' + 'Answer - score: ' + str(round(result['score'],2)) + '</h5>'))
            display(HTML(result['text']))

            
print("\n\n")
display(HTML('<h4>Top Results</h4>'))

content = dict()
ordered_content = OrderedDict()


for index,search_results in agg_search_results.items():
    for result in search_results['value']:
        if result['@search.rerankerScore'] > 1:# Show answers that are at least 25% of the max possible score=4
            content[result['id']]={
                                    "title": result['title'],
                                    "chunk": result['chunk'], 
                                    "name": result['name'], 
                                    "location": result['location'] ,
                                    "caption": result['@search.captions'][0]['text'],
                                    "score": result['@search.rerankerScore'],
                                    "index": index
                                    }
    
#After results have been filtered we will Sort and add them as an Ordered list\n",
for id in sorted(content, key= lambda x: content[x]["score"], reverse=True):
    ordered_content[id] = content[id]
    url = str(ordered_content[id]['location']) + os.environ['BLOB_SAS_TOKEN']
    title = str(ordered_content[id]['title']) if (ordered_content[id]['title']) else ordered_content[id]['name']
    score = str(round(ordered_content[id]['score'],2))
    display(HTML('<h5><a href="'+ url + '">' + title + '</a> - score: '+ score + '</h5>'))
    display(HTML(ordered_content[id]['caption']))






In [8]:
ordered_content

OrderedDict([('ae98cb3d4f4c_aHR0cHM6Ly9ibG9ic3RvcmFnZWRqeW02ZWl6MmpobGsuYmxvYi5jb3JlLndpbmRvd3MubmV0L3BkZnMvMDAwMjAwMXYxMXlrLnBkZg2_chunks_0',
              {'title': 'arXiv:cs/0002001v1  [cs.LO]  3 Feb 2000',
               'chunk': 'ar\nX\n\niv\n:c\n\ns/\n00\n\n02\n00\n\n1v\n1 \n\n [\ncs\n\n.L\nO\n\n] \n 3\n\n F\neb\n\n 2\n00\n\n0\n\nComputing large and small stable models1\n\nMiros law Truszczyński\n\nUniversity of Kentucky\n\nLexington, KY 40506-0046, USA\n\nmirek@cs.uky.edu\n\nAbstract\n\nIn this paper, we focus on the problem of existence and computing of small and large stable\nmodels. We show that for every fixed integer k, there is a linear-time algorithm to decide\nthe problem LSM (large stable models problem): does a logic program P have a stable\nmodel of size at least |P | − k. In contrast, we show that the problem SSM (small stable\nmodels problem) to decide whether a logic program P has a stable model of size at most\nk is much harder. We present two algorithms for this

### Comments on Query results

As seen above the semantic re-ranking feature of Azure AI Search service is good. It gives answers (sometimes) and also the top results with the corresponding file and the paragraph where the answers is possible located.

Let's see if we can make this better with Azure OpenAI

# Using Azure OpenAI

To use OpenAI to get a better answer to our question, the thought process is simple: let's **give the answer and the content of the documents from the search result to the GPT model as context and let it provide a better response**. This is what RAG (Retreival Augmented Generation) is about.

Now, before we do this, we need to understand a few things first:

1) Chainning and Prompt Engineering
2) Embeddings

We will use a library call **LangChain** that wraps a lot of boiler plate code.
Langchain is one library that does a lot of the prompt engineering for us under the hood, for more information see [here](https://python.langchain.com/en/latest/index.html)

In [10]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]

**Important Note**: Starting now, we will utilize OpenAI models. Please ensure that you have deployed the following models within the Azure OpenAI portal:

- text-embedding-ada-002 (or newer)
- gpt-35-turbo (1106 or newer)
- gpt-4-turbo (1106 or newer)

Reference for Azure OpenAI models (regions, limits, dimensions, etc): [HERE](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/models)

## A gentle intro to chaining LLMs and prompt engineering

Chains refer to sequences of calls - whether to an LLM, a tool, or a data preprocessing step.

Azure OpenAI is a type of LLM (provider) that you can use but there are others like Cohere, Huggingface, etc.

Chains can be simple (i.e. Generic) or specialized (i.e. Utility).

* Generic — A single LLM is the simplest chain. It takes an input prompt and the name of the LLM and then uses the LLM for text generation (i.e. output for the prompt).

Here’s an example:

In [12]:
COMPLETION_TOKENS = 2000
llm = AzureChatOpenAI(deployment_name=os.environ["GPT4_DEPLOYMENT_NAME"], 
                      temperature=0, 
                      max_tokens=COMPLETION_TOKENS)

In [13]:
output_parser = StrOutputParser()
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an assistant that give thorough responses to users."),
    ("user", "{input}. Give your response in {language}")
])

The | symbol is similar to a unix pipe operator, which chains together the different components feeds the output from one component as input into the next component.

In [12]:
pip show openai

Name: openai
Version: 1.30.1
Summary: The official Python library for the openai API
Home-page: 
Author: 
Author-email: OpenAI <support@openai.com>
License: 
Location: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages
Requires: anyio, distro, httpx, pydantic, sniffio, tqdm, typing-extensions
Required-by: langchain-openai, semantic-kernel
Note: you may need to restart the kernel to use updated packages.


In [14]:
pip show langchain_openai

Name: langchain-openai
Version: 0.1.7
Summary: An integration package connecting OpenAI and LangChain
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages
Requires: langchain-core, openai, tiktoken
Required-by: 
Note: you may need to restart the kernel to use updated packages.


In [15]:
chain = prompt | llm | output_parser

In [16]:
%%time
display(Markdown(chain.invoke({"input": QUESTION, "language": "Spanish"})))

El problema de los modelos grandes y estables se refiere a un desafío en el campo de la inteligencia artificial y el aprendizaje automático. En términos generales, se refiere a la dificultad de construir y entrenar modelos de aprendizaje automático que sean tanto grandes (es decir, capaces de manejar y procesar una gran cantidad de datos) como estables (es decir, capaces de producir resultados consistentes y confiables a lo largo del tiempo).

Este problema surge debido a varias razones. Primero, a medida que un modelo se vuelve más grande y complejo, también se vuelve más difícil de entrenar de manera efectiva. Esto se debe a que los modelos más grandes a menudo requieren más datos para entrenar y más tiempo y recursos computacionales para procesar esos datos.

En segundo lugar, a medida que un modelo se vuelve más grande, también puede volverse menos estable. Esto se debe a que los modelos más grandes pueden ser más susceptibles a problemas como el sobreajuste, donde el modelo se ajusta tan estrechamente a los datos de entrenamiento que se desempeña mal en los datos de prueba o en los datos nuevos.

Por último, los modelos grandes y estables también pueden ser difíciles de mantener y actualizar. A medida que se recopilan y procesan más datos, el modelo puede necesitar ser reentrenado o ajustado para mantener su precisión y eficacia.

En resumen, el problema de los modelos grandes y estables es un desafío importante en el campo de la inteligencia artificial y el aprendizaje automático, y los investigadores están trabajando constantemente para desarrollar nuevas técnicas y enfoques para abordarlo.

CPU times: user 53.9 ms, sys: 4.05 ms, total: 57.9 ms
Wall time: 11.9 s


**Note**: this is the first time you use OpenAI in this Accelerator, so if you get a Resource not found error, is most likely because the name of your OpenAI model deployment is different than the environmental variable set above `os.environ["GPT35_DEPLOYMENT_NAME"]`

Great!!, now you know how to create a simple prompt and use a chain in order to answer a general question using ChatGPT knowledge!. 

It is important to note that we rarely use generic chains as standalone chains. More often they are used as building blocks for Utility chains (as we will see next). Also important to notice is that we are NOT using our documents or the result of the Azure Search yet, just the knowledge of ChatGPT on the data it was trained on.

**The second type of Chains are Utility:**

* Utility — These are specialized chains, comprised of many building blocks to help solve a specific task. For example, LangChain supports some end-to-end chains (such as `create_retrieval_chain` for QnA Doc retrieval, Summarization, etc).

We will build our own specific chain in this workshop for digging deeper and solve our use case of enhancing the results of Azure AI Search.


But before dealing with the utility chain needed, let's first review the concept of Embeddings and Vector Search and RAG. 

## Embeddings and Vector Search

From the Azure OpenAI documentation ([HERE](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/embeddings?tabs=python)), An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar. 

### Why Do We Need Vectors?

Vectors are essential for several reasons:

- **Semantic Richness**: They convert the semantic meaning of text into mathematical vectors, capturing nuances that simple keyword searches miss. This makes them incredibly powerful for understanding and processing language.
- **Human-like Searching**: Searching using vector distances mimics the human approach to finding information based on context and meaning, rather than relying solely on exact word matches.
- **Efficiency in Scale**: Vector representations allow for efficient handling and searching of large datasets. By reducing complex text to numerical vectors, algorithms can quickly sift through vast amounts of information.

### Understanding LLM Tokens' Context Limitation

Large Language Models (LLMs) like GPT come with a token limit for each input, which poses a challenge when dealing with lengthy documents or extensive data sets. This limitation restricts the model's ability to understand and generate responses based on the full context of the information provided. It becomes crucial, therefore, to devise strategies that can effectively manage and circumvent this limitation to leverage the full power of LLMs.

To address this challenge, the solution incorporates several key steps:

1. **Segmenting Documents**: Breaking down large documents into smaller, manageable segments.
2. **Vectorization of Chunks**: Converting these segments into vectors, making them compatible with vector-based search techniques.
3. **Hybrid Search**: Employing both vector and text search methods to pinpoint the most relevant segments in relation to the query.
4. **Optimal Context Provision**: Presenting the LLM with the most pertinent segments, ensuring a balance between detail and brevity to stay within token limits.


Our ultimate goal is to rely solely on vector indexes and hybrid searchs (vector + text). While it is possible to manually code parsers with OCR for various file types and develop a scheduler to synchronize data with the index, there is a more efficient alternative: **Azure AI Search has automated chunking strategies and vectorization**.

It's important to note that **document segmentation and vectorization have already been completed in AI Azure Search**, as seen in the `ordered_content` dictionary. This pre-processing step simplifies subsequent operations, ensuring rapid response times and adherence to the token limits of the chosen OpenAI model.


So really, our only job now is to make sure that the results from the Azure AI Search queries fit on the LLM context size, and then let it do its magic.

In [17]:
index_name = "cogsrch-index-files-ykilinc"
index2_name = "cogsrch-index-csv-ykilinc"
indexes = [index1_name, index2_name]

In order to not duplicate code, we have put many of the code used above into functions. These functions are in the `common/utils.py` and `common/prompts.py` files. This way we can use these functios in the app that we will build later.

`get_search_results()` do the multi-index search and returns the combined ordered list of documents/chunks.

In [18]:
k = 10  # play with this parameter and see the quality of the final answer
ordered_results = get_search_results(QUESTION, indexes, k=k, reranker_threshold=1)
print("Number of results:",len(ordered_results))

Number of results: 10


In [19]:
# Uncomment the below line if you want to inspect the ordered results
ordered_results

OrderedDict([('b736df85507f_aHR0cHM6Ly9ibG9ic3RvcmFnZWRqeW02ZWl6MmpobGsuYmxvYi5jb3JlLndpbmRvd3MubmV0L3BkZnMvMDAwMjAwMXYxMXlrLnBkZg2_chunks_3',
              {'title': 'arXiv:cs/0002001v1  [cs.LO]  3 Feb 2000',
               'name': '0002001v11yk.pdf',
               'chunk': 'f(|y|)|x|p (|z| stands for the length of a string\nz ∈ Σ∗). The class of fixed-parameter tractable problems will be denoted by FPT. Clearly,\nif a parametrized problem L is in FPT, each of the associated fixed-parameter problems\nLy is solvable in polynomial time by an algorithm whose exponent does not depend on the\nvalue of the parameter y. It is known (see [7]) that the vertex cover problem is in FPT.\n\nThere is substantial evidence available now to support a conjecture that some parametrized\nproblems whose fixed-parameter versions are in P are not fixed-parameter tractable. To\nstudy and compare complexity of parametrized problems Downey and Fellows proposed\nthe following notion of reducibility2. A paramet

Now let's create a Prompt Template that will ground the response only in the chunks retrieve by our hybrid AI Search.

In [20]:
template = """Answer the question thoroughly, based **ONLY** on the following context:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

In [21]:
%%time 
# Creation of our custom chain
chain = prompt | llm | output_parser

try:
    display(Markdown(chain.invoke({"question": QUESTION, "context": ordered_results})))
except Exception as e:
    print(e)

The Large Stable Models problem (LSM) is a computational problem that focuses on the existence and computation of large stable models in a logic program. Given a finite propositional logic program P and an integer k, the LSM problem is to decide whether there is a stable model of P of size at least |P| - k. The LSM problem is considered fixed-parameter tractable. This means that for every fixed integer k, there is a linear-time algorithm to decide the LSM problem. This is a significant improvement over the straightforward algorithm. The LSM problem is related to the study of fixed-parameter tractability of problems occurring in the area of nonmonotonic reasoning.

CPU times: user 21.9 ms, sys: 279 µs, total: 22.2 ms
Wall time: 9.6 s


# Summary
##### By using OpenAI, the answers to user questions are way better than taking just the results from Azure AI Search. So the summary is:
- Utilizing Azure AI Search, we conduct a multi-index hybrid search that identifies the top chunks of documents from each index.
- Subsequently, Azure OpenAI utilizes these extracted chunks as context, comprehends the content, and employs it to deliver optimal answers.
- Best of two worlds!

##### Important observations on this notebook:

1) Answers with GPT-3.5 are less quality but fast
2) Answers with GPT-3.5 sometimes failed on provinding citations in the right format
3) Answers with GPT-4 are great quality but slower
4) Answers with GPT-4 always provide good and diverse citations in the right format
5) Streaming the answers improves the user experience big time!