## Lesson 6: Semantic Search, Building a Q&A System

#### Project environment setup

- Load credentials and relevant Python Libraries

## The idea is that what if we want something that is very domain specific?
## What if we cannot get the answer from just any LLM? Such as access inf outside of trainint data; integrate with existing IT systems, databases, and business data; Mitigate risks of hallucinations. (a term: Grounding LLMs)

## Naturally, we can add things to the knowlegde base, such as a csv or an external database. 

In [7]:
# !pip install mplcursors
# !pip install google-cloud-bigquery google-auth
# !pip install python-dotenv
# !pip install ipympl
# !pip install python-dotenv
# !pip install scann

In [8]:
from utilss_6 import authenticate
credentials, PROJECT_ID = authenticate()

#### Enter project details

In [9]:
REGION = 'us-central1'

In [2]:
import os
# API stuff:
# Do not forget to erase
project = os.env.get["PROJECT_ID"]
REGION = 'us-central1'
# project = project
credentials = os.env.get["credentials"]

In [46]:
# have to add this part back too. Important
# from vertexai.language_models import TextEmbeddingModel
from google.cloud import aiplatform
from google.oauth2 import service_account

# Load the credentials explicitly
credentials = service_account.Credentials.from_service_account_file(
    'google_api.json')

# Initialize Vertex AI with the correct credentials
aiplatform.init(project= project , location="us-central1", credentials=credentials)

In [16]:
import vertexai
vertexai.init(project=PROJECT_ID, location=REGION, credentials = credentials)

## Load Stack Overflow questions and answers from BigQuery

In [17]:
import pandas as pd

In [18]:
so_database = pd.read_csv('so_database_app.csv')

In [20]:
print("Shape: " + str(so_database.shape))
so_database

Shape: (2000, 3)


Unnamed: 0,input_text,output_text,category
0,"python's inspect.getfile returns ""<string>""<p>...",<p><code>&lt;string&gt;</code> means that the ...,python
1,Passing parameter to function while multithrea...,<p>Try this and note the difference:</p>\n<pre...,python
2,How do we test a specific method written in a ...,"<p>Duplicate of <a href=""https://stackoverflow...",python
3,how can i remove the black bg color of an imag...,<p>The alpha channel &quot;disappears&quot; be...,python
4,How to extract each sheet within an Excel file...,<p>You need to specify the <code>index</code> ...,python
...,...,...,...
1995,Is it possible to made inline-block elements l...,<p>If this is only for the visual purpose then...,css
1996,Flip Clock code works on Codepen and doesn't w...,<p>You forgot to attach the CSS file for the f...,css
1997,React Native How can I put one view in front o...,<p>You can do it using zIndex for example:</p>...,css
1998,setting fixed width with 100% height of the pa...,<p>You can use <code>width: calc(100% - 100px)...,css


## Load the question embeddings

In [21]:
from vertexai.language_models import TextEmbeddingModel

In [56]:
embedding_model = TextEmbeddingModel.from_pretrained(
    "textembedding-gecko@001")

# managing rate limits

In [24]:
import numpy as np
from utilss_6 import encode_text_to_embedding_batched

- Here is the code that embeds the text.  You can adapt it for use in your own projects.  
- To save on API calls, we've embedded the text already, so you can load it from the saved file in the next cell.

```Python
# Encode the stack overflow data

so_questions = so_database.input_text.tolist()
question_embeddings = encode_text_to_embedding_batched(
            sentences = so_questions,
            api_calls_per_second = 20/60,
            batch_size = 5)
```

In [26]:
# # Encode the stack overflow data

# so_questions = so_database.input_text.tolist()
# question_embeddings = encode_text_to_embedding_batched(
#             sentences = so_questions,
#             api_calls_per_second = 20/60,
#             batch_size = 5)

# no need if run the below codes.
# get batch embedding. To save API calls, do not do this.

In [57]:
import pickle
with open('question_embeddings_app.pkl', 'rb') as file:

    # Call load method to deserialze
    question_embeddings = pickle.load(file)

    print(question_embeddings)


# no need to do this if already done.
# precalculated embeddings

[[-0.03571156 -0.00240684  0.05860338 ... -0.03100227 -0.00855574
  -0.01997405]
 [-0.02024316 -0.0026255   0.01940405 ... -0.02158143 -0.05655403
  -0.01040497]
 [-0.05175979 -0.03712264  0.02699278 ... -0.07055898 -0.0402537
   0.00092099]
 ...
 [-0.00580394 -0.01621097  0.05829635 ... -0.03350992 -0.05343556
  -0.06016821]
 [-0.00436622 -0.02692963  0.03363771 ... -0.01686567 -0.03812337
  -0.02329491]
 [-0.04240424 -0.01633749  0.05516777 ... -0.02697376 -0.01751165
  -0.04558187]]


In [58]:
so_database['embeddings'] = question_embeddings.tolist()

In [59]:
so_database

Unnamed: 0,input_text,output_text,category,embeddings
0,"python's inspect.getfile returns ""<string>""<p>...",<p><code>&lt;string&gt;</code> means that the ...,python,"[-0.03571155667304993, -0.0024068362545222044,..."
1,Passing parameter to function while multithrea...,<p>Try this and note the difference:</p>\n<pre...,python,"[-0.020243162289261818, -0.002625499852001667,..."
2,How do we test a specific method written in a ...,"<p>Duplicate of <a href=""https://stackoverflow...",python,"[-0.05175979062914848, -0.03712264448404312, 0..."
3,how can i remove the black bg color of an imag...,<p>The alpha channel &quot;disappears&quot; be...,python,"[0.02206624671816826, -0.028208276256918907, 0..."
4,How to extract each sheet within an Excel file...,<p>You need to specify the <code>index</code> ...,python,"[-0.05498068407177925, -0.0032414537854492664,..."
...,...,...,...,...
1995,Is it possible to made inline-block elements l...,<p>If this is only for the visual purpose then...,css,"[-0.009190441109240055, -0.01732615754008293, ..."
1996,Flip Clock code works on Codepen and doesn't w...,<p>You forgot to attach the CSS file for the f...,css,"[-0.009033069014549255, -0.0009270847076550126..."
1997,React Native How can I put one view in front o...,<p>You can do it using zIndex for example:</p>...,css,"[-0.005803938489407301, -0.016210969537496567,..."
1998,setting fixed width with 100% height of the pa...,<p>You can use <code>width: calc(100% - 100px)...,css,"[-0.004366223234683275, -0.02692963369190693, ..."


# Q & A over stack over flow data

### QuetsionL: Combine 3 datadrames in panags, I have a dataframe like this:

### accpted answer: You can use pd.concat here.
### Just define and pass all 3 dataframes to...

## Embedded user query (one) --> compute similarity --> Embedded Stack Overflow questions (many)


## Mesaruing similarity using embeddings:

### Euclidean Distance, Cosine similarity, Dot product

### Euclidean Distance (L2 Distance): Distance between ends of vectors: distance between ends of vectors
### Cosine Similarity: Cosine of the angle between vectors: Cosine of the angle between vectors
### Dot Product: Cosine multiplied by lengths of both vectors; Cosine multiplied by lengths of both vectors





## Semantic Search

When a user asks a question, we can embed their query on the fly and search over all of the Stack Overflow question embeddings to find the most simliar datapoint.

In [60]:
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances_argmin as distances_argmin

In [61]:
query = ['How to concat dataframes pandas']

In [78]:
# if goodgleAPI does not work, wait for a while
# Check billing even if on free trials
# set up payment regardless


# make the query embeedings:
query_embedding = embedding_model.get_embeddings(query)[0].values


In [117]:
# Looking inside and check:

# embedding_model.get_embeddings(query)
# query_embedding = embedding_model.get_embeddings(query)[0]

# query_embedding_all = embedding_model.get_embeddings(query)[0]
# query_embedding_all.values

# query_embedding_inside = embedding_model.get_embeddings(query)
# type(query_embedding_inside), len(query_embedding_inside), \
# len(query_embedding), query_embedding[0], \
# query_embedding_inside

In [127]:
# get the cosine value for query_embedding and so_database.embeddings
cos_sim_array = cosine_similarity([query_embedding],
                                  list(so_database.embeddings.values))

In [126]:
list(so_database.embeddings.values[0])[0]

-0.03571155667304993

In [128]:
cos_sim_array, cos_sim_array.shape, type(cos_sim_array)

(array([[0.58799547, 0.59202437, 0.56660594, ..., 0.49681789, 0.5201778 ,
         0.48240689]]),
 (1, 2000),
 numpy.ndarray)

In [129]:
cos_sim_array.shape

(1, 2000)

Once we have a similarity value between our query embedding and each of the database embeddings, we can extract the index with the highest value. This embedding corresponds to the Stack Overflow post that is most similiar to the question "How to concat dataframes pandas".

In [130]:
# cosine similarity
index_doc_cosine = np.argmax(cos_sim_array)

In [131]:
# distances
index_doc_distances = distances_argmin([query_embedding],
                                       list(so_database.embeddings.values))[0]

In [123]:
so_database.input_text[index_doc_cosine]

'Add a column name to a panda dataframe (multi index)<p>I concatenate series objects, with existing column names together to a DataFrame in Pandas. The result looks like this:</p>\n<pre><code>pd.concat([x, y, z], axis=1)\n\n\n   X   |  Y   |   Z\n  -------------------\n  data | data | data\n</code></pre>\n<p>Now I want to insert another column name A above the column names X, Y, Z, for the whole DataFrame. This should look like this at the end:</p>\n<pre><code>   A                  # New Column Name\n  ------------------- \n   X   |  Y   |   Z   # Old Column Names\n  -------------------\n  data | data | data \n</code></pre>\n<p>So far I did not find a solution how to insert a column name A above the existing columns names X, Y, Z for the complete DataFrame. I would be grateful for any help. :)</p>'

In [91]:
so_database.output_text[index_doc_cosine]
# unfinished product

"<p>Let's try with <code>MultiIndex.from_product</code> to create <code>MultiIndex</code> columns:</p>\n<pre><code>df = pd.concat([x, y, z], axis=1)\ndf.columns = pd.MultiIndex.from_product([['A'], df.columns])\n</code></pre>\n<hr />\n<pre><code>A            \nX     Y     Z\ndata  data  data\n</code></pre>"

## Question answering with relevant context

Now that we have found the most simliar Stack Overflow question, we can take the corresponding answer and use an LLM to produce a more conversational response.


### Noe, introducing LLM

In [137]:
from vertexai.language_models import TextGenerationModel

## Google LLM, under the llms family of Gemini

In [138]:
generation_model = TextGenerationModel.from_pretrained(
    "text-bison@001")



In [139]:
context = "Question: " + so_database.input_text[index_doc_cosine] +\
"\n Answer: " + so_database.output_text[index_doc_cosine]

In [140]:
prompt = f"""Here is the context: {context}
             Using the relevant information from the context,
             provide an answer to the query: {query}."
             If the context doesn't provide \
             any relevant information, \
             answer with \
             [I couldn't find a good match in the \
             document database for your query]
             """

In [142]:
from IPython.display import Markdown, display

t_value = 0.2
response = generation_model.predict(prompt = prompt,
                                    temperature = t_value,
                                    max_output_tokens = 1024)

display(Markdown(response.text))

# took the answer from stackoverflow and use

To concatenate dataframes in pandas, you can use the `pd.concat()` function. This function takes a list of dataframes as input and concatenates them along the specified axis. The default axis is 0, which means that the dataframes will be concatenated along the rows. To concatenate dataframes along the columns, you can specify the `axis` argument to be 1.

For example, the following code concatenates two dataframes, `df1` and `df2`, along the rows:

```
df = pd.concat([df1, df2])
```

The following code concatenates `df1` and `df2` along the columns:

```
df = pd.concat([df1, df2], axis=1)
```

You can also use the `pd.concat()` function to concatenate dataframes with different column names. To do this, you can specify the `ignore_index` argument to be True. This will cause the index of the concatenated dataframe to be reset to start from 0.

For example, the following code concatenates two dataframes, `df1` and `df2`, with different column names:

```
df = pd.concat([df1, df2], ignore_index=True)
```

The resulting dataframe will have the following columns:

```
0   1
0   a   b
1   c   d
```

## When the documents don't provide useful information

Our current workflow returns the most similar question from our embeddings database. But what do we do when that question isn't actually relevant when answering the user query? In other words, we don't have a good match in our database.

In addition to providing a more conversational response, LLMs can help us handle these cases where the most similiar document isn't actually a reasonable answer to the user's query.

In [143]:
query = ['How to make the perfect lasagna']

In [144]:
query_embedding = embedding_model.get_embeddings(query)[0].values

In [145]:
cos_sim_array = cosine_similarity([query_embedding],
                                  list(so_database.embeddings.values))

In [146]:
cos_sim_array

array([[0.52794666, 0.49346994, 0.50784253, ..., 0.45782265, 0.5483748 ,
        0.49693718]])

In [147]:
index_doc = np.argmax(cos_sim_array)

In [148]:
context = so_database.input_text[index_doc] + \
"\n Answer: " + so_database.output_text[index_doc]

In [149]:
prompt = f"""Here is the context: {context}
             Using the relevant information from the context,
             provide an answer to the query: {query}."
             If the context doesn't provide \
             any relevant information, answer with
             [I couldn't find a good match in the \
             document database for your query]
             """

In [150]:
t_value = 0.2
response = generation_model.predict(prompt = prompt,
                                    temperature = t_value,
                                    max_output_tokens = 1024)
display(Markdown(response.text))

# just one sentence answer

I couldn't find a good match in the document database for your query.

## Exhaustive serch over a collection fo embedding vectors is infeasible for large datasets.

## Instead, use algorithms that perform approximate matches

## ScaNN (Scalable Nearest neighbors) is a method for efficient vector similarity search at scale.

## Scale with approximate nearest neighbor search

When dealing with a large dataset, computing the similarity between the query and each original embedded document in the database might be too expensive. Instead of doing that, you can use approximate nearest neighbor algorithms that find the most similar documents in a more efficient way.

These algorithms usually work by creating an index for your data, and using that index to find the most similar documents for your queries. In this notebook, we will use ScaNN to demonstrate the benefits of efficient vector similarity search. First, you have to create an index for your embedded dataset.

In [155]:
import scann
from utilss_6 import create_index

#Create index using scann
index = create_index(embedded_dataset = question_embeddings,
                     num_leaves = 25,
                     num_leaves_to_search = 10,
                     training_sample_size = 2000)

In [156]:
query = "how to concat dataframes pandas"

In [159]:
import time

start = time.time()
query_embedding = embedding_model.get_embeddings([query])[0].values
neighbors, distances = index.search(query_embedding, final_num_neighbors = 1)
end = time.time()

for id, dist in zip(neighbors, distances):
    print(f"[docid:{id}] [{dist}] -- {so_database.input_text[int(id)][:125]}...")

print("Latency (ms):", 1000 * (end - start))

[docid:1143] [0.778209388256073] -- Add a column name to a panda dataframe (multi index)<p>I concatenate series objects, with existing column names together to a...
Latency (ms): 135.92839241027832


In [161]:
start = time.time()
query_embedding = embedding_model.get_embeddings([query])[0].values
cos_sim_array = cosine_similarity([query_embedding], list(so_database.embeddings.values))
index_doc = np.argmax(cos_sim_array)
end = time.time()

print(f"[docid:{index_doc}] [{np.max(cos_sim_array)}] -- {so_database.input_text[int(index_doc)][:125]}...")

print("Latency (ms):", 1000 * (end - start))


# the same result with exhaustive search and ScaNN
# same file identified

[docid:1143] [0.7782096307231192] -- Add a column name to a panda dataframe (multi index)<p>I concatenate series objects, with existing column names together to a...
Latency (ms): 228.19232940673828


## Summary:

Q&A with semantic search

Stack Overflow Questions
+      
User Query
-----> Embedding model --> Embedded query  + Embedded quetions

-----> Embedded quetions --> Nearest Neighbor Search <------ Embedded query

Nearest Neighbor Search --> Stack Overflow answers / or any CSV or external database