### Meta Data Retrieval 
> Meta data can be used effectively in retrieval of data from knowledge Base  
> It can provide another aspect of search possibility to pull out relevant information  
> Further the semantic search and meta data retrieval can be used in Hybrid mode to improve the context

#### Imports

In [40]:
import lancedb

# Sentence transformers to use the embedding models locally
from sentence_transformers import SentenceTransformer, util
import pandas as pd

# Import required class from Google
from google import genai
from google.genai import types
from dotenv import load_dotenv

# Initialise an client object with API key
load_dotenv ()
client = genai.Client()

**Embedding Models**  
2 differenet models are used for embedding  
The embedding that is needed for vector search shall be teh same model that is used for vector DB creation

In [3]:
Embedder_1 = SentenceTransformer ("sentence-transformers/all-MiniLM-L6-v2")
Embedder_2 = SentenceTransformer ("sentence-transformers/all-mpnet-base-v2")

**Utility**  
> Function to match and get the relevant meta data from the entire meta data present in vector Db

In [17]:
def Match_Meta_Data (query, meta_data, meta_data_emb, top_k=2):

    # Query    
    query_emb = Embedder_1.encode(query, normalize_embeddings=True)

    # Top-k semantic search
    hits = util.semantic_search (query_embeddings=query_emb, corpus_embeddings=meta_data_emb, top_k=top_k)

    Matches = []
    for hit in hits[0]:

        Matches = Matches + [meta_data[hit['corpus_id']]]
    
    return Matches

> 2 Different Embedding model from Sentence Transformer  
> One used for Search from Vector DB (Vectors in DB and query vector shall be with same embedding model)  
> the other one used for handling meta data

> Connect to the Database that was created

In [None]:
# Connect to existing Vector DB and use data
# Create a Lance DB Vector Base
DB = lancedb.connect ('Vector_DB')

# Create a Table and add the Chunks data
table = DB.open_table ("tech_ref")
print (table.schema)

In [35]:
# Query a vector
# Query = "There are many service providers"
# Query = "Where are the servers located?"
Query = "What shall be the deciding factor for my embedded system?"

Query_Vector = Embedder_2.encode (Query).tolist ()

**Different Search**  
The search with similarity and top_k always returns a result. This is because criteria is only about top_k  
Whereas setting a distance threshold can identify really meaningful matches.

In [None]:
print ("\nSimilarity : Top k :")
Results = table.search(Query_Vector).distance_type("cosine").limit(5).to_list ()

for Rs in Results :

    print (Rs['_distance'],Rs['source']," ## ",Rs ['text'])

print ("\nSimilarity : Distance threshold then Top k :")
Results = table.search(Query_Vector).distance_type("cosine").distance_range(upper_bound=0.6).limit(5).to_list ()

for Rs in Results :

    print (Rs['_distance'],Rs['source']," ## ",Rs ['text'])

**Meta Data filtering**  
> The meta data fields present in th vector DB can be used to pre filter the content. Since the meta data provides the broad meaning of the content, it can be a good reference to narrow down  
> One possibility of meta data filtering is to 

In [57]:
# Get all the sources and topics.
DF = table.to_pandas ()
Sources = DF['source'].unique ().tolist ()
Topics = DF['topic'].unique ().tolist ()
Topics_Emb = Embedder_1.encode(Topics, normalize_embeddings=True)

> Formulate the search query that can be used in vector DB filtering

In [None]:
filter = Match_Meta_Data (Query, Topics, Topics_Emb, 3)
filter_text = "(topic IN (" + (",".join(f"'{x}'" for x in filter)) + "))"
filter_text

**Filter + Search**  
> Filtering criteria applied. Based on the outcome (the chunks that are filtered out), then semantic search is applied  

In [None]:
print ("\n Meta Data Filtered : Top k :")

Results = table.search(Query_Vector).where (filter_text).distance_type("cosine").limit(5).to_list ()

for Rs in Results :

    print (Rs['_distance'],Rs['source']," ## ",Rs ['text'])

In [None]:
Results

#### Augmentation + Generation

**Meta data Overview**  
Use LLM to create a summary of the content present in the meta data

In [None]:
# Instruction for the LLM
Instruction = """You will be provided a list of topics. Make a summary what is the being discussed, based on the list of topics.
                 Summary in 100 words. Just provide summary text. **No additional Text**
            """
Topic_List  = "Topics : \n"+str(Topics)

   
response = client.models.generate_content(
                model="gemini-2.0-flash",
                config =types.GenerateContentConfig(
                            system_instruction=Instruction,
                            # temperature=0.0
                            ),
                contents =Topic_List
)

print(response.text)

**Prepare Context**  
Find relevant information to the query from Vector DB. Search is done in multiple methods and results consolidated  
Then its made into context information for the LLM to answer

In [None]:
# Query = "What is Cloud Computing?"
# Query = "How does IoT get benefitted by Edge Technology?"
# Query = "Is GPU Mandatory for AI?"
Query = "How can I estimate the Cloud infra cost for my project?"
# Query = "Is Cloud computing cost effective?"

filter = Match_Meta_Data (Query, Topics, Topics_Emb, 3)
filter_text = "(topic IN (" + (",".join(f"'{x}'" for x in filter)) + "))"

Query_Vector = Embedder_2.encode (Query).tolist ()

# Typical Similarity Seach with threshold
Results_1 = table.search(Query_Vector).distance_type("cosine").distance_range(upper_bound=0.6).limit(5).to_list ()
print (len(Results_1))

# Search with meta data filtering
Results_2 = table.search(Query_Vector).where (filter_text).distance_type("cosine").limit(5).to_list ()
print (len(Results_2))

# Remove Duplicates
Context = [d['text'] for d in Results_1]
Context = Context + [d['text'] for d in Results_2]

Context = list(set(Context))
print (len(Context))

In [None]:
# Instruction for the LLM
Instruction = """You will be given context information and a user query. You have to provide an answer to user query based on information provided in context.
                Answer **ONLY** based on context. If sufficient details are not in context, respond as "No Sufficient Details"
            """
   
response = client.models.generate_content(
                model="gemini-2.0-flash",
                config =types.GenerateContentConfig(
                            system_instruction=Instruction,
                            # temperature=0.0
                            ),
                contents = ["Context : \n"+str(Context), "User Query : \n"+Query]
)

print(response.text)

In [None]:
  
response = client.models.generate_content(
                model="gemini-2.0-flash",
                config =types.GenerateContentConfig(
                            system_instruction="Respond to User query",
                            ),
                contents = "User Query : \n"+Query
)

print(response.text)