

# RAG Methods - Metadata

Many tasks are in reach when we have managed to quantify our text with vector representations. There are innumerable excellent sentence embedding models available on hugging face, and we are not obliged to go directly to OpenAI for there embedding models (although they offer an acceptable solution) as similar solutions are available for free.

This notebook initially examines the task of classification, after having made vector representation of some text. In addition, once a vector store of text embeddings has been established, we need the functionality to be able to filter this database.

Each text is asociated with its Metadata, this can be used to narrow down the search, by date for example.

This code hangs off Langchain's massive library, that will not only wrap our database, but also our Embedding model. The inital phase is to have the data in tabular format before we convert all the data to Lanchain Document objects. A very convenient solution is the Langchain Dataframe Dataloader, here we simply specify the text column from which to make the embedding, and all the other columns become metadata. The fact that all this is one line of code, highlights the usecase for Langchain.


## $\color{blue}{Sections:}$
* Admin
* Data
* Vector Store
* Query


---
## $\color{blue}{Admin}$
---

In [None]:
%%capture
!pip install langchain

In [None]:
%%capture
pip install langchain-community

In [None]:
%%capture
!pip install chromadb

In [None]:
%%capture
!pip install sentence_transformers

In [None]:
%%capture
!pip install datasets

In [None]:
# native
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime

# admin
from google.colab import drive
import os
import getpass

# ai
from huggingface_hub import login
from datasets import load_dataset

In [None]:
drive.mount("/content/drive")
%cd '/content/drive/MyDrive/'

---
## $\color{blue}{Data}$
---

Data from our toy dataset is imported from hugiging face with the datasetskn library. There are only three columns, but we can simulate another couple of columns to demonstrate filtering.

This small sythetic dataset imagines an IT sevrice desk.

In [None]:
dataset = load_dataset("SparkExpedition/TicketsData", split='train')

In [None]:
D_master = {'t_cat': dataset['TECHNOLOGY'],
            't_query': dataset['QUESTION'],
            't_response': dataset['SOLUTION']}
df = pd.DataFrame(D_master)

In [None]:
df.head()

Unnamed: 0,t_cat,t_query,t_response,resolved,days,id
0,Azure SQL,Which SQL cloud database deployment options ar...,Azure SQL Database is available as a single da...,True,-17,0
1,Azure SQL,Error message: Conversion failed when converti...,"In the copy activity sink, under PolyBase sett...",False,-50,1
2,Azure SQL,"Cannot open database ""master"" requested by the...","1. On the login screen of SSMS, select Options...",True,-19,2
3,Azure SQL,Error 40552: The session has been terminated b...,The issue can occur in any DML operation such ...,True,-47,3
4,Azure SQL,Error 5: Cannot connect to < servername >,"To resolve this issue, make sure that port 143...",True,-26,4


In [None]:
df.shape

**Now we have categories, questions, and response. Lets add two more columns:

*  Resolved - whether the ticket is resolved or not
*  Days - here we will simulate with numbers as Chroma doesn't handle datetimes.


In [None]:
# resolved
np.random.seed(101)
status = [True, False]
resolved = list(np.random.choice(status, df.shape[0], replace=True, p=[.9,.1]))
df['resolved'] = resolved

# days
day_range = list(range(-50, 1))
days = list(np.random.choice(day_range, df.shape[0], replace=True))
df['days'] = days

In [None]:
df.columns

Index(['t_cat', 't_query', 't_response', 'resolved', 'days', 'id'], dtype='object')

In [None]:
df.t_cat.value_counts()

t_cat
Azure AKS             21
GCP Security IAM      21
GCP Functions         21
Azure SQL             20
Azure Security IAM    20
GCP Cloud Storage     20
Azure - AML           20
Azure Functions       20
GCP Cloud SQL         20
Azure Synapse         20
GCP Cloud Run         20
GCP Big Query         20
GCP Fire Store        20
Name: count, dtype: int64

**Keep only categories with over 20 examples**

In [None]:
cats = list(df.t_cat.value_counts().index[:13])

In [None]:
cats

In [None]:
df = df[df.t_cat.isin(cats)]
df.reset_index(drop= True, inplace=True)
df['id'] = df.index

---
## $\color{blue}{DataLoader}$
---

Import the Langchain DataLoad and create documents.

In [None]:
# langchain
from langchain_community.document_loaders.dataframe import DataFrameLoader

In [None]:
loader = DataFrameLoader(df,page_content_column="t_query")

In [None]:
data = loader.load()

**data is now a list of Langchain documents**

In [None]:
len(data)

263

In [None]:
type(data)

list

In [None]:
a = data[0]

In [None]:
type(a)

In [None]:
a.page_content

'Which SQL cloud database deployment options are \navailable?'

In [None]:
a.metadata

{'t_cat': 'Azure SQL',
 't_response': 'Azure SQL Database is available as a single database with \nits own set of resources managed via a logical server,and\n as a pooled database in an elastic pool, with a shared set of resources managed through a logical server. In general, elastic pools are designed for a typical software-as-a-service (SaaS) application pattern, with one database per custtomer or tenant. With pools, you manage the collective performance, and the databases scale up or down automatically.',
 'resolved': True,
 'days': -17,
 'id': 0}

---
## $\color{blue}{Vector Store}$
---

Chroma vector store is used to save the data locally. This also gives us the possibility to add or to delete data from the db. In addition the metadata filtering becomes possible.

In [None]:
persist_directory = "RAG_tutorial/dbs/toy_metadata.1.db"

In [None]:
# ''

In [None]:
HF_TOKEN = getpass.getpass("Hugging Face token please: ")

In [None]:
login(token=HF_TOKEN)
os.environ['HUGGINGFACEHUB_API_TOKEN'] = HF_TOKEN

**Now we need to download model from hugging face via Sentence Transformers**

In [None]:
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)

In [None]:
embeddings = SentenceTransformerEmbeddings(model_name="BAAI/bge-m3")


**We now need the langchain wrapper over the chromadb vector store**

In [None]:
from langchain_community.vectorstores.chroma import Chroma

In [None]:
db = Chroma.from_documents(
    documents=data,
    embedding=embeddings,
    persist_directory=persist_directory,
    collection_name='toy_db'
)

db.persist()

**We don't need to rebuild the db every time, if we load the previously saved db we can run this on the cpu**

In [None]:
db = Chroma( persist_directory=persist_directory, embedding_function=embeddings, collection_name='toy_db')

**Check if the data is in the database**

We can access thek items of database directly, with filtering options. [Read the docs.](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html)

In [None]:
def peek(db, ticket_id=None):
    """
    Check the db for data, optionally specifying a ticket id.

    Parameters
    ----------
    ticket_id : int, optional
        Ticket ID. The default is None.

    Returns
    -------
    res : TYPE
        DESCRIPTION.

    """
    if not ticket_id:
        ticket_id = int(np.random.choice(list(df.shape[0]), 1)[0])

    if not ticket_id in list(df['id']):
        print(f'ticket_id {ticket_id} is not in the database.')
        return None

    res = db.get(include=['metadatas', 'documents'], where = {'id': ticket_id})
    return res

In [None]:
a = peek(db,20)

In [None]:
a

{'ids': ['16536230-86a8-4248-9558-156abe258d3b'],
 'embeddings': None,
 'metadatas': [{'days': -14,
   'id': 20,
   'resolved': True,
   't_cat': 'Azure AKS',
   't_response': "Ensure that your client's IP address is within the ranges authorized by the cluster's API server:\n\n1. Find your local IP address. For information on how to find it on Windows and Linux, see How to find my IP.\n\n2. Update the range that's authorized by the API server by using the az aks update command in Azure CLI. Authorize your client's IP address."}],
 'documents': ["Client can't reach an Azure Kubernetes Service (AKS) cluster's API "],
 'uris': None,
 'data': None}

In [None]:
# a = db.get(include=['metadatas', 'documents', 'embeddings']) # To return the embeddings aswell

**We can now query the database to find the closest matches to an input query.**

In [None]:
query = "I want to convert my string to an id"

In [None]:
res = db.similarity_search_with_score(query, k = 2)

In [None]:
res

[(Document(page_content='Error message: Conversion failed when converting from a \ncharacter string to uniqueidentifier', metadata={'days': -50, 'id': 1, 'resolved': False, 't_cat': 'Azure SQL', 't_response': 'In the copy activity sink, under PolyBase settings, set the use type \ndefault option to false.'}),
  0.7504672408103943),
 (Document(page_content=' I get an error when I attempt to make my data public', metadata={'days': -36, 'id': 84, 'resolved': False, 't_cat': 'GCP Cloud Storage', 't_response': 'Make sure that you have the setIamPolicy permission for your object or bucket. This permission is granted, for example, in the Storage Admin role. If you have the setIamPolicy permission and you still get an error, your bucket might be subject to public access prevention, which does not allow access to allUsers or allAuthenticatedUsers. Public access prevention might be set on the bucket directly, or it might be enforced through an organization policy that is set at a higher level.\n'

---
## $\color{blue}{Query}$
---

We now have a [Chroma db](https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.chroma.Chroma.html), which is langchain wrapped around chromadb with enough functionality to modify and query the underlying vectorstore which is built with SQLite.

Now we can demonstrate filtering and ultimately build a wrapper for filtering the category and the date.

---
#### $\color{red}{Simple}$
---

In [None]:
query = "I want to convert my string to an id"

In [None]:
res = db.similarity_search_with_score(query, k = 4)

In [None]:
res[0][1]

In [None]:
for el in res:
  print('#######################')
  print('\n##### New Point #####\n')
  print(el[0].page_content, '\n')
  print(el[0].metadata['t_cat'])

#######################

##### New Point #####

Error message: Conversion failed when converting from a 
character string to uniqueidentifier 

Azure SQL
#######################

##### New Point #####

 I get an error when I attempt to make my data public 

GCP Cloud Storage
#######################

##### New Point #####

Unable to register the self-hosted IR  

Azure Security IAM
#######################

##### New Point #####

Getting the error message Bad syntax for dict arg when trying to set a 
flag. 

GCP Cloud SQL


---
#### $\color{red}{Metadata-string-match}$
---

In [None]:
df.t_cat.value_counts()

**filter only the Normal**

In [None]:
res = db.similarity_search_with_score(query, k = 4, filter= {"t_cat": 'Azure SQL'} )

In [None]:
for el in res:
  print('#######################')
  print('\n##### New Point #####\n')
  print(el[0].page_content, '\n')
  print(el[0].metadata['t_cat'])

#######################

##### New Point #####

Error message: Conversion failed when converting from a 
character string to uniqueidentifier 

Azure SQL
#######################

##### New Point #####

c# error when connect to mysql "Object cannot be cast from 
DBNull to other types" (mariadb 10.3) 

Azure SQL
#######################

##### New Point #####

Error code: 2056 - SqlInfoValidationFailed 

Azure SQL
#######################

##### New Point #####

Error 5: Cannot connect to < servername > 

Azure SQL


---
#### $\color{red}{Metadata->=/<=}$
---

**Note the result ids**

In [None]:
res = db.similarity_search_with_score(query, k = 4)

In [None]:
res[0][0].metadata

{'days': -50,
 'id': 1,
 'resolved': False,
 't_cat': 'Azure SQL',
 't_response': 'In the copy activity sink, under PolyBase settings, set the use type \ndefault option to false.'}

In [None]:
for el in res:
  print(el[0].metadata['id'])

1
84
51
107


In [None]:
res = db.similarity_search_with_score(query, k = 4, filter = {'id':{'$gt':51}} )

In [None]:
for el in res:
  print(el[0].metadata['id'])

84
107
115
245


**in between two values**

In [None]:
res = db.similarity_search_with_score(
    query, k = 4, filter = {"$and": [{'id':{'$gt':110}}, {'id': {'$lte':135}}]}
)

In [None]:
for el in res:
  print(el[0].metadata['id'])

115
119
128
125


---
#### $\color{red}{Metadata-date}$
---



In [None]:
res = db.similarity_search_with_score(query, k = 4 )

In [None]:
for el in res:
  print('#######################')
  print('\n##### New Point #####\n')
  print(el[0].page_content, '\n')
  print(el[0].metadata['t_cat'])
  print(el[0].metadata['days'])
  print(f"resolved: {el[0].metadata['resolved']}.")

#######################

##### New Point #####

Error message: Conversion failed when converting from a 
character string to uniqueidentifier 

Azure SQL
-50
resolved: False.
#######################

##### New Point #####

 I get an error when I attempt to make my data public 

GCP Cloud Storage
-36
resolved: False.
#######################

##### New Point #####

Unable to register the self-hosted IR  

Azure Security IAM
-3
resolved: True.
#######################

##### New Point #####

Getting the error message Bad syntax for dict arg when trying to set a 
flag. 

GCP Cloud SQL
-22
resolved: True.


In [None]:
res = db.similarity_search_with_score(query, k = 4, filter= {"t_cat": 'Azure SQL'} )

In [None]:
for el in res:
  print('#######################')
  print('\n##### New Point #####\n')
  print(el[0].page_content, '\n')
  print(el[0].metadata['t_cat'])
  print(el[0].metadata['days'])
  print(f"resolved: {el[0].metadata['resolved']}.")

In [None]:
res = db.similarity_search_with_score(
    query, k = 4, filter = {"$and": [{'resolved': False}, {'days': {'$gte':-21}}]}
)


In [None]:
for el in res:
  print('#######################')
  print('\n##### New Point #####\n')
  print(el[0].page_content, '\n')
  print(el[0].metadata['t_cat'])
  print(el[0].metadata['days'])
  print(f"resolved: {el[0].metadata['resolved']}.")

#######################

##### New Point #####

How do I add or access an app.config file in Azure functions to add a 
database connection string? 

Azure Functions
-13
resolved: False.
#######################

##### New Point #####

How can I manage who can access my instances? 

GCP Security IAM
-13
resolved: False.
#######################

##### New Point #####

How do I set entry point in cloud function? 

GCP Functions
-2
resolved: False.
#######################

##### New Point #####

Deployment failure: Insufficient permissions to (re)configure a trigger
(permission denied for bucket <BUCKET_ID>). Please, give owner permissions to the editor role of the bucket and try again. 

GCP Functions
-10
resolved: False.


**Similarity score**

In [None]:
res = db.similarity_search_with_score(query, k = 4 )

In [None]:
for el in res:
  print('-' * 20)
  print(el[0].page_content)
  print(f"cosine distance: {el[1]}")  # 0 - 2 zero is the best
  print(f"cosine similarity: {1- el[1]}")  # -1 to 1 one is the best
