Azure CosmosDB Extension and Tabular Data #1058
TrueCodePoet
started this conversation in
Misc
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I have been working on a custom extension for importing into Azure CosmosDB. I have this working just fine but as an extra I am building a custom tabular extension.
Obviously, this is more complex, but it allows the indexer to create a schema and requires additional calls to the LLM to build the Query. but it does allow the LLM to build the query and return multiple rows at a time for the LLM to use. My issue with chunking in the traditional unstructured way does not provide all data needed for tabular processing.
here is an example process: The user asks a question and this is fed to an LLM something like :
"Give me a list of all server names with a Server Purpose of 'Corelight Network monitoring sensor'."
the LLM compares the request with the known schema stored in CosmosDB from ingestion.
the LLM returns potential filter suggestion :
LLM Normalized Filter Suggestion: [
]
This is converted into a query like this :
SQL Query: SELECT TOP @limit
c.id,c.file,c.tags,c.data,c.source,c.payload,c.schemaId,c.importBatchId, VectorDistance(c.vector, @queryEmbedding) AS SimilarityScore
FROM c
WHERE ((LOWER(c.data.server_purpose) LIKE @p_0 OR LOWER(c.data.server_purpose) LIKE @p_1))
ORDER BY VectorDistance(c.vector, @queryEmbedding)
Parameters:
@limit: 100
@queryEmbedding: [Vector with 1536 dimensions]
@p_0: %corelight%
@p_1: %sensor%
after the query is run it is returned just like other chunks from other documents to the LLM for final response.
The largest issue I am finding is that the ingestion using kernel memory is causing issues as I am missing rows. but if I ingest using my own code and do not ingest using kernel memory all is good. So I am still working that out.
I posted this as using unstructured vector searches against tabular data will not return the results you are looking for. you will need a tabular approach unless you want to include the complete document in the LLM for response which is a usually a lot of data. This is why I am using a tabular query to reduce the number of results to just what the user needs.
Everything I am working on just requires the user to provide the Excel or CSV file. the system handles the rest from ingestion to query results using kernel memory. I just need to figure out the missing rows which I think has to do with how Kernel memory handles chunking and is potentially not committing some records or at the least overwriting records during ingestion. If I use my own ingestion, it works just fine.
Any thoughts on this would be appriciated.
Beta Was this translation helpful? Give feedback.
All reactions