Fix multi database issues #214

rvguha · 2025-06-16T22:43:36Z

Fixes issues with multi-db requests
Introduces tests to make sure things do work

…iple-Simultaneous-Retrieval-Backends" This reverts commit ffa1db3, reversing changes made to 6fbdbc6.

- also provides better abstraction - adds a number of tests

Copilot

Pull Request Overview

This PR fixes issues related to multi-database requests and improves test coverage for database operations. Key changes include refactoring test scripts across multiple modules to use the new search interface instead of direct client calls, updating configuration files to include an "enabled" flag and write endpoint, and enhancing logging and error handling across the codebase.

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
code/testing/*	Refactored test scripts to verify multi-db requests and RSS feed loads
code/snowflake-connectivity.py	Updated to use search_all_sites and get_sites_wrapper for Snowflake
code/scraping/incrementalCrawlAndLoad.py	Replaced direct client usage with upload_documents and updated logging
code/pre_retrieval/* and code/core/*	Replaced get_vector_db_client with search and updated trim imports
code/config/*	Updated retrieval configuration with write_endpoint and "enabled" flag

Comments suppressed due to low confidence (2)

code/config/config.py:247

Ensure that the new 'enabled' field and 'write_endpoint' in the retrieval configuration are clearly documented and consistently handled in the codebase to avoid confusion during maintenance.

db_type=self._get_config_value(cfg.get("db_type")),  # Add db_type

code/core/generate_answer.py:122

Since the search() function is now used in place of get_vector_db_client(), verify that it correctly supports the 'query_params' argument to ensure consistent behavior across modules.

top_embeddings = await search(self.decontextualized_query, self.site, query_params=self.query_params)

Copilot · 2025-06-16T23:03:29Z

code/snowflake-connectivity.py

@@ -42,13 +42,11 @@ async def check_complete() -> bool:
    return resp.get("answer", None) is not None

 async def check_search() -> bool:
-    client = retriever.get_vector_db_client("snowflake_cortex_search_1")
-    resp = await client.search_all_sites("funny movies", top_n=1)
+    resp = await search_all_sites("funny movies", top_n=1, endpoint_name="snowflake_cortex_search_1")
    return len(resp) > 0 and len(resp[0]) == 4


The check 'len(resp[0]) == 4' assumes that each result always has 4 elements. Consider adopting a more flexible data validation approach to accommodate potential variations in response structure.

Suggested change

return len(resp) > 0 and len(resp[0]) == 4

if isinstance(resp, list) and len(resp) > 0 and isinstance(resp[0], (list, tuple)) and len(resp[0]) == 4:

return True

else:

print(f"❌ check_search: Unexpected response structure: {resp}")

return False

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

- Resolved conflict in code/retrieval/retriever.py - Kept dynamic imports with auto-installation from main - Fixed variable references (db_type instead of self.db_type)

Copilot

Pull Request Overview

This PR fixes issues related to multi-database requests and improves consistency in database interactions while introducing a comprehensive suite of tests for database operations.

Updated configuration and documentation to support multi-db setups and added a new write_endpoint field.
Refactored database load and search functions to use asynchronous wrapper functions, removing direct Qdrant client calls.
Introduced and expanded test scripts for database operations, endpoint statistics, and query deduplication analysis.

Reviewed Changes

Copilot reviewed 24 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
docs/retrieval-config.md	Added detailed configuration guide for retrieval endpoints and new write_endpoint option.
code/utils/json_utils.py	Introduced JSON trimming and merging functions, replacing old trim utilities.
code/tools/qdrant_load.py	Changed to async operations and switched from direct QdrantClient usage to async wrapper.
code/tools/db_load.py	Refactored upload and deletion functions to use wrapper functions and query_params.
Various test files in code/testing/	Added comprehensive asynchronous tests for database operations, search, and endpoint statistics.
Code in core modules (e.g. generate_answer.py, baseHandler.py, etc.)	Replaced direct client calls with search wrapper for consistency.
code/config/config_retrieval.yaml & code/config/config.py	Updated configuration to include an enabled flag per endpoint and a dedicated write_endpoint field.

Comments suppressed due to low confidence (1)

code/tools/db_load.py:117

The delete_site_from_database function still uses CONFIG.preferred_retrieval_endpoint to determine the endpoint if no database is provided. Since the configuration now switches to using a dedicated write_endpoint, update the reference to use CONFIG.write_endpoint for consistency with the new configuration design.

    query_params = {"db": [endpoint_name]} if database else None

Copilot · 2025-06-18T19:12:37Z

code/utils/json_utils.py

+                    js[attr] = []
+                js[attr].append(items[attr]["name"])
+        elif (attr == "review"):
+            items['review'] = []


In the trim_movie function, the 'review' attribute is handled by resetting items['review'] to an empty list and then attempting to append to js[attr] without first initializing js['review']. Consider initializing js['review'] as an empty list before the loop to properly capture review bodies.

Suggested change

items['review'] = []

if attr not in js:

js[attr] = []

@rvguha I also didn't look at this yet if you want to investigate this comment from Copilot code review

…endpoint

jennifermarsman · 2025-06-18T19:47:48Z

@rvguha I pushed one round of fixes but still testing. We will need changes to check_connectivity too - to check all enabled databases rather than just the write endpoint now. Working on it.
Update: this has been fixed in commit d4ba3c8.

code/config/config_retrieval.yaml

…sers

jennifermarsman · 2025-06-18T20:24:24Z

code/retrieval/retriever.py

@@ -160,18 +162,21 @@ async def search_all_sites(self, query: str, num_results: int = 50, **kwargs) ->
        """
        pass



Why did you remove the @AbstractMethod? Was this intentional or a mistake?

rvguha · 2025-06-18T20:26:08Z

The abstract methods work at a single provider level. This function (not method) works against all the available providers.

…

On Wed, Jun 18, 2025 at 1:24 PM Jennifer Marsman ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In code/retrieval/retriever.py <#214 (comment)>: > @@ -160,18 +162,21 @@ async def search_all_sites(self, query: str, num_results: int = 50, **kwargs) -> """ pass Why did you remove the @AbstractMethod <https://github.com/AbstractMethod> ? — Reply to this email directly, view it on GitHub <#214 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABICKCSXCMO26V53HNZMHMT3EHDI3AVCNFSM6AAAAAB7ORXZGSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSNBQGU3DIMZTHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

jennifermarsman · 2025-06-18T20:35:34Z

code/retrieval/retriever.py

            query_params: Optional query parameters for overriding endpoint
        """
        self.query_params = query_params or {}
+        self.endpoint_name = endpoint_name  # Store the endpoint name


It looks like you still expose the query_params in the init method but you don't use them anywhere or override the defaults with them anymore, which is a functionality change. Can you please fix this? Either use them or get rid of them.

Good catch. If the params specify and endpoint, only that one should be used.

code/retrieval/retriever.py

jennifermarsman · 2025-06-18T21:06:42Z

code/retrieval/retriever.py

@@ -344,16 +619,17 @@ async def search(self, query: str, site: Union[str, List[str]],
        Returns:
            List of search results
        """


Jen start here - this is where you left off

i assume this is a note to yourself

yes it is sorry

…der backend

rvguha added 2 commits June 16, 2025 12:51

Revert "Merge pull request #213 from microsoft/revert-182-Enable-Mult…

adcf9e5

…iple-Simultaneous-Retrieval-Backends" This reverts commit ffa1db3, reversing changes made to 6fbdbc6.

Fixes for multi database issues

64c233e

- also provides better abstraction - adds a number of tests

rvguha requested a review from jennifermarsman June 16, 2025 22:43

jennifermarsman requested review from Copilot June 16, 2025 23:02

Copilot AI reviewed Jun 16, 2025

View reviewed changes

This comment was marked as outdated.

Sign in to view

Merge main into Fix-multi-database-issues branch

18fcbb9

- Resolved conflict in code/retrieval/retriever.py - Kept dynamic imports with auto-installation from main - Fixed variable references (db_type instead of self.db_type)

jennifermarsman self-assigned this Jun 18, 2025

jennifermarsman requested a review from Copilot June 18, 2025 19:12

Copilot AI reviewed Jun 18, 2025

View reviewed changes

Replacing all occurrences of preferred_retrieval_endpoint with write_…

4d5db7a

…endpoint

jennifermarsman reviewed Jun 18, 2025

View reviewed changes

code/config/config_retrieval.yaml Show resolved Hide resolved

Setting defaults in config_retrieval.yaml that will be best for end u…

fe6522a

…sers

jennifermarsman reviewed Jun 18, 2025

View reviewed changes

code/retrieval/retriever.py Outdated Show resolved Hide resolved

jennifermarsman reviewed Jun 18, 2025

View reviewed changes

code/retrieval/retriever.py Show resolved Hide resolved

Fixed logic to verify valid credentials for retrievers

1cf9c96

jennifermarsman reviewed Jun 18, 2025

View reviewed changes

Fixed check connectivity script to work with multiple retrieval provi…

d4ba3c8

…der backend

-    return len(resp) > 0 and len(resp[0]) == 4
+    if isinstance(resp, list) and len(resp) > 0 and isinstance(resp[0], (list, tuple)) and len(resp[0]) == 4:
+        return True
+    else:
+        print(f"❌ check_search: Unexpected response structure: {resp}")
+        return False

		@@ -160,18 +162,21 @@ async def search_all_sites(self, query: str, num_results: int = 50, **kwargs) ->
		"""
		pass

Fix multi database issues #214

Are you sure you want to change the base?

Fix multi database issues #214

Uh oh!

Conversation

rvguha commented Jun 16, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

jennifermarsman Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

jennifermarsman commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jennifermarsman Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rvguha commented Jun 18, 2025 via email

Uh oh!

jennifermarsman Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

rvguha Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jennifermarsman Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

rvguha Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

jennifermarsman Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jennifermarsman commented Jun 18, 2025 •

edited

Loading

jennifermarsman Jun 18, 2025 •

edited

Loading