-
Notifications
You must be signed in to change notification settings - Fork 514
Fix multi database issues #214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- also provides better abstraction - adds a number of tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes issues related to multi-database requests and improves test coverage for database operations. Key changes include refactoring test scripts across multiple modules to use the new search interface instead of direct client calls, updating configuration files to include an "enabled" flag and write endpoint, and enhancing logging and error handling across the codebase.
Reviewed Changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
code/testing/* | Refactored test scripts to verify multi-db requests and RSS feed loads |
code/snowflake-connectivity.py | Updated to use search_all_sites and get_sites_wrapper for Snowflake |
code/scraping/incrementalCrawlAndLoad.py | Replaced direct client usage with upload_documents and updated logging |
code/pre_retrieval/* and code/core/* | Replaced get_vector_db_client with search and updated trim imports |
code/config/* | Updated retrieval configuration with write_endpoint and "enabled" flag |
Comments suppressed due to low confidence (2)
code/config/config.py:247
- Ensure that the new 'enabled' field and 'write_endpoint' in the retrieval configuration are clearly documented and consistently handled in the codebase to avoid confusion during maintenance.
db_type=self._get_config_value(cfg.get("db_type")), # Add db_type
code/core/generate_answer.py:122
- Since the search() function is now used in place of get_vector_db_client(), verify that it correctly supports the 'query_params' argument to ensure consistent behavior across modules.
top_embeddings = await search(self.decontextualized_query, self.site, query_params=self.query_params)
code/snowflake-connectivity.py
Outdated
@@ -42,13 +42,11 @@ async def check_complete() -> bool: | |||
return resp.get("answer", None) is not None | |||
|
|||
async def check_search() -> bool: | |||
client = retriever.get_vector_db_client("snowflake_cortex_search_1") | |||
resp = await client.search_all_sites("funny movies", top_n=1) | |||
resp = await search_all_sites("funny movies", top_n=1, endpoint_name="snowflake_cortex_search_1") | |||
return len(resp) > 0 and len(resp[0]) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check 'len(resp[0]) == 4' assumes that each result always has 4 elements. Consider adopting a more flexible data validation approach to accommodate potential variations in response structure.
return len(resp) > 0 and len(resp[0]) == 4 | |
if isinstance(resp, list) and len(resp) > 0 and isinstance(resp[0], (list, tuple)) and len(resp[0]) == 4: | |
return True | |
else: | |
print(f"❌ check_search: Unexpected response structure: {resp}") | |
return False |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
- Resolved conflict in code/retrieval/retriever.py - Kept dynamic imports with auto-installation from main - Fixed variable references (db_type instead of self.db_type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes issues related to multi-database requests and improves consistency in database interactions while introducing a comprehensive suite of tests for database operations.
- Updated configuration and documentation to support multi-db setups and added a new write_endpoint field.
- Refactored database load and search functions to use asynchronous wrapper functions, removing direct Qdrant client calls.
- Introduced and expanded test scripts for database operations, endpoint statistics, and query deduplication analysis.
Reviewed Changes
Copilot reviewed 24 out of 27 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
docs/retrieval-config.md | Added detailed configuration guide for retrieval endpoints and new write_endpoint option. |
code/utils/json_utils.py | Introduced JSON trimming and merging functions, replacing old trim utilities. |
code/tools/qdrant_load.py | Changed to async operations and switched from direct QdrantClient usage to async wrapper. |
code/tools/db_load.py | Refactored upload and deletion functions to use wrapper functions and query_params. |
Various test files in code/testing/ | Added comprehensive asynchronous tests for database operations, search, and endpoint statistics. |
Code in core modules (e.g. generate_answer.py, baseHandler.py, etc.) | Replaced direct client calls with search wrapper for consistency. |
code/config/config_retrieval.yaml & code/config/config.py | Updated configuration to include an enabled flag per endpoint and a dedicated write_endpoint field. |
Comments suppressed due to low confidence (1)
code/tools/db_load.py:117
- The delete_site_from_database function still uses CONFIG.preferred_retrieval_endpoint to determine the endpoint if no database is provided. Since the configuration now switches to using a dedicated write_endpoint, update the reference to use CONFIG.write_endpoint for consistency with the new configuration design.
query_params = {"db": [endpoint_name]} if database else None
js[attr] = [] | ||
js[attr].append(items[attr]["name"]) | ||
elif (attr == "review"): | ||
items['review'] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the trim_movie function, the 'review' attribute is handled by resetting items['review'] to an empty list and then attempting to append to js[attr] without first initializing js['review']. Consider initializing js['review'] as an empty list before the loop to properly capture review bodies.
items['review'] = [] | |
if attr not in js: | |
js[attr] = [] |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rvguha I also didn't look at this yet if you want to investigate this comment from Copilot code review
@@ -160,18 +162,21 @@ async def search_all_sites(self, query: str, num_results: int = 50, **kwargs) -> | |||
""" | |||
pass | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the @AbstractMethod? Was this intentional or a mistake?
The abstract methods work at a single provider level. This function (not
method) works against all the available providers.
…On Wed, Jun 18, 2025 at 1:24 PM Jennifer Marsman ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In code/retrieval/retriever.py
<#214 (comment)>:
> @@ -160,18 +162,21 @@ async def search_all_sites(self, query: str, num_results: int = 50, **kwargs) ->
"""
pass
Why did you remove the @AbstractMethod <https://github.com/AbstractMethod>
?
—
Reply to this email directly, view it on GitHub
<#214 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABICKCSXCMO26V53HNZMHMT3EHDI3AVCNFSM6AAAAAB7ORXZGSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSNBQGU3DIMZTHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
query_params: Optional query parameters for overriding endpoint | ||
""" | ||
self.query_params = query_params or {} | ||
self.endpoint_name = endpoint_name # Store the endpoint name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you still expose the query_params in the init method but you don't use them anywhere or override the defaults with them anymore, which is a functionality change. Can you please fix this? Either use them or get rid of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. If the params specify and endpoint, only that one should be used.
@@ -344,16 +619,17 @@ async def search(self, query: str, site: Union[str, List[str]], | |||
Returns: | |||
List of search results | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jen start here - this is where you left off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i assume this is a note to yourself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is sorry
Fixes issues with multi-db requests
Introduces tests to make sure things do work