Skip to content

Fix multi database issues #214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Fix multi database issues #214

wants to merge 7 commits into from

Conversation

rvguha
Copy link
Collaborator

@rvguha rvguha commented Jun 16, 2025

Fixes issues with multi-db requests
Introduces tests to make sure things do work

rvguha added 2 commits June 16, 2025 12:51
…iple-Simultaneous-Retrieval-Backends"

This reverts commit ffa1db3, reversing
changes made to 6fbdbc6.
- also provides better abstraction
- adds a number of tests
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes issues related to multi-database requests and improves test coverage for database operations. Key changes include refactoring test scripts across multiple modules to use the new search interface instead of direct client calls, updating configuration files to include an "enabled" flag and write endpoint, and enhancing logging and error handling across the codebase.

Reviewed Changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
code/testing/* Refactored test scripts to verify multi-db requests and RSS feed loads
code/snowflake-connectivity.py Updated to use search_all_sites and get_sites_wrapper for Snowflake
code/scraping/incrementalCrawlAndLoad.py Replaced direct client usage with upload_documents and updated logging
code/pre_retrieval/* and code/core/* Replaced get_vector_db_client with search and updated trim imports
code/config/* Updated retrieval configuration with write_endpoint and "enabled" flag
Comments suppressed due to low confidence (2)

code/config/config.py:247

  • Ensure that the new 'enabled' field and 'write_endpoint' in the retrieval configuration are clearly documented and consistently handled in the codebase to avoid confusion during maintenance.
db_type=self._get_config_value(cfg.get("db_type")),  # Add db_type

code/core/generate_answer.py:122

  • Since the search() function is now used in place of get_vector_db_client(), verify that it correctly supports the 'query_params' argument to ensure consistent behavior across modules.
top_embeddings = await search(self.decontextualized_query, self.site, query_params=self.query_params)

@@ -42,13 +42,11 @@ async def check_complete() -> bool:
return resp.get("answer", None) is not None

async def check_search() -> bool:
client = retriever.get_vector_db_client("snowflake_cortex_search_1")
resp = await client.search_all_sites("funny movies", top_n=1)
resp = await search_all_sites("funny movies", top_n=1, endpoint_name="snowflake_cortex_search_1")
return len(resp) > 0 and len(resp[0]) == 4
Copy link
Preview

Copilot AI Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check 'len(resp[0]) == 4' assumes that each result always has 4 elements. Consider adopting a more flexible data validation approach to accommodate potential variations in response structure.

Suggested change
return len(resp) > 0 and len(resp[0]) == 4
if isinstance(resp, list) and len(resp) > 0 and isinstance(resp[0], (list, tuple)) and len(resp[0]) == 4:
return True
else:
print(f"❌ check_search: Unexpected response structure: {resp}")
return False

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

This comment was marked as outdated.

- Resolved conflict in code/retrieval/retriever.py
- Kept dynamic imports with auto-installation from main
- Fixed variable references (db_type instead of self.db_type)
@jennifermarsman jennifermarsman self-assigned this Jun 18, 2025
@jennifermarsman jennifermarsman requested a review from Copilot June 18, 2025 19:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes issues related to multi-database requests and improves consistency in database interactions while introducing a comprehensive suite of tests for database operations.

  • Updated configuration and documentation to support multi-db setups and added a new write_endpoint field.
  • Refactored database load and search functions to use asynchronous wrapper functions, removing direct Qdrant client calls.
  • Introduced and expanded test scripts for database operations, endpoint statistics, and query deduplication analysis.

Reviewed Changes

Copilot reviewed 24 out of 27 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
docs/retrieval-config.md Added detailed configuration guide for retrieval endpoints and new write_endpoint option.
code/utils/json_utils.py Introduced JSON trimming and merging functions, replacing old trim utilities.
code/tools/qdrant_load.py Changed to async operations and switched from direct QdrantClient usage to async wrapper.
code/tools/db_load.py Refactored upload and deletion functions to use wrapper functions and query_params.
Various test files in code/testing/ Added comprehensive asynchronous tests for database operations, search, and endpoint statistics.
Code in core modules (e.g. generate_answer.py, baseHandler.py, etc.) Replaced direct client calls with search wrapper for consistency.
code/config/config_retrieval.yaml & code/config/config.py Updated configuration to include an enabled flag per endpoint and a dedicated write_endpoint field.
Comments suppressed due to low confidence (1)

code/tools/db_load.py:117

  • The delete_site_from_database function still uses CONFIG.preferred_retrieval_endpoint to determine the endpoint if no database is provided. Since the configuration now switches to using a dedicated write_endpoint, update the reference to use CONFIG.write_endpoint for consistency with the new configuration design.
    query_params = {"db": [endpoint_name]} if database else None

js[attr] = []
js[attr].append(items[attr]["name"])
elif (attr == "review"):
items['review'] = []
Copy link
Preview

Copilot AI Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the trim_movie function, the 'review' attribute is handled by resetting items['review'] to an empty list and then attempting to append to js[attr] without first initializing js['review']. Consider initializing js['review'] as an empty list before the loop to properly capture review bodies.

Suggested change
items['review'] = []
if attr not in js:
js[attr] = []

Copilot uses AI. Check for mistakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvguha I also didn't look at this yet if you want to investigate this comment from Copilot code review

@jennifermarsman
Copy link
Member

jennifermarsman commented Jun 18, 2025

@rvguha I pushed one round of fixes but still testing. We will need changes to check_connectivity too - to check all enabled databases rather than just the write endpoint now. Working on it.
Update: this has been fixed in commit d4ba3c8.

@@ -160,18 +162,21 @@ async def search_all_sites(self, query: str, num_results: int = 50, **kwargs) ->
"""
pass

Copy link
Member

@jennifermarsman jennifermarsman Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the @AbstractMethod? Was this intentional or a mistake?

@rvguha
Copy link
Collaborator Author

rvguha commented Jun 18, 2025 via email

query_params: Optional query parameters for overriding endpoint
"""
self.query_params = query_params or {}
self.endpoint_name = endpoint_name # Store the endpoint name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you still expose the query_params in the init method but you don't use them anywhere or override the defaults with them anymore, which is a functionality change. Can you please fix this? Either use them or get rid of them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. If the params specify and endpoint, only that one should be used.

@@ -344,16 +619,17 @@ async def search(self, query: str, site: Union[str, List[str]],
Returns:
List of search results
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jen start here - this is where you left off

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this is a note to yourself

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants