-
Notifications
You must be signed in to change notification settings - Fork 646
Add google embedder support #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…t, integration, and API tests
…e backward compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @kuarcis, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team,
Gemini here, providing a summary for this pull request titled "Add google embedder support".
The primary goal of this PR is to integrate support for Google AI's embedding models into the DeepWiki project. This allows users to utilize Google's embedding capabilities, particularly the text-embedding-004
model, and simplifies setup for users already leveraging Google Gemini models for text generation, as it uses the same GOOGLE_API_KEY
.
The changes involve adding a new client for Google embeddings, updating the configuration system to allow selecting the embedder type via the DEEPWIKI_EMBEDDER_TYPE
environment variable, modifying the data processing pipeline and RAG components to be aware of the different embedder types, enhancing documentation, and adding comprehensive tests for the new functionality and the overall embedder system.
This enhancement provides users with more flexibility in choosing their embedding provider based on their existing infrastructure and preferences.
Highlights
- Google Embedder Support: Adds a new
GoogleEmbedderClient
to integrate with Google AI's embedding API, specifically supporting thetext-embedding-004
model for tasks like semantic similarity. - Flexible Embedder Selection: Introduces the
DEEPWIKI_EMBEDDER_TYPE
environment variable (openai
,google
,ollama
) to easily switch between different embedding providers without code changes. Defaults toopenai
. - Updated Data Pipeline & RAG: Modifies core components like token counting, document reading, data pipeline preparation, and the RAG class to correctly handle and utilize the selected embedder type, including backward compatibility for the old
is_ollama_embedder
parameter. - Enhanced Documentation: Updates the
README.md
to clearly explain how to enable and use Google AI embeddings, including configuration options for.env
, Docker, and Docker Compose, and provides a table of available embedder types. - Comprehensive Testing: Adds new unit and integration tests (
test_google_embedder.py
,test_full_integration.py
,test_all_embedders.py
) to verify the Google embedder client, AdalFlow integration, embedder selection logic, environment variable handling, and data pipeline compatibility across different embedder types. - Minor API & Main Updates: Includes a small fix in the API's cache deletion endpoint for authorization code validation and refines the
uvicorn
reload configuration inmain.py
for development mode usingwatchfiles
.
Changelog
Click here to see the changelog
- README.md
- Added 'Flexible Embeddings' feature to the main list.
- Added
DEEPWIKI_EMBEDDER_TYPE=google
to the Quick Start.env
example. - Added
DEEPWIKI_EMBEDDER_TYPE=google
to the Manual Setup.env
example. - Added a new section '🧠 Using Google AI Embeddings' detailing features, how to enable (env var, Docker, Docker Compose), available types table, reasons to use it, and switching instructions.
- Updated the Environment Variables table to include
DEEPWIKI_EMBEDDER_TYPE
and clarify API key requirements based on embedder type.
- api/api.py
- Modified
delete_wiki_cache
endpoint to check ifauthorization_code
is not empty before comparing it toWIKI_AUTH_CODE
.
- Modified
- api/config.py
- Imported
GoogleEmbedderClient
. - Added
EMBEDDER_TYPE
environment variable (DEEPWIKI_EMBEDDER_TYPE
, default 'openai'). - Added
GoogleEmbedderClient
to theCLIENT_CLASSES
mapping. - Updated
load_embedder_config
to include theembedder_google
key when processing client classes. - Modified
get_embedder_config
to return the configuration based on theEMBEDDER_TYPE
environment variable ('google', 'ollama', or default 'embedder'). - Added
is_google_embedder
function to check if the current embedder is Google. - Added
get_embedder_type
function to return the current embedder type string ('ollama', 'google', 'openai'). - Updated the loop in the main config loading section to include
embedder_google
when updating configs.
- Imported
- api/config/embedder.json
- Added a new configuration section
embedder_google
specifyingGoogleEmbedderClient
,batch_size
, andmodel_kwargs
(text-embedding-004
,SEMANTIC_SIMILARITY
).
- Added a new configuration section
- api/data_pipeline.py
- Modified
count_tokens
to accept optionalembedder_type
parameter (with backward compatibility foris_ollama_embedder
) and use it to determine the encoding. - Modified
read_all_documents
to accept optionalembedder_type
parameter (with backward compatibility foris_ollama_embedder
) and pass it tocount_tokens
. - Modified
prepare_data_pipeline
to accept optionalembedder_type
parameter (with backward compatibility foris_ollama_embedder
) and use it to select the embedder and the appropriate document processor (OllamaDocumentProcessor
for ollama,ToEmbeddings
for others including google).
- Modified
- api/google_embedder_client.py
- Added a new file implementing
GoogleEmbedderClient
inheriting fromadalflow.core.model_client.ModelClient
. - Includes methods for initializing the client with
GOOGLE_API_KEY
, parsing Google AI embedding responses, converting inputs to API kwargs (handling single and batch), and calling the Google AI embedding API (genai.embed_content
). - Adds backoff for API calls.
- Notes the lack of async support in the current Google AI Python client.
- Added a new file implementing
- api/main.py
- Removed unused
uvicorn
import at the top. - Added configuration for
watchfiles
logger to show file paths in development. - Implemented
watchfiles
monkey patch to specifically watchapi
subdirectories (excludinglogs
) and.py
files in theapi
root during development reload. - Updated
uvicorn.run
call to includereload_excludes
forlogs
,__pycache__
, and.pyc
files whenreload
is enabled.
- Removed unused
- api/rag.py
- Modified
RAG
class initialization to useapi.config.get_embedder_type()
to determine the embedder type and pass it toget_embedder
. - Updated
prepare_retriever
method to pass the detectedembedder_type
toprepare_database
.
- Modified
- api/tools/embedder.py
- Modified
get_embedder
function to acceptembedder_type
anduse_google_embedder
(legacy) parameters. - Updated logic to select the embedder configuration based on
embedder_type
, legacy parameters, or auto-detection viaapi.config.get_embedder_type()
. - Added logic to set the
batch_size
attribute on the returnedadal.Embedder
instance if it's present in the configuration.
- Modified
- tests/README.md
- Added mention of Google AI embedder tests.
- Updated Environment Variables section to include
DEEPWIKI_EMBEDDER_TYPE
requirement for Google tests. - Added
test_google_embedder.py
andtest_google_embedder_fix.py
to the Unit Tests section. - Added
test_full_integration.py
to the Integration Tests section. - Updated descriptions for test categories.
- Added troubleshooting tips for API Key Issues and Server Dependencies.
- tests/init.py
- Added a comment.
- tests/api/init.py
- Added a comment.
- tests/integration/init.py
- Added a comment.
- tests/integration/test_full_integration.py
- Added a new file with integration tests specifically for Google AI embeddings, including tests for configuration loading, embedder selection, and environment variable handling.
- tests/run_tests.py
- Updated the test runner script to include the new test directories (
unit
,integration
,api
). - Added
check_environment
function to verify required API keys (GOOGLE_API_KEY
,OPENAI_API_KEY
) and dependencies (adalflow
,google-generativeai
,requests
). - Modified
run_tests
to iterate through specified directories and runtest_*.py
files. - Added argument parsing for
--unit
,--integration
,--api
, and--check-env
.
- Updated the test runner script to include the new test directories (
- tests/unit/init.py
- Added a comment.
- tests/unit/test_all_embedders.py
- Added a new file with comprehensive unit tests for the embedder system.
- Includes tests for configuration loading, embedder type detection (
is_ollama_embedder
,is_google_embedder
,get_embedder_type
),get_embedder_config
,get_embedder
factory function (with explicit types, legacy params, and auto-detection), direct client tests (GoogleEmbedderClient
,OpenAIClient
via AdalFlow), data pipeline functions (count_tokens
,prepare_data_pipeline
), RAG integration, and environment variable handling. - Uses a simple custom test runner.
- tests/unit/test_google_embedder.py
- Added a new file with unit tests specifically for the
GoogleEmbedderClient
. - Includes tests for the client's
call
andparse_embedding_response
methods for both single and batch embeddings, and tests its integration with the AdalFlowEmbedder
andToEmbeddings
components.
- Added a new file with unit tests specifically for the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Google's vectors now arrive,
Making DeepWiki feel alive.
With keys in place,
We embed with grace,
And watch our RAG thrive.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively introduces support for Google Embedders, which is a valuable addition to the DeepWiki Open project. The refactoring from boolean flags to a string-based embedder_type
for selecting embedders is a significant improvement for maintainability and future extensibility.
The new GoogleEmbedderClient
is well-implemented, and the configuration changes are clear and appropriate. The updates to the README.md
are particularly commendable, providing comprehensive documentation for the new feature, including setup, benefits, and switching between embedders. The backward compatibility for existing embedder configurations has also been thoughtfully handled.
Overall, the changes are well-executed and enhance the flexibility of the embedding system. The inclusion of new tests and the statement that they pass is also a positive indicator of quality.
Summary of Findings
- Refactoring for Embedder Type: The refactoring across multiple files (e.g.,
api/config.py
,api/data_pipeline.py
,api/rag.py
,api/tools/embedder.py
) to use a string-basedembedder_type
instead of boolean flags (likeis_ollama_embedder
) is a well-executed improvement. This makes the system more extensible for adding new embedder types in the future and improves code clarity. - GoogleEmbedderClient Implementation: The new
api/google_embedder_client.py
provides a solid integration with the Google AI Embeddings API, correctly handling API key configuration, input conversion for single and batch embeddings, and response parsing. The use ofbackoff
for API calls is also a good practice. - Documentation (README.md): The updates to
README.md
are comprehensive and very helpful for users. The new section on 'Using Google AI Embeddings' clearly explains its features, setup, and benefits. The updated environment variable table and API key requirements are also much clearer. - Configuration Management: The changes in
api/config.py
andapi/config/embedder.json
correctly incorporate the Google embedder, allowing for easy configuration and selection via theDEEPWIKI_EMBEDDER_TYPE
environment variable. - Backward Compatibility: The handling of backward compatibility for functions previously using
is_ollama_embedder
by introducing logic to interpret it in the context of the newembedder_type
system is well done, ensuring a smoother transition. - Minor: Unused Parameter in
get_embedder
: Inapi/tools/embedder.py
, theget_embedder
function signature was updated to includeuse_google_embedder: bool = False
. While this doesn't cause issues, this specific parameter is not utilized by any call sites within the changes of this PR. This was noted but not commented on due to review settings (low severity). - Minor: Potentially Unreachable Code in
GoogleEmbedderClient
: Inapi/google_embedder_client.py
, theparse_embedding_response
method includes a conditionelif hasattr(response, 'embeddings'):
. Given the standard responses from the Google AI SDK'sembed_content
method, this specific branch might be unreachable. This was noted but not commented on due to review settings (low severity).
Merge Readiness
The pull request is in excellent shape and significantly enhances the project's embedding capabilities. The code is well-structured, the new feature is thoroughly documented, and backward compatibility has been maintained. I believe these changes are ready for merging after any standard final checks by the maintainers. As an AI assistant, I am not authorized to approve pull requests, so please ensure it undergoes the necessary human review and approval process.
Fix the conflicts and simplify it. |
Thank you for your feedback. I’ve resolved the conflicts as requested. Regarding the complexity of the PR, due to the complexity of supporting multiple embedders while ensuring backward compatibility, the changes are already as streamlined as possible without sacrificing reliability or flexibility. |
Summary
This PR adds support for the Google Embedder to the DeepWiki Open project.
Changes Introduced
Motivation
Adding Google Embedder support allows users to leverage Google’s embedding capabilities, and allow use only google api key run the backend
Testing
Related Issues
Checklist
Let me know if you need to customize or expand this further!