Skip to content

Add google embedder support #192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

kuarcis
Copy link

@kuarcis kuarcis commented Jun 5, 2025

Summary

This PR adds support for the Google Embedder to the DeepWiki Open project.

Changes Introduced

  • Integrates Google Embedder functionality.
  • Updates relevant modules and configuration to support Google Embedder.
  • Ensures compatibility with existing embedder interfaces and workflows.
  • Adds documentation and usage examples for the new embedder.

Motivation

Adding Google Embedder support allows users to leverage Google’s embedding capabilities, and allow use only google api key run the backend

Testing

  • All unit and integration tests pass.
  • Manual testing performed to verify embedding through Google.
  • No breaking changes observed with existing embedders.

Related Issues

  • N/A (please link any related issues if applicable)

Checklist

  • Added/updated unit and integration tests
  • Updated documentation
  • Verified backward compatibility

Let me know if you need to customize or expand this further!

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @kuarcis, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello team,

Gemini here, providing a summary for this pull request titled "Add google embedder support".

The primary goal of this PR is to integrate support for Google AI's embedding models into the DeepWiki project. This allows users to utilize Google's embedding capabilities, particularly the text-embedding-004 model, and simplifies setup for users already leveraging Google Gemini models for text generation, as it uses the same GOOGLE_API_KEY.

The changes involve adding a new client for Google embeddings, updating the configuration system to allow selecting the embedder type via the DEEPWIKI_EMBEDDER_TYPE environment variable, modifying the data processing pipeline and RAG components to be aware of the different embedder types, enhancing documentation, and adding comprehensive tests for the new functionality and the overall embedder system.

This enhancement provides users with more flexibility in choosing their embedding provider based on their existing infrastructure and preferences.

Highlights

  • Google Embedder Support: Adds a new GoogleEmbedderClient to integrate with Google AI's embedding API, specifically supporting the text-embedding-004 model for tasks like semantic similarity.
  • Flexible Embedder Selection: Introduces the DEEPWIKI_EMBEDDER_TYPE environment variable (openai, google, ollama) to easily switch between different embedding providers without code changes. Defaults to openai.
  • Updated Data Pipeline & RAG: Modifies core components like token counting, document reading, data pipeline preparation, and the RAG class to correctly handle and utilize the selected embedder type, including backward compatibility for the old is_ollama_embedder parameter.
  • Enhanced Documentation: Updates the README.md to clearly explain how to enable and use Google AI embeddings, including configuration options for .env, Docker, and Docker Compose, and provides a table of available embedder types.
  • Comprehensive Testing: Adds new unit and integration tests (test_google_embedder.py, test_full_integration.py, test_all_embedders.py) to verify the Google embedder client, AdalFlow integration, embedder selection logic, environment variable handling, and data pipeline compatibility across different embedder types.
  • Minor API & Main Updates: Includes a small fix in the API's cache deletion endpoint for authorization code validation and refines the uvicorn reload configuration in main.py for development mode using watchfiles.

Changelog

Click here to see the changelog
  • README.md
    • Added 'Flexible Embeddings' feature to the main list.
    • Added DEEPWIKI_EMBEDDER_TYPE=google to the Quick Start .env example.
    • Added DEEPWIKI_EMBEDDER_TYPE=google to the Manual Setup .env example.
    • Added a new section '🧠 Using Google AI Embeddings' detailing features, how to enable (env var, Docker, Docker Compose), available types table, reasons to use it, and switching instructions.
    • Updated the Environment Variables table to include DEEPWIKI_EMBEDDER_TYPE and clarify API key requirements based on embedder type.
  • api/api.py
    • Modified delete_wiki_cache endpoint to check if authorization_code is not empty before comparing it to WIKI_AUTH_CODE.
  • api/config.py
    • Imported GoogleEmbedderClient.
    • Added EMBEDDER_TYPE environment variable (DEEPWIKI_EMBEDDER_TYPE, default 'openai').
    • Added GoogleEmbedderClient to the CLIENT_CLASSES mapping.
    • Updated load_embedder_config to include the embedder_google key when processing client classes.
    • Modified get_embedder_config to return the configuration based on the EMBEDDER_TYPE environment variable ('google', 'ollama', or default 'embedder').
    • Added is_google_embedder function to check if the current embedder is Google.
    • Added get_embedder_type function to return the current embedder type string ('ollama', 'google', 'openai').
    • Updated the loop in the main config loading section to include embedder_google when updating configs.
  • api/config/embedder.json
    • Added a new configuration section embedder_google specifying GoogleEmbedderClient, batch_size, and model_kwargs (text-embedding-004, SEMANTIC_SIMILARITY).
  • api/data_pipeline.py
    • Modified count_tokens to accept optional embedder_type parameter (with backward compatibility for is_ollama_embedder) and use it to determine the encoding.
    • Modified read_all_documents to accept optional embedder_type parameter (with backward compatibility for is_ollama_embedder) and pass it to count_tokens.
    • Modified prepare_data_pipeline to accept optional embedder_type parameter (with backward compatibility for is_ollama_embedder) and use it to select the embedder and the appropriate document processor (OllamaDocumentProcessor for ollama, ToEmbeddings for others including google).
  • api/google_embedder_client.py
    • Added a new file implementing GoogleEmbedderClient inheriting from adalflow.core.model_client.ModelClient.
    • Includes methods for initializing the client with GOOGLE_API_KEY, parsing Google AI embedding responses, converting inputs to API kwargs (handling single and batch), and calling the Google AI embedding API (genai.embed_content).
    • Adds backoff for API calls.
    • Notes the lack of async support in the current Google AI Python client.
  • api/main.py
    • Removed unused uvicorn import at the top.
    • Added configuration for watchfiles logger to show file paths in development.
    • Implemented watchfiles monkey patch to specifically watch api subdirectories (excluding logs) and .py files in the api root during development reload.
    • Updated uvicorn.run call to include reload_excludes for logs, __pycache__, and .pyc files when reload is enabled.
  • api/rag.py
    • Modified RAG class initialization to use api.config.get_embedder_type() to determine the embedder type and pass it to get_embedder.
    • Updated prepare_retriever method to pass the detected embedder_type to prepare_database.
  • api/tools/embedder.py
    • Modified get_embedder function to accept embedder_type and use_google_embedder (legacy) parameters.
    • Updated logic to select the embedder configuration based on embedder_type, legacy parameters, or auto-detection via api.config.get_embedder_type().
    • Added logic to set the batch_size attribute on the returned adal.Embedder instance if it's present in the configuration.
  • tests/README.md
    • Added mention of Google AI embedder tests.
    • Updated Environment Variables section to include DEEPWIKI_EMBEDDER_TYPE requirement for Google tests.
    • Added test_google_embedder.py and test_google_embedder_fix.py to the Unit Tests section.
    • Added test_full_integration.py to the Integration Tests section.
    • Updated descriptions for test categories.
    • Added troubleshooting tips for API Key Issues and Server Dependencies.
  • tests/init.py
    • Added a comment.
  • tests/api/init.py
    • Added a comment.
  • tests/integration/init.py
    • Added a comment.
  • tests/integration/test_full_integration.py
    • Added a new file with integration tests specifically for Google AI embeddings, including tests for configuration loading, embedder selection, and environment variable handling.
  • tests/run_tests.py
    • Updated the test runner script to include the new test directories (unit, integration, api).
    • Added check_environment function to verify required API keys (GOOGLE_API_KEY, OPENAI_API_KEY) and dependencies (adalflow, google-generativeai, requests).
    • Modified run_tests to iterate through specified directories and run test_*.py files.
    • Added argument parsing for --unit, --integration, --api, and --check-env.
  • tests/unit/init.py
    • Added a comment.
  • tests/unit/test_all_embedders.py
    • Added a new file with comprehensive unit tests for the embedder system.
    • Includes tests for configuration loading, embedder type detection (is_ollama_embedder, is_google_embedder, get_embedder_type), get_embedder_config, get_embedder factory function (with explicit types, legacy params, and auto-detection), direct client tests (GoogleEmbedderClient, OpenAIClient via AdalFlow), data pipeline functions (count_tokens, prepare_data_pipeline), RAG integration, and environment variable handling.
    • Uses a simple custom test runner.
  • tests/unit/test_google_embedder.py
    • Added a new file with unit tests specifically for the GoogleEmbedderClient.
    • Includes tests for the client's call and parse_embedding_response methods for both single and batch embeddings, and tests its integration with the AdalFlow Embedder and ToEmbeddings components.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Google's vectors now arrive,
Making DeepWiki feel alive.
With keys in place,
We embed with grace,
And watch our RAG thrive.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively introduces support for Google Embedders, which is a valuable addition to the DeepWiki Open project. The refactoring from boolean flags to a string-based embedder_type for selecting embedders is a significant improvement for maintainability and future extensibility.

The new GoogleEmbedderClient is well-implemented, and the configuration changes are clear and appropriate. The updates to the README.md are particularly commendable, providing comprehensive documentation for the new feature, including setup, benefits, and switching between embedders. The backward compatibility for existing embedder configurations has also been thoughtfully handled.

Overall, the changes are well-executed and enhance the flexibility of the embedding system. The inclusion of new tests and the statement that they pass is also a positive indicator of quality.

Summary of Findings

  • Refactoring for Embedder Type: The refactoring across multiple files (e.g., api/config.py, api/data_pipeline.py, api/rag.py, api/tools/embedder.py) to use a string-based embedder_type instead of boolean flags (like is_ollama_embedder) is a well-executed improvement. This makes the system more extensible for adding new embedder types in the future and improves code clarity.
  • GoogleEmbedderClient Implementation: The new api/google_embedder_client.py provides a solid integration with the Google AI Embeddings API, correctly handling API key configuration, input conversion for single and batch embeddings, and response parsing. The use of backoff for API calls is also a good practice.
  • Documentation (README.md): The updates to README.md are comprehensive and very helpful for users. The new section on 'Using Google AI Embeddings' clearly explains its features, setup, and benefits. The updated environment variable table and API key requirements are also much clearer.
  • Configuration Management: The changes in api/config.py and api/config/embedder.json correctly incorporate the Google embedder, allowing for easy configuration and selection via the DEEPWIKI_EMBEDDER_TYPE environment variable.
  • Backward Compatibility: The handling of backward compatibility for functions previously using is_ollama_embedder by introducing logic to interpret it in the context of the new embedder_type system is well done, ensuring a smoother transition.
  • Minor: Unused Parameter in get_embedder: In api/tools/embedder.py, the get_embedder function signature was updated to include use_google_embedder: bool = False. While this doesn't cause issues, this specific parameter is not utilized by any call sites within the changes of this PR. This was noted but not commented on due to review settings (low severity).
  • Minor: Potentially Unreachable Code in GoogleEmbedderClient: In api/google_embedder_client.py, the parse_embedding_response method includes a condition elif hasattr(response, 'embeddings'):. Given the standard responses from the Google AI SDK's embed_content method, this specific branch might be unreachable. This was noted but not commented on due to review settings (low severity).

Merge Readiness

The pull request is in excellent shape and significantly enhances the project's embedding capabilities. The code is well-structured, the new feature is thoroughly documented, and backward compatibility has been maintained. I believe these changes are ready for merging after any standard final checks by the maintainers. As an AI assistant, I am not authorized to approve pull requests, so please ensure it undergoes the necessary human review and approval process.

@sng-asyncfunc
Copy link
Contributor

Fix the conflicts and simplify it.

@kuarcis
Copy link
Author

kuarcis commented Jun 8, 2025

Thank you for your feedback. I’ve resolved the conflicts as requested. Regarding the complexity of the PR, due to the complexity of supporting multiple embedders while ensuring backward compatibility, the changes are already as streamlined as possible without sacrificing reliability or flexibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants