Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra database tool #13423

Merged
merged 16 commits into from
May 22, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
396 changes: 396 additions & 0 deletions docs/docs/examples/tools/casssandra.ipynb

Large diffs are not rendered by default.

153 changes: 153 additions & 0 deletions llama-index-integrations/tools/llama-index-tools-cassandra/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
llama_index/_static
.DS_Store
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
bin/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
etc/
include/
lib/
lib64/
parts/
sdist/
share/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
.ruff_cache

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints
notebooks/

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
pyvenv.cfg

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# Jetbrains
.idea
modules/
*.swp

# VsCode
.vscode

# pipenv
Pipfile
Pipfile.lock

# pyright
pyrightconfig.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
python_sources()
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
GIT_ROOT ?= $(shell git rev-parse --show-toplevel)

help: ## Show all Makefile targets.
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[33m%-30s\033[0m %s\n", $$1, $$2}'

format: ## Run code autoformatters (black).
pre-commit install
git ls-files | xargs pre-commit run black --files

lint: ## Run linters: pre-commit (black, ruff, codespell) and mypy
pre-commit install && git ls-files | xargs pre-commit run --show-diff-on-failure --files

test: ## Run tests via pytest.
pytest tests

watch-docs: ## Build and watch documentation.
sphinx-autobuild docs/ docs/_build/html --open-browser --watch $(GIT_ROOT)/llama_index/
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Cassandra Database Tools

## Overview

The Cassandra Database Tools project is designed to help AI engineers efficiently integrate Large Language Models (LLMs) with Apache Cassandra® data. It facilitates optimized and safe interactions with Cassandra databases, supporting various deployments like Apache Cassandra®, DataStax Enterprise™, and DataStax Astra™.

## Key Features

- **Fast Data Access:** Optimized queries ensure most operations complete in milliseconds.
- **Schema Introspection:** Enhances the reasoning capabilities of LLMs by providing detailed schema information.
- **Compatibility:** Supports various Cassandra deployments, ensuring wide applicability.
- **Safety Measures:** Limits operations to SELECT queries and schema introspection to prioritize data integrity.

## Installation

Ensure your system has Python installed and proceed with the following installations via pip:

```bash
pip install python-dotenv cassio llama-index-tools-cassandra
```

Create a `.env` file for environmental variables related to Cassandra and Astra configurations, following the example structure provided in the notebook.

## Environment Setup

- For Cassandra: Configure `CASSANDRA_CONTACT_POINTS`, `CASSANDRA_USERNAME`, `CASSANDRA_PASSWORD`, and `CASSANDRA_KEYSPACE`.
- For DataStax Astra: Set `ASTRA_DB_APPLICATION_TOKEN`, `ASTRA_DB_DATABASE_ID`, and `ASTRA_DB_KEYSPACE`.

## How It Works

The toolkit leverages the Cassandra Query Language (CQL) and integrates with LLMs to provide an efficient query path determination for the user's requests, ensuring best practices for querying are followed. Using functions, the LLMs decision making can invoke the tool instead of designing custom queries. The result is faster and efficient access to Cassandra data for agents.

## Tools Included

- **`cassandra_db_schema`**: Fetches schema information, essential for the agent’s operation.
- **`cassandra_db_select_table_data`**: Allows selection of data from a specific keyspace and table.
- **`cassandra_db_query`**: An experimental tool that accepts fully formed query strings from the agent.

## Example Usage

Initialize the CassandraDatabase and set up the agent with the tools provided. Query the database by interacting with the agent as shown in the example [notebook](examples/casssandra.ipynb).
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from llama_index.tools.cassandra.base import CassandraDatabaseToolSpec


__all__ = ["CassandraDatabaseToolSpec"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
"""Tools for interacting with an Apache Cassandra database."""
from typing import List

from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document
from llama_index.core.tools.tool_spec.base import BaseToolSpec


from pydantic import Field

from llama_index.tools.cassandra.cassandra_database_wrapper import (
CassandraDatabase,
)


class CassandraDatabaseToolSpec(BaseToolSpec, BaseReader):
pmcfadin marked this conversation as resolved.
Show resolved Hide resolved
"""Base tool for interacting with an Apache Cassandra database."""

db: CassandraDatabase = Field(exclude=True)

spec_functions = [
"cassandra_db_query",
"cassandra_db_schema",
"cassandra_db_select_table_data",
]

def __init__(self, db: CassandraDatabase) -> None:
"""DB session in context."""
self.db = db

def cassandra_db_query(self, query: str) -> List[Document]:
"""Execute a CQL query and return the results as a list of Documents.

Args:
query (str): A CQL query to execute.

Returns:
List[Document]: A list of Document objects, each containing data from a row.
"""
documents = []
result = self.db.run_no_throw(query, fetch="Cursor")
for row in result:
doc_str = ", ".join([str(value) for value in row])
documents.append(Document(text=doc_str))
return documents

def cassandra_db_schema(self, keyspace: str) -> List[Document]:
"""Input to this tool is a keyspace name, output is a table description
of Apache Cassandra tables.
If the query is not correct, an error message will be returned.
If an error is returned, report back to the user that the keyspace
doesn't exist and stop.

Args:
keyspace (str): The name of the keyspace for which to return the schema.

Returns:
List[Document]: A list of Document objects, each containing a table description.
"""
return [Document(text=self.db.get_keyspace_tables_str_no_throw(keyspace))]

def cassandra_db_select_table_data(
self, keyspace: str, table: str, predicate: str, limit: int
) -> List[Document]:
"""Tool for getting data from a table in an Apache Cassandra database.
Use the WHERE clause to specify the predicate for the query that uses the
primary key. A blank predicate will return all rows. Avoid this if possible.
Use the limit to specify the number of rows to return. A blank limit will
return all rows.

Args:
keyspace (str): The name of the keyspace containing the table.
table (str): The name of the table for which to return data.
predicate (str): The predicate for the query that uses the primary key.
limit (int): The maximum number of rows to return.

Returns:
List[Document]: A list of Document objects, each containing a row of data.
"""
return [
Document(
text=self.db.get_table_data_no_throw(keyspace, table, predicate, limit)
)
]