# Autoflow

Autoflow is a RAG framework supported:

- Vector Search Based RAG
- Knowledge Graph Based RAG (aka. GraphRAG)
- Knowledge Base and Document Management

## Installation

In [2]:
%pip install uv

Collecting uv
  Downloading uv-0.6.14-py3-none-macosx_11_0_arm64.whl.metadata (11 kB)
Downloading uv-0.6.14-py3-none-macosx_11_0_arm64.whl (15.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.1/15.1 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: uv
Successfully installed uv-0.6.14
Note: you may need to restart the kernel to use updated packages.


In [4]:
%pip install autoflow-ai==0.0.1.dev33
%pip install autoflow-ai[experiment]==0.0.1.dev33
%pip install ipywidgets

Collecting autoflow-ai==0.0.1.dev33
  Downloading autoflow_ai-0.0.1.dev33-py3-none-any.whl.metadata (885 bytes)
Downloading autoflow_ai-0.0.1.dev33-py3-none-any.whl (62 kB)
Installing collected packages: autoflow-ai
  Attempting uninstall: autoflow-ai
    Found existing installation: autoflow-ai 0.0.1.dev32
    Uninstalling autoflow-ai-0.0.1.dev32:
      Successfully uninstalled autoflow-ai-0.0.1.dev32
Successfully installed autoflow-ai-0.0.1.dev33
Note: you may need to restart the kernel to use updated packages.
zsh:1: no matches found: autoflow-ai[experiment]==0.0.1.dev33
Note: you may need to restart the kernel to use updated packages.
Collecting ipywidgets
  Using cached ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets)
  Using cached widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.12 (from ipywidgets)
  Using cached jupyterlab_widgets-3.0.13-py3-none-any.whl.met

## Prerequisites

- Go [tidbcloud.com](https://tidbcloud.com/) or using [tiup playground](https://docs.pingcap.com/tidb/stable/tiup-playground/) to create a free TiDB database cluster.
- Go [OpenAI platform](https://platform.openai.com/api-keys) to create your API key.

#### For Jupyter Notebook

Configuration can be provided through environment variables, or using `.env`:

In [None]:
# Create .env file, then edit your .env, for example:
# $ cat .env
# TIDB_HOST=localhost
# TIDB_PORT=4000
# TIDB_USERNAME=root
# TIDB_PASSWORD=
# OPENAI_API_KEY='your_openai_api_key'
%cp .env.example .env

In [1]:
import os
import dotenv

dotenv.load_dotenv()

True

#### For Google Colab

In [None]:
from google.colab import userdata

os.environ["TIDB_HOST"] = userdata.get("TIDB_HOST")
os.environ["TIDB_PORT"] = userdata.get("TIDB_PORT")
os.environ["TIDB_USERNAME"] = userdata.get("TIDB_USERNAME")
os.environ["TIDB_PASSWORD"] = userdata.get("TIDB_PASSWORD")
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")

## Quickstart

### Init Autoflow

In [2]:
import os
from autoflow import Autoflow
from autoflow.configs.db import DatabaseConfig
from autoflow.configs.main import Config

af = Autoflow.from_config(
    config=Config(
        db=DatabaseConfig(
            host=os.getenv("TIDB_HOST"),
            port=int(os.getenv("TIDB_PORT")),
            username=os.getenv("TIDB_USERNAME"),
            password=os.getenv("TIDB_PASSWORD"),
            database=os.getenv("TIDB_DATABASE"),
            enable_ssl=False,
        )
    )
)

### Create knowledge base

In [3]:
from autoflow.configs.knowledge_base import IndexMethod
from autoflow.models import llms
from autoflow.models.llms import LLM
from autoflow.models.embedding_models import EmbeddingModel

llm = LLM("gpt-4o-mini")
embed_model = EmbeddingModel("text-embedding-3-small")

kb = af.create_knowledge_base(
    namespace="quickstart",
    name="New KB",
    description="This is a knowledge base for testing",
    index_methods=[IndexMethod.VECTOR_SEARCH, IndexMethod.KNOWLEDGE_GRAPH],
    llm=llm,
    embedding_model=embed_model,
)
kb.model_dump_json()

'{"namespace":"quickstart","name":"New KB","description":"This is a knowledge base for testing","index_methods":["VECTOR_SEARCH","KNOWLEDGE_GRAPH"],"class_name":"KnowledgeBase"}'

### Import documents from files

In [4]:
from autoflow.chunkers.text import TextChunker
from autoflow.configs.chunkers.text import TextChunkerConfig

text_chunker = TextChunker(config=TextChunkerConfig(chunk_size=512, chunk_overlap=20))

In [7]:
from pandas import DataFrame
from pandas import set_option

set_option("display.max_colwidth", None)

In [8]:
docs = kb.add("./fixtures/tidb-overview.md", chunker=text_chunker)

DataFrame(
    [(c.id, c.text) for c in docs[0].chunks],
    columns=["id", "text"],
)

  Expected `list[float]` but got `ndarray` with value `array([-0.04310741, -0.02...2691607], dtype=float32)` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


Unnamed: 0,id,text
0,019634ed-f4f3-7a72-ac12-60988c00cf1c,"---\ntitle: What is TiDB Self-Managed\nsummary: Learn about the key features and usage scenarios of TiDB.\naliases: ['/docs/dev/key-features/','/tidb/dev/key-features','/docs/dev/overview/']\n---\n\n# What is TiDB Self-Managed\n\n<!-- Localization note for TiDB:\n\n- English: use distributed SQL, and start to emphasize HTAP\n- Chinese: can keep ""NewSQL"" and emphasize one-stop real-time HTAP (""一栈式实时 HTAP"")\n- Japanese: use NewSQL because it is well-recognized\n\n-->\n\n[TiDB](https://github.com/pingcap/tidb) (/'taɪdiːbi:/, ""Ti"" stands for Titanium) is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. The goal of TiDB is to provide users with a one-stop database solution that covers OLTP (Online Transactional Processing), OLAP (Online Analytical Processing), and HTAP services. TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.\n\nTiDB Self-Managed is a product option of TiDB, where users or organizations can deploy and manage TiDB on their own infrastructure with complete flexibility. With TiDB Self-Managed, you can enjoy the power of open source, distributed SQL while retaining full control over your environment.\n\nThe following video introduces key features of TiDB.\n\n<iframe width=""600"" height=""450"" src=""https://www.youtube.com/embed/aWBNNPm21zg?enablejsapi=1"" title=""Why TiDB?"" frameborder=""0"" allow=""accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"" allowfullscreen></iframe>\n\n## Key features\n\n- **Easy horizontal scaling**\n\n The TiDB architecture design separates computing from storage, letting you scale out or scale in the computing or storage capacity online as needed. The scaling process is transparent to application operations and maintenance staff.\n\n- **Financial-grade high availability**\n\n Data is stored in multiple replicas, and the Multi-Raft protocol is used to obtain the transaction log. A transaction can only be committed when data has been successfully written into the majority of replicas. This guarantees strong consistency and availability when a minority of replicas go down."
1,019634ed-f4f3-7aed-a440-118a911905db,"This guarantees strong consistency and availability when a minority of replicas go down. You can configure the geographic location and number of replicas as needed to meet different disaster tolerance levels.\n\n- **Real-time HTAP**\n\n TiDB provides two storage engines: [TiKV](/tikv-overview.md), a row-based storage engine, and [TiFlash](/tiflash/tiflash-overview.md), a columnar storage engine. \n\n TiFlash uses the Multi-Raft Learner protocol to replicate data from TiKV in real time, ensuring consistent data between the TiKV row-based storage engine and the TiFlash columnar storage engine. TiKV and TiFlash can be deployed on different machines as needed to solve the problem of HTAP resource isolation.\n\n- **Cloud-native distributed database**\n\n TiDB is a distributed database designed for the cloud, providing flexible scalability, reliability, and security on the cloud platform. Users can elastically scale TiDB to meet the requirements of their changing workloads. In TiDB, each piece of data has at least 3 replicas, which can be scheduled in different cloud availability zones to tolerate the outage of a whole data center. [TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/tidb-operator-overview) helps manage TiDB on Kubernetes and automates tasks related to operating the TiDB cluster, making TiDB easier to deploy on any cloud that provides managed Kubernetes. [TiDB Cloud](https://pingcap.com/tidb-cloud/), the fully-managed TiDB service, is the easiest, most economical, and most resilient way to unlock the full power of [TiDB in the cloud](https://docs.pingcap.com/tidbcloud/), allowing you to deploy and run TiDB clusters with just a few clicks.\n\n- **Compatible with the MySQL protocol and MySQL ecosystem**\n\n TiDB is compatible with the MySQL protocol, common features of MySQL, and the MySQL ecosystem. To migrate applications to TiDB, you do not need to change a single line of code in many cases, or only need to modify a small amount of code. In addition, TiDB provides a series of [data migration tools](/ecosystem-tool-user-guide.md) to help easily migrate application data into TiDB."
2,019634ed-f4f3-7b02-8321-ea37b59f666b,## See also\n\n- [TiDB Architecture](/tidb-architecture.md)\n- [TiDB Storage](/tidb-storage.md)\n- [TiDB Computing](/tidb-computing.md)\n- [TiDB Scheduling](/tidb-scheduling.md)


### Search Documents

In [9]:
result = kb.search_documents(
    query="What is TiDB?",
    top_k=2,
)

DataFrame(
    [(c.score, c.text) for c in result.chunks],
    columns=["score", "text"],
)

Unnamed: 0,score,text
0,0.687499,## See also\n\n- [TiDB Architecture](/tidb-architecture.md)\n- [TiDB Storage](/tidb-storage.md)\n- [TiDB Computing](/tidb-computing.md)\n- [TiDB Scheduling](/tidb-scheduling.md)
1,0.687499,## See also\n\n- [TiDB Architecture](/tidb-architecture.md)\n- [TiDB Storage](/tidb-storage.md)\n- [TiDB Computing](/tidb-computing.md)\n- [TiDB Scheduling](/tidb-scheduling.md)


### Search Knowledge Graph

In [16]:
kg = kb.search_knowledge_graph(
    query="What is TiDB?",
)

DataFrame(
    [(r.description, r.score) for r in kg.relationships],
    columns=["relation", "score"],
)

  Expected `list[float]` but got `ndarray` with value `array([-0.05912   , -0.01...0282664], dtype=float32)` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `list[float]` but got `ndarray` with value `array([0.03591861, 0.0189...104106 ], dtype=float32)` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `list[float]` but got `ndarray` with value `array([-0.00068632, -0.05...2822509], dtype=float32)` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `list[float]` but got `ndarray` with value `array([-0.04539546,  0.00...1088327], dtype=float32)` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(
  Expected `list[float]` but got `ndarray` with value `array([-0.0378481 , -0.02...2477005], dtype=float32)` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


Unnamed: 0,relation,score
0,TiDB provides TiKV as a row-based storage engine to ensure strong consistency and availability.,6.084352
1,TiDB also supports Online Analytical Processing (OLAP) as part of its comprehensive database services.,5.229995
2,TiDB uses TiFlash as its columnar storage engine.,5.12296
3,TiDB uses TiFlash as its columnar storage engine.,5.12296
4,TiDB provides Online Analytical Processing (OLAP) services.,5.369752
5,TiDB provides OLTP services as part of its database solution.,5.263002
6,"TiDB supports HTAP workloads, allowing it to handle both transactional and analytical processing.",5.05726
7,TiDB provides services for Online Analytical Processing (OLAP).,5.380857
8,TiDB uses the Multi-Raft protocol to ensure strong consistency and availability through transaction logging.,4.99767
9,TiDB provides the TiFlash columnar storage engine to complement the TiKV storage engine.,5.156526
