# Autoflow

Autoflow is a RAG framework supported:

- Vector Search Based RAG
- Knowledge Graph Based RAG (aka. GraphRAG)
- Knowledge Base and Document Management

## Installation

In [1]:
%pip install -q autoflow-ai==0.0.2.dev4 ipywidgets

Note: you may need to restart the kernel to use updated packages.


## Prerequisites

- Go [tidbcloud.com](https://tidbcloud.com/) or using [tiup playground](https://docs.pingcap.com/tidb/stable/tiup-playground/) to create a free TiDB database cluster.
- Go [OpenAI platform](https://platform.openai.com/api-keys) to create your API key.

#### For Jupyter Notebook

Configuration can be provided through environment variables, or using `.env`:

In [2]:
%%bash

# Check if the .env file is existing.
if [ -f .env ]; then
    exit 0
fi

# Create .env file with your configuration.
cat > .env <<EOF
TIDB_HOST=localhost
TIDB_PORT=4000
TIDB_USERNAME=root
TIDB_PASSWORD=
TIDB_DATABASE=test
OPENAI_API_KEY='your_openai_api_key'
EOF

In [3]:
import os
import dotenv

dotenv.load_dotenv()

True

In [4]:
from pandas import DataFrame
from pandas import set_option

set_option("display.max_colwidth", None)

## Quickstart

### Init Autoflow

In [5]:
from autoflow import Autoflow
from autoflow.configs.db import DatabaseConfig
from autoflow.configs.main import Config

af = Autoflow.from_config(
    config=Config(
        db=DatabaseConfig(
            host=os.getenv("TIDB_HOST"),
            port=int(os.getenv("TIDB_PORT")),
            username=os.getenv("TIDB_USERNAME"),
            password=os.getenv("TIDB_PASSWORD"),
            database=os.getenv("TIDB_DATABASE"),
            enable_ssl=False,
        )
    )
)

### Create knowledge base

In [6]:
from autoflow.configs.knowledge_base import IndexMethod
from autoflow.models.llms import LLM
from autoflow.models.embedding_models import EmbeddingModel
from IPython.display import JSON

llm = LLM("gpt-4o-mini")
embed_model = EmbeddingModel("text-embedding-3-small")

kb = af.create_knowledge_base(
    namespace="quickstart",
    name="New KB",
    description="This is a knowledge base for testing",
    index_methods=[IndexMethod.VECTOR_SEARCH, IndexMethod.KNOWLEDGE_GRAPH],
    llm=llm,
    embedding_model=embed_model,
)
JSON(kb.model_dump())

<IPython.core.display.JSON object>

In [7]:
# Reset all the data of knowledge base.
kb.reset()

### Custom Chunker

In [8]:
from autoflow.chunkers.text import TextChunker
from autoflow.configs.chunkers.text import TextChunkerConfig

text_chunker = TextChunker(config=TextChunkerConfig(chunk_size=256, chunk_overlap=20))

### Import documents from files

In [9]:
docs = kb.add("./fixtures/tidb-overview.md", chunker=text_chunker)

DataFrame(
    [(c.id, c.text) for c in docs[0].chunks],
    columns=["id", "text"],
)

Unnamed: 0,id,text
0,01963849-add8-7f4b-b190-095d7a6ea80b,"---\ntitle: What is TiDB Self-Managed\nsummary: Learn about the key features and usage scenarios of TiDB.\naliases: ['/docs/dev/key-features/','/tidb/dev/key-features','/docs/dev/overview/']\n---\n\n# What is TiDB Self-Managed\n\n<!-- Localization note for TiDB:\n\n- English: use distributed SQL, and start to emphasize HTAP\n- Chinese: can keep ""NewSQL"" and emphasize one-stop real-time HTAP (""一栈式实时 HTAP"")\n- Japanese: use NewSQL because it is well-recognized\n\n-->\n\n[TiDB](https://github.com/pingcap/tidb) (/'taɪdiːbi:/, ""Ti"" stands for Titanium) is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. The goal of TiDB is to provide users with a one-stop database solution that covers OLTP (Online Transactional Processing), OLAP (Online Analytical Processing), and HTAP services. TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data."
1,01963849-add8-7f8d-93ec-cf76897f6b70,"TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.\n\nTiDB Self-Managed is a product option of TiDB, where users or organizations can deploy and manage TiDB on their own infrastructure with complete flexibility. With TiDB Self-Managed, you can enjoy the power of open source, distributed SQL while retaining full control over your environment.\n\nThe following video introduces key features of TiDB.\n\n<iframe width=""600"" height=""450"" src=""https://www.youtube.com/embed/aWBNNPm21zg?enablejsapi=1"" title=""Why TiDB?"" frameborder=""0"" allow=""accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"" allowfullscreen></iframe>\n\n## Key features\n\n- **Easy horizontal scaling**\n\n The TiDB architecture design separates computing from storage, letting you scale out or scale in the computing or storage capacity online as needed. The scaling process is transparent to application operations and maintenance staff.\n\n- **Financial-grade high availability**\n\n Data is stored in multiple replicas, and the Multi-Raft protocol is used to obtain the transaction log."
2,01963849-add8-7fa1-b2a5-eee60257a8ae,"A transaction can only be committed when data has been successfully written into the majority of replicas. This guarantees strong consistency and availability when a minority of replicas go down. You can configure the geographic location and number of replicas as needed to meet different disaster tolerance levels.\n\n- **Real-time HTAP**\n\n TiDB provides two storage engines: [TiKV](/tikv-overview.md), a row-based storage engine, and [TiFlash](/tiflash/tiflash-overview.md), a columnar storage engine. \n\n TiFlash uses the Multi-Raft Learner protocol to replicate data from TiKV in real time, ensuring consistent data between the TiKV row-based storage engine and the TiFlash columnar storage engine. TiKV and TiFlash can be deployed on different machines as needed to solve the problem of HTAP resource isolation.\n\n- **Cloud-native distributed database**\n\n TiDB is a distributed database designed for the cloud, providing flexible scalability, reliability, and security on the cloud platform. Users can elastically scale TiDB to meet the requirements of their changing workloads."
3,01963849-add8-7fae-853c-6b98b510c774,"Users can elastically scale TiDB to meet the requirements of their changing workloads. In TiDB, each piece of data has at least 3 replicas, which can be scheduled in different cloud availability zones to tolerate the outage of a whole data center. [TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/tidb-operator-overview) helps manage TiDB on Kubernetes and automates tasks related to operating the TiDB cluster, making TiDB easier to deploy on any cloud that provides managed Kubernetes. [TiDB Cloud](https://pingcap.com/tidb-cloud/), the fully-managed TiDB service, is the easiest, most economical, and most resilient way to unlock the full power of [TiDB in the cloud](https://docs.pingcap.com/tidbcloud/), allowing you to deploy and run TiDB clusters with just a few clicks.\n\n- **Compatible with the MySQL protocol and MySQL ecosystem**\n\n TiDB is compatible with the MySQL protocol, common features of MySQL, and the MySQL ecosystem. To migrate applications to TiDB, you do not need to change a single line of code in many cases, or only need to modify a small amount of code."
4,01963849-add8-7fba-979a-b216e5d45c36,"In addition, TiDB provides a series of [data migration tools](/ecosystem-tool-user-guide.md) to help easily migrate application data into TiDB.\n\n## See also\n\n- [TiDB Architecture](/tidb-architecture.md)\n- [TiDB Storage](/tidb-storage.md)\n- [TiDB Computing](/tidb-computing.md)\n- [TiDB Scheduling](/tidb-scheduling.md)"


### Search Documents

In [None]:
result = kb.search_documents(
    query="What is TiDB?",
    top_k=3,
)

DataFrame(
    [(c.text, c.score) for c in result.chunks],
    columns=["text", "score"],
)

Unnamed: 0,text,score
0,"---\ntitle: What is TiDB Self-Managed\nsummary: Learn about the key features and usage scenarios of TiDB.\naliases: ['/docs/dev/key-features/','/tidb/dev/key-features','/docs/dev/overview/']\n---\n\n# What is TiDB Self-Managed\n\n<!-- Localization note for TiDB:\n\n- English: use distributed SQL, and start to emphasize HTAP\n- Chinese: can keep ""NewSQL"" and emphasize one-stop real-time HTAP (""一栈式实时 HTAP"")\n- Japanese: use NewSQL because it is well-recognized\n\n-->\n\n[TiDB](https://github.com/pingcap/tidb) (/'taɪdiːbi:/, ""Ti"" stands for Titanium) is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. The goal of TiDB is to provide users with a one-stop database solution that covers OLTP (Online Transactional Processing), OLAP (Online Analytical Processing), and HTAP services. TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.",0.726047
1,"TiDB is suitable for various use cases that require high availability and strong consistency with large-scale data.\n\nTiDB Self-Managed is a product option of TiDB, where users or organizations can deploy and manage TiDB on their own infrastructure with complete flexibility. With TiDB Self-Managed, you can enjoy the power of open source, distributed SQL while retaining full control over your environment.\n\nThe following video introduces key features of TiDB.\n\n<iframe width=""600"" height=""450"" src=""https://www.youtube.com/embed/aWBNNPm21zg?enablejsapi=1"" title=""Why TiDB?"" frameborder=""0"" allow=""accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"" allowfullscreen></iframe>\n\n## Key features\n\n- **Easy horizontal scaling**\n\n The TiDB architecture design separates computing from storage, letting you scale out or scale in the computing or storage capacity online as needed. The scaling process is transparent to application operations and maintenance staff.\n\n- **Financial-grade high availability**\n\n Data is stored in multiple replicas, and the Multi-Raft protocol is used to obtain the transaction log.",0.669803
2,"Users can elastically scale TiDB to meet the requirements of their changing workloads. In TiDB, each piece of data has at least 3 replicas, which can be scheduled in different cloud availability zones to tolerate the outage of a whole data center. [TiDB Operator](https://docs.pingcap.com/tidb-in-kubernetes/stable/tidb-operator-overview) helps manage TiDB on Kubernetes and automates tasks related to operating the TiDB cluster, making TiDB easier to deploy on any cloud that provides managed Kubernetes. [TiDB Cloud](https://pingcap.com/tidb-cloud/), the fully-managed TiDB service, is the easiest, most economical, and most resilient way to unlock the full power of [TiDB in the cloud](https://docs.pingcap.com/tidbcloud/), allowing you to deploy and run TiDB clusters with just a few clicks.\n\n- **Compatible with the MySQL protocol and MySQL ecosystem**\n\n TiDB is compatible with the MySQL protocol, common features of MySQL, and the MySQL ecosystem. To migrate applications to TiDB, you do not need to change a single line of code in many cases, or only need to modify a small amount of code.",0.656657


### Search Knowledge Graph

In [11]:
kg = kb.search_knowledge_graph(
    query="What is TiDB?",
)

# Notice: score is the result of a weighted formula

DataFrame(
    [
        (r.source_entity.name, r.description, r.target_entity.name, r.score)
        for r in kg.relationships
    ],
    columns=["source_entity", "relation", "target_entity", "score"],
)

Unnamed: 0,source_entity,relation,target_entity,score
0,TiDB,TiDB provides TiKV as a row-based storage engine.,TiKV,6.36816
1,TiDB,TiDB has key features such as easy horizontal scaling and financial-grade high availability.,Key features of TiDB,5.795406
2,TiDB,TiDB provides OLAP services as part of its one-stop database solution.,OLAP (Online Analytical Processing),5.739264
3,TiDB,TiDB Storage is a component of TiDB that is essential for data management.,TiDB Storage,5.477352
4,TiDB,TiDB provides TiFlash as a columnar storage engine.,TiFlash,5.024619
5,TiDB,TiDB provides OLTP services as part of its one-stop database solution.,OLTP (Online Transactional Processing),4.989751
6,TiDB,TiDB supports Hybrid Transactional and Analytical Processing (HTAP) workloads.,Hybrid Transactional and Analytical Processing (HTAP),4.939893
7,TiDB,TiDB Computing is a crucial aspect of TiDB that enables data processing and query execution.,TiDB Computing,4.832199
8,TiDB,TiDB Architecture is a related concept that describes the structural design of the TiDB database.,TiDB Architecture,4.772436
9,TiDB,TiDB Cloud is the fully-managed service that allows users to deploy and run TiDB clusters with ease.,TiDB Cloud,4.656846


### Ask question

In [12]:
from IPython.display import Markdown

res = kb.ask("What is TiDB?")
Markdown(res.message.content)

TiDB is an open-source distributed SQL database designed to support Hybrid Transactional and Analytical Processing (HTAP) workloads. It is compatible with the MySQL protocol, which allows for easy migration of applications without significant code changes. TiDB features horizontal scalability, strong consistency, and high availability, making it suitable for various use cases that require reliable data management with large-scale data.

### Key Features of TiDB:
- **Hybrid Transactional and Analytical Processing (HTAP)**: TiDB can handle both transactional (OLTP) and analytical (OLAP) workloads simultaneously, providing real-time insights and data processing.
- **Easy Horizontal Scaling**: The architecture separates computing from storage, allowing users to scale resources up or down as needed without disrupting application operations.
- **Financial-grade High Availability**: Data is stored in multiple replicas, and the Multi-Raft protocol ensures that transaction logs are maintained for high availability.
- **Compatibility with MySQL**: TiDB supports the MySQL protocol and ecosystem, enabling seamless migration of applications with minimal code changes.
- **Data Migration Tools**: TiDB provides tools to assist users in migrating application data efficiently into the TiDB database.
- **Cloud-native Design**: TiDB is designed to operate in cloud environments, offering features like scalability, reliability, and security.

### Storage Engines:
TiDB utilizes two storage engines:
- **TiKV**: A row-based storage engine that handles data in a row-oriented format.
- **TiFlash**: A columnar storage engine that replicates data from TiKV in real time using the Multi-Raft Learner protocol, ensuring consistent data across both storage engines.

### Deployment Options:
TiDB can be deployed in various ways:
- **TiDB Self-Managed**: Users can deploy and manage TiDB on their own infrastructure, providing complete control over their environment.
- **TiDB Cloud**: A fully-managed service that simplifies the deployment and management of TiDB clusters in the cloud.

Overall, TiDB aims to provide a one-stop database solution that meets the needs of modern applications requiring high availability, strong consistency, and the ability to process large volumes of data efficiently.

### Reset the KnowledgeBase

In [13]:
# kb.reset()