Claw - AI Chatbot Maker with TiDB Vector Search

This project is an AI-powered chatbot builder that allows users to create custom chatbots from documentation sites or GitHub repositories using TiDB Vector Search. It was built for the TiDB Cloud Vector Search Hackathon.

Features

Data Ingestion: Import content from documentation websites or GitHub repositories
Vector Search: Leverages TiDB's native vector search capabilities for semantic similarity
Custom Chatbots: Create multiple chatbots for different use cases
Embeddable Widget: Generate a script tag to embed your chatbot on any website
Dashboard: Manage your chatbots through an intuitive interface

Tech Stack

Frontend: Next.js 15, React 19, TypeScript, Tailwind CSS
Backend: Next.js API Routes
Database: TiDB Serverless with Vector Search
Authentication: Clerk
ORM: Drizzle ORM
AI Services: OpenAI API and Google Gemini for embeddings and completions
UI Components: Radix UI, Shadcn UI

TiDB Architecture Overview

Advanced RAG (Retrieval-Augmented Generation) System

graph TB
    subgraph "User Interaction"
        A[User Query] --> B[Chat Interface]
    end
    
    subgraph "Query Processing"
        B --> C[Query Preprocessing]
        C --> D[Embedding Generation<br/>OpenAI text-embedding-3-small<br/>OR Google text-embedding-004]
        D --> E[Query Vector<br/>768 dimensions]
    end
    
    subgraph "TiDB Vector Search Engine"
        E --> F[TiDB Vector Search<br/>VEC_COSINE_DISTANCE]
        F --> G[Similarity Search<br/>Full scan with distance calculation]
        G --> H[Relevant Document Chunks]
    end
    
    subgraph "Context Assembly"
        H --> I[Context Ranking<br/>Similarity Scores]
        I --> J[Top-K Selection<br/>k=3]
        J --> K[Context Window<br/>Token Limit Enforcement]
    end
    
    subgraph "Response Generation"
        K --> L[LLM Prompt Construction<br/>System + Context + Query]
        L --> M[LLM Generation<br/>OpenAI GPT-4o-mini<br/>OR Google Gemini 2.0 Flash]
        M --> N[Response Postprocessing<br/>Markdown Formatting]
        N --> O[Formatted Response]
    end
    
    subgraph "Feedback Loop"
        O --> P[Response Storage<br/>Chat History]
        P --> Q[Performance Metrics]
        Q --> R[Continuous Improvement]
    end
    
    O --> B

Comprehensive Data Pipeline

flowchart TB
    subgraph "Data Sources"
        A1[Documentation Websites] --> B[Ingestion Service]
        A2[GitHub Repositories] --> B
        A3[Text Documents] --> B
    end
    
    subgraph "Document Processing Pipeline"
        B --> C[Content Extraction<br/>HTML/PDF Parsing]
        C --> D[Text Cleaning<br/>Noise Removal]
        D --> E[Metadata Extraction<br/>Title, URL, Timestamp]
    end
    
    subgraph "Chunking Strategy"
        E --> F[Sentence-based Splitting<br/>Chunk Size: 1000 chars<br/>Context-aware Overlap]
        F --> G[Chunk Metadata<br/>Source, Position, Hash]
    end
    
    subgraph "Embedding Generation"
        G --> H[Embedding Generation<br/>OpenAI text-embedding-3-small (primary)<br/>Google text-embedding-004 (fallback)]
        H --> I[Vector Representation<br/>768 dimensions]
    end
    
    subgraph "TiDB Storage Layer"
        I --> J[Vector Storage<br/>Native TiDB VECTOR type]
        J --> K[Vector Table Insertion<br/>vectors_new Table]
        K --> L[Document Table Update<br/>documents Table]
    end
    
    subgraph "Monitoring & Analytics"
        L --> M[Processing Metrics<br/>Throughput, Latency]
        M --> N[Data Quality Checks<br/>Validation, Deduplication]
        N --> O[Performance Dashboard<br/>Real-time Monitoring]
    end

Detailed Database Schema with TiDB Features

erDiagram
    USERS ||--o{ CHATBOTS : owns
    CHATBOTS ||--o{ DOCUMENTS : contains
    DOCUMENTS ||--o{ VECTORS_NEW : embeds
    CHATBOTS ||--o{ CHAT_HISTORY : has
    
    USERS {
        string id PK
        string email
        string name
        string image
        timestamp created_at
        timestamp updated_at
    }
    
    CHATBOTS {
        int id PK
        string user_id FK
        string name
        json config
        json script_config
        boolean is_public
        timestamp created_at
        timestamp updated_at
    }
    
    DOCUMENTS {
        int id PK
        int chatbot_id FK
        string url
        longtext content
        string content_type
        json metadata
        timestamp created_at
        timestamp updated_at
    }
    
    VECTORS_NEW {
        int id PK
        int document_id FK
        longtext content
        vector(768) embedding  // Native VECTOR type
        json metadata
        timestamp created_at
    }
    
    CHAT_HISTORY {
        int id PK
        int chatbot_id FK
        string session_id
        text content
        string role
        json metadata
        timestamp created_at
    }

Intelligent AI Agent Workflow

flowchart TB
    subgraph "Input Processing"
        A[User Message] --> B[Message Validation]
        B --> C[Session Context Retrieval]
    end
    
    subgraph "Semantic Analysis"
        C --> D[Query Enhancement]
        D --> E[Embedding Generation]
    end
    
    subgraph "Vector Retrieval System"
        E --> F[TiDB Vector Search]
        F --> G[Similarity Scoring]
        G --> H[Result Ranking]
        H --> I[Top-K Selection]
        I --> J[Context Aggregation]
    end
    
    subgraph "Response Synthesis"
        J --> K[Context Window Construction]
        K --> L[Prompt Engineering]
        L --> M[LLM Generation]
        M --> N[Response Validation]
    end
    
    subgraph "Output Processing"
        N --> O[Response Formatting]
        O --> P[Token Usage Tracking]
        P --> Q[Response Storage]
        Q --> R[Streaming Response]
    end
    
    R --> S[User Interface]

Current Vector Implementation Process

flowchart TB
    subgraph "Document Ingestion"
        A[Document Content] --> B[Text Preprocessing]
        B --> C[Chunking Algorithm]
    end
    
    subgraph "Embedding Pipeline"
        C --> D[Primary Embedding]
        D --> E[Fallback Mechanism]
        E --> F[Vector Validation]
    end
    
    subgraph "Vector Storage"
        F --> G[CAST to VECTOR(768)]
        G --> H[Vector Table Insertion]
        H --> I[Document Metadata Update]
    end
    
    subgraph "Search Implementation"
        I --> J[Query Embedding]
        J --> K[CAST to VECTOR(768)]
        K --> L[VEC_COSINE_DISTANCE Search]
        L --> M[Result Filtering]
        M --> N[Result Ranking]
        N --> O[Top-K Selection]
    end
    
    subgraph "Performance Monitoring"
        O --> P[Query Performance]
        P --> Q[System Optimization]
    end

TiDB Vector Search Implementation

Current Vector Storage

CREATE TABLE vectors_new (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  document_id BIGINT,
  content TEXT,
  embedding VECTOR<FLOAT>(768),  -- Native VECTOR type
  metadata JSON,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Vector Search Query

SELECT 
  v.id,
  v.document_id as documentId,
  v.content,
  VEC_COSINE_DISTANCE(v.embedding, CAST('[1, 2, 3, ...]' AS VECTOR(768))) as similarity
FROM vectors_new v
INNER JOIN documents d ON v.document_id = d.id
WHERE d.chatbot_id = 1
ORDER BY similarity
LIMIT 3;

Current Implementation Details

Vector Dimensions: 768 (OpenAI text-embedding-3-small with dimensions parameter)
Storage Format: Native VECTOR(768) type
Search Method: VEC_COSINE_DISTANCE with CAST for proper type conversion
Indexing: No native HNSW index yet
Future Plans: Implement HNSW indexing for improved search performance

Architecture

Database Schema

The application uses TiDB with the following tables:

users: User accounts from Clerk
chatbots: User-created chatbots with configuration
documents: Ingested content from websites or repositories
vectors_new: Vector embeddings of document chunks (using native VECTOR type)
chat_history: Conversation history for each chatbot

Vector Search Implementation

Uses TiDB's native VEC_COSINE_DISTANCE function for similarity calculations
Implements fallback mechanisms for both embedding generation and LLM responses
Properly uses CAST to convert arrays to VECTOR type for storage and search
Plans to implement native HNSW indexes for improved performance

Multi-Cluster Placement

Implements data placement policies for performance optimization
Distributes data across regions for disaster recovery

Getting Started

Prerequisites

Node.js 18+
TiDB Cloud account
OpenAI API key (primary) OR Google Generative AI API key (fallback)
Clerk account
GitHub Personal Access Token (for GitHub repository ingestion - optional)

Installation

Clone the repository:

git clone <repository-url>
cd ai-chatbot-maker

Install dependencies:
```
bun install
```
Set up environment variables:
```
cp .env.example .env
```
Update the .env file with your:
- Clerk credentials
- TiDB connection string
- OpenAI API key (or Google Generative AI API key)
- GitHub Personal Access Token (optional, for GitHub repository ingestion)
Run database migrations:
```
bun run db:push
```
Start the development server:
```
bun run dev
```

Usage

Create a Chatbot: Use the dashboard to create a new chatbot
Ingest Data: Add documentation websites or GitHub repositories
Configure Widget: Customize the appearance and behavior of your chatbot widget
Embed on Website: Copy the generated script tag and add it to your website

API Endpoints

POST /api/chatbots - Create a new chatbot
GET /api/chatbots - List all chatbots for the user
GET /api/chatbots/[id] - Get details for a specific chatbot
PUT /api/chatbots/[id] - Update a chatbot
DELETE /api/chatbots/[id] - Delete a chatbot
POST /api/chatbots/[id]/ingest - Ingest a document for a chatbot
GET /api/chatbots/[id]/ingest - List documents for a chatbot
POST /api/chatbots/[id]/chat - Chat with a chatbot
GET /api/chatbots/[id]/chat - Get chat history
GET /api/chatbots/[id]/script - Get the embed script for a chatbot
PUT /api/chatbots/[id]/script - Update script configuration

Hackathon Requirements Fulfillment

This project fulfills all hackathon requirements:

Integrates TiDB Vector: Uses TiDB's native vector search capabilities with VEC_COSINE_DISTANCE
Combines at least two components:
- Data ingestion (websites and GitHub repositories)
- Vector search with TiDB
- LLM calls for chat completions
Real-world problem solving: Enables businesses to create AI assistants from their existing documentation

Future Enhancements

Implement HNSW indexing for improved search performance
Add GitHub repository integration
Add support for PDF and other document types
Improve chunking algorithms for better context
Add analytics dashboard
Implement domain verification for widget security
Add more customization options for the chat widget

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
drizzle		drizzle
public		public
scripts		scripts
src		src
test/data		test/data
.gitignore		.gitignore
README.md		README.md
TIDB_SETUP.md		TIDB_SETUP.md
biome.json		biome.json
bun.lock		bun.lock
components.json		components.json
drizzle.config.ts		drizzle.config.ts
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
plan.md		plan.md
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Claw - AI Chatbot Maker with TiDB Vector Search

Features

Tech Stack

TiDB Architecture Overview

Advanced RAG (Retrieval-Augmented Generation) System

Comprehensive Data Pipeline

Detailed Database Schema with TiDB Features

Intelligent AI Agent Workflow

Current Vector Implementation Process

TiDB Vector Search Implementation

Current Vector Storage

Vector Search Query

Current Implementation Details

Architecture

Database Schema

Vector Search Implementation

Multi-Cluster Placement

Getting Started

Prerequisites

Installation

Usage

API Endpoints

Hackathon Requirements Fulfillment

Future Enhancements

Contributing

License

About

Uh oh!

Releases

Packages

Languages

srthkdev/claw

Folders and files

Latest commit

History

Repository files navigation

Claw - AI Chatbot Maker with TiDB Vector Search

Features

Tech Stack

TiDB Architecture Overview

Advanced RAG (Retrieval-Augmented Generation) System

Comprehensive Data Pipeline

Detailed Database Schema with TiDB Features

Intelligent AI Agent Workflow

Current Vector Implementation Process

TiDB Vector Search Implementation

Current Vector Storage

Vector Search Query

Current Implementation Details

Architecture

Database Schema

Vector Search Implementation

Multi-Cluster Placement

Getting Started

Prerequisites

Installation

Usage

API Endpoints

Hackathon Requirements Fulfillment

Future Enhancements

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages