This project is an AI-powered chatbot builder that allows users to create custom chatbots from documentation sites or GitHub repositories using TiDB Vector Search. It was built for the TiDB Cloud Vector Search Hackathon.
- Data Ingestion: Import content from documentation websites or GitHub repositories
- Vector Search: Leverages TiDB's native vector search capabilities for semantic similarity
- Custom Chatbots: Create multiple chatbots for different use cases
- Embeddable Widget: Generate a script tag to embed your chatbot on any website
- Dashboard: Manage your chatbots through an intuitive interface
- Frontend: Next.js 15, React 19, TypeScript, Tailwind CSS
- Backend: Next.js API Routes
- Database: TiDB Serverless with Vector Search
- Authentication: Clerk
- ORM: Drizzle ORM
- AI Services: OpenAI API and Google Gemini for embeddings and completions
- UI Components: Radix UI, Shadcn UI
graph TB
subgraph "User Interaction"
A[User Query] --> B[Chat Interface]
end
subgraph "Query Processing"
B --> C[Query Preprocessing]
C --> D[Embedding Generation<br/>OpenAI text-embedding-3-small<br/>OR Google text-embedding-004]
D --> E[Query Vector<br/>768 dimensions]
end
subgraph "TiDB Vector Search Engine"
E --> F[TiDB Vector Search<br/>VEC_COSINE_DISTANCE]
F --> G[Similarity Search<br/>Full scan with distance calculation]
G --> H[Relevant Document Chunks]
end
subgraph "Context Assembly"
H --> I[Context Ranking<br/>Similarity Scores]
I --> J[Top-K Selection<br/>k=3]
J --> K[Context Window<br/>Token Limit Enforcement]
end
subgraph "Response Generation"
K --> L[LLM Prompt Construction<br/>System + Context + Query]
L --> M[LLM Generation<br/>OpenAI GPT-4o-mini<br/>OR Google Gemini 2.0 Flash]
M --> N[Response Postprocessing<br/>Markdown Formatting]
N --> O[Formatted Response]
end
subgraph "Feedback Loop"
O --> P[Response Storage<br/>Chat History]
P --> Q[Performance Metrics]
Q --> R[Continuous Improvement]
end
O --> B
flowchart TB
subgraph "Data Sources"
A1[Documentation Websites] --> B[Ingestion Service]
A2[GitHub Repositories] --> B
A3[Text Documents] --> B
end
subgraph "Document Processing Pipeline"
B --> C[Content Extraction<br/>HTML/PDF Parsing]
C --> D[Text Cleaning<br/>Noise Removal]
D --> E[Metadata Extraction<br/>Title, URL, Timestamp]
end
subgraph "Chunking Strategy"
E --> F[Sentence-based Splitting<br/>Chunk Size: 1000 chars<br/>Context-aware Overlap]
F --> G[Chunk Metadata<br/>Source, Position, Hash]
end
subgraph "Embedding Generation"
G --> H[Embedding Generation<br/>OpenAI text-embedding-3-small (primary)<br/>Google text-embedding-004 (fallback)]
H --> I[Vector Representation<br/>768 dimensions]
end
subgraph "TiDB Storage Layer"
I --> J[Vector Storage<br/>Native TiDB VECTOR type]
J --> K[Vector Table Insertion<br/>vectors_new Table]
K --> L[Document Table Update<br/>documents Table]
end
subgraph "Monitoring & Analytics"
L --> M[Processing Metrics<br/>Throughput, Latency]
M --> N[Data Quality Checks<br/>Validation, Deduplication]
N --> O[Performance Dashboard<br/>Real-time Monitoring]
end
erDiagram
USERS ||--o{ CHATBOTS : owns
CHATBOTS ||--o{ DOCUMENTS : contains
DOCUMENTS ||--o{ VECTORS_NEW : embeds
CHATBOTS ||--o{ CHAT_HISTORY : has
USERS {
string id PK
string email
string name
string image
timestamp created_at
timestamp updated_at
}
CHATBOTS {
int id PK
string user_id FK
string name
json config
json script_config
boolean is_public
timestamp created_at
timestamp updated_at
}
DOCUMENTS {
int id PK
int chatbot_id FK
string url
longtext content
string content_type
json metadata
timestamp created_at
timestamp updated_at
}
VECTORS_NEW {
int id PK
int document_id FK
longtext content
vector(768) embedding // Native VECTOR type
json metadata
timestamp created_at
}
CHAT_HISTORY {
int id PK
int chatbot_id FK
string session_id
text content
string role
json metadata
timestamp created_at
}
flowchart TB
subgraph "Input Processing"
A[User Message] --> B[Message Validation]
B --> C[Session Context Retrieval]
end
subgraph "Semantic Analysis"
C --> D[Query Enhancement]
D --> E[Embedding Generation]
end
subgraph "Vector Retrieval System"
E --> F[TiDB Vector Search]
F --> G[Similarity Scoring]
G --> H[Result Ranking]
H --> I[Top-K Selection]
I --> J[Context Aggregation]
end
subgraph "Response Synthesis"
J --> K[Context Window Construction]
K --> L[Prompt Engineering]
L --> M[LLM Generation]
M --> N[Response Validation]
end
subgraph "Output Processing"
N --> O[Response Formatting]
O --> P[Token Usage Tracking]
P --> Q[Response Storage]
Q --> R[Streaming Response]
end
R --> S[User Interface]
flowchart TB
subgraph "Document Ingestion"
A[Document Content] --> B[Text Preprocessing]
B --> C[Chunking Algorithm]
end
subgraph "Embedding Pipeline"
C --> D[Primary Embedding]
D --> E[Fallback Mechanism]
E --> F[Vector Validation]
end
subgraph "Vector Storage"
F --> G[CAST to VECTOR(768)]
G --> H[Vector Table Insertion]
H --> I[Document Metadata Update]
end
subgraph "Search Implementation"
I --> J[Query Embedding]
J --> K[CAST to VECTOR(768)]
K --> L[VEC_COSINE_DISTANCE Search]
L --> M[Result Filtering]
M --> N[Result Ranking]
N --> O[Top-K Selection]
end
subgraph "Performance Monitoring"
O --> P[Query Performance]
P --> Q[System Optimization]
end
CREATE TABLE vectors_new (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
document_id BIGINT,
content TEXT,
embedding VECTOR<FLOAT>(768), -- Native VECTOR type
metadata JSON,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);SELECT
v.id,
v.document_id as documentId,
v.content,
VEC_COSINE_DISTANCE(v.embedding, CAST('[1, 2, 3, ...]' AS VECTOR(768))) as similarity
FROM vectors_new v
INNER JOIN documents d ON v.document_id = d.id
WHERE d.chatbot_id = 1
ORDER BY similarity
LIMIT 3;- Vector Dimensions: 768 (OpenAI text-embedding-3-small with dimensions parameter)
- Storage Format: Native VECTOR(768) type
- Search Method: VEC_COSINE_DISTANCE with CAST for proper type conversion
- Indexing: No native HNSW index yet
- Future Plans: Implement HNSW indexing for improved search performance
The application uses TiDB with the following tables:
- users: User accounts from Clerk
- chatbots: User-created chatbots with configuration
- documents: Ingested content from websites or repositories
- vectors_new: Vector embeddings of document chunks (using native VECTOR type)
- chat_history: Conversation history for each chatbot
- Uses TiDB's native
VEC_COSINE_DISTANCEfunction for similarity calculations - Implements fallback mechanisms for both embedding generation and LLM responses
- Properly uses CAST to convert arrays to VECTOR type for storage and search
- Plans to implement native HNSW indexes for improved performance
- Implements data placement policies for performance optimization
- Distributes data across regions for disaster recovery
- Node.js 18+
- TiDB Cloud account
- OpenAI API key (primary) OR Google Generative AI API key (fallback)
- Clerk account
- GitHub Personal Access Token (for GitHub repository ingestion - optional)
-
Clone the repository:
git clone <repository-url> cd ai-chatbot-maker
-
Install dependencies:
bun install
-
Set up environment variables:
cp .env.example .env
Update the
.envfile with your:- Clerk credentials
- TiDB connection string
- OpenAI API key (or Google Generative AI API key)
- GitHub Personal Access Token (optional, for GitHub repository ingestion)
-
Run database migrations:
bun run db:push
-
Start the development server:
bun run dev
- Create a Chatbot: Use the dashboard to create a new chatbot
- Ingest Data: Add documentation websites or GitHub repositories
- Configure Widget: Customize the appearance and behavior of your chatbot widget
- Embed on Website: Copy the generated script tag and add it to your website
POST /api/chatbots- Create a new chatbotGET /api/chatbots- List all chatbots for the userGET /api/chatbots/[id]- Get details for a specific chatbotPUT /api/chatbots/[id]- Update a chatbotDELETE /api/chatbots/[id]- Delete a chatbotPOST /api/chatbots/[id]/ingest- Ingest a document for a chatbotGET /api/chatbots/[id]/ingest- List documents for a chatbotPOST /api/chatbots/[id]/chat- Chat with a chatbotGET /api/chatbots/[id]/chat- Get chat historyGET /api/chatbots/[id]/script- Get the embed script for a chatbotPUT /api/chatbots/[id]/script- Update script configuration
This project fulfills all hackathon requirements:
- Integrates TiDB Vector: Uses TiDB's native vector search capabilities with VEC_COSINE_DISTANCE
- Combines at least two components:
- Data ingestion (websites and GitHub repositories)
- Vector search with TiDB
- LLM calls for chat completions
- Real-world problem solving: Enables businesses to create AI assistants from their existing documentation
- Implement HNSW indexing for improved search performance
- Add GitHub repository integration
- Add support for PDF and other document types
- Improve chunking algorithms for better context
- Add analytics dashboard
- Implement domain verification for widget security
- Add more customization options for the chat widget
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License.