TokenFlow

A hybrid Go + Python backend system for intelligent prompt compression and LLM generation, designed to demonstrate modern microservice architecture patterns and efficient ML model serving.

Architecture Overview

TokenFlow uses a multilanguage microservice architecture that leverages the strengths of both Go and Python:

Go Backend: High-performance HTTP server acting as the main API gateway and orchestrator
Python Backend: Specialized ML model server for prompt compression and classification
External LLM APIs: Integration with Groq for powerful text generation

System Architecture

Key Design Patterns

Service Separation

ClassificationService: Dedicated to prompt classification using Python's Hugging Face Transformers
CompressionService: Handles text compression via Python's LLMLingua library
ModelService: Manages direct LLM generation and model selection logic

Constructor Pattern (Idiomatic Go)

Each service follows Go's standard constructor pattern:

type ModelService struct { /* fields */ }           // Struct definition
func NewModelService() *ModelService { /* init */ }  // Constructor function

Dependency Injection

Services are initialized once and injected into handlers:

modelService := services.NewModelService()
classificationService := services.NewClassificationService()
modelHandler := handlers.NewModelHandler(modelService, classificationService)

Technology Stack

Go Backend

Framework: Gin (high-performance HTTP router)
HTTP Client: Standard net/http for Python service calls
Streaming: Server-Sent Events (SSE) for real-time LLM output
Configuration: Environment variables with godotenv

Python Backend

Framework: FastAPI (async Python web framework)
Package Management: uv (fast Python package installer and resolver)
Environment Management: Hermit (reproducible development environments)
ML Libraries:
- LLMLingua (prompt compression)
- Hugging Face Transformers (classification)
Models:
- BART-MNLI for zero-shot classification
- BERT multilingual for compression

External APIs

Groq: High-performance LLM inference
OpenRouter: Alternative LLM provider (configured but not actively used)

Data Flow Examples

Compression Flow

Frontend → POST /api/compress → Go CompressionHandler
Go CompressionService → POST http://localhost:8001/compress → Python FastAPI
Python LLMLingua processing → JSON response → Go
Go → Frontend (compressed text + metrics)

Classification Flow

Frontend → POST /api/classify → Go ClassificationHandler
Go ClassificationService → POST http://localhost:8001/classify → Python FastAPI
Python Transformers pipeline → JSON response → Go
Go → Frontend (categories + confidence scores)

Generation Flow

Frontend → POST /api/generate → Go ModelHandler
Go ModelService → Groq API (streaming)
Groq → Go (token-by-token via SSE) → Frontend

Project Structure

tokenflow/
├── frontend/                 # React/Next.js UI
├── backend_go/              # Go API Gateway
│   ├── cmd/server/          # Application entry point
│   ├── pkg/
│   │   ├── handlers/        # HTTP request handlers
│   │   ├── services/        # Business logic services
│   │   ├── models/          # Data structures
│   │   └── config/          # Configuration management
│   └── go.mod
├── backend_python/          # Python ML Server
│   └── src/tokenflow_python/
│       └── main.py          # FastAPI application
└── README.md               # This file

Getting Started

Prerequisites

Go 1.24+
Python 3.9+ (managed via Hermit)
uv (Python package manager - installed via Hermit)
Hermit (for reproducible development environment)
Environment variables configured in .env

Environment Setup

Install Hermit (if not already installed):

curl -fsSL https://github.com/cashapp/hermit/releases/latest/download/install.sh | bash

Activate Hermit environment (from project root):
```
. bin/activate-hermit
```
This automatically installs the correct Python version and uv.

Why Hermit + uv?

Hermit ensures reproducible development environments across machines

uv provides fast Python package installation and dependency resolution

Together they eliminate "works on my machine" issues and speed up development

Run the System

Python Backend (Terminal 1):

cd backend_python
uv run --active python -m src.tokenflow_python.main
# Runs on http://localhost:8001

Go Backend (Terminal 2):

cd backend_go
go run cmd/server/main.go
# Runs on http://localhost:8000

Frontend (Terminal 3):

cd frontend
npm run dev
# Runs on http://localhost:3000

API Documentation

Core Endpoints

POST /api/compress - Compress text using LLMLingua (ratio-based)
POST /api/classify - Classify prompts into categories
POST /api/generate - Stream LLM generation from specified model
POST /api/models/select - Auto-select and stream from best model
GET /api/model-rankings - Available model information

Request/Response Examples

Compression Request:

{
  "text": "Long text to be compressed...",
  "ratio": 0.5
}

Classification Request:

{
  "prompt": "Write a function to calculate fibonacci",
  "possible_categories": ["reasoning", "function-calling", "text-to-text"],
  "multi_label": false
}

Request Lifecycle: `/api/models/select`

The following diagram illustrates the request lifecycle for the /api/models/select endpoint, which automatically selects the best model based on prompt classification and streams the LLM response:

sequenceDiagram
    participant Client
    participant ModelHandler
    participant ClassificationService
    participant ModelService
    participant External APIs
    
    Client->>+ModelHandler: POST /api/models/select
    Note over Client,ModelHandler: JSON: {prompt, temperature, topP, maxTokens, possibleCategories}
    
    ModelHandler->>ModelHandler: Setup SSE headers & create Go channels (streamChan, errorChan)
    
    par Background Goroutine
        ModelHandler->>+ClassificationService: ClassifyPrompt() - calls Python service
        ClassificationService-->>-ModelHandler: {topCategory, confidenceScore}
        
        ModelHandler->>+ModelService: SelectModelForCategory(topCategory)
        ModelService-->>-ModelHandler: selectedModel
        
        ModelHandler->>Client: SSE metadata event
        Note over ModelHandler,Client: Go struct serialized to JSON
        
        ModelHandler->>+ModelService: StreamCompletion() - writes to streamChan
        ModelService->>+External APIs: Groq/OpenRouter API calls
        
        loop Stream chunks via Go channel
            External APIs-->>ModelService: Response chunks
            ModelService->>ModelHandler: streamChan <- chunk
        end
        
        External APIs-->>-ModelService: Stream complete
        ModelService-->>-ModelHandler: Close streamChan
        
    and Main Goroutine
        loop Listen on Go channels with select{}
            alt Data from streamChan
                ModelHandler->>Client: SSE text_chunk event
            else Error from errorChan
                ModelHandler->>Client: SSE error event
            else Client context cancelled
                ModelHandler->>ModelHandler: Cleanup & return
            end
        end
    end
    
    ModelHandler->>Client: SSE end_stream event
    ModelHandler-->>-Client: Close connection

Step-by-Step Request Flow

Client Request: Client sends POST /api/models/select with JSON payload containing prompt, temperature, topP, maxTokens, and possibleCategories
Handler Setup: ModelHandler validates the request and sets up Server-Sent Events (SSE) headers for streaming response
Go Channel Creation: ModelHandler creates two Go channels:
- streamChan - for streaming LLM response chunks
- errorChan - for error handling
Concurrent Processing: Two goroutines run in parallel:

Background Goroutine (ML Processing):
- Classification: Calls ClassificationService which makes HTTP request to Python backend
- Model Selection: Based on classification result, selects appropriate LLM model
- Metadata Event: Sends SSE event with selected model and classification info
- Stream Setup: Calls ModelService.StreamCompletion() which writes to streamChan
- LLM API Calls: Makes streaming requests to Groq/OpenRouter APIs
- Channel Operations: Each response chunk is sent via streamChan <- chunk
- Cleanup: Closes streamChan when stream completes
Main Goroutine (SSE Management):
- Channel Listening: Uses Go's select{} statement to listen on multiple channels
- Stream Forwarding: Forwards data from streamChan as SSE text_chunk events
- Error Handling: Forwards errors from errorChan as SSE error events
- Context Cancellation: Handles client disconnection and cleanup
Stream Completion: Sends final SSE end_stream event and closes connection

Key Components:

Concurrent Architecture: Main HTTP goroutine handles SSE streaming while background goroutine processes ML tasks
Service Integration: Classification service calls Python backend, model service integrates with Groq API
Real-time Streaming: Server-Sent Events provide immediate feedback to client
Error Handling: Errors are streamed as events, allowing graceful frontend handling
Resource Management: Proper channel cleanup and context cancellation

Architecture Benefits

Performance: Go handles concurrent HTTP requests efficiently
Specialization: Python excels at ML model serving
Modularity: Services can be developed, deployed, and scaled independently
Type Safety: Go's strong typing catches errors at compile time
Maintainability: Clear separation of concerns between services
Scalability: Each service can be horizontally scaled based on demand

Future Improvements

Service discovery for dynamic Python service URLs
Health checks and circuit breakers
Metrics and observability (Prometheus/Grafana)
Testing and CI/CD

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
backend_go		backend_go
backend_python		backend_python
bin		bin
frontend		frontend
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
TokenFlow.png		TokenFlow.png
model_select_mermaid.svg		model_select_mermaid.svg
next-env.d.ts		next-env.d.ts
next.config.js		next.config.js
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TokenFlow

Architecture Overview

System Architecture

Key Design Patterns

Service Separation

Constructor Pattern (Idiomatic Go)

Dependency Injection

Technology Stack

Go Backend

Python Backend

External APIs

Data Flow Examples

Compression Flow

Classification Flow

Generation Flow

Project Structure

Getting Started

Prerequisites

Environment Setup

Run the System

API Documentation

Core Endpoints

Request/Response Examples

Request Lifecycle: `/api/models/select`

Step-by-Step Request Flow

Key Components:

Architecture Benefits

Future Improvements

About

Uh oh!

Releases

Packages

Languages

shaunjoshi/tokenflow

Folders and files

Latest commit

History

Repository files navigation

TokenFlow

Architecture Overview

System Architecture

Key Design Patterns

Service Separation

Constructor Pattern (Idiomatic Go)

Dependency Injection

Technology Stack

Go Backend

Python Backend

External APIs

Data Flow Examples

Compression Flow

Classification Flow

Generation Flow

Project Structure

Getting Started

Prerequisites

Environment Setup

Run the System

API Documentation

Core Endpoints

Request/Response Examples

Request Lifecycle: /api/models/select

Step-by-Step Request Flow

Key Components:

Architecture Benefits

Future Improvements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Request Lifecycle: `/api/models/select`

Packages