An end-to-end AI-powered Data Engineering pipeline. Upload any CSV — watch it flow through Bronze → Silver → Gold.
DataFlow Studio is a full-stack web application that runs any CSV file through a professional medallion architecture pipeline (Bronze → Silver → Gold), powered by Claude AI for intelligent data analysis and production-ready code generation.
It was built as a portfolio project to demonstrate end-to-end data engineering skills interactively — not just on a resume.
🔗 dataflow-studio-seven.vercel.app
Upload any CSV and watch the full pipeline run. Try it with:
- Sales data
- Financial records
- Survey responses
- Any messy real-world dataset
Upload CSV → Bronze Layer → Silver Layer → Gold Layer → Export
Raw profiling AI transforms Analytics Code + CSV
- Schema detection (column names + data types)
- Null value analysis (count + percentage per column)
- Duplicate row detection
- Data quality scoring
- Live terminal animation showing what's happening
- Intelligent issue detection
- AI-generated data cleaning transformations
- Real PySpark code for every transformation
- Equivalent SQL for every transformation
- PySpark ↔ SQL tab switcher
- Auto-generated KPIs with color coding
- Charts: Bar, Line, Pie (auto-selected by Claude based on data)
- Business insights — actionable observations from the data
- Production-ready PySpark Gold layer code
- Cleaned CSV — Silver-transformed data ready for analysis
- Pipeline Code (.py) — Complete Bronze→Silver→Gold PySpark file
- Pipeline Report (.txt) — Full summary of schema, insights, and transformations
| Layer | Technology |
|---|---|
| Frontend | React 18, Tailwind CSS v3, Recharts, React Dropzone |
| Backend | Python, FastAPI, Uvicorn, Pandas |
| AI | Anthropic Claude API (claude-sonnet-4-20250514) |
| Deploy — Frontend | Vercel |
| Deploy — Backend | Render |
| Version Control | GitHub |
dataflow-studio/
├── backend/
│ ├── main.py # FastAPI app, CORS, router setup
│ ├── requirements.txt # Python dependencies
│ ├── Procfile # Render start command
│ └── routes/
│ ├── bronze.py # CSV upload + data profiling
│ ├── silver.py # Claude AI transformations
│ └── gold.py # Analytics + download endpoints
└── frontend/
└── src/
├── App.js # Stage machine (landing→upload→bronze→silver→gold)
├── components/
│ └── Layout.jsx # Shared navbar + background
└── pages/
├── Landing.jsx # Hero page with animated pipeline diagram
├── Upload.jsx # CSV drag and drop
├── Bronze.jsx # Data profile + live terminal
├── Silver.jsx # AI transforms + code tabs
└── Gold.jsx # Dashboard + charts + exports
- Node.js 18+
- Python 3.10+
- An Anthropic API key (get one here)
# Clone the repo
git clone https://github.com/rohith-66/dataflow-studio.git
cd dataflow-studio/backend
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Create .env file
echo "ANTHROPIC_API_KEY=your_key_here" > .env
# Start the backend
uvicorn main:app --reloadBackend runs at: http://localhost:8000
API docs at: http://localhost:8000/docs
cd ../frontend
# Install dependencies
npm install
# Start the frontend
npm startFrontend runs at: http://localhost:3000
| Method | Endpoint | Description |
|---|---|---|
| POST | /bronze/upload |
Upload CSV, returns schema + quality profile |
| POST | /silver/transform |
Claude AI analysis + transformations |
| POST | /gold/analyze |
KPIs, charts, insights generation |
| POST | /gold/download/csv |
Download cleaned CSV |
| POST | /gold/download/code |
Download PySpark pipeline (.py) |
| POST | /gold/download/report |
Download pipeline report (.txt) |
DataFlow Studio generates real, production-ready PySpark code like this:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, to_date, when, regexp_extract
spark = SparkSession.builder \
.appName("DataFlowStudio_sales_data") \
.getOrCreate()
# Bronze — raw ingestion
df = spark.read.csv("s3://your-bucket/raw/sales_data.csv", header=True, inferSchema=True)
# Silver — AI-generated transformations
# Fix date columns
df = df.withColumn("Order Date", to_date(col("Order Date"), "dd/MM/yyyy"))
# Handle nulls
df = df.fillna({"Postal Code": "UNKNOWN"})
# Fix data types
df = df.withColumn("Revenue", col("Revenue").cast("double"))
df.write.mode("overwrite").parquet("s3://your-bucket/silver/")
# Gold — aggregations
df_gold = df.groupBy("Region").agg({"Revenue": "sum", "Order ID": "count"})
df_gold.write.mode("overwrite").parquet("s3://your-bucket/gold/")Why a stage machine instead of React Router? The pipeline is a strict linear flow. useState with a stage string is simpler, requires no URL configuration, and makes the data flow between stages explicit.
Why FastAPI over Flask or Django?
FastAPI's automatic request validation via Pydantic, async support, and auto-generated Swagger docs make it ideal for a data API. The /docs endpoint is great for testing during development.
Why Claude API for code generation? Claude understands context — it reads the actual column names, data types, and quality issues before generating transformations. This produces dataset-specific PySpark code, not generic templates.
Why max_tokens: 4000 for Silver? Early builds used 2000 tokens which caused JSON truncation on larger datasets. 4000 tokens ensures complete JSON responses. An auto-retry mechanism handles any remaining parse failures.
This project showcases the following data engineering skills:
- Medallion Architecture — Bronze/Silver/Gold lakehouse pattern
- Data Quality — Null detection, type inference, duplicate analysis
- PySpark — Transformations, type casting, date parsing, null handling
- SQL — Equivalent queries for every transformation
- REST API Design — FastAPI with Pydantic validation, streaming downloads
- AI Integration — Prompt engineering, structured JSON output, error handling
- Full Stack — React frontend, Python backend, deployed end to end
Root Directory: frontend
Framework: Create React App
Build Command: npm run build
Root Directory: backend
Build Command: pip install -r requirements.txt
Start Command: uvicorn main:app --host 0.0.0.0 --port $PORT
Environment: ANTHROPIC_API_KEY=your_key
| Variable | Where | Description |
|---|---|---|
ANTHROPIC_API_KEY |
backend/.env |
Your Anthropic API key |
Never commit .env to GitHub. It is gitignored by default.
- Anthropic Claude API — AI brain for transformations and analytics
- FastAPI — Python backend framework
- React — Frontend framework
- Tailwind CSS — Utility-first styling
- Recharts — Chart components
- Vercel — Frontend hosting
- Render — Backend hosting
Rohith Srinivasa MS Data Science — Arizona State University (May 2026) 2 years experience as IoT & Data Analyst at Software AG, Bangalore
MIT License — feel free to use, modify, and build on this project.
Built for the Codex Creator Challenge — April 2026