Skip to content

yprashanna/MLForge

Repository files navigation

⚙️ MLForge

End-to-end MLOps platform for model lifecycle management

100% FREE No Paid APIs Python 3.11 MLflow License: MIT

Upload a CSV → get a deployed, monitored ML model. Everything runs locally or on free hosting tiers. No credit card required, ever.


Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                         MLForge Platform                            │
│                                                                     │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────────┐  │
│  │  Upload  │───▶│ Validate │───▶│ Profile  │───▶│  Feature Eng │  │
│  │  CSV     │    │ (Rules)  │    │ (Stats)  │    │ (Scale/OHE)  │  │
│  └──────────┘    └──────────┘    └──────────┘    └──────┬───────┘  │
│                                                          │          │
│  ┌──────────────────────────────────────────────────────▼───────┐  │
│  │                    Train 4 Models                             │  │
│  │   LogisticRegression  │  RandomForest  │  XGBoost  │  LGBM   │  │
│  └──────────────────────────────┬────────────────────────────────┘  │
│                                 │                                   │
│  ┌──────────────────────────────▼────────────────────────────────┐  │
│  │           Evaluate & Compare (Acc / F1 / AUC-ROC)             │  │
│  │                     Pick Best Model                           │  │
│  └──────────────────────────────┬────────────────────────────────┘  │
│                                 │                                   │
│         ┌───────────────────────┼──────────────────────┐           │
│         ▼                       ▼                      ▼           │
│  ┌─────────────┐    ┌──────────────────┐    ┌──────────────────┐   │
│  │   MLflow    │    │  FastAPI Serving  │    │   Drift Monitor  │   │
│  │  Registry   │    │  /predict        │    │  KS-test + PSI   │   │
│  │  (local)    │    │  /health         │    │  → Alert/Retrain │   │
│  └─────────────┘    └──────────────────┘    └──────────────────┘   │
│                                                                     │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │           Streamlit Dashboard (upload → monitor)             │   │
│  └─────────────────────────────────────────────────────────────┘   │
│                                                                     │
│  Storage: SQLite (metadata) + Local filesystem (models + MLruns)   │
└─────────────────────────────────────────────────────────────────────┘

Quick Start (3 commands)

git clone https://github.com/yourusername/mlforge.git && cd mlforge
pip install -r requirements.txt && python data/generate_sample.py
streamlit run ui/app.py

Open http://localhost:8501 → upload CSV → train → deploy → monitor.


Detailed Setup

1. Clone & Install

git clone https://github.com/yourusername/mlforge.git
cd mlforge
pip install -r requirements.txt

2. Generate Sample Data

python data/generate_sample.py
# → creates data/sample.csv (1000 rows, 15 columns, credit scoring)

Or use make data.

3. Start MLflow Server (free, local)

In a separate terminal:

pip install mlflow  # already in requirements.txt
mlflow server \
  --host 0.0.0.0 \
  --port 5000 \
  --backend-store-uri sqlite:///mlforge_meta.db \
  --default-artifact-root ./mlruns

Or use make mlflow.

MLflow UI → http://localhost:5000 (free, runs entirely on your machine)

4. Start the Dashboard

make ui
# or
streamlit run ui/app.py

5. Start the API Server

make serve
# or
uvicorn serving.app:app --host 0.0.0.0 --port 8000 --reload

6. Configure Environment (optional)

cp .env.example .env
# Edit .env for Slack alerts, email alerts, etc.

How To Use

Full Workflow

Step 1 — Upload & Profile

  1. Open http://localhost:8501
  2. Navigate to Data Profiling
  3. Upload your CSV or check "Use built-in sample dataset"
  4. Set your target column name (default: default)
  5. Click Run Full Profile to see stats, missing values, correlations

Step 2 — Train Models

  1. Navigate to Train Models
  2. Adjust test set size and CV folds
  3. Click Start Training — trains 4 models with 5-fold cross-validation
  4. Models are automatically saved to models/ and logged to MLflow

Step 3 — Evaluate

  1. Navigate to Evaluate
  2. See side-by-side comparison: Accuracy, F1, ROC-AUC, Avg Precision
  3. View confusion matrix for the best model
  4. Check cross-validation scores for reliability

Step 4 — Predict

  1. Navigate to Deploy & Predict
  2. Use the API (if make serve is running) or in-memory prediction
  3. Send JSON payload, get back predictions + probabilities + latency

Step 5 — Monitor

  1. Navigate to Monitor Drift
  2. Upload a sample of recent production data
  3. Run drift check → see per-feature KS statistics and PSI values
  4. If drift detected → alerts fire + retrain recommendation shown

API Documentation

GET /health

Health check. Returns model load status.

{
  "status": "ok",
  "model_loaded": true,
  "model_loaded_at": 1714156800.0
}

POST /predict

Run inference on one or more rows.

Request:

{
  "data": [
    {
      "age": 35,
      "annual_income": 65000,
      "loan_amount": 12000,
      "loan_term_months": 36,
      "credit_score": 720,
      "num_credit_lines": 5,
      "debt_to_income_ratio": 0.25,
      "employment_years": 7.5,
      "num_late_payments": 0,
      "num_inquiries": 2,
      "home_ownership": "MORTGAGE",
      "employment_status": "EMPLOYED",
      "loan_purpose": "DEBT_CONSOLIDATION",
      "has_cosigner": 0
    }
  ]
}

Response:

{
  "predictions": [0],
  "probabilities": [0.0823],
  "model_name": "RandomForestClassifier",
  "latency_ms": 2.4
}

GET /model/info

Returns info about the currently loaded model: type, parameters, selected features.


Drift Detection

MLForge uses two complementary statistical tests to detect when your production data has drifted away from the training distribution.

KS-Test (Kolmogorov-Smirnov)

Compares the empirical cumulative distribution functions of two samples.

  • p-value < 0.05 → distributions are significantly different → drift detected
  • Fast, non-parametric, no assumptions about distribution shape
  • Implemented via scipy.stats.ks_2samp

PSI (Population Stability Index)

Industry standard metric from banking/insurance. Measures the magnitude of distribution shift.

PSI = Σ (Prod% - Ref%) × ln(Prod% / Ref%)
PSI Value Interpretation
< 0.1 No significant change
0.1–0.2 Moderate change — monitor
> 0.2 Significant drift — retrain

Overall Drift Decision

MLForge flags overall drift if >20% of features show drift in either test. This avoids false alarms from a single noisy feature.

Retraining Trigger

When overall drift is detected:

  1. Streamlit dashboard shows 🚨 DRIFT DETECTED banner
  2. Console alert is logged
  3. Slack webhook fires (if configured)
  4. Email alert sent (if SMTP configured)
  5. "Retrain" recommendation shown with link to Train Models tab

Running Tests

# all tests
make test

# specific test file
pytest tests/test_drift.py -v

# with coverage
make coverage
# → opens htmlcov/index.html

# fast tests only (skips CSV I/O)
make test-fast

Test coverage:

  • tests/test_pipeline.py — ingestion, profiling, feature engineering, training, evaluation
  • tests/test_serving.py — FastAPI endpoints (with mocked model)
  • tests/test_drift.py — KS-test and PSI correctness
  • tests/test_validation.py — data validation rules

Docker

Using docker-compose (recommended)

# start everything: MLflow + API + Streamlit
make docker-up

# check logs
make docker-logs

# stop everything
make docker-down

Services:

Manual Docker

docker build --target runtime -t mlforge:latest .
docker run -p 8000:8000 -v $(pwd)/models:/app/models mlforge:latest

Deployment Guide (Free Hosting)

FastAPI → Render Free Tier

  1. Push to GitHub
  2. Create account at https://render.com (free tier available)
  3. New → Web Service → connect your GitHub repo
  4. Settings:
    • Build command: pip install -r requirements.txt
    • Start command: uvicorn serving.app:app --host 0.0.0.0 --port $PORT
    • Environment: Add MLFLOW_TRACKING_URI=file:///app/mlruns
  5. Deploy → get a public HTTPS URL

Note: Render free tier sleeps after 15min inactivity. For always-on, use Railway free tier (500 hrs/month).

Streamlit UI → Streamlit Cloud (Free)

  1. Push to GitHub (include data/sample.csv — untrack models in .gitignore)
  2. Go to https://share.streamlit.io
  3. Click New app → connect repo
  4. Main file path: ui/app.py
  5. Set environment variables in the Secrets section (same as .env)
  6. Deploy → free subdomain at yourapp.streamlit.app

MLflow → Local Only

MLflow tracking server is meant to run locally. For a team setup, you can run it on any free VPS (Oracle Cloud Free Tier has always-free VMs). Just point MLFLOW_TRACKING_URI to that server.


Tech Stack

Component Library Version Cost
ML Models scikit-learn, XGBoost, LightGBM Latest Free
Experiment Tracking MLflow ≥2.12 Free
API Serving FastAPI + Uvicorn Latest Free
Dashboard Streamlit ≥1.34 Free
Data Processing Pandas, NumPy, SciPy Latest Free
Storage SQLite Built-in Free
Containerization Docker Latest Free
CI/CD GitHub Actions Latest Free
Testing pytest ≥8.1 Free

Total infrastructure cost: $0.00


Makefile Commands

make install       # install dependencies
make data          # generate sample.csv
make mlflow        # start MLflow tracking server
make train         # train models on sample.csv
make serve         # start FastAPI server
make ui            # start Streamlit dashboard
make monitor       # run drift check (simulates drift on sample data)
make test          # run all tests
make test-fast     # run tests (skip I/O-heavy ones)
make coverage      # tests with HTML coverage report
make lint          # flake8 linting
make format        # black + isort formatting
make docker-up     # start all services via docker-compose
make docker-down   # stop all docker services
make clean         # remove __pycache__ and build artifacts

Project Structure

mlforge/
├── pipeline/
│   ├── ingestion.py          # CSV loading, column type inference
│   ├── profiler.py           # data statistics, missing values, correlations
│   ├── feature_engineering.py # scaling, OHE, label encoding, feature selection
│   ├── trainer.py            # trains LR, RF, XGBoost, LightGBM
│   ├── evaluator.py          # accuracy, F1, AUC-ROC, confusion matrix
│   └── drift_detector.py     # KS-test + PSI drift detection
├── serving/
│   ├── app.py                # FastAPI /predict and /health endpoints
│   └── model_loader.py       # loads model from MLflow or local fallback
├── registry/
│   └── mlflow_manager.py     # MLflow experiment logging and registry
├── monitoring/
│   ├── drift_monitor.py      # SQLite-backed drift check history
│   └── alerts.py             # console / Slack / email alerts
├── validation/
│   └── data_validator.py     # data quality rules engine
├── ui/
│   └── app.py                # Streamlit dashboard (all 5 pages)
├── config/
│   └── settings.py           # all configuration in one place
├── tests/                    # pytest test suite
├── data/
│   ├── sample.csv            # 1000-row credit scoring dataset
│   └── generate_sample.py    # script to regenerate sample data
├── Dockerfile                # multi-stage Docker build
├── docker-compose.yml        # MLflow + API + UI
├── .github/workflows/ci.yml  # GitHub Actions: test on push
├── Makefile                  # make install/train/serve/test/...
└── requirements.txt          # all free, open-source dependencies

License

MIT License — see LICENSE

Copyright (c) 2024 MLForge Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction...

Contributing

  1. Fork the repo
  2. Create a feature branch: git checkout -b feature/my-feature
  3. Run tests: make test
  4. Submit a PR

Ideas welcome: Optuna hyperparameter tuning, multi-class support, categorical drift detection, SHAP explainability, model A/B testing.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors