Skip to content

sarveshmokal/SAS2Py

Note: This project was originally developed at the SRH-Heidelberg-University-ADSA organization as a Proof of Concept for Deutsche Bank AG. This is a personal copy for portfolio purposes. Original repo: SRH-Heidelberg-University-ADSA/SAS2Py

SAS2Py — Multi-Model SAS to Python Translator

AI-powered tool for translating legacy SAS code into modern Python using fine-tuned Large Language Models and Retrieval-Augmented Generation (RAG).

University: SRH Heidelberg University
Program: Applied Data Science & Analytics
Industry Partner: Deutsche Bank AG


Project Overview

Enterprise organizations running legacy SAS analytics face increasing pressure to modernize. Manual code migration is slow, error-prone, and expensive. SAS2Py automates this translation using a combination of fine-tuned LLMs and a hybrid RAG pipeline that retrieves semantically similar examples from a curated dataset of 164 SAS→Python translation pairs.

The system supports three fine-tuned models that users can switch between at runtime, each optimized for different translation scenarios:

Model Parameters VRAM Best For
Qwen2.5-Coder 1.5B 1.5B ~3 GB Fast translations, low-resource environments
DeepSeek Coder 6.7B 6.7B ~13 GB Balanced accuracy and speed
Llama 3.1 8B Instruct 8B ~15 GB Complex multi-step SAS programs

Key Features

  • Multi-model translation with runtime hot-swapping via REST API
  • Hybrid RAG combining vector similarity search (pgvector) with full-text keyword search
  • 8-step post-processing pipeline that cleans FIM tokens, SAS code leaks, and model artifacts
  • Dependency analysis detecting macros, libraries, remote execution, and unsupported PROCs
  • Syntax validation via AST parsing with missing import detection
  • Web interface (Streamlit) with file upload, model selection, and user feedback
  • Audit logging for compliance tracking and feedback collection

Architecture

SAS2Py follows a 4-layer architecture with clear separation of concerns:

Layer 1: Presentation    → Streamlit UI (port 8501)
Layer 2: Business Logic  → FastAPI + Translation Pipeline (port 8000)
Layer 3: Persistence     → Connection Pool + Query Layer
Layer 4: Database        → PostgreSQL 16 + pgvector

Translation Pipeline:

User SAS Code
    → Dependency Analyzer (detect macros, datasets, warnings)
    → RAG Service (hybrid vector + full-text retrieval)
    → Prompt Builder (Full2ex for small models, Full for large)
    → Model Service (generate Python via LLM)
    → Post-Processor (clean FIM tokens, SAS leaks)
    → Validator (check Python syntax via AST)
    → Audit Log
    → Return result

See docs/ARCHITECTURE.md for the full architecture documentation.

The architecture evolved through multiple design iterations — from a microservices design (Redis, Celery, Nginx, Ollama) to the current streamlined 4-layer approach. See figures/architecture_diagrams/README.md for the full evolution with all 6 diagrams.


Quick Start

Prerequisites

This system is designed for GPU servers. It was developed and tested on an NVIDIA H200 NVL server (141 GB VRAM, 256 GB RAM). A GPU server is required for running all three models and for production deployment.

Recommended: GPU Server (for all models and production use)

Requirement Details How to check
NVIDIA GPU 16+ GB VRAM (A100, H100, H200, etc.) nvidia-smi
CUDA 12.1+ drivers nvidia-smi (shows CUDA version)
RAM 32 GB+ free -h
Disk Space 50+ GB free (31 GB models + dependencies) df -h
Python 3.11+ (tested on 3.12) python --version
PostgreSQL 16+ with pgvector psql --version
Docker For frontend deployment (can be on any machine, not necessarily the GPU server) docker --version
OS Linux (Ubuntu 22.04+ recommended) cat /etc/os-release

Installing prerequisites on Ubuntu/Debian:

sudo apt install postgresql postgresql-16-pgvector docker.io

Possible but not recommended: High-end laptop/workstation

Running on a laptop is technically possible but comes with significant limitations:

Model Min VRAM Min RAM Laptop GPU examples Experience
Qwen 1.5B only 3 GB 16 GB RTX 3060, RTX 4060 Workable, 5-10 sec/translation
+ DeepSeek 6.7B 13 GB 32 GB RTX 4080, RTX 4090 Slow model switching
+ Llama 3.1 8B 16 GB 32 GB RTX 4090 (24 GB) Very slow, may run out of memory

Laptops will not be able to run the backend (Steps 1-8) without a dedicated NVIDIA GPU. However, laptops can run the frontend Docker container (Step 9.3-9.5) and connect to a remote backend via the Cloudflare Tunnel.

Per-model requirements:

Model Parameters VRAM Disk Space Speed
Qwen2.5-Coder 1.5B 1.5B ~3 GB ~2.9 GB ~2-4 sec
DeepSeek Coder 6.7B 6.7B ~13 GB ~13 GB ~5-10 sec
Llama 3.1 8B Instruct 8B ~16 GB ~15 GB ~8-15 sec
All three ~16 GB (one at a time) ~31 GB

The Qwen 1.5B model is recommended for getting started — smallest footprint, fastest inference, best syntax validity (100%).


Step-by-Step Installation

Step 1 — Clone the repository

git clone https://github.com/SRH-Heidelberg-University-ADSA/SAS2Py.git
cd SAS2Py

Step 2 — Create a virtual environment

python -m venv venv
source venv/bin/activate

Step 3 — Install dependencies

pip install -r requirements.txt

Note (PyTorch with CUDA): If torch installs without GPU support, install it separately first:

pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124

Step 4 — Configure your environment

cp .env.example .env

Now open .env in a text editor:

nano .env

Update only these values — choose any username and password you want (you will create this user in PostgreSQL during Step 5):

# Change these to YOUR database credentials:
DB_USER=your_postgres_username
DB_PASSWORD=your_postgres_password

# These defaults usually work — only change if your setup is different:
DB_HOST=localhost
DB_PORT=5432
DB_NAME=sas_translator

Save and exit: Ctrl+X, then Y, then Enter.

Example: If you set DB_USER=sarvesh and DB_PASSWORD=sarvesh14, you will run CREATE USER sarvesh WITH PASSWORD 'sarvesh14' SUPERUSER; in Step 5.

All other values in .env work with defaults. No code changes are needed anywhere.

Step 5 — Set up the database

Install PostgreSQL + pgvector (if not already installed):

# Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y postgresql postgresql-16-pgvector

# Start PostgreSQL:
sudo pg_ctlcluster 16 main start
# Or: sudo service postgresql start

Verify PostgreSQL is running:

sudo pg_isready

Should show: /var/run/postgresql:5432 - accepting connections

Create a database user and the database:

Replace your_db_user and your_db_password with the same values you set in .env (Step 4):

sudo -u postgres psql -c "CREATE USER your_db_user WITH PASSWORD 'your_db_password' SUPERUSER;"
sudo -u postgres psql -c "CREATE DATABASE sas_translator OWNER your_db_user;"
sudo -u postgres psql -d sas_translator -c "CREATE EXTENSION IF NOT EXISTS vector;"
sudo -u postgres psql -d sas_translator -f database/migrations/001_initial_schema.sql

Example (if your .env has DB_USER=sarvesh14 and DB_PASSWORD=sarvesh14):

sudo -u postgres psql -c "CREATE USER sarvesh14 WITH PASSWORD 'sarvesh14' SUPERUSER;"
sudo -u postgres psql -c "CREATE DATABASE sas_translator OWNER sarvesh14;"
sudo -u postgres psql -d sas_translator -c "CREATE EXTENSION IF NOT EXISTS vector;"
sudo -u postgres psql -d sas_translator -f database/migrations/001_initial_schema.sql

Alternative — Automated setup (if you have password-based PostgreSQL auth):

bash scripts/setup_database.sh

This reads credentials from .env automatically. If it asks for a password, use the manual method above instead.

Step 6 — Load training data into the database

Option A — Load pre-built SQL dump (faster, includes embeddings):

sudo -u postgres psql -d sas_translator -f database/code_translations.sql

Option B — Generate fresh embeddings (takes ~30 seconds):

python scripts/load_data.py

Both options give you the same result: 164 SAS-to-Python examples with vector embeddings ready for RAG retrieval. Verify with:

sudo -u postgres psql -d sas_translator -c "SELECT COUNT(*) FROM code_translations;"

Should show 164.

Step 7 — Download model weights

Models are not included in the repository (~31 GB total). You have three options:

Option A — Download fine-tuned models (recommended, same results as live demo):

Download the fine-tuned model weights (~26 GB) from Google Drive:

https://drive.google.com/file/d/1CO4jdgd9uhiNW4075TBehrwV6kwmf4uI/view

After downloading SAS2Py_model_weights.zip, extract the model folders:

unzip SAS2Py_model_weights.zip
cp -r backend/models/qwen_v6_merged models/
cp -r backend/models/deepseek_merged models/
cp -r backend/models/llama_8b_merged models/
rm -rf backend/models

Option B — Download base models from Hugging Face (requires re-training):

python scripts/download_models.py --model qwen-1.5b   # ~3 GB
python scripts/download_models.py --model all          # ~31 GB (all 3)

These are the original unfine-tuned models. You will need to run the training scripts to get comparable results. See docs/REPRODUCIBILITY.md.

Option C — Start with Qwen only (quickest, ~3 GB):

If you just want to test the system, download only the Qwen model (Option A or B). It is the smallest (2.9 GB) and loads fastest.

After downloading, your models/ directory should look like:

models/
├── qwen_v6_merged/
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── deepseek_merged/          # (optional, 13 GB)
└── llama_8b_merged/          # (optional, 15 GB)

Step 8 — Start the backend

cd SAS2Py
source venv/bin/activate
uvicorn backend.main:app --host 0.0.0.0 --port 8000

Wait until you see API ready! in the logs (~10-30 seconds for model loading). Keep this terminal open.

Verify in another terminal:

curl http://localhost:8000/health

Step 9 — Deploy the web interface (Docker + Cloudflare Tunnel)

The backend runs on your GPU server, but users need a web interface to interact with it. The frontend (Streamlit) runs inside a Docker container and connects to the backend through a Cloudflare Tunnel.

Why a tunnel? The Docker container is an isolated environment. Inside the container, localhost means the container itself — not your server. So the frontend inside Docker cannot reach the backend at localhost:8000. The Cloudflare Tunnel creates a public URL that forwards to your backend, and the container uses that URL to communicate.

How it works:

User's Browser
    |
    v
Docker Container (Streamlit frontend on port 8501)
    |  reads SAS2PY_API environment variable
    |  sends translation requests to that URL
    v
Cloudflare Tunnel (public URL)
    |  forwards requests to localhost:8000
    v
GPU Server (FastAPI backend on port 8000)
    |  loads AI model, queries database
    v
Returns Python translation

What is SAS2PY_API? It is an environment variable that tells the frontend where the backend API is running. In the code (frontend/api_client.py):

API_BASE = os.getenv("SAS2PY_API", "http://localhost:8000")

When running inside Docker, you must set this to the tunnel URL. Otherwise the frontend cannot reach the backend.

Step 9.1 — Install Cloudflare Tunnel on the GPU server:

Run these commands on the GPU server (where the backend is running):

sudo wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 \
  -O /usr/local/bin/cloudflared && sudo chmod +x /usr/local/bin/cloudflared

Step 9.2 — Start the tunnel (on the GPU server):

nohup cloudflared tunnel --url http://localhost:8000 \
  --proxy-connect-timeout 600s \
  --proxy-keepalive-timeout 600s \
  > /tmp/sas2py_tunnel.log 2>&1 &

sleep 10
grep -o 'https://[a-z0-9-]*\.trycloudflare\.com' /tmp/sas2py_tunnel.log

Save the URL that appears (e.g., https://random-words.trycloudflare.com). You need it in Step 9.4.

Why 600s timeout? Model switching (loading Llama 8B from disk) can take 2+ minutes. The default timeout would drop the connection before the switch completes.

Step 9.3 — Build the frontend Docker image:

On any machine with Docker installed (your laptop, the GPU server, or a separate server):

docker build -t sas2py-frontend frontend/

Step 9.4 — Run the frontend container:

Replace YOUR_TUNNEL_URL with the URL from Step 9.2:

docker run -d \
  --name sas2py-frontend \
  -p 8501:8501 \
  -e SAS2PY_API=YOUR_TUNNEL_URL \
  --restart unless-stopped \
  sas2py-frontend

Example:

docker run -d \
  --name sas2py-frontend \
  -p 8501:8501 \
  -e SAS2PY_API=https://random-words.trycloudflare.com \
  --restart unless-stopped \
  sas2py-frontend

Using Portainer.io? Instead of docker run, you can deploy via Portainer:

  1. Go to Stacks and click Add Stack
  2. Paste the contents of docker-compose.yml
  3. Add environment variable: SAS2PY_API = your tunnel URL
  4. Click Deploy the stack

Portainer is a web-based GUI for managing Docker containers. It is optional — the docker run command above does the same thing.

Step 9.5 — Open the application:

Open your browser and go to:

  • If Docker runs on your local machine: http://localhost:8501
  • If Docker runs on a remote server: http://<server-ip>:8501 (replace <server-ip> with the actual IP address of the server running Docker)

You should see the SAS2Py translator interface with:

  • SAS code input area
  • Model selector in the sidebar (Qwen 1.5B, DeepSeek 6.7B, Llama 3.1 8B)
  • "Backend Online" indicator in the sidebar
  • Python code output area

Try translating:

proc sort data=mydata; by name; run;

Note on model switching: When switching between models in the sidebar, wait for the switch to complete before clicking another model. Loading DeepSeek 6.7B takes ~2 minutes and Llama 3.1 8B takes ~3 minutes on first load (faster on subsequent loads). The Qwen 1.5B model loads in ~10 seconds.

When the tunnel URL changes:

The Cloudflare quick tunnel generates a new URL each time it restarts. When this happens:

  1. Start a new tunnel (repeat Step 9.2)
  2. Stop the old container: docker stop sas2py-frontend && docker rm sas2py-frontend
  3. Start a new container with the new URL (repeat Step 9.4)

The tunnel stays alive as long as the server is running and the process is not killed. It only changes on server restart.

Want a permanent URL? Set up a Cloudflare named tunnel with your own domain. See docs/DEPLOYMENT.md for details.

Local development on a machine with a GPU? If backend, frontend, and browser are all on the same machine, you can skip Steps 9.1-9.5 entirely. Just run streamlit run frontend/app.py in a second terminal and open http://localhost:8501 in your browser. No Docker or tunnel needed — the frontend defaults to http://localhost:8000 which already points to the backend.


Verify everything works

# Check backend health
curl http://localhost:8000/health

# Open Swagger API docs
# Visit: http://localhost:8000/docs

Running Tests

# Unit tests (no GPU, no database needed — runs anywhere)
python -m pytest tests/unit/ -v

# Integration tests (needs database connection)
python -m pytest tests/integration/ -v

# End-to-end tests (needs running backend on port 8000)
SAS2PY_TEST_URL=http://localhost:8000 python -m pytest tests/e2e/ -v

# All tests
python -m pytest tests/ -v

Expected results: 46 unit + 15 integration + 14 e2e = 75 tests total.

See tests/README.md for details on what each test suite covers.


What you need to change (summary)

What Where Example
Database username .envDB_USER DB_USER=postgres
Database password .envDB_PASSWORD DB_PASSWORD=mypassword
Database port (if non-default) .envDB_PORT DB_PORT=5432
Nothing else All other config has working defaults

You do NOT need to change any Python code, any file paths, or any imports. Everything is configured through the .env file.


Repository Structure

SAS2Py/
├── backend/                     # FastAPI backend (Layer 2 + 3)
│   ├── main.py                  # Application entry point
│   ├── config.py                # Centralized configuration
│   ├── api/routes.py            # REST API endpoints
│   ├── schemas/                 # Pydantic request/response models
│   ├── services/                # Business logic services
│   │   ├── translation_service.py   # Pipeline orchestrator
│   │   ├── model_service.py         # Multi-model management
│   │   ├── rag_service.py           # Hybrid RAG retrieval
│   │   ├── post_processor.py        # 8-step output cleaning
│   │   ├── validator.py             # Python syntax validation
│   │   └── dependency_analyzer.py   # SAS dependency detection
│   └── database/                # Database connection and queries
│
├── frontend/                    # Streamlit web interface
│   ├── app.py                   # Main UI application
│   ├── api_client.py            # Backend API connector
│   └── Dockerfile               # Frontend container config
│
├── docker-compose.yml           # Docker deployment for frontend
│
├── models/                      # AI model weights (git-ignored)
│   ├── configs/                 # Model configuration files
│   └── model_cards/             # Per-model documentation
│
├── training/                    # Training data and configs
│   ├── datasets/                # SAS→Python translation pairs
│   │   ├── training.json        # 164 training examples
│   │   ├── test.json            # 20 test examples
│   │   └── validation.json      # 31 validation examples
│   ├── scripts/                 # Model fine-tuning scripts
│   │   └── train_qwen_v6.py     # Qwen 1.5B LoRA training
│   └── hyperparameters/         # Training configurations and progression
│
├── benchmarks/                  # Benchmark evaluation data
│   ├── sas_programs/            # Test SAS inputs
│   └── python_reference/        # Expected Python outputs
│
├── experiments/                 # Experiment tracking
│   ├── experiment_logs/         # Training run logs
│   └── experiment_configs/      # Hyperparameter variations
│
├── results/                     # Evaluation results
│   ├── evaluation_metrics/      # CodeBLEU, syntax scores
│   └── comparison_tables/       # Model comparison data
│
├── figures/                     # Diagrams and plots
│   ├── architecture_diagrams/   # 6 diagrams showing architecture evolution
│   └── evaluation_plots/
│
├── research/                    # Research artefacts
│   ├── related_papers/          # Reference papers (LoRA, RAG, CodeBLEU, etc.)
│   ├── competitor_analysis/     # Competitor tools comparison
│   └── README.md                # Literature review and competitor analysis
│
├── notebooks/                   # Jupyter notebooks
├── resources/                   # Project artefacts
│   ├── presentations/           # Meeting and demo presentations
│   ├── meeting_minutes/         # Meeting notes
│   └── demo_videos/             # Demo video links (hosted on Google Drive)
├── database/migrations/         # SQL schema files
├── scripts/                     # Setup and utility scripts
├── tests/                       # Test suite (unit, integration, e2e)
└── docs/                        # Documentation
    ├── ARCHITECTURE.md
    ├── SETUP.md
    ├── DEPLOYMENT.md
    ├── API.md
    ├── MODEL_CARD.md
    ├── DATASET.md
    ├── LEARNINGS.md
    ├── TROUBLESHOOTING.md
    └── FAQ.md

API Endpoints

Method Endpoint Description
GET /health Health check and model status
POST /api/translate Translate SAS code to Python
GET /api/models List available models
POST /api/switch-model Switch the active model
POST /api/feedback Submit translation feedback
GET /api/stats System statistics
GET /docs Interactive Swagger UI

See docs/API.md for complete API documentation.


Dataset

The training dataset contains 164 curated SAS→Python translation pairs derived from real-world Deutsche Bank analytics workflows. Examples span:

  • PROC SQL queries, PROC SORT, PROC EXPORT/IMPORT
  • DATA step operations with merges and transformations
  • SAS macro definitions and macro variable handling
  • Remote execution wrappers (rsubmit/endrsubmit)

All examples are normalized to remove infrastructure wrappers before embedding generation. See docs/DATASET.md for dataset versioning, format details, and category distribution.


Deployed Version

A live deployment of SAS2Py is available at:

https://sas2py-translator.app.data-lab.site/

This instance runs on the SRH University compute server (NVIDIA H200 NVL, 141 GB VRAM) with all three models available for hot-swapping.

Note: Availability depends on the university server being online. If the link is down, follow the installation instructions above to run a local instance.


Technology Stack

Component Technology
Backend API FastAPI, Uvicorn
Frontend Streamlit
Database PostgreSQL 16, pgvector
LLMs Qwen2.5-Coder, DeepSeek Coder, Llama 3.1
Embeddings Sentence Transformers (all-mpnet-base-v2)
ML Framework PyTorch, Transformers, PEFT
Deployment Docker, Portainer.io, Cloudflare Tunnel
Testing pytest

Documentation

Document Description
docs/ARCHITECTURE.md System architecture and translation pipeline
docs/SETUP.md Detailed installation guide
docs/DEPLOYMENT.md Docker and Portainer deployment guide
docs/API.md REST API endpoint documentation
docs/MODEL_CARD.md Model details, training, and performance
docs/DATASET.md Dataset versioning and format
docs/LEARNINGS.md Key learnings and future scope
docs/TROUBLESHOOTING.md Problems faced and solutions
docs/FAQ.md Frequently asked questions
docs/REPRODUCIBILITY.md Step-by-step guide to replicate training and results
docs/LIMITATIONS.md Known limitations across dataset, models, RAG, infrastructure
training/hyperparameters/README.md Exact training parameters with progression
figures/architecture_diagrams/README.md Architecture evolution (6 diagrams)
research/README.md Related papers and competitor analysis
resources/demo_videos/README.md Demo video links

Demo Videos

Video demonstrations are hosted on Google Drive. See resources/demo_videos/README.md for links.


Research and Competitor Analysis

See research/README.md for:

  • Related papers (LoRA, RAG, CodeBLEU, Sentence-BERT, StarCoder)
  • Competitor analysis (SAS CodeConverter, SPROCKET, ChatGPT prompting, saspy)
  • How SAS2Py differs from existing approaches

Team

Name Role
Sarvesh Uday Mokal Machine Learning Engineer, MLOps & Backend Developer
Swathi Chandrashekargiri Machine Learning Engineer, Research Analyst
Kotrashetti Basavraj C. Frontend Developer, QA Engineer
Tanuj Agarwal Machine Learning Engineer, Data Curator
Akshata Hate Data Analyst & QA

Supervisors:

Industry Partner: Deutsche Bank AG

See CONTRIBUTING.md for detailed contributions per team member.


License

This project is developed as part of an academic program at SRH University Heidelberg. See LICENSE for details.


Contributing

See CONTRIBUTING.md for guidelines on how to contribute to this project.

About

AI-powered SAS-to-Python translator built as a POC for Deutsche Bank. Fine-tuned LLMs with LoRA/QLoRA + hybrid RAG pipeline.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors