SAS2Py — Multi-Model SAS to Python Translator

Note: This project was originally developed at the SRH-Heidelberg-University-ADSA organization as a Proof of Concept for Deutsche Bank AG. This is a personal copy for portfolio purposes. Original repo: SRH-Heidelberg-University-ADSA/SAS2Py

SAS2Py — Multi-Model SAS to Python Translator

AI-powered tool for translating legacy SAS code into modern Python using fine-tuned Large Language Models and Retrieval-Augmented Generation (RAG).

University: SRH Heidelberg University
Program: Applied Data Science & Analytics
Industry Partner: Deutsche Bank AG

Project Overview

Enterprise organizations running legacy SAS analytics face increasing pressure to modernize. Manual code migration is slow, error-prone, and expensive. SAS2Py automates this translation using a combination of fine-tuned LLMs and a hybrid RAG pipeline that retrieves semantically similar examples from a curated dataset of 164 SAS→Python translation pairs.

The system supports three fine-tuned models that users can switch between at runtime, each optimized for different translation scenarios:

Model	Parameters	VRAM	Best For
Qwen2.5-Coder 1.5B	1.5B	~3 GB	Fast translations, low-resource environments
DeepSeek Coder 6.7B	6.7B	~13 GB	Balanced accuracy and speed
Llama 3.1 8B Instruct	8B	~15 GB	Complex multi-step SAS programs

Key Features

Multi-model translation with runtime hot-swapping via REST API
Hybrid RAG combining vector similarity search (pgvector) with full-text keyword search
8-step post-processing pipeline that cleans FIM tokens, SAS code leaks, and model artifacts
Dependency analysis detecting macros, libraries, remote execution, and unsupported PROCs
Syntax validation via AST parsing with missing import detection
Web interface (Streamlit) with file upload, model selection, and user feedback
Audit logging for compliance tracking and feedback collection

Architecture

SAS2Py follows a 4-layer architecture with clear separation of concerns:

Layer 1: Presentation    → Streamlit UI (port 8501)
Layer 2: Business Logic  → FastAPI + Translation Pipeline (port 8000)
Layer 3: Persistence     → Connection Pool + Query Layer
Layer 4: Database        → PostgreSQL 16 + pgvector

Translation Pipeline:

User SAS Code
    → Dependency Analyzer (detect macros, datasets, warnings)
    → RAG Service (hybrid vector + full-text retrieval)
    → Prompt Builder (Full2ex for small models, Full for large)
    → Model Service (generate Python via LLM)
    → Post-Processor (clean FIM tokens, SAS leaks)
    → Validator (check Python syntax via AST)
    → Audit Log
    → Return result

See docs/ARCHITECTURE.md for the full architecture documentation.

The architecture evolved through multiple design iterations — from a microservices design (Redis, Celery, Nginx, Ollama) to the current streamlined 4-layer approach. See figures/architecture_diagrams/README.md for the full evolution with all 6 diagrams.

Quick Start

Prerequisites

This system is designed for GPU servers. It was developed and tested on an NVIDIA H200 NVL server (141 GB VRAM, 256 GB RAM). A GPU server is required for running all three models and for production deployment.

Recommended: GPU Server (for all models and production use)

Requirement	Details	How to check
NVIDIA GPU	16+ GB VRAM (A100, H100, H200, etc.)	`nvidia-smi`
CUDA	12.1+ drivers	`nvidia-smi` (shows CUDA version)
RAM	32 GB+	`free -h`
Disk Space	50+ GB free (31 GB models + dependencies)	`df -h`
Python	3.11+ (tested on 3.12)	`python --version`
PostgreSQL	16+ with pgvector	`psql --version`
Docker	For frontend deployment (can be on any machine, not necessarily the GPU server)	`docker --version`
OS	Linux (Ubuntu 22.04+ recommended)	`cat /etc/os-release`

Installing prerequisites on Ubuntu/Debian:
sudo apt install postgresql postgresql-16-pgvector docker.io

Possible but not recommended: High-end laptop/workstation

Running on a laptop is technically possible but comes with significant limitations:

Model	Min VRAM	Min RAM	Laptop GPU examples	Experience
Qwen 1.5B only	3 GB	16 GB	RTX 3060, RTX 4060	Workable, 5-10 sec/translation
+ DeepSeek 6.7B	13 GB	32 GB	RTX 4080, RTX 4090	Slow model switching
+ Llama 3.1 8B	16 GB	32 GB	RTX 4090 (24 GB)	Very slow, may run out of memory

Laptops will not be able to run the backend (Steps 1-8) without a dedicated NVIDIA GPU. However, laptops can run the frontend Docker container (Step 9.3-9.5) and connect to a remote backend via the Cloudflare Tunnel.

Per-model requirements:

Model	Parameters	VRAM	Disk Space	Speed
Qwen2.5-Coder 1.5B	1.5B	~3 GB	~2.9 GB	~2-4 sec
DeepSeek Coder 6.7B	6.7B	~13 GB	~13 GB	~5-10 sec
Llama 3.1 8B Instruct	8B	~16 GB	~15 GB	~8-15 sec
All three	—	~16 GB (one at a time)	~31 GB	—

The Qwen 1.5B model is recommended for getting started — smallest footprint, fastest inference, best syntax validity (100%).

Step-by-Step Installation

Step 1 — Clone the repository

git clone https://github.com/SRH-Heidelberg-University-ADSA/SAS2Py.git
cd SAS2Py

Step 2 — Create a virtual environment

python -m venv venv
source venv/bin/activate

Step 3 — Install dependencies

pip install -r requirements.txt

Note (PyTorch with CUDA): If torch installs without GPU support, install it separately first:
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124

Step 4 — Configure your environment

cp .env.example .env

Now open .env in a text editor:

nano .env

Update only these values — choose any username and password you want (you will create this user in PostgreSQL during Step 5):

# Change these to YOUR database credentials:
DB_USER=your_postgres_username
DB_PASSWORD=your_postgres_password

# These defaults usually work — only change if your setup is different:
DB_HOST=localhost
DB_PORT=5432
DB_NAME=sas_translator

Save and exit: Ctrl+X, then Y, then Enter.

Example: If you set DB_USER=sarvesh and DB_PASSWORD=sarvesh14, you will run CREATE USER sarvesh WITH PASSWORD 'sarvesh14' SUPERUSER; in Step 5.

All other values in .env work with defaults. No code changes are needed anywhere.

Step 5 — Set up the database

Install PostgreSQL + pgvector (if not already installed):

# Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y postgresql postgresql-16-pgvector

# Start PostgreSQL:
sudo pg_ctlcluster 16 main start
# Or: sudo service postgresql start

Verify PostgreSQL is running:
sudo pg_isready
Should show: /var/run/postgresql:5432 - accepting connections

Create a database user and the database:

Replace your_db_user and your_db_password with the same values you set in .env (Step 4):

sudo -u postgres psql -c "CREATE USER your_db_user WITH PASSWORD 'your_db_password' SUPERUSER;"
sudo -u postgres psql -c "CREATE DATABASE sas_translator OWNER your_db_user;"
sudo -u postgres psql -d sas_translator -c "CREATE EXTENSION IF NOT EXISTS vector;"
sudo -u postgres psql -d sas_translator -f database/migrations/001_initial_schema.sql

Example (if your .env has DB_USER=sarvesh14 and DB_PASSWORD=sarvesh14):

sudo -u postgres psql -c "CREATE USER sarvesh14 WITH PASSWORD 'sarvesh14' SUPERUSER;"
sudo -u postgres psql -c "CREATE DATABASE sas_translator OWNER sarvesh14;"
sudo -u postgres psql -d sas_translator -c "CREATE EXTENSION IF NOT EXISTS vector;"
sudo -u postgres psql -d sas_translator -f database/migrations/001_initial_schema.sql

Alternative — Automated setup (if you have password-based PostgreSQL auth):
bash scripts/setup_database.sh
This reads credentials from .env automatically. If it asks for a password, use the manual method above instead.

Step 6 — Load training data into the database

Option A — Load pre-built SQL dump (faster, includes embeddings):

sudo -u postgres psql -d sas_translator -f database/code_translations.sql

Option B — Generate fresh embeddings (takes ~30 seconds):

python scripts/load_data.py

Both options give you the same result: 164 SAS-to-Python examples with vector embeddings ready for RAG retrieval. Verify with:
sudo -u postgres psql -d sas_translator -c "SELECT COUNT(*) FROM code_translations;"
Should show 164.

Step 7 — Download model weights

Models are not included in the repository (~31 GB total). You have three options:

Option A — Download fine-tuned models (recommended, same results as live demo):

Download the fine-tuned model weights (~26 GB) from Google Drive:

https://drive.google.com/file/d/1CO4jdgd9uhiNW4075TBehrwV6kwmf4uI/view

After downloading SAS2Py_model_weights.zip, extract the model folders:

unzip SAS2Py_model_weights.zip
cp -r backend/models/qwen_v6_merged models/
cp -r backend/models/deepseek_merged models/
cp -r backend/models/llama_8b_merged models/
rm -rf backend/models

Option B — Download base models from Hugging Face (requires re-training):

python scripts/download_models.py --model qwen-1.5b   # ~3 GB
python scripts/download_models.py --model all          # ~31 GB (all 3)

These are the original unfine-tuned models. You will need to run the training scripts to get comparable results. See docs/REPRODUCIBILITY.md.

Option C — Start with Qwen only (quickest, ~3 GB):

If you just want to test the system, download only the Qwen model (Option A or B). It is the smallest (2.9 GB) and loads fastest.

After downloading, your models/ directory should look like:

models/
├── qwen_v6_merged/
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── deepseek_merged/          # (optional, 13 GB)
└── llama_8b_merged/          # (optional, 15 GB)

Step 8 — Start the backend

cd SAS2Py
source venv/bin/activate
uvicorn backend.main:app --host 0.0.0.0 --port 8000

Wait until you see API ready! in the logs (~10-30 seconds for model loading). Keep this terminal open.

Verify in another terminal:

curl http://localhost:8000/health

Step 9 — Deploy the web interface (Docker + Cloudflare Tunnel)

The backend runs on your GPU server, but users need a web interface to interact with it. The frontend (Streamlit) runs inside a Docker container and connects to the backend through a Cloudflare Tunnel.

Why a tunnel? The Docker container is an isolated environment. Inside the container, localhost means the container itself — not your server. So the frontend inside Docker cannot reach the backend at localhost:8000. The Cloudflare Tunnel creates a public URL that forwards to your backend, and the container uses that URL to communicate.

How it works:

User's Browser
    |
    v
Docker Container (Streamlit frontend on port 8501)
    |  reads SAS2PY_API environment variable
    |  sends translation requests to that URL
    v
Cloudflare Tunnel (public URL)
    |  forwards requests to localhost:8000
    v
GPU Server (FastAPI backend on port 8000)
    |  loads AI model, queries database
    v
Returns Python translation

What is SAS2PY_API? It is an environment variable that tells the frontend where the backend API is running. In the code (frontend/api_client.py):

API_BASE = os.getenv("SAS2PY_API", "http://localhost:8000")

When running inside Docker, you must set this to the tunnel URL. Otherwise the frontend cannot reach the backend.

Step 9.1 — Install Cloudflare Tunnel on the GPU server:

Run these commands on the GPU server (where the backend is running):

sudo wget -q https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64 \
  -O /usr/local/bin/cloudflared && sudo chmod +x /usr/local/bin/cloudflared

Step 9.2 — Start the tunnel (on the GPU server):

nohup cloudflared tunnel --url http://localhost:8000 \
  --proxy-connect-timeout 600s \
  --proxy-keepalive-timeout 600s \
  > /tmp/sas2py_tunnel.log 2>&1 &

sleep 10
grep -o 'https://[a-z0-9-]*\.trycloudflare\.com' /tmp/sas2py_tunnel.log

Save the URL that appears (e.g., https://random-words.trycloudflare.com). You need it in Step 9.4.

Why 600s timeout? Model switching (loading Llama 8B from disk) can take 2+ minutes. The default timeout would drop the connection before the switch completes.

Step 9.3 — Build the frontend Docker image:

On any machine with Docker installed (your laptop, the GPU server, or a separate server):

docker build -t sas2py-frontend frontend/

Step 9.4 — Run the frontend container:

Replace YOUR_TUNNEL_URL with the URL from Step 9.2:

docker run -d \
  --name sas2py-frontend \
  -p 8501:8501 \
  -e SAS2PY_API=YOUR_TUNNEL_URL \
  --restart unless-stopped \
  sas2py-frontend

Example:

docker run -d \
  --name sas2py-frontend \
  -p 8501:8501 \
  -e SAS2PY_API=https://random-words.trycloudflare.com \
  --restart unless-stopped \
  sas2py-frontend

Using Portainer.io? Instead of docker run, you can deploy via Portainer:

Go to Stacks and click Add Stack

Paste the contents of docker-compose.yml

Add environment variable: SAS2PY_API = your tunnel URL

Click Deploy the stack

Portainer is a web-based GUI for managing Docker containers. It is optional — the docker run command above does the same thing.

Step 9.5 — Open the application:

Open your browser and go to:

If Docker runs on your local machine: http://localhost:8501
If Docker runs on a remote server: http://<server-ip>:8501 (replace <server-ip> with the actual IP address of the server running Docker)

You should see the SAS2Py translator interface with:

SAS code input area
Model selector in the sidebar (Qwen 1.5B, DeepSeek 6.7B, Llama 3.1 8B)
"Backend Online" indicator in the sidebar
Python code output area

Try translating:

proc sort data=mydata; by name; run;

Note on model switching: When switching between models in the sidebar, wait for the switch to complete before clicking another model. Loading DeepSeek 6.7B takes ~2 minutes and Llama 3.1 8B takes ~3 minutes on first load (faster on subsequent loads). The Qwen 1.5B model loads in ~10 seconds.

When the tunnel URL changes:

The Cloudflare quick tunnel generates a new URL each time it restarts. When this happens:

Start a new tunnel (repeat Step 9.2)
Stop the old container: docker stop sas2py-frontend && docker rm sas2py-frontend
Start a new container with the new URL (repeat Step 9.4)

The tunnel stays alive as long as the server is running and the process is not killed. It only changes on server restart.

Want a permanent URL? Set up a Cloudflare named tunnel with your own domain. See docs/DEPLOYMENT.md for details.

Local development on a machine with a GPU? If backend, frontend, and browser are all on the same machine, you can skip Steps 9.1-9.5 entirely. Just run streamlit run frontend/app.py in a second terminal and open http://localhost:8501 in your browser. No Docker or tunnel needed — the frontend defaults to http://localhost:8000 which already points to the backend.

Verify everything works

# Check backend health
curl http://localhost:8000/health

# Open Swagger API docs
# Visit: http://localhost:8000/docs

Running Tests

# Unit tests (no GPU, no database needed — runs anywhere)
python -m pytest tests/unit/ -v

# Integration tests (needs database connection)
python -m pytest tests/integration/ -v

# End-to-end tests (needs running backend on port 8000)
SAS2PY_TEST_URL=http://localhost:8000 python -m pytest tests/e2e/ -v

# All tests
python -m pytest tests/ -v

Expected results: 46 unit + 15 integration + 14 e2e = 75 tests total.

See tests/README.md for details on what each test suite covers.

What you need to change (summary)

What	Where	Example
Database username	`.env` → `DB_USER`	`DB_USER=postgres`
Database password	`.env` → `DB_PASSWORD`	`DB_PASSWORD=mypassword`
Database port (if non-default)	`.env` → `DB_PORT`	`DB_PORT=5432`
Nothing else	—	All other config has working defaults

You do NOT need to change any Python code, any file paths, or any imports. Everything is configured through the .env file.

Repository Structure

SAS2Py/
├── backend/                     # FastAPI backend (Layer 2 + 3)
│   ├── main.py                  # Application entry point
│   ├── config.py                # Centralized configuration
│   ├── api/routes.py            # REST API endpoints
│   ├── schemas/                 # Pydantic request/response models
│   ├── services/                # Business logic services
│   │   ├── translation_service.py   # Pipeline orchestrator
│   │   ├── model_service.py         # Multi-model management
│   │   ├── rag_service.py           # Hybrid RAG retrieval
│   │   ├── post_processor.py        # 8-step output cleaning
│   │   ├── validator.py             # Python syntax validation
│   │   └── dependency_analyzer.py   # SAS dependency detection
│   └── database/                # Database connection and queries
│
├── frontend/                    # Streamlit web interface
│   ├── app.py                   # Main UI application
│   ├── api_client.py            # Backend API connector
│   └── Dockerfile               # Frontend container config
│
├── docker-compose.yml           # Docker deployment for frontend
│
├── models/                      # AI model weights (git-ignored)
│   ├── configs/                 # Model configuration files
│   └── model_cards/             # Per-model documentation
│
├── training/                    # Training data and configs
│   ├── datasets/                # SAS→Python translation pairs
│   │   ├── training.json        # 164 training examples
│   │   ├── test.json            # 20 test examples
│   │   └── validation.json      # 31 validation examples
│   ├── scripts/                 # Model fine-tuning scripts
│   │   └── train_qwen_v6.py     # Qwen 1.5B LoRA training
│   └── hyperparameters/         # Training configurations and progression
│
├── benchmarks/                  # Benchmark evaluation data
│   ├── sas_programs/            # Test SAS inputs
│   └── python_reference/        # Expected Python outputs
│
├── experiments/                 # Experiment tracking
│   ├── experiment_logs/         # Training run logs
│   └── experiment_configs/      # Hyperparameter variations
│
├── results/                     # Evaluation results
│   ├── evaluation_metrics/      # CodeBLEU, syntax scores
│   └── comparison_tables/       # Model comparison data
│
├── figures/                     # Diagrams and plots
│   ├── architecture_diagrams/   # 6 diagrams showing architecture evolution
│   └── evaluation_plots/
│
├── research/                    # Research artefacts
│   ├── related_papers/          # Reference papers (LoRA, RAG, CodeBLEU, etc.)
│   ├── competitor_analysis/     # Competitor tools comparison
│   └── README.md                # Literature review and competitor analysis
│
├── notebooks/                   # Jupyter notebooks
├── resources/                   # Project artefacts
│   ├── presentations/           # Meeting and demo presentations
│   ├── meeting_minutes/         # Meeting notes
│   └── demo_videos/             # Demo video links (hosted on Google Drive)
├── database/migrations/         # SQL schema files
├── scripts/                     # Setup and utility scripts
├── tests/                       # Test suite (unit, integration, e2e)
└── docs/                        # Documentation
    ├── ARCHITECTURE.md
    ├── SETUP.md
    ├── DEPLOYMENT.md
    ├── API.md
    ├── MODEL_CARD.md
    ├── DATASET.md
    ├── LEARNINGS.md
    ├── TROUBLESHOOTING.md
    └── FAQ.md

API Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check and model status
`POST`	`/api/translate`	Translate SAS code to Python
`GET`	`/api/models`	List available models
`POST`	`/api/switch-model`	Switch the active model
`POST`	`/api/feedback`	Submit translation feedback
`GET`	`/api/stats`	System statistics
`GET`	`/docs`	Interactive Swagger UI

See docs/API.md for complete API documentation.

Dataset

The training dataset contains 164 curated SAS→Python translation pairs derived from real-world Deutsche Bank analytics workflows. Examples span:

PROC SQL queries, PROC SORT, PROC EXPORT/IMPORT
DATA step operations with merges and transformations
SAS macro definitions and macro variable handling
Remote execution wrappers (rsubmit/endrsubmit)

All examples are normalized to remove infrastructure wrappers before embedding generation. See docs/DATASET.md for dataset versioning, format details, and category distribution.

Deployed Version

A live deployment of SAS2Py is available at:

https://sas2py-translator.app.data-lab.site/

This instance runs on the SRH University compute server (NVIDIA H200 NVL, 141 GB VRAM) with all three models available for hot-swapping.

Note: Availability depends on the university server being online. If the link is down, follow the installation instructions above to run a local instance.

Technology Stack

Component	Technology
Backend API	FastAPI, Uvicorn
Frontend	Streamlit
Database	PostgreSQL 16, pgvector
LLMs	Qwen2.5-Coder, DeepSeek Coder, Llama 3.1
Embeddings	Sentence Transformers (all-mpnet-base-v2)
ML Framework	PyTorch, Transformers, PEFT
Deployment	Docker, Portainer.io, Cloudflare Tunnel
Testing	pytest

Documentation

Document	Description
`docs/ARCHITECTURE.md`	System architecture and translation pipeline
`docs/SETUP.md`	Detailed installation guide
`docs/DEPLOYMENT.md`	Docker and Portainer deployment guide
`docs/API.md`	REST API endpoint documentation
`docs/MODEL_CARD.md`	Model details, training, and performance
`docs/DATASET.md`	Dataset versioning and format
`docs/LEARNINGS.md`	Key learnings and future scope
`docs/TROUBLESHOOTING.md`	Problems faced and solutions
`docs/FAQ.md`	Frequently asked questions
`docs/REPRODUCIBILITY.md`	Step-by-step guide to replicate training and results
`docs/LIMITATIONS.md`	Known limitations across dataset, models, RAG, infrastructure
`training/hyperparameters/README.md`	Exact training parameters with progression
`figures/architecture_diagrams/README.md`	Architecture evolution (6 diagrams)
`research/README.md`	Related papers and competitor analysis
`resources/demo_videos/README.md`	Demo video links

Demo Videos

Video demonstrations are hosted on Google Drive. See resources/demo_videos/README.md for links.

Research and Competitor Analysis

See research/README.md for:

Related papers (LoRA, RAG, CodeBLEU, Sentence-BERT, StarCoder)
Competitor analysis (SAS CodeConverter, SPROCKET, ChatGPT prompting, saspy)
How SAS2Py differs from existing approaches

Team

Name	Role
Sarvesh Uday Mokal	Machine Learning Engineer, MLOps & Backend Developer
Swathi Chandrashekargiri	Machine Learning Engineer, Research Analyst
Kotrashetti Basavraj C.	Frontend Developer, QA Engineer
Tanuj Agarwal	Machine Learning Engineer, Data Curator
Akshata Hate	Data Analyst & QA

Supervisors:

Prof. Dr. Swati Chandna, SRH University Heidelberg, Applied Data Science and Analytics
Prof. Dr. Ghadeer Mobasher, SRH University Heidelberg, Applied Data Science and Analytics

Industry Partner: Deutsche Bank AG

See CONTRIBUTING.md for detailed contributions per team member.

License

This project is developed as part of an academic program at SRH University Heidelberg. See LICENSE for details.

Contributing

See CONTRIBUTING.md for guidelines on how to contribute to this project.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
backend		backend
benchmarks		benchmarks
database		database
docs		docs
experiments		experiments
figures		figures
frontend		frontend
logs		logs
notebooks		notebooks
research		research
resources		resources
results		results
scripts		scripts
tests		tests
training		training
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

SAS2Py — Multi-Model SAS to Python Translator

Project Overview

Key Features

Architecture

Quick Start

Prerequisites

Step-by-Step Installation

Step 1 — Clone the repository

Step 2 — Create a virtual environment

Step 3 — Install dependencies

Step 4 — Configure your environment

Step 5 — Set up the database

Step 6 — Load training data into the database

Step 7 — Download model weights

Step 8 — Start the backend

Step 9 — Deploy the web interface (Docker + Cloudflare Tunnel)

Verify everything works

Running Tests

What you need to change (summary)

Repository Structure

API Endpoints

Dataset

Deployed Version

Technology Stack

Documentation

Demo Videos

Research and Competitor Analysis

Team

License

Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages