StatLang

An open-source, Python-based statistical scripting language

Write and run statistical scripts with full syntax highlighting and a Python backend.

Overview

StatLang provides an open-source environment for statistical analysis by offering:

Expressive scripting syntax for data manipulation and analysis
Python backend for execution and performance
Jupyter notebook support with a StatLang kernel
VS Code extension with syntax highlighting and execution
Cross-platform compatibility (Windows, macOS, Linux)
Open source and free to use

🌟 What Makes StatLang Special?

🤖 AI Integration: Built-in PROC LANGUAGE with LLM capabilities for intelligent data analysis
🧠 Complete ML Pipeline: From data exploration to model deployment using familiar, concise syntax
💾 Modern SQL: PROC SQL powered by DuckDB for high-performance data querying
🔧 Robust language features: Macro system, format system, and statistical procedures
📊 Rich Visualizations: Professional output formatting with TITLE statements and structured results

Features

Core Interpreter

Scripting-based DATA step functionality with inline data support
Statistical procedures (MEANS, FREQ, SORT, PRINT)
Concise data manipulation and analysis syntax
Python pandas/numpy backend for performance
Clean, professional output with familiar formatting

Jupyter Notebook Support

StatLang kernel for Jupyter notebooks
Interactive statistical programming in notebook environment
Rich output display with formatted tables
Dataset visualization and exploration
VS Code Extension
Syntax highlighting for .statlang files
Code snippets for common statistical analysis patterns
File execution directly from VS Code
Notebook support for interactive analysis

Supported Features

📊 Statistical Procedures

PROC MEANS: Descriptive statistics with CLASS variables and OUTPUT statements
PROC FREQ: Frequency tables and cross-tabulations with options
PROC SORT: Data sorting with ascending/descending order
PROC PRINT: Data display and formatting
PROC REG: Linear regression analysis with MODEL, OUTPUT, and SCORE statements
PROC UNIVARIATE: Detailed univariate analysis with distribution diagnostics
PROC CORR: Correlation analysis (Pearson, Spearman)
PROC FACTOR: Principal component analysis and factor analysis
PROC CLUSTER: Clustering methods (k-means, hierarchical)
PROC NPAR1WAY: Nonparametric tests (Mann-Whitney, Kruskal-Wallis)
PROC TTEST: T-tests (independent and paired)
PROC LOGIT: Logistic regression modeling
PROC TIMESERIES: Time series analysis and seasonal decomposition
PROC SURVEYSELECT: Random sampling with SRS method, SAMPRATE/N options, and OUTALL flag

🤖 Machine Learning Procedures

PROC TREE: Decision trees for classification and regression
PROC FOREST: Random forests for ensemble learning
PROC BOOST: Gradient boosting for advanced modeling

💻 Advanced Features

PROC SQL: SQL query processing with DuckDB backend
PROC LANGUAGE: Built-in LLM integration for text generation, Q&A, and data analysis
Macro System: Complete macro facility with %MACRO/%MEND, %LET, & substitution, %PUT, %IF/%THEN/%ELSE, %DO/%END
Format System: Built-in date/time, numeric, and currency formats with metadata persistence
TITLE Statements: Professional output formatting

🔧 Core Data Processing

DATA Steps: Variable creation, conditional logic, DATALINES input
Macro variables: %LET, %PUT statements
Libraries: LIBNAME functionality
NOPRINT option: Silent execution for procedures

Installation

Python Package

pip install statlang

Jupyter Kernel Installation

# Install the StatLang kernel
python -m statlang.kernel install

# List available kernels
jupyter kernelspec list

VS Code Extension

Install from VS Code Marketplace: "StatLang" by RyanBlakeStory
Or install from source (see Development section)

🚀 Exciting New Features

🤖 LANGUAGE - AI-Powered Analysis

language prompt="Analyze the correlation between income and spending in our dataset";
run;

Built-in LLM integration for text generation, Q&A, and intelligent data analysis using Hugging Face transformers!

🧠 Complete Machine Learning Workflow

Check out our ML Project Demo - a comprehensive regression analysis project showcasing:

PROC UNIVARIATE for distribution exploration
PROC SURVEYSELECT for train/test splitting
PROC REG with MODEL, OUTPUT, and SCORE statements
Macro system for reusable analysis workflows
Complete ML pipeline in pure StatLang syntax

💾 SQL - Modern Data Querying

sql;
  select age, income, spend,
         case when income > 60000 then 'High' else 'Low' end as income_group
  from work.customers
  where age between 25 and 50
  order by income desc;
quit;

DuckDB-powered SQL processing with full dataset integration!

Quick Start

1. Interactive Python Usage

from statlang import StatLangInterpreter

# Create interpreter
interpreter = StatLangInterpreter()

# Create sample data using StatLang syntax
interpreter.run_code('''
data work.employees;
    input employee_id name $ department $ salary;
    datalines;
1 Alice Engineering 75000
2 Bob Marketing 55000
3 Carol Engineering 80000
4 David Sales 45000
;
run;
''')

# Run statistical analysis
interpreter.run_code('''
proc means data=work.employees;
    class department;
    var salary;
run;
''')

2. Jupyter Notebook Usage

Install the StatLang kernel:
```
python -m statlang.kernel install
```
Create a new Jupyter notebook (.ipynb)
Select "statlang" as the kernel
Write StatLang code in cells and execute

3. VS Code Usage

Install the StatLang extension from the marketplace
Create a new file with .statlang extension
Write your StatLang code
Use Ctrl+Shift+P → "StatLang: Run File" to execute

4. Command Line Usage

# Run StatLang code from file
python -m statlang.cli run example.statlang

# Interactive mode
python -m statlang.cli interactive

📚 Examples & Demos

🎯 Complete ML Project

ML Project Demo - A comprehensive machine learning workflow:

Synthetic dataset creation with 30 observations
PROC UNIVARIATE for distribution analysis
PROC SURVEYSELECT for train/test splitting (70/30)
PROC REG with MODEL, OUTPUT, and SCORE statements
Macro-based reusable analysis functions
Complete regression analysis pipeline

📊 Comprehensive Walkthrough

StatLang Walkthrough - Complete feature demonstration:

All statistical procedures with examples
Macro system demonstrations
Format system usage
Advanced data manipulation techniques
Real-world analysis scenarios

Project Structure

StatLang/
├── stat_lang/                # Core Python package
│   ├── __init__.py
│   ├── interpreter.py        # Main statistical interpreter
│   ├── cli.py               # Command line interface
│   ├── kernel/              # Jupyter kernel implementation
│   │   ├── statlang_kernel.py   # Main kernel
│   │   └── install.py       # Kernel installation
│   ├── parser/              # Syntax parser
│   │   ├── data_step_parser.py
│   │   ├── proc_parser.py
│   │   └── macro_parser.py
│   ├── procs/               # Statistical procedure implementations
│   │   ├── proc_means.py
│   │   ├── proc_freq.py
│   │   ├── proc_sort.py
│   │   └── proc_print.py
│   └── utils/               # Utility functions
│       ├── expression_evaluator.py
│       ├── data_utils.py
│       └── libname_manager.py
├── vscode-extension/         # VS Code extension
├── examples/                # Example files and demo notebook
├── media/                   # Logo and icons
├── setup.py                 # Package setup
└── README.md

Development

Setup Development Environment

git clone https://github.com/ryan-story/StatLang.git
cd StatLang
pip install -e .

Running Tests

# Run basic functionality tests
python -c "from statlang import StatLangInterpreter; print('StatLang loaded successfully')"

Key Features Implemented

✅ Completed Features

🚧 Future Enhancements

Additional statistical procedures (SQL queries, advanced regression, etc.)
Advanced macro functionality
Performance optimizations
Enhanced data connectivity options

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Areas for Contribution

Additional statistical procedures
Macro functionality enhancements
Performance optimizations
VS Code extension features
Documentation and examples

License

MIT License - see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
media		media
stat_lang		stat_lang
vscode-extension		vscode-extension
.gitignore		.gitignore
.gitignore.public		.gitignore.public
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
delete_open_sas.py		delete_open_sas.py
pyproject.toml		pyproject.toml
setup.py		setup.py

License

ryan-story/StatLang

Folders and files

Latest commit

History

Repository files navigation

StatLang

An open-source, Python-based statistical scripting language

Overview

🌟 What Makes StatLang Special?

Features

Core Interpreter

Jupyter Notebook Support

VS Code Extension

Supported Features

📊 Statistical Procedures

🤖 Machine Learning Procedures

💻 Advanced Features

🔧 Core Data Processing

Installation

Python Package

Jupyter Kernel Installation

VS Code Extension

🚀 Exciting New Features

🤖 LANGUAGE - AI-Powered Analysis

🧠 Complete Machine Learning Workflow

💾 SQL - Modern Data Querying

Quick Start

1. Interactive Python Usage

2. Jupyter Notebook Usage

3. VS Code Usage

4. Command Line Usage

📚 Examples & Demos

🎯 Complete ML Project

📊 Comprehensive Walkthrough

Project Structure

Development

Setup Development Environment

Running Tests

Key Features Implemented

✅ Completed Features

🚧 Future Enhancements

Contributing

Areas for Contribution

License

Support

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages