Skip to content

ryan-story/StatLang

StatLang

StatLang Logo

An open-source, Python-based statistical scripting language

Write and run statistical scripts with full syntax highlighting and a Python backend.

Overview

StatLang provides an open-source environment for statistical analysis by offering:

  • Expressive scripting syntax for data manipulation and analysis
  • Python backend for execution and performance
  • Jupyter notebook support with a StatLang kernel
  • VS Code extension with syntax highlighting and execution
  • Cross-platform compatibility (Windows, macOS, Linux)
  • Open source and free to use

🌟 What Makes StatLang Special?

  • πŸ€– AI Integration: Built-in PROC LANGUAGE with LLM capabilities for intelligent data analysis
  • 🧠 Complete ML Pipeline: From data exploration to model deployment using familiar, concise syntax
  • πŸ’Ύ Modern SQL: PROC SQL powered by DuckDB for high-performance data querying
  • πŸ”§ Robust language features: Macro system, format system, and statistical procedures
  • πŸ“Š Rich Visualizations: Professional output formatting with TITLE statements and structured results

Features

Core Interpreter

  • Scripting-based DATA step functionality with inline data support
  • Statistical procedures (MEANS, FREQ, SORT, PRINT)
  • Concise data manipulation and analysis syntax
  • Python pandas/numpy backend for performance
  • Clean, professional output with familiar formatting

Jupyter Notebook Support

  • StatLang kernel for Jupyter notebooks

  • Interactive statistical programming in notebook environment

  • Rich output display with formatted tables

  • Dataset visualization and exploration

  • VS Code Extension

  • Syntax highlighting for .statlang files

  • Code snippets for common statistical analysis patterns

  • File execution directly from VS Code

  • Notebook support for interactive analysis

Supported Features

πŸ“Š Statistical Procedures

  • PROC MEANS: Descriptive statistics with CLASS variables and OUTPUT statements
  • PROC FREQ: Frequency tables and cross-tabulations with options
  • PROC SORT: Data sorting with ascending/descending order
  • PROC PRINT: Data display and formatting
  • PROC REG: Linear regression analysis with MODEL, OUTPUT, and SCORE statements
  • PROC UNIVARIATE: Detailed univariate analysis with distribution diagnostics
  • PROC CORR: Correlation analysis (Pearson, Spearman)
  • PROC FACTOR: Principal component analysis and factor analysis
  • PROC CLUSTER: Clustering methods (k-means, hierarchical)
  • PROC NPAR1WAY: Nonparametric tests (Mann-Whitney, Kruskal-Wallis)
  • PROC TTEST: T-tests (independent and paired)
  • PROC LOGIT: Logistic regression modeling
  • PROC TIMESERIES: Time series analysis and seasonal decomposition
  • PROC SURVEYSELECT: Random sampling with SRS method, SAMPRATE/N options, and OUTALL flag

πŸ€– Machine Learning Procedures

  • PROC TREE: Decision trees for classification and regression
  • PROC FOREST: Random forests for ensemble learning
  • PROC BOOST: Gradient boosting for advanced modeling

πŸ’» Advanced Features

  • PROC SQL: SQL query processing with DuckDB backend
  • PROC LANGUAGE: Built-in LLM integration for text generation, Q&A, and data analysis
  • Macro System: Complete macro facility with %MACRO/%MEND, %LET, & substitution, %PUT, %IF/%THEN/%ELSE, %DO/%END
  • Format System: Built-in date/time, numeric, and currency formats with metadata persistence
  • TITLE Statements: Professional output formatting

πŸ”§ Core Data Processing

  • DATA Steps: Variable creation, conditional logic, DATALINES input
  • Macro variables: %LET, %PUT statements
  • Libraries: LIBNAME functionality
  • NOPRINT option: Silent execution for procedures

Installation

Python Package

pip install statlang

Jupyter Kernel Installation

# Install the StatLang kernel
python -m statlang.kernel install

# List available kernels
jupyter kernelspec list

VS Code Extension

  1. Install from VS Code Marketplace: "StatLang" by RyanBlakeStory
  2. Or install from source (see Development section)

πŸš€ Exciting New Features

πŸ€– LANGUAGE - AI-Powered Analysis

language prompt="Analyze the correlation between income and spending in our dataset";
run;

Built-in LLM integration for text generation, Q&A, and intelligent data analysis using Hugging Face transformers!

🧠 Complete Machine Learning Workflow

Check out our ML Project Demo - a comprehensive regression analysis project showcasing:

  • PROC UNIVARIATE for distribution exploration
  • PROC SURVEYSELECT for train/test splitting
  • PROC REG with MODEL, OUTPUT, and SCORE statements
  • Macro system for reusable analysis workflows
  • Complete ML pipeline in pure StatLang syntax

πŸ’Ύ SQL - Modern Data Querying

sql;
  select age, income, spend,
         case when income > 60000 then 'High' else 'Low' end as income_group
  from work.customers
  where age between 25 and 50
  order by income desc;
quit;

DuckDB-powered SQL processing with full dataset integration!

Quick Start

1. Interactive Python Usage

from statlang import StatLangInterpreter

# Create interpreter
interpreter = StatLangInterpreter()

# Create sample data using StatLang syntax
interpreter.run_code('''
data work.employees;
    input employee_id name $ department $ salary;
    datalines;
1 Alice Engineering 75000
2 Bob Marketing 55000
3 Carol Engineering 80000
4 David Sales 45000
;
run;
''')

# Run statistical analysis
interpreter.run_code('''
proc means data=work.employees;
    class department;
    var salary;
run;
''')

2. Jupyter Notebook Usage

  1. Install the StatLang kernel:
    python -m statlang.kernel install
  2. Create a new Jupyter notebook (.ipynb)
  3. Select "statlang" as the kernel
  4. Write StatLang code in cells and execute

3. VS Code Usage

  1. Install the StatLang extension from the marketplace
  2. Create a new file with .statlang extension
  3. Write your StatLang code
  4. Use Ctrl+Shift+P β†’ "StatLang: Run File" to execute

4. Command Line Usage

# Run StatLang code from file
python -m statlang.cli run example.statlang

# Interactive mode
python -m statlang.cli interactive

πŸ“š Examples & Demos

🎯 Complete ML Project

ML Project Demo - A comprehensive machine learning workflow:

  • Synthetic dataset creation with 30 observations
  • PROC UNIVARIATE for distribution analysis
  • PROC SURVEYSELECT for train/test splitting (70/30)
  • PROC REG with MODEL, OUTPUT, and SCORE statements
  • Macro-based reusable analysis functions
  • Complete regression analysis pipeline

πŸ“Š Comprehensive Walkthrough

StatLang Walkthrough - Complete feature demonstration:

  • All statistical procedures with examples
  • Macro system demonstrations
  • Format system usage
  • Advanced data manipulation techniques
  • Real-world analysis scenarios

Project Structure

StatLang/
β”œβ”€β”€ stat_lang/                # Core Python package
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ interpreter.py        # Main statistical interpreter
β”‚   β”œβ”€β”€ cli.py               # Command line interface
β”‚   β”œβ”€β”€ kernel/              # Jupyter kernel implementation
β”‚   β”‚   β”œβ”€β”€ statlang_kernel.py   # Main kernel
β”‚   β”‚   └── install.py       # Kernel installation
β”‚   β”œβ”€β”€ parser/              # Syntax parser
β”‚   β”‚   β”œβ”€β”€ data_step_parser.py
β”‚   β”‚   β”œβ”€β”€ proc_parser.py
β”‚   β”‚   └── macro_parser.py
β”‚   β”œβ”€β”€ procs/               # Statistical procedure implementations
β”‚   β”‚   β”œβ”€β”€ proc_means.py
β”‚   β”‚   β”œβ”€β”€ proc_freq.py
β”‚   β”‚   β”œβ”€β”€ proc_sort.py
β”‚   β”‚   └── proc_print.py
β”‚   └── utils/               # Utility functions
β”‚       β”œβ”€β”€ expression_evaluator.py
β”‚       β”œβ”€β”€ data_utils.py
β”‚       └── libname_manager.py
β”œβ”€β”€ vscode-extension/         # VS Code extension
β”œβ”€β”€ examples/                # Example files and demo notebook
β”œβ”€β”€ media/                   # Logo and icons
β”œβ”€β”€ setup.py                 # Package setup
└── README.md

Development

Setup Development Environment

git clone https://github.com/ryan-story/StatLang.git
cd StatLang
pip install -e .

Running Tests

# Run basic functionality tests
python -c "from statlang import StatLangInterpreter; print('StatLang loaded successfully')"

Key Features Implemented

βœ… Completed Features

  • Core DATA step implementation with DATALINES
  • Statistical procedures with CLASS variables and OUTPUT statements
  • Frequency analysis with cross-tabulations and options
  • Data sorting with ascending/descending order
  • Data display and formatting
  • Linear regression analysis with PROC REG
  • Random sampling with PROC SURVEYSELECT
  • Silent execution options
  • Jupyter notebook kernel
  • VS Code extension with syntax highlighting
  • Clean, professional output
  • Concise behavior and syntax

🚧 Future Enhancements

  • Additional statistical procedures (SQL queries, advanced regression, etc.)
  • Advanced macro functionality
  • Performance optimizations
  • Enhanced data connectivity options

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Areas for Contribution

  • Additional statistical procedures
  • Macro functionality enhancements
  • Performance optimizations
  • VS Code extension features
  • Documentation and examples

License

MIT License - see LICENSE for details.

Support

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published