StatLang provides an open-source environment for statistical analysis by offering:
- Expressive scripting syntax for data manipulation and analysis
- Python backend for execution and performance
- Jupyter notebook support with a StatLang kernel
- VS Code extension with syntax highlighting and execution
- Cross-platform compatibility (Windows, macOS, Linux)
- Open source and free to use
- π€ AI Integration: Built-in PROC LANGUAGE with LLM capabilities for intelligent data analysis
- π§ Complete ML Pipeline: From data exploration to model deployment using familiar, concise syntax
- πΎ Modern SQL: PROC SQL powered by DuckDB for high-performance data querying
- π§ Robust language features: Macro system, format system, and statistical procedures
- π Rich Visualizations: Professional output formatting with TITLE statements and structured results
- Scripting-based DATA step functionality with inline data support
- Statistical procedures (MEANS, FREQ, SORT, PRINT)
- Concise data manipulation and analysis syntax
- Python pandas/numpy backend for performance
- Clean, professional output with familiar formatting
-
StatLang kernel for Jupyter notebooks
-
Interactive statistical programming in notebook environment
-
Rich output display with formatted tables
-
Dataset visualization and exploration
-
Syntax highlighting for
.statlangfiles -
Code snippets for common statistical analysis patterns
-
File execution directly from VS Code
-
Notebook support for interactive analysis
- PROC MEANS: Descriptive statistics with CLASS variables and OUTPUT statements
- PROC FREQ: Frequency tables and cross-tabulations with options
- PROC SORT: Data sorting with ascending/descending order
- PROC PRINT: Data display and formatting
- PROC REG: Linear regression analysis with MODEL, OUTPUT, and SCORE statements
- PROC UNIVARIATE: Detailed univariate analysis with distribution diagnostics
- PROC CORR: Correlation analysis (Pearson, Spearman)
- PROC FACTOR: Principal component analysis and factor analysis
- PROC CLUSTER: Clustering methods (k-means, hierarchical)
- PROC NPAR1WAY: Nonparametric tests (Mann-Whitney, Kruskal-Wallis)
- PROC TTEST: T-tests (independent and paired)
- PROC LOGIT: Logistic regression modeling
- PROC TIMESERIES: Time series analysis and seasonal decomposition
- PROC SURVEYSELECT: Random sampling with SRS method, SAMPRATE/N options, and OUTALL flag
- PROC TREE: Decision trees for classification and regression
- PROC FOREST: Random forests for ensemble learning
- PROC BOOST: Gradient boosting for advanced modeling
- PROC SQL: SQL query processing with DuckDB backend
- PROC LANGUAGE: Built-in LLM integration for text generation, Q&A, and data analysis
- Macro System: Complete macro facility with %MACRO/%MEND, %LET, & substitution, %PUT, %IF/%THEN/%ELSE, %DO/%END
- Format System: Built-in date/time, numeric, and currency formats with metadata persistence
- TITLE Statements: Professional output formatting
- DATA Steps: Variable creation, conditional logic, DATALINES input
- Macro variables: %LET, %PUT statements
- Libraries: LIBNAME functionality
- NOPRINT option: Silent execution for procedures
pip install statlang# Install the StatLang kernel
python -m statlang.kernel install
# List available kernels
jupyter kernelspec list- Install from VS Code Marketplace: "StatLang" by RyanBlakeStory
- Or install from source (see Development section)
language prompt="Analyze the correlation between income and spending in our dataset";
run;
Built-in LLM integration for text generation, Q&A, and intelligent data analysis using Hugging Face transformers!
Check out our ML Project Demo - a comprehensive regression analysis project showcasing:
- PROC UNIVARIATE for distribution exploration
- PROC SURVEYSELECT for train/test splitting
- PROC REG with MODEL, OUTPUT, and SCORE statements
- Macro system for reusable analysis workflows
- Complete ML pipeline in pure StatLang syntax
sql;
select age, income, spend,
case when income > 60000 then 'High' else 'Low' end as income_group
from work.customers
where age between 25 and 50
order by income desc;
quit;
DuckDB-powered SQL processing with full dataset integration!
from statlang import StatLangInterpreter
# Create interpreter
interpreter = StatLangInterpreter()
# Create sample data using StatLang syntax
interpreter.run_code('''
data work.employees;
input employee_id name $ department $ salary;
datalines;
1 Alice Engineering 75000
2 Bob Marketing 55000
3 Carol Engineering 80000
4 David Sales 45000
;
run;
''')
# Run statistical analysis
interpreter.run_code('''
proc means data=work.employees;
class department;
var salary;
run;
''')- Install the StatLang kernel:
python -m statlang.kernel install
- Create a new Jupyter notebook (
.ipynb) - Select "statlang" as the kernel
- Write StatLang code in cells and execute
- Install the StatLang extension from the marketplace
- Create a new file with
.statlangextension - Write your StatLang code
- Use
Ctrl+Shift+Pβ "StatLang: Run File" to execute
# Run StatLang code from file
python -m statlang.cli run example.statlang
# Interactive mode
python -m statlang.cli interactiveML Project Demo - A comprehensive machine learning workflow:
- Synthetic dataset creation with 30 observations
- PROC UNIVARIATE for distribution analysis
- PROC SURVEYSELECT for train/test splitting (70/30)
- PROC REG with MODEL, OUTPUT, and SCORE statements
- Macro-based reusable analysis functions
- Complete regression analysis pipeline
StatLang Walkthrough - Complete feature demonstration:
- All statistical procedures with examples
- Macro system demonstrations
- Format system usage
- Advanced data manipulation techniques
- Real-world analysis scenarios
StatLang/
βββ stat_lang/ # Core Python package
β βββ __init__.py
β βββ interpreter.py # Main statistical interpreter
β βββ cli.py # Command line interface
β βββ kernel/ # Jupyter kernel implementation
β β βββ statlang_kernel.py # Main kernel
β β βββ install.py # Kernel installation
β βββ parser/ # Syntax parser
β β βββ data_step_parser.py
β β βββ proc_parser.py
β β βββ macro_parser.py
β βββ procs/ # Statistical procedure implementations
β β βββ proc_means.py
β β βββ proc_freq.py
β β βββ proc_sort.py
β β βββ proc_print.py
β βββ utils/ # Utility functions
β βββ expression_evaluator.py
β βββ data_utils.py
β βββ libname_manager.py
βββ vscode-extension/ # VS Code extension
βββ examples/ # Example files and demo notebook
βββ media/ # Logo and icons
βββ setup.py # Package setup
βββ README.md
git clone https://github.com/ryan-story/StatLang.git
cd StatLang
pip install -e .# Run basic functionality tests
python -c "from statlang import StatLangInterpreter; print('StatLang loaded successfully')"- Core DATA step implementation with DATALINES
- Statistical procedures with CLASS variables and OUTPUT statements
- Frequency analysis with cross-tabulations and options
- Data sorting with ascending/descending order
- Data display and formatting
- Linear regression analysis with PROC REG
- Random sampling with PROC SURVEYSELECT
- Silent execution options
- Jupyter notebook kernel
- VS Code extension with syntax highlighting
- Clean, professional output
- Concise behavior and syntax
- Additional statistical procedures (SQL queries, advanced regression, etc.)
- Advanced macro functionality
- Performance optimizations
- Enhanced data connectivity options
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
- Additional statistical procedures
- Macro functionality enhancements
- Performance optimizations
- VS Code extension features
- Documentation and examples
MIT License - see LICENSE for details.
- π Documentation
- π Issue Tracker
- π¬ Discussions
