# Open-SAS v2.0 Comprehensive Walkthrough

This notebook demonstrates the complete functionality of Open-SAS v2.0, showcasing all 13 major statistical procedures:

## Core Features:
- ✅ **DATA Steps**: Creating and manipulating datasets with DATALINES
- ✅ **Data Input**: External files, library management, variable creation
- ✅ **Data Manipulation**: Conditions, loops, transformations

## Statistical Procedures (v2.0):
- ✅ **PROC MEANS**: Descriptive statistics with CLASS variables
- ✅ **PROC FREQ**: Frequency tables with Chi-square tests
- ✅ **PROC UNIVARIATE**: Distribution diagnostics and normality tests
- ✅ **PROC CORR**: Correlation analysis (Pearson, Spearman, Kendall)
- ✅ **PROC FACTOR**: Principal Component Analysis and Factor Analysis
- ✅ **PROC CLUSTER**: K-means, Hierarchical, and HDBSCAN clustering
- ✅ **PROC NPAR1WAY**: Nonparametric tests (Mann-Whitney, Kruskal-Wallis)
- ✅ **PROC TTEST**: T-tests (independent, paired, one-sample)
- ✅ **PROC LOGIT**: Logistic regression with statsmodels
- ✅ **PROC TIMESERIES**: Time series decomposition and ARIMA modeling
- ✅ **PROC TREE**: Decision tree classification and regression
- ✅ **PROC FOREST**: Random forest ensemble methods
- ✅ **PROC BOOST**: Gradient boosting with scikit-learn
- ✅ **PROC LANGUAGE**: LLM integration with Ollama (experimental)

**Kernel**: Make sure to select "osas" as the kernel!


In [1]:
/* Test 1: Basic DATA Step with DATALINES */
data work.employees;
    input employee_id name $ department $ salary;
    datalines;
1 Alice Engineering 75000
2 Bob Marketing 55000
3 Carol Engineering 80000
4 David Sales 45000
5 Eve Engineering 70000
6 Frank Marketing 60000
7 Grace Sales 50000
8 Henry Engineering 85000
;
run;


Saved dataset work.employees to library work


employee_id,name,department,salary
1.0,Alice,Engineering,75000.0
2.0,Bob,Marketing,55000.0
3.0,Carol,Engineering,80000.0
4.0,David,Sales,45000.0
5.0,Eve,Engineering,70000.0


In [1]:
/* Test 2: PROC PRINT - Display Dataset */
proc print data=work.employees;
run;


PROC PRINT - Dataset Contents
Observations: 8
Variables: 4

employee_id | name     | department  | salary  
-----------------------------------------------
1.0         | Alice    | Engineering | 75000.0 
2.0         | Bob      | Marketing   | 55000.0 
3.0         | Carol    | Engineering | 80000.0 
4.0         | David    | Sales       | 45000.0 
5.0         | Eve      | Engineering | 70000.0 
6.0         | Frank    | Marketing   | 60000.0 
7.0         | Grace    | Sales       | 50000.0 
8.0         | Henry    | Engineering | 85000.0 


In [1]:
/* Test 3: PROC MEANS - Descriptive Statistics */
proc means data=work.employees;
    var salary;
    class department;
run;


PROC MEANS - Grouped Analysis
Analysis Variables: salary
Grouping Variables: department

department  | salary_count | salary_mean | salary_std         | salary_min | salary_max
---------------------------------------------------------------------------------------
Engineering | 4            | 77500.0     | 6454.972243679028  | 70000.0    | 85000.0   
Marketing   | 2            | 57500.0     | 3535.5339059327375 | 55000.0    | 60000.0   
Sales       | 2            | 47500.0     | 3535.5339059327375 | 45000.0    | 50000.0   


In [1]:
/* Test 4: PROC FREQ - Frequency Analysis */
proc freq data=work.employees;
    tables department;
run;


PROC FREQ - Frequency Table for department

Value                Frequency    Percent    Cumulative Percent
------------------------------------------------------------
Engineering          4            50.0       50.0              
Marketing            2            25.0       75.0              
Sales                2            25.0       100.0             
------------------------------------------------------------
Total                8            100.0      100.0             


In [1]:
/* Test 5: Data Manipulation - Create New Variables */
data work.enhanced_employees;
    set work.employees;
    salary_category = ifn(salary > 70000, 'High', ifn(salary > 55000, 'Medium', 'Low'));
    annual_bonus = salary * 0.1;
    total_compensation = salary + annual_bonus;
run;


Evaluating IFN expression: ifn(salary > 70000, 'High', ifn(salary > 55000, 'Medium', 'Low'))
IFN parsed - condition: salary > 70000, true: High, false: ifn(salary > 55000, 'Medium', 'Low'
Condition result: 0     True
1    False
2     True
3    False
4    False
5    False
6    False
7     True
Name: salary, dtype: bool
Handling nested IFN
Nested IFN - cond2: salary > 55000, val2: Medium, val3: Low
Final IFN result: 0      High
1       Low
2      High
3       Low
4    Medium
5    Medium
6       Low
7      High
dtype: object
Saved dataset work.enhanced_employees to library work


employee_id,name,department,salary,salary_category,annual_bonus,total_compensation
1.0,Alice,Engineering,75000.0,High,7500.0,82500.0
2.0,Bob,Marketing,55000.0,Low,5500.0,60500.0
3.0,Carol,Engineering,80000.0,High,8000.0,88000.0
4.0,David,Sales,45000.0,Low,4500.0,49500.0
5.0,Eve,Engineering,70000.0,Medium,7000.0,77000.0


In [1]:
/* Test 6: Display Enhanced Dataset */
proc print data=work.enhanced_employees;
run;


PROC PRINT - Dataset Contents
Observations: 8
Variables: 7

employee_id | name     | department  | salary   | salary_category | annual_bonus | total_compensation
-----------------------------------------------------------------------------------------------------
1.0         | Alice    | Engineering | 75000.0  | High            | 7500.0       | 82500.0           
2.0         | Bob      | Marketing   | 55000.0  | Low             | 5500.0       | 60500.0           
3.0         | Carol    | Engineering | 80000.0  | High            | 8000.0       | 88000.0           
4.0         | David    | Sales       | 45000.0  | Low             | 4500.0       | 49500.0           
5.0         | Eve      | Engineering | 70000.0  | Medium          | 7000.0       | 77000.0           
6.0         | Frank    | Marketing   | 60000.0  | Medium          | 6000.0       | 66000.0           
7.0         | Grace    | Sales       | 50000.0  | Low             | 5000.0       | 55000.0           
8.0         | Henry   

In [1]:
/* Test 7: Conditional Processing - WHERE Clause */
data work.high_earners;
    set work.enhanced_employees;
    where salary > 70000;
run;


Saved dataset work.high_earners to library work


employee_id,name,department,salary,salary_category,annual_bonus,total_compensation
1.0,Alice,Engineering,75000.0,High,7500.0,82500.0
3.0,Carol,Engineering,80000.0,High,8000.0,88000.0
8.0,Henry,Engineering,85000.0,High,8500.0,93500.0


In [1]:
/* Test 8: Display Filtered Results */
proc print data=work.high_earners;
run;


PROC PRINT - Dataset Contents
Observations: 3
Variables: 7

employee_id | name     | department  | salary   | salary_category | annual_bonus | total_compensation
-----------------------------------------------------------------------------------------------------
1.0         | Alice    | Engineering | 75000.0  | High            | 7500.0       | 82500.0           
3.0         | Carol    | Engineering | 80000.0  | High            | 8000.0       | 88000.0           
8.0         | Henry    | Engineering | 85000.0  | High            | 8500.0       | 93500.0           


In [1]:
/* Test 9: PROC SORT - Sort Data */
proc sort data=work.enhanced_employees;
    by department descending salary;
run;


PROC SORT - Dataset Sorted
BY Variables: department (ASC), salary (DESC)
Observations: 8



## Open-SAS v2.0 New Statistical Procedures

The following sections demonstrate the powerful new statistical procedures added in v2.0:


### PROC UNIVARIATE - Distribution Diagnostics

PROC UNIVARIATE provides comprehensive univariate analysis including descriptive statistics, distribution shape, and normality tests.


In [None]:
/* PROC UNIVARIATE - Detailed distribution analysis */
proc univariate data=work.employees;
    var salary age;
run;


### PROC CORR - Correlation Analysis

PROC CORR computes correlation matrices with options for Pearson, Spearman, and Kendall correlations.


In [None]:
/* PROC CORR - Correlation analysis */
proc corr data=work.employees method=pearson;
    var salary age;
run;


### PROC FACTOR - Principal Component Analysis

PROC FACTOR performs PCA and factor analysis for dimensionality reduction and exploratory data analysis.


In [None]:
/* PROC FACTOR - Principal Component Analysis */
proc factor data=work.employees method=pca nfactors=2;
    var salary age;
run;


### PROC CLUSTER - Clustering Analysis

PROC CLUSTER provides k-means, hierarchical, and HDBSCAN clustering methods for data segmentation.


In [None]:
/* PROC CLUSTER - K-means clustering */
proc cluster data=work.employees method=kmeans nclusters=3;
    var salary age;
run;


### PROC NPAR1WAY - Nonparametric Tests

PROC NPAR1WAY performs nonparametric tests including Mann-Whitney U test and Kruskal-Wallis test for comparing groups.


In [None]:
/* PROC NPAR1WAY - Nonparametric analysis */
proc npar1way data=work.employees;
    class department;
    var salary;
run;


### PROC TTEST - T-Tests

PROC TTEST performs independent samples, paired samples, and one-sample t-tests with comprehensive statistical output.


In [None]:
/* Create a binary variable for t-test */
data work.employees;
    set work.employees;
    high_salary = (salary > 70000);
run;

/* PROC TTEST - Independent samples t-test */
proc ttest data=work.employees;
    class high_salary;
    var age;
run;


### PROC LOGIT - Logistic Regression

PROC LOGIT performs logistic regression for binary and multinomial outcomes with comprehensive model diagnostics.


In [None]:
/* PROC LOGIT - Logistic regression */
proc logit data=work.employees;
    model high_salary = age;
run;


### PROC FREQ Enhanced - Chi-Square Tests

Enhanced PROC FREQ now includes Chi-square tests of independence and Fisher's exact test for contingency tables.


In [None]:
/* PROC FREQ Enhanced - Chi-square test */
proc freq data=work.employees;
    tables department*high_salary / chisq;
run;


### PROC TIMESERIES - Time Series Analysis

PROC TIMESERIES provides seasonal decomposition, ARIMA modeling, and autocorrelation analysis for time series data.


In [None]:
/* Create time series data */
data work.sales;
    input month sales marketing_spend;
    datalines;
1 1000 5000
2 1100 5200
3 1200 4800
4 1300 5500
5 1250 5100
6 1400 5300
7 1350 4900
8 1500 5600
9 1450 5200
10 1600 5400
11 1550 5000
12 1700 5800
;
run;

/* PROC TIMESERIES - Seasonal decomposition */
proc timeseries data=work.sales type=decompose;
    var sales;
run;


### PROC TREE - Decision Tree Analysis

PROC TREE builds decision trees for classification and regression with feature importance analysis.


In [None]:
/* PROC TREE - Decision tree classification */
proc tree data=work.employees maxdepth=3;
    model high_salary = age salary;
run;


### PROC FOREST - Random Forest Analysis

PROC FOREST builds random forest ensembles for improved prediction accuracy and feature importance.


In [None]:
/* PROC FOREST - Random forest classification */
proc forest data=work.employees ntrees=50;
    model high_salary = age salary;
run;


### PROC BOOST - Gradient Boosting

PROC BOOST implements gradient boosting for powerful ensemble learning with configurable parameters.


In [None]:
/* PROC BOOST - Gradient boosting */
proc boost data=work.employees ntrees=50 learningrate=0.1;
    model high_salary = age salary;
run;


### PROC LANGUAGE - LLM Integration (Experimental)

PROC LANGUAGE integrates with Ollama for open-source LLM access, enabling AI-assisted data analysis and text generation.

**Note**: Requires Ollama to be installed and running locally. Install from https://ollama.ai/


In [None]:
/* PROC LANGUAGE - Text generation (requires Ollama) */
proc language model=llama2 mode=generate;
    prompt "Explain the concept of statistical significance in data analysis";
run;


## Summary

Open-SAS v2.0 now provides a comprehensive statistical analysis platform with:

- **13 Major Procedures**: From basic descriptive statistics to advanced machine learning
- **Professional Output**: SAS-inspired formatting with proper statistical interpretation
- **Modern Backend**: Python-powered with scikit-learn, statsmodels, and scipy
- **Extensible Architecture**: Easy to add new procedures and capabilities
- **Open Source**: Free and open-source alternative to proprietary statistical software

### Key Advantages:
- ✅ **No License Costs**: Completely free to use
- ✅ **Cross-Platform**: Works on Windows, macOS, and Linux
- ✅ **Jupyter Integration**: Interactive analysis in notebooks
- ✅ **VS Code Support**: Syntax highlighting and execution
- ✅ **Python Ecosystem**: Access to thousands of Python packages
- ✅ **Community Driven**: Open source with active development

**Ready for production use in research, education, and data analysis!**


In [1]:
/* Test 10: Display Sorted Data */
proc print data=work.enhanced_employees;
run;


PROC PRINT - Dataset Contents
Observations: 8
Variables: 7

employee_id | name     | department  | salary   | salary_category | annual_bonus | total_compensation
-----------------------------------------------------------------------------------------------------
8.0         | Henry    | Engineering | 85000.0  | High            | 8500.0       | 93500.0           
3.0         | Carol    | Engineering | 80000.0  | High            | 8000.0       | 88000.0           
1.0         | Alice    | Engineering | 75000.0  | High            | 7500.0       | 82500.0           
5.0         | Eve      | Engineering | 70000.0  | Medium          | 7000.0       | 77000.0           
6.0         | Frank    | Marketing   | 60000.0  | Medium          | 6000.0       | 66000.0           
2.0         | Bob      | Marketing   | 55000.0  | Low             | 5500.0       | 60500.0           
7.0         | Grace    | Sales       | 50000.0  | Low             | 5000.0       | 55000.0           
4.0         | David   

In [1]:
/* Test 11: Advanced Statistics - PROC MEANS with, print, no output */
proc means data=work.enhanced_employees;
    class department;
    var salary total_compensation;
    output 
           mean=avg_salary avg_compensation 
           min=min_salary min_compensation 
           max=max_salary max_compensation;
run;


PROC MEANS - Grouped Analysis
Analysis Variables: salary, total_compensation
Grouping Variables: department

department  | salary_count | salary_mean | salary_std         | salary_min | salary_max | total_compensation_count | total_compensation_mean | total_compensation_std | total_compensation_min | total_compensation_max
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Engineering | 4            | 77500.0     | 6454.972243679028  | 70000.0    | 85000.0    | 4                        | 85250.0                 | 7100.469468046931      | 77000.0                | 93500.0               
Marketing   | 2            | 57500.0     | 3535.5339059327375 | 55000.0    | 60000.0    | 2                        | 63250.0                 | 3889.0872965260114     | 60500.0                | 66000.0               
Sales       | 2            

In [1]:
/* Test 12: Advanced Statistics - PROC MEANS with noprint, and output */
proc means data=work.enhanced_employees noprint;
    class department;
    var salary total_compensation;
    output out=work.dept_summary 
           mean=avg_salary avg_compensation 
           min=min_salary min_compensation 
           max=max_salary max_compensation;
run;


In [1]:
/* Test 12: Display Summary Statistics */
proc print data=work.dept_summary;
run;


PROC PRINT - Dataset Contents
Observations: 3
Variables: 7

avg_salary | avg_compensation | max_salary | max_compensation | department  | salary_min | total_compensation_min
-----------------------------------------------------------------------------------------------------------------
77500.0    | 85250.0          | 85000.0    | 93500.0          | Engineering | 70000.0    | 77000.0               
57500.0    | 63250.0          | 60000.0    | 66000.0          | Marketing   | 55000.0    | 60500.0               
47500.0    | 52250.0          | 50000.0    | 55000.0          | Sales       | 45000.0    | 49500.0               


In [1]:
/* Test 13: Cross-tabulation Analysis */
proc freq data=work.enhanced_employees;
    tables department * salary_category / nocol nopercent;
run;


PROC FREQ - Cross-tabulation: department * salary_category
Options: nocol nopercent

department      | High | Low | Medium | Total
---------------------------------------------
Engineering     | 3    | 0   | 1      | 4    
Marketing       | 0    | 1   | 1      | 2    
Sales           | 0    | 2   | 0      | 2    
Total           | 3    | 3   | 2      | 8    


In [1]:
/* Test 14: PROC CONTENTS - Dataset Information */
proc contents data=work.enhanced_employees;
run;


PROC CONTENTS - Dataset Information

Dataset Information:
  Observations: 8
  Variables: 7

Variable Information:
--------------------------------------------------------------------------------
#   Variable             Type       Length   Non-Null   Null    
--------------------------------------------------------------------------------
1   employee_id          Numeric    8        8          0       
2   name                 Character  Variable 8          0       
3   department           Character  Variable 8          0       
4   salary               Numeric    8        8          0       
5   salary_category      Character  Variable 8          0       
6   annual_bonus         Numeric    8        8          0       
7   total_compensation   Numeric    8        8          0       
