# StatLang v2.0 Comprehensive Walkthrough

Welcome to StatLang v2.0! This notebook demonstrates the complete functionality of our open-source statistical analysis system with SAS-inspired syntax.

## 🎯 What is StatLang?

StatLang is a Python-based statistical analysis system that provides SAS-like syntax for data manipulation and statistical procedures. It combines the familiarity of SAS programming with the power and flexibility of Python's scientific computing ecosystem.

## 🚀 Key Features

### Core Data Operations
- **DATA Steps**: Create and manipulate datasets with DATALINES
- **Library Management**: Persistent data storage with LIBNAME
- **Variable Creation**: Transformations, conditions, and calculations

### Statistical Procedures (v2.0)
- **PROC MEANS**: Descriptive statistics with CLASS variables
- **PROC FREQ**: Frequency tables with Chi-square tests  
- **PROC UNIVARIATE**: Distribution diagnostics and normality tests
- **PROC CORR**: Correlation analysis (Pearson, Spearman, Kendall)
- **PROC FACTOR**: Principal Component Analysis and Factor Analysis
- **PROC CLUSTER**: K-means, Hierarchical, and HDBSCAN clustering
- **PROC NPAR1WAY**: Nonparametric tests (Mann-Whitney, Kruskal-Wallis)
- **PROC TTEST**: T-tests (independent, paired, one-sample)
- **PROC LOGIT**: Logistic regression with odds ratios
- **PROC TIMESERIES**: Time series decomposition and ARIMA modeling
- **PROC TREE**: Decision tree classification and regression
- **PROC FOREST**: Random forest ensemble methods
- **PROC BOOST**: Gradient boosting with scikit-learn
- **PROC LANGUAGE**: Built-in LLM integration with Hugging Face transformers
- **PROC SQL**: SQL query processing with DuckDB

## 📋 Getting Started

**Important**: Make sure to select the **"osas"** kernel for this notebook!

Each section below includes:
- 📖 **Introduction**: Explanation of the procedure and its capabilities
- 💻 **Demo**: Working examples with real data
- 📊 **Output**: Expected results and interpretation

Let's begin our journey through StatLang v2.0!


## 📊 DATA Steps: Creating and Manipulating Datasets

DATA steps are the foundation of StatLang, allowing you to create, modify, and transform datasets. The `DATALINES` statement provides an easy way to input data directly in your code.

### Key Features:
- **Direct Data Input**: Use `DATALINES` to embed data in your code
- **Variable Types**: Support for character (`$`) and numeric variables
- **Data Transformations**: Create new variables, apply conditions, and perform calculations
- **Library Storage**: Automatically save datasets to the WORK library


In [1]:
/* Demo 1: Employee Dataset Creation */
data work.employees;
    input employee_id name $ department $ salary age experience performance_score satisfaction $ remote;
    datalines;
1 Alice Engineering 75000 30 5 8.5 High 1
2 Bob Marketing 55000 25 2 7.2 Medium 0
3 Carol Engineering 80000 35 8 9.1 High 0
4 David Sales 45000 28 3 6.8 Low 1
5 Eve Engineering 70000 32 6 7.8 Medium 1
6 Frank Marketing 60000 27 4 8.0 High 0
7 Grace Sales 50000 24 1 6.5 Low 1
8 Henry Engineering 85000 40 12 9.5 High 0
9 Irene HR 62000 29 3 7.5 Medium 1
10 Jack Marketing 59000 31 7 8.2 High 0
;
run;


Saved dataset work.employees to library work


employee_id,name,department,salary,age,experience,performance_score,satisfaction,remote
1.0,Alice,Engineering,75000.0,30.0,5.0,8.5,High,1.0
2.0,Bob,Marketing,55000.0,25.0,2.0,7.2,Medium,0.0
3.0,Carol,Engineering,80000.0,35.0,8.0,9.1,High,0.0
4.0,David,Sales,45000.0,28.0,3.0,6.8,Low,1.0
5.0,Eve,Engineering,70000.0,32.0,6.0,7.8,Medium,1.0


In [1]:
/* Test 2: PROC PRINT - Display Dataset */
proc print data=work.employees;
run;


PROC PRINT - Dataset Contents
Observations: 10
Variables: 9

employee_id | name     | department  | salary   | age      | experience | performance_score | satisfaction | remote  
---------------------------------------------------------------------------------------------------------------------
1.0         | Alice    | Engineering | 75000.0  | 30.0     | 5.0        | 8.5               | High         | 1.0     
2.0         | Bob      | Marketing   | 55000.0  | 25.0     | 2.0        | 7.2               | Medium       | 0.0     
3.0         | Carol    | Engineering | 80000.0  | 35.0     | 8.0        | 9.1               | High         | 0.0     
4.0         | David    | Sales       | 45000.0  | 28.0     | 3.0        | 6.8               | Low          | 1.0     
5.0         | Eve      | Engineering | 70000.0  | 32.0     | 6.0        | 7.8               | Medium       | 1.0     
6.0         | Frank    | Marketing   | 60000.0  | 27.0     | 4.0        | 8.0               | High         | 0.0 

## 📊 PROC UNIVARIATE: Distribution Diagnostics

PROC UNIVARIATE provides detailed univariate analysis including descriptive statistics, distribution shape analysis, and normality tests. It's perfect for understanding the characteristics of individual variables.

### Key Features:
- **Comprehensive Statistics**: Mean, median, mode, standard deviation, skewness, kurtosis
- **Distribution Analysis**: Normality tests and extreme value identification
- **Visual Diagnostics**: Histograms and Q-Q plots for distribution assessment
- **Outlier Detection**: Identify unusual observations in your data


In [1]:
/* Demo 4: PROC UNIVARIATE - Distribution Diagnostics */
proc univariate data=work.employees;
    var salary age performance_score;
run;


PROC UNIVARIATE - Univariate Analysis

Variable: salary
----------------------------------------
Moments
  N                                  10
  Mean                       64100.000000
  Std Deviation              13067.772062
  Variance                   170766666.666667
  Skewness                     0.205420
  Uncorrected SS             42625000000.000000
  Corrected SS               1536900000.000000
  Coeff Variation             20.386540

Basic Statistical Measures
  Location                    Variability
  Mean      64100.000000     Std Deviation     13067.772062
  Median    61000.000000     Variance          170766666.666667
  Mode      45000.000000     Range             40000.000000
                           Interquartile Range 17750.000000

Tests for Location: Mu0=0
  Test           -Statistic-    -----p Value------
  Student's t     15.511596    Pr > |t|      0.000000
  Sign                   10    Pr >= |M|     0.000000
  Signed Rank            55    Pr >= |S|     1.000

## 🔗 PROC CORR: Correlation Analysis

PROC CORR computes correlation matrices for numeric variables, helping you understand relationships between variables in your dataset.

### Key Features:
- **Multiple Correlation Types**: Pearson, Spearman, and Kendall correlations
- **Significance Testing**: P-values for correlation coefficients
- **Missing Data Handling**: Pairwise deletion of missing values
- **Matrix Output**: Clear correlation matrix display

In [1]:
/* Demo 5: PROC CORR - Correlation Analysis */
proc corr data=work.employees;
    var salary age experience performance_score;
run;


PROC CORR - Correlation Analysis
Method: Pearson

Correlation Matrix
----------------------------------------
Variable        salary       age  experien  performa
----------------------------------------------------
salary          1.0000    0.8516    0.8290    0.9258
age             0.8516    1.0000    0.9716    0.8806
experience      0.8290    0.9716    1.0000    0.9135
performance_    0.9258    0.8806    0.9135    1.0000

Sample Size Information
------------------------------
Number of observations used: 4

Correlation method: Pearson



## 🔍 PROC FACTOR: Principal Component Analysis

PROC FACTOR performs principal component analysis and factor analysis for dimensionality reduction and data exploration.

### Key Features:
- **Principal Components**: Identify key dimensions in your data
- **Eigenvalue Analysis**: Determine number of components to retain
- **Component Loadings**: Understand variable contributions
- **Variance Explained**: See how much variance each component captures

In [1]:
/* Demo 6: PROC FACTOR - Principal Component Analysis */
proc factor data=work.employees method=pca nfactors=3;
    var salary age experience performance_score;
run;


Principal Component Analysis
----------------------------------------

Eigenvalues and Variance Explained
----------------------------------------
Component    Eigenvalue   Proportion   Cumulative  
--------------------------------------------------
1            4.0962       0.9217       0.9217      
2            0.2466       0.0555       0.9771      
3            0.0872       0.0196       0.9968      
4            0.0144       0.0032       1.0000      

Component Loadings (Eigenvectors)
----------------------------------------
Variable         Comp1     Comp2     Comp3     Comp4
----------------------------------------------------
salary          0.4887    0.6730   -0.4793    0.2801
age             0.5026   -0.4637   -0.4666   -0.5610
experience      0.5041   -0.4913    0.2189    0.6757
performance_    0.5045    0.3009    0.7104   -0.3876

Kaiser criterion (eigenvalue > 1): 1 components



## 🎯 PROC CLUSTER: Clustering Analysis

PROC CLUSTER performs various clustering algorithms to group similar observations together.

### Key Features:
- **K-means Clustering**: Partition data into K clusters
- **Hierarchical Clustering**: Build cluster trees and dendrograms  
- **HDBSCAN**: Density-based clustering for complex shapes
- **Cluster Validation**: Assess clustering quality and stability

In [1]:
/* Demo 7: PROC CLUSTER - K-means Clustering */
proc cluster data=work.employees method=kmeans nclusters=3;
    var salary age experience performance_score;
run;


K-Means Clustering Results
----------------------------------------

Cluster Sizes
--------------------
Cluster 0: 3 observations
Cluster 1: 5 observations
Cluster 2: 2 observations

Cluster Centroids
------------------------------
Variable        Cluster0    Cluster1    Cluster2
------------------------------------------------
salary           -1.1374      0.0887      1.4842
age              -0.9794     -0.0663      1.6348
experience       -0.9959     -0.0321      1.5741
performance_     -1.1826      0.0989      1.5267

Model Statistics
--------------------
Within-cluster sum of squares: 6.5873
Number of iterations: 2



## 📈 PROC NPAR1WAY: Nonparametric Tests

PROC NPAR1WAY performs nonparametric tests for comparing groups without assuming normal distributions.

### Key Features:
- **Mann-Whitney U Test**: Compare two independent groups
- **Kruskal-Wallis Test**: Compare multiple groups
- **Rank-based Analysis**: Works with any distribution shape
- **Post-hoc Comparisons**: Identify which groups differ significantly

In [1]:
/* Demo 8: PROC NPAR1WAY - Nonparametric Tests */
proc npar1way data=work.employees;
    class department;
    var salary;
run;


PROC NPAR1WAY - Nonparametric One-Way Analysis
Class variable: department
Test type: WILCOXON

Variable: salary
--------------------------------------------------
Summary Statistics by Group
----------------------------------------
Group           N        Mean         Median       Std Dev     
-----------------------------------------------------------------
Engineering     4        77500.0000   77500.0000   6454.9722   
Marketing       3        58000.0000   59000.0000   2645.7513   
Sales           2        47500.0000   47500.0000   3535.5339   
HR              1        62000.0000   62000.0000   nan         

Kruskal-Wallis Test
------------------------------
Groups: Engineering, Marketing, Sales, HR
Sample sizes: [4, 3, 2, 1]

Test Statistics
--------------------
H statistic: 8.1818
Degrees of freedom: 3
p-value: 0.042400
Effect size (eta²): 0.8636

Conclusion: p < 0.05 (significant)
Note: Significant difference detected. Consider post-hoc pairwise comparisons.




  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  ret = ret.dtype.type(ret / rcount)


## 📊 PROC TTEST: T-Tests

PROC TTEST performs various t-tests for comparing means between groups.

### Key Features:
- **Independent Samples**: Compare means between two groups
- **Paired Samples**: Compare means for related observations
- **One-sample Tests**: Test if mean differs from specified value
- **Equal/Unequal Variances**: Welch's t-test for unequal variances

In [1]:
/* Demo 9: PROC TTEST - Independent Samples T-Test */
proc ttest data=work.employees;
    class remote;
    var salary;
run;


PROC TTEST - T-Test Analysis
Test type: INDEPENDENT

Variable: salary
--------------------------------------------------
Independent Samples T-Test

Group Statistics
------------------------------
Group           N        Mean         Std Dev      Std Error   
-----------------------------------------------------------------
1.0             5        60400.0000   12778.8888   5714.8928   
0.0             5        67800.0000   13663.8208   6110.6464   

Levene's Test for Equality of Variances
---------------------------------------------
F statistic: 0.0011
p-value: 0.973800

T-Test Results
--------------------
Test                      t          df       p-value      Mean Diff   
---------------------------------------------------------------------------
Equal variances           -0.8845    8        0.402242     -7400.0000  
Unequal variances         -0.8845    8.0      0.402354     -7400.0000  

Effect size (Cohen's d): -0.5594

Recommended test: Equal variances (Student's t-test)
Con

## 📈 PROC LOGIT: Logistic Regression

PROC LOGIT performs logistic regression for modeling binary and categorical outcomes.

### Key Features:
- **Binary Logistic Regression**: Model yes/no outcomes
- **Odds Ratios**: Interpretable effect measures
- **Model Diagnostics**: Assess model fit and assumptions
- **Statistical Significance**: P-values and confidence intervals

In [1]:
/* Demo 10: PROC LOGIT - Logistic Regression */
proc logit data=work.employees;
    model remote = salary age experience;
run;


PROC LOGIT - Logistic Regression Analysis
Dependent variable: remote
Independent variables: salary, age, experience
Link function: logit

Model Information
------------------------------
Link function: logit
Number of observations: 10
Number of parameters: 4

Model Fit Statistics
------------------------------
Log-likelihood: 0.0000
AIC: 8.0000
BIC: 9.2103
Pseudo R-squared: 1.0000

Parameter Estimates
------------------------------
Variable        Estimate     Std Err      Z          P>|Z|      [95% Conf. Interval]     
-----------------------------------------------------------------------------------------------
const           -12724.7374  nan          nan        nan        [nan     , nan     ]
salary          0.0144       46042382739508512.0000 0.0000     1.0000     [-90241411931845312.0000, 90241411931845312.0000]
age             553.6437     152192623354228736000.0000 0.0000     1.0000     [-298292060486957858816.0000, 298292060486957858816.0000]
experience      -985.8240    nan 

  return 1/(1+np.exp(-X))
  coef = model.params[i]
  std_err = model.bse[i]
  z_stat = model.tvalues[i]
  p_value = model.pvalues[i]
  coef = model.params[i]
  std_err = model.bse[i]
  z_stat = model.tvalues[i]
  coef = model.params[i]
  or_ci_upper = np.exp(ci_upper)
  'Coefficient': model.params[i],
  'Std_Error': model.bse[i],
  'Z_Statistic': model.tvalues[i],
  'P_Value': model.pvalues[i],
  'Odds_Ratio': np.exp(model.params[i]) if var != 'const' else np.nan,
  'OR_CI_Upper': np.exp(model.conf_int().iloc[i, 1]) if var != 'const' else np.nan


## 📊 PROC FREQ: Chi-square Tests

PROC FREQ creates cross-tabulations and performs chi-square tests for categorical data analysis.

### Key Features:
- **Cross-tabulations**: Two-way frequency tables
- **Chi-square Tests**: Test for independence between variables
- **Expected Frequencies**: Compare observed vs expected counts
- **Effect Size Measures**: Cramér's V and other association measures

In [1]:
/* Demo 11: Enhanced PROC FREQ - Chi-square Tests */
proc freq data=work.employees;
    tables department*satisfaction / chisq;
run;


PROC FREQ - Cross-tabulation: department * satisfaction
Options: chisq

department      | High | Low | Medium | Total
---------------------------------------------
Engineering     | 3    | 0   | 1      | 4    
HR              | 0    | 0   | 1      | 1    
Marketing       | 2    | 0   | 1      | 3    
Sales           | 0    | 2   | 0      | 2    
Total           | 5    | 2   | 3      | 10   

Statistics:
  Engineering: 4 (40.0%)
  HR: 1 (10.0%)
  Marketing: 3 (30.0%)
  Sales: 2 (20.0%)

Chi-Square Test of Independence
----------------------------------------
Chi-square statistic: 12.4444
Degrees of freedom: 6
p-value: 0.052757

Expected Frequencies
-------------------------
Minimum expected frequency: 0.20

Effect size (Cramér's V): 0.7888

Conclusion: p >= 0.05 (not significant)
No significant association between variables.



## 📈 PROC TIMESERIES: Time Series Analysis

PROC TIMESERIES performs time series analysis including decomposition and ARIMA modeling.

### Key Features:
- **Seasonal Decomposition**: Separate trend, seasonal, and residual components
- **ARIMA Modeling**: Fit autoregressive integrated moving average models
- **ACF/PACF Plots**: Identify model parameters
- **Forecasting**: Predict future values with confidence intervals

In [1]:
/* First, create time series data */
data work.sales;
    input date $ sales marketing_spend seasonality;
    datalines;
2020-01-01 1000 5000 0
2020-02-01 1100 5200 50
2020-03-01 1200 4800 87
2020-04-01 1300 5500 100
2020-05-01 1250 5100 87
2020-06-01 1400 5300 50
2020-07-01 1350 4900 0
2020-08-01 1500 5600 -50
2020-09-01 1450 5200 -87
2020-10-01 1600 5400 -100
2020-11-01 1550 5000 -87
2020-12-01 1700 5800 -50
2021-01-01 1050 5100 0
2021-02-01 1150 5300 50
2021-03-01 1250 4900 87
2021-04-01 1350 5600 100
2021-05-01 1300 5200 87
2021-06-01 1450 5400 50
2021-07-01 1400 5000 0
2021-08-01 1550 5700 -50
2021-09-01 1500 5300 -87
2021-10-01 1650 5500 -100
2021-11-01 1600 5100 -87
2021-12-01 1750 5900 -50
;
run;


Saved dataset work.sales to library work


date,sales,marketing_spend,seasonality
2020-01-01,1000.0,5000.0,0.0
2020-02-01,1100.0,5200.0,50.0
2020-03-01,1200.0,4800.0,87.0
2020-04-01,1300.0,5500.0,100.0
2020-05-01,1250.0,5100.0,87.0


In [1]:
/* Demo 12: PROC TIMESERIES - Seasonal Decomposition */
proc timeseries data=work.sales type=decompose;
    var sales;
    time date;
run;


PROC TIMESERIES - Time Series Analysis
Analysis type: DECOMPOSE

Variable: sales
--------------------------------------------------
Seasonal Decomposition

Component Summary Statistics
----------------------------------------
Component       Mean         Std Dev      Min          Max         
-----------------------------------------------------------------
Original        1391.6667    205.1864     1000.0000    1750.0000   
Trend           1391.6667    15.0231      1368.7500    1414.5833   
Seasonal        -0.0000      189.3886     -343.7500    310.4167    
Residual        0.0000       0.0000       0.0000       0.0000      

Variance Explained
-------------------------
Trend: 0.5%
Seasonal: 85.2%
Residual: 0.0%




## 🌳 PROC TREE: Decision Trees

PROC TREE builds decision trees for classification and regression tasks.

### Key Features:
- **Classification Trees**: Predict categorical outcomes
- **Regression Trees**: Predict continuous outcomes
- **Feature Importance**: Identify most important predictors
- **Tree Visualization**: Understand decision rules and splits

In [1]:
/* Demo 13: PROC TREE - Decision Tree Classification */
proc tree data=work.employees maxdepth=3;
    model satisfaction = salary age experience;
run;


PROC TREE - Decision Tree Analysis
Target variable: satisfaction
Feature variables: salary, age, experience
Max depth: 5

Decision Tree Classification Results
----------------------------------------
Training accuracy: 0.0000
Tree depth: 3
Number of leaves: 4

Feature Importance
--------------------
experience      0.6863    
age             0.3137    
salary          0.0000    

Classification Report
-------------------------
No valid classes found in test data for classification report.



## 🌲 PROC FOREST: Random Forests

PROC FOREST builds random forest ensembles for improved prediction accuracy.

### Key Features:
- **Ensemble Learning**: Combine multiple decision trees
- **Bootstrap Aggregating**: Reduce overfitting with bagging
- **Feature Sampling**: Random feature selection at each split
- **Out-of-bag Error**: Estimate generalization performance

In [1]:
/* Demo 14: PROC FOREST - Random Forest Classification */
proc forest data=work.employees ntrees=50;
    model satisfaction = salary age experience;
run;


PROC FOREST - Random Forest Analysis
Target variable: satisfaction
Feature variables: salary, age, experience
Number of trees: 100

Random Forest Classification Results
----------------------------------------
Training accuracy: 0.0000
Out-of-bag score: 0.5000
Number of trees: 100

Feature Importance
--------------------
salary          0.3787    
experience      0.3258    
age             0.2955    



## 🚀 PROC BOOST: Gradient Boosting

PROC BOOST performs gradient boosting for high-performance predictive modeling.

### Key Features:
- **Sequential Learning**: Each tree corrects errors of previous trees
- **Gradient Descent**: Optimize loss function iteratively
- **Feature Importance**: Identify key predictive variables
- **High Performance**: Often achieves best-in-class accuracy

In [1]:
/* Demo 15: PROC BOOST - Gradient Boosting */
proc boost data=work.employees ntrees=50 learningrate=0.1;
    model satisfaction = salary age experience;
run;


PROC BOOST - Gradient Boosting Analysis
Target variable: satisfaction
Feature variables: salary, age, experience
Number of estimators: 100
Learning rate: 0.1

Gradient Boosting Classification Results
---------------------------------------------
Training accuracy: 0.0000
Number of estimators: 100
Learning rate: 0.1

Feature Importance
--------------------
experience      0.4030    
salary          0.3760    
age             0.2210    



## 🤖 PROC LANGUAGE: Built-in LLM Integration

PROC LANGUAGE integrates large language models for AI-powered text analysis and generation.

### Key Features:
- **Built-in LLM**: Uses Hugging Face transformers with distilgpt2 model
- **No Setup Required**: Works out of the box with pip install
- **Multiple Modes**: Text generation, Q&A, summarization, analysis
- **Data Integration**: Analyze datasets with natural language queries

In [1]:
/* Demo 16: PROC LANGUAGE - Built-in LLM Integration */
proc language prompt='What is the meaning of life?';
run;


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


PROC LANGUAGE - LLM Analysis
Model: distilgpt2
Mode: GENERATE

Text Generation
--------------------
Prompt: What is the meaning of life?

Generated Text:
I have a problem with the idea of being a person – I feel like I'm not a person, but I feel as if that's what I want to be.
At the same time, I feel that I



### Additional PROC LANGUAGE Examples

PROC LANGUAGE supports multiple modes for different AI tasks:

**Model**: Uses `distilgpt2` - a lightweight, fast transformer model from Hugging Face
**Modes Available**:
- `generate` (default): Text generation and completion
- `qna`: Question & Answer with optional context
- `summarize`: Text summarization
- `analyze`: Text analysis and insights


In [1]:
/* Text Generation Mode */
proc language mode=generate prompt='The future of artificial intelligence is';
run;


PROC LANGUAGE - LLM Analysis
Model: distilgpt2
Mode: GENERATE

Text Generation
--------------------
Prompt: The future of artificial intelligence is

Generated Text:
set for 20 years. It is the most important step for human machines that will make it possible to develop artificial intelligence.



In [1]:
/* Question & Answer Mode */
proc language mode=qna prompt='What is machine learning?' context='Machine learning is a subset of artificial intelligence that focuses on algorithms';
run;


PROC LANGUAGE - LLM Analysis
Model: distilgpt2
Mode: QNA

Question & Answer
--------------------
Question: What is machine learning?

Answer: Machine learning is a new field of research. It has been a long time since I was a student. I was a student



In [1]:
/* Text Summarization Mode */
proc language mode=summarize prompt='Machine learning is a subset of artificial intelligence that focuses on algorithms and statistical models that enable computer systems to improve their performance on a specific task through experience. It involves training models on data to make predictions or decisions without being explicitly programmed for every scenario';
run;


PROC LANGUAGE - LLM Analysis
Model: distilgpt2
Mode: SUMMARIZE

Text Summarization
--------------------
Original text: Machine learning is a subset of artificial intelligence that focuses on algorithms and statistical m...

Summary: The machine learning approach is a subset of artificial intelligence



In [1]:
/* Text Analysis Mode */
proc language mode=analyze prompt='The rapid advancement of artificial intelligence has transformed many industries, from healthcare to finance, creating both opportunities and challenges for businesses and society';
run;


PROC LANGUAGE - LLM Analysis
Model: distilgpt2
Mode: ANALYZE

Text Analysis
--------------------
Text: The rapid advancement of artificial intelligence has transformed many industries, from healthcare to...

Analysis: We have seen a number of technological advances in the past few decades, but the rapid technological advances in the past few decades have not only increased the value



In [1]:
/* Data Analysis Mode with VAR specification */
proc language mode=analyze prompt='Analyze the employee dataset and provide insights about salary patterns';
run;


PROC LANGUAGE - LLM Analysis
Model: distilgpt2
Mode: ANALYZE

Text Analysis
--------------------
Text: Analyze the employee dataset and provide insights about salary patterns

Analysis: Analyze the employee dataset and provide insights about salary patterns
Analyze the employee dataset and provide insights about salary patterns
Analyze the employee dataset and provide



## 🗃️ PROC SQL: SQL Query Processing

PROC SQL provides powerful SQL query capabilities using DuckDB for fast, in-memory SQL operations.

### Key Features:
- **DuckDB Backend**: Fast, in-memory SQL processing
- **Automatic Dataset Registration**: All datasets available as SQL tables
- **Standard SQL Syntax**: SELECT, FROM, WHERE, GROUP BY, ORDER BY, JOIN
- **CREATE TABLE**: Create new datasets from query results

In [1]:
/* Demo 17: PROC SQL - Basic SELECT Query */
proc sql;
    select avg(sales) as avg_sales, avg(marketing_spend) as avg_marketing
    from work_sales
    where seasonality > 0
run;


PROC SQL - SQL Query Processing
Backend: DuckDB

Statement 1:
--------------------
select avg(sales) as avg_sales, avg(marketing_spend) as avg_marketing from work_sales where seasonality > 0

Query Results:
Rows: 1, Columns: 2

avg_sales | avg_marketing
-------------------------
1275.0    | 5230.0       



In [1]:
/* CREATE TABLE with Aggregation */
proc sql;
    create table work_summary as
    select avg(sales) as avg_sales, 
           avg(marketing_spend) as avg_marketing,
           count(*) as count
    from work_sales
run;


PROC SQL - SQL Query Processing
Backend: DuckDB

Statement 1:
--------------------
create table work_summary as select avg(sales) as avg_sales, avg(marketing_spend) as avg_marketing, count(*) as count from work_sales

Created table: work_summary
Rows: 1, Columns: 3

avg_sales          | avg_marketing | count
------------------------------------------
1391.6666666666667 | 5283.333333333333 | 24.0 



avg_sales,avg_marketing,count
1391.6666666666667,5283.333333333333,24


In [1]:
/* Complex Query with JOIN */
proc sql;
    select e.name, e.department, e.salary, s.avg_sales
    from work_employees e
    cross join work_summary s
    where e.salary > 60000
    order by e.salary desc
run;


PROC SQL - SQL Query Processing
Backend: DuckDB

Statement 1:
--------------------
select e.name, e.department, e.salary, s.avg_sales from work_employees e cross join work_summary s where e.salary > 60000 order by e.salary desc

Query Results:
Rows: 5, Columns: 4

name  | department  | salary  | avg_sales         
--------------------------------------------------
Henry | Engineering | 85000.0 | 1391.6666666666667
Carol | Engineering | 80000.0 | 1391.6666666666667
Alice | Engineering | 75000.0 | 1391.6666666666667
Eve   | Engineering | 70000.0 | 1391.6666666666667
Irene | HR          | 62000.0 | 1391.6666666666667



### PROC SQL Features:

✅ **DuckDB Backend**: Fast, in-memory SQL processing  
✅ **Automatic Dataset Registration**: All datasets available as SQL tables  
✅ **Standard SQL Syntax**: SELECT, FROM, WHERE, GROUP BY, ORDER BY, JOIN  
✅ **CREATE TABLE**: Create new datasets from query results  
✅ **Multi-statement Support**: Multiple SQL statements in one PROC  
✅ **Flattened Naming**: `work.sales` becomes `work_sales` for SQL access  

**Supported SQL Operations:**
- SELECT queries with aggregation
- CREATE TABLE AS SELECT
- JOINs (INNER, LEFT, RIGHT, FULL, CROSS)
- WHERE conditions and filtering
- GROUP BY and ORDER BY
- Subqueries and complex expressions

**Example Dataset Structure:**
- `work.sales`: `date`, `sales`, `marketing_spend`, `seasonality`
- `work.employees`: `employee_id`, `name`, `department`, `salary`, `age`, `experience`, `performance_score`, `satisfaction`, `remote`


## 📈 PROC MEANS: Descriptive Statistics

PROC MEANS provides comprehensive descriptive statistics for numeric variables. It's perfect for understanding the central tendency, variability, and distribution of your data.

### Key Features:
- **Comprehensive Statistics**: Mean, median, standard deviation, min/max, quartiles
- **CLASS Variables**: Group statistics by categorical variables  
- **Multiple Variables**: Analyze several variables simultaneously
- **Customizable Output**: Control which statistics are displayed


In [1]:
/* Demo 2: Basic Descriptive Statistics */
proc means data=work.employees;
    var salary;
    class department;
run;


PROC MEANS - Grouped Analysis
Analysis Variables: salary
Grouping Variables: department

department  | salary_count | salary_mean | salary_std         | salary_min | salary_max
---------------------------------------------------------------------------------------
Engineering | 4            | 77500.0     | 6454.972243679028  | 70000.0    | 85000.0   
HR          | 1            | 62000.0     | nan                | 62000.0    | 62000.0   
Marketing   | 3            | 58000.0     | 2645.7513110645905 | 55000.0    | 60000.0   
Sales       | 2            | 47500.0     | 3535.5339059327375 | 45000.0    | 50000.0   


## 📊 PROC FREQ: Frequency Analysis

PROC FREQ creates frequency tables and performs statistical tests on categorical data. It's essential for understanding the distribution of categorical variables and testing associations.

### Key Features:
- **Frequency Tables**: Count and percentage distributions
- **Cross-tabulations**: Two-way frequency tables
- **Chi-square Tests**: Test for independence between categorical variables
- **Missing Data Handling**: Control how missing values are treated


In [1]:
/* Demo 3: Frequency Analysis */
proc freq data=work.employees;
    tables department;
run;


PROC FREQ - Frequency Table for department

Value                Frequency    Percent    Cumulative Percent
------------------------------------------------------------
Engineering          4            40.0       40.0              
HR                   1            10.0       50.0              
Marketing            3            30.0       80.0              
Sales                2            20.0       100.0             
------------------------------------------------------------
Total                10           100.0      100.0             


In [1]:
/* Demo 5: Data Manipulation - Create New Variables */
data work.enhanced_employees;
    set work.employees;
    salary_category = ifn(salary > 70000, 'High', ifn(salary > 55000, 'Medium', 'Low'));
    annual_bonus = salary * 0.1;
    total_compensation = salary + annual_bonus;
run;


Evaluating IFN expression: ifn(salary > 70000, 'High', ifn(salary > 55000, 'Medium', 'Low'))
IFN parsed - condition: salary > 70000, true: High, false: ifn(salary > 55000, 'Medium', 'Low'
Condition result: 0     True
1    False
2     True
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: salary, dtype: bool
Handling nested IFN
Nested IFN - cond2: salary > 55000, val2: Medium, val3: Low
Final IFN result: 0      High
1       Low
2      High
3       Low
4    Medium
5    Medium
6       Low
7      High
8    Medium
9    Medium
dtype: object
Saved dataset work.enhanced_employees to library work


employee_id,name,department,salary,age,experience,performance_score,satisfaction,remote,salary_category,annual_bonus,total_compensation
1.0,Alice,Engineering,75000.0,30.0,5.0,8.5,High,1.0,High,7500.0,82500.0
2.0,Bob,Marketing,55000.0,25.0,2.0,7.2,Medium,0.0,Low,5500.0,60500.0
3.0,Carol,Engineering,80000.0,35.0,8.0,9.1,High,0.0,High,8000.0,88000.0
4.0,David,Sales,45000.0,28.0,3.0,6.8,Low,1.0,Low,4500.0,49500.0
5.0,Eve,Engineering,70000.0,32.0,6.0,7.8,Medium,1.0,Medium,7000.0,77000.0


In [1]:
/* Demo 6: Display Enhanced Dataset */
proc print data=work.enhanced_employees;
run;


PROC PRINT - Dataset Contents
Observations: 10
Variables: 12

employee_id | name     | department  | salary   | age      | experience | performance_score | satisfaction | remote   | salary_category | annual_bonus | total_compensation
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.0         | Alice    | Engineering | 75000.0  | 30.0     | 5.0        | 8.5               | High         | 1.0      | High            | 7500.0       | 82500.0           
2.0         | Bob      | Marketing   | 55000.0  | 25.0     | 2.0        | 7.2               | Medium       | 0.0      | Low             | 5500.0       | 60500.0           
3.0         | Carol    | Engineering | 80000.0  | 35.0     | 8.0        | 9.1               | High         | 0.0      | High            | 8000.0       | 88000.0           
4.0         | David    | Sales       | 45000.0  | 28.0     | 3.0        | 6.8 

In [1]:
/* Demo 7: Conditional Processing - WHERE Clause */
data work.high_earners;
    set work.enhanced_employees;
    where salary > 70000;
run;


Saved dataset work.high_earners to library work


employee_id,name,department,salary,age,experience,performance_score,satisfaction,remote,salary_category,annual_bonus,total_compensation
1.0,Alice,Engineering,75000.0,30.0,5.0,8.5,High,1.0,High,7500.0,82500.0
3.0,Carol,Engineering,80000.0,35.0,8.0,9.1,High,0.0,High,8000.0,88000.0
8.0,Henry,Engineering,85000.0,40.0,12.0,9.5,High,0.0,High,8500.0,93500.0


In [1]:
/* Demo 8: Display Filtered Results */
proc print data=work.high_earners;
run;


PROC PRINT - Dataset Contents
Observations: 3
Variables: 12

employee_id | name     | department  | salary   | age      | experience | performance_score | satisfaction | remote   | salary_category | annual_bonus | total_compensation
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1.0         | Alice    | Engineering | 75000.0  | 30.0     | 5.0        | 8.5               | High         | 1.0      | High            | 7500.0       | 82500.0           
3.0         | Carol    | Engineering | 80000.0  | 35.0     | 8.0        | 9.1               | High         | 0.0      | High            | 8000.0       | 88000.0           
8.0         | Henry    | Engineering | 85000.0  | 40.0     | 12.0       | 9.5               | High         | 0.0      | High            | 8500.0       | 93500.0           


In [1]:
/* Demo 9: PROC SORT - Sort Data */
proc sort data=work.enhanced_employees;
    by department descending salary;
run;


PROC SORT - Dataset Sorted
BY Variables: department (ASC), salary (DESC)
Observations: 10



## StatLang v2.0 New Statistical Procedures

The following sections demonstrate the powerful new statistical procedures added in v2.0:


### PROC UNIVARIATE - Distribution Diagnostics

PROC UNIVARIATE provides comprehensive univariate analysis including descriptive statistics, distribution shape, and normality tests.


In [1]:
/* PROC UNIVARIATE - Detailed distribution analysis */
proc univariate data=work.employees;
    var salary age;
run;


PROC UNIVARIATE - Univariate Analysis

Variable: salary
----------------------------------------
Moments
  N                                  10
  Mean                       64100.000000
  Std Deviation              13067.772062
  Variance                   170766666.666667
  Skewness                     0.205420
  Uncorrected SS             42625000000.000000
  Corrected SS               1536900000.000000
  Coeff Variation             20.386540

Basic Statistical Measures
  Location                    Variability
  Mean      64100.000000     Std Deviation     13067.772062
  Median    61000.000000     Variance          170766666.666667
  Mode      45000.000000     Range             40000.000000
                           Interquartile Range 17750.000000

Tests for Location: Mu0=0
  Test           -Statistic-    -----p Value------
  Student's t     15.511596    Pr > |t|      0.000000
  Sign                   10    Pr >= |M|     0.000000
  Signed Rank            55    Pr >= |S|     1.000

### PROC CORR - Correlation Analysis

PROC CORR computes correlation matrices with options for Pearson, Spearman, and Kendall correlations.


In [1]:
/* PROC CORR - Correlation analysis */
proc corr data=work.employees method=pearson;
    var salary age;
run;


PROC CORR - Correlation Analysis
Method: Pearson

Correlation Matrix
----------------------------------------
Variable        salary       age
--------------------------------
salary          1.0000    0.8516
age             0.8516    1.0000

Sample Size Information
------------------------------
Number of observations used: 2

Correlation method: Pearson



### PROC FACTOR - Principal Component Analysis

PROC FACTOR performs PCA and factor analysis for dimensionality reduction and exploratory data analysis.


In [1]:
/* PROC FACTOR - Principal Component Analysis */
proc factor data=work.employees method=pca nfactors=2;
    var salary age;
run;


Principal Component Analysis
----------------------------------------

Eigenvalues and Variance Explained
----------------------------------------
Component    Eigenvalue   Proportion   Cumulative  
--------------------------------------------------
1            2.0573       0.9258       0.9258      
2            0.1649       0.0742       1.0000      

Component Loadings (Eigenvectors)
----------------------------------------
Variable         Comp1     Comp2
--------------------------------
salary          0.7071   -0.7071
age             0.7071    0.7071

Kaiser criterion (eigenvalue > 1): 1 components



### PROC CLUSTER - Clustering Analysis

PROC CLUSTER provides k-means, hierarchical, and HDBSCAN clustering methods for data segmentation.


In [1]:
/* PROC CLUSTER - K-means clustering */
proc cluster data=work.employees method=kmeans nclusters=3;
    var salary age;
run;


K-Means Clustering Results
----------------------------------------

Cluster Sizes
--------------------
Cluster 0: 4 observations
Cluster 1: 2 observations
Cluster 2: 4 observations

Cluster Centroids
------------------------------
Variable        Cluster0    Cluster1    Cluster2
------------------------------------------------
salary           -0.9357      1.4842      0.1936
age              -0.9058      1.6348      0.0884

Model Statistics
--------------------
Within-cluster sum of squares: 3.2843
Number of iterations: 3



### PROC NPAR1WAY - Nonparametric Tests

PROC NPAR1WAY performs nonparametric tests including Mann-Whitney U test and Kruskal-Wallis test for comparing groups.


In [1]:
/* PROC NPAR1WAY - Nonparametric analysis */
proc npar1way data=work.employees;
    class department;
    var salary;
run;


PROC NPAR1WAY - Nonparametric One-Way Analysis
Class variable: department
Test type: WILCOXON

Variable: salary
--------------------------------------------------
Summary Statistics by Group
----------------------------------------
Group           N        Mean         Median       Std Dev     
-----------------------------------------------------------------
Engineering     4        77500.0000   77500.0000   6454.9722   
Marketing       3        58000.0000   59000.0000   2645.7513   
Sales           2        47500.0000   47500.0000   3535.5339   
HR              1        62000.0000   62000.0000   nan         

Kruskal-Wallis Test
------------------------------
Groups: Engineering, Marketing, Sales, HR
Sample sizes: [4, 3, 2, 1]

Test Statistics
--------------------
H statistic: 8.1818
Degrees of freedom: 3
p-value: 0.042400
Effect size (eta²): 0.8636

Conclusion: p < 0.05 (significant)
Note: Significant difference detected. Consider post-hoc pairwise comparisons.




  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  ret = ret.dtype.type(ret / rcount)


### PROC TTEST - T-Tests

PROC TTEST performs independent samples, paired samples, and one-sample t-tests with comprehensive statistical output.


In [1]:
/* Create a binary variable for t-test */
data work.employees;
    set work.employees;
    high_salary = (salary > 70000);
run;

/* PROC TTEST - Independent samples t-test */
proc ttest data=work.employees;
    class high_salary;
    var age;
run;


Saved dataset work.employees to library work
PROC TTEST - T-Test Analysis
Test type: INDEPENDENT

Variable age: CLASS variable must have exactly 2 groups



### PROC LOGIT - Logistic Regression

PROC LOGIT performs logistic regression for binary and multinomial outcomes with comprehensive model diagnostics.


In [1]:
/* PROC LOGIT - Logistic regression */
proc logit data=work.employees;
    model high_salary = age;
run;


PROC LOGIT - Logistic Regression Analysis
Dependent variable: high_salary
Independent variables: age
Link function: logit

ERROR: PROC LOGIT currently supports binary logistic regression only.


### PROC FREQ Enhanced - Chi-Square Tests

Enhanced PROC FREQ now includes Chi-square tests of independence and Fisher's exact test for contingency tables.


In [1]:
/* PROC FREQ Enhanced - Chi-square test */
proc freq data=work.employees;
    tables department*high_salary / chisq;
run;


PROC FREQ - Cross-tabulation: department * high_salary
Options: chisq

department      | 0  | Total
----------------------------
Engineering     | 4  | 4    
HR              | 1  | 1    
Marketing       | 3  | 3    
Sales           | 2  | 2    
Total           | 10 | 10   

Statistics:
  Engineering: 4 (40.0%)
  HR: 1 (10.0%)
  Marketing: 3 (30.0%)
  Sales: 2 (20.0%)

Chi-Square Test of Independence
----------------------------------------
Chi-square statistic: 0.0000
Degrees of freedom: 0
p-value: 1.000000

Expected Frequencies
-------------------------
Minimum expected frequency: 1.00

Effect size (Cramér's V): nan

Conclusion: p >= 0.05 (not significant)
No significant association between variables.



  cramers_v = np.sqrt(chi2_stat / (n * (min(test_table.shape) - 1)))


### PROC TIMESERIES - Time Series Analysis

PROC TIMESERIES provides seasonal decomposition, ARIMA modeling, and autocorrelation analysis for time series data.


In [1]:
/* Create time series data */
data work.sales;
    input month sales marketing_spend;
    datalines;
1 1000 5000
2 1100 5200
3 1200 4800
4 1300 5500
5 1250 5100
6 1400 5300
7 1350 4900
8 1500 5600
9 1450 5200
10 1600 5400
11 1550 5000
12 1700 5800
;
run;

/* PROC TIMESERIES - Seasonal decomposition */
proc timeseries data=work.sales type=decompose;
    var sales;
run;


Saved dataset work.sales to library work
PROC TIMESERIES - Time Series Analysis
Analysis type: DECOMPOSE

Variable sales: Seasonal decomposition failed - x must have 2 complete cycles requires 24 observations. x only has 12 observation(s)



### PROC TREE - Decision Tree Analysis

PROC TREE builds decision trees for classification and regression with feature importance analysis.


In [1]:
/* PROC TREE - Decision tree classification */
proc tree data=work.employees maxdepth=3;
    model high_salary = age salary;
run;


PROC TREE - Decision Tree Analysis
Target variable: high_salary
Feature variables: age, salary
Max depth: 5

Decision Tree Classification Results
----------------------------------------
Training accuracy: 1.0000
Tree depth: 0
Number of leaves: 1

Feature Importance
--------------------
age             0.0000    
salary          0.0000    

Classification Report
-------------------------
Class           Precision  Recall     F1-Score   Support   
-------------------------------------------------------
0               1.0000     1.0000     1.0000     2         



### PROC FOREST - Random Forest Analysis

PROC FOREST builds random forest ensembles for improved prediction accuracy and feature importance.


In [1]:
/* PROC FOREST - Random forest classification */
proc forest data=work.employees ntrees=50;
    model high_salary = age salary;
run;


PROC FOREST - Random Forest Analysis
Target variable: high_salary
Feature variables: age, salary
Number of trees: 100

Random Forest Classification Results
----------------------------------------
Training accuracy: 1.0000
Out-of-bag score: 1.0000
Number of trees: 100

Feature Importance
--------------------
age             0.0000    
salary          0.0000    



### PROC BOOST - Gradient Boosting

PROC BOOST implements gradient boosting for powerful ensemble learning with configurable parameters.


In [1]:
/* PROC BOOST - Gradient boosting */
proc boost data=work.employees ntrees=50 learningrate=0.1;
    model high_salary = age salary;
run;


ERROR in PROC: y contains 1 class after sample_weight trimmed classes with zero weights, while a minimum of 2 classes are required.


### PROC LANGUAGE - LLM Integration (Experimental)

PROC LANGUAGE integrates with Ollama for open-source LLM access, enabling AI-assisted data analysis and text generation.

**Note**: Requires Ollama to be installed and running locally. Install from https://ollama.ai/


In [1]:
/* PROC LANGUAGE - Text generation (requires Ollama) */
proc language model=llama2 mode=generate;
    prompt "Explain the concept of statistical significance in data analysis";
run;


ERROR: PROMPT specification required for PROC LANGUAGE.


## Summary

StatLang v2.0 now provides a comprehensive statistical analysis platform with:

- **13 Major Procedures**: From basic descriptive statistics to advanced machine learning
- **Professional Output**: SAS-inspired formatting with proper statistical interpretation
- **Modern Backend**: Python-powered with scikit-learn, statsmodels, and scipy
- **Extensible Architecture**: Easy to add new procedures and capabilities
- **Open Source**: Free and open-source alternative to proprietary statistical software

### Key Advantages:
- ✅ **No License Costs**: Completely free to use
- ✅ **Cross-Platform**: Works on Windows, macOS, and Linux
- ✅ **Jupyter Integration**: Interactive analysis in notebooks
- ✅ **VS Code Support**: Syntax highlighting and execution
- ✅ **Python Ecosystem**: Access to thousands of Python packages
- ✅ **Community Driven**: Open source with active development

**Ready for production use in research, education, and data analysis!**


In [1]:
/* Demo 10: Display Sorted Data */
proc print data=work.enhanced_employees;
run;


PROC PRINT - Dataset Contents
Observations: 10
Variables: 12

employee_id | name     | department  | salary   | age      | experience | performance_score | satisfaction | remote   | salary_category | annual_bonus | total_compensation
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
8.0         | Henry    | Engineering | 85000.0  | 40.0     | 12.0       | 9.5               | High         | 0.0      | High            | 8500.0       | 93500.0           
3.0         | Carol    | Engineering | 80000.0  | 35.0     | 8.0        | 9.1               | High         | 0.0      | High            | 8000.0       | 88000.0           
1.0         | Alice    | Engineering | 75000.0  | 30.0     | 5.0        | 8.5               | High         | 1.0      | High            | 7500.0       | 82500.0           
5.0         | Eve      | Engineering | 70000.0  | 32.0     | 6.0        | 7.8 

In [1]:
/* Demo 11: Advanced Statistics - PROC MEANS with, print, no output */
proc means data=work.enhanced_employees;
    class department;
    var salary total_compensation;
    output 
           mean=avg_salary avg_compensation 
           min=min_salary min_compensation 
           max=max_salary max_compensation;
run;


PROC MEANS - Grouped Analysis
Analysis Variables: salary, total_compensation
Grouping Variables: department

department  | salary_count | salary_mean | salary_std         | salary_min | salary_max | total_compensation_count | total_compensation_mean | total_compensation_std | total_compensation_min | total_compensation_max
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Engineering | 4            | 77500.0     | 6454.972243679028  | 70000.0    | 85000.0    | 4                        | 85250.0                 | 7100.469468046931      | 77000.0                | 93500.0               
HR          | 1            | 62000.0     | nan                | 62000.0    | 62000.0    | 1                        | 68200.0                 | nan                    | 68200.0                | 68200.0               
Marketing   | 3            

In [1]:
/* Demo 12: Advanced Statistics - PROC MEANS with noprint, and output */
proc means data=work.enhanced_employees noprint;
    class department;
    var salary total_compensation;
    output out=work.dept_summary 
           mean=avg_salary avg_compensation 
           min=min_salary min_compensation 
           max=max_salary max_compensation;
run;


In [1]:
/* Demo 12: Display Summary Statistics */
proc print data=work.dept_summary;
run;


PROC PRINT - Dataset Contents
Observations: 4
Variables: 7

department  | avg_salary | avg_compensation | max_salary | max_compensation | salary_min | total_compensation_min
-----------------------------------------------------------------------------------------------------------------
Engineering | 77500.0    | 85250.0          | 85000.0    | 93500.0          | 70000.0    | 77000.0               
HR          | 62000.0    | 68200.0          | 62000.0    | 68200.0          | 62000.0    | 68200.0               
Marketing   | 58000.0    | 63800.0          | 60000.0    | 66000.0          | 55000.0    | 60500.0               
Sales       | 47500.0    | 52250.0          | 50000.0    | 55000.0          | 45000.0    | 49500.0               


In [1]:
/* Demo 13: Cross-tabulation Analysis */
proc freq data=work.enhanced_employees;
    tables department * salary_category / nocol nopercent;
run;


PROC FREQ - Cross-tabulation: department * salary_category
Options: nocol nopercent

department      | High | Low | Medium | Total
---------------------------------------------
Engineering     | 3    | 0   | 1      | 4    
HR              | 0    | 0   | 1      | 1    
Marketing       | 0    | 1   | 2      | 3    
Sales           | 0    | 2   | 0      | 2    
Total           | 3    | 3   | 4      | 10   


In [1]:
/* Demo 14: PROC CONTENTS - Dataset Information */
proc contents data=work.enhanced_employees;
run;


PROC CONTENTS - Dataset Information

Dataset Information:
  Observations: 10
  Variables: 12

Variable Information:
--------------------------------------------------------------------------------
#   Variable             Type       Length   Non-Null   Null    
--------------------------------------------------------------------------------
1   employee_id          Numeric    8        10         0       
2   name                 Character  Variable 10         0       
3   department           Character  Variable 10         0       
4   salary               Numeric    8        10         0       
5   age                  Numeric    8        10         0       
6   experience           Numeric    8        10         0       
7   performance_score    Numeric    8        10         0       
8   satisfaction         Character  Variable 10         0       
9   remote               Numeric    8        10         0       
10  salary_category      Character  Variable 10         0       
11  ann

## 🎨 Advanced Features: Macro Facility and Format System

StatLang v2.0 includes powerful SAS macro capabilities and comprehensive format system for professional data presentation and code reusability.

### 🔧 Macro Facility Features:
- **Macro Variables**: `%LET` for setting variables, `&` substitution
- **Macro Definitions**: `%MACRO`/`%MEND` with parameter support
- **Conditional Logic**: `%IF`/`%THEN`/`%ELSE` for dynamic code generation
- **Loops**: `%DO`/`%END` for iterative processing
- **System Variables**: Built-in variables like `&sysver`, `&sysdate`

### 🎨 Format System Features:
- **Built-in Formats**: DOLLAR, COMMA, PERCENT, DATE, TIME formats
- **Format Persistence**: Formats stored with dataset metadata
- **Professional Output**: Properly formatted tables and reports
- **Format Inheritance**: Formats propagate through dataset operations


In [1]:
/* Demo 18: Professional Formatting with SAS Formats */
data work.sales_report;
    input date $ sales marketing_spend bonus_percent;
    format sales dollar12.2 marketing_spend comma10.0 bonus_percent percent8.2;
    datalines;
2024-01-15 15000.50 5000 0.15
2024-02-20 18000.75 6000 0.18
2024-03-25 22000.00 7000 0.22
2024-04-30 19500.25 5500 0.16
;
run;

proc print data=work.sales_report;
    title "Sales Report with Professional Formatting";
run;


Saved dataset work.sales_report to library work
PROC PRINT - Dataset Contents
Observations: 4
Variables: 4

date       | sales      | marketing_spend | bonus_percent
---------------------------------------------------------
2024-01-15 | $15,000.50 | 5,000           | 15.00%       
2024-02-20 | $18,000.75 | 6,000           | 18.00%       
2024-03-25 | $22,000.00 | 7,000           | 22.00%       
2024-04-30 | $19,500.25 | 5,500           | 16.00%       


date,sales,marketing_spend,bonus_percent
2024-01-15,15000.5,5000.0,0.15
2024-02-20,18000.75,6000.0,0.18
2024-03-25,22000.0,7000.0,0.22
2024-04-30,19500.25,5500.0,0.16


In [1]:
/* Demo 19: SAS Macro Facility - Dynamic Code Generation */
%let company = StatLang Corp;
%let year = 2024;

%macro create_department_report(dept);
    data work.&dept._employees;
        input name $ salary experience;
        format salary dollar10.2 experience comma3.0;
        datalines;
    John 75000 5
    Jane 80000 7
    Bob 65000 3
    ;
    run;
    
    title "&company - &dept Department (&year)";
    proc print data=work.&dept._employees;
    run;
%mend;

%create_department_report(Engineering);
%create_department_report(Marketing);


Saved dataset work.Engineering_employees to library work
TITLE: StatLang Corp - Engineering Department (2024)

PROC PRINT - Dataset Contents
Observations: 3
Variables: 3

name     | salary     | experience
----------------------------------
John     | $75,000.00 | 5         
Jane     | $80,000.00 | 7         
Bob      | $65,000.00 | 3         
Saved dataset work.Marketing_employees to library work
TITLE: StatLang Corp - Marketing Department (2024)

PROC PRINT - Dataset Contents
Observations: 3
Variables: 3

name     | salary     | experience
----------------------------------
John     | $75,000.00 | 5         
Jane     | $80,000.00 | 7         
Bob      | $65,000.00 | 3         


name,salary,experience
John,75000.0,5.0
Jane,80000.0,7.0
Bob,65000.0,3.0


name,salary,experience
John,75000.0,5.0
Jane,80000.0,7.0
Bob,65000.0,3.0


## Demo 20: PROC SURVEYSELECT - Random Sampling

**Purpose**: Random sampling of observations from datasets for statistical analysis and survey research.

**Key Features**:
- Simple Random Sampling (SRS) with `METHOD=SRS`
- Sample size specification via `SAMPRATE=` (percentage) or `N=` (absolute count)
- Reproducible sampling with `SEED=` option
- Optional `OUTALL` flag to include all observations with selection indicator
- Professional output with sampling statistics

**Use Cases**:
- Creating representative samples for analysis
- Survey sampling and data collection
- Cross-validation and holdout sets for machine learning
- Quality control and auditing samples


In [1]:
/* Demo 20: PROC SURVEYSELECT - Random Sampling */
/* Create a larger dataset for sampling demonstration */
data work.population;
    input id name $ department $ salary experience;
    datalines;
1 Alice Engineering 75000 5
2 Bob Marketing 55000 3
3 Carol Engineering 80000 7
4 David Sales 45000 2
5 Eve Engineering 70000 4
6 Frank Marketing 60000 6
7 Grace Sales 50000 3
8 Henry Engineering 85000 8
9 Irene HR 62000 5
10 Jack Marketing 59000 4
11 Kelly Engineering 72000 6
12 Leo Sales 48000 2
13 Mary Marketing 61000 5
14 Nick Engineering 78000 7
15 Olivia HR 58000 4
16 Paul Sales 52000 3
17 Quinn Engineering 76000 5
18 Rachel Marketing 64000 6
19 Sam HR 60000 4
20 Tom Engineering 82000 8
;
run;

/* Example 1: 50% random sample */
proc surveyselect data=work.population out=work.sample50 method=srs samprate=0.5 seed=12345;
run;

/* Example 2: Fixed sample size of 8 observations */
proc surveyselect data=work.population out=work.sample8 method=srs n=8 seed=54321;
run;

/* Example 3: 30% sample with selection indicator (OUTALL) */
proc surveyselect data=work.population out=work.sample30_all method=srs samprate=0.3 seed=99999 outall=true;
run;


Saved dataset work.population to library work
PROC SURVEYSELECT - Random Sampling Results
Sampling Method: SRS
Total Observations: 20
Sample Size: 10
Sampling Rate: 0.500
Random Seed: 12345

Output Dataset: 10 observations (selected only)

Sample Data Preview:
------------------------------
id       | name     | department  | salary   | experience
---------------------------------------------------------
1.0      | Alice    | Engineering | 75000.0  | 5.0       
4.0      | David    | Sales       | 45000.0  | 2.0       
8.0      | Henry    | Engineering | 85000.0  | 8.0       
9.0      | Irene    | HR          | 62000.0  | 5.0       
11.0     | Kelly    | Engineering | 72000.0  | 6.0       
12.0     | Leo      | Sales       | 48000.0  | 2.0       
13.0     | Mary     | Marketing   | 61000.0  | 5.0       
14.0     | Nick     | Engineering | 78000.0  | 7.0       
16.0     | Paul     | Sales       | 52000.0  | 3.0       
17.0     | Quinn    | Engineering | 76000.0  | 5.0       
PROC SURVEYS

id,name,department,salary,experience
1.0,Alice,Engineering,75000.0,5.0
2.0,Bob,Marketing,55000.0,3.0
3.0,Carol,Engineering,80000.0,7.0
4.0,David,Sales,45000.0,2.0
5.0,Eve,Engineering,70000.0,4.0


id,name,department,salary,experience
2.0,Bob,Marketing,55000.0,3.0
4.0,David,Sales,45000.0,2.0
6.0,Frank,Marketing,60000.0,6.0
10.0,Jack,Marketing,59000.0,4.0
12.0,Leo,Sales,48000.0,2.0


id,name,department,salary,experience,selected
1.0,Alice,Engineering,75000.0,5.0,1
2.0,Bob,Marketing,55000.0,3.0,0
3.0,Carol,Engineering,80000.0,7.0,1
4.0,David,Sales,45000.0,2.0,0
5.0,Eve,Engineering,70000.0,4.0,1


id,name,department,salary,experience
1.0,Alice,Engineering,75000.0,5.0
4.0,David,Sales,45000.0,2.0
8.0,Henry,Engineering,85000.0,8.0
9.0,Irene,HR,62000.0,5.0
11.0,Kelly,Engineering,72000.0,6.0
