# 🎯 StatLang Regression Demo

This example project demonstrates how to perform a complete regression analysis workflow using **StatLang**, a Python-powered statistical scripting environment.

We will walk through the following steps using real StatLang syntax:

1. 📊 Create a synthetic dataset with 30 observations
2. 📈 Use `PROC UNIVARIATE` to explore variable distributions
3. 🔀 Split the dataset into training and testing sets using `PROC SURVEYSELECT`
4. 🧠 Train a linear regression model using `PROC REG`
5. 🔍 Generate predictions and perform error analysis on the test set
6. 🧩 Wrap part of the workflow inside a `%MACRO` to demonstrate macro capabilities

This notebook is a great starting point for data scientists transitioning from SAS or exploring StatLang for statistical modeling.

> ✅ All code is written using StatLang syntax and is fully executable in an `.statlang` environment.

Happy modeling!


In [1]:
libname work './examples/work';

data work.customers;
  input age income spend;
  datalines;
25 40000 5500
32 48000 6200
29 46000 5900
45 75000 8700
41 69000 8100
36 58000 7200
52 88000 9500
48 79000 8800
30 50000 6000
27 42000 5700
33 47000 6100
38 60000 7400
44 72000 8600
39 61000 7500
26 39000 5300
34 55000 7000
40 68000 8000
46 76000 8800
50 83000 9100
43 71000 8500
31 49000 6200
28 43000 5800
35 57000 7100
42 70000 8400
49 80000 8900
37 59000 7300
47 78000 8700
51 85000 9400
53 89000 9600
55 91000 9800
;
run;

Library work created and mapped to ./examples/work
Saved dataset work.customers to library work


age,income,spend
25.0,40000.0,5500.0
32.0,48000.0,6200.0
29.0,46000.0,5900.0
45.0,75000.0,8700.0
41.0,69000.0,8100.0


In [1]:
proc univariate data=work.customers normal plot;
    var age income spend;
  run;
  

PROC UNIVARIATE - Univariate Analysis

Variable: age
----------------------------------------
Moments
  N                                  30
  Mean                        39.533333
  Std Deviation                8.861903
  Variance                    78.533333
  Skewness                     0.022552
  Uncorrected SS             49164.000000
  Corrected SS               2277.466667
  Coeff Variation             22.416282

Basic Statistical Measures
  Location                    Variability
  Mean       39.533333     Std Deviation       8.861903
  Median     39.500000     Variance           78.533333
  Mode       25.000000     Range              30.000000
                           Interquartile Range  14.500000

Tests for Location: Mu0=0
  Test           -Statistic-    -----p Value------
  Student's t     24.434139    Pr > |t|      0.000000
  Sign                   30    Pr >= |M|     0.000000
  Signed Rank           465    Pr >= |S|     1.000000

Quantiles (Definition 5)
  Level      

In [1]:
proc surveyselect data=work.customers out=work.split method=srs samprate=0.7 seed=12345 outall;
run;

data work.train;
  set work.split;
  if selected = 1;
run;

data work.test;
  set work.split;
  if selected = 0;
run;

PROC SURVEYSELECT - Random Sampling Results
Sampling Method: SRS
Total Observations: 30
Sample Size: 21
Sampling Rate: 0.700
Random Seed: 12345

Output Dataset: 30 observations (all with selection indicator)
Selected Observations: 21

Sample Data Preview:
------------------------------
age      | income   | spend    | selected
-----------------------------------------
25.0     | 40000.0  | 5500.0   | 1.0     
32.0     | 48000.0  | 6200.0   | 0.0     
29.0     | 46000.0  | 5900.0   | 0.0     
45.0     | 75000.0  | 8700.0   | 1.0     
41.0     | 69000.0  | 8100.0   | 0.0     
36.0     | 58000.0  | 7200.0   | 0.0     
52.0     | 88000.0  | 9500.0   | 1.0     
48.0     | 79000.0  | 8800.0   | 1.0     
30.0     | 50000.0  | 6000.0   | 1.0     
27.0     | 42000.0  | 5700.0   | 0.0     
... and 20 more observations
Saved dataset work.train to library work
Saved dataset work.test to library work


age,income,spend,selected
25.0,40000.0,5500.0,1
32.0,48000.0,6200.0,0
29.0,46000.0,5900.0,0
45.0,75000.0,8700.0,1
41.0,69000.0,8100.0,0


age,income,spend,selected
25.0,40000.0,5500.0,1
32.0,48000.0,6200.0,0
29.0,46000.0,5900.0,0
45.0,75000.0,8700.0,1
41.0,69000.0,8100.0,0


age,income,spend,selected
25.0,40000.0,5500.0,1
32.0,48000.0,6200.0,0
29.0,46000.0,5900.0,0
45.0,75000.0,8700.0,1
41.0,69000.0,8100.0,0


In [1]:
%macro train_and_evaluate(depvar, indepvars);
    proc reg data=work.train;
      model &depvar = &indepvars;
      output out=work.train_results p=predicted_&depvar r=residuals;
    run;
  
    proc reg data=work.test;
      model &depvar = &indepvars;
      score data=work.test out=work.test_results predicted=predicted_&depvar;
    run;
  
    data work.error_analysis;
      set work.test_results;
      error = &depvar - predicted_&depvar;
      abs_error = abs(error);
      squared_error = error**2;
    run;
  
    proc means data=work.error_analysis mean;
      var abs_error squared_error;
    run;
  %mend;
  
  %train_and_evaluate(spend, age income);

PROC REG - Linear Regression Analysis
Dependent Variable: spend
Independent Variables: age, income
Number of Observations: 30

Model Summary
--------------------
R-Square: 0.986681
Adjusted R-Square: 0.985695
Root MSE: 158.838213

Parameter Estimates
------------------------------
Variable        DF    Estimate     Std Error    t Value    Pr > |t|  
--------------------------------------------------------------------------------
Intercept       2     2022.993160  N/A          N/A        N/A       
age             1     12.732970    28.999757    0.439      0.1000    
income          1     0.079517     28.999757    0.003      0.1000    

Output dataset 'work.train_results' created with predictions and residuals
PROC REG - Linear Regression Analysis
Dependent Variable: spend
Independent Variables: age, income
Number of Observations: 30

Model Summary
--------------------
R-Square: 0.986681
Adjusted R-Square: 0.985695
Root MSE: 158.838213

Parameter Estimates
------------------------------

spend,age,income,predicted_spend,residuals
5500.0,25.0,40000.0,5522.000040285902,-22.000040285902287
6200.0,32.0,48000.0,6247.267357018326,-47.26735701832604
5900.0,29.0,46000.0,6050.034314954452,-150.03431495445147
8700.0,45.0,75000.0,8559.756741503723,140.24325849627712
8100.0,41.0,69000.0,8031.722466835173,68.27753316482722


age,income,spend,selected,predicted_spend
25.0,40000.0,5500.0,1,5522.000040285902
32.0,48000.0,6200.0,0,6247.267357018326
29.0,46000.0,5900.0,0,6050.034314954452
45.0,75000.0,8700.0,1,8559.756741503723
41.0,69000.0,8100.0,0,8031.722466835173


age,income,spend,selected,predicted_spend,error,abs_error,squared_error
25.0,40000.0,5500.0,1,5522.000040285902,-22.000040285902287,22.000040285902287,484.0017725813236
32.0,48000.0,6200.0,0,6247.267357018326,-47.26735701832604,47.26735701832604,2234.203039497896
29.0,46000.0,5900.0,0,6050.034314954452,-150.03431495445147,150.03431495445147,22510.295663851542
45.0,75000.0,8700.0,1,8559.756741503723,140.24325849627712,140.24325849627712,19668.171553653607
41.0,69000.0,8100.0,0,8031.722466835173,68.27753316482722,68.27753316482722,4661.821535074081
