# SAS PROC Functions
This section will go over a few basic `PROC` functions such as `print`, `means`, `univariate`, `freq`, etc. Here is the code structure for each below:

In [None]:
/* prints n amount of rows for col-names */
PROC PRINT DATA=input-table(OBS=n);
    VAR col-names(s);
run;

/* provides summary statistics for col-names */
PROC MEANS DATA=input-table;
    VAR col-names(s);
run;

/* provides (more detailed) summary statistics for col-names */
PROC UNIVARIATE DATA=input-table;
    VAR col-names(s);
run;

/* provides frequency tables for each col in col-names */
PROC FREQ DATA=input-table;
    TABLES col-names(s);
RUN;

## PROC PRINT
We can specify the `proc print` statement further by specifying the number of rows to print and which specific columns. By default, PROC PRINT lists all columns and rows in the table specified. In addition, `(OBS=n)` refers to the last observation (row) to read. The following code below calls the `cars` table from the automatic library `sashelp` and prints out the first 10 observations and only columns: Make, Model, Type, and MSRP.

In [3]:
PROC PRINT data=sashelp.cars (obs=10);
    var Make Model Type MSRP;
run;

Obs,Make,Model,Type,MSRP
1,Acura,MDX,SUV,"$36,945"
2,Acura,RSX Type S 2dr,Sedan,"$23,820"
3,Acura,TSX 4dr,Sedan,"$26,990"
4,Acura,TL 4dr,Sedan,"$33,195"
5,Acura,3.5 RL 4dr,Sedan,"$43,755"
6,Acura,3.5 RL w/Navigation 4dr,Sedan,"$46,100"
7,Acura,NSX coupe 2dr manual S,Sports,"$89,765"
8,Audi,A4 1.8T 4dr,Sedan,"$25,940"
9,Audi,A41.8T convertible 2dr,Sedan,"$35,940"
10,Audi,A4 3.0 4dr,Sedan,"$31,840"


### PROC MEANS
The `proc means` step calculates default statistics - frequency count (N), mean, standard deviation, minimum value, and maximum values. Similarly, we can specify which specific variables we would like these default summary statistics. The code below computes these statistics for the same `cars` dataset for the variables: EngineSize and Horsepower

In [7]:
PROC MEANS data=sashelp.cars;
    var EngineSize Horsepower;
run;

Variable,Label,N,Mean,Std Dev,Minimum,Maximum
EngineSize Horsepower,Engine Size (L),428 428,3.1967290 215.8855140,1.1085947 71.8360316,1.3000000 73.0000000,8.3000000 500.0000000


Please note, if we include the `(OBS=n)` argument to out `proc means` step, it would compute those summary statistics for those `n` rows. Notice the difference in the output values below:

In [10]:
PROC MEANS data=sashelp.cars (OBS=20);
    var EngineSize Horsepower;
run;

Variable,Label,N,Mean,Std Dev,Minimum,Maximum
EngineSize Horsepower,Engine Size (L),20 20,3.0600000 238.7500000,0.7036746 47.0407158,1.8000000 170.0000000,4.2000000 340.0000000


### PROC UNIVARIATE
The `proc univariate` step generates a bit more detailed summary statistics than the `proc means` step. By default, this step generates summary statistics for each numeric column in the data given. This step provides descriptive statistics and data distribution analysis by providing these 5 differen tables: 
1. Moments
   - a table with basic desciptive moments for the variable
2. Basic Statisitcal Measures
   - table with the core statistics for the variable
3. Tests for Location
   - a table with hypothesis tests that assess whether the mean or median is equal to the given value
4. Quantiles (Percentiles)
   - provides detailed percentiles or quantiles
5. Extreme Observations
   - provides both the highest and lowest number of values

In [16]:
PROC UNIVARIATE data=sashelp.cars;
    var MPG_Highway;
run;

Moments,Moments.1,Moments.2,Moments.3
N,428.0,Sum Weights,428.0
Mean,26.8434579,Sum Observations,11489.0
Std Deviation,5.74120072,Variance,32.9613857
Skewness,1.25239527,Kurtosis,6.04561068
Uncorrected SS,322479.0,Corrected SS,14074.5117
Coeff Variation,21.3877092,Std Error Mean,0.27751141

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,26.84346,Std Deviation,5.7412
Median,26.0,Variance,32.96139
Mode,26.0,Range,54.0
,,Interquartile Range,5.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,96.7292,Pr > |t|,<.0001
Sign,M,214.0,Pr >= |M|,<.0001
Signed Rank,S,45903.0,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,66
99%,44
95%,36
90%,34
75% Q3,29
50% Median,26
25% Q1,24
10%,20
5%,18
1%,16

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
12,167,44,156
13,119,46,405
14,252,51,150
16,217,51,374
16,216,66,151


### PROC FREQ
The `proc freq` step is used to provide frequency tables for each column and/or table specified. By default, this step will create a frequency table for each column. The code below creates a frequency table for the `Origin` column in the `cars` dataset.

In [20]:
PROC FREQ data=sashelp.cars;
    tables Origin;
run;

Origin,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Asia,158,36.92,158,36.92
Europe,123,28.74,281,65.65
USA,147,34.35,428,100.0


## SAS Output Delivery System
The SAS Output Delivery System (ODS) controls the way output is generated, formatted, and delivered. ODS allows customization and can also direct your output to a variety of destimations (html, pdf, excel, etc). For now, we will only focus on selectively choosing your output.

Adding `ods trace on` before your script will display detailed information about the output objects generated during the execution of any procedure. This information will appear in the `log` tab. Using `ods trace on` helps identify the specific table names being output, which you can then use in an `ods select statement` to specify exactly which outputs you want to display.

## Filtering Data in the PROC Step
There `WHERE` statement in a `PROC` step is used to filter data based on specific conditions. This will allow you to focus on a subset of your data that meets certain criteria without needing to create a new dataset. The `WHERE` statement can be used in all the previous `PROC` steps described above.

In [None]:
PROC procedure-name;
    WHERE expression;
RUN;

### Operators
| Operator                   | Symbol              |
|-----------------------------|---------------------|
| Equals                      | `=` or `EQ`         |
| Not Equal                   | `^=` or `~=` or `NE`|
| Greater Than                | `>` or `GT`         |
| Less Than                   | `<` or `LT`         |
| Greater Than or Equal to    | `>=` or `GE`        |
| Less Than or Equal to       | `<=` or `LE`        |


The table above summarizes common comparison operators used in SAS for filtering data. When comparing character values, the string must be captured in double or single quotations and is case-sensitive, ie `p1 = P1` is false. Numeric values must be standard numbers (no symbols). 

If you are comparing `date` values, you would have to add in a *SAS date constant (d)* to your expression: `WHERE date > "01NOV2016"d;` This is due to Dates being stored as a numeric value, so the date constant turns the string date into the numeric equivalent in otder to evaluate the expression. 

### Using AND or OR

`WHERE` expressions can be combined with `AND` or `OR`.

In [3]:
PROC PRINT data = sashelp.cars (OBS=10);
    WHERE TYPE = "SUV" and MSRP <= 30000;
run;

Obs,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
48,Buick,Rendezvous CX,SUV,USA,Front,"$26,545","$24,085",3.4,6,185,19,26,4024,112,187
67,Chevrolet,Tracker,SUV,USA,Front,"$20,255","$19,108",2.5,6,165,19,22,2866,98,163
121,Ford,Explorer XLT V6,SUV,USA,All,"$29,670","$26,983",4.0,6,210,15,20,4463,114,190
122,Ford,Escape XLS,SUV,USA,All,"$22,515","$20,907",3.0,6,201,18,23,3346,103,173
152,Honda,Pilot LX,SUV,Asia,All,"$27,560","$24,843",3.5,6,240,17,22,4387,106,188
153,Honda,CR-V LX,SUV,Asia,All,"$19,860","$18,419",2.4,4,160,21,25,3258,103,179
154,Honda,Element LX,SUV,Asia,All,"$18,690","$17,334",2.4,4,160,21,24,3468,101,167
168,Hyundai,Santa Fe GLS,SUV,Asia,Front,"$21,589","$20,201",2.7,6,173,20,26,3549,103,177
189,Isuzu,Rodeo S,SUV,Asia,Front,"$20,449","$19,261",3.2,6,193,17,21,3836,106,178
202,Jeep,Grand Cherokee Laredo,SUV,USA,Front,"$27,905","$25,686",4.0,6,195,16,21,3790,106,181


### IN or NOT IN
In tedious cases, the `IN` or `NOT IN` operator provides an efficient approach when there are several values that should be listed. The `IN` operatore works with both numeric and character values. Character values should be enclosed and are still case-sensitive.

In [6]:
PROC PRINT data = sashelp.cars (obs=10);
    WHERE TYPE in ("SUV", "Truck");
run;

Obs,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
1,Acura,MDX,SUV,Asia,All,"$36,945","$33,337",3.5,6,265,17,23,4451,106,189
27,BMW,X3 3.0i,SUV,Europe,All,"$37,000","$33,873",3.0,6,225,16,23,4023,110,180
28,BMW,X5 4.4i,SUV,Europe,All,"$52,195","$47,720",4.4,8,325,16,22,4824,111,184
47,Buick,Rainier,SUV,USA,All,"$37,895","$34,357",4.2,6,275,15,21,4600,113,193
48,Buick,Rendezvous CX,SUV,USA,Front,"$26,545","$24,085",3.4,6,185,19,26,4024,112,187
56,Cadillac,Escalade,SUV,USA,Front,"$52,795","$48,377",5.3,8,295,14,18,5367,116,199
57,Cadillac,SRX V8,SUV,USA,Front,"$46,995","$43,523",4.6,8,320,16,21,4302,116,195
63,Cadillac,Escalade EXT,Truck,USA,All,"$52,975","$48,541",6.0,8,345,13,17,5879,130,221
64,Chevrolet,Suburban 1500 LT,SUV,USA,Front,"$42,735","$37,422",5.3,8,295,14,18,4947,130,219
65,Chevrolet,Tahoe LT,SUV,USA,All,"$41,465","$36,287",5.3,8,295,14,18,5050,116,197


### IS MISSING or IS NOT MISSING
Another special operator that can be used is `IS MISSING` or `IS NOT MISSING`. This outputs null or missing values. The below output tells us that the column `TYPE` has no null/missing values.

In [47]:
PROC PRINT data = sashelp.cars;
    WHERE TYPE is missing;
run;

97                                                         The SAS System                             17:59 Monday, October 21, 2024

2534       ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
2534     ! ods graphics on / outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1[0m
2535       
2536       PROC PRINT data = sashelp.cars;
2537           WHERE TYPE is missing;
2538       run;

[38;5;21mNOTE: No observations were selected from data set SASHELP.CARS.[0m
[38;5;21mNOTE: There were 0 observations read from the data set SASHELP.CARS.
      WHERE TYPE is null;[0m
[38;5;21mNOTE: PROCEDURE PRINT used (Total process time):
      real time           0.06 seconds
      cpu time            0.03 seconds
      [0m

2539       
2540       
2541       ods html5 (id=saspy_internal) close;ods listing;
2542       
98                                                         The SAS System     

### BETWEEN
We can also specify `BETWEEN` when we want to filter values within a specific range. This includes rows with values between and also includes the endpoints specified.

In [53]:
PROC PRINT data = sashelp.cars (obs=5);
    WHERE MSRP between 36000 and 50000;
run;

Obs,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
1,Acura,MDX,SUV,Asia,All,"$36,945","$33,337",3.5,6,265,17,23,4451,106,189
5,Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",3.5,6,225,18,24,3880,115,197
6,Acura,3.5 RL w/Navigation 4dr,Sedan,Asia,Front,"$46,100","$41,100",3.5,6,225,18,24,3893,115,197
13,Audi,A6 3.0 4dr,Sedan,Europe,Front,"$36,640","$33,129",3.0,6,220,20,27,3561,109,192
14,Audi,A6 3.0 Quattro 4dr,Sedan,Europe,All,"$39,640","$35,992",3.0,6,220,18,25,3880,109,192


### LIKE
We can also use the `LIKE` operator, that helps us find like values for character values. The `%` represents the wildcard for any number of characters and the `_` represents the wildcard for a single character. The code below prints data where the values start with "Ch". (It should include Chevrolet and Chrysler, but the outputted is limited for these notes purposes).

In [62]:
PROC PRINT data = sashelp.cars (OBS=5);
    WHERE Make like "Ch%";
run;

Obs,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length
64,Chevrolet,Suburban 1500 LT,SUV,USA,Front,"$42,735","$37,422",5.3,8,295,14,18,4947,130,219
65,Chevrolet,Tahoe LT,SUV,USA,All,"$41,465","$36,287",5.3,8,295,14,18,5050,116,197
66,Chevrolet,TrailBlazer LT,SUV,USA,Front,"$30,295","$27,479",4.2,6,275,16,21,4425,113,192
67,Chevrolet,Tracker,SUV,USA,Front,"$20,255","$19,108",2.5,6,165,19,22,2866,98,163
68,Chevrolet,Aveo 4dr,Sedan,USA,Front,"$11,690","$10,965",1.6,4,103,28,34,2370,98,167


## SAS Macro Variable
A `SAS macro variable` is a symbolic placeholder that stores text or numeric values that can be used throughout your SAS code to make it more dynamic and efficient. Macro variables allow you to substitute values in your code, making updating, reusing, and controlling large portions of the code easier. 

- A macro variable is defined with the `%LET` statement.
- In your code, they are referenced with the ampersand (&)


In [70]:
%let CarMake=Kia;

PROC PRINT data=sashelp.cars;
    where Make="&CarMake";
    var Type Make Model MSRP;
run;

PROC freq data=sashelp.cars;
    where Make="&CarMake";
    tables Origin Type;
run;

Obs,Type,Make,Model,MSRP
205,SUV,Kia,Sorento LX,"$19,635"
206,Sedan,Kia,Optima LX 4dr,"$16,040"
207,Sedan,Kia,Rio 4dr manual,"$10,280"
208,Sedan,Kia,Rio 4dr auto,"$11,155"
209,Sedan,Kia,Spectra 4dr,"$12,360"
210,Sedan,Kia,Spectra GS 4dr hatch,"$13,580"
211,Sedan,Kia,Spectra GSX 4dr hatch,"$14,630"
212,Sedan,Kia,Optima LX V6 4dr,"$18,435"
213,Sedan,Kia,Amanti 4dr,"$26,000"
214,Sedan,Kia,Sedona LX,"$20,615"

Origin,Frequency,Percent,Cumulative Frequency,Cumulative Percent
Asia,11,100.0,11,100.0

Type,Frequency,Percent,Cumulative Frequency,Cumulative Percent
SUV,1,9.09,1,9.09
Sedan,9,81.82,10,90.91
Wagon,1,9.09,11,100.0


## FORMAT
In the case you would like to make the values in your SAS table more readable, we can do so with the `format` statement within a `proc print` step. For instance, one of our cell values contains: 12345.678, but we want to format this as 12,345.7 so it is intuitatively easier to read. 

PROC print data = .. ;
    FORMAT col-name(s) format;
RUN;

In the code template above, the `format` portion is more specifically: `<$>format-name<w>.<d>`, where
- `<$>`: indicates a character value
- `<w>`: indicates the total width we want the formatted value
- `<d>` the number of decimal plpaces for numeric values
- `.` (the period) is always there!

Here are some example formats for numeric and date values: 
#### Example formats for Numeric Values
| Format Name | Example Value | Format Applied | Formatted Value |
|-------------|---------------|----------------|-----------------|
| w.d         | 20435.86      | 5.             | 20436           |
| COMMAw.d    | 20435.86      | COMMA8.1       | 20,435.9        |
| DOLLARw.d   | 20435.86      | DOLLAR10       | $20,435         |

#### Example formats for Date Values

| Example Value | Format Applied | Formatted Value            |
|---------------|----------------|----------------------------|
| 21199         | DATE7.         | 15JAN18                    |
| 21199         | MMDDYY10.      | 01/15/2018                 |
| 21199         | WEEKDATE.      | Monday, January 15, 201 

Note: Date values in SAS are stored as the number of days between January 1, 1960 and the specific date (which is why at first glance, the example value does not look like a Date variable).

You can find more documentation about the several formatting names you can use here:  https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/allprodsle/syntaxByType-format.html 

The code below shows the `EngineSize` and `Weight` value of the `cars` dataset before formatting and after formatting.   |


In [18]:
proc print data=sashelp.cars (obs=2);
    var Make Model EngineSize Weight;
run;

Obs,Make,Model,EngineSize,Weight
1,Acura,MDX,3.5,4451
2,Acura,RSX Type S 2dr,2.0,2778


In [24]:
proc print data=sashelp.cars (obs=2);
/* For EngineSize, we want to round the value to not include any decimals (indicated by the 1.)
for Weight, we are allowing for 10 values, and no decimals as well*/
    format EngineSize 1. Weight COMMA10.; 
    var Make Model EngineSize Weight;
run;

Obs,Make,Model,EngineSize,Weight
1,Acura,MDX,4,4451
2,Acura,RSX Type S 2dr,2,2778


## SORT
Sorting is a useful tool when exploring data. The `PROC SORT` step allows you to rearrange the dataset, making it easier to view the top or bottom values. SAS first rearranges the rows in the input dataset and then creates a new dataset with the sorted rows. If no output dataset is specified, SAS will overwrite the original dataset (this is the default behavior). By default, sorting is done in **ascending** order. To sort in **descending** order, you need to place the `DESCENDING` keyword before each variable that should be sorted in that way.

`proc sort` template:
```
PROC SORT DATA = input <OUT=output>;
    BY <DESCENDING> cols;
RUN;
```

In the code below, we are sorting the data by age in descending order. After that, within the same age group, the data will be sorted by Name in ascending order.

In [33]:
PROC SORT data=sashelp.class out=test_sort;
    by DESCENDING age Name;
run;

PROC print data=test_sort (obs=10);
run;

Obs,Name,Sex,Age,Height,Weight
1,Philip,M,16,72.0,150.0
2,Janet,F,15,62.5,112.5
3,Mary,F,15,66.5,112.0
4,Ronald,M,15,67.0,133.0
5,William,M,15,66.5,112.0
6,Alfred,M,14,69.0,112.5
7,Carol,F,14,62.8,102.5
8,Henry,M,14,63.5,102.5
9,Judy,F,14,64.3,90.0
10,Alice,F,13,56.5,84.0


### Removing Duplicate Values with Sort
We can also use `PROC SORT` to remove duplicate rows, as sorting helps to identify and eliminate duplicates more easily.
```
PROC SORT DATA = input-table <OUT=output-table>
    NODUPKEY <DUPOUT=output-table>;
BY _ALL_;
RUN;
```

In the duplication identification and removal sort step above, the `BY _ALL_` statement sorts by ALL columns so duplicate rows are adjacent, the `NODUPKEY` removes adjacent rows with duplicate BY values and keeps only the FIRST occurence of each variable and here, we also specify an output table for our duplicate values as specified by `dupout=output-table`.

Our code below outputs three different tables: the original `class` dataset, the cleaned (duplicates removed) dataset, and then the duplicates found table. (Note: they are titled with the `title` global statement before each `proc print` step).

If you look closely at the original dataset, you will see that observations: 7 and 16 are duplicates & 12 and 19 are duplicates. Our `proc sort` statement first sorts the dataset `class` from the `ma505` library by `ALL` columns and saves the duplicate remove to `ma505.clean` and the duplicates to `ma505.duplicates`. Looking at the `Duplicates Removed` table, we can see the observations go from 21 down to 19 as 2 rows were removed.

In [63]:
title 'Original Dataset';
proc print data=ma505.class;
run;

proc sort data=ma505.class
    out=ma505.clean
    nodupkey
    dupout=ma505.duplicates;
by _all_;
run;

title 'Duplicates Removed';
proc print data=ma505.clean;
run;

title 'Duplicates Found';
proc print data=ma505.duplicates;
run;

Obs,Name,Sex,Age,Height,Weight
1,Alfred,M,14,69.0,112.5
2,Alice,F,13,56.5,84.0
3,Barbara,F,13,65.3,98.0
4,Carol,F,14,62.8,102.5
5,Henry,M,14,63.5,102.5
6,James,M,12,57.3,83.0
7,Jane,F,12,59.8,84.5
8,Janet,F,15,62.5,112.5
9,Jeffrey,M,13,62.5,84.0
10,John,M,12,59.0,99.5

Obs,Name,Sex,Age,Height,Weight
1,Alfred,M,14,69.0,112.5
2,Alice,F,13,56.5,84.0
3,Barbara,F,13,65.3,98.0
4,Carol,F,14,62.8,102.5
5,Henry,M,14,63.5,102.5
6,James,M,12,57.3,83.0
7,Jane,F,12,59.8,84.5
8,Janet,F,15,62.5,112.5
9,Jeffrey,M,13,62.5,84.0
10,John,M,12,59.0,99.5

Obs,Name,Sex,Age,Height,Weight
1,Jane,F,12,59.8,84.5
2,Judy,F,14,64.3,90.0
