# DATA Steps
In SAS, the **DATA step** is one of the foundational components used for data manipulation and processing. It allows you to read, modify, transform, and create datasets. A DATA step processes observations from a source, such as a raw data file or an existing SAS dataset, and can apply calculations, conditional logic, or transformations to the data. The result of a DATA step is typically a new or updated SAS dataset. 

DATA steps are highly flexible and are commonly used for tasks like data cleaning, filtering, creating new variables, merging datasets, or reading data from external sources

The code template below shows how to use the DATA Step to create a SAS Data Set.
```

DATA output-table;
    SET input-table;
RUN;
```.

Whenever you work with data, it may be common practice to always  preserve your exisitng data and create a copy to work on. The `output-table` is simply the copy of the original dataset that becomes a temporary table in our WORK library. The `input-table` from the `SET` statement is what we are reading from. 

The code below takes our clean dataset from our section of duplicate removals and creates a temporary table in our WORK library.

In [18]:
DATA cln_class;
    SET ma505.clean;
run;

39                                                         The SAS System                          20:50 Wednesday, October 23, 2024

1028       ods listing close;ods html5 (id=saspy_internal) file=_tomods1 options(bitmap_mode='inline') device=svg style=HTMLBlue;
1028     ! ods graphics on / outputfmt=png;
[38;5;21mNOTE: Writing HTML5(SASPY_INTERNAL) Body file: _TOMODS1[0m
1029       
1030       DATA cln_class;
1031           SET ma505.clean;
1032       run;

[38;5;21mNOTE: There were 19 observations read from the data set MA505.CLEAN.[0m
[38;5;21mNOTE: The data set WORK.CLN_CLASS has 19 observations and 5 variables.[0m
[38;5;21mNOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds
      [0m

1033       
1034       
1035       ods html5 (id=saspy_internal) close;ods listing;
1036       
40                                                         The SAS System                          20:50 Wednesday, Octo

## Filtering and Subsetting
We can filter our data in the DATA step with the same syntax as we do in our PROC step, ie using the `WHERE` statement! For subsetting data by columns, we can add in a `KEEP` or `DROP` statement (ie `KEEP` these columns OR `DROP` these columns)

The code below filters our clean class dataset to any row where the `age` is greater than 14. Note, that means our table `cln_class` will now ONLY contain data where the `age` is greater than 14! Using the `keep` statement, we drop any other columns that are NOT `Name` `Sex` or `Age`.

In [25]:
DATA cln_class;
    SET ma505.clean;
    WHERE age > 14;
    KEEP Name Sex Age;
RUN;

PROC PRINT data=cln_class;
RUN;

Obs,Name,Sex,Age,Height,Weight
1,Janet,F,15,62.5,112.5
2,Mary,F,15,66.5,112.0
3,Philip,M,16,72.0,150.0
4,Ronald,M,15,67.0,133.0
5,William,M,15,66.5,112.0


## Computing New Columns
In data manipulation, it's common to create new columns based on existing ones. This can be done easily in the `DATA` step using an assignment statement. In the code below, we perform several tasks, including creating a new column called `Profit`, which calculates the difference between the `Invoice` and `MSRP` columns. Additionally, we create a new column, `Source`, which assigns the string `Non-US Cars` to each row.

In [32]:
data cars_new;
    set sashelp.cars;
    where Origin ~= "USA";
    Profit = MSRP-Invoice;
    Source = "Non-US Cars";
    format Profit dollar10.;
    keep Model Profit Source;
run;

proc print data=cars_new (obs=5);
run;

Obs,Model,Profit,Source
1,MDX,"$3,608",Non-US Cars
2,RSX Type S 2dr,"$2,059",Non-US Cars
3,TSX 4dr,"$2,343",Non-US Cars
4,TL 4dr,"$2,896",Non-US Cars
5,3.5 RL 4dr,"$4,741",Non-US Cars


## Functions 
SAS offers many functions that can make your data manipulation a bit more flexible!
```
function(arg1, arg2,...);
```

```
DATA output-table;
    SET input-table;
    new_col = function(args);
RUN;

This table below shows a few numeric functions that you can use for your data:
| Functions  | Syntax         |
|------------|----------------|
| SUM        | SUM(...)       |
| MEAN       | MEAN(...)      |
| MEDIAN     | MEDIAN(...)    |
| RANGE      | RANGE(...)     |
| MIN        | MIN(...)       |
| MAX        | MAX(...)       |
| N          | N(...)         |
| NMISS      | NMISS(..
These functions can have an unlimited amount of arguments. An important thing to note is that **missing values are ignored**!.)     |


This table below shows a few character functions:
| CHARACTER | Description                                                     |
|-----------|-----------------------------------------------------------------|
| UPCASE    | changes letters to uppercase                                    |
| LOWCASE   | changes letters to lowercase                                    |
| PROPCASE  | changes the first letter of each word to uppercase and others to lowercase |
| CATS      | concatenates character strings                                  |
| SUBSTR    | returns a substring from a character string                     |


And this table below shows a few DATE functions:
| DATE    | Description                                                        |
|---------|--------------------------------------------------------------------|
| MONTH   | Returns a number from 1 through 12 that represents the month       |
| YEAR    | Returns the four-digit year                                        |
| DAY     | Returns a number from 1 through 31 that represents the day of the month |
| WEEKDAY | Returns a number from 1 through 7 that represents the day of the week |
| QTR     | Returns a number from 1 through 4 that represents the quarter      |
| TODAY   | Returns the current date as numeric SAS value                      |
| MDY     | Returns a SAS date value from month, day, and year          


The example code below creates 4 new columns using 3 different functions, the `SUM`, `MEAN`, and the `UPCASE` functions.       |


In [45]:
data cars_new;
    set sashelp.cars;
    MPG = SUM(MPG_City, MPG_Highway);
    MPGavg = MEAN(MPG_City, MPG_Highway);
    Make = UPCASE(Make);
    Type = UPCASE(Type);
run;

proc print data=cars_new (obs=5);
run;

Obs,Make,Model,Type,Origin,DriveTrain,MSRP,Invoice,EngineSize,Cylinders,Horsepower,MPG_City,MPG_Highway,Weight,Wheelbase,Length,MPG,MPGavg
1,ACURA,MDX,SUV,Asia,All,"$36,945","$33,337",3.5,6,265,17,23,4451,106,189,40,20.0
2,ACURA,RSX Type S 2dr,SEDAN,Asia,Front,"$23,820","$21,761",2.0,4,200,24,31,2778,101,172,55,27.5
3,ACURA,TSX 4dr,SEDAN,Asia,Front,"$26,990","$24,647",2.4,4,200,22,29,3230,105,183,51,25.5
4,ACURA,TL 4dr,SEDAN,Asia,Front,"$33,195","$30,299",3.2,6,270,20,28,3575,108,186,48,24.0
5,ACURA,3.5 RL 4dr,SEDAN,Asia,Front,"$43,755","$39,014",3.5,6,225,18,24,3880,115,197,42,21.0


## IF/THEN IF/ELSE - CONDITIONAL PROCESSING
Sometimes we only need to evaluate something, if a statement is true or not. This is done with an `IF/THEN` statement.The IF/THEN statement is a core conditional logic structure in SAS, allowing you to perform different actions based on whether a condition is true or false. This is particularly useful when you want to make data-driven decisions during data manipulation.

The basic syntax evaluates an expression and performs a specified action if the expression is true. If the expression is false and you have an ELSE IF or ELSE block, SAS will evaluate the next condition or execute the ELSE block if none of the conditions are met.

Here’s the general syntax for an IF/THEN/ELSE structure:

```
DATA output-table;
    SET input-table;
    IF expression THEN statement;
    <ELSE IF expression THEN statement;>
    <ELSE statement;>
RUN;

This next code utilizes only the `IF/THEN` statement and defines a new Column called `Cost` based on the value of `MSRP`.HEN`

In [51]:
data cars_new;
    set sashelp.cars;
    length Cost $ 10; /* this ensures our character col Cost is of length 10 */
    if MSRP<30000 then Cost = "Okay";
    if MSRP>=30000 then Cost = "Too High";
    keep Make Model Type MSRP Cost;
run;

proc print data=cars_new (obs=5);
run;

Obs,Make,Model,Type,MSRP,Cost
1,Acura,MDX,SUV,"$36,945",Too High
2,Acura,RSX Type S 2dr,Sedan,"$23,820",Okay
3,Acura,TSX 4dr,Sedan,"$26,990",Okay
4,Acura,TL 4dr,Sedan,"$33,195",Too High
5,Acura,3.5 RL 4dr,Sedan,"$43,755",Too High


In SAS, when you need to execute multiple statements in response to a condition, you can use the DO and END statements within an IF/THEN structure. This allows you to group several actions together, ensuring that all of them are executed when the condition is true.

```
DATA output-table;
    SET input-table;
    
    IF expression THEN DO;
        /* Multiple statements go here */
        statement1;
        statement2;
        /* Add as many statements as needed */
    END;

    ELSE IF expression THEN DO;
        /* Multiple statements for another condition */
        statement3;
        statement4;
    END;

    ELSE DO;
        /* Multiple statements for the else condition */
        statement5;
        statement6;

  END;

RUN;
END;>
RUN;




The example code below splits the sashelp.cars dataset into two datasets: under40 and over40, based on the value of the MSRP (Manufacturer's Suggested Retail Price) variable. It also creates a new variable called Cost that categorizes the cars into different price ranges  

In [66]:
data under40 over40;
    set sashelp.cars;
    keep Make Model MSRP Cost;
    if MSRP < 20000 then do;
        Cost = 1;
        output under40;
    end;
    else if MSRP < 40000 then do;
        Cost = 2;
        output under40;
    end;
    else do;
        Cost = 3;
        output over40;
    end;
run;

title 'Under 40';
proc print data=under40 (obs=5);
run;

title 'Over 40';
proc print data=over40 (obs=5);
run;

Obs,Make,Model,MSRP,Cost
1,Acura,MDX,"$36,945",2
2,Acura,RSX Type S 2dr,"$23,820",2
3,Acura,TSX 4dr,"$26,990",2
4,Acura,TL 4dr,"$33,195",2
5,Audi,A4 1.8T 4dr,"$25,940",2

Obs,Make,Model,MSRP,Cost
1,Acura,3.5 RL 4dr,"$43,755",3
2,Acura,3.5 RL w/Navigation 4dr,"$46,100",3
3,Acura,NSX coupe 2dr manual S,"$89,765",3
4,Audi,A4 3.0 convertible 2dr,"$42,490",3
5,Audi,A4 3.0 Quattro convertible 2dr,"$44,240",3
