# Manipulating Data - Part 1

## Creating Variables

We have done this in an earlier example. Let's use a new data and be more thorough here. 

The following raw data give information about hotels in Kyoto, Japan. The hotel name is followed by the nightly rate for two people in yen and the distance from Kyoto Station in kilometers.

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

As I read the data, I want to create a column called USD where I convert Yen to USD. 

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

Note that the statement is added before the datalines. 

SAS Programming Good rule 1: use as few data steps as possible (in most cases only one step is needed!)

The reason is just when you use a data step, a new data will be created. This takes a lot of resources, making the program slow or sometimes your computer out of resources. 

Let's create another variable, miles. 

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

SAS will override the values of existing variables or datasets if you reuse the name, ***without giving an error***.
* SAS won't change a numeric to a string and vice versa. It won't tell you either. 

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels; /* The data named kyotohotels will be replaced by the new one */
    SET kyotohotels;
    Miles = Kilometers * 0;  /* The column named Miles will be replaced by the new values */
RUN;

## Renaming and Labeling

You can rename variables using RENAME statement. For example, we believe we may get confused about USD, and we want to make it a more explicit name, US_Dollar.

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels; 
    SET kyotohotels;
    RENAME USD = US_Dollar; 
RUN;

You can label variables using LABEL statement, up to 256 characters long, for each variable. For example, we believe we may get confused about USD, and we want to give it a label, "US$". You still refer to the variable using its name, but the lable will appear in reporting by default. Consider the label as something you want to present or to be more informative.. 

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels;
    SET kyotohotels;
    LABEL USD = "US$" /* Note that US$ is in quotes. */
            Hotel = "Name of Hotel" /* You can lable multiple variables like this. */
            Miles = "Distance to Kyoto Station in Miles"; 
RUN;

## Subsetting Data

### Subsetting Rows

### Subsetting Rows using Conditions

In SAS, you can subset data using conditions. For example, we want to get a subset of the kyotohotels so that the subset only contains hotels within 2.5 miles from Kyoto Station. You can do this in 2 steps. 

In [None]:
/* First read the data */
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

/* Second subset the data using if statement */
DATA kyotohotels_subset;
SET kyotohotels;
if Miles <= 2.5;
RUN;

/* Or you can combine them. */
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
IF Miles <= 2.5;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

Note that the IF statement comes after the definition of Miles. Try and see what happens if you put it before. Why the difference? 

Alternatively, you can use something called the where statement. 

In [None]:
/* First read the data */
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

/* Second subset the data using where statement */
DATA kyotohotels_subset;
SET kyotohotels;
where Miles <= 2.5;
RUN;

* What's the difference between IF and WHERE? 
    * It's how it is executed.
        * IF statements are excuted row by row. SAS will go to the first row, check the if condition, make a decision and then go to the next row.
            * To do this, SAS has to read all the data from the SET first and then check the if conditions.
        * WHERE does not do this. It just tells SAS to read the observations that satisfies the condition in the SET statement. Hence, where is faster.
    * It's also where they can be used.
        * The WHERE statement can be used in procedures to subset data while IF statement cannot be used in procedures.
        * WHERE can be used as a **data set option** while IF cannot be used as a data set option.
        * IF Statement can be used when specifying an INPUT statement, but WHERE cannot be used.
        * IF can also be used under some specific condistions and more complex conditions. More later. 

In [None]:
/* Example of a data set option */
DATA kyotohotels_subset;
SET kyotohotels (where = (Miles <= 2.5));
RUN;

#### Subsetting Rows using SAS Automatic Variables

* SAS has built in variables for each data set and when it is executing codes. We call them SAS Automatic variables
    * ***\_N\_*** indicates the number of times SAS has looped through the DATA step.
    * ***\_ERROR\_*** has a value of 1 if there is a data error for that observation and 0 if there isn't.
        * Things that can cause data errors include invalid data (such as characters in a numeric field), conversion errors (like division by zero), and illegal arguments in functions (including log of zero).
    * ***FIRST.variable*** and ***LAST.variable*** are available when you are using a BY statement in a DATA step.
        * The FIRST.variable will have a value of 1 when SAS is processing an observation with the first occurrence of a new value for that variable and a value of 0 for the other observations.
        * The LAST.variable will have a value of 1 for an observation with the last occurrence of a value for that variable and the value 0 for the other observations.

* For example, if we want to select the first observation of our data. What can we do?
    * You can actually do it more than one ways. 

In [None]:
DATA kyotohotels_subset;
SET kyotohotels (obs = 1); /* <-- Using data set options */
RUN;

DATA kyotohotels_subset;
SET kyotohotels; 
if _N_ = 1; /* <-- Using automatic variable */
RUN;

* Look at this data. What if we want to select the first observations in each group (town)? 

In [None]:
DATA Town;
INFILE datalines;
INPUT Town $ 1-11 Rooms Price;
datalines;
Princeton  5 5.2
Princeton  2 0.7 
Princeton  1 0.55 
Hopewell   3 0.4 
Hopewell   4 0.9 
Hopewell   5 1.1 
Burlington 6 0.7
Burlington 7 0.8
Burlington 8 0.9
;
RUN;

In [None]:
DATA Town_subset;
SET Town; 
by town; /* <-- FIRST.Var or LAST.Var only works when a BY is used. */
if first.town; /* Selecting the first obs in each group. */ /* we cannot use WHERE here.  */
RUN;

### Subsetting Columns

Subsetting of columns is straightforward. You can use the KEEP or DROP statements or KEEP or DROP **data set options**.

Let's keep the American columns (USD and Miles) and drop the international columns (KM and Yen). 

In [None]:
/* First read the data */
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
IF Miles <= 2.5;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

/* Third subset the columns */
DATA kyotohotels_subset;
SET kyotohotels;
KEEP Hotel Miles USD;
RUN;

/* Alternatively, use the drop statement */
DATA kyotohotels_subset;
SET kyotohotels;
DROP Yen Kilometers;
RUN;

The syntax for KEEP or DROP **data set options** is like this: 

In [None]:
DATA kyotohotels_subset;
SET kyotohotels (keep = Hotel Miles USD);
RUN;

DATA kyotohotels_subset;
SET kyotohotels (drop = Yen Kilometers);
RUN;

You can also do the following to yield the same result. 

In [None]:
DATA kyotohotels_subset (keep = Hotel Miles USD);
SET kyotohotels;
RUN;

DATA kyotohotels_subset (drop = Yen Kilometers);
SET kyotohotels;
RUN;

Whence the difference? 
* Remember that the data set options are applied to the dataset it is trailing to.

Try the following two code blocks. What do they do? Why the difference? 

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels_subset (keep = Hotel Miles USD);
SET kyotohotels;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
RUN;

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels_subset;
SET kyotohotels (keep = Hotel Miles USD);
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
RUN;

## IF ... THEN ... ELSE ...

We used the IF statement when subsetting data. We can also create variables based on some conditions using IF THEN ELSE. For example, let's create a variable that tells us whether a hotel is close or far from the Kyoto station. Note that you are still creating a variable, but the statement to create the variable is under a certain condition. 

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels;
    SET kyotohotels;
    IF miles <= 2.5 then Close_or_Far = "Close";
    IF miles > 2.5 then Close_or_Far = "Far";
RUN;

Your turn. Try to create a variable called Cheap_or_Pricy. If the cost is not higher than \$200 then it is cheap. Otherwise it is pricy. 

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels;
    SET kyotohotels;
    IF miles <= 2.5 THEN Close_or_Far = "Close";
    IF miles > 2.5 THEN Close_or_Far = "Far";
    IF USD <= 200 THEN Close_or_Far = "Cheap";
    IF USD > 200 THEN Close_or_Far = "Pricy";
RUN;

There are some additional notes I want to add to the IF statements. 

### Use ELSE or ELSE IF

Because we have two gourps only, and when Miles is not less than or equal to 2.5, it can only be bigger than 2.5. Instead of specifying the miles > 2.5 condition, we cal use ELSE. 

In [None]:
DATA kyotohotels;
    SET kyotohotels;
    IF miles <= 2.5 THEN Close_or_Far = "Close";
    ELSE Close_or_Far = "Far";
    IF USD <= 200 THEN Close_or_Far = "Cheap";
    ELSE Close_or_Far = "Pricy";
RUN;

If we want to create more complicated price group, we can use multiple IF statements or ELSE IF. Let's say we want to create a group called really close. A hotel is really close if it is within 2 miles of the station. A hotel is close if it is within 2.5 miles of the station, and far if more than 2.5 miles. Try the following code. Does it do the job?

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels;
    SET kyotohotels;
    IF miles < 2 THEN Close_or_Far = "Really Close";
    IF miles <= 2.5 THEN Close_or_Far = "Close";
    IF miles > 2.5 THEN Close_or_Far = "Far";
RUN;

Because data step is executed line-by-line and row-by-row, the second IF statement is replacing whatever the first IF statement acomplished. 

To avoid things like this, we should always try to use ELSE IF. Plus, ELSE IF is faster. 

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels;
    SET kyotohotels;
    IF miles < 2 THEN Close_or_Far = "Really Close";
    ELSE IF miles <= 2.5 THEN Close_or_Far = "Close";
    ELSE IF miles > 2.5 THEN Close_or_Far = "Far";
RUN;

### Use IF THEN DO END

When you want to do more than one thing after a condition check, you can use IF ... THEN DO ... END...

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels;
    SET kyotohotels;
    LENGTH Close_or_Far $12. Need_Taxi $5.; /* Specifying the length of new character variables */
    IF miles < 2 THEN DO;
        Close_or_Far = "Really Close";
        Need_Taxi = "No";
    END;
    ELSE IF miles <= 2.5 THEN DO;
        Close_or_Far = "Close";
        Need_Taxi = "Maybe";
    END;
    ELSE IF miles > 2.5 THEN DO;
        Close_or_Far = "Far";
        Need_Taxi = "Yes";
    END;
RUN;

<center><font size="+2">Always close a <b>DO</b> with an <b>END</b>!</font></center>

### Conditional Expressions

To describe the conditions, you can use symbolic expressions like "<=", ">=", "&", "|"... These are pretty intuitive from our math classes. You can also use mnemonic expressions such as "LE", "GE", "AND", "OR".... I always prefer the mnemonic expressions because they are faster to type. 

Here is a table summarizing these condition checks. Unfortunately there is not too much of understanding about them. You just have to memorize them to use them. Fortunately they make sense. 

| Symbolic | Mnemonic | Description |
| :-: | :-: | :- |
| = | EQ | equal |
| ^=, or ~= | NE | not equal |
| > | GT | greater than |
| < | LT | less than |
| >= | GE | greater than or equal to |
| <= | LE | less than or equal to |
| & | AND | all conditions must be true |
| \| , ¦ , or ! | OR | at least one condition must be true |

Let's use the previous definitions of cheap and close. Then create another variable called Ranking. If it is cheap and close then we rank it 1, and if it is either cheap or close we rank it 2, and if it is neither cheap or close we rank it 3. Try it your self. Here are the conditions we were using. 

In [None]:
miles <= 2.5 /* close */
USD <= 200 /* cheap */

Here is the full code. 

In [None]:
DATA kyotohotels;
INFILE datalines;
INPUT Hotel $ 1-25 Yen Kilometers;
USD = Yen * 0.0089;
Miles = Kilometers * 0.62;
datalines;
The Grand West Arashiyama 32200 9.5
Kyoto Sharagam            48000 3.3
The Palace Side Hotel     10200 3.8
Rinn Fushimiinari         41000 2.9
Rinn Nijo Castle          18000 3.3
Suiran Kyoto              102000 11.0
;
RUN;

DATA kyotohotels;
    SET kyotohotels;
    IF miles <= 2.5 and USD <= 200 THEN Rank = 1;
    ELSE IF miles <= 2.5 or USD <= 200 THEN rank = 2;
    ELSE IF miles > 2.5 or USD > 200 THEN rank = 3;
RUN;

## SAS Functions

* Funcations are important components of any programming language.
* To understand a language, we need to know what 3 things:
    1. What to feed the function
    2. What the function does
    3. What the function spit out

### Functions for numbers

The SAS functions for numbers are not so special. There is one thing you really want to keep in mind: each time you call a function, you are calling it on columns, but executed row by row. Let's look at the following example. Suppose I have the test scores of each student and I want to calculate the average of the 5 scores for each student. 


In [None]:
DATA class102;
INFILE DATALINES MISSOVER;
INPUT Name $ Test1 Test2 Test3 Test4 Test5;
datalines;
Nguyen 89 76 91 82 85
Ramos 67 72 80 76 86
Robbins 76 65 79 60 55
;
RUN;

DATA class102;
    SET class102;
    average_score = MEAN(Test1, Test2, Test3, Test4, Test5);
RUN;

SAS will look at the values of Test1 to Test5 in the first row, calculate their average, and then move on to the next row. If you want to calculate the average for each test among the 3 students, that is, the average of a column, this is not doable on this data. 

You can also use just math formula to calculate the average like this: 

In [None]:
DATA class102;
INFILE DATALINES MISSOVER;
INPUT Name $ Test1 Test2 Test3 Test4 Test5;
datalines;
Nguyen 89 76 91 82 85
Ramos 67 72 80 76 86
Robbins 76 65 79 60 55
;
RUN;

DATA class102;
    SET class102;
    average_score1 = MEAN(Test1, Test2, Test3, Test4, Test5);
    average_score2 = (Test1 + Test2 + Test3 + Test4 + Test5) / 5;
RUN;

The functions and math expressions handle missing values differently. Try this. What do you notice? Why the difference? 

In [None]:
DATA class102;
INFILE DATALINES MISSOVER;
INPUT Name $ Test1 Test2 Test3 Test4 Test5;
datalines;
Nguyen 89 76 91 82 85
Ramos 67 72 80 76 86
Robbins 76 65 79 60
;
RUN;

DATA class102;
    SET class102;
    average_score1 = MEAN(Test1, Test2, Test3, Test4, Test5);
    average_score2 = (Test1 + Test2 + Test3 + Test4 + Test5) / 5;
    average_score3 = SUM(Test1, Test2, Test3, Test4, Test5) / 5;
RUN;

There are just many functions out there. I should not list all of them here. Just keep in mind, all common math operations have a function in SAS. If you need any of them, just google. This is what I got when googling "remainder of a number in SAS". You will usually be led here. 

[SAS Help Center](https://documentation.sas.com/doc/en/vdmmlcdc/8.1/ds2ref/n0t9j8b09x4uphn1kl1i70x63z19.htm)

### Functions for Characters

Functions for characters are much more complicated. I will introduce a few that I found useful here. 
* SUBSTR
* FIND
* INDEX
* SCAN
* CAT
* CATS
* CATX
* || 
* COMPRESS
* TRIM
* TRANSLATE: Replaces specific characters in a character expression.

* TRANWRD


In [None]:
/* Substr */
data firstlast;
    input string $60.;
    WORD1=substr(string, 1,4);
    WORD2=substr(string, 2,4);
    WORD3=substr(string, 2,3);
    datalines;
    Jack and Jill
    & Bob & Carol & Ted & Alice &
    Leonardo
    ! $ % & ( ) * + , - . / 
;;;;
RUN;

In [None]:
/* Index and Find */
/* Find */
data firstlast;
    input string $60.;
    find1=find(string,"Jack");
    find2=find(string,"Bob");
    find3=find(string,"&");
    find4=find(string,"Carol");
    datalines;
    Jack and Jill
    & Bob & Carol & Ted & Alice &
    Leonardo
    ! $ % & ( ) * + , - . / 
;;;;
RUN;

/* Index */
data firstlast;
    input string $60.;
    index1=index(string,"Jack");
    index2=index(string,"Bob");
    index3=index(string,"&");
    index4=index(string,"Carol");
    datalines;
    Jack and Jill
    & Bob & Carol & Ted & Alice &
    Leonardo
    ! $ % & ( ) * + , - . / 
;;;;
RUN;
/* Index and Find are very similar. In Find you can have modifiers and specify the starting position, but you cannot do that in Index. */

In [None]:
/* Scan */
data firstlast;
    input string $60.;
    First_Word=scan(string, 1);
    Last_Word=scan(string, -1);
    WORD1=scan(string, 1);
    WORD2=scan(string, 2);
    WORD3=scan(string, 3);
    datalines;
    Jack and Jill
    & Bob & Carol & Ted & Alice &
    Leonardo
    ! $ % & ( ) * + , - . / 
;;;;
RUN;

In [None]:
/* CAT CATS and CATX */
data temp;
      x='  The 2012 Olym'; 
      y='pic Arts Festi';
      z='  val included works by D  ';
      a='ale Chihuly.';
      result1=cat(x,y,z,a); /*Does not remove leading or trailing blanks, and returns a concatenated character string.*/
      result2=cats(x,y,z,a); /*Removes leading and trailing blanks, and returns a concatenated character string.*/
      result3=catx("$$",x,y,z,a); /*Removes leading and trailing blanks, inserts delimiters, and returns a concatenated character string.*/
      result4=x||y||z||a; /*The same as CAT*/
RUN;

In [None]:
/* Compress and Trim */
data temp;
      x='The 2012 Olym pic Arts Festi val included works by D ale Chihuly.'; 
      
      x1=compress(x); /*Remove spaces*/
      x2=compress(x,"O "); /*Remove spaces and O*/
      x3=compress(x,"aeiou"); /*Remove vowels*/ /*note that these are not treated as a whole thing*/
RUN;

data temp;
      x='The 2012 Olym pic Arts Festi val included works by D ale Chihuly.'; 
      
      x1=trim(x); /*Remove trailing blanks*/
RUN;

In [None]:
/* Translate and Tranwrd */
data temp;
      x='The 2012 Olym pic Arts Festi val included works by D ale Chihuly.'; 
      
      x1=translate(x,"*****","aeiou"); /*Replace vowels with *s*/ /*note that aeiow are not treated as a whole thing*/
      x2=translate(x,"^@107","aeiou"); /*Replace vowels with symbols*/ /*note that aeiow are not treated as a whole thing*/
RUN;

data temp;
      x='The 2012 Olym pic Arts Festi val included works by D ale Chihuly.'; 
      
      x1=Tranwrd(x,"aeiou","*****"); /*Nothing happens*/
      x2=Tranwrd(x,"Olym pic","Olympic"); /*Olym pic is replaced to be one word*/
RUN;
/* The TRANWRD function differs from TRANSLATE in that it scans for words (or patterns of characters) and replaces 
those words with a second word (or pattern of characters). */
/* As you can see the syntax between translate and tranwrd can get confusing. In traslate, you put what you want as the second
arguement, but in tranwrd, you put what you want as the third argument.*/

### Converting between Numerics and Characters

To convert a character value to numerics, use the input function:

In [None]:
DATA temp;
    x = "0123";
    y = input(x); /*Note that I specify the informat*/
run;

To convert a character value to numerics, use the put function:

In [None]:
DATA temp;
    x = 123;
    y = put(x,4.); /*Note that I specify the informat*/
run;

DATA temp;
    x = 123;
    y = put(x,z4.); /*Use a special informat to get leading 0s*/
run;

### Functions for Dates

#### How dates are handled in SAS

Like any programming language, SAS handles dates as numbers. Run the following code and see what you get. What's the type of "Date"?

In [None]:
DATA contest;
INPUT Name $16. +1 Date MMDDYY10.;
DATALINES;
Alicia Grossman  10-28-2020
Matthew Lee      10-30-2020
Elizabeth Garcia 10-29-2020
Lori Newcombe    10-30-2020
Jose Martinez    10-31-2020
Brian Williams   10-29-2020
Brony Williams   10-29-1955
;
RUN;

10-28-2020 becomes 22216! 

Because 10-28-2020 is exactly 22216 days from 01-01-1960, which is the base date in SAS. If you enter a date before that, it will be loaded as a negative number. 

#### Date format and date informat

I cannot understand 22216 without the help of a calculator. How can we see the actual date? We need to ***FORMAT*** the variable. 

In [None]:
DATA contest;
INPUT Name $16. +1 Date MMDDYY10.; /* This MMDDYY10. is INFORMAT */
FORMAT Date MMDDYY10.; /* This MMDDYY10. is FORMAT */
DATALINES;
Alicia Grossman  10-28-2020
Matthew Lee      10-30-2020
Elizabeth Garcia 10-29-2020
Lori Newcombe    10-30-2020
Jose Martinez    10-31-2020
Brian Williams   10-29-2020
Brony Williams   10-29-1955
;
RUN;

INFORMAT tells SAS how to read the raw data. FORMAT tells SAS how to display the data. 

Try the following and see the difference. 

In [None]:
DATA contest;
INPUT Name $16. +1 Date MMDDYY10.;
FORMAT Date date9.;
DATALINES;
Alicia Grossman  10-28-2020
Matthew Lee      10-30-2020
Elizabeth Garcia 10-29-2020
Lori Newcombe    10-30-2020
Jose Martinez    10-31-2020
Brian Williams   10-29-2020
Brony Williams   10-29-1955
;
RUN;

DATA contest;
INPUT Name $16. +1 Date date9.;
FORMAT Date MMDDYY10.;
DATALINES;
Alicia Grossman  10-28-2020
Matthew Lee      10-30-2020
Elizabeth Garcia 10-29-2020
Lori Newcombe    10-30-2020
Jose Martinez    10-31-2020
Brian Williams   10-29-2020
Brony Williams   10-29-1955
;
RUN;

#### Date as strings

Sometimes you may get a data where the date is loaded as a string. 

In [None]:
DATA contest;
INPUT Name $16. +1 Date $10.; /* Here the informat is not a date informat, so SAS will take it as a 10-character string. */
DATALINES;
Alicia Grossman  10-28-2020
Matthew Lee      10-30-2020
Elizabeth Garcia 10-29-2020
Lori Newcombe    10-30-2020
Jose Martinez    10-31-2020
Brian Williams   10-29-2020
Brony Williams   10-29-1955
;
RUN;

What can we do? We can extract month, day, and year from the string using SAS character functions. 

Which one to use? 
* Do you recall which function gives a part of a string? 

In [None]:
DATA contest;
SET contest;
Month = substr(Date,1,2); /* Note that the output of substr is also a string. */
Day = substr(Date,4,2);
Year = substr(Date,7,2);
RUN;

How do we turn this into a date? We have to convert the strings into numerics and then use the mdy() function. 

The MDY function takes Month, Day, and Year as 3 inputs and returns a numeric date, but still without formatting. 

In [None]:
DATA contest;
SET contest;
Month = input(substr(Date,1,2), 4.); 
Day = input(substr(Date,4,2), 4.);
Year = input(substr(Date,7,2), 4.);

Num_Date = mdy(month, day, year);
format num_date date9.;
RUN;

Alternatively, we can use PUT and INPUT functions. 
* ***PUT***: Returns a value using a specified **format**.
* ***INPUT***: Returns the value produced when a SAS expression that uses a specified **informat** expression is read
As you see, the difference is format vs informat. What does it mean though? Let's look at an example. ad

In [None]:
data temp;
    numeric = 1234; 
    character=put(numeric,6.); /*The 6. here is the specified FORMAT*//*Converting Numeric Values to Character Value*/
run;

In [None]:
data temp;
    character = "001234"; 
    numeric = input(character,6.); /*The 6. here is the specified INFORMAT*/ /*Converting Character Values to Numeric Value*/
run;

If the date is loaded as a string, how do we turn it into a number? 
* Right! Use INPUT function and INFORMAT. 

In [None]:
data temp;
    char_date = "1999-07-09"; 
    num_date = input(char_date,YYMMDD10.); /*Converting Character Values to Numeric Value*/
    format num_date YYMMDD10.;
run;

DATETIME format
* DATETIME w.d
    * **w** specifies the width of the output field.
        * Default: 16
        * Range: 7–40
        * Tip: SAS requires a minimum w value of 16 to write a SAS datetime value with the date, hour, and seconds. Add an additional two places to w and a value to d to return values with optional decimal fractions of seconds.
    * **d** specifies the number of digits to the right of the decimal point in the seconds value. This argument is optional.
        * Range: 0–39
        * Requirement: must be less than w

Common DATETIME INFORMATS

You can find more here: [Date and Datetime Informats](https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/etsug/etsug_intervals_sect009.htm)

| Informat/Example | Description | Width Range | Default Width |
| --- | --- | --- | --- |
| ANYDTDTEw. | Reads and extracts the date value from any of the following: DATE, DATETIME, DDMMYY, JULIAN, MDYAMPM, MMDDYY, MMxYY*, MONYY, TIME, YMDDTTM, YYMMDD, YYQ, YYxMM*, month-day-year | 5–32 | 9 |
| ANYDTDTMw. | Reads and extracts the datetime value from anyof the following: DATE, DATETIME, DDMMYY,JULIAN, MMDDYY, MMxYY*, MONYY,TIME, YYMMDD, YYQ, YYxMM*,month-day-year | 1–32 | 19 |
| ANYDTTMEw. | Reads and extracts the time value from anyof the following: DATE, DATETIME, DDMMYY,JULIAN, MMDDYY, MONYY, TIME,YYMMDD, YYQ, month-day-year  | 1–32 | 8 |
| DATEw. | Day, month abbreviation, and year: | 7–32 | 7 |
| DATETIMEw.d | Date and time: ddmonyy:hh:mm:ss | 13–40 | 18 |

In [None]:
data temp;
    char_date = "1999-07-09:11:20:23"; 
    num_date = input(char_date,ANYDTDTM19.); /*Converting Character Values to Numeric Value*/
    format num_date datetime16.;
run;

#### Commonly used date functions

Again, there are many date functions. I will list some commonly used ones here. 
* INTNX: Returns the interval between two dates
* INTCN: Increments a date

In [None]:
data b;
    FORMAT WeddingDay date9. today date9.;
    WeddingDay='14feb2021'd;
    Today=today();
    YearsMarried=intck('YEAR', WeddingDay, today(), 'C');
    MonthsMarried=intck('MONTH', WeddingDay, today(), 'C');
    DaysMarried=intck('DAY', WeddingDay, today(), 'C');
run;

In [None]:
data b;
    FORMAT WeddingDay date9. today date9. FirstAnn date9. SecondAnn date9. SixMonthAnn date9.;
    WeddingDay='14feb2021'd;
    Today=today();
    FirstAnn=intnx('YEAR', WeddingDay, 1, 'SAME');
    SecondAnn=intnx('YEAR', WeddingDay, 2, 'SAME');
    SixMonthAnn=intnx('MONTH', WeddingDay, 6, 'SAME');
run;