It might be hard imagining a scenario when you’d use the Cartesian product in Education Data Scieince (EDS). However, SAS programmers can use the Cartesian product for solving a number of analytical problems. Here I discuss solving one of those problems.
- How has a school district's determination level (DL) changed from one year to the next?
This article is structured into six parts: introduction, keywords, Cartesian product, scenario, SAS programming, and determination level change.
Below is a list of keywords and their definitions:
- SAS Language - is a computer programming language used for statistical analysis based on SAS (Statistical Analysis System), a statistical software suite developed by the SAS Institute.
- Ordered Pair - a pair of objects (denoted by: a,b) where one element is designated first and the other element is designated second.
- Cartesian Product - the Cartesian product of two sets A and B, denoted A × B, is the set of all possible ordered pairs where the elements of A are first and the elements of B are second.
- Determination Level - Each US school district is assigned one of four determination levels (DLs) for special education by their state education agency: Meets Requirements = 1, Needs Assistance = 2, Needs Intervention = 3, and Needs Substantial Intervention = 4.
What is the Cartesian product? The essence of the Cartesian product is that we can use sets to make a new set that contain all the paired elements from the initial sets. What does that mean? Let's break it down. A set is a collection of things with a common property. The things that make up a set are called elements. Supposed we have two sets: A and B. And, each set contains a series of elements. Let's use lower case letters, where set A has three elements: a, b, and c. And, set B has three elements: d, e, and f (see set notation below).
- A = {a,b,c}
- B = {d,e,f}
The Cartesian product allows us to find all combinations of the elements in both set A and set B. Given sets A and B, we can multiply them together to produce a new set denoted as A x B. This operation is called the Cartesian product. This new set contains all the paired combinations (or ordered pairs) from set A and set B. Because set A has 3 elements and set B has 3 elements, the Cartesian product is 9 ordered pairs: 3 x 3 = 9 (see notation for the Cartesian product below).
- A x B = {(a,d),(a,e),(a,f),(b,d),(b,e),(b,f),(c,d),(c,e),(c,f)}
So, from set A and set B we've created the Cartesian product A x B with nine ordered pairs. Those ordered pairs give all of the possible combinations of elements from both set A and set B.
Now let's look at a scenario of when we would use the Cartesian product.
Each K-12 U.S. school district that has a special education program is assigned one of four determination levels (DLs) annually by their state education agency:
- Meets Requirements (DL 1)
- Needs Assistance (DL 2)
- Needs Intervention (DL 3)
- Needs Substantial Intervention (DL 4)
Such DL assignment denotes the degree to which a school district has implemented the federal Individuals with Disabilities Education Act (IDEA). For example, in Texas each DL is based on an equation that combines different indicators like graduation, dropout, etc. However, that's not relevant here but you can read more about the RDA system.
This scenario is about how a school district's DL has changed from one year to the next.
Of course, if we're interested in one school district, we could look at their 2019 DL and their 2020 DL to know how their DL changed. But, when dealing with over a thousand districts, this becomes a task better suited for SAS programming, the Cartesian product, and conditional processing.
The first step is to use SAS to create a data table of the 2019 DL values. The values ranged from DL 0 to DL 4. The 2019 DL set and its elements can be expressed as:
- 19DL = {0,1,2,3,4}
The SAS code below starts with a DATA
step and creates a dataset called a_table
. The INPUT
statement tells SAS to create a DL
variable that's a character data type, denoted by $
, and a numeric variable called DL2019
. The DATALINES
statement reads the subsequent lines of data directly into the SAS program (rather than coming from an external data source). Each of the five lines starting with DL1
1
contain the data values for the two variables in the INPUT
statement. The RUN
statement tells SAS to execute the preceding block of code to generate the new SAS dataset.
data a_table;
input DL $ DL2019;
datalines;
DL0 0
DL1 1
DL2 2
DL3 3
DL4 4
;
run;
The screenshot below is the output from executing the preceding block of SAS code:
The second step is to use SAS to create a data table of the 2020 DL values. However, Unlike the 2019 DL values that ranged from DL 0 to DL 4, the 2020 DL values ranged from DL 1 to DL 4. The 2020 DL set and its elements can be expressed as:
- 20DL = {1,2,3,4}
The SAS code below starts with a DATA
step and creates a dataset called b_table
. The INPUT
statement tells SAS to create a DL
$
character variable and a DL2020
numeric variable. The DATALINES
statement reads the subsequent lines of data directly into the SAS program (rather than coming from an external data source). Each of the five lines starting with DL1
1
contain the data values for the two variables in the INPUT
statement. The RUN
statement tells SAS to execute the preceding block of code to generate the new SAS dataset.
data b_table;
input DL $ DL2020;
datalines;
DL1 1
DL2 2
DL3 3
DL4 4
;
run;
The screenshot below is the output from executing the preceding block of SAS code:
Now we have two datasets containing all DL values from 2019 and 2020. We'll use those datasets to create a new dataset of the Cartesian product of DL change from year-to-year. While a_table
has five elements, b_table
has four elements. Therefore, the Cartesian product of a_table
and b_table
is a_table
x b_table
or 5 x 4 = 20. So, there are 20 ordered pair combinations of DL change:
- 19DL x 20DL = {(0,1),(0,2),(0,3),(0,4),(1,1),(1,2),(1,3),(1,4),(2,1),(2,2),(2,3),(2,4),(3,1),(3,2),(3,3),(3,4),(4,1),(4,2),(4,3),(4,4)}
The SAS code below starts with a PROC SQL
step to join the two datasets together from a query. The CREATE TABLE
statement with the AS
keyword creates a dataset called cartesian_dl
. The asterisk *
in the SELECT
statement tells SAS to include all columns from both the a_table
and the b_table
datasets in the FROM
clause. The DROP=
option is associated with the output dataset cartesian_dl
, which means that SAS will not write the DL
variable to the output cartesian_dl
dataset because for our purposes we're only interested in the DL2019 and the DL2020 columns. The QUIT
statement ends the PROC SQL
procedure.
proc sql;
create table cartesian_dl (drop=DL) as
select *
from
a_table,
b_table
;
quit;
The screenshot below is the output from executing the preceding block of SAS code:
The Cartesian product allows us to see all the possible ways the DL values may have changed for a school district from one year to the next. Each school district with a DL in both years must meet one of the ordered pair combinations from the Cartesian product dataset. Then we can assess if the school district's DL increased, decreased, or didn't change from year-to-year.
We can use SAS programming and conditional logic to create a new variable DeterminationLevelChange
, and we can use the Cartesian product to account for every possible DL change combination, which I'll cover in my next GitHub article when time permits.
- PennState - "Reading Instream Data"
- Math LibreTexts, "The Cartesian Product"
- MathisFun, "Introduction to Sets"
- SAS Documentation, "Asterisk ("*") Notation"
- SAS Documentation, "About Creating a SAS Data Set with a DATA Step"
- SAS Documentation, "Creating Tables"
- SAS Documentation, "DROP="
- SAS Documentation, "INFORMAT Statement"
- SAS Documentation, "The HPDS2 Procedure"
- SAS Documentation, "Using the DROP= and KEEP= Data Set Options for Efficiency"
- SAS Support, "MERGING vs. JOINING: Comparing the DATA Step with SQL"
- SAS Support, "Sample 25270: Using PROC SQL to generate the Cartesian Product"
- SASnrd, "SAS Cartesian Product with PROC SQL and the Data Step"
- UCLA, "Inputing Data into SAS"
- web.mnstate.edu, "The Language of Sets — Cartesian Product"
- Wikipedia, "Cartesian product"