# V. Split-apply-combine

The split-apply-combine strategy refers to the notion of partitioning data into groups, doing computations on these groups, and then combining results. The original dataframe will typically have categorical variables that allow the data to be separated into different groups based on the values of these categorical variables.

We'll cover the primary functions for this type of workflow for use with dataframes:

`by(df, cols, f)`

`aggregate(df, cols, f)`

`groupby(df, cols, skipmissing=false)`

In [1]:
using DataFrames, Distributions, Random, Statistics, CategoricalArrays

In [2]:
#create a dataframe

Random.seed!(1234)

N = 20
df = DataFrame(ID = 1:N,
                Category = wsample(["Low", "Medium", "High"], [1/3, 1/3, 1/3], N),
                Weight = rand(120:170, N),
                Age = rand(20:80, N),
                IndVar = wsample([0, 1], [0.5, 0.5], N),
                RandNum = randn(N))

categorical!(df, [:Category, :IndVar]);
levels!(df.Category, ["Low", "Medium", "High"]);

In [3]:
df

Unnamed: 0_level_0,ID,Category,Weight,Age,IndVar,RandNum
Unnamed: 0_level_1,Int64,Cat…,Int64,Int64,Cat…,Float64
1,1,Medium,152,33,0,-0.981132
2,2,High,159,47,1,-0.316387
3,3,Medium,140,73,1,0.265743
4,4,Medium,145,47,0,1.06561
5,5,High,121,49,1,1.38501
6,6,High,149,27,0,0.0799514
7,7,Low,138,26,1,-0.833369
8,8,Low,126,20,1,-0.443247
9,9,Low,125,35,0,-1.66323
10,10,Medium,143,54,1,-0.521229


The `by` function is used to apply the function <i>f</i> to the input dataframe by the indicated columns. The function _f_ is a pair where the first element of the pair is the column variable(s) and the second element of the pair is the function to be applied to the column variable(s).

Let's calculate the number of samples for each value of the <i>Category</i> variable. The third argument to `by` specifies to apply the `length` function to the <i>ID</i> variable and the second argument indicates to do this calculation by the different <i>Category</i> values.

In [4]:
by(df, :Category, :ID => length)

│   caller = top-level scope at In[4]:1
└ @ Core In[4]:1


Unnamed: 0_level_0,Category,ID_length
Unnamed: 0_level_1,Cat…,Int64
1,Low,8
2,Medium,7
3,High,5


The default column name is a combination of the name of the column the function was applied to, an underscore, and the name of the function itself. You can specifiy a name for the column if you don't want the default name.

In [5]:
by(df, :Category, IDCountByCat = :ID => length)

│             in_col
│         else
│             AsTable(in_col)
│         end => (fun => out_col) for (out_col, (in_col, fun)) = f]...)` instead.
│   caller = ip:0x0
└ @ Core :-1


Unnamed: 0_level_0,Category,IDCountByCat
Unnamed: 0_level_1,Cat…,Int64
1,Low,8
2,Medium,7
3,High,5


To calculate additional functions, pass the additional functions as arguments to `by`. To calculate the sum total weight by category we use an anonymous function.

In [6]:
by(df, :Category, a = :ID => length, b = :Weight => x -> sum(x))

Unnamed: 0_level_0,Category,a,b
Unnamed: 0_level_1,Cat…,Int64,Int64
1,Low,8,1089
2,Medium,7,1023
3,High,5,742


If you want each column to have a custom name, you'll need to pass the multiple functions as a single argument.


Here is a slightly more complicated example. We calculate the total weight and the average weight by _Category_. Here we have four function pairs. The average weight by category is calculated two different ways: one way is using the `mean` function and the other manually calculates the average using the `sum` and `length` functions.

In [7]:
by(df, :Category, IDCountByCat = :ID => length,
                  TotWeightByCat = :Weight => sum,
                  AvgWeightByCat1 = :Weight => mean,
                  AvgWeightByCat2 = [:ID, :Weight] => x -> sum(x.Weight)/length(x.ID))

Unnamed: 0_level_0,Category,IDCountByCat,TotWeightByCat,AvgWeightByCat1,AvgWeightByCat2
Unnamed: 0_level_1,Cat…,Int64,Int64,Float64,Float64
1,Low,8,1089,136.125,136.125
2,Medium,7,1023,146.143,146.143
3,High,5,742,148.4,148.4


The following is a different way to do the same thing in the above cell. Here the first element to the function pair is a __named tuple__ and the second element is an anonymous function. The <i>x</i> refers to the named tuple.
A named tuple is a tuple but where the tuple values can be referred to by a name instead of an numeric index.

The named tuple _x_ has two elements with names ID and Weight. In this case __x.ID__ maps to the values in the _ID_ column and __x.Weight__ maps to the values in the _Weight_ column.

In [8]:
by(df, :Category, [:ID, :Weight] => x -> (IDCountByCat = length(x.ID),
                                          TotWeightByCat = sum(x.Weight),
                                          AvgWeightByCat1 = mean(x.Weight),
                                          AvgWeightByCat2 = sum(x.Weight)/length(x.ID)))

│   caller = top-level scope at In[8]:1
└ @ Core In[8]:1


Unnamed: 0_level_0,Category,IDCountByCat,TotWeightByCat,AvgWeightByCat1,AvgWeightByCat2
Unnamed: 0_level_1,Cat…,Int64,Int64,Float64,Float64
1,Low,8,1089,136.125,136.125
2,Medium,7,1023,146.143,146.143
3,High,5,742,148.4,148.4


You can specify multiple columns by which to do the calculations. Above we did them by just the <i>Category</i> variable but below we do them by <i>Category</i> and <i>IndVar</i> so we get six groupings.

In [9]:
by(df, 
   [:Category, :IndVar], 
   [:ID, :Weight] => x -> (IDCountByCat = length(x.ID),
                           TotWeightByCat = sum(x.Weight),
                           AvgWeightByCat = mean(x.Weight)
                           ))

│   caller = top-level scope at In[9]:1
└ @ Core In[9]:1


Unnamed: 0_level_0,Category,IndVar,IDCountByCat,TotWeightByCat,AvgWeightByCat
Unnamed: 0_level_1,Cat…,Cat…,Int64,Int64,Float64
1,Medium,0,5,740,148.0
2,High,1,3,437,145.667
3,Medium,1,2,283,141.5
4,High,0,2,305,152.5
5,Low,1,4,549,137.25
6,Low,0,4,540,135.0


You can get the output of __by__ sorted by the partition columns by passing **true** to the `sort` keyword argument. The above output will now be sorted by _Category_ and _IndVar_.

In [10]:
by(df, 
   [:Category, :IndVar], 
   [:ID, :Weight] => x -> (IDCountByCat = length(x.ID),
                           TotWeightByCat = sum(x.Weight),
                           AvgWeightByCat = mean(x.Weight)),
   sort = true)

Unnamed: 0_level_0,Category,IndVar,IDCountByCat,TotWeightByCat,AvgWeightByCat
Unnamed: 0_level_1,Cat…,Cat…,Int64,Int64,Float64
1,Low,0,4,540,135.0
2,Low,1,4,549,137.25
3,Medium,0,5,740,148.0
4,Medium,1,2,283,141.5
5,High,0,2,305,152.5
6,High,1,3,437,145.667


In [11]:
smalldf = copy(df[:, [2, 3, 4, 5]]);

The __aggregate__ function takes similar arguments as the __by__ function. However, it applies the list of functions to all variables other than the column variables used for the partitioning.

In [12]:
agg = aggregate(smalldf, [:Category, :IndVar], [length, mean, median, minimum, maximum], sort=true);

│   caller = ip:0x0
└ @ Core :-1


In [13]:
show(agg, allcols=true)

6×12 DataFrame
│ Row │ Category │ IndVar │ Weight_length │ Age_length │ Weight_mean │
│     │ [90mCat…[39m     │ [90mCat…[39m   │ [90mInt64[39m         │ [90mInt64[39m      │ [90mFloat64[39m     │
├─────┼──────────┼────────┼───────────────┼────────────┼─────────────┤
│ 1   │ Low      │ 0      │ 4             │ 4          │ 135.0       │
│ 2   │ Low      │ 1      │ 4             │ 4          │ 137.25      │
│ 3   │ Medium   │ 0      │ 5             │ 5          │ 148.0       │
│ 4   │ Medium   │ 1      │ 2             │ 2          │ 141.5       │
│ 5   │ High     │ 0      │ 2             │ 2          │ 152.5       │
│ 6   │ High     │ 1      │ 3             │ 3          │ 145.667     │

│ Row │ Age_mean │ Weight_median │ Age_median │ Weight_minimum │ Age_minimum │
│     │ [90mFloat64[39m  │ [90mFloat64[39m       │ [90mFloat64[39m    │ [90mInt64[39m          │ [90mInt64[39m       │
├─────┼──────────┼───────────────┼────────────┼────────────────┼─────────────┤
│ 1   │ 

The last split-apply-combine function we'll discuss is `groupby`. This function can be useful if you want to break the dataframe out into separate sub-dataframes based on a column variable. This might be useful if you want work with the sub-dataframes individually.  

Here we'll break the dataframe into three sub-dataframes based on the <i>Category</i> column variable. So we'll get three sub-dataframes: one for where <i>Category</i> equals "Low", another for where <i>Category</i> equals "Medium", and then a third for where <i>Category</i> equals "High."

In [14]:
grouped = groupby(df, :Category)

Unnamed: 0_level_0,ID,Category,Weight,Age,IndVar,RandNum
Unnamed: 0_level_1,Int64,Cat…,Int64,Int64,Cat…,Float64
1,7,Low,138,26,1,-0.833369
2,8,Low,126,20,1,-0.443247
3,9,Low,125,35,0,-1.66323
4,12,Low,166,51,0,-1.27635
5,13,Low,156,29,1,1.03132
6,16,Low,121,40,0,-1.29475
7,17,Low,129,62,1,-0.308944
8,19,Low,128,62,0,1.44886

Unnamed: 0_level_0,ID,Category,Weight,Age,IndVar,RandNum
Unnamed: 0_level_1,Int64,Cat…,Int64,Int64,Cat…,Float64
1,2,High,159,47,1,-0.316387
2,5,High,121,49,1,1.38501
3,6,High,149,27,0,0.0799514
4,14,High,156,60,0,-0.910805
5,20,High,157,72,1,0.778151


You can see the result of `groupby` is a grouped dataframe.

In [15]:
typeof(grouped)

GroupedDataFrame{DataFrame}

Under the hood `groupby` uses indexing that makes it fast to retrieve the individual subframes. For example to get rows of data with for _Category_ "Low" we can look at the subframe for "Low":

In [16]:
grouped[("Low",)]

Unnamed: 0_level_0,ID,Category,Weight,Age,IndVar,RandNum
Unnamed: 0_level_1,Int64,Cat…,Int64,Int64,Cat…,Float64
1,7,Low,138,26,1,-0.833369
2,8,Low,126,20,1,-0.443247
3,9,Low,125,35,0,-1.66323
4,12,Low,166,51,0,-1.27635
5,13,Low,156,29,1,1.03132
6,16,Low,121,40,0,-1.29475
7,17,Low,129,62,1,-0.308944
8,19,Low,128,62,0,1.44886


You can use the `combine` function with a grouped dataframe to calculate summary statistics for each subgroup:

In [17]:
combine(grouped, :Weight => mean, :Age => mean)

Unnamed: 0_level_0,Category,Weight_mean,Age_mean
Unnamed: 0_level_1,Cat…,Float64,Float64
1,Low,136.125,40.625
2,Medium,146.143,46.5714
3,High,148.4,51.0


In this lesson we covered:
* by, groupby, and aggregate functions.