# V. Split-apply-combine

The split-apply-combine strategy refers to the notion of partitioning data into groups, doing computations on these groups, and then combining results. The original dataframe will typically have categorical variables that allow the data to be separated into different groups based on the values of these categorical variables.

We'll cover the primary functions for this type of workflow for use with dataframes:

`by(df, cols, f)`

`aggregate(df, cols, f)`

`groupby(df, cols, skipmissing=false)`

In [None]:
using DataFrames, Distributions, Random, Statistics, CategoricalArrays

In [None]:
#create a dataframe

Random.seed!(1234)

N = 20
df = DataFrame(ID = 1:N,
                Category = wsample(["Low", "Medium", "High"], [1/3, 1/3, 1/3], N),
                Weight = rand(120:170, N),
                Age = rand(20:80, N),
                IndVar = wsample([0, 1], [0.5, 0.5], N),
                RandNum = randn(N))

categorical!(df, [:Category, :IndVar]);
levels!(df.Category, ["Low", "Medium", "High"]);

In [None]:
df

The __by__ function is used to apply the function <i>f</i> to the input dataframe by the indicated columns. The function is a pair where the first element of the pair is the column variable(s) and the second element of the pair is the function to be applied to the column variable(s).

Let's calculate the sum of the <i>ID</i> variable by the <i>Category</i> variable. The third argument to **by** specifies the sum function is to be applied to the <i>ID</i> column and the second argument indicates to do this calculation by the different <i>Category</i> values.

In [None]:
by(df, :Category, :ID => length)

The default column name is a combination of the name of the column the function was applied to, an underscore, and the name of the function itself. You can specifiy a name for the column if you don't want the default name.

In [None]:
by(df, :Category, IDCountByCat = :ID => length)

To calculate additional functions, pass the additional functions as arguments to __by__. To calculate the sum total weight by category we use an anonymous function.

In [None]:
by(df, :Category, :ID => length, :Weight => x -> sum(x))

If you want each column to have a custom name, you'll need to pass the multiple functions as a single argument.

The third argument to __by__ is again a pair. Here the first element to the pair is a named tuple and the second element is an anonymous function. The <i>x</i> refers to the named tuple. The named tuple has two elements. A named tuple is similar to a tuple but the elements of the tuple have assigned names.

In [None]:
by(df, :Category, (:ID, :Weight) => x -> (IDCountByCat = length(x.ID),
                                          TotWeightByCat = sum(x.Weight),
                                          AvgWeightByCat1 = mean(x.Weight),
                                          AvgWeightByCat2 = sum(x.Weight)/length(x.ID)))

You can specify multiple columns by which to do the calculations. Above we did them by just the <i>Category</i> variable but below we do them by <i>Category</i> and <i>IndVar</i> so we'll get six groupings.

In [None]:
by(df, 
   [:Category, :IndVar], 
   [:ID, :Weight] => x -> (IDCountByCat = length(x.ID),
                           TotWeightByCat = sum(x.Weight),
                           AvgWeightByCat = mean(x.Weight)
                           ))

You can get the output of __by__ sorted by the partition columns by passing **true** to the `sort` keyword argument.

In [None]:
by(df, 
   [:Category, :IndVar], 
   [:ID, :Weight] => x -> (IDCountByCat = length(x.ID),
                           TotWeightByCat = sum(x.Weight),
                           AvgWeightByCat = mean(x.Weight)),
   sort = true)

In [None]:
smalldf = copy(df[:, [2, 3, 4, 5]]);

The __aggregate__ function takes similar arguments as the __by__ function. However, it applies the list of functions to all variables other than the column variables used for the partitioning.

In [None]:
agg = aggregate(smalldf, [:Category, :IndVar], [length, mean, median, minimum, maximum], sort=true);

In [None]:
show(agg, allcols=true)

The last split-apply-combine function we'll discuss is __groupby__. This function can be useful if you want to break the dataframe into separate sub-dataframes based on a column variable. This might be useful if you want work with the sub-dataframes individually.  

Here we'll break the dataframe into three sub-dataframes based on the <i>Category</i> column variable. So we'll get three sub-dataframes: one for where <i>Category</i> equals "Low", another for where <i>Category</i> equals "Medium", and then a third for where <i>Category</i> equals "High."

In [None]:
grouped = groupby(df, :Category)

You can see the result of __groupby__ is a grouped dataframe.

In [None]:
typeof(grouped)

You can use indexing to retrieve the individual subframes. For example to get the subframe for the "Low" grouped datframe:

In [None]:
grouped[1]

In this lesson we covered:
* by, groupby, and aggregate functions.