# V. Split-apply-combine

The split-apply-combine strategy refers to the notion of partitioning data into groups, doing computations on these groups, and then combining results. The original dataframe will typically have categorical variables that allow the data to be separated into different groups based on the values of these categorical variables.

We'll cover the primary functions for this type of workflow for use with dataframes:

`by(df, cols, f)`

`aggregate(df, cols, f)`

`groupby(df, cols, skipmissing=false)`

In [None]:
using DataFrames, Distributions, Random, Statistics, CategoricalArrays

In [None]:
# Change jupyter settings to show wide columns and fewer rows

ENV["COLUMNS"] = 200;
ENV["LINES"] = 25;

In [None]:
#create a dataframe

Random.seed!(1234)

N = 20
df = DataFrame(ID = 1:N,
                Category = wsample(["Low", "Medium", "High"], [1/3, 1/3, 1/3], N),
                Weight = rand(120:170, N),
                Age = rand(20:80, N),
                IndVar = wsample([0, 1], [0.5, 0.5], N),
                RandNum = randn(N))

categorical!(df, [:Category, :IndVar]);
levels!(df.Category, ["Low", "Medium", "High"]);

In [None]:
df

The `by` function is used to apply some type of computation to the input dataframe by the indicated columns. The computation _f_ is a pair where the first element of the pair is the column variable(s) and the second element of the pair is the function to be applied to the column variable(s).

Let's calculate the number of samples for each value of the <i>Category</i> variable. The third argument to `by` specifies to apply the `length` function to the <i>ID</i> variable and the second argument indicates to do this calculation by the different <i>Category</i> values.

In [None]:
by(df, :Category, :ID => length)

The default column name is a combination of the name of the column the function was applied to, an underscore, and the name of the function itself. You can specifiy a name for the column if you don't want the default name.

In [None]:
by(df, :Category, IDCountByCat = :ID => length)

To calculate additional functions, pass the additional functions as arguments to `by`. To calculate the sum total weight by category we use an anonymous function.

In [None]:
by(df, :Category, a = :ID => length, b = :Weight => x -> sum(x))

# by(df, :Category, a = :ID => length, b = :Weight => sum) also works

Here is a slightly more complicated example. We calculate the total weight and the average weight by _Category_. Here we have four function pairs. The average weight by category is calculated two different ways: one way is using the `mean` function and the other manually calculates the average using the `sum` and `length` functions.

In [None]:
by(df, :Category, IDCountByCat = :ID => length,
                  TotWeightByCat = :Weight => sum,
                  AvgWeightByCat1 = :Weight => mean,
                  AvgWeightByCat2 = [:ID, :Weight] => x -> sum(x.Weight)/length(x.ID))

The following is a different way to do the same thing in the above cell. Here the first element to the function pair is selecting the columns that will be used in the computation. In this case __x.ID__ maps to the values in the _ID_ column and __x.Weight__ maps to the values in the _Weight_ column for the _Category_.

In [None]:
by(df, :Category, [:ID, :Weight] => x -> (IDCountByCat = length(x.ID),
                                          TotWeightByCat = sum(x.Weight),
                                          AvgWeightByCat1 = mean(x.Weight),
                                          AvgWeightByCat2 = sum(x.Weight)/length(x.ID)) )

You can specify multiple columns by which to do the calculations. Above we did them by just the <i>Category</i> variable but below we do them by <i>Category</i> and <i>IndVar</i> so we get six groupings.

In [None]:
by(df, 
   [:Category, :IndVar], 
   [:ID, :Weight] => x -> (IDCountByCat = length(x.ID),
                           TotWeightByCat = sum(x.Weight),
                           AvgWeightByCat = mean(x.Weight)
                           ))

You can get the output of __by__ sorted by the partition columns by passing **true** to the `sort` keyword argument. The above output will now be sorted by _Category_ and _IndVar_.

In [None]:
by(df, 
   [:Category, :IndVar], 
   [:ID, :Weight] => x -> (IDCountByCat = length(x.ID),
                           TotWeightByCat = sum(x.Weight),
                           AvgWeightByCat = mean(x.Weight)),
   sort = true)

In [None]:
smalldf = copy(df[:, [2, 3, 4, 5]]);

The __aggregate__ function takes similar arguments as the __by__ function. However, it applies the list of functions to all variables other than the column variables used for the partitioning.

In [None]:
agg = aggregate(smalldf, [:Category, :IndVar], [length, mean, median, minimum, maximum], sort=true)

The last split-apply-combine function we'll discuss is `groupby`. This function can be useful if you want to break the dataframe out into separate sub-dataframes based on a column variable. This might be useful if you want work with the sub-dataframes individually.  

Here we'll break the dataframe into three sub-dataframes based on the <i>Category</i> column variable. So we'll get three sub-dataframes: one for where <i>Category</i> equals "Low", another for where <i>Category</i> equals "Medium", and then a third for where <i>Category</i> equals "High."

In [None]:
grouped = groupby(df, :Category)

You can see the result of `groupby` is a grouped dataframe.

In [None]:
typeof(grouped)

Under the hood `groupby` uses indexing that makes it fast to retrieve the individual subframes. For example to get rows of data with for _Category_ "Low" we can look at the subframe for "Low":

In [None]:
grouped[("Low",)]

You can use the `combine` function with a grouped dataframe to calculate summary statistics for each subgroup:

In [None]:
combine(grouped, :Weight => mean, :Age => mean)

In this lesson we covered:
* by, groupby, and aggregate functions.