# VI. Using Pipe.jl and Query.jl

The Pipe.jl package allows you to chain operations together using a convenient syntax. This package allows you to do more advanced piping than what you can do via the default piping sytnax in Julia. The typical syntax is:

`@pipe in |> f(x, _)`

The `@pipe` is a built-in Julia macro. Here you are taking result of `in` and passing it as input to `f`. The underscore on the right hand side means to substitute the result of `in` at this place in the call to `f`.

In [None]:
using DataFrames, Distributions, Random, Statistics, CategoricalArrays, Query, Pipe

In [None]:
#create a dataframe

Random.seed!(1234)

N = 20
dfa = DataFrame(ID = 1:N,
                Category = wsample(["Low", "Medium", "High"], [1/3, 1/3, 1/3], N),
                Weight = rand(120:170, N),
                Age = rand(20:80, N),
                IndVar = wsample([0, 1], [0.5, 0.5], N),
                RandNum = randn(N))

categorical!(dfa, [:Category, :IndVar]);
levels!(dfa.Category, ["Low", "Medium", "High"]);

In [None]:
dfa

Let's first look at a simple example using `filter`. We know the first argument to `filter` will be a pair and the second argument is the dataframe. Here the underscore in `filter` is replaced by the result of the left-hand side. The left-hand side is just the dataframe __dfa__ so this is what gets passed into the `filter` command on the right for the underscoare. In the end what we get are rows where the sum of _Weight_ and _Age_ is greather than 180.

In [None]:
filtered_gt130 = @pipe dfa |> filter( [:Weight, :Age]  => (x,y) -> x+y > 180 , _)

We can use the Pipe.jl to do grouping:

In [None]:
grouped = @pipe dfa |> groupby(_, :Category)

As we did before you can calculate the mean _Weight_ and _Age_ within each subgroup. Here the output of the `groupby` is passed as input to `combine`. Specifically the output of `groupby`, which is a grouped dataframe, is substituted into the underscore in the call to `combine`.

In [None]:
grouped_mean = @pipe dfa |> groupby(_, :Category) |> combine(_, :Weight => mean, :Age => mean)

The Query.jl package in Julia can be used to query data sources using query expressions. Typical operations include things like filtering, projecting, joining, sorting, and grouping. Legitimate data sources include data streams such as CSV files, arrays, dictionaries, databases (SQLite), dataframes, etc. (basically any iterable datasource). The basic syntax has this structure:

`myq = @from <range_var> in <source> begin
    <query_statements>
end`

The range variable is what iterates over the data source; and the query statements are the query commands that get executed. The `@from` is a Julia macro provided by the Query.jl package.

In [None]:
dfa

For a first simple example, let's create a new dataframe based on **dfa** where we filter on `Weight > 130` and only keep the columns <i>ID</i>, <i>Weight</i>, <i>Age</i>, and sort the resulting dataframe in descending order by <i>Age</i>.

In [None]:
ex = @from i in dfa begin
     @where i.Weight > 130
     @orderby descending(i.Age)
     @select {PatientID = i.ID, PatientWeight = i.Weight, PatientAge = i.Age}
     @collect DataFrame
end

In the above code:

* `@where` is doing the filtering operation based on the <i>Weight</i> variable.
* `@orderby` does the sorting in descending order using the <i>Age</i> varibale.
* `@select` is selecting the columns to keep and to optionally name the columns in the resulting object.
* `@collect` indicates to return the resulting object **ex** as a dataframe. If nothing is specified then the resulting object is an array.

Note the use of the range variable __i__ to reference the columns in the dataframe.

Next let's do an example where we group the data into subgroups based on the values of <i>Category</i> ("Low", "Medium", "High") and <i>IndVar</i> (0, 1). 

In [None]:
ex = @from i in dfa begin
     @where i.RandNum > 0 && i.Age > 25
     @group i by i.Category, i.IndVar into c
     @orderby key(c)
     @select {Grouping = key(c), AvgAge=mean(c.Age), MaxWeight = maximum(c.Weight), Count = length(c.Age)}
     @collect DataFrame
end

The `@group` statement groups the data into a new range variable (the new range variable is called <i>c</i> in our case) based on the levels of the column variables; this new range variable is then used to aggregate the data. 

The `key` function gives the values used to group the data. The other functions used in `@select` are calculated based on the grouped data via the new range variable <i>c</i>.

If you wanted to restrict the output to cases where some aggregate based value condition was met you could add another `@where` statement after the `@group` statement.

In [None]:
ex = @from i in dfa begin
     @where i.RandNum > 0 && i.Age > 25
     @group i by i.Category, i.IndVar into c
     @where maximum(c.Weight) > 150
     @orderby key(c)
     @select {Group = key(c), AvgAge=mean(c.Age), MaxWeight = maximum(c.Weight), Count = length(c.Age)}
     @collect DataFrame
end

You can use the Query.jl `@let` macro to introduce new range variables into a query. Here we introduce a new range variable <i>ExpRandNum</i> as a function of another column variable (__RandNum__).

In [None]:
ex = @from i in dfa begin
     @let ExpRandNum = exp(i.RandNum)
     @orderby i.Category
     @where ExpRandNum > 0.9
     @select {i.ID, i.Category, i.IndVar, i.RandNum, ExpRandNum}
     @collect DataFrame
end

The last thing we'll cover is doing join operations in Query.jl. We'll create another dataframe to join with __dfa__.

In [None]:
N = 15
dfb = DataFrame(IDNum = 1:N, 
                Color = ["blue", "orange", "orange", "black", "black", "red", "white", "purple", "yellow",
                         "green", "brown", "grey", "blue", "red", "white"]);

We'll do a simple inner join on <i>ID</i> from __dfa__ with <i>IDNum</i> from  __dfb__.

In [None]:
ex = @from i in dfa begin
     @join j in dfb on i.ID equals j.IDNum
     @select {SubjID = i.ID, i.Category, i.IndVar, j.Color}
     @collect DataFrame
end

In this lesson we covered:
* Using Query.jl to execute query expressions on dataframes.