# VI. Using Pipe.jl and Query.jl

The Pipe.jl package allows you to chain operations together using a convenient syntax. This package allows you to do more advanced piping than what you can do via the default piping sytnax in Julia. The typical syntax is:

`@pipe in |> f(x, _)`

The `@pipe` is a Julia macro. Here you are taking result of `in` and passing it as input to `f`. The underscore on the right hand side means to substitute the result of `in` at this place in the call to `f`.

In [1]:
using DataFrames, Distributions, Random, Statistics, CategoricalArrays, Query, Pipe

In [2]:
#create a dataframe

Random.seed!(1234)

N = 20
dfa = DataFrame(ID = 1:N,
                Category = wsample(["Low", "Medium", "High"], [1/3, 1/3, 1/3], N),
                Weight = rand(120:170, N),
                Age = rand(20:80, N),
                IndVar = wsample([0, 1], [0.5, 0.5], N),
                RandNum = randn(N))

categorical!(dfa, [:Category, :IndVar]);
levels!(dfa.Category, ["Low", "Medium", "High"]);

In [3]:
dfa

Unnamed: 0_level_0,ID,Category,Weight,Age,IndVar,RandNum
Unnamed: 0_level_1,Int64,Cat…,Int64,Int64,Cat…,Float64
1,1,Medium,152,33,0,-0.981132
2,2,High,159,47,1,-0.316387
3,3,Medium,140,73,1,0.265743
4,4,Medium,145,47,0,1.06561
5,5,High,121,49,1,1.38501
6,6,High,149,27,0,0.0799514
7,7,Low,138,26,1,-0.833369
8,8,Low,126,20,1,-0.443247
9,9,Low,125,35,0,-1.66323
10,10,Medium,143,54,1,-0.521229


Let's first look at a simple example using `filter`. We know the first argument to `filter` will be a pair and the second argument is the dataframe. Here the underscore in `filter` is replaced by the result of the left-hand side. The left-hand side is just the dataframe __dfa__ so this is what gets passed into the `filter` command on the right for the underscoare. In the end what we get are rows where the sum of _Weight_ and _Age_ is greather than 180.

In [4]:
filtered_gt130 = @pipe dfa |> filter( [:Weight, :Age]  => (x,y) -> x+y > 180 , _)

Unnamed: 0_level_0,ID,Category,Weight,Age,IndVar,RandNum
Unnamed: 0_level_1,Int64,Cat…,Int64,Int64,Cat…,Float64
1,1,Medium,152,33,0,-0.981132
2,2,High,159,47,1,-0.316387
3,3,Medium,140,73,1,0.265743
4,4,Medium,145,47,0,1.06561
5,10,Medium,143,54,1,-0.521229
6,12,Low,166,51,0,-1.27635
7,13,Low,156,29,1,1.03132
8,14,High,156,60,0,-0.910805
9,15,Medium,169,27,0,0.754603
10,17,Low,129,62,1,-0.308944


We can use the Pipe.jl to do grouping:

In [5]:
grouped = @pipe dfa |> groupby(_, :Category)

Unnamed: 0_level_0,ID,Category,Weight,Age,IndVar,RandNum
Unnamed: 0_level_1,Int64,Cat…,Int64,Int64,Cat…,Float64
1,7,Low,138,26,1,-0.833369
2,8,Low,126,20,1,-0.443247
3,9,Low,125,35,0,-1.66323
4,12,Low,166,51,0,-1.27635
5,13,Low,156,29,1,1.03132
6,16,Low,121,40,0,-1.29475
7,17,Low,129,62,1,-0.308944
8,19,Low,128,62,0,1.44886

Unnamed: 0_level_0,ID,Category,Weight,Age,IndVar,RandNum
Unnamed: 0_level_1,Int64,Cat…,Int64,Int64,Cat…,Float64
1,2,High,159,47,1,-0.316387
2,5,High,121,49,1,1.38501
3,6,High,149,27,0,0.0799514
4,14,High,156,60,0,-0.910805
5,20,High,157,72,1,0.778151


As we did before you can calculate the mean _Weight_ and _Age_ within each subgroup. Here the output of the `groupby` is passed as input to `combine`. Specifically the output of `groupby`, which is a grouped dataframe, is substituted into the underscore in the call to `combine`.

In [6]:
grouped_mean = @pipe dfa |> groupby(_, :Category) |> combine(_, :Weight => mean, :Age => mean)

Unnamed: 0_level_0,Category,Weight_mean,Age_mean
Unnamed: 0_level_1,Cat…,Float64,Float64
1,Low,136.125,40.625
2,Medium,146.143,46.5714
3,High,148.4,51.0


The Query.jl package in Julia can be used to query data sources using query expressions. Typical operations include things like filtering, projecting, joining, sorting, and grouping. Legitimate data sources include data streams such as CSV files, arrays, dictionaries, databases (SQLite), dataframes, etc. (basically any iterable datasource). The basic syntax has this structure:

`myq = @from <range_var> in <source> begin
    <query_statements>
end`

The range variable is what iterates over the data source; and the query statements are the query commands that get executed. The `@from` is a Julia macro provided by the Query.jl package.

In [7]:
dfa

Unnamed: 0_level_0,ID,Category,Weight,Age,IndVar,RandNum
Unnamed: 0_level_1,Int64,Cat…,Int64,Int64,Cat…,Float64
1,1,Medium,152,33,0,-0.981132
2,2,High,159,47,1,-0.316387
3,3,Medium,140,73,1,0.265743
4,4,Medium,145,47,0,1.06561
5,5,High,121,49,1,1.38501
6,6,High,149,27,0,0.0799514
7,7,Low,138,26,1,-0.833369
8,8,Low,126,20,1,-0.443247
9,9,Low,125,35,0,-1.66323
10,10,Medium,143,54,1,-0.521229


For a first simple example, let's create a new dataframe based on **dfa** where we filter on `Weight > 130` and only keep the columns <i>ID</i>, <i>Weight</i>, <i>Age</i>, and sort the resulting dataframe in descending order by <i>Age</i>.

In [8]:
ex = @from i in dfa begin
     @where i.Weight > 130
     @orderby descending(i.Age)
     @select {PatientID = i.ID, PatientWeight = i.Weight, PatientAge = i.Age}
     @collect DataFrame
end

Unnamed: 0_level_0,PatientID,PatientWeight,PatientAge
Unnamed: 0_level_1,Int64,Int64,Int64
1,3,140,73
2,20,157,72
3,18,137,61
4,14,156,60
5,10,143,54
6,12,166,51
7,2,159,47
8,4,145,47
9,1,152,33
10,11,137,31


In the above code:

* `@where` is doing the filtering operation based on the <i>Weight</i> variable.
* `@orderby` does the sorting in descending order using the <i>Age</i> varibale.
* `@select` is selecting the columns to keep and to optionally name the columns in the resulting object.
* `@collect` indicates to return the resulting object **ex** as a dataframe. If nothing is specified then the resulting object is an array.

Note the use of the range variable __i__ to reference the columns in the dataframe.

Next let's do an example where we group the data into subgroups based on the values of <i>Category</i> ("Low", "Medium", "High") and <i>IndVar</i> (0, 1). 

In [9]:
ex = @from i in dfa begin
     @where i.RandNum > 0 && i.Age > 25
     @group i by i.Category, i.IndVar into c
     @orderby key(c)
     @select {Grouping = key(c), AvgAge=mean(c.Age), MaxWeight = maximum(c.Weight), Count = length(c.Age)}
     @collect DataFrame
end

Unnamed: 0_level_0,Grouping,AvgAge,MaxWeight,Count
Unnamed: 0_level_1,Tuple…,Float64,Int64,Int64
1,"(""Low"", 0)",62.0,128,1
2,"(""Low"", 1)",29.0,156,1
3,"(""Medium"", 0)",41.5,169,4
4,"(""Medium"", 1)",73.0,140,1
5,"(""High"", 0)",27.0,149,1
6,"(""High"", 1)",60.5,157,2


The `@group` statement groups the data into a new range variable (the new range variable is called <i>c</i> in our case) based on the levels of the column variables; this new range variable is then used to aggregate the data. 

The `key` function gives the values used to group the data. The other functions used in `@select` are calculated based on the grouped data via the new range variable <i>c</i>.

If you wanted to restrict the output to cases where some aggregate based value condition was met you could add another `@where` statement after the `@group` statement.

In [10]:
ex = @from i in dfa begin
     @where i.RandNum > 0 && i.Age > 25
     @group i by i.Category, i.IndVar into c
     @where maximum(c.Weight) > 150
     @orderby key(c)
     @select {Group = key(c), AvgAge=mean(c.Age), MaxWeight = maximum(c.Weight), Count = length(c.Age)}
     @collect DataFrame
end

Unnamed: 0_level_0,Group,AvgAge,MaxWeight,Count
Unnamed: 0_level_1,Tuple…,Float64,Int64,Int64
1,"(""Low"", 1)",29.0,156,1
2,"(""Medium"", 0)",41.5,169,4
3,"(""High"", 1)",60.5,157,2


You can use the Query.jl `@let` macro to introduce new range variables into a query. Here we introduce a new range variable <i>ExpRandNum</i> as a function of another column variable (__RandNum__).

In [11]:
ex = @from i in dfa begin
     @let ExpRandNum = exp(i.RandNum)
     @orderby i.Category
     @where ExpRandNum > 0.9
     @select {i.ID, i.Category, i.IndVar, i.RandNum, ExpRandNum}
     @collect DataFrame
end

Unnamed: 0_level_0,ID,Category,IndVar,RandNum,ExpRandNum
Unnamed: 0_level_1,Int64,Cat…,Cat…,Float64,Float64
1,13,Low,1,1.03132,2.80477
2,19,Low,0,1.44886,4.25826
3,3,Medium,1,0.265743,1.3044
4,4,Medium,0,1.06561,2.90261
5,11,Medium,0,0.183976,1.20199
6,15,Medium,0,0.754603,2.12677
7,18,Medium,0,1.30668,3.6939
8,5,High,1,1.38501,3.99487
9,6,High,0,0.0799514,1.08323
10,20,High,1,0.778151,2.17744


The last thing we'll cover is doing join operations in Query.jl. We'll create another dataframe to join with __dfa__.

In [12]:
N = 15
dfb = DataFrame(IDNum = 1:N, 
                Color = ["blue", "orange", "orange", "black", "black", "red", "white", "purple", "yellow",
                         "green", "brown", "grey", "blue", "red", "white"]);

We'll do a simple inner join on <i>ID</i> from __dfa__ with <i>IDNum</i> from  __dfb__.

In [13]:
ex = @from i in dfa begin
     @join j in dfb on i.ID equals j.IDNum
     @select {SubjID = i.ID, i.Category, i.IndVar, j.Color}
     @collect DataFrame
end

Unnamed: 0_level_0,SubjID,Category,IndVar,Color
Unnamed: 0_level_1,Int64,Cat…,Cat…,String
1,1,Medium,0,blue
2,2,High,1,orange
3,3,Medium,1,orange
4,4,Medium,0,black
5,5,High,1,black
6,6,High,0,red
7,7,Low,1,white
8,8,Low,1,purple
9,9,Low,0,yellow
10,10,Medium,1,green


In this lesson we covered:
* Using Query.jl to execute query expressions on dataframes.