# VI. Using Query.jl

The Query.jl package in Julia can be used to query data sources using query expressions. Typical operations include things like filtering, projecting, joining, sorting, and grouping. Legitimate data sources include data streams such as CSV files, arrays, dictionaries, databases (SQLite), dataframes, etc. (basically any iterable datasource). The basic syntax has this structure:

`myq = @from <range_var> in <source> begin
    <query_statements>
end`

The range variable is what iterates over the data source; and the query statements are the query commands that get executed. The `@from` is a Julia macro provided by the **Query.jl** package.

In [None]:
using DataFrames, Distributions, Random, Statistics, CategoricalArrays, Query

Let's use the same dataframe from the split-apply-combine notebook.

In [None]:
#create a dataframe

Random.seed!(1234)

N = 20
dfa = DataFrame(ID = 1:N,
                Category = wsample(["Low", "Medium", "High"], [1/3, 1/3, 1/3], N),
                Weight = rand(120:170, N),
                Age = rand(20:80, N),
                IndVar = wsample([0, 1], [0.5, 0.5], N),
                RandNum = randn(N))

categorical!(dfa, [:Category, :IndVar]);
levels!(dfa.Category, ["Low", "Medium", "High"]);

In [None]:
dfa

For a first simple example, let's create a new dataframe based on **dfa** where we filter on `Weight > 130` and only keep the columns <i>ID</i>, <i>Weight</i>, <i>Age</i>, and sort the resulting dataframe in descending order by <i>Age</i>.

In [None]:
ex = @from i in dfa begin
     @where i.Weight > 130
     @orderby descending(i.Age)
     @select {PatientID = i.ID, PatientWeight = i.Weight, PatientAge = i.Age}
     @collect DataFrame
end

In the above code:

* `@where` is doing the filtering operation based on the __Weight__ variable.
* `@orderby` does the sorting in descending order using the __Age__ varibale.
* `@select` is selecting the columns to keep and to optionally name the columns in the resulting object.
* `@collect` indicates to return the resulting object __ex__ as a dataframe. If nothing is specified then the resulting object is an array.

Note the use of the range variable __i__ to reference the columns in the dataframe.

Next let's do an example where we group the data into subgroups based on the values of __Category__("Low", "Medium", "High") and __IndVar__(0, 1). 

In [None]:
ex = @from i in dfa begin
     @where i.RandNum > 0 && i.Age > 25
     @group i by i.Category, i.IndVar into c
     @orderby key(c)
     @select {Grouping = key(c), AvgAge=mean(c.Age), MaxWeight = maximum(c.Weight), Count = length(c.Age)}
     @collect DataFrame
end

The `@group` statement groups the data into a new range variable (the new range variable is called __c__ in our case) based on the levels of the column variables; this new range variable is then used to aggregate the data. 

The __key__ function gives the values used to group the data. The other functions used in `@select` are calculated based on the grouped data via the new range variable __c__.

If you wanted to restrict the output to cases where some aggregate based value condition was met you could add another `@where` statement after the `@group` statement.

In [None]:
ex = @from i in dfa begin
     @where i.RandNum > 0 && i.Age > 25
     @group i by i.Category, i.IndVar into c
     @where maximum(c.Weight) > 140
     @orderby key(c)
     @select {Group = key(c), AvgAge=mean(c.Age), MaxWeight = maximum(c.Weight), Count = length(c.Age)}
     @collect DataFrame
end

You can use the Query.jl `@let` macro to introduce new range variables into a query. Here we introduce a new range variable (__ExpRandNum__) as a function of another column variable (__RandNum__).

In [None]:
ex = @from i in dfa begin
     @let ExpRandNum = exp(i.RandNum)
     @orderby i.Category
     @where ExpRandNum > 0.5
     @select {i.ID, i.Category, i.IndVar, i.RandNum, ExpRandNum}
     @collect DataFrame
end

The last thing we'll cover is doing join operations in Query.jl. We'll create another dataframe to join with __dfa__.

In [None]:
N = 15
dfb = DataFrame(IDNum = 1:N, 
                Color = ["blue", "orange", "orange", "black", "black", "red", "white", "purple", "yellow",
                         "green", "brown", "grey", "blue", "red", "white"]);

We'll do a simple inner join on _ID_ from __dfa__ with _IDNum_ from  __dfb__.

In [None]:
ex = @from i in dfa begin
     @join j in dfb on i.ID equals j.IDNum
     @select {SubjID = j.IDNum, i.Category, i.IndVar, j.Color}
     @collect DataFrame
end

In this lesson we covered:
* Using Query.jl to execute query expressions on dataframes.