# SciencesPo Computational Economics 2018

* Welcome!
* Why do Economists (and Social Scientists in general) have to talk about Computation?
* [This](https://web.stanford.edu/%7Egentzkow/research/CodeAndData.pdf) by Gentzkow and Shapiro is still relevant, even if in *Computing Time*, 2014 is a long time ago. I highly recommend you read this.

## Gentzkow and Shapiro

> Here is a good rule of thumb: If you are trying to solve a problem, and there are multi-billion dollar firms whose entire business model depends on solving the same problem, and there are whole courses at your university devoted to how to solve that problem, you might want to figure out what the experts do and see if you can’t learn something from it.

## Gentzkow and Shapiro

* This is the spirit of this course. 
* We want to learn computation as practised by *the experts*.
* We don't want to - **ever** - reinvent the wheel. 


# Computation Basics

* It is important that we understand some basics about computers.
* Even though software (and computers) always get more and more sophisticated, there is still a considerable margin for "human error". This doesn't mean necessarily that there is something wrong, but certain ways of doing things may have severe performance implications.
* Whatever else happens, *you* write the code, and one way of writing code is different from another.

![test](../assets/figs/BasicComputing/picnic.jpeg)

## Julia? Why Julia?

* The *best* software doesn't exist. All depends on:
	1. The problem at hand. 
		* You are fine with Stata if you need to run a probit.
		* Languages have different comparative advantages with regards to different tasks.
	1. Preferences of the analyst. Some people just *love* their software.
* That said, there are some general themes we should keep in mind when choosing a software.
* [Stephen Johnson at MIT has a good pitch.](https://github.com/stevengj/julia-mit)

## High versus Low Level Languages

* High-level languages for technical computing: Matlab, Python, R, ...
	* you get going immediately
	* very important for exploratory coding or data analysis
	* You don't want to worry about type declarations and compilers at the exploratory stage
* High-level languages are slow.
	* Traditional Solutions to this: Passing the high-speed threshold.
	* Using `Rcpp` or `Cython` etc is a bit like [Stargate](https://en.wikipedia.org/wiki/Stargate_SG-1). You loose control the moment you pass the barrier to `C++` for a little bit. (Even though those are great solutions.) If the `C++` part of your code becomes large, testing this code becomes increasingly difficult.
	* You end up spending your time coding `C++`. But that has it's own drawbacks.

## Julia is Fast

* Julia is [fast](http://julialang.org/benchmarks/).
	* But julia is also a high-level dynamic language. How come? 
	* The JIT compiler.
	* The [LLVM project](https://en.wikipedia.org/wiki/LLVM).
* Julia is open source (and it's for free)
	* It's for free. Did I say that it's for free?
	* You will never again worry about licenses. Want to run 1000 instances of julia? Do it.
	* The entire standard library of julia is written in julia (and not in `C`, e.g., as is the case in R, matlab or python). It's easy to look and understand at how things work.
* Julia is a very modern language, combining the best features of many other languages.

## What does Julia want to achieve?

* There is a *wall* built into the scientific software stack
* Julia cofounder Stefan Karpinski [talks about **the 2 languages problem**](https://opendatascience.com/conferences/odsc-east-2016-stefan-karpinski-solving-the-two-language-problem/)
* **key: the wall creates a social barrier.** Developer and User are different people.

## The Wall in the scientific software stack

![test](../assets/figs/BasicComputing/stack.png)

## Who is using Julia?

* [Case Studies](https://juliacomputing.com/case-studies/)
    * [One of the top 10 data sciences problems solved with Celeste.jl](https://youtu.be/uecdcADM3hY)
        * [next platform article](https://www.nextplatform.com/2017/11/28/julia-language-delivers-petascale-hpc-performance/)
    * [US Federal Aviation Administration](https://juliacomputing.com/case-studies/lincoln-labs.html) builds their Airborne Collision Avoidance System with julia
    * [The NY Fed runs their DGSE model in julia](https://juliacomputing.com/case-studies/ny-fed.html)
    * and more

## Economists and Their Software

* In [*A Comparison of Programming Languages in Economics*](http://economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf)<cite data-cite=jesuscomputing></cite>, the authors compare some widely used languages on a close to [identical piece of code](https://github.com/jesusfv/Comparison-Programming-Languages-Economics).
* It can be quite contentious to talk about Software to Economists.
	* Religious War. 
	* Look at the [comments on this blog post regarding the paper](http://marginalrevolution.com/marginalrevolution/2014/07/a-comparison-of-programming-languages-in-economics.html).
	* There *are* switching costs from one language to another.
	* Network effects (Seniors handing down their software to juniors etc)
* Takeaway from that paper: 
	* There are some very good alternatives to `fortran`
	* `fortran` is **not** faster than `C++`
	* It seems pointless to invest either money or time in `matlab`, given the many good options that are available for free.


## The Fundamental Tradeoff

#### Developer Time (Your Time) is Much More Expensive than Computing Time

* It may well be that the runtime of a fortran program is one third of the time it takes to run the program in julia, or anything else for that matter.
* However, the time it takes to **develop** that program is very likely to be (much) longer in fortran. 
* Particularly if you want to hold your program to the same quality standards.

#### Takeaway

* Given my personal experience with many of the above languagues, I think `julia` is a very good tool for economists with non-trivial computational tasks.
* This is why I am using it for demonstrations in this course.



## A Second Fundamental Tradeoff

* Regardless of the software you use, there is one main problem with computation.
* It concerns **speed vs accuracy**.
* You may be able to do something very fast, but at very small accuracy (i.e. with a high numerical margin of error)
* On the other hand, you may be able to get a very accurate solution, but it may take you an irrealistic amount of time to get there.
* You have to face that tradeoff and decide for yourself what's best.

## A Warning about Optimizing your Code!

In Donald Knuth's paper "Structured Programming With GoTo Statements", he wrote:   

>"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: **premature optimization is the root of all evil**. Yet we should not pass up our opportunities in that critical 3%."





## Julia Workflow

* We will use the lastest stable version of julia. today that is `v0.6.2`
* For today, stay within this notebook.
* In general, you will want to install julia on your computer
* the most stable workflow involves a text editor and a julia terminal next to it
* You would develop your code in a text file (say "develop.jl"), and then just do `include("develop.jl")` in the terminal
* I use sublime text as an editor
* Atom is also very good


## Some Numerical Concepts and `Julia`

* Machine epsilon: The smallest number that your computer can represent, type `eps()`.
* Infinity: A number greater than all representable numbers on your computer. [Obeys some arithmethmic rules](http://docs.julialang.org/en/release-0.4/manual/integers-and-floating-point-numbers/?highlight=infinity#special-floating-point-values)
	* Overflow: If you perform an operation where the result is greater than the largest representable number.
	* Underflow: You take two (very small) representable numbers, but the result is smaller than `eps()`. 
	* In Julia, you are wrapped around the end of your representable space:
	```julia
	x = typemax(Int64)
	x + 1
	```
* Integers and Floating Point Numbers.
* Single and Double Precision.
* In Julia, all of these are different [*numeric primitive types (head over to julia manual for a second)*](https://docs.julialang.org/en/latest/manual/integers-and-floating-point-numbers/).
* Julia also supports *Arbitrary Precision Arithmetic*. Thus, overflow shouldn't become an issue anymore.
* See min and max for different types:

In [1]:
for T in [Int8,Int16,Int32,Int64,Int128,UInt8,UInt16,UInt32,
		  UInt64,UInt128,Float32,Float64]
         println("$(lpad(T,7)): [$(typemin(T)),$(typemax(T))]")
end

   Int8: [-128,127]
  Int16: [-32768,32767]
  Int32: [-2147483648,2147483647]
  Int64: [-9223372036854775808,9223372036854775807]
 Int128: [-170141183460469231731687303715884105728,170141183460469231731687303715884105727]
  UInt8: [0,255]
 UInt16: [0,65535]
 UInt32: [0,4294967295]
 UInt64: [0,18446744073709551615]
UInt128: [0,340282366920938463463374607431768211455]
Float32: [-Inf,Inf]
Float64: [-Inf,Inf]



## Interacting with the `Julia REPL`

* REPL?
* different modes: command, help, search, shell
* incremental search with `CTRL r`
* documented in the [manual](https://docs.julialang.org/en/stable/manual/interacting-with-julia/)




# `Julia` Primer: Types

* Types are at the core of what makes julia a great language. 
* *Everything* in julia is represented as a datatype. 
* Remember the different numeric *types* from before? Those are types.
* The [manual](https://docs.julialang.org/en/stable/manual/interacting-with-julia/), as usual, is very informative on this.
* From the [wikibook on julia](https://en.wikibooks.org/wiki/Introducing_Julia/Types), here is a representation of the numeric type graph:

![](../assets/figs/BasicComputing/Type-hierarchy-for-julia-numbers.png)



## `Julia` Primer: Custom Types

* The great thing is that you can create you own types. 
* Going with the example from the wikibook, we could have types `Jaguar` and `Cat` as being subtypes of `Feline`:

```julia
struct Feline
	weight::Float64
	sound::String
end
struct Cat <: Feline
	weight::Float64
	sound::String
end
```


In [2]:
abstract type Feline end

struct Jaguar <: Feline
	weight::Float64
	sound::String
end
struct Cat <: Feline
	weight::Float64
	sound::String
end

# is a cat a Feline?
Cat <: Feline

true

In [3]:
# create a cat and a jaguar
c = Cat(15.2,"miauu")
j = Jaguar(95.1,"ROARRRRRR!!!!!")

# is c an instance of type Cat?
isa(c,Cat)

# methods
function behave(c::Cat)
    println(c.sound)
    println("my weight is $(c.weight) kg! should go on a diet")
end
function behave(j::Jaguar)
    println(j.sound)
    println("Step back! I'm a $(j.weight) kg jaguar.")
end

behave (generic function with 2 methods)

In [4]:
# make a cat behave:
behave(c)

# and a jaguar
behave(j)

miauu
my weight is 15.2 kg! should go on a diet
ROARRRRRR!!!!!
Step back! I'm a 95.1 kg jaguar.



## Julia Primer: Multiple Dispatch

* You have just learned `multiple dispatch`. The same function name dispatches to different functions, depending on the input argument type.


## Julia Primer: Important performance lesson - Type Stability

* If you don't declare types, julia will try to infer them for you.
* DANGER: don't change types along the way. 
	* julia optimizes your code for a specific type configuration.
	* it's not the same CPU operation to add two `Int`s and two `Float`s. The difference matters.
* Example

In [5]:
function t1(n)
    s  = 0  # typeof(s) = Int
    for i in 1:n
        s += s/i
    end
end
function t2(n)
    s  = 0.0   # typeof(s) = Float64
    for i in 1:n
        s += s/i
    end
end
@time t1(10000000)
@time t2(10000000)

  0.586747 seconds (60.00 M allocations: 915.613 MiB, 7.64% gc time)
  0.007323 seconds (1.09 k allocations: 59.092 KiB)


## Julia Modules 

* A module is a new workspace - a new *global scope*
* A module defines a separate *namespace*
* There is an illustrative example available at the julia manual, let's look at it.

### An example Module

```julia
module MyModule
    # which other modules to use: imports
    using Lib
    using BigLib: thing1, thing2
    import Base.show
    importall OtherLib
    
    # what to export from this module
    export MyType, foo

    # type defs
    struct MyType
        x
    end

    # methods
    bar(x) = 2x
    foo(a::MyType) = bar(a.x) + 1

    show(io::IO, a::MyType) = print(io, "MyType $(a.x)")
end
```

### Modules and files

* you can easily have more files inside a module to organize your code.
* For example, you could `include` other files like this

```julia
module Foo

include("file1.jl")
include("file2.jl")

end
```

### Working with Modules

* Look at the example at [the manual!](https://docs.julialang.org/en/stable/manual/modules/)
* **Location of Modules**: Julia stores packages in a hidden folder `~/.julia/v0.6` (system-dependent)
* You can develop your own modules in a different location if you want.
* Julia reads the file `~/.juliarc.jl` on each startup. Modify the `LOAD_PATH` variable:

```julia
# add this to ~/.juliarc.jl
push!(LOAD_PATH, "/Path/To/My/Module/")
```

## Unit Testing and Code Quality


![](../assets/figs/BasicComputing/phd033114s.png)

## What is Unit Testing? Why should you test you code?

* Bugs are very hard to find just by *looking* at your code.
* Bugs hide.
* From this very instructive [MIT software construction class](http://web.mit.edu/6.005/www/fa15/classes/03-testing/#unit_testing_and_stubs):

> Even with the best validation, it’s very hard to achieve perfect quality in software. Here are some typical residual defect rates (bugs left over after the software has shipped) per kloc (one thousand lines of source code):
  * 1 - 10 defects/kloc: Typical industry software.
  * 0.1 - 1 defects/kloc: High-quality validation. The Java libraries might achieve this level of correctness.
  * 0.01 - 0.1 defects/kloc: The very best, safety-critical validation. NASA and companies like Praxis can achieve this level. This can be discouraging for large systems. For example, if you have shipped a million lines of typical industry source code (1 defect/kloc), it means you missed 1000 bugs!



## Unit Testing in Science

* One widely-used way to prevent your code from having too many errors, is to continuously test it.
* This issue is widely neglected in Economics as well as other sciences.
	* If the resulting graph looks right, the code should be alright, shouldn't it?
	* Well, should it?
* It is regrettable that so little effort is put into verifying the proper functioning of scientific code. 


* Referees in general don't have access to the computing code for paper that is submitted to a journal for publication.
* How should they be able to tell whether what they see in black on white on paper is the result of the actual computation that was proposed, rather than the result of chance (a.k.a. a bug)?
	* Increasingly papers do post the source code *after* publication.
	* The scientific method is based on the principle of **reproduciblity** of results. 
		* Notice that having something reproducible is only a first step, since you can reproduce with your buggy code the same nice graph. 
		* But from where we are right now, it's an important first step.
	* This is an issue that is detrimental to credibility of Economics, and Science, as a whole. 
* Extensively testing your code will guard you against this.

## Best Practice

* You want to be in **maximum control** over your code at all times:
	* You want to be **as sure as possible** that a certain piece of code is doing, what it actually meant to do.
	* This sounds trivial (and it is), yet very few people engage in unit testing.
* Things are slowly changing. See [http://www.runmycode.org](http://www.runmycode.org) for example.
* **You** are the generation that is going to change this. Do it.
* Let's look at some real world Examples.



## Ariane 5 blows up because of a bug

> It took the European Space Agency 10 years and $$7 billion to produce Ariane 5, a giant rocket capable of hurling a pair of three-ton satellites into orbit with each launch and intended to give Europe overwhelming supremacy in the commercial space business. 
All it took to explode that rocket less than a minute into its maiden voyage last June, scattering fiery rubble across the mangrove swamps of French Guiana, was a small computer program trying to stuff a 64-bit number into a 16-bit space. This shutdown occurred 36.7 seconds after launch, when the guidance system's own computer tried to convert one piece of data -- the sideways velocity of the rocket -- from a 64-bit format to a 16-bit format. 
**The number was too big, and an overflow error resulted**. When the guidance system shut down, it passed control to an identical, redundant unit, which was there to provide backup in case of just such a failure. But the second unit had failed in the identical manner a few milliseconds before. And why not? It was running the same software.

## NASA Mars Orbiter crashes because of a bug

> For nine months, the Mars Climate Orbiter was speeding through space and speaking to NASA in **metric**. But the engineers on the ground were replying **in non-metric English**. 
It was a mathematical mismatch that was not caught until after the $$125-million spacecraft, a key part of NASA's Mars exploration program, was sent crashing too low and too fast into the Martian atmosphere. The craft has not been heard from since.
Noel Henners of Lockheed Martin Astronautics, the prime contractor for the Mars craft, said at a news conference it was up to his company's engineers to assure the metric systems used in one computer program were compatible with the English system used in another program. The simple conversion check was not done, he said.



## LA Airport Air Traffic Control shuts down because of a bug

>(IEEE Spectrum) -- It was an air traffic controller's worst nightmare. Without warning, on Tuesday, 14 September, at about 5 p.m. Pacific daylight time, air traffic controllers lost voice contact with 400 airplanes they were tracking over the southwestern United States. Planes started to head toward one another, something that occurs routinely under careful control of the air traffic controllers, who keep airplanes safely apart. But now the controllers had no way to redirect the planes' courses.
The controllers lost contact with the planes when the main voice communications system (VCS) shut down unexpectedly. To make matters worse, a backup system that was supposed to take over in such an event crashed within a minute after it was turned on. The outage disrupted about 800 flights across the country.
Inside the control system unit (VCSU) is a countdown timer that ticks off time in milliseconds. The VCSU uses the timer as a pulse to send out periodic queries to the VSCS. It starts out at the highest possible number that the system's server and its software can handle — 232. It's a number just over 4 billion milliseconds. When the counter reaches zero, the system runs out of ticks and can no longer time itself. **So it shuts down**.
*Counting down from 232 to zero in milliseconds takes just under 50 days*. The FAA procedure of having a technician reboot the VSCS every 30 days resets the timer to 232 almost three weeks before it runs out of digits.



## Automated Testing

* You should try to minimize the effort of writing tests.
* Using an automated test suite is very helpful here.
* In Julia, we have got `Base.Test` in the Base package
* Julia unit testing is described [here](http://docs.julialang.org/en/stable/stdlib/test/)

## Automated Testing on Travis

* [https://travis-ci.org](https://travis-ci.org) is a continuous integration service.
* It runs your test on their machines and notifies you of the result.
* Every time you push a commit to github.
* If the repository is public on github, the service is for free.
* Many julia packages are testing on Travis.
* You should look out for this green badge: [![Example](../assets/figs/BasicComputing/Example_0.6.png)](http://pkg.julialang.org/?pkg=Example)
* You can run the tests for a package with `Pkg.test("Package_name")`
* You can run the tests for julia itself with `Base.runtests()`


In [6]:
# let's do some simple testing
using Base.Test

@test 1==1
@test pi ≈ 3.14159 atol=1e-4

[1m[32mTest Passed
[39m[22m

In [7]:
@test 2>3

[1m[91mTest Failed
[39m[22m  Expression: 2 > 3
   Evaluated: 2 > 3


LoadError: [91mThere was an error during testing[39m

## Debugging Julia

There are at least 2 ways to chase down a bug.

1. Use logging facilities. 
    * poor man's logger: write `println` statements at various points in your code.
    * better: use Base.logging
        * this will really become useful in v1.0, when you can choose the logging level in a better way.
1. Use the debugger:
    * The Julia debugger used to be [Gallium.jl](https://github.com/Keno/Gallium.jl)
        * There is a lot of work on that repo. in the meantime:
    * [https://github.com/Keno/ASTInterpreter2.jl](https://github.com/Keno/ASTInterpreter2.jl) works very well.
    * Debugging works well in Juno. 
    * We will try this out next time, when you start to work with actual code.

## Julia and Data

* There is a github org for [julia and data](https://github.com/JuliaData)
* Some prominent members of that org are
    * [DataFrames.jl](https://dataframesjl.readthedocs.io)
    * [DataFramesMeta.jl](https://github.com/JuliaStats/DataFramesMeta.jl)
    * [Queries.jl](https://github.com/davidanthoff/Query.jl)

### What's special about Data?

* There are several issues when working with *data*:
    * A typical dataset might be deliverd to you in tabular form. A comma separated file, for example: a spreadsheet.
    * R, julia and python share the concept of a `DataFrame`. A tabular dataset with column names.
    * that means in particular that each column could have a different datatype.
    * For a language that optimizes on efficiently computing with different datatypes, that is a challenge.
    * Importantly: data can be **missing**, i.e. for several reasons there is a record that was not, well, recorded.
    * Julia has made a lot of progress here. We now have the `Missing` data type, provided in `Missings.jl`
    

In [8]:
using DataFrames  # DataFrames re-exports Missings.jl
m = missings(Float64,3)  # you choose a datatype, and the dims for an array

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/florian.oswald/.julia/lib/v0.6/DataFrames.ji for module DataFrames.
[39m

3-element Array{Union{Float64, Missings.Missing},1}:
 missing
 missing
 missing

In [9]:
m[1:2] = ones(2)
sum(m)  # => missing. because Float64 + missing = missing
prod(m)  # => missing. because Float64 * missing = missing
sum(skipmissing(m)) # => 2

2.0

In [10]:
# you can replace missing values
println(typeof(Missings.replace(m,3)))
collect(Missings.replace(m,3))

Missings.EachReplaceMissing{Array{Union{Float64, Missings.Missing},1},Float64}


3-element Array{Float64,1}:
 1.0
 1.0
 3.0

### DataFrames

* A dataframe is a tabular dataset: a spreadsheet
* columns can be of different data type. very convenient, very hard to optimize.

In [11]:
df = DataFrame(nums = rand(3),words=["little","brown","dog"])

Unnamed: 0,nums,words
1,0.526495,little
2,0.607385,brown
3,0.502525,dog


In [12]:
# there is alot of functionality. please consult the manual.
df[:nums]
df[:words][1]  # "little"
df[:nums][2:3] = [1,1]
df

Unnamed: 0,nums,words
1,0.526495,little
2,1.0,brown
3,1.0,dog


In [1]:
using RDatasets   # popular Datasets from R
iris = dataset("datasets", "iris")
head(iris)  # get the first 6 rows

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /Users/74097/.julia/lib/v0.6/RDatasets.ji for module RDatasets.
[39m

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


In [2]:
println(iris[:SepalLength][1:6])  # get a column

show(iris[2,:])   # get a row

describe(iris);   # get a description

[5.1, 4.9, 4.7, 4.6, 5.0, 5.4]
1×5 DataFrames.DataFrame
│ Row │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ Species │
├─────┼─────────────┼────────────┼─────────────┼────────────┼─────────┤
│ 1   │ 4.9         │ 3.0        │ 1.4         │ 0.2        │ setosa  │SepalLength
Summary Stats:
Mean:           5.843333
Minimum:        4.300000
1st Quartile:   5.100000
Median:         5.800000
3rd Quartile:   6.400000
Maximum:        7.900000
Length:         150
Type:           Float64

SepalWidth
Summary Stats:
Mean:           3.057333
Minimum:        2.000000
1st Quartile:   2.800000
Median:         3.000000
3rd Quartile:   3.300000
Maximum:        4.400000
Length:         150
Type:           Float64

PetalLength
Summary Stats:
Mean:           3.758000
Minimum:        1.000000
1st Quartile:   1.600000
Median:         4.350000
3rd Quartile:   5.100000
Maximum:        6.900000
Length:         150
Type:           Float64

PetalWidth
Summary Stats:
Mean:           1.199333
Minimum:     

### Working with DataFrames

* [RTFM :-)](http://juliastats.github.io/DataFrames.jl/stable/)
* We can sort, join (i.e. merge), split-apply-combine and reshape dataframes. We can do most things one can do in base R with a `data.frame`.

In [15]:
d = DataFrame(A = rand(10),B=rand([1,2],10),C=["word $i" for i in 1:10])
by(d,:B,x->mean(x[:A]))

Unnamed: 0,B,x1
1,1,0.605719
2,2,0.382219


In [16]:
# subsetting a dataframe can become cumbersome
d = DataFrame(A = rand(10),B=rand([1,2],10),C=["word $i" for i in 1:10])
d[(d[:A].>0.1) .& (d[:B].==2),:]  # approach 1
using DataFramesMeta
@where(d,(:A .>0.1) .& (:B.==2))  # approach 2

Unnamed: 0,A,B,C
1,0.134315,2,word 2
2,0.929985,2,word 3
3,0.84077,2,word 5
4,0.831327,2,word 8
5,0.607283,2,word 9


enter

### [DataFramesMeta.jl](https://github.com/JuliaStats/DataFramesMeta.jl)

* This makes this much easier. 
* It makes heavy use of macros.
* The same thing from before is now

```julia
using DataFramesMeta
@where(d,(:A .>0.1) & (:B.==2))
```

* We can access column names as *symbols* inside an expression.



In [17]:
# the previous operation would become
@by(d,:B,m=mean(:A))

Unnamed: 0,B,m
1,1,0.418524
2,2,0.668736


### Chaining operations

* Very often we have to do a chain of operations on a set of data.
* Readibility is a big concern here.
* Here we can use the `@linq` macro together with a pipe operator.
* This is inspired by `LINK` (language integrated query) from microsoft .NET

In [18]:
df = DataFrame(a = 1:20,b = rand([2,5,10],20), x = rand(20))
x_thread = @linq df |>
    transform(y = 10 * :x) |>
    where(:a .> 2) |>
    by(:b, meanX = mean(:x), meanY = mean(:y)) |>
    orderby(:meanX) |>
    select(meanx=:meanX,meany= :meanY, var = :b)

# or
# using Lazy
# x_thread = @> begin
#     df
#     @transform(y = 10 * :x)
#     @where(:a .> 2)
#     @by(:b, meanX = mean(:x), meanY = mean(:y))
#     @orderby(:meanX)
#     @select(:meanX, :meanY, var = :b)
# end

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:70[22m[22m
 [2] [1mType[22m[22m at [1m/Users/florian.oswald/.julia/v0.6/DataFrames/src/deprecated.jl:4[22m[22m [inlined]
 [3] [1m#35[22m[22m at [1m/Users/florian.oswald/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:316[22m[22m [inlined]
 [4] [1m##696[22m[22m at [1m/Users/florian.oswald/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:70[22m[22m [inlined]
 [5] [1m(::##21#29{DataFrames.DataFrame})[22m[22m[1m([22m[22m::DataFrames.DataFrame[1m)[22m[22m at [1m/Users/florian.oswald/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:72[22m[22m
 [6] [1morderby[22m[22m[1m([22m[22m::DataFrames.DataFrame, ::##21#29{DataFrames.DataFrame}[1m)[22m[22m at [1m/Users/florian.oswald/.julia/v0.6/DataFramesMeta/src/DataFramesMeta.jl:313[22m[22m
 [7] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:515[22m

Unnamed: 0,meanx,meany,var
1,0.400537,4.00537,5
2,0.516743,5.16743,10
3,0.70122,7.0122,2


an alternative

### [Query.jl](https://github.com/davidanthoff/Query.jl)

* Much in the same spirit. But can query almost any data source, not only dataframes.
    * Data Sources: DataFrames, Dicts, Arrays, TypedTables, DataStreams,...
    * Data Sinks: DataFrames, Dicts, Csv
* It is not as convenient to summarise data, however.
* It is one of the [best documented packages](http://www.david-anthoff.com/Query.jl/stable/) I know.
* In general, a query looks like this:
```julia
q = @from <range variable> in <source> begin
    <query statements>
end
```
* Here is an example:

In [21]:
using Query

df = DataFrame(name=["John", "Sally", "Kirk"], age=[23., 42., 59.], children=[3,5,2])

x = @from i in df begin
    @where i.age>50
    @select {i.name, i.children}
    @collect DataFrame
end
println(df)
println(x)

3×3 DataFrames.DataFrame
│ Row │ name  │ age  │ children │
├─────┼───────┼──────┼──────────┤
│ 1   │ John  │ 23.0 │ 3        │
│ 2   │ Sally │ 42.0 │ 5        │
│ 3   │ Kirk  │ 59.0 │ 2        │
1×2 DataFrames.DataFrame
│ Row │ name │ children │
├─────┼──────┼──────────┤
│ 1   │ Kirk │ 2        │


you can do [several things in a query](http://www.david-anthoff.com/Query.jl/stable/querycommands.html):

* sort
* filter
* project
* flatten
* join
* split-apply-combine (dplyr)



In [22]:
df = DataFrame(name=repeat(["John", "Sally", "Kirk"],inner=[1],outer=[2]), 
     age=vcat([10., 20., 30.],[10., 20., 30.].+3), 
     children=repeat([3,2,2],inner=[1],outer=[2]),state=[:a,:a,:a,:b,:b,:b])

x = @from i in df begin
    @group i by i.state into g
    @select {group=g.key,mage=mean(g..age), oldest=maximum(g..age), youngest=minimum(g..age)}
    @collect DataFrame
end

println(x)


2×4 DataFrames.DataFrame
│ Row │ group │ mage │ oldest │ youngest │
├─────┼───────┼──────┼────────┼──────────┤
│ 1   │ a     │ 20.0 │ 30.0   │ 10.0     │
│ 2   │ b     │ 23.0 │ 33.0   │ 13.0     │
