# First steps with Data Frames.jl

## Learn
```julia
extrema()
round()
string.(["x", "y"], [1 2 3 4])
vec()
```

In this notebook we will reproduce the classical Anscombe's quartert plot.

Our objective is to produce a figure similar to this one (the plot is taken from [here](https://upload.wikimedia.org/wikipedia/commons/e/ec/Anscombe%27s_quartet_3.svg)).

<img src="https://upload.wikimedia.org/wikipedia/commons/e/ec/Anscombe%27s_quartet_3.svg" style="height: 400px; width:400px;" />

We start with loading of the required packages

In [None]:
using DataFrames
using Statistics
using PyPlot
using GLM

This is a matrix in which we store 8 columns representing Anscombe's quartet data

In [None]:
aq = [10.0   8.04  10.0  9.14  10.0   7.46   8.0   6.58
       8.0   6.95   8.0  8.14   8.0   6.77   8.0   5.76
      13.0   7.58  13.0  8.74  13.0  12.74   8.0   7.71
       9.0   8.81   9.0  8.77   9.0   7.11   8.0   8.84
      11.0   8.33  11.0  9.26  11.0   7.81   8.0   8.47
      14.0   9.96  14.0  8.1   14.0   8.84   8.0   7.04
       6.0   7.24   6.0  6.13   6.0   6.08   8.0   5.25
       4.0   4.26   4.0  3.1    4.0   5.39  19.0  12.50 
      12.0  10.84  12.0  9.13  12.0   8.15   8.0   5.56
       7.0   4.82   7.0  7.26   7.0   6.42   8.0   7.91
       5.0   5.68   5.0  4.74   5.0   5.73   8.0   6.89]

We can simply convert a matrix to a `DataFrame` by calling its constructor

In [None]:
df = DataFrame(aq, :auto)

Note that the auto-generated column names are `x1`, `x2`, etc.
Next we replace automatically generated column names by proper ones.

In [None]:
# See broadcast() to understand how this works.
# The first array is 2x1, and the second array is 1x4 so the broadcast produces a matrix
newname_mat = string.(["x", "y"], [1 2 3 4])

In [None]:
newnames = vec(newname_mat) # vec() turn matrix into vec column first

In [None]:
rename!(df, newnames)

We could have also assigned the names to columns at the moment of data frame creation like this:

In [None]:
DataFrame(aq, [:x1, :y1, :x2, :y2, :x3, :y3, :x4, :y4])

> You might have noticed that in the first example we used a string (e.g. "x1") as column name 
> and in the second one we used a `Symbol` (e.g. `:x1`).  This was intentional.  `DataFrames.jl` allows you to use either
> of them for column indexing.

To see the above rule at work let us extract the second column `:y1` from the data frame.  Here are several options how 
you can do it:

In [None]:
df.y1

In [None]:
df."y1"

In [None]:
df[:, :y1]

In [None]:
df[:, "y1"]

Assume that now we want to reorder columns of the data frame `df` in-place by first grouping the "x"-columns and then
"y"-columns.

This can be easily achieved with the `select!` function.

Note that in column selection we can in particular use regular expressions like `r"x"` (matching all columns that have "x"
in their name) and `:` which matches all columns (in this case only columns not having 'x" in their name are left).

In [None]:
select!(df, r"x", :)

Note that we could have used `select` instead of `select!` function to create a new data frame (instead of mutating the data
frame in-place).

An interesting feature of Anscombe's quartet is that its variables have the same mean and variance.

We can easily check this using the `describe` function.

In [None]:
describe(df, mean=>:mean, std=>:std)

Now let us add a new column `id` to the data frame that will just index its rows from 1 to number of rows.

In [None]:
df.id = axes(df,1) # or 1:nrow(df)
df

Similar to `nrow` which gives us the number of rows in a data frame, one can use `ncol` to get the number of columns.

In [None]:
ncol(df)

Move "id" column to the front.

In [None]:
select(df, "id", :) # a copy returned, and df not changed

Get data in matrix form.

In [None]:
Matrix(df)

In [None]:
collect(extrema(Matrix(select(df, r"x"))))

In [None]:
extrema(Matrix(select(df, r"x"))) .+ (-1,1) # padding to enlarge the range for plotting

In [None]:
xlim = collect(extrema(Matrix(select(df, r"x"))) .+ (-1,1))

In [None]:
ylim = collect(extrema(Matrix(select(df, r"y"))) .+ (-1,1))

In [None]:
#plt.rcParams["figure.figsize"] = (400,400)
fig, axs = plt.subplots(2,2)
fig.tight_layout(pad=4.0)
for i in 1:4
    x = Symbol("x", i) # x1, x2 ...
    y = Symbol("y", i)
    
    model = lm(term(y)~term(x), df)
    axs[i].plot(xlim, predict(model, DataFrame(x=>xlim)), color="orange")
    axs[i].scatter(df[:,x], df[:,y])
    axs[i].set_xlim(xlim)
    axs[i].set_ylim(ylim)
    axs[i].set_xlabel("x$i")
    axs[i].set_ylabel("y$i")
    a, b = round.(coef(model), digits=2)
    c = round(100 * r2(model), digits=2)
    axs[i].set_title("R²=$c%, $y=$a+$b$x")
end

It is easy to create a data frame from variables holding column names and valuesusing `=>`.

In [None]:
x = :var1
y = :var2
xc = 1:3
yc = 4:6
DataFrame(x=>xc, y=>yc)

In [None]:
# direct access to the column stored in `df`
df.x1

In [None]:
# copy a column
df[:, :x1]

In [None]:
# use special row selector `!`
v = df[!, :x1]

In [None]:
v === df.x1