# Basic data analysis tasks

- Data: Fitness in Arabidopsis recombinant inbred lines
- Data input: CSV files, missing data
- Data description: Summary statistics, plots
- Modeling: Linear regression

## Example: Fitness measured in Arabidopsis recombinant inbred lines

Topics: Reading data, JIT compiler, missing data, summary statistics, plots, linear regression 

In [None]:
using Statistics, CSV, Plots, DataFrames, GLM

## Reading data

In [None]:
agrenURL = "https://raw.githubusercontent.com/sens/smalldata/master/arabidopsis/agren2013.csv"
agren = CSV.read(download(agrenURL),DataFrame,missingstring="NA");
first(agren,10)

## Data description

In [None]:
describe(agren)

## Calculating summary statistics

In [None]:
mean(skipmissing(agren.it09))

In [None]:
mean.(skipmissing.(eachcol(agren)))

In [None]:
?skipmissing

## Visualization: histogram

In [None]:
histogram(Float64.(skipmissing(agren.it09)),lab="")
# display.(histogram.(eachcol(agren)))

## Visualization: scatterplot

In [None]:
scatter(log2.(agren.it09),log2.(agren.it10),lab="")

## Modeling: linear regression

In [None]:
out0 = lm(@formula(it11~it09+flc),agren)

## Extracting information

In [None]:
coef(out0)

In [None]:
vcov(out0)

## Residual plots

In [None]:
scatter(residuals(out0),predict(out0),lab="")

## GLM

In [None]:
out1 = glm(@formula( it11 ~ log(it09) ),agren,Normal(),LogLink())

## Chaining and automation of model fitting

In [None]:
yr = ["09","10","11"]
sw = ("sw" .* yr);
it = ("it" .* yr);
models = Term.(Symbol.(sw)) .~ Term.(Symbol.(it)) .+ Term.(:flc)

broadcast( m->glm(m,agren,Normal(),LogLink()), models ) .|> coeftable

In [None]:
( Term.(Symbol.(sw)) .~ permutedims(Term.(Symbol.(it))) .+ Term.(:flc) )  .|> 
                     ( m->glm(m,agren,Normal(),LogLink()) ) .|>
                     print;

## Generating random numbers

In [None]:
using Random
using Distributions

In [None]:
# initialize random number generator
rnd = MersenneTwister(100)
# Draw uniform (0,1) numbers in 4x4 matrix
rand(Cauchy(),4,4)

## Calculating probabilities and densities

Calculating the CDF of the normal distribution

In [None]:
cdf(Cauchy(0,1),1.96)

Quantiles of the normal distribution.

In [None]:
quantile(Normal(),0.95)

Generating random normal variables with mean 0.3 and standard deviation 0.5.

In [None]:
x = rand(Normal(0.3,0.5),1000)

We now fit the MLE assuming a normal distribution.

In [None]:
normFit = fit_mle(Laplace,x)

We generate normal variables with parameters equal to the estimated ones.

In [None]:
rand(normFit,10)