## Working with data

## In this lecture

- [Introduction](#Introduction)
- [Distributions](#Distributions)
- [Normal distribution](#Normal-distribution)
- [Other distributions](#Other-distributions)
- [DataFrames](#DataFrames)
    - [Combining dataframes](#Combining-dataframes)
    - [Grouping](#Grouping)
    - [Sorting](#Sorting)
    - [Unique rows only](#Unique-rows-only)
    - [Deleting rows](#Deleting-rows)

## Introduction

The ability to use data is fundamental to most modern computer coding taks.  In this lecture, we will have a brief introduction to the way in which the Julia language incorporates data through the use of the `Distributions.jl` and `DataFrames.jl` packages.

[Back to the top](#In-this-lecture)

## Distributions

Data point values for a distribution usually follow a pattern.  Such patterns are called distributions.  Distributions are either discrete or continuous.  The `Distribution.jl` package contains most of the common data distributions.

We will also use the `Random.jl` package to seed the pseudo-random number generator so that we can reproduce the random values that we are going to use in the lecture.

In [None]:
using Distributions
using Random

[Back to the top](#In-this-lecture)

### The normal distribution

The normal distribution is the famous bell-shaped curve that we are familiar with.  Values around the mean occur most frequently and as values get progressively further away from the mean, they occur less frequently.

In [None]:
# Seed the pseudo-random number generator
Random.seed!(1234)

# Saving the standard normal distribution as an object
n = Distributions.Normal()  # This function is from the Distributions package

# Parameter values of the standard normal distribution
params(n)

Using the `params()` function, we note a mean on $0$ and a standard deviation of $1$, also called the _standard normal distribution_.

The `fieldnames()` function provides the actual parameters of the given distribution.  In the case of the normal distribution, it will be the average and the standard deviation, namely $\mu$ and $\sigma$.

In [None]:
# Returning the parameters of the normal distribution
fieldnames(Normal)

Now we create a variable called `var1` and use the `rand()` function to create select $10$ random values from the standard normal distribution.

In [None]:
# Seed the pseudo-random number generator
Random.seed!(1234)

# Select 10 elements at random from n
var1 = rand(n, 10);

We can calculate the average and standard deviation of our randomly selected values.

In [None]:
# Average
mean(var1)

In [None]:
# Standard deviation
std(var1)

The `pdf()` calculates the probability density function value of a given distribution up until a specified point (from $- \infty$).

In [None]:
# Probability density function value at x = 0.3
pdf(Normal(), 0.3)

The `cdf()` functions calculates the cummulative distribution function value of a given distribution up until a specified point (from $- \infty$).

In [None]:
# Cumulative distribution function as x = 0.25
cdf(Normal(), 0.25)

The values for the average and standard deviation can be specified.

In [None]:
# Creating 100 data point values from a normal distribution
# with a mean of 100 and a standard deviation of 10
Random.seed!(1234)
var2 = rand(Normal(100, 10), 100);

In [None]:
# Calculating the mean of var2
mean(var2)

In [None]:
# Calculating the standard deviation of var2
std(var2)

The parameters of a set of values for a specified distribution can be returned.

In [None]:
# Using fit() to calculate the parameters of a distribution
fit(Normal, var2)

The `quantiles()` function provides us with values for the specific percentiles (provided as fractions).  Below we calculate the $2.5$% and $97.5$% percentile values of the standard normal distribution.

In [None]:
# Quantiles
quantile(Normal(), 0.025)

In [None]:
quantile(Normal(), 0.975)

[Back to the top](#In-this-lecture)

### Other distributions

There are many distributions in the `Distribution().jl` package. In the code below, a few of these are showcased by way of setting parameters, selecting random values, and fitting those value back to the distribution or returning the parameter field names.

In [None]:
# Beta distribution
b = Beta(1, 1)
params(b)
Random.seed!(1234)
var3 = rand(b, 100);
fit(Beta, var3)

In [None]:
# χ2 distribution
c = Chisq(1)
Random.seed!(1234)
var4 = rand(c, 100)
fieldnames(Chisq) # Degrees of freedom

[Back to the top](#In-this-lecture)

## Dataframes

The `Dataframes.jl` package allows for creation of a flat data structure (rows and columns).  Columns are variables and rows are subjects (examples).

In [None]:
using DataFrames

Below, we create an empty dataframe object that we call `df`.

In [None]:
# Create and empty DataFrame
df = DataFrame();

Column headers representing statistical variable names are entered in square brackets as symbols, i.e. preceeded with a colon.  We will attach the `var2` set of values as data point entries for this statistical variables.

In [None]:
# Add a column with data point values (rows)
df[:, :Var2] = var2;

We can specify to print the first $5$ rows to the screen with the `first()` function,

In [None]:
# View first five rows
first(df, 5)

Below, we create another statistical variable with some data point values that we already have in the waiting.

In [None]:
# Add another column
df[:, :Var3] = var3;

The `last()` functions shows the last specified rows.

In [None]:
# View last three rows
last(df, 3)

The `size()` function returns a tuple with the number of rows and columns returned,

In [None]:
# Dimensions of a DataFrame
size(df)

The `describe()` functions attemps tp provide summary statistics of the variables>

In [None]:
# Summarize the content
describe(df)

The data type for each variable can be returned.

In [None]:
# Data type only
eltype.(eachcol(df))

Below we create a new instance of a dataframe object called `df2`.  It contains four statistical variables.  Note the use of symbol notation in creating the names of these variables.

In [None]:
# 6 Create a bigger DataFrame
df2 = DataFrame()
df2[:, :A] = 1:10
df2[:, :B] = ["I", "II", "II", "I", "II","I", "II", "II", "I", "II"]
Random.seed!(1234)
df2[:, :C] = rand(Normal(), 10)
df2[:, :D] = rand(Chisq(1), 10);

By using indexing (in square brackets), we can refer to row and column values (i.e. _row, column_).  Below is an example of seleting data point values for rows one through three, showing all the columns.  The colon symbol serves as shortcut syntax for this selection.

In [None]:
# First three rows with all the colums
df2[1:3, :]

If only specified columns, that is to say, not the range of one, two, and three as we did above, but rather only colums one and three, we create a list to indicate this.

In [None]:
# All rows columns 1 and 3
df2[:, [1, 3]]

Instead of indicating the column numbers, we can also reference the actual column names (statistical variable names), using symbol notation, i.e. `:A`.

In [None]:
# Different notation
df2[:, [:A, :C]]

The `CSV.jl` package's `read()` function can import a comma separated values data file.

In [None]:
# Make sure to install the package in the REPL first
import Pkg

Pkg.add("CSV")

In [None]:
using CSV

The file is saved in the same directory / folder as this notebook file.

In [None]:
# Import csv file (in same directory / folder)
data1 = CSV.read("CCS.csv", DataFrame);

Using the `type()` function, we note that we now have an instance of a dataframe object.

In [None]:
typeof(data1)

Let's view the first five rows of data.

In [None]:
first(data1, 5)

The `describe()` function will attempt to summarize all the variables.  In the case of categorical variables, an alphabetical arrangement for minimum and maximum values will be stated.

In [None]:
describe(data1)

[Back to the top](#In-this-lecture)

### Combining dataframes

Combining dataframes on a common variable is a very useful operation.  Below we create two dataframe instances.  Note that both have a `Number` variable.

In [None]:
# Creating DataFrames
subjects = DataFrame(Number = [100, 101, 102, 103], Stage = ["I", "III", "II", "I"])
treatment  = DataFrame(Number = [103, 102, 101, 100], Treatment = ["A", "B", "A", "B"]);

In [None]:
# Joining the two dataframes based on Number column
df3 = innerjoin(subjects, treatment, on = :Number);
df3

In [None]:
# Adding a longer list of subjects
subjects = DataFrame(Number = [100, 101, 102, 103, 104, 105], Stage = ["I", "III", "II", "I", "II", "II"]);

An outer join will join both dataframes and add `missing` data. The function `outerjoin()` can be used.

In [None]:
# Outer joing: empty fields filled with missing
df4  = outerjoin(subjects, treatment, on = :Number);
df4

[Back to the top](#In-this-lecture)

### Grouping

A dataframe can be _spliced_ by grouping rows according to values in a variable.

In [None]:
# Creating a new DataFrame
df5 = DataFrame(Group = rand(["A", "B", "C"], 15), Variable1 = randn(15), Variable2 = rand(15));

# Show first 5 rows
first(df5, 5)

We can use `combine()` to group the data in a dataframe. The `combine()` function can receive a function to apply onto the dataset as the parameter and a grouped dataframe as a second parameter. For more information please run `?combine`. Below we use `size` argument to indicate the number of rows and columns for the number of each unique values that are found in the specified variable.

In [None]:
# We define grouped dataframe first
group_df = groupby(df5, :Group);

In [None]:
# Grouping using combine()
combine(size, group_df)

Since the dataframe has three columns, we note that as the second value in the `count` tuple returned above.  The first value shows the number of instances of the unique values found for the specified variable.

Below we create a dataframe instance that shows only the count of the unique values.

In [None]:
# Count unique data point values in :Group column
combine(group_df, nrow => :Count)

We can use grouped dataframes to calculate the mean and the standard deviation for (A, B, C) groups.

In [None]:
# 1st way
combine(group_df, [:Variable1, :Variable2] .=> mean, [:Variable1, :Variable2] .=> std)

You can use the loop to shorten the above code.

In [None]:
# 2nd way - shortened
combine(group_df, ([:Variable1, :Variable2] .=> f for f in (mean, std))...)

The `groupby()` function actually creates sub-dataframes based on the unique values found in the specified variable.

In [None]:
# Group
groupby(df5, :Group)

By calling the `length()` function, we note that there are indeed three sub-dataframes.

In [None]:
length(groupby(df5, :Group))

Using indexing, we can select any of the three sub-dataframes. For example below:

In [None]:
groupby(df5, :Group)[2]

[Back to the top](#In-this-lecture)

### Sorting

Sorting using the `sort!()` function (permanent bang version used here), sorts the dataframe based on the columns we specify. A list can be provided to sort by more than one variable. Also note that if you use `!`, it will permanently update the original `df5` as well.

In [None]:
df5_sorted = sort!(df5, [:Group, :Variable1]);
first(df5_sorted, 7)

[Back to the top](#In-this-lecture)

### Unique rows only

Below we create a dataframe with two identical rows.

In [None]:
# Creating a DataFrame with an obvious duplicate row
df6 = DataFrame(A = [1, 2, 2, 3, 4, 5],  B = [11, 12, 12, 13, 14, 15], C = ["A", "B", "B", "C", "D", "E"]);
df6

The `unique()` function will, as the name implies, delete the duplicate row.

In [None]:
# Only unique rows
unique(df6)

As always, the bang will make the change permament.

In [None]:
# Permanant change
unique!(df6)
df6

[Back to the top](#In-this-lecture)

### Deleting rows

The `delete!()` function (permanent bang version used here), deletes specified rows.

In [None]:
# Permanently
delete!(df6, [1, 5])
df6

[Back to the top](#In-this-lecture)