## DataFrames and broadcasting in Julia

In Julia, arrays are the fundamental way to store and manipulate collections of data. Julia supports arrays of any dimension and can be used to store various types of data such as integers, floating-point numbers, and strings. In addition to arrays, Julia has a package called DataFrames.jl, which provides a tabular data structure that is similar to a spreadsheet or database table.


## Dataframes

DataFrames in Julia are similar to spreadsheets or database tables. DataFrames are stored in a tabular format with rows and columns. The first row of the DataFrame contains the column names, and each subsequent row contains the data. Here is an example of creating a DataFrame in Julia:

In [46]:
cd(@__DIR__)
using Pkg; Pkg.activate(".");

[32m[1m  Activating[22m[39m project at `~/ETHZ/PostDoc_ELE/teaching/WSL_workshop_Julia/material/Day1/32_dataframe_tuto`


### Constructing `DataFrame`s

In [47]:
using DataFrames
using Statistics

df = DataFrame(grp=repeat(1:2, 3), x=6:-1:1, y=4:9, z=[3:7; missing], id='a':'f')

Row,grp,x,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,6,4,3,a
2,2,5,5,4,b
3,1,4,6,5,c
4,2,3,7,6,d
5,1,2,8,7,e
6,2,1,9,missing,f


In [48]:
df2 = DataFrame(grp=[1, 3], w=[10, 11])

Row,grp,w
Unnamed: 0_level_1,Int64,Int64
1,1,10
2,3,11


In [49]:
names(df)

5-element Vector{String}:
 "grp"
 "x"
 "y"
 "z"
 "id"

Constructing Row by Row

In [50]:
df3 = DataFrame(A=Int[], B=String[])
push!(df, (1, "M"))

DimensionMismatch: DimensionMismatch: Length of `row` does not match `DataFrame` column count.

### Accessing data

Cell indexing by location

In [51]:
df[2, 2]

5

Row slicing by location

In [52]:
df[2:3, :]

Row,grp,x,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,2,5,5,4,b
2,1,4,6,5,c


Column indexing

In [53]:
df[:, :x]

6-element Vector{Int64}:
 6
 5
 4
 3
 2
 1

In [54]:
df.x

6-element Vector{Int64}:
 6
 5
 4
 3
 2
 1

In [55]:
df[:, [:x, :z]]

Row,x,z
Unnamed: 0_level_1,Int64,Int64?
1,6,3
2,5,4
3,4,5
4,3,6
5,2,7
6,1,missing


Row indexing by label

In [56]:
df[df.id .== 'c', :]

Row,grp,x,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,4,6,5,c


Notice the `.` in front of the `==`. More on that in a minute!

### Changing the data stored in a dataframe

In [59]:
df.x = rand([1,2,3,4],length(df.x))
df

Row,grp,x,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,2,4,3,a
2,2,3,5,4,b
3,1,1,6,5,c
4,2,2,7,6,d
5,1,4,8,7,e
6,2,2,9,missing,f


Equivalently, one can use the syntax

In [60]:
df[!,:x] = rand([1,2,3,4],length(df.x))

6-element Vector{Int64}:
 1
 1
 3
 1
 2
 3

As a special rule using ! as row selector replaces column without copying.

Notice that you cannot do something like
```julia
df[:,:x] = rand([1,2,3,4],length(df.x))
```

### Common operations

In [73]:
describe(df)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,grp,1.5,1,1.5,2,0,Int64
2,x,1.83333,1,1.5,3,0,Int64
3,y,6.5,4,6.5,9,0,Int64
4,z,5.0,3,5.0,7,1,"Union{Missing, Int64}"
5,id,,a,,f,0,Char


Reduce multiple values

In [19]:
mean(skipmissing(df.z))

5.0

Rename columns

In [20]:
rename(df, :x => :x_new)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,6,4,3,a
2,2,5,5,4,b
3,1,4,6,5,c
4,2,3,7,6,d
5,1,2,8,7,e
6,2,1,9,missing,f


With the bang operator

In [21]:
rename!(df, :x => :x_new)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,6,4,3,a
2,2,5,5,4,b
3,1,4,6,5,c
4,2,3,7,6,d
5,1,2,8,7,e
6,2,1,9,missing,f


Drop missing rows

In [23]:
dropmissing(df)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Char
1,1,6,4,3,a
2,2,5,5,4,b
3,1,4,6,5,c
4,2,3,7,6,d
5,1,2,8,7,e


Select unique rows

In [24]:
unique(df)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,6,4,3,a
2,2,5,5,4,b
3,1,4,6,5,c
4,2,3,7,6,d
5,1,2,8,7,e
6,2,1,9,missing,f


### Grouping data and aggregation
DataFrames.jl provides a groupby function to apply operations over each group independently. The result of groupby is a GroupedDataFrame object which may be processed using the combine, transform, or select functions. The following table illustrates some common grouping and aggregation usages.

In [27]:
dfg = groupby(df, :grp)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,6,4,3,a
2,1,4,6,5,c
3,1,2,8,7,e

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,2,5,5,4,b
2,2,3,7,6,d
3,2,1,9,missing,f


In [28]:
dfg[1]

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,6,4,3,a
2,1,4,6,5,c
3,1,2,8,7,e


Aggregate by groups

In [30]:
combine(groupby(df, :grp), :y => mean)

Row,grp,y_mean
Unnamed: 0_level_1,Int64,Float64
1,1,6.0
2,2,7.0


### Reading a CSV file

## Broadcasting

Broadcasting is a powerful feature in Julia that allows for the application of operations to arrays or collections in a way that is both concise and efficient. Broadcasting can be thought of as a way to extend scalar operations to arrays or collections without having to use explicit loops.

In Julia, something like this fails

In [62]:
a = [1, 2, 3, 4]
b = a + 1

MethodError: MethodError: no method matching +(::Vector{Int64}, ::Int64)
For element-wise addition, use broadcasting with dot syntax: array .+ scalar
Closest candidates are:
  +(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:591
  +(!Matched::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:87
  +(!Matched::Base.TwicePrecision, ::Number) at twiceprecision.jl:290
  ...

That's where we need broadcasting! Broadcasting is the application of an operation to each element of an array or collection.


The syntax for broadcasting is the use of the dot notation `.`, followed by the operator.

In [64]:
b = a .+ 1

4-element Vector{Int64}:
 2
 3
 4
 5

The dot notation tells Julia to apply the `+` operator to each element of `a` and the scalar value `1`.


Broadcasting can also be used with functions. For example, to apply the sin function to each element of an array, we can write:

In [69]:
a = [0, π/2, π, 3π/2, 2π]
b = sin.(a)

5-element Vector{Float64}:
  0.0
  1.0
  1.2246467991473532e-16
 -1.0
 -2.4492935982947064e-16

#### Comparision with R

In R, the `apply` family of functions can be used for similar purposes. For example, to apply the sin function to each element of a vector in R, we can write:

```R
a <- c(0, pi/2, pi, 3*pi/2, 2*pi)
b <- sapply(a, sin)
```

But the dot notation makes it cleaner and less confusing (at least for me!)

In [71]:
a = [1 2; 3 4]
b = [2, 2] 

a * b

2-element Vector{Int64}:
  6
 14

In R this is equivalent to something like

```R
a %*% b
```

while this

In [72]:
a .* b

2×2 Matrix{Int64}:
 2  4
 6  8

Is equivalent in R to 
```R
a * b
```
R also provides a number of built-in vectorized functions, such as `sqrt`, `exp`, and `sin`, which can be used to operate on arrays in a similar way to broadcasting in Julia.

### Benefits of broadcasting in Julia

-  Allows for **concise and readable code**. Without broadcasting, we would need to use a loop or comprehension to apply a function element-wise to an array, which can be more verbose and harder to read. Broadcasting also allows for more efficient code, since it avoids the overhead of a loop or comprehension.
- Allows for more flexible code! Broadcasting can be used with any function, not just arithmetic operators, which means that it can be used with **user-defined functions** as well. This makes it easy to write code that operates on arrays in a flexible and customizable way.

In [83]:
function my_function(x, y)
    return x + y^2
end

a = [1, 2, 3]
b = 2
c = my_function.(a, b)

3-element Vector{Int64}:
 5
 6
 7

#### Use of Ref

In Julia when you write r = Ref(x) you create a 0-dimensional container storing the x value as its only element. You can retrieve the x object from the Ref value r by writing r[] (notice that we do not pass any indices in the indexing syntax, as the r object is 0-dimensional). The type is named Ref as you can think of it that r is a reference to x.

Since Ref objects are 0-dimensional and store exactly one element they have length 1 in every dimension. In consequence if you use the r object in broadcasting the x value stored in it is used in all required dimensions following expansion rules.

In [75]:
mycountries = ["USA" 
                "CHN" 
                "JPN" 
                "DEU" 
                "BRA" 
                "FRA" 
                "IRA" 
                "RUS" 
                "GBR" 
                "AUS" 
                "CAN" 
                "IND" 
                "MEX" 
                "KOR" 
                "ESP" 
                "IDN" 
                "TUR" 
                "NLD" 
                "SAU"]
"CHE" ∈ mycountries

false

In [81]:

mycountries2 = ["BRA" 
                "FRA" 
                "IRA" 
                "RUS"
                "CHE"]

mycountries2 .∈ mycountries

DimensionMismatch: DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 5 and 19

In [82]:
mycountries2 .∈ Ref(mycountries)

5-element BitVector:
 1
 1
 1
 1
 0

## Acknowledgement and additional resources
- [Bits and pieces of this tutorial have been inspired from the DataFrames.jl documentation](https://dataframes.juliadata.org/stable/)
- [Here is a very detailed tutorial for DataFrames.jl](https://dataframes.juliadata.org/stable/man/basics/#Changing-the-Data-Stored-in-a-Data-Frame)