# DataFrames and broadcasting in Julia

## Dataframes

![](https://dataframes.juliadata.org/stable/assets/logo.png)


- [`DataFrames.jl`](https://github.com/JuliaData/DataFrames.jl) is the primary package for working with tabular data in Julia. 

- It provides functionality similar to that of `pandas` in Python or `data.frames` in R. 

- Lot of correspondence between these three packages!



Let's get started!

### Installation

`Dataframes.jl` is not shipped by default with Julia - you need to install it.

```julia
>julia] add DataFrames 
```

In [4]:
cd(@__DIR__)
using Pkg; Pkg.activate(".");
Pkg.instantiate()

[32m[1m  Activating[22m[39m project at `~/Academia/Postdoc_S2z/teaching/iDiv_Julia_workshop/materials/Day1/32_dataframe_tuto`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/VBoussangeRegistry`
[32m[1m    Updating[22m[39m git-repo `https://github.com/vboussange/VBoussangeRegistry.git`
[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General.toml`
[32m[1m   Installed[22m[39m GBIF2 ─ v0.2.1
[32m[1m    Updating[22m[39m `~/Academia/Postdoc_S2z/teaching/iDiv_Julia_workshop/materials/Day1/32_dataframe_tuto/Project.toml`
  [90m[336ed68f] [39m[92m+ CSV v0.10.14[39m
  [90m[a93c6f00] [39m[92m+ DataFrames v1.6.1[39m
  [90m[dedd4f52] [39m[92m+ GBIF2 v0.2.1[39m
  [90m[91a5bcdd] [39m[92m+ Plots v1.40.5[39m
  [90m[274fc56d] [39m[92m+ PythonPlot v1.0.5[39m
  [90m[f43a241f] [39m[93m~ Downloads ⇒ v1.6.0[39m
[32m[1m    Updating[22m[39m `~/Academia/Postdoc_S2z/teaching/iDiv_Julia_workshop/materials/Day1/32_dataframe_tuto/Manifest

### Constructing `DataFrame`s

In [7]:
using DataFrames

df = DataFrame(grp=repeat(1:2, 3), 
                x=6:-1:1, 
                y=4:9, 
                z=[3:7; missing], 
                id='a':'f')

Row,grp,x,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,6,4,3,a
2,2,5,5,4,b
3,1,4,6,5,c
4,2,3,7,6,d
5,1,2,8,7,e
6,2,1,9,missing,f


In [6]:
df2 = DataFrame(grp=[1, 3], 
                w=[10, 11])

Row,grp,w
Unnamed: 0_level_1,Int64,Int64
1,1,10
2,3,11


In [4]:
names(df)

5-element Vector{String}:
 "grp"
 "x"
 "y"
 "z"
 "id"

Constructing Row by Row

In [5]:
df3 = DataFrame(A=Int[], B=String[])
push!(df3, (1, "M"))

Row,A,B
Unnamed: 0_level_1,Int64,String
1,1,M


### Accessing data

Cell indexing by location

In [6]:
df[2, 2]

5

Row slicing by location

In [7]:
df[2:3, :]

Row,grp,x,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,2,5,5,4,b
2,1,4,6,5,c


Column indexing

In [8]:
df[:, :x]

6-element Vector{Int64}:
 6
 5
 4
 3
 2
 1

In [8]:
df.x

6-element Vector{Int64}:
 6
 5
 4
 3
 2
 1

In [9]:
df[:, [:x, :z]]

Row,x,z
Unnamed: 0_level_1,Int64,Int64?
1,6,3
2,5,4
3,4,5
4,3,6
5,2,7
6,1,missing


Row indexing by label

In [10]:
df[df.id .== 'c', :]

Row,grp,x,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,4,6,5,c


Notice the `.` in front of the `==`. More on that in a minute!

### Changing the data stored in a dataframe

In [11]:
df.x = rand([1,2,3,4],length(df.x))
df

Row,grp,x,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,2,4,3,a
2,2,3,5,4,b
3,1,2,6,5,c
4,2,2,7,6,d
5,1,4,8,7,e
6,2,2,9,missing,f


Equivalently, one can use the syntax

In [14]:
df[:,:x] = rand([1,2,3,4],length(df.x))

6-element Vector{Int64}:
 3
 4
 2
 3
 2
 2

### Common operations

In [14]:
describe(df)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,grp,1.5,1,1.5,2,0,Int64
2,x,2.16667,2,2.0,3,0,Int64
3,y,6.5,4,6.5,9,0,Int64
4,z,5.0,3,5.0,7,1,"Union{Missing, Int64}"
5,id,,a,,f,0,Char


Reduce multiple values with `skipmissing` (returns memory efficient iterator)


In [16]:
using Statistics
mean(skipmissing(df.z))

5.0

Rename columns

In [17]:
rename(df, :x => :x_new)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,3,4,3,a
2,2,2,5,4,b
3,1,2,6,5,c
4,2,2,7,6,d
5,1,2,8,7,e
6,2,2,9,missing,f


With the bang operator

In [18]:
rename!(df, :x => :x_new)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,3,4,3,a
2,2,2,5,4,b
3,1,2,6,5,c
4,2,2,7,6,d
5,1,2,8,7,e
6,2,2,9,missing,f


Drop missing rows

In [19]:
dropmissing(df)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Char
1,1,3,4,3,a
2,2,2,5,4,b
3,1,2,6,5,c
4,2,2,7,6,d
5,1,2,8,7,e


Select unique rows

In [20]:
unique(df)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,3,4,3,a
2,2,2,5,4,b
3,1,2,6,5,c
4,2,2,7,6,d
5,1,2,8,7,e
6,2,2,9,missing,f


### Grouping data and aggregation
DataFrames.jl provides a groupby function to apply operations over each group independently. The result of groupby is a GroupedDataFrame object which may be processed using the combine, transform, or select functions. The following table illustrates some common grouping and aggregation usages.

In [21]:
dfg = groupby(df, :grp)

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,3,4,3,a
2,1,2,6,5,c
3,1,2,8,7,e

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,2,2,5,4,b
2,2,2,7,6,d
3,2,2,9,missing,f


In [22]:
dfg[1]

Row,grp,x_new,y,z,id
Unnamed: 0_level_1,Int64,Int64,Int64,Int64?,Char
1,1,3,4,3,a
2,1,2,6,5,c
3,1,2,8,7,e


Aggregate by groups

In [23]:
combine(groupby(df, :grp), :y => mean)

Row,grp,y_mean
Unnamed: 0_level_1,Int64,Float64
1,1,6.0
2,2,7.0


### Reading a CSV file

To read a parse a CSV and pipe it in a DataFrame, you'll need to install the [`CSV.jl`](https://github.com/JuliaData/CSV.jl) package. CSV.jl is *a fast, flexible delimited file reader/writer for Julia.*

In [24]:
using CSV
iris_data_filename = "iris_data.csv"
CSV.read(iris_data_filename, DataFrame)

Row,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String15
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


Checkout [`CSV.jl` documentation](https://csv.juliadata.org/stable/index.html#Overview) to learn more.

## Broadcasting

In R, something like this works perfectly:
```R
a <- c(0, pi/2, pi, 3*pi/2, 2*pi)
b <- sin(a)
```



But in Julia, the following fails!

In [25]:
a = [0, π/2, π, 3π/2, 2π]
b = sin(a)

LoadError: MethodError: no method matching sin(::Vector{Float64})
[0mClosest candidates are:
[0m  sin([91m::T[39m) where T<:Union{Float32, Float64} at special/trig.jl:29
[0m  sin([91m::LinearAlgebra.UniformScaling[39m) at ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/uniformscaling.jl:173
[0m  sin([91m::LinearAlgebra.Diagonal[39m) at ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/diagonal.jl:674
[0m  ...

- In R, `sin` function works with lists thanks to the fact the there exists a built-in vectorized function `sin`.

- In Julia, you can obtain this behavior by **broadcasting**,
-  allows for **the application of operations to arrays or collections** in a way that is both concise and efficient.




The syntax for broadcasting is the use of the dot notation `.`, followed by the operator.

In [26]:
b = sin.(a)

5-element Vector{Float64}:
  0.0
  1.0
  1.2246467991473532e-16
 -1.0
 -2.4492935982947064e-16

- The dot notation tells Julia to apply the `sin` operator to each element of `a`. 
- This is more natural, because the $sin$ function is mathematically only defined for scalar values.



Similarly, in R you could write something like

```R
b = a + 1
```

but mathematically, this is weird, since adding a scalar to a vector does not make sense! As such, this in Julia fails

In [27]:
b = a + 1

LoadError: MethodError: no method matching +(::Vector{Float64}, ::Int64)
For element-wise addition, use broadcasting with dot syntax: array .+ scalar
[0mClosest candidates are:
[0m  +(::Any, ::Any, [91m::Any[39m, [91m::Any...[39m) at operators.jl:591
[0m  +([91m::T[39m, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:87
[0m  +([91m::Base.TwicePrecision[39m, ::Number) at twiceprecision.jl:290
[0m  ...

But this works!

In [28]:
b = a .+ 1

5-element Vector{Float64}:
 1.0
 2.5707963267948966
 4.141592653589793
 5.71238898038469
 7.283185307179586

Note that the following R user-defined function

```R
# A function that returns the square of a number if it is even, and -1 if odd
even_square <- function(x){
  if(x %% 2 == 0){
    return(x^2)
  } else {
    return(-1)
  }
}
```

would fail with a R list:

```R
# A vector of numbers
nums <- c(1, 2, 3, 4, 5)

even_square(nums) # Error in if (x%%2 == 0) { : the condition has length > 1
```



For the above code to work, you'd need to vectorize the function `even_square` with the `apply` family function - in a similar fashion as you would use the dot operator in Julia

```R
sapply(nums, even_square)
```

#### Matrix multiplication in R and Julia

In [29]:
a = [1 2; 3 4]
b = [2, 2] 

a * b

2-element Vector{Int64}:
  6
 14

In R this is equivalent to something like

```R
a %*% b
```

while this

In [30]:
a .* b

2×2 Matrix{Int64}:
 2  4
 6  8

Is equivalent in R to 
```R
a * b
```


#### Ref

In Julia when you write `r = Ref(x)` you create a 0-dimensional container storing the x value as its only element. You can retrieve the `x` object from the `Ref` value `r` by writing `r[]` (notice that we do not pass any indices in the indexing syntax, as the `r` object is 0-dimensional). The type is named `Ref` as you can think of it that `r` is a reference to `x`.

Since `Ref` objects are 0-dimensional and store exactly one element they have length 1 in every dimension. In consequence if you use the r object in broadcasting the `x` value stored in it is used in all required dimensions following expansion rules.



`Ref` can be useful in certain cases. Here is one: assume that you want to look whether you have an array of strings, and you want to check whether each element of this array is in an other array of strings. One option is to use a loop:

In [1]:
mycountries = ["USA" 
                "CHN" 
                "JPN" 
                "DEU" 
                "BRA" 
                "FRA" 
                "IRA" 
                "RUS" 
                "GBR" 
                "AUS" 
                "CAN" 
                "IND" 
                "MEX" 
                "KOR" 
                "ESP" 
                "IDN" 
                "TUR" 
                "NLD" 
                "SAU"]
mycountries2 = ["BRA" 
                "FRA" 
                "IRA" 
                "RUS"
                "CHE"]
x = Bool[]
for c in mycountries2
    if c ∈ mycountries
        push!(x, true)
    else
        push!(x, false)
    end
end
println(x)

Bool[1, 1, 1, 1, 0]


That's where `Ref` is useful!

In [33]:
mycountries2 .∈ Ref(mycountries)

5-element BitVector:
 1
 1
 1
 1
 0

## Your turn!

Try out [some exercises](33_dataframe_exercises.md) with the iris dataset and GBIF 🤩!

Leave out the plotting section, we'll do that after you get a proper introduction to plotting in Julia!

## Acknowledgement and additional resources
- [Bits and pieces of this tutorial have been inspired from the DataFrames.jl documentation](https://dataframes.juliadata.org/stable/)
- [Here is a very detailed tutorial for DataFrames.jl](https://dataframes.juliadata.org/stable/man/basics/#Changing-the-Data-Stored-in-a-Data-Frame)
- [`CSV.jl` documentation](https://csv.juliadata.org/stable/index.html#Overview)