# Chapter Four: DataFrames

In [1]:
using Pkg
Pkg.activate("juliadatascience")

[32m[1m  Activating[22m[39m project at `~/research/JuliaDataScience/notebooks/juliadatascience`


Before we begin, make sure you install the `DataFrames` package. From the Julia REPL, `using Pkg; Pkg.add("DataFrames"`. I will load all packages that are used in a given chapter at the top of the notebook. 

In [2]:
using DataFrames

A note about `DataFrames.jl`. If you are an R person (like me), but want something similar to `dplyr` or `data.table`, here's a handy [website](https://dataframes.juliadata.org/latest/man/comparisons/#Comparison-with-the-R-package-dplyr) that compares those two packages with Julia's `DataFrames`. I think I'll be relying on that quite a bit. I think there is a `Pandas` package as well (or, I know there is) if you prefer that API, but I think the `DataFrames` package is used more often in Julia. 

Let's consider an example using base Julia functionality. 

In [3]:
function grades_array()
    name = ["Bob", "Sally", "Alice", "Hank"]
    age = [17, 18, 20, 19]
    grade_2020 = [5.0, 1.0, 8.5, 4.0]
    (; name, age, grade_2020)
end

grades_array (generic function with 1 method)

How do we access the individual columns?

In [4]:
grades_array().age

4-element Vector{Int64}:
 17
 18
 20
 19

In [5]:
grades_array().grade_2020

4-element Vector{Float64}:
 5.0
 1.0
 8.5
 4.0

Suppose we want a function to extract the second row of the `grades_array()`. 

In [6]:
function second_row()
    name, age, grade_2020 = grades_array()
    i = 2
    row = (name[i], age[i], grade_2020[i])
end

second_row (generic function with 1 method)

In [7]:
second_row()

("Sally", 18, 1.0)

Or how about the row with Alice? We need to find the row that contains Alice and then extract that row. Let's deconstruct an upcoming function to do just that. First, we'll get the individual names.

In [8]:
names = grades_array().name

4-element Vector{String}:
 "Bob"
 "Sally"
 "Alice"
 "Hank"

Next, find the first element in the `names` vector showing Alice. 

In [9]:
findfirst(names .== "Alice")

3

Finally, let's wrap this thing in a function. Note that my function is slightly different from what is given in the book. I'm not sure why they define the function variable `i`. In any case, I think this works. 

In [10]:
function row_alice()
    names = grades_array().name
    return findfirst(names .== "Alice")
end
row_alice()

3

Now let's get Alice's grade. Obviously this is a very cumbersome way of filtering/selecting from a data frame if you are used to `dplyr`, `data.table`, or `pandas`. Hopefully things will be better once we use `DataFrames.jl`...tbd. 

In [11]:
function value_alice()
    grades = grades_array().grade_2020
    i = row_alice()
    return grades[i]
end
value_alice()

8.5

OK, now let's use a data frame. 

In [12]:
names = ["Sally", "Bob", "Alice", "Hank"]
grades = [1, 5, 8.5, 4]
df = DataFrame(; name=names, grade_2020=grades)

Row,name,grade_2020
Unnamed: 0_level_1,String,Float64
1,Sally,1.0
2,Bob,5.0
3,Alice,8.5
4,Hank,4.0


In [14]:
function grades_2020()
    name = ["Sally", "Bob", "Alice", "Hank"]
    grade_2020 = [1, 5, 8.5, 4]
    DataFrame(; name, grade_2020)
end
grades_2020()

Row,name,grade_2020
Unnamed: 0_level_1,String,Float64
1,Sally,1.0
2,Bob,5.0
3,Alice,8.5
4,Hank,4.0


In [15]:
df = DataFrame(name = ["Malice"], grade_2020 = ["10"])

Row,name,grade_2020
Unnamed: 0_level_1,String,String
1,Malice,10


In [17]:
df = grades_2020()

Row,name,grade_2020
Unnamed: 0_level_1,String,Float64
1,Sally,1.0
2,Bob,5.0
3,Alice,8.5
4,Hank,4.0
