# Chapter Four: DataFrames

In [1]:
using Pkg
Pkg.activate("juliadatascience")

[32m[1m  Activating[22m[39m project at `~/research/JuliaDataScience/notebooks/juliadatascience`


Before we begin, make sure you install the `DataFrames` package. From the Julia REPL, `using Pkg; Pkg.add("DataFrames"`. I will load all packages that are used in a given chapter at the top of the notebook. 

In [2]:
using DataFrames

A note about `DataFrames.jl`. If you are an R person (like me), but want something similar to `dplyr` or `data.table`, here's a handy [website](https://dataframes.juliadata.org/latest/man/comparisons/#Comparison-with-the-R-package-dplyr) that compares those two packages with Julia's `DataFrames`. I think I'll be relying on that quite a bit. I think there is a `Pandas` package as well (or, I know there is) if you prefer that API, but I think the `DataFrames` package is used more often in Julia. 

Let's consider an example using base Julia functionality. 

In [3]:
function grades_array()
    name = ["Bob", "Sally", "Alice", "Hank"]
    age = [17, 18, 20, 19]
    grade_2020 = [5.0, 1.0, 8.5, 4.0]
    (; name, age, grade_2020)
end

grades_array (generic function with 1 method)

How do we access the individual columns?

In [4]:
grades_array().age

4-element Vector{Int64}:
 17
 18
 20
 19

In [5]:
grades_array().grade_2020

4-element Vector{Float64}:
 5.0
 1.0
 8.5
 4.0

Suppose we want a function to extract the second row of the `grades_array()`. 

In [6]:
function second_row()
    name, age, grade_2020 = grades_array()
    i = 2
    row = (name[i], age[i], grade_2020[i])
end

second_row (generic function with 1 method)

In [7]:
second_row()

("Sally", 18, 1.0)

Or how about the row with Alice? We need to find the row that contains Alice and then extract that row. 

In [8]:
function row_alice()
    names = grades_array().name
    i = findfirst(names .== "Alice")
end
row_alice()

3