# Chapter 2. Loading Data in Julia

### 1. Load common datasets

Firstly, we need to load some sample data, so we can install a common package for convenience:

In [23]:
using Pkg
Pkg.add("RDatasets")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`


In [24]:
using RDatasets
df = dataset("datasets", "iris")
first(df, 5)

Row,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Cat…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa


Here, we are using `first()` to see the first several rows of the dataframe.

### 2. Load *.csv files locally

In [25]:
Pkg.add("CSV")
using CSV

df = CSV.read("./res/data/iris.csv", DataFrame)
first(df, 3)

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`


Row,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String15
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa


### 3. Load datasets online

In [26]:
Pkg.add("HTTP")
using HTTP

url = "https://github.com/mwaskom/seaborn-data/raw/master/iris.csv"
response = HTTP.get(url)
df = CSV.read(IOBuffer(response.body), DataFrame)
first(df, 3)

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`


Row,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String15
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa


### 4. Creating a data frame from scratch:

In [27]:
Pkg.add("DataFrames")
using DataFrames

df2 = DataFrame(
  title = ["A", "B", "C"],
  published = [1, 2, 3], 
  author = "Rongxin"
)
first(df2, 3)

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`


Row,title,published,author
Unnamed: 0_level_1,String,Int64,String
1,A,1,Rongxin
2,B,2,Rongxin
3,C,3,Rongxin


## Selecting Data in Julia

### 1. Indexing a subset

![](./res/img/selection/index.png)

We can select a subset using a pair of row-column indexes. For example, if we want to select the first row to the second row, with all columns, we can:

In [29]:
df[1:2, :]

Row,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String15
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa


### 2. Select by column names

![](./res/img/selection/columns.png)

In [38]:
df[:, [:sepal_width, :petal_length]]

Row,sepal_width,petal_length
Unnamed: 0_level_1,Float64,Float64
1,3.5,1.4
2,3.0,1.4
3,3.2,1.3
4,3.1,1.5
5,3.6,1.4
6,3.9,1.7
7,3.4,1.4
8,3.4,1.5
9,2.9,1.4
10,3.1,1.5


And the powerful part of it is, we can directly using regex to select columns!

For instance, if we only care about the columns ended with `length`, we can:

![](./res/img/selection/col.reg.png)

In [39]:
df[:, r".*length$"]

Row,sepal_length,petal_length
Unnamed: 0_level_1,Float64,Float64
1,5.1,1.4
2,4.9,1.4
3,4.7,1.3
4,4.6,1.5
5,5.0,1.4
6,5.4,1.7
7,4.6,1.4
8,5.0,1.5
9,4.4,1.4
10,4.9,1.5


### 3. Conditional filtering

It's common in data analysis that we want to subset a dataframe according to a condition.

In this case, we can define a condition, e.g., find out the rows whose `species` is `virginica`, as the following lines: 

![img](./res/img/selection/col.if.png)

In [34]:
condition = df.species .== "virginica"
df[condition, :]

Row,sepal_length,sepal_width,petal_length,petal_width,species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,String15
1,6.3,3.3,6.0,2.5,virginica
2,5.8,2.7,5.1,1.9,virginica
3,7.1,3.0,5.9,2.1,virginica
4,6.3,2.9,5.6,1.8,virginica
5,6.5,3.0,5.8,2.2,virginica
6,7.6,3.0,6.6,2.1,virginica
7,4.9,2.5,4.5,1.7,virginica
8,7.3,2.9,6.3,1.8,virginica
9,6.7,2.5,5.8,1.8,virginica
10,7.2,3.6,6.1,2.5,virginica


Now, you know how to load and select dataframes upon your interests, it's time to know how to [transform your data and calculate your variables](https://reynards-org.gitbook.io/data-analysis-in-julia/3.transform.calculate.jl)