# Julia DataFrames.jl 介紹 (一): 入門操作

![](https://juliadata.github.io/DataFrames.jl/stable/assets/logo.png)

DataFrames.jl 官方網站: [https://juliadata.github.io/DataFrames.jl/stable/](https://juliadata.github.io/DataFrames.jl/stable/)

DataFrames.jl GitHub: [https://github.com/JuliaData/DataFrames.jl/blob/master/docs/src/index.md](https://github.com/JuliaData/DataFrames.jl/blob/master/docs/src/index.md)

## 0. 安裝

如果尚未安裝過 DataFrames.jl 的話, 執行 `Pkg.add()` 進行安裝

In [1]:
using Pkg
Pkg.add(PackageSpec(name="DataFrames", version="0.20.2"))

[32m[1m   Updating[22m[39m registry at `C:\Users\kai\.julia\registries\General`
[32m[1m   Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`












[32m[1m  Resolving[22m[39m package versions...
[32m[1m   Updating[22m[39m `C:\Users\kai\.julia\environments\v1.4\Project.toml`
[90m [no changes][39m
[32m[1m   Updating[22m[39m `C:\Users\kai\.julia\environments\v1.4\Manifest.toml`
[90m [no changes][39m


Check DataFrames version

In [2]:
Pkg.installed()["DataFrames"]

└ @ Pkg D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Pkg.jl:531


v"0.20.2"

## 1. 建立 DataFrame

In [35]:
using DataFrames

### 1.1 使用向量建立 DataFrame

In [4]:
df = DataFrame(col1 = 1:5, col2 = ["M", "F", "F", missing, "M"])

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M


the second column :col2 can hold `String` or `Missing`, which is indicated by ⍰ printed after the name of type

In [5]:
typeof(df)

DataFrame

In [6]:
dump(df)

DataFrame
  columns: Array{AbstractArray{T,1} where T}((2,))
    1: Array{Int64}((5,)) [1, 2, 3, 4, 5]
    2: Array{Union{Missing, String}}((5,))
      1: String "M"
      2: String "F"
      3: String "F"
      4: Missing missing
      5: String "M"
  colindex: DataFrames.Index
    lookup: Dict{Symbol,Int64}
      slots: Array{UInt8}((16,)) UInt8[0x00, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x01, 0x00, 0x00]
      keys: Array{Symbol}((16,))
        1: #undef
        2: Symbol col2
        3: #undef
        4: #undef
        5: #undef
        ...
        12: #undef
        13: #undef
        14: Symbol col1
        15: #undef
        16: #undef
      vals: Array{Int64}((16,)) [4853, 2, 3, 818813136, 818812944, 384234704, 384233808, 14, 13, 162350368, 818813520, 0, 818813328, 1, 812933904, 338560864]
      ndel: Int64 0
      count: Int64 2
      age: UInt64 0x0000000000000002
      idxfloor: Int64 1
      maxprobe: Int64 0
    names: Array{Symbol}((2,))

Columns can be directly (i.e. without copying) accessed via `df.col` or `df[!, :col]`. **The latter syntax is more flexible as it allows passing a variable holding the name of the column, and not only a literal name.** Note that column names are symbols (`:col` or `Symbol("col")`) rather than strings (`"col"`). Columns can also be accessed using an integer index specifying their position.

Since **`df[!, :col]` does not make a copy**, changing the elements of the column vector returned by this syntax will affect the values stored in the original df. **To get a copy of the column use `df[:, :col]`**: changing the vector returned by this syntax does not change df.

In [8]:
df.col1

5-element Array{Int64,1}:
 1
 2
 3
 4
 5

In [9]:
df.col2

5-element Array{Union{Missing, String},1}:
 "M"
 "F"
 "F"
 missing
 "M"

In [12]:
df.col1 === df[!, :col1]

true

In [14]:
df.col1 === df[:, :col1]

false

In [16]:
df.col1 == df[:, :col1]

true

In [18]:
df.col1 === df[!, 1]

true

In [19]:
firstcolumn = :col1

:col1

In [20]:
df[!, firstcolumn] === df.col1

true

In [21]:
df[:, firstcolumn] === df.col1

false

In [22]:
df[:, firstcolumn] == df.col1

true

Column names can be obtained using the names function:

In [23]:
names(df)

2-element Array{Symbol,1}:
 :col1
 :col2

In [24]:
typeof(:col1)

Symbol

In [27]:
?Symbol

search: [0m[1mS[22m[0m[1my[22m[0m[1mm[22m[0m[1mb[22m[0m[1mo[22m[0m[1ml[22m



```
Symbol
```

The type of object used to represent identifiers in parsed julia code (ASTs). Also often used as a name or label to identify an entity (e.g. as a dictionary key). `Symbol`s can be entered using the `:` quote operator:

```jldoctest
julia> :name
:name

julia> typeof(:name)
Symbol

julia> x = 42
42

julia> eval(:x)
42
```

`Symbol`s can also be constructed from strings or other values by calling the constructor `Symbol(x...)`.

`Symbol`s are immutable and should be compared using `===`. The implementation re-uses the same object for all `Symbol`s with the same name, so comparison tends to be efficient (it can just compare pointers).

Unlike strings, `Symbol`s are "atomic" or "scalar" entities that do not support iteration over characters.

---

```
Symbol(x...) -> Symbol
```

Create a [`Symbol`](@ref) by concatenating the string representations of the arguments together.

# Examples

```jldoctest
julia> Symbol("my", "name")
:myname

julia> Symbol("day", 4)
:day4
```


### 1.2 使用 column by column 的方式建立 DataFrame

In [28]:
# 使用建構子建立空的 DataFrame
df = DataFrame()

In [29]:
# 指定各個 column 及其值, 加入到 DataFrame 中
df.col1 = 1:5
df.col2 = ["M", "F", "F", missing, "M"]

# DataFrames.show() 函式顯示 DataFrame
# show([io::IO,] df::AbstractDataFrame;
#        allrows::Bool = !get(io, :limit, false),
#        allcols::Bool = !get(io, :limit, false),
#        allgroups::Bool = !get(io, :limit, false),
#        splitcols::Bool = get(io, :limit, false),
#        rowlabel::Symbol = :Row,
#        summary::Bool = true)
show(df)

5×2 DataFrame
│ Row │ col1  │ col2    │
│     │ [90mInt64[39m │ [90mString⍰[39m │
├─────┼───────┼─────────┤
│ 1   │ 1     │ M       │
│ 2   │ 2     │ F       │
│ 3   │ 3     │ F       │
│ 4   │ 4     │ [90mmissing[39m │
│ 5   │ 5     │ M       │

Use the `size` function to check the number of rows and columns:

In [30]:
size(df, 1)

5

In [31]:
size(df, 2)

2

In [32]:
size(df)

(5, 2)

### 1.3 新增 row 資料列到 DataFrame

新增 row 到 DataFrame, 資料值的部分可以使用 tuple, vector, 或是 dictionary

In [33]:
# 使用 tuple
push!(df, (1, "M"))

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M
6,1,M


In [34]:
# 使用 vector
push!(df, [2, "f"])

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M
6,1,M
7,2,f


In [35]:
# 使用 Dict
push!(df, Dict(:col2 => "F", :col1 => 2))

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M
6,1,M
7,2,f
8,2,F


### 1.4 刪除 Row 資料

呼叫 `deleterows!()` 函式可將 DataFrame 中指定的 row 刪除

In [36]:
deleterows!(df, 7:8)

Unnamed: 0_level_0,col1,col2
Unnamed: 0_level_1,Int64,String⍰
1,1,M
2,2,F
3,3,F
4,4,missing
5,5,M
6,1,M


### 1.5 載入資料集

延續先前範例, 使用 CSV 載入 UCI Machine Learning Repository 的 Auto MPG Data Set, 資料集的物件為 DataFrames 類型.

若尚未安裝 CSV.jl 的話請先安裝.

In [37]:
Pkg.update()
Pkg.add("CSV")

[32m[1m   Updating[22m[39m registry at `C:\Users\kai\.julia\registries\General`


[?25l

[32m[1m   Updating[22m[39m git-repo `https://github.com/JuliaRegistries/General.git`




[32m[1m  Installed[22m[39m DataAPI ───────── v1.2.0
[32m[1m  Installed[22m[39m ZipFile ───────── v0.9.1
[32m[1m  Installed[22m[39m RecipesPipeline ─ v0.1.4
[32m[1m  Installed[22m[39m IJulia ────────── v1.21.2
[32m[1m  Installed[22m[39m MbedTLS ───────── v1.0.2
[32m[1m  Installed[22m[39m MbedTLS_jll ───── v2.16.0+2
[32m[1m  Installed[22m[39m SQLite_jll ────── v3.31.1+0
[32m[1m  Installed[22m[39m Plots ─────────── v1.0.12
[32m[1m  Installed[22m[39m DataStructures ── v0.17.13
[32m[1m  Installed[22m[39m SQLite ────────── v1.0.3
[32m[1m  Installed[22m[39m Parsers ───────── v1.0.2
[32m[1m  Installed[22m[39m DBInterface ───── v2.0.0
[32m[1m  Installed[22m[39m CSV ───────────── v0.6.1
[32m[1m   Updating[22m[39m `C:\Users\kai\.julia\environments\v1.4\Project.toml`
 [90m [336ed68f][39m[93m ↑ CSV v0.5.26 ⇒ v0.6.1[39m
 [90m [7073ff75][39m[93m ↑ IJulia v1.21.1 ⇒ v1.21.2[39m
 [90m [91a5bcdd][39m[93m ↑ Plots v1.0.9 ⇒ v1.0.12[39m
 [9

In [2]:
Pkg.installed()["CSV"]

└ @ Pkg D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.4\Pkg\src\Pkg.jl:531


v"0.6.1"

In [3]:
using CSV

┌ Info: Precompiling CSV [336ed68f-0bac-5ca0-87d4-7b16caf5d00b]
└ @ Base loading.jl:1260


使用 CSV.jl 透過 `read()` 函式將 CSV 資料產生為 DataFrame, `CSV.read()` 之回傳資料類別即為 DataFrame 型別.

warning 的訊息是正常的，原因是資料集裡面有缺值，所以載入 CSV 時會有警告訊息.

In [4]:
df = CSV.read("auto-mpg.data", delim=',')



Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0
6,15.0,8,429.0,198.0,4341.0,10.0,70.0
7,14.0,8,454.0,220.0,4354.0,9.0,70.0
8,14.0,8,440.0,215.0,4312.0,8.5,70.0
9,14.0,8,455.0,225.0,4425.0,10.0,70.0
10,15.0,8,390.0,190.0,3850.0,8.5,70.0


In [5]:
# 要顯示所有 column 或 row 的話, 可以透過 `show()` 函式
# 下面示範顯示第 1 - 5筆資料列的所有 column
show(df[1:5, :], allcols=true)

5×9 DataFrames.DataFrame
│ Row │ mpg     │ cylinders │ displacement │ horsepower │ weight   │
│     │ [90mFloat64[39m │ [90mInt64⍰[39m    │ [90mFloat64[39m      │ [90mString[39m     │ [90mFloat64⍰[39m │
├─────┼─────────┼───────────┼──────────────┼────────────┼──────────┤
│ 1   │ 18.0    │ 8         │ 307.0        │ 130.0      │ 3504.0   │
│ 2   │ 15.0    │ 8         │ 350.0        │ 165.0      │ 3693.0   │
│ 3   │ 18.0    │ 8         │ 318.0        │ 150.0      │ 3436.0   │
│ 4   │ 16.0    │ 8         │ 304.0        │ 150.0      │ 3433.0   │
│ 5   │ 17.0    │ 8         │ 302.0        │ 140.0      │ 3449.0   │

│ Row │ acceleration │ model year │ origin  │ car name                  │
│     │ [90mFloat64⍰[39m     │ [90mFloat64[39m    │ [90mFloat64[39m │ [90mString[39m                    │
├─────┼──────────────┼────────────┼─────────┼───────────────────────────┤
│ 1   │ 12.0         │ 70.0       │ 1.0     │ chevrolet chevelle malibu │
│ 2   │ 11.5         │ 70.0       │ 

In [8]:
df[[1, 5, 10], :]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,17.0,8,302.0,140.0,3449.0,10.5,70.0
3,15.0,8,390.0,190.0,3850.0,8.5,70.0


In [12]:
names(df)

9-element Array{Symbol,1}:
 :mpg
 :cylinders
 :displacement
 :horsepower
 :weight
 :acceleration
 Symbol("model year")
 :origin
 Symbol("car name")

In [14]:
df[1:5, [Symbol("model year")]]  # return DataFrame

Unnamed: 0_level_0,model year
Unnamed: 0_level_1,Float64
1,70.0
2,70.0
3,70.0
4,70.0
5,70.0


In [15]:
df[1:5, Symbol("model year")]  # return Array

5-element Array{Float64,1}:
 70.0
 70.0
 70.0
 70.0
 70.0

In [16]:
df[1:5, [Symbol("model year"), :horsepower]]  # return DataFrame

Unnamed: 0_level_0,model year,horsepower
Unnamed: 0_level_1,Float64,String
1,70.0,130.0
2,70.0,165.0
3,70.0,150.0
4,70.0,150.0
5,70.0,140.0


In [31]:
df[df.mpg .> 36, :]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,43.1,4,90.0,48.00,1985.0,21.5,78.0
2,36.1,4,98.0,66.00,1800.0,14.4,78.0
3,39.4,4,85.0,70.00,2070.0,18.6,78.0
4,36.1,4,91.0,60.00,1800.0,16.4,78.0
5,37.3,4,91.0,69.00,2130.0,14.7,79.0
6,41.5,4,98.0,76.00,2144.0,14.7,80.0
7,38.1,4,89.0,60.00,1968.0,18.8,80.0
8,37.2,4,86.0,65.00,2019.0,16.4,80.0
9,37.0,4,119.0,92.00,2434.0,15.0,80.0
10,46.6,4,86.0,65.00,2110.0,17.9,80.0


In [30]:
df[(df.mpg .> 36) .& (90 .< df.displacement .< 100), :]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,36.1,4,98.0,66.0,1800.0,14.4,78.0
2,36.1,4,91.0,60.0,1800.0,16.4,78.0
3,37.3,4,91.0,69.0,2130.0,14.7,79.0
4,41.5,4,98.0,76.0,2144.0,14.7,80.0
5,44.6,4,91.0,67.0,1850.0,13.8,80.0
6,37.0,4,91.0,68.0,2025.0,18.2,82.0
7,38.0,4,91.0,67.0,1965.0,15.0,82.0
8,38.0,4,91.0,67.0,1995.0,16.2,82.0
9,44.0,4,97.0,52.0,2130.0,24.6,82.0


### 1.6 複製 DataFrame

呼叫 `copy()` 函式可以複製並建立一個新的 DataFrame

In [32]:
df2 = copy(df)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0
6,15.0,8,429.0,198.0,4341.0,10.0,70.0
7,14.0,8,454.0,220.0,4354.0,9.0,70.0
8,14.0,8,440.0,215.0,4312.0,8.5,70.0
9,14.0,8,455.0,225.0,4425.0,10.0,70.0
10,15.0,8,390.0,190.0,3850.0,8.5,70.0


In [50]:
# df2 和 df 的內容相同
isequal(df2, df) 

true

In [49]:
# df2 和 df 的內容相同
df2 == df  # please see References

missing

## 2. 將 DataFrame 儲存到 CSV 檔案

In [51]:
CSV.write("a.csv", df)

"a.csv"

從目錄中可以看到 csv 檔案已寫入

In [52]:
println(readdir())
"a.csv" in readdir()

[".ipynb_checkpoints", "04-02-2020.csv", "a.csv", "auto-mpg.data", "julia_017_hw.ipynb", "julia_017_practice.ipynb"]


true

In [53]:
# 使用 Julia 內建的 DelimitedFiles library 
# 驗證檔案 header 與前 5 筆資料
using DelimitedFiles

readdlm("a.csv")[1:6]

6-element Array{Any,1}:
 "mpg,cylinders,displacement,horsepower,weight,acceleration,model"
 "18.0,8,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet"
 "15.0,8,350.0,165.0,3693.0,11.5,70.0,1.0,buick"
 "18.0,8,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth"
 "16.0,8,304.0,150.0,3433.0,12.0,70.0,1.0,amc"
 "17.0,8,302.0,140.0,3449.0,10.5,70.0,1.0,ford"

## 3. DataFrame 的操作

### 3.1 檢視 DataFrame

In [54]:
# 檢視 DataFrame 的尺寸
size(df)

(398, 9)

In [55]:
# 彙總 DataFrame 資訊
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Union…,Type
1,mpg,23.5146,9.0,23.0,46.6,,,Float64
2,cylinders,5.44836,3.0,4.0,8,,1.0,"Union{Missing, Int64}"
3,displacement,192.682,8.0,146.0,455.0,,,Float64
4,horsepower,,304.0,,?,94.0,,String
5,weight,2966.01,193.0,2797.5,5140.0,,6.0,"Union{Missing, Float64}"
6,acceleration,27.5656,8.0,15.5,4732.0,,6.0,"Union{Missing, Float64}"
7,model year,112.433,18.5,76.0,3035.0,,,Float64
8,origin,1.98719,1.0,1.0,70.0,,,Float64
9,car name,,71.0,,vw rabbit custom,306.0,,String


下面三行程式, 均可列出所有的 row 與 column

In [21]:
df
df[!, :]
df[:, :]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0
6,15.0,8,429.0,198.0,4341.0,10.0,70.0
7,14.0,8,454.0,220.0,4354.0,9.0,70.0
8,14.0,8,440.0,215.0,4312.0,8.5,70.0
9,14.0,8,455.0,225.0,4425.0,10.0,70.0
10,15.0,8,390.0,190.0,3850.0,8.5,70.0


In [57]:
df === df[!, :]

false

In [58]:
df === df[:, :]

false

In [59]:
df[:, !]  # column does not support `!`

MethodError: MethodError: no method matching getindex(::DataFrame, ::Colon, ::typeof(!))
Closest candidates are:
  getindex(::DataFrame, ::Colon) at deprecated.jl:65
  getindex(::DataFrame, ::Colon, !Matched::Union{Signed, Symbol, Unsigned}) at C:\Users\kai\.julia\packages\DataFrames\S3ZFo\src\dataframe\dataframe.jl:346
  getindex(::DataFrame, ::Colon, !Matched::Union{Colon, Regex, AbstractArray{T,1} where T, All, Between, InvertedIndex}) at C:\Users\kai\.julia\packages\DataFrames\S3ZFo\src\dataframe\dataframe.jl:395
  ...

在 Jupyter Notebook 環境中, 預設顯示螢幕容許大小的資料, 因此可能不會顯示所有 column 和 row. 使用 `show()` 函式, 可以有效地控制顯示. 下面的例子是設定 `allcols=true` 及 `allrows=true` 以顯示所有 column 及 row.

In [60]:
show(df, allcols=true, allrows=true)

398×9 DataFrame
│ Row │ mpg     │ cylinders │ displacement │ horsepower │ weight   │
│     │ [90mFloat64[39m │ [90mInt64⍰[39m    │ [90mFloat64[39m      │ [90mString[39m     │ [90mFloat64⍰[39m │
├─────┼─────────┼───────────┼──────────────┼────────────┼──────────┤
│ 1   │ 18.0    │ 8         │ 307.0        │ 130.0      │ 3504.0   │
│ 2   │ 15.0    │ 8         │ 350.0        │ 165.0      │ 3693.0   │
│ 3   │ 18.0    │ 8         │ 318.0        │ 150.0      │ 3436.0   │
│ 4   │ 16.0    │ 8         │ 304.0        │ 150.0      │ 3433.0   │
│ 5   │ 17.0    │ 8         │ 302.0        │ 140.0      │ 3449.0   │
│ 6   │ 15.0    │ 8         │ 429.0        │ 198.0      │ 4341.0   │
│ 7   │ 14.0    │ 8         │ 454.0        │ 220.0      │ 4354.0   │
│ 8   │ 14.0    │ 8         │ 440.0        │ 215.0      │ 4312.0   │
│ 9   │ 14.0    │ 8         │ 455.0        │ 225.0      │ 4425.0   │
│ 10  │ 15.0    │ 8         │ 390.0        │ 190.0      │ 3850.0   │
│ 11  │ 15.0    │ 8         │ 383.0  

│ 164 │ 18.0    │ 6         │ 225.0        │ 95.00      │ 3785.0   │
│ 165 │ 21.0    │ 6         │ 231.0        │ 110.0      │ 3039.0   │
│ 166 │ 20.0    │ 8         │ 262.0        │ 110.0      │ 3221.0   │
│ 167 │ 13.0    │ 8         │ 302.0        │ 129.0      │ 3169.0   │
│ 168 │ 29.0    │ 4         │ 97.0         │ 75.00      │ 2171.0   │
│ 169 │ 23.0    │ 4         │ 140.0        │ 83.00      │ 2639.0   │
│ 170 │ 20.0    │ 6         │ 232.0        │ 100.0      │ 2914.0   │
│ 171 │ 23.0    │ 4         │ 140.0        │ 78.00      │ 2592.0   │
│ 172 │ 24.0    │ 4         │ 134.0        │ 96.00      │ 2702.0   │
│ 173 │ 25.0    │ 4         │ 90.0         │ 71.00      │ 2223.0   │
│ 174 │ 24.0    │ 4         │ 119.0        │ 97.00      │ 2545.0   │
│ 175 │ 18.0    │ 6         │ 171.0        │ 97.00      │ 2984.0   │
│ 176 │ 29.0    │ 4         │ 90.0         │ 70.00      │ 1937.0   │
│ 177 │ 19.0    │ 6         │ 232.0        │ 90.00      │ 3211.0   │
│ 178 │ 23.0    │ 4         │ 115.

│ 284 │ 20.2    │ 6         │ 232.0        │ 90.00      │ 3265.0   │
│ 285 │ 20.6    │ 6         │ 225.0        │ 110.0      │ 3360.0   │
│ 286 │ 17.0    │ 8         │ 305.0        │ 130.0      │ 3840.0   │
│ 287 │ 17.6    │ 8         │ 302.0        │ 129.0      │ 3725.0   │
│ 288 │ 16.5    │ 8         │ 351.0        │ 138.0      │ 3955.0   │
│ 289 │ 18.2    │ 8         │ 318.0        │ 135.0      │ 3830.0   │
│ 290 │ 16.9    │ 8         │ 350.0        │ 155.0      │ 4360.0   │
│ 291 │ 15.5    │ 8         │ 351.0        │ 142.0      │ 4054.0   │
│ 292 │ 19.2    │ 8         │ 267.0        │ 125.0      │ 3605.0   │
│ 293 │ 18.5    │ 8         │ 360.0        │ 150.0      │ 3940.0   │
│ 294 │ 31.9    │ 4         │ 89.0         │ 71.00      │ 1925.0   │
│ 295 │ 34.1    │ 4         │ 86.0         │ 65.00      │ 1975.0   │
│ 296 │ 35.7    │ 4         │ 98.0         │ 80.00      │ 1915.0   │
│ 297 │ 27.4    │ 4         │ 121.0        │ 80.00      │ 2670.0   │
│ 298 │ 25.4    │ 5         │ 183.

│ 100 │ 16.0         │ 73.0       │ 1.0     │
│ 101 │ 16.5         │ 73.0       │ 1.0     │
│ 102 │ 16.0         │ 73.0       │ 1.0     │
│ 103 │ 21.0         │ 73.0       │ 2.0     │
│ 104 │ 14.0         │ 73.0       │ 1.0     │
│ 105 │ 12.5         │ 73.0       │ 1.0     │
│ 106 │ 13.0         │ 73.0       │ 1.0     │
│ 107 │ 12.5         │ 73.0       │ 1.0     │
│ 108 │ 15.0         │ 73.0       │ 1.0     │
│ 109 │ 19.0         │ 73.0       │ 3.0     │
│ 110 │ 19.5         │ 73.0       │ 1.0     │
│ 111 │ 16.5         │ 73.0       │ 3.0     │
│ 112 │ 13.5         │ 73.0       │ 3.0     │
│ 113 │ 18.5         │ 73.0       │ 1.0     │
│ 114 │ 14.0         │ 73.0       │ 1.0     │
│ 115 │ 15.5         │ 73.0       │ 2.0     │
│ 116 │ 13.0         │ 73.0       │ 1.0     │
│ 117 │ 9.5          │ 73.0       │ 1.0     │
│ 118 │ 19.5         │ 73.0       │ 2.0     │
│ 119 │ 15.5         │ 73.0       │ 2.0     │
│ 120 │ 14.0         │ 73.0       │ 2.0     │
│ 121 │ 15.5         │ 73.0       

│ 298 │ 20.1         │ 79.0       │ 2.0     │
│ 299 │ 17.4         │ 79.0       │ 1.0     │
│ 300 │ 24.8         │ 79.0       │ 2.0     │
│ 301 │ 22.2         │ 79.0       │ 1.0     │
│ 302 │ 13.2         │ 79.0       │ 1.0     │
│ 303 │ 14.9         │ 79.0       │ 1.0     │
│ 304 │ 19.2         │ 79.0       │ 3.0     │
│ 305 │ 14.7         │ 79.0       │ 2.0     │
│ 306 │ 16.0         │ 79.0       │ 1.0     │
│ 307 │ 11.3         │ 79.0       │ 1.0     │
│ 308 │ 12.9         │ 79.0       │ 1.0     │
│ 309 │ 13.2         │ 79.0       │ 1.0     │
│ 310 │ 14.7         │ 80.0       │ 2.0     │
│ 311 │ 18.8         │ 80.0       │ 3.0     │
│ 312 │ 15.5         │ 80.0       │ 1.0     │
│ 313 │ 16.4         │ 80.0       │ 3.0     │
│ 314 │ 16.5         │ 80.0       │ 1.0     │
│ 315 │ 18.1         │ 80.0       │ 1.0     │
│ 316 │ 20.1         │ 80.0       │ 1.0     │
│ 317 │ 18.7         │ 80.0       │ 1.0     │
│ 318 │ 15.8         │ 80.0       │ 2.0     │
│ 319 │ 15.5         │ 80.0       

│ 111 │ datsun 610                           │
│ 112 │ maxda rx3                            │
│ 113 │ ford pinto                           │
│ 114 │ mercury capri v6                     │
│ 115 │ fiat 124 sport coupe                 │
│ 116 │ chevrolet monte carlo s              │
│ 117 │ pontiac grand prix                   │
│ 118 │ fiat 128                             │
│ 119 │ opel manta                           │
│ 120 │ audi 100ls                           │
│ 121 │ volvo 144ea                          │
│ 122 │ dodge dart custom                    │
│ 123 │ saab 99le                            │
│ 124 │ toyota mark ii                       │
│ 125 │ oldsmobile omega                     │
│ 126 │ plymouth duster                      │
│ 127 │  74                                  │
│ 128 │ amc hornet                           │
│ 129 │ chevrolet nova                       │
│ 130 │ datsun b210                          │
│ 131 │ ford pinto                           │
│ 132 │ toyot

│ 343 │ plymouth reliant                     │
│ 344 │ toyota starlet                       │
│ 345 │ plymouth champ                       │
│ 346 │ honda civic 1300                     │
│ 347 │ subaru                               │
│ 348 │ datsun 210 mpg                       │
│ 349 │ toyota tercel                        │
│ 350 │ mazda glc 4                          │
│ 351 │ plymouth horizon 4                   │
│ 352 │ ford escort 4w                       │
│ 353 │ ford escort 2h                       │
│ 354 │ volkswagen jetta                     │
│ 355 │  81                                  │
│ 356 │ honda prelude                        │
│ 357 │ toyota corolla                       │
│ 358 │ datsun 200sx                         │
│ 359 │ mazda 626                            │
│ 360 │ peugeot 505s turbo diesel            │
│ 361 │ volvo diesel                         │
│ 362 │ toyota cressida                      │
│ 363 │ datsun 810 maxima                    │
│ 364 │ buick

`first()` 和 `last()` 函式用來顯示 DataFrame 中的前 n 筆或後 n 筆的資料

In [61]:
first(df, 5)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0


In [62]:
last(df, 10)

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,26.0,4,156.0,92.0,2585.0,14.5,82.0
2,22.0,6,232.0,112.0,2835.0,14.7,82.0
3,32.0,4,144.0,96.0,2665.0,13.9,82.0
4,36.0,4,135.0,84.0,2370.0,13.0,82.0
5,27.0,4,151.0,90.0,2950.0,17.3,82.0
6,27.0,4,140.0,86.0,2790.0,15.6,82.0
7,44.0,4,97.0,52.0,2130.0,24.6,82.0
8,32.0,4,135.0,84.0,2295.0,11.6,82.0
9,28.0,4,120.0,79.0,2625.0,18.6,82.0
10,31.0,4,119.0,82.0,2720.0,19.4,82.0


### 3.2 DataFrame 子集

要查看 DataFrame 子集, 可以使用 `df[<row index>, <column index>]`

In [63]:
df[1:5, :]

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year
Unnamed: 0_level_1,Float64,Int64⍰,Float64,String,Float64⍰,Float64⍰,Float64
1,18.0,8,307.0,130.0,3504.0,12.0,70.0
2,15.0,8,350.0,165.0,3693.0,11.5,70.0
3,18.0,8,318.0,150.0,3436.0,11.0,70.0
4,16.0,8,304.0,150.0,3433.0,12.0,70.0
5,17.0,8,302.0,140.0,3449.0,10.5,70.0


In [64]:
# 如前述, 顯示所有 column
show(df[1:5, :], allcols=true)

5×9 DataFrame
│ Row │ mpg     │ cylinders │ displacement │ horsepower │ weight   │
│     │ [90mFloat64[39m │ [90mInt64⍰[39m    │ [90mFloat64[39m      │ [90mString[39m     │ [90mFloat64⍰[39m │
├─────┼─────────┼───────────┼──────────────┼────────────┼──────────┤
│ 1   │ 18.0    │ 8         │ 307.0        │ 130.0      │ 3504.0   │
│ 2   │ 15.0    │ 8         │ 350.0        │ 165.0      │ 3693.0   │
│ 3   │ 18.0    │ 8         │ 318.0        │ 150.0      │ 3436.0   │
│ 4   │ 16.0    │ 8         │ 304.0        │ 150.0      │ 3433.0   │
│ 5   │ 17.0    │ 8         │ 302.0        │ 140.0      │ 3449.0   │

│ Row │ acceleration │ model year │ origin  │ car name                  │
│     │ [90mFloat64⍰[39m     │ [90mFloat64[39m    │ [90mFloat64[39m │ [90mString[39m                    │
├─────┼──────────────┼────────────┼─────────┼───────────────────────────┤
│ 1   │ 12.0         │ 70.0       │ 1.0     │ chevrolet chevelle malibu │
│ 2   │ 11.5         │ 70.0       │ 1.0     │ b

In [65]:
# 可以指定特定要查看的 row / column
df[[1, 3, 5], [1, 2, 9]]

Unnamed: 0_level_0,mpg,cylinders,car name
Unnamed: 0_level_1,Float64,Int64⍰,String
1,18.0,8,chevrolet chevelle malibu
2,18.0,8,plymouth satellite
3,17.0,8,ford torino


指定 column 可以使用 index, 也可以使用 column 名稱, 使用的方式為 ":" 加上 column 名稱. 示範如下:

":" 加上 column 名稱的型別為`Symbol`

In [66]:
df[1:5, [:mpg, :displacement, :horsepower]]

Unnamed: 0_level_0,mpg,displacement,horsepower
Unnamed: 0_level_1,Float64,Float64,String
1,18.0,307.0,130.0
2,15.0,350.0,165.0
3,18.0,318.0,150.0
4,16.0,304.0,150.0
5,17.0,302.0,140.0


### 3.3 `select()` 及 `select!()`

如果要篩選 DataFrame 中的 column, 可以使用 `select()` 和 `select!()` 函式. 兩者不同之處在於, `select()` 不會變更原 DataFrame 而會傳回傳變更後的 DataFrame, 而 `select!()`會變更原 DataFrame.

In [67]:
select(df2, 1:3)

Unnamed: 0_level_0,mpg,cylinders,displacement
Unnamed: 0_level_1,Float64,Int64⍰,Float64
1,18.0,8,307.0
2,15.0,8,350.0
3,18.0,8,318.0
4,16.0,8,304.0
5,17.0,8,302.0
6,15.0,8,429.0
7,14.0,8,454.0
8,14.0,8,440.0
9,14.0,8,455.0
10,15.0,8,390.0


In [68]:
# df2 未改變
size(df2)

(398, 9)

In [69]:
# 呼叫 select!() 後 df2 被變更, 僅剩下被篩選的 3 個 column
select!(df2, 1:3)
size(df2)

(398, 3)

### 3.4 行 (column) 的操作

#### Aggregate

`aggregate` 函式可以套用到 column 中的每一個值, 例如如果要計算及找出汽車油耗 (mpg) 與排氣量 (displacement) 的平均數和中位數, 可以透過下列的示範來達成. 計算平均數和中位數時, 我們運用 Statistics 模組的 `mean` 及 `median` 函式來計算.

In [70]:
df3 = df[1:5, [:mpg, :displacement]]

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,18.0,307.0
2,15.0,350.0
3,18.0,318.0
4,16.0,304.0
5,17.0,302.0


In [71]:
using Statistics

aggregate(df3, [mean, median])

Unnamed: 0_level_0,mpg_mean,displacement_mean,mpg_median,displacement_median
Unnamed: 0_level_1,Float64,Float64,Float64,Float64
1,16.8,316.2,17.0,307.0


#### Sort 簡介

Sorting 在之後的內容會有更詳細的介紹

`sort()` 排序後不會改變原來的 DataFrame

In [72]:
sort(df3)

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,15.0,350.0
2,16.0,304.0
3,17.0,302.0
4,18.0,307.0
5,18.0,318.0


In [73]:
df3

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,18.0,307.0
2,15.0,350.0
3,18.0,318.0
4,16.0,304.0
5,17.0,302.0


`sort!()` 排序後會改變原來的 DataFrame

下面範例是依 displacement 反序排序

In [74]:
sort!(df3, :displacement, rev=true)

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,15.0,350.0
2,18.0,318.0
3,18.0,307.0
4,16.0,304.0
5,17.0,302.0


In [75]:
df3

Unnamed: 0_level_0,mpg,displacement
Unnamed: 0_level_1,Float64,Float64
1,15.0,350.0
2,18.0,318.0
3,18.0,307.0
4,16.0,304.0
5,17.0,302.0


# References:
- Marathon example notebook
- [DataFrames.jl Documentation](https://juliadata.github.io/DataFrames.jl/stable/)
- [DataFrames.jl GitHub](https://github.com/JuliaData/DataFrames.jl/blob/master/docs/src/index.md)
- [Statistics](https://docs.julialang.org/en/v1/stdlib/Statistics/)
- [CSV.jl Documentation](https://juliadata.github.io/CSV.jl/stable/)
- [Various constructors and equality for DataFrame](https://discourse.julialang.org/t/various-constructors-and-equality-for-dataframe/1551)
- [Comparing two dataframes with missing values](https://github.com/JuliaData/DataFrames.jl/issues/1420)