In [8]:
# To ensure Chinese characters are displayed correctly
options(encoding = "UTF-8")
Sys.setlocale("LC_CTYPE", "zh_TW.UTF-8")

# Read progress file

In [34]:
flights = readRDS("data/flights.rds")

In [3]:
str(flights)

List of 1
 $ data:List of 1
  ..$ :List of 4
  .. ..$ file      : chr "data/international_flights.json"
  .. ..$ meta      :List of 2
  .. .. ..$ name       : chr "<U+570B><U+969B><U+822A><U+7A7A><U+5B9A><U+671F><U+6642><U+523B><U+8868>"
  .. .. ..$ source_link: chr "https://data.gov.tw/dataset/161167"
  .. ..$ data_frame:'data.frame':	4941 obs. of  20 variables:
  .. .. ..$ AirlineID         : Factor w/ 74 levels "3U","5J","7C",..: 1 1 1 1 1 1 1 1 1 1 ...
  .. .. ..$ ScheduleStartDate : chr [1:4941] "2023-10-13" "2023-10-20" "2023-10-27" "2023-10-13" ...
  .. .. ..$ ScheduleEndDate   : chr [1:4941] "2023-10-15" "2023-10-22" "2023-10-27" "2023-10-15" ...
  .. .. ..$ FlightNumber      : chr [1:4941] "3U3783" "3U3783" "3U3783" "3U3784" ...
  .. .. ..$ DepartureAirportID: chr [1:4941] "CKG" "CKG" "CKG" "TSA" ...
  .. .. ..$ DepartureTime     : chr [1:4941] "15:00" "15:00" "15:00" "19:00" ...
  .. .. ..$ CodeShare         :List of 4941
  .. .. .. ..$ :'data.frame':	0 obs. of  0 variables
 

In [14]:
flights$data[[1]]$meta$name

In [4]:
flightsData <- flights$data[[1]]$data_frame

In [None]:
flightsData$AirlineID

# Preliminary data observations

We need to describe variables in `flightsData` that we want to use in our analysis. 

## AirlineID

- factor data type
  - describe levels
  - table

In [11]:
levels(flightsData$AirlineID) |> length()




 CI  BR  JX  IT  TG  SQ  NH  CA  NZ  JL 
801 730 304 277 220 143 116 112 111 101 

There are 74 different airlines.
The top 10 operators are:


In [12]:
table(flightsData$AirlineID) |> sort(decreasing = TRUE) |> head(10)


 CI  BR  JX  IT  TG  SQ  NH  CA  NZ  JL 
801 730 304 277 220 143 116 112 111 101 

 `|>` pipe operator

`a |> fun()` is the same as `fun(a)`
`a |> fun(b)` is the same as `fun(a, b)`

- pipe operator is only for the first argument of a function

What are those airline names? 

In [16]:

data2 <- list(
  meta = list(
    name="航空公司統一代碼
",
    source_link ="https://data.gov.tw/dataset/8088"
  ),
  file = "data/airlines.json"
)

flights$data[[2]] <- data2

airlines <-
  jsonlite::fromJSON(
    flights$data[[2]]$file
  )

flights$data[[2]]$data_frame <- airlines

saveRDS(flights, file="flights.rds")

In [15]:
dplyr::glimpse(airlines)

Rows: 850
Columns: 2
$ AirlineName [3m[90m<chr>[39m[23m "Executive Aviation Taiwan Corp.", "GREAT  WING  AIRLINES"~
$ AirlineID   [3m[90m<chr>[39m[23m NA, NA, NA, NA, "00", "02", "0B", "0D", "0V", "1B", "1L", ~


# Join two data frames

We can `airlines` data into `flightsData` using `AirlineID` variable. We can use `dplyr::left_join` function to do that. 


In [18]:
# dplyr::left_join example

# two data frames
df1 <- data.frame(
  id = c(1, 2, 3, 4, 5),
  name = c("A", "B", "C", "D", "E")
)

df2 <- data.frame(
  id = c(1, 2, 3, 4, 5, 7),
  score = c(90, 80, 70, 60, 50, 40)
)

# join by id
dplyr::left_join(df1, df2, by = "id")

# join by id
df3 <- data.frame(
  ID = c(1, 2, 3, 4, 5, 7),
  score = c(90, 80, 70, 60, 50, 40)
)

# join by df$id and df3$ID
dplyr::left_join(df1, df3, by = c("id" = "ID"))

id,name,score
<dbl>,<chr>,<dbl>
1,A,90
2,B,80
3,C,70
4,D,60
5,E,50


id,name,score
<dbl>,<chr>,<dbl>
1,A,90
2,B,80
3,C,70
4,D,60
5,E,50


In [19]:
dplyr::left_join(
  flightsData, airlines,
  by="AirlineID"
) -> flightsData

flightsData$AirlineName <-
  factor(
    flightsData$AirlineName
  )


# Departure and Arrival Flights


In [20]:
names(flightsData)

If DepartureAirportID belong to Taiwan airport, it is a departure flight. If ArrivalAirportID belong to Taiwan airport, it is an arrival flight.

What are Taiwan's airport IDs?

In [None]:
# airportr package has a airports data frame
install.packages("airportr")

In [22]:
library(airportr)
dplyr::glimpse(airportr::airports)

Rows: 7,698
Columns: 17
$ `OpenFlights ID`         [3m[90m<dbl>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14~
$ Name                     [3m[90m<chr>[39m[23m "Goroka Airport", "Madang Airport", "Mount Ha~
$ City                     [3m[90m<chr>[39m[23m "Goroka", "Madang", "Mount Hagen", "Nadzab", ~
$ IATA                     [3m[90m<chr>[39m[23m "GKA", "MAG", "HGU", "LAE", "POM", "WWK", "UA~
$ ICAO                     [3m[90m<chr>[39m[23m "AYGA", "AYMD", "AYMH", "AYNZ", "AYPY", "AYWK~
$ Country                  [3m[90m<chr>[39m[23m "Papua New Guinea", "Papua New Guinea", "Papu~
$ `Country Code`           [3m[90m<chr>[39m[23m "598", "598", "598", "598", "598", "598", "30~
$ `Country Code (Alpha-2)` [3m[90m<chr>[39m[23m "PG", "PG", "PG", "PG", "PG", "PG", "GL", "GL~
$ `Country Code (Alpha-3)` [3m[90m<chr>[39m[23m "PNG", "PNG", "PNG", "PNG", "PNG", "PNG", "GR~
$ Latitude                 [3m[90m<dbl>[39m[23m -6.081690, -5.207080, -5.826790, 

# dplyr::filter

Keep those rows of data frame that satisfy a condition.

In [24]:
dplyr::filter(
  airports,
  Country == "Taiwan"
) -> airports_taiwan

head(airports_taiwan)

OpenFlights ID,Name,City,IATA,ICAO,Country,Country Code,Country Code (Alpha-2),Country Code (Alpha-3),Latitude,Longitude,Altitude,UTC,DST,Timezone,Type,Source
<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>
2259,Kinmen Airport,Kinmen,KNH,RCBS,Taiwan,158,TW,TWN,24.4279,118.359,93,8,U,Asia/Taipei,airport,OurAirports
2260,Pingtung South Airport,Pingtung,\N,RCDC,Taiwan,158,TW,TWN,22.6724,120.462,78,8,U,Asia/Taipei,airport,OurAirports
2261,Longtan Air Base,Longtang,\N,RCDI,Taiwan,158,TW,TWN,24.8551,121.238,790,8,U,Asia/Taipei,airport,OurAirports
2262,Taitung Airport,Fengnin,TTT,RCFN,Taiwan,158,TW,TWN,22.755,121.102,143,8,U,Asia/Taipei,airport,OurAirports
2263,Lyudao Airport,Green Island,GNI,RCGI,Taiwan,158,TW,TWN,22.6739,121.466,28,8,U,Asia/Taipei,airport,OurAirports
2264,Kaohsiung International Airport,Kaohsiung,KHH,RCKH,Taiwan,158,TW,TWN,22.5771,120.35,31,8,U,Asia/Taipei,airport,OurAirports


 - `Country == "Taiwan"` is to compare if `Country` variable is equal to `"Taiwan"`.

 - [relational operators](https://tpemartin.github.io/NTPU-R-for-Data-Science-EN/operations-on-atomic-vectors.html#operations-on-atomic-vectors-1)

In [28]:
airports_taiwan$IATA

# keep only unique values
unique(airports_taiwan$IATA)

In [37]:
flightsData |>
 dplyr::filter(
    DepartureAirportID %in% unique(airports_taiwan$IATA)) -> 
    departure_flightsData

flightsData |>
  dplyr::filter(
    ArrivalAirportID %in% unique(airports_taiwan$IATA)) -> 
    arrival_flightsData


flights$data[[3]] <- list(
  departure_flightsData= departure_flightsData,
  arrival_flightsData = arrival_flightsData
)

saveRDS(flights, file="flights.rds")

- `DepartureAirportID %in% unique(airports_taiwan$IATA)` has `%in%` operator. It is to check if a value is in a vector.

In [33]:
# number of departure flights
nrow(derparture_flightsData)

# number of arrival flights
nrow(arrival_flightsData)