In [1]:
# To ensure Chinese characters are displayed correctly
options(encoding = "UTF-8")
Sys.setlocale("LC_CTYPE", "zh_TW.UTF-8")

# Read progress file

In [2]:
flights = readRDS("data/flights.rds")

In [3]:
# check the progress object structure
str(flights)

List of 1
 $ data:List of 1
  ..$ :List of 3
  .. ..$ file      : chr "data/international_flights.json"
  .. ..$ meta      :List of 2
  .. .. ..$ name       : chr "<U+570B><U+969B><U+822A><U+7A7A><U+5B9A><U+671F><U+6642><U+523B><U+8868>"
  .. .. ..$ source_link: chr "https://data.gov.tw/dataset/161167"
  .. ..$ data_frame:'data.frame':	4941 obs. of  20 variables:
  .. .. ..$ AirlineID         : chr [1:4941] "3U" "3U" "3U" "3U" ...
  .. .. ..$ ScheduleStartDate : chr [1:4941] "2023-10-13" "2023-10-20" "2023-10-27" "2023-10-13" ...
  .. .. ..$ ScheduleEndDate   : chr [1:4941] "2023-10-15" "2023-10-22" "2023-10-27" "2023-10-15" ...
  .. .. ..$ FlightNumber      : chr [1:4941] "3U3783" "3U3783" "3U3783" "3U3784" ...
  .. .. ..$ DepartureAirportID: chr [1:4941] "CKG" "CKG" "CKG" "TSA" ...
  .. .. ..$ DepartureTime     : chr [1:4941] "15:00" "15:00" "15:00" "19:00" ...
  .. .. ..$ CodeShare         :List of 4941
  .. .. .. ..$ :'data.frame':	0 obs. of  0 variables
  .. .. .. ..$ :'data.frame

![](../img/str_flight.png)

- `flights` is a list with one element `data` (We can take `data` out via `$data`)
- `data` element is a list of one element (no name on that element). (We can take that element out via `[[1]]`)
- The first element of `data` is a list of three elements, named `file`, `meta` and `data_frame`. (We can take `data_frame` out via `$data_frame`)

In [4]:
flightsData <- flights$data[[1]]$data_frame

> You can also interactively find out how to retrieve an element inside a list following:
> #### Step 1:
> ![observe flights1](img/observe_flights1.png)
> #### Step 2:
> ![observe flights2](img/observe_flights2.png)

## Analysis on Data Frame

### Recap

 - `[[...]]`, `$...`, and `[... , ...]` retrieval on data frame

Data Frame is an extended class from list. It can use `[[...]]` and `$...` as list. Other than that, it has one more special retrieval tool `[... , ...]`.

In [5]:
# Consider the following data frame of student grades

grades <- data.frame(
    student = c("Alice", "Bob", "Charlie", "David", "Eve"),
    midterm = c(95, 80, 70, 60, 75),
    final = c(90, 85, 75, 95, 80)
)


In [6]:
# it is an extended class out of the following list

grades_list <- list(
    student = c("Alice", "Bob", "Charlie", "David", "Eve"),
    midterm = c(95, 80, 70, 60, 75),
    final = c(90, 85, 75, 95, 80)
)

  - It is like a list with named elements. Each element can be retrieve via `[[...]]` or `$...`

In [7]:
grades[["student"]]
grades[["midterm"]]

In [8]:
grades$student
grades$midterm

- It has an extra retrieval tool `[..., ...]` which allows you to retrieve a subset of the data frame.

In [9]:
grades[, "student"] # retrieve student column
grades[, c("student", "midterm")] # retrieve student and midterm columns
grades[c(1,3), "student"] # retrieve student column for rows 1 and 3
grades[c(1,3), c("student", "midterm")] # retrieve student and midterm columns for rows 1 and 3

student,midterm
<chr>,<dbl>
Alice,95
Bob,80
Charlie,70
David,60
Eve,75


Unnamed: 0_level_0,student,midterm
Unnamed: 0_level_1,<chr>,<dbl>
1,Alice,95
3,Charlie,70


In [10]:
dplyr::glimpse(flights$data[[1]]$data_frame)

Rows: 4,941
Columns: 20
$ AirlineID          [3m[90m<chr>[39m[23m "3U", "3U", "3U", "3U", "3U", "3U", "3U", "3U", "3U…
$ ScheduleStartDate  [3m[90m<chr>[39m[23m "2023-10-13", "2023-10-20", "2023-10-27", "2023-10-…
$ ScheduleEndDate    [3m[90m<chr>[39m[23m "2023-10-15", "2023-10-22", "2023-10-27", "2023-10-…
$ FlightNumber       [3m[90m<chr>[39m[23m "3U3783", "3U3783", "3U3783", "3U3784", "3U3784", "…
$ DepartureAirportID [3m[90m<chr>[39m[23m "CKG", "CKG", "CKG", "TSA", "TSA", "TSA", "TFU", "T…
$ DepartureTime      [3m[90m<chr>[39m[23m "15:00", "15:00", "15:00", "19:00", "19:00", "19:00…
$ CodeShare          [3m[90m<list>[39m[23m [<data.frame[0 x 0]>], [<data.frame[0 x 0]>], [<da…
$ ArrivalAirportID   [3m[90m<chr>[39m[23m "TSA", "TSA", "TSA", "CKG", "CKG", "CKG", "TSA", "T…
$ ArrivalTime        [3m[90m<chr>[39m[23m "18:00", "18:00", "18:00", "22:15", "22:15", "22:15…
$ Monday             [3m[90m<lgl>[39m[23m FALSE, FALSE, FALSE, FALSE, FALSE, FALS

> If you encounter errors saying No Package `dplyr` can be found, please install it via `install.packages("dplyr")`

### Column element's class

In [11]:
# check column names 
names(flightsData)

In [12]:
class(flightsData$AirlineID)
class(flightsData$ScheduleStartDate)

## Factor class


By default characters are imported as character class values. Pure character values are not very useful for analysis. We can not do much further analysis on it except reading its value.

Some character values are actually categorical values. For example, the `AirlineID` column in `flightsData` data frame is a categorical value. 

Categorical values: 

- have a limited number of possible values
  - How many distinct values? For each categorical value, how many rows are there?  

There is another special categorical values which can be compared. For example, there are 10 households data. We categorise them into "low income", "middle income", "high income". 

- "low income" < "middle income" < "high income"

In [13]:
householdIncomes <- c(
    "middle income", "high income", "middle income", "low income",
    "middle income", "high income", "high income", "low income",
    "high income", "middle income", "high income", "middle income"
)


In R, categorical values should be parsed into `factor` or `ordered factor` class. `ordered factor` is for categorical values which can be compared.

In [14]:
# we want it to be factor class
class(flightsData$AirlineID)


- [4.2.2 Factor](https://tpemartin.github.io/NTPU-R-for-Data-Science-EN/operations-on-atomic-vectors.html#factor)

How to parse into (ordered) factor:

In [15]:
# parse to factor
factor( an object of class character)

# parse to ordered factor
ordered( an object of class character, levels = a vector of levels from smallest to largest)

ERROR: Error in parse(text = x, srcfile = src): <text>:2:12: unexpected symbol
1: # parse to factor
2: factor( an object
              ^


> If an object to be parsed is not a character class, it will be converted into character implicitly first. 

In [None]:
fct_householdIncome <- factor(householdIncomes)

ord_fct_householdIncome <- ordered(householdIncomes, levels = c("low income", "middle income", "high income"))

### Fequency table

How many distinct values? For each categorical value, how many of them are there?

In [None]:
# Counts on each level

table(fct_householdIncome)
table(ord_fct_householdIncome)

fct_householdIncome
  high income    low income middle income 
            5             2             5 

ord_fct_householdIncome
   low income middle income   high income 
            2             5             5 

For each categorical value, how much proportion does it take?

In [None]:
# Proportion on each level
tb_fct_householdIncome <- table(fct_householdIncome)
prop.table(tb_fct_householdIncome)

tb_ord_fct_householdIncome <- table(ord_fct_householdIncome)
prop.table(tb_ord_fct_householdIncome)

fct_householdIncome
  high income    low income middle income 
    0.4166667     0.1666667     0.4166667 

ord_fct_householdIncome
   low income middle income   high income 
    0.1666667     0.4166667     0.4166667 

### Levels

What are those distinct categorical values?

In [None]:
levels(fct_householdIncome)

### What levels do?

> Sequence of levels determined how summaries and plots are ordered, unless otherwise specified. 

In [None]:
table(fct_householdIncome)

fct_householdIncome
  high income    low income middle income 
            5             2             5 

In [None]:
fct_householdIncome2 <- 
    factor(householdIncomes, 
    levels = c("high income", "middle income", "low income"))

levels(fct_householdIncome2)

table(fct_householdIncome2)

fct_householdIncome2
  high income middle income    low income 
            5             5             2 

> Other than sequence of summaries and plots, **ordered factor**'s levels determined how levels are compared.

In [None]:
levels(ord_fct_householdIncome)



> Ordered factor can be compared. For example, "low income" < "middle income" < "high income". `LHS < RHS` is to check if `LHS` element values are smaller than `RHS` element values.

In [None]:
# Avoid comparing two strings with <, >, <=, >=
"high income" < "middle income"

> When comparing strings, it is comparing the first letter's position in a-z (a is the smallest, z is the largetst). "h" is smaller than "m" because "h" is the 8th letter in a-z, "m" is the 13th letter in a-z.

In [None]:
# values in ordered factor can be compared to a level string
head(ord_fct_householdIncome)
head(ord_fct_householdIncome) < "high income"


## Flight data

In [16]:
flightsData[["AirlineID"]] <- factor(flightsData[["AirlineID"]])

class(flightsData$AirlineID)

> `factor(flightsData$AirlineID)` is involved with two steps:
> 1. Name call `flightsData[["AirlineID"]]` or `flightsData$AirlineID` to get the column element values, and make a copy of it (so won't change the original column element values).
> 2. `factor(...)` parsed the copied column element values into factor class values.  
> The entire process is dealing with a copied value of `flightsData$AirlineID` column element values. The original column element values are not changed. 
> If we want `flightsData$AirlineID` column element values to be parsed into factor class values **PERMANENTLY**, we need to assign the result of `factor(...)` back to `flightsData$AirlineID` column element values.

> In R, most **name call** is dealing with a copy of the original values. The original values are not changed. If you want the original values to be changed, you need to assign the result of name call back to the original values. 

In [39]:
tb_airlineID <-  table(flightsData$AirlineID)
tb_airlineID


 3U  5J  7C  9C  AA  AC  AE  AF  AI  AK  AV  AZ  B7  BI  BR  BX  CA  CI  CM  CX 
 12  40  35   6   6  27  40   3  30  12  36   6  83  60 730  14 112 801  39  86 
 CZ  D7  DL  EK  FD  FM  GA  GK  HA  HB  HO  HU  HX  IT  JL  JX  KE  KL  LH  LJ 
 62  16  75  32  21  30   6   8  24  18  43  21  60 277 101 304  42  33   6  14 
 LY  MF  MH  MM  MU  NH  NU  NX  NZ  OD  OZ  PG  PR  QF  QH  RF  RW  SC  SL  SQ 
 10  52  49  80  85 116   2  68 111  42  35  24  41  21  14  12   6   6  26 143 
 TG  TK  TR  TW  UA  UK  UO  UX  VJ  VN  VZ  WE  Z2  ZH 
220  43  53  26   6  12  44  12  67  51  39  12  24  18 

In [41]:
typeof(tb_airlineID)

For `integer` type values, we can `sort` it.

In [42]:
sort(tb_airlineID)

sort(tb_airlineID, decreasing = TRUE)


 NU  AF  9C  AA  AZ  GA  LH  RW  SC  UA  GK  LY  3U  AK  RF  UK  UX  WE  BX  LJ 
  2   3   6   6   6   6   6   6   6   6   8  10  12  12  12  12  12  12  14  14 
 QH  D7  HB  ZH  FD  HU  QF  HA  PG  Z2  SL  TW  AC  AI  FM  EK  KL  7C  OZ  AV 
 14  16  18  18  21  21  21  24  24  24  26  26  27  30  30  32  33  35  35  36 
 CM  VZ  5J  AE  PR  KE  OD  HO  TK  UO  MH  VN  MF  TR  BI  HX  CZ  VJ  NX  DL 
 39  39  40  40  41  42  42  43  43  44  49  51  52  53  60  60  62  67  68  75 
 MM  B7  MU  CX  JL  NZ  CA  NH  SQ  TG  IT  JX  BR  CI 
 80  83  85  86 101 111 112 116 143 220 277 304 730 801 


 CI  BR  JX  IT  TG  SQ  NH  CA  NZ  JL  CX  MU  B7  MM  DL  NX  VJ  CZ  BI  HX 
801 730 304 277 220 143 116 112 111 101  86  85  83  80  75  68  67  62  60  60 
 TR  MF  VN  MH  UO  HO  TK  KE  OD  PR  5J  AE  CM  VZ  AV  7C  OZ  KL  EK  AI 
 53  52  51  49  44  43  43  42  42  41  40  40  39  39  36  35  35  33  32  30 
 FM  AC  SL  TW  HA  PG  Z2  FD  HU  QF  HB  ZH  D7  BX  LJ  QH  3U  AK  RF  UK 
 30  27  26  26  24  24  24  21  21  21  18  18  16  14  14  14  12  12  12  12 
 UX  WE  LY  GK  9C  AA  AZ  GA  LH  RW  SC  UA  AF  NU 
 12  12  10   8   6   6   6   6   6   6   6   6   3   2 

- [Airports and airlines search](https://www.iata.org/en/publications/directories/code-search/)

## Update your progress

In [45]:
flights$data[[1]]$data_frame <- flightsData

flights$data[[1]][["analysis"]][["airlineID"]][["frequency_count"]] <- sort(tb_airlineID, decreasing = TRUE)

saveRDS(flights, "data/flights.rds")