In [1]:
# To ensure Chinese characters are displayed correctly
options(encoding = "UTF-8")
Sys.setlocale("LC_CTYPE", "en_US.UTF-8")

# Rule on names

  * [Rules](https://hyp.is/QQ4DmGflEe6G8SsIYP70Gg/tpemartin.github.io/NTPU-R-for-Data-Science-EN/r-basics.html):  
    * starts only with LETTERS (`a` to `z`, `A` to `Z`) or dot (`.`)  
    * non-starting place can have LETTERS, numbers (`0` to `9`), dot (`.`), or underline (`_`)

  * [Common styles](https://hyp.is/qUJrrGflEe67khOW7wxCXw/tpemartin.github.io/NTPU-R-for-Data-Science-EN/r-basics.html)

# Progress storage

During the research process, you will need to save your progress and come back the next data to pick up where you left off. This is a good practice to avoid losing your work. In this section, we will learn how to save and load data in R.

  - Download `international_flights.json` from [here](https://classroom.google.com/c/NTg5NjUwNjM0MjA4/m/NjMwNDMyNzc5MDgz/details) and
  - Put it in the `data` folder of your project folder

In [5]:
data1 <- list(
            file = "data/international_flights.json",
            meta = list(
                name = "國際航空定期時刻表",
                source_link = "https://data.gov.tw/dataset/161167")
            )

flights <- list(
    data = list(data1)
)

- we will constantly save our research progress under `flights` list.  
- when we want to save our progress, run:

In [6]:
saveRDS(flights, "data/flights.rds")

- if you turn off your computer and want to come back the next day, run:

In [7]:
flights = readRDS("data/flights.rds")

# Exercise: understand your data

- what's the name of the dataset? where is it from? 

# Import data

In [9]:
# Read JSON file

filepath = flights$data[[1]]$file
flightsData <- jsonlite::fromJSON(filepath)

Let's save the `flightsData` object inside `flight` list so that in the future we can retrieve its value from 

```
flights$data[[1]]$data_frame
```

We can do it through:

In [10]:
flights$data[[1]]$data_frame <- flightsData

saveRDS(flights, "data/flights.rds")

# Data acquantance

## Data structure

- Type of storage: an atomic vector, a list, a data frame, or a matrix.  
- Class: numeric, character, integer, list, data frame, or matrix.  

In [14]:
typeof(flightsData)
class(flightsData)

- imported value is stored as a list. (`typeof`)  
- imported value can be managed as a data frame. (`class`)   

"Stored" means the value is saved in the memory of your computer. The storage type gives you basic ideal what you can do with it, such as retrieving element values, adding new elements, or deleting elements. The class gives you a more specific idea of what you can do with it.

### Two classes of collective data 

Our previous example has the data structure picked up by the way you import the data file. However, sometimes you construct your data within the program, instead of importing from outside. In this case, you need to choose the data structure you want to use.

- Observation by observation (obo): mostly a `list` class.
- Feature by feature (fbf): mostly a `data.frame` class.

[3.1.1 Json data](https://tpemartin.github.io/NTPU-R-for-Data-Science-EN/element-values.html#json-data)  
[4.2.4 Data frame](https://tpemartin.github.io/NTPU-R-for-Data-Science-EN/operations-on-atomic-vectors.html#data-frame)


In [53]:
person1 <- list(
    name = "John",
    age = 30,
    married = TRUE
)
person2 <- list(
    name = "Mary",
    age = 25,
    married = FALSE
)
person3 <- list(
    name = "Tom",
    age = 35,
    married = TRUE
)

# observation by observation stacking
data_obo <- list(person1, person2, person3)

In [54]:
names = c("John", "Mary", "Tom")
ages = c(30, 25, 35)
isMarried = c(TRUE, FALSE, TRUE)

# feature by feature stacking
data_fbf <- list(
    name = names, 
    age = ages, 
    married = isMarried)

- Feature by feature: each element has the same length.

In [None]:
length(data_fbf$name)
length(data_fbf$age)
length(data_fbf$married)

- Within each feature, its values are of the same type.

In [57]:
typeof(data_fbf$name)
typeof(data_fbf$age)
typeof(data_fbf$married)

For fbf stacking, mostly we will use a new way to store data: `data.frame` class.

### Data frame

In [56]:
df <- list2DF(data_fbf) # use list2DF to convert a list fbf into a data frame


name,age,married
<chr>,<dbl>,<lgl>
John,30,True
Mary,25,False
Tom,35,True


In [None]:

df <- data.frame(
    name = names, 
    age = ages, 
    married = isMarried
) # directly forming a data frame

> Our flights data can also be imported as in observation by observation format using `simplifyDataFrame = FALSE` option when importing the data.


In [11]:
flights_obo <- jsonlite::fromJSON(flights$data[[1]]$file, simplifyDataFrame = FALSE)

In [13]:
head(flights_obo, 3)

- `head()`: show the first 6 values. `head( , 3)` will show the first 3 values.

## Extra Management tools on data frame

Data frame has an extra management tool in element retrieval, which is `[row(s), column(s)]`.

[Benefit of data frame](https://hyp.is/K1GeGme-Ee6m2vNkdp7lnw/tpemartin.github.io/NTPU-R-for-Data-Science-EN/operations-on-atomic-vectors.html)

## Names of columns

In [13]:
names(flightsData)

Each name represents:

- `AirlineID`: an identification number assigned by IATA to identify a unique airline (carrier).
- `ScheduleStartDate`: the start date of the flight schedule season for which the row of data is relevant to.
- `ScheduleEndDate`: the end date of the flight schedule season for which the row of data is relevant to.
- `FlightNumber`: the flight number assigned by the carrier.
- `DepartureAirportID`: an identification number assigned by IATA to identify a unique airport.
- `ArrivalAirportID`: an identification number assigned by IATA to identify a unique airport.
- `DepartureTime`: the scheduled departure time of the flight.
- `ArrivalTime`: the scheduled arrival time of the flight.
- `Monday` to `Sunday`: the days of the week on which the flight operates based on departure date.
- `CodeShare`: a code share flight is a flight booked through one airline but operated by another airline (as indicated by the carrier code).



## Check the first few rows

In [14]:
head(flightsData)

Unnamed: 0_level_0,AirlineID,ScheduleStartDate,ScheduleEndDate,FlightNumber,DepartureAirportID,DepartureTime,CodeShare,ArrivalAirportID,ArrivalTime,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,UpdateTime,VersionID,Terminal,num_codeShare
Unnamed: 0_level_1,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<list>,<int>
1,3U,2023-10-13,2023-10-15,3U3783,CKG,15:00,,TSA,18:00,False,False,False,False,True,False,True,2023-10-10T08:26:07+08:00,1111,,0
2,3U,2023-10-20,2023-10-22,3U3783,CKG,15:00,,TSA,18:00,False,False,False,False,True,False,True,2023-10-10T08:26:07+08:00,1111,,0
3,3U,2023-10-27,2023-10-27,3U3783,CKG,15:00,,TSA,18:00,False,False,False,False,True,False,False,2023-10-10T08:26:07+08:00,1111,,0
4,3U,2023-10-13,2023-10-15,3U3784,TSA,19:00,,CKG,22:15,False,False,False,False,True,False,True,2023-10-10T08:26:07+08:00,1111,,0
5,3U,2023-10-20,2023-10-22,3U3784,TSA,19:00,,CKG,22:15,False,False,False,False,True,False,True,2023-10-10T08:26:07+08:00,1111,,0
6,3U,2023-10-27,2023-10-27,3U3784,TSA,19:00,,CKG,22:15,False,False,False,False,True,False,False,2023-10-10T08:26:07+08:00,1111,,0


In [30]:
flightsData$ArrivalAirportID |> unlist() |> table() |> sort(decreasing = T)


 TPE  PVG  KHH  HKG  NRT  BKK  KIX  ICN  TSA  SIN  MFM  FUK  HAN  BWN  LAX  SGN 
2268  156  140  138  134  120  120  108   99   82   70   66   63   60   60   59 
 SFO  MNL  KUL  OKA  PEK  CEB  PUS  SDJ  DAD  CTS  HND  SZX  CAN  KMQ  SHA  HGH 
  57   54   50   49   42   41   37   36   35   33   33   33   27   27   24   23 
 XMN  DMK  RMQ  GMP  NGO  PEN  BKI  JFK  DXB  NKG  FOC  TFU  YVR  CGK  CKG  CNX 
  23   22   22   21   21   21   19   18   16   16   15   15   15   12   12   12 
 MEL  ORD  PNH  TAO  YYZ  BNE  TAE  CRK  DPS  LHR  SEA  SYD  TAK  WUH  VIE  AKL 
  12   12   12   12   12   11   10    9    9    9    9    9    9    9    8    6 
 AMS  CJJ  CJU  FCO  FRA  GAJ  HIJ  IAH  IST  KMJ  MUC  NGB  KIJ  MXP  TKS  CXR 
   6    6    6    6    6    6    6    6    6    6    6    6    5    5    5    4 
 PRG  AKJ  CDG  CGO  HKD  HKT  HNA  HSG  HUI  IBR  KCZ  KLO  MPH  OKJ  ONT  PPS 
   4    3    3    3    3    3    3    3    3    3    3    3    3    3    3    3 
 RGN  ROR  TNN  AOJ  IZO  T

## Understand column values

### `...AirportID`

In [None]:
install.packages("airportr")


In [49]:
library(airportr)
han <- airport_lookup("HAN", output_type=c("city"))

han 

In [48]:
tpe_han <- airport_distance("TPE","HAN")

tpe_han

> Warning message could be ignored most of the time since the command still goes through. 