
  Today's goals:
  
  * learn to import .csv data

  * change feature names of a dataframe

  * learn **class** and its difference from **type**

  * what does parsing do?

  * how to fix wrong classes? 

  * check NA

  * factor class summary


# University registration rate

Aging society: too many schools, too little students. What's the current enrollment rate in each university in Taiwan

## Data

  * [Open data](https://data.gov.tw/)

    * <https://data.gov.tw/dataset/26228>

### Format

Data is like a spreadsheet. When we imported, we imported as a special list stucture called **Data Frame** which is a collection of data following

  * feature-by-feature structure

  * each feature is an **ATOMIC** vector (Within a feature, element values are all of the same type.)

  * feature vectors have the same length, which is equal to the number of observations (觀測值個數), also called sample size (樣本數）。

### Import .csv data

  * source: <https://udb.moe.edu.tw/download/udata/data_gov/學生類/學12-1.新生註冊率-以「系所」統計.csv>

     * as an url: `"https://udb.moe.edu.tw/download/udata/data_gov/學生類/學12-1.新生註冊率-以「系所」統計.csv"`

     * as a local file: `"./data/學生類/學12-1.新生註冊率-以「系所」統計.csv"`



### Import to global environment

We need data be bound to an object in the global environment for us to use.

```
enrollmentRate_department <- ...
```

   * csv

In **Environment** tab, click **Import Dataset** > **From Text (readr)**, show

Copy the code from **Code Preview**.


In [68]:
library(readr)
enrollmentRate_department <- read_csv("data/學12-1.新生註冊率-以「系所」統計.csv")
# View(enrollmentRate_department)


[1mRows: [22m[34m42900[39m [1mColumns: [22m[34m16[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (11): 設立別, 學校類別, 學校統計處代碼, 學校名稱, 系所代碼, 系所名稱, 日間/進修, 學制班別, 當學年度各學系境外(新生)學生...
[32mdbl[39m  (5): 學年度, 當學年度總量內核定新生招生名額(A), 當學年度新生保留入學資格人數(B), 當學年度總量內新生招生核定名額之實際註冊人數...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


## Preliminary Exploration



### the basics

  * data source:

  * sample size

  * features

  * entiry unit of an observation

In [None]:
# sample size
nrow(enrollmentRate_department) 
# feature numbers
ncol(enrollmentRate_department)
# feature names
names(enrollmentRate_department)

In [None]:
library(dplyr) # package to deal with dataframe.

dplyr::glimpse(enrollmentRate_department)

The entity unit is one university's degree program. Each program's frenshmen enrollment rate is tracked through several **school years**.


### renames features

Names not following regular pattern can make your program typing difficult. It is common that we change it to some easy-to-call name.



In [None]:
names(enrollmentRate_department)

 [1] "學年度"                                               
 [2] "設立別"                                               
 [3] "學校類別"                                             
 [4] "學校統計處代碼"                                       
 [5] "學校名稱"                                             
 [6] "系所代碼"                                             
 [7] "系所名稱"                                             
 [8] "日間/進修"                                            
 [9] "學制班別"                                             
[10] "當學年度總量內核定新生招生名額(A)"                    
[11] "當學年度新生保留入學資格人數(B)"                      
[12] "當學年度總量內新生招生核定名額之實際註冊人數(C)"      
[13] "當學年度各學系境外(新生)學生實際註冊人數 (E)"         
[14] "當學年度新生註冊率(%)\n \n D=〔(C+E)/(A-B+E)〕＊100％"

In [None]:
names(enrollmentRate_department)[c(1,2,5,7,8,10:14)]

names(enrollmentRate_department)[c(1,2,5,7,8,10:14)] <-  # notice 10:14
  c("schoolYear", "typeByFundingSource","school", "department","day_night", 
  "freshmenFresh","freshmenOld","freshmenFreshRegistered",
  "foreignStudentsRegistered", "netFreshmenRegistrationRate")

## Checking

### Class

  * class determine how much computer understand the data, and therefore what can computer do to it.



In [61]:
class("2022-12-31")
typeof("2022-12-31")

class(TRUE)
typeof(TRUE)

class(2)
typeof(2)

In [None]:
string2 <- lubridate::ymd("2022-12-31")
print(string2) 
class(string2)
string2+lubridate::days(3)

In [None]:
string1 <- "2022-12-31"
print(string1)
class(string1)
string1+lubridate::days(3) # computer does not know what to do


### Are feature [classes](https://tpemartin.github.io/NTPU-R-for-Data-Science-EN/operations-on-atomic-vectors.html#class) appropriate. 


In [None]:
class(enrollmentRate_department$schoolYear)
class(enrollmentRate_department$netFreshmenRegistrationRate)

In [None]:
# Since **schoolYear** is consider numeric by computer
# computer accept the following (non-sense) operation
enrollmentRate_department$schoolYear /3  



In [None]:
enrollmentRate_department$schoolYear <- factor(enrollmentRate_department$schoolYear, ordered = T)
enrollmentRate_department$schoolYear / 3 # Computer know it's non-sense


> `factor()` and `lubridate::ymd()` are parsing functions that teach computer the meaning of human words (stored as string,  numeric)

> Value that is not parsed is consider having its class as its type.

### classes

  * factor class: parsing `factor()` or `as.factor()`

  * date class: parsing `as.Date()` or `lubridate::ymd()` and others depend on stored date/time string type.

In [72]:
dfPartial = enrollmentRate_department[c("schoolYear", "typeByFundingSource","school", "department","day_night", 
  "freshmenFresh","freshmenOld","freshmenFreshRegistered",
  "foreignStudentsRegistered", "netFreshmenRegistrationRate")] 
  
dfPartial |> dplyr::glimpse()

Rows: 42,900
Columns: 10
$ schoolYear                  [3m[90m<dbl>[39m[23m 106, 106, 106, 106, 106, 106, 106, 106, 10…
$ typeByFundingSource         [3m[90m<chr>[39m[23m "公立", "公立", "公立", "公立", "公立", "…
$ school                      [3m[90m<chr>[39m[23m "國立政治大學", "國立政治大學", "國立政治…
$ department                  [3m[90m<chr>[39m[23m "教育學系", "教育學系", "教育學系", "教育…
$ day_night                   [3m[90m<chr>[39m[23m "日間", "日間", "日間", "日間", "在職", "…
$ freshmenFresh               [3m[90m<dbl>[39m[23m 50, 11, 16, 14, 24, 13, 2, 15, 11, 5, 44, …
$ freshmenOld                 [3m[90m<dbl>[39m[23m 0, 3, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 1, 1, …
$ freshmenFreshRegistered     [3m[90m<dbl>[39m[23m 49, 8, 16, 14, 21, 10, 2, 14, 9, 2, 43, 11…
$ foreignStudentsRegistered   [3m[90m<chr>[39m[23m "...", "...", "...", "...", "...", "...", …
$ netFreshmenRegistrationRate [3m[90m<dbl>[39m[23m 98.00, 100.00, 100.00, 100.00, 91.30, 100.…


>  pipe operator `|>`

```
value |> doThis() #closer to human words
```
is the same as 
```
doThis(value)
```

### fix classes

In [74]:
dfPartial$typeByFundingSource |> factor() -> dfPartial$typeByFundingSource

# the same as 
library(magrittr)
dfPartial$typeByFundingSource %<>% factor() 

dfPartial$department %<>% factor()
dfPartial$day_night %<>% factor()

dplyr::glimpse(dfPartial)

Rows: 42,900
Columns: 10
$ schoolYear                  [3m[90m<dbl>[39m[23m 106, 106, 106, 106, 106, 106, 106, 106, 10…
$ typeByFundingSource         [3m[90m<fct>[39m[23m 公立, 公立, 公立, 公立, 公立, 公立, 公立, …
$ school                      [3m[90m<chr>[39m[23m "國立政治大學", "國立政治大學", "國立政治…
$ department                  [3m[90m<fct>[39m[23m 教育學系, 教育學系, 教育學系, 教育行政與政…
$ day_night                   [3m[90m<fct>[39m[23m 日間, 日間, 日間, 日間, 在職, 日間, 日間, …
$ freshmenFresh               [3m[90m<dbl>[39m[23m 50, 11, 16, 14, 24, 13, 2, 15, 11, 5, 44, …
$ freshmenOld                 [3m[90m<dbl>[39m[23m 0, 3, 0, 0, 1, 3, 0, 0, 0, 0, 0, 1, 1, 1, …
$ freshmenFreshRegistered     [3m[90m<dbl>[39m[23m 49, 8, 16, 14, 21, 10, 2, 14, 9, 2, 43, 11…
$ foreignStudentsRegistered   [3m[90m<chr>[39m[23m "...", "...", "...", "...", "...", "...", …
$ netFreshmenRegistrationRate [3m[90m<dbl>[39m[23m 98.00, 100.00, 100.00, 100.00, 91.30, 100.…


## Summarise data

  * Discrete data:

    * How many categories?

    * It's distribution.


  * Continuous data:

    * Central tendency (e.g. mean) , dispersion (e.g. standard deviation).


  * NA 


### NA

  * Any NA?

In [None]:
# check each feature
dfPartial[[1]] |> anyNA()
dfPartial[[2]] |> anyNA()

# Actually should check all before individual checking
dfPartial |> anyNA()

## Discrete data



### Factor

#### Levels

In [None]:
levels(dfPartial$typeByFundingSource)

# change to English
levels(dfPartial$typeByFundingSource) <- c("public", "private")


In [None]:
levels(dfPartial$typeByFundingSource)


tb <- table(dfPartial$typeByFundingSource)
tb
prop.table(tb) # shows the distribution

In [None]:
dfPartial$typeByFundingSource |>
  table() |>
  prop.table()