# Data Sources

### Zhentao Shi

R 4.3.1

《汉书·地理志》:京兆尹，元始二年户十九万五千七百二，口六十八万二千四百六十八。县十二：长安，新丰，船司空，蓝田，华阴，郑，湖，下邽，南陵，奉明，霸陵，杜陵。

<!-- code is tested on SCRP -->

* Online archives
* API (Application Programming Interface)
* Proprietary data
* Survey data

## Public data

* Time Series
  - Natural ordering of observations
  - A single realization in history
  - eg. GDP, stock prices

* Microeconomic data
  - No natural ordering
  - Collected at the same time, or time does not matter
  - [Auction data](https://capcp.la.psu.edu/data-and-software/alaska-oil-and-gas-auction-data/)

* Aggregate panel data
  - [Penn World Table](https://www.rug.nl/ggdc/productivity/pwt/?lang=en)
  - [Atlas Trade Data](https://atlas.cid.harvard.edu/about-data)
  - [IMF databases](https://data.imf.org/?sk=388DFA60-1D26-4ADE-B505-A05A558D9A42&sId=1479329132316)

## Gated Data

* Applications needed
  * Chinese Longitudinal Healthy Longevity Survey [link](https://www.icpsr.umich.edu/web/NACDA/studies/36692)
  * China Household Finance Survey [link](https://chfs.swufe.edu.cn/)


## CUHK Library

* [Refinitiv](http://easyaccess1.lib.cuhk.edu.hk/limited/refinitiv.html)
* [WRDS](http://easyaccess1.lib.cuhk.edu.hk/limited/wrds.htm)
* [CEIC](https://cas-ceicdata-com.easyaccess1.lib.cuhk.edu.hk/login#)


## Econ Department

* WIND
* CEIC
* Bloomberg
* China Census

In [None]:
library(magrittr)
library(dplyr)
library(zoo)
# library(rvest)


## Time Series

- Macroeconomics
  - 国家统计局 [National Bureau of Statistics](https://data.stats.gov.cn/easyquery.htm?cn=B01)
  - Federal Reserve [FRED database](https://research.stlouisfed.org/econ/mccracken/fred-databases/)
- Financial
  - [Yahoo Finance](https://finance.yahoo.com/)

## HK GDP

* [Census and Statistics Department](https://www.censtatd.gov.hk/en/web_table.html?id=33#)
  * Webpage With API

In [None]:
library(httr)
library(jsonlite)
library(rjson)

url <- "https://www.censtatd.gov.hk/api/get.php?id=310-31003&lang=en&param=N4KABGBEDGBukC4yghSBxAIgBQPoGEB5AWW0IDkBRcgFUTAG1xU0AxTSAGmZckw+4s02fJS49UGUeKFQsrfKxlDI+AJIA1ZbwAa6AMrbJxA0bR6zUHYcEqTlyMUMSAuswC+tyAGd4SFJJE5PRMspAASgCGAO64xLgAFgDWACa4KQ4AmgD2mbgAjCkADrgApLjekK4eXkUApgBOAJbZGf4SPgAukQ2d9JAATAAMA-lD+VUQnsyQTW1QAMxjALRLQ0MLypAANpEAdgDm-XV7Ve5AA"
download.file(url, destfile="input.json")
result <- jsonlite::fromJSON("input.json")

data.frame(result$dataSet) %>% head() 

## Federal Reserve

* China GDP [[link](https://fred.stlouisfed.org/series/MKTGDPCNA646NWDB)]

In [None]:
quantmod::getSymbols.FRED(Symbols = "MKTGDPCNA646NWDB", env = .GlobalEnv) 
plot(MKTGDPCNA646NWDB)
MKTGDPCNA646NWDB


Another example: [Quarterly US Industrial Production Index](https://fred.stlouisfed.org/series/IPB50001SQ)


In [None]:
quantmod::getSymbols.FRED(Symbols = c("IPB50001SQ"), env = .GlobalEnv)
plot(IPB50001SQ)

## Finance Data


* [Yahoo Finance](https://finance.yahoo.com/)

* Tick `AAPL` for *Apple Inc.* 
  * Package `quantmod`

In [None]:
quantmod::getSymbols("AAPL", src = "yahoo")
tail(AAPL)
plot(AAPL$AAPL.Close)

### Shanghai Composite Index

In [None]:
tick = "000001.SS" # need to find the tick 
SH <- quantmod::getSymbols(tick, auto.assign = FALSE, 
         from = "2000-01-01")[, paste0(tick,".Close")] 

plot(SH)

### 000001.SS Return

In [None]:
diff(log(SH)) %>% plot()

### Cryptocurrencies

In [None]:
BTC <- quantmod::getSymbols("BTC-USD",auto.assign = FALSE, from = "2021-07-01")[,4]
plot(BTC)

ETH <- quantmod::getSymbols("ETH-USD",auto.assign = FALSE, from = "2021-07-01")[,4]
plot(ETH)

plot( x = as.vector(ETH), y = as.vector(BTC), type = "l")

### Bitcoin return

In [None]:
diff( log(BTC) ) %>% plot( )

## Automated Data Download

* Example: HKMA [API](https://apidocs.hkma.gov.hk/documentation/market-data-and-statistics/daily-monetary-statistics/daily-figures-interbank-liquidity/)
  * Save as a csv file
  * Repeat the regular job via `cron` in Linux
  * `cronR` provides an R interface to `cron`

In [None]:
library(httr)
library(jsonlite)
library("rjson")
url <- "https://api.hkma.gov.hk/public/market-data-and-statistics/daily-monetary-statistics/daily-figures-interbank-liquidity"
download.file(url,destfile="input.json")
result <- jsonlite::fromJSON("input.json")
hkma <- data.frame(result$result)
write.csv(hkma,"hkma.csv")

```
library("cronR")

cron_ls() # list existing cron tasks

cmd <- cron_rscript("HKMA_API.r", rscript_log = "/root/HKMAdata.txt") 
# specify a task and record the log into a txt file

cron_add(cmd, frequency="daily", ask = "FALSE", id = "zt_econ5821")
# specify the frequency of the task
```

## Access Database

* API for CEIC

```
# Dai Qiyu (Jan, 2023)

PackageList=c("R6","xml2","zoo","httr","getPass") 
lapply(PackageList, require, character.only=TRUE)

install.packages ("ceic", repos = "https://downloads.ceicdata.com/R/", 
                  type = "source")
library(ceic)
 
#log in
ceic.login(username="cuhk_student_id@link.cuhk.edu.hk",
           password="your_password")
```

```
#Step 2: obtain China's quarterly GDP

CN_GDP_list=
  ceic.series(c("2609f72b-16b5-4799-a347-35d8ac05e585"), format = "ts", 
              lang = "zh")
CN_GDP=CN_GDP_list$timepoints
CN_GDP


#Finally, logout
ceic.logout()
```

### Bank Marketing Data


* Direct marketing campaigns of a Portuguese banking institution. 


* S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.


*  **Data Import**: `readr::read_delim`

In [None]:
library(tidyverse)

# `readr` packages has more flexible functions to import data
d0 = readr::read_delim("data_example/bank-full.csv", delim = ";", col_names = TRUE,
                       col_types = cols(
                         age = "i",
                         job = "c",
                         marital = "f",
                         education = "f",
                         balance = "i",
                         )
                       )

head(d0)
colnames(d0)


## Data Transformation

* `filter`: pick out a subset of rows that satisfy some conditions
* `select`: pick out a subset of columns
* `arrange`: order rows. Default order from low to high
* `mutate`: add columns produced by the existing ones

In [None]:
# select columns
d1 = select(d0, age:loan)
head(d1)

In [None]:
# select a subset by conditions
filter(d1, job == "blue-collar", age > 50) 

In [None]:
filter(d1, job == "blue-collar", (age > 20 & age <= 30) ) 

In [None]:
# (re)arrange rows
arrange(d1, age, education)

In [None]:
arrange(d1, desc(age), education)

In [None]:
# add generated columns (last column)
mutate(d1, edu_f = as.numeric(education) )

In [None]:
transmute(d1, 
          age = age,
          marital = as.numeric(marital), 
          education = as.numeric(education))

## Summarize

* `group_by`

In [None]:
# overall mean balance 
summarize(d1, mean_b = mean(balance))

In [None]:
# mean balance by groups
group_by(d1, education) %>%
  summarize( mean_b = mean(balance))

In [None]:
d1 %>%
  group_by(education, marital) %>%
  summarize( mean_b = mean(balance),
             sd_b = sd(balance),
             count = n())

## Data Scrapping

* [Liangjia data](https://github.com/zhentaoshi/econ_data_science/blob/master/data_example/Scrape_Lianjia.ipynb) (by Wang Yishu)
* [Beijing Housing paper](https://github.com/zhentaoshi/Econ5821/blob/main/data_example/2022%20Lin%20Shi%20Wang%20Yan%20Computational_Economics.pdf)
* We can test it on SCRP.

## Reading

* Wickham and Grolemund](https://r4ds.hadley.nz/)
  * Ch.7: data import
  * Ch.3: data transformation
