# Economic Data Science

@ Central University of Finance and Economics


**Zhentao Shi**

Department of Economics

The Chinese University of Hong Kong

### Why data

* Scientific research. e.g. Physics.
  * Theory and experimental evidence.

* Economics research.
  * Modeling causality
  * Experimental causality

* Business analytics
  * Descriptive statistics
  * Prediction
  * Inference

## Programming

* Essential skill
* Low-level languages: C, Fortran, ...
* High-level languages: Python, R, Matlab, Stata, ...

* R is used for demonstration
* You choose your languages

## Data Sources


* Online archives
* API (Application Programming Interface)
* Proprietary data
* Survey data

## Public data

* Time Series
  - Natural ordering of observations
  - A single realization in history
  - eg. GDP, stock prices

* Microeconomic data
  - No natural ordering
  - Collected at the same time, or time does not matter

* Aggregate panel data
  - [Penn World Table](https://www.rug.nl/ggdc/productivity/pwt/?lang=en)
  - [Atlas Trade Data](https://atlas.cid.harvard.edu/about-data)
  - [IMF databases](https://data.imf.org/?sk=388DFA60-1D26-4ADE-B505-A05A558D9A42&sId=1479329132316)

## Gated Data

* Applications needed
  * Chinese Longitudinal Healthy Longevity Survey [link](https://www.icpsr.umich.edu/web/NACDA/studies/36692)
  * China Household Finance Survey [link](https://chfs.swufe.edu.cn/)


In [None]:
library(magrittr)
library(dplyr)
library(zoo)
library(rvest)
library(Quandl)


## Time Series

- Macroeconomics
  - 国家统计局 [National Bureau of Statistics](https://data.stats.gov.cn/easyquery.htm?cn=B01)
  - Federal Reserve [FRED database](https://research.stlouisfed.org/econ/mccracken/fred-databases/)
- Financial
  - [Yahoo Finance](https://finance.yahoo.com/)

## HK GDP

* [Census and Statistics Department](https://www.censtatd.gov.hk/en/web_table.html?id=33#)
  * Webpage With API

In [None]:
library(httr)
library(jsonlite)
library(rjson)

url <- "https://www.censtatd.gov.hk/api/get.php?id=310-31003&lang=en&param=N4KABGBEDGBukC4yghSBxAIgBQPoGEB5AWW0IDkBRcgFUTAG1xU1j0BlSAGmZcgA0O3Xqkj4AkgDVhLNFgBi+eTNkZ8lFX2zrNozJl1p5BnrKjFOp1W0NR+lkWkGQRAXWYBfK5ADO8JCiiROT0TGaQAEoAhgDuuMS4ABYA1gAmuKm2kACaAPbZuACMqQAOuACkuD4uLO4QXsyQJQCmAE4AlrmZASK+AC5RrX30kABMAAyjheOFNWANaO3dUADM0wC0a+PjKyqQADZRAHYA5iPNRy4eQA"
download.file(url, destfile="input.json")
result <- jsonlite::fromJSON("input.json")

In [None]:
data.frame(result$dataSet) 

## Federal Reserve

* China GDP [[link](https://fred.stlouisfed.org/series/MKTGDPCNA646NWDB)]

In [None]:
quantmod::getSymbols.FRED(Symbols = "MKTGDPCNA646NWDB", env = .GlobalEnv) 
plot(MKTGDPCNA646NWDB)


Another example: [Quarterly US Industrial Production Index](https://fred.stlouisfed.org/series/IPB50001SQ)


In [None]:
quantmod::getSymbols.FRED(Symbols = c("IPB50001SQ"), env = .GlobalEnv)
plot(IPB50001SQ)

## Finance Data


* [Yahoo Finance](https://finance.yahoo.com/)

* Tick `AAPL` for *Apple Inc.* 
  * Package `quantmod`

In [None]:
quantmod::getSymbols("AAPL", src = "yahoo")
tail(AAPL)
plot(AAPL$AAPL.Close)

### Shanghai Composite Index

In [None]:
tick = "000001.SS" # need to find the tick 
SH <- quantmod::getSymbols(tick, auto.assign = FALSE, 
         from = "2000-01-01")[, paste0(tick,".Close")]

plot(SH)

### 000001.SS Return

In [None]:
diff(log(SH)) %>% plot()

### Bank Marketing Data


* Direct marketing campaigns of a Portuguese banking institution. 


* S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.


*  **Data Import**: `readr::read_delim`

In [None]:
library(tidyverse)

# `readr` packages has more flexible functions to import data
d0 = readr::read_delim("data_example/bank-full.csv", delim = ";", col_names = TRUE,
                       col_types = cols(
                         age = "i",
                         job = "c",
                         marital = "f",
                         education = "f",
                         balance = "i",
                         )
                       )

head(d0)
colnames(d0)


## Data Transformation

* `filter`: pick out a subset of rows that satisfy some conditions
* `select`: pick out a subset of columns
* `arrange`: order rows. Default order from low to high
* `mutate`: add columns produced by the existing ones

In [None]:
# select columns
d1 = select(d0, age:loan)
head(d1)

In [None]:
# select a subset by conditions
filter(d1, job == "blue-collar", age > 50) 
filter(d1, job == "blue-collar", (age > 20 & age <= 30) ) 

In [None]:
# (re)arrange rows
arrange(d1, age, education)
arrange(d1, desc(age), education)

In [None]:
# add generated columns (last column)
mutate(d1, edu_f = as.numeric(education) )

In [None]:
transmute(d1, 
          age = age,
          marital = as.numeric(marital), 
          education = as.numeric(education))

## Summarize

* `group_by`

In [None]:
# overall mean balance 
summarize(d1, mean_b = mean(balance))

In [None]:
# mean balance by groups
group_by(d1, education) %>%
  summarize( mean_b = mean(balance))

In [None]:
d1 %>%
  group_by(education, marital) %>%
  summarize( mean_b = mean(balance),
             sd_b = sd(balance),
             count = n())

## Data Scrapping

* [Liangjia Shenzhen data](https://github.com/zhentaoshi/econ_data_science/blob/master/data_example/Scrape_Lianjia.ipynb) (by Wang Yishu)

## Version Control

* [Git](https://git-scm.com/)
  - coding projects
  - long documents
  
* [Github](https://github.com/)
  - online copy
  - collaboration


Collaboration typically works as follows:
1. Fetch and merge changes from a remote repository;
2. Create a branch to work on a new project feature;
3. Develop the feature on your branch and commit your work;
4. Fetch and merge from the remote again (in case other collaborators have uploaded new commits while you were working);
5. Push your branch up to the remote for review.

## Markdown

* Text only. 
* Simple syntax.
* [Cheat sheet](https://www.markdownguide.org/cheat-sheet/)