Skip to content

Dataset

Dat Nguyen edited this page Feb 8, 2021 · 1 revision

In this section, we describe in detail how to get data from data sources.

We found that pandas datareader is a great tool to get various data sources.

Detail on pandas Datareader remote resources could be found here.

Below is the summary of data sources in this project:

Method Source Categories Feature/Key Frequency Unit Description
API JHU CSSE Covid Confirmed cases US daily case Covid-19 confirmed cases in the US
API JHU CSSE Covid Deaths US daily case Covid-19 deaths in the US
API JHU CSSE Covid Confirmed cases Global daily case Covid-19 confirmed in the world
API JHU CSSE Covid Deaths cases Global daily case Covid-19 deaths in the world
PD Datareader fred Financial nasdaq100 daily N/A NASDAQ 100 Index
PD Datareader fred Financial sp500 daily N/A S&P 500 Index
PD Datareader fred Financial djia daily N/A Dow Jones Industrial Average
PD Datareader fred Labor market payems monthly Thousands of person All Employees, Total Nonfarm
PD Datareader fred Labor market CES4348100001 monthly Thousands of person All Employees, Air Transportation
PD Datareader fred Labor market CES6562000101 monthly Thousands of person All Employees, Health Care
PD Datareader fred Labor market CES6561000001 monthly Thousands of person All Employees, Educational Services
PD Datareader fred Labor market CES7071000001 monthly Thousands of person All Employees, Arts, Entertainment, and Recreation
PD Datareader fred Labor market unrate monthly Percent Unemployment Rate
PD Datareader fred Labor market jtsjol monthly Level in Thousands Job Openings: Total Nonfarm
PD Datareader fred Labor market LNS13023653 monthly Thousands of Persons Unemployment Level - Job Losers on Layoff
PD Datareader fred Production & Business Activity VMTD11 monthly Millions Vehicle Miles Traveled
PD Datareader fred Production & Business Activity AIRRPMTSID11 monthly Thousands Air Revenue Passenger Miles
PD Datareader fred Production & Business Activity MRTSSM7225USN monthly Millions of Dollars Retail Sales: Restaurants and Other Eating Places
PD Datareader fred Production & Business Activity MRTSSM4541USS monthly Millions of Dollars Retail Sales: Electronic Shopping and Mail-order Houses
PD Datareader fred Production & Business Activity MRTSSM4451USS monthly Millions of Dollars Retail Sales: Grocery Stores
PD Datareader fred Production & Business Activity MRTSSM446USS monthly Millions of Dollars Retail Sales: Health and Personal Care Stores

Covid-19

import pandas as pd
confirmed_case_global_url='https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv'
confirmed_df = pd.read_csv(confirmed_case_global_url)
print(confirmed_df.shape)
print(confirmed_df.head())
  • US Raw data sample:
  UID iso2 iso3  code3  ...  10/15/20 10/16/20 10/17/20 10/18/20
0  84001001   US  USA    840  ...      1949     1966     1983     1989
1  84001003   US  USA    840  ...      6285     6333     6350     6369
2  84001005   US  USA    840  ...       965      968      977      981
3  84001007   US  USA    840  ...       761      771      775      785
4  84001009   US  USA    840  ...      1768     1783     1807     1827
  • Global Raw data sample:
Province/State Country/Region       Lat  ...  10/16/20  10/17/20  10/18/20
0            NaN    Afghanistan  33.93911  ...     40073     40141     40200
1            NaN        Albania  41.15330  ...     16501     16774     17055
2            NaN        Algeria  28.03390  ...     53998     54203     54402
3            NaN        Andorra  42.50630  ...      3377      3377      3377
4            NaN         Angola -11.20270  ...      7222      7462      7622

Important Note The raw data is in coloumn-oriented format. Data of a new day is append as new column.

US:

UID 84001001
iso2 US
iso3 USA
code3 840
FIPS 1001.0
Admin2 Autauga
Province_State Alabama
Country_Region US
Lat 32.53952745
Long_ -86.64408227
Combined_Key Autauga, Alabama, US
day1
day2
...
currentday

Global:

Province/State NaN
Country/Region Albania
Lat 41.1533
Long 20.1683
day1
day2
...
currentday

Stock Prices

  • Currently, there are three methods to get stock prices: Yahoo Finance API, Google Finance API, and pandas_datareader. Only the last one work.

  • Type: Time series

  • Supported methods: API, json

  • Frequency: daily

  • We focus on popular stock market in the US such as S&P500, Dow Jones, and Nasdaq

  • stock_ticker_raw table:

CREATE TABLE IF NOT EXISTS stock_ticker_raw(
    ticker VARCHAR(16) UNIQUE NOT NULL,
    name VARCHAR(128), -- full name of the stock ticker
    industry VARCHAR(64) NULL,
    subindustry VARCHAR(64) NULL,
    hq_location VARCHAR(64) NULL,
    date_first_added datetime NULL,
    cik VARCHAR(10) NULL, -- A Central Index Key or CIK number
    founded_year int NULL,
    PRIMARY KEY(ticker)
);
  • stock_price_raw table:
CREATE TABLE IF NOT EXISTS stock_price_raw(
    stock_ticker VARCHAR(16) NOT NULL,
    date datetime NOT NULL,
    High double NOT NULL,
    Low double NOT NULL,
    Open double NOT NULL,
    Close double NOT NULL,
    Volume double NOT NULL,
    adj_close double NOT NULL
);
  • How to extract data
    • First, we get the master list of all stock stickers using the API.
    • Next, for each stock sticker, we get the stock prices data using Yahoo's API.
    • Extract data into the raw table.
pip install pandas-datareader
from pandas_datareader import data

tickers = ['AAPL', 'MSFT', '^GSPC']
start_date = '2020-01-01'
end_date = '2020-09-30'

for ticker in tickers:
    panel_data = data.DataReader(ticker, 'yahoo', start_date, end_date)
    print(type(panel_data))
    print("ticker: ", ticker)
    print(panel_data)
  • Stock price data sample
Date          High        Low       Open      Close       Volume     Adj Close
                                                                          
2020-01-02  75.150002  73.797501  74.059998  75.087502  135480400.0  74.573036
2020-01-03  75.144997  74.125000  74.287498  74.357498  146322800.0  73.848030
2020-01-06  74.989998  73.187500  73.447502  74.949997  118387200.0  74.436470
2020-01-07  75.224998  74.370003  74.959999  74.597504  108872000.0  74.086395
2020-01-08  76.110001  74.290001  74.290001  75.797501  132079200.0  75.278160

Unemployment rate

  • Data source: U.S Bureau of Labor
  • Type: Time series
  • Supported methods: API, json
  • Frequency: monthly
  • All data from Bureau of Labor has unique series id for each feature. We list out desired features and their corresponding series_id as below
  • Interested features:
    • Unemployment Rate (overall): LNS14000000

    • Unemployment Rate Races:

      • Black or African American: LNS14000006
      • Hispanic or Latino: LNS14000009
      • White: LNS14000003
      • Asian: LNS14032183
    • Unemployment Rate - Occupations source

      • Management, Professional, and Related Occupations: LNU04032215
      • Service Occupations: LNU04032218
      • Sales and Office Occupations: LNU04032219
      • Natural Resources, Construction, and Maintenance Occupations: LNU04032222
      • Production, Transportation and Material Moving Occupations: LNU04032226
  • How to extract data:
    • First, we build the master list of interested series (e.g., LNS14000000, LNS14000009, LNS14000003, etc) and keep the data in bol_series_dim.
CREATE TABLE IF NOT EXISTS bol_series_dim(
    series_id VARCHAR(64) UNIQUE NOT NULL, -- matched with series_id from bol_raw
    category VARCHAR(256) NOT NULL, -- main category
    subcat1 VARCHAR(256), -- subcategory 1
    subcat2 VARCHAR(256), -- subcategory 1
    PRIMARY KEY(series_id)
);
INSERT INTO BOL_series_dim VALUES('LNS14000000', 'Unemployment Rate', 'overall', '');
INSERT INTO BOL_series_dim VALUES('LNS14000006', 'Unemployment Rate', 'race', 'Black or African American');
INSERT INTO BOL_series_dim VALUES('LNS14000009', 'Unemployment Rate', 'race', 'Hispanic or Latino');
INSERT INTO BOL_series_dim VALUES('LNS14000003', 'Unemployment Rate', 'race', 'White');
INSERT INTO BOL_series_dim VALUES('LNS14000003', 'Unemployment Rate', 'race', 'Asian');
INSERT INTO BOL_series_dim VALUES('LNU04032215', 'Unemployment Rate', 'occupation', 'Management, Professional, and Related Occupations');
INSERT INTO BOL_series_dim VALUES('LNU04032218', 'Unemployment Rate', 'occupation', 'Service');
INSERT INTO BOL_series_dim VALUES('LNU04032219', 'Unemployment Rate', 'occupation', 'Sales and Office Occupations');
INSERT INTO BOL_series_dim VALUES('LNU04032222', 'Unemployment Rate', 'occupation', 'Natural Resources, Construction, and Maintenance Occupations');
INSERT INTO BOL_series_dim VALUES('LNU04032226', 'Unemployment Rate', 'occupation', 'Production, Transportation and Material Moving Occupations');
  • Next, for each serie, we extract data from data source using HTTP request.
  • We extract data to raw tables.
import requests
import json
import prettytable
headers = {'Content-type': 'application/json'}
data = json.dumps({"seriesid": ['CUUR0000SA0','SUUR0000SA0'],"startyear":"2011", "endyear":"2014"})
p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
json_data = json.loads(p.text)
for series in json_data['Results']['series']:
    x=prettytable.PrettyTable(["series id","year","period","value","footnotes"])
    seriesId = series['seriesID']
    for item in series['data']:
        year = item['year']
        period = item['period']
        value = item['value']
        footnotes=""
        for footnote in item['footnotes']:
            if footnote:
                footnotes = footnotes + footnote['text'] + ','
       'if 'M01' <= period <= 'M12':'
            x.add_row([seriesId,year,period,value,footnotes[0:-1]])
    output = open(seriesId + '.txt','w')
    output.write (x.get_string())
    output.close()
  • Sample data
CUUR0000SA0.txt
+-------------+------+--------+---------+-----------+
|  series id  | year | period |  value  | footnotes |
+-------------+------+--------+---------+-----------+
| CUUR0000SA0 | 2020 |  M09   | 260.280 |           |
| CUUR0000SA0 | 2020 |  M08   | 259.918 |           |
| CUUR0000SA0 | 2020 |  M07   | 259.101 |           |
| CUUR0000SA0 | 2020 |  M06   | 257.797 |           |
| CUUR0000SA0 | 2020 |  M05   | 256.394 |           |
| CUUR0000SA0 | 2020 |  M04   | 256.389 |           |
| CUUR0000SA0 | 2020 |  M03   | 258.115 |           |
| CUUR0000SA0 | 2020 |  M02   | 258.678 |           |
| CUUR0000SA0 | 2020 |  M01   | 257.971 |           |
| CUUR0000SA0 | 2019 |  M12   | 256.974 |           |
| CUUR0000SA0 | 2019 |  M11   | 257.208 |           |
| CUUR0000SA0 | 2019 |  M10   | 257.346 |           |
| CUUR0000SA0 | 2019 |  M09   | 256.759 |           |
| CUUR0000SA0 | 2019 |  M08   | 256.558 |           |
| CUUR0000SA0 | 2019 |  M07   | 256.571 |           |
| CUUR0000SA0 | 2019 |  M06   | 256.143 |           |
| CUUR0000SA0 | 2019 |  M05   | 256.092 |           |
| CUUR0000SA0 | 2019 |  M04   | 255.548 |           |
| CUUR0000SA0 | 2019 |  M03   | 254.202 |           |
| CUUR0000SA0 | 2019 |  M02   | 252.776 |           |
| CUUR0000SA0 | 2019 |  M01   | 251.712 |           |
+-------------+------+--------+---------+-----------+

SUUR0000SA0.txt
+-------------+------+--------+---------+-----------+
|  series id  | year | period |  value  | footnotes |
+-------------+------+--------+---------+-----------+
| SUUR0000SA0 | 2020 |  M09   | 146.072 |  Initial  |
| SUUR0000SA0 | 2020 |  M08   | 145.853 |  Initial  |
| SUUR0000SA0 | 2020 |  M07   | 145.405 |  Initial  |
| SUUR0000SA0 | 2020 |  M06   | 144.651 |  Interim  |
| SUUR0000SA0 | 2020 |  M05   | 143.800 |  Interim  |
| SUUR0000SA0 | 2020 |  M04   | 143.847 |  Interim  |
| SUUR0000SA0 | 2020 |  M03   | 145.005 |  Interim  |
| SUUR0000SA0 | 2020 |  M02   | 145.390 |  Interim  |
| SUUR0000SA0 | 2020 |  M01   | 144.995 |  Interim  |
| SUUR0000SA0 | 2019 |  M12   | 144.437 |  Interim  |
| SUUR0000SA0 | 2019 |  M11   | 144.613 |  Interim  |
| SUUR0000SA0 | 2019 |  M10   | 144.722 |  Interim  |
| SUUR0000SA0 | 2019 |  M09   | 144.428 |           |
| SUUR0000SA0 | 2019 |  M08   | 144.388 |           |
| SUUR0000SA0 | 2019 |  M07   | 144.409 |           |
| SUUR0000SA0 | 2019 |  M06   | 144.243 |           |
| SUUR0000SA0 | 2019 |  M05   | 144.183 |           |
| SUUR0000SA0 | 2019 |  M04   | 143.926 |           |
| SUUR0000SA0 | 2019 |  M03   | 143.297 |           |
| SUUR0000SA0 | 2019 |  M02   | 142.571 |           |
| SUUR0000SA0 | 2019 |  M01   | 142.001 |           |
+-------------+------+--------+---------+-----------+

Transportation

Environment:

Businesses closed/bankruptcy

Healthcare index

Clone this wiki locally