This is a Singer tap that produces JSON-formatted data following the Singer spec.
This tap:
-
Pulls CSV files from GitHub v3 API .
-
Extracts the following resources:
- CSV Data Files: Git API Search with filename and extension filters from the following COVID-19 Repositories; streaming in new/changed files:
-
Outputs the schema for each resource
-
Incrementally pulls data based on the input state (file last-modified in GitHub)
- Repository: covid19-eu-zh/covid19-eu-data
- Folder: dataset/daily/
- Search Endpoint: https://api.github.com/search/code?q=-filename:ecdc+path:dataset/daily+extension:csv+repo:covid19-eu-zh/covid19-eu-data&sort=indexed&order=desc
- Exclude: ecdc folder/files
- File Endpoint: https://api.github.com/repos/covid19-eu-zh/covid19-eu-data/contents/[GIT_FILE_PATH]
- Primary key fields: git_path, __sdc_row_number
- Replication strategy: Many files w/ new files each day. Use INCREMENTAL replication only (NOT activate_version).
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse eu_daily_file content, get date from table datetime, merge differing column sets, convert to JSON
- Notes: source = country
- Repository: covid19-eu-zh/covid19-eu-data
- Folder: dataset/daily/ecdc
- Search Endpoint: https://api.github.com/search/code?q=filename:ecdc+path:dataset/daily/ecdc+extension:csv+repo:covid19-eu-zh/covid19-eu-data&sort=indexed&order=desc
- File Endpoint: https://api.github.com/repos/covid19-eu-zh/covid19-eu-data/contents/[GIT_FILE_PATH]
- Primary key fields: git_path, __sdc_row_number
- Replication strategy: Many files w/ new file each day. Use INCREMENTAL replication only (NOT activate_version).
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse eu_daily_file content, get date from table datetime, merge differing column sets, convert to JSON
- Notes: source = ecdc
- Repository: pcm-dpc/COVID-19
- Folder: dati-andamento-nazionale
- Search Endpoint: hhttps://api.github.com/search/code?q=path:dati-andamento-nazionale+extension:csv+repo:pcm-dpc/COVID-19&sort=indexed&order=asc
- exclude: current files
- File Endpoint: https://api.github.com/repos/pcm-dpc/COVID-19/contents/[GIT_FILE_PATH]
- Primary key fields: git_path, __sdc_row_number
- Replication strategy: Many files w/ new file each day. Use INCREMENTAL replication only (NOT activate_version).
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse italy_national_daily content, cleanse location fields, and convert to JSON
- Repository: pcm-dpc/COVID-19
- Folder: dati-regioni
- Search Endpoint: https://api.github.com/search/code?q=path:dati-regioni+extension:csv+repo:pcm-dpc/COVID-19&sort=indexed&order=asc
- Exclude current files
- File Endpoint: https://api.github.com/repos/pcm-dpc/COVID-19/contents/[GIT_FILE_PATH]
- Primary key fields: git_path, __sdc_row_number
- Replication strategy: Many files w/ new files each day. Use INCREMENTAL replication only (NOT activate_version).
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse italy_regional_daily content, cleanse location fields, and convert to JSON
- Repository: pcm-dpc/COVID-19
- Folder: dati-province
- Search Endpoint: https://api.github.com/search/code?q=path:dati-province+extension:csv+repo:pcm-dpc/COVID-19&sort=indexed&order=asc
- Exclude: current files
- File Endpoint: https://api.github.com/repos/pcm-dpc/COVID-19/contents/[GIT_FILE_PATH]
- Primary key fields: git_path, __sdc_row_number
- Replication strategy: Many files w/ new files each day. Use INCREMENTAL replication only (NOT activate_version).
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse italy_provincial_daily content, cleanse location fields, and convert to JSON
- Repository: CSSEGISandData/COVID-19
- Folder: csse_covid_19_data/csse_covid_19_daily_reports
- Search Endpoint: https://api.github.com/search/code?q=path:csse_covid_19_data/csse_covid_19_daily_reports+extension:csv+repo:CSSEGISandData/COVID-19&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/CSSEGISandData/COVID-19/contents/[GIT_FILE_PATH]
- Primary key fields: git_path, __sdc_row_number
- Replication strategy: Many files w/ new file each day. Use INCREMENTAL replication only (NOT activate_version).
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, Decode, parse jh_daily_file content, cleanse location fields, and convert to JSON
- Repository: nytimes/covid-19-data
- Folder: . (root folder)
- Search Endpoint: https://api.github.com/search/code?q=filename:us-states+extension:csv+repo:nytimes/covid-19-data&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/nytimes/covid-19-data/contents/[GIT_FILE_PATH]
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ daily updates (additional rows). Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse us-states content and convert to JSON, lookup state_name
- Repository: nytimes/covid-19-data
- Folder: . (root folder)
- Search Endpoint: https://api.github.com/search/code?q=filename:us-counties+extension:csv+repo:nytimes/covid-19-data&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/nytimes/covid-19-data/contents/[GIT_FILE_PATH]
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ daily updates (additional rows). Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse us-counties content and convert to JSON, lookup state_name
- Repository: COVID19Tracking/covid-tracking-data
- Folder: data
- Search Endpoint: https://api.github.com/search/code?q=path:data+filename:state_current+extension:csv+repo:COVID19Tracking/covid-tracking-data&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/COVID19Tracking/covid-tracking-data/contents/[GIT_FILE_PATH]
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ daily updates (additional rows). Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse states_current content and convert to JSON, camelCase to snake_case field keys
- Repository: COVID19Tracking/covid-tracking-data
- Folder: data
- Search Endpoint: https://api.github.com/search/code?q=path:data+filename:states_daily_4pm_et+extension:csv+repo:COVID19Tracking/covid-tracking-data&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/COVID19Tracking/covid-tracking-data/contents/[GIT_FILE_PATH]
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ daily updates (updated rows). Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse states_daily content and convert to JSON, camelCase to snake_case field keys
- Repository: COVID19Tracking/covid-tracking-data
- Folder: data
- Search Endpoint: https://api.github.com/search/code?q=path:data+filename:states_info+extension:csv+repo:COVID19Tracking/covid-tracking-data&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/COVID19Tracking/covid-tracking-data/contents/[GIT_FILE_PATH]
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ occasional updates (updated rows). Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, ecode, parse states_info content and convert to JSON, camelCase to snake_case field keys
- Repository: COVID19Tracking/associated-data
- Folder: us_census_data
- Search Endpoint: https://api.github.com/search/code?q=path:us_census_data+filename:us_census_2018_population_estimates_states+extension:csv+repo:COVID19Tracking/associated-data&sort=indexed&order=asc
- Exclude: agegroups file
- File Endpoint: https://api.github.com/repos/COVID19Tracking/associated-data/contents/[GIT_FILE_PATH]
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ minimal updates. Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse us_population_states content and convert to JSON, camelCase to snake_case field keys
c19_trk_us_population_states_age_groups
- Repository: COVID19Tracking/associated-data
- Folder: us_census_data
- Search Endpoint: https://api.github.com/search/code?q=path:us_census_data+filename:us_census_2018_population_estimates_states_agegroups+extension:csv+repo:COVID19Tracking/associated-data&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/COVID19Tracking/associated-data/contents/[GIT_FILE_PATH]
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ minimal updates. Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse us_population_states_age_groups content and convert to JSON, camelCase to snake_case field keys
c19_trk_us_population_counties
- Repository: COVID19Tracking/associated-data
- Folder: us_census_data
- Search Endpoint: https://api.github.com/search/code?q=path:us_census_data+filename:us_census_2018_population_estimates_counties+extension:csv+repo:COVID19Tracking/associated-data&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/COVID19Tracking/associated-data/contents/[GIT_FILE_PATH]
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ minimal updates. Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse us_population_counties content and convert to JSON, camelCase to snake_case field keys
c19_trk_us_states_acs_health_insurance
- Repository: COVID19Tracking/associated-data
- Folder: acs_health_insurance
- Search Endpoint: https://api.github.com/search/code?q=path:acs_health_insurance+filename:acs_2018_health_insurance_coverage_estimates+extension:csv+repo:COVID19Tracking/associated-data&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/COVID19Tracking/associated-data/contents/[GIT_FILE_PATH]
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ minimal updates. Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse us_acs_health_insurance content and convert to JSON, camelCase to snake_case field keys
c19_trk_us_states_kff_hospital_beds_files (per 1000 population)
- Repository: COVID19Tracking/associated-data
- Folder: kff_hospital_beds
- Search Endpoint: https://api.github.com/search/code?q=path:kff_hospital_beds+filename:kff_usa_hospital_beds_per_capita_2018+extension:csv+repo:COVID19Tracking/associated-data&sort=indexed&order=asc
- File Endpoint: https://api.github.com/repos/COVID19Tracking/associated-data/contents/[GIT_FILE_PATH]
- Original Sourc: KFF (Kaiser Family Foundation)
- Primary key fields: __sdc_row_number
- Replication strategy: Single file w/ minimal updates. Use FULL_TABLE replication w/ activate_version.
- Bookmark field: git_last_modified
- Transformations: Remove content node, add repository fields, decode, parse us_kff_hospital_beds content and convert to JSON, camelCase to snake_case field keys
This tap requires a GitHub API Token. See Step 3 below. Even though this tap pulls from public GitHub repositories, API request limits are much lower without a token.
-
Install
Clone this repository, and then install using setup.py. We recommend using a virtualenv:
> virtualenv -p python3 venv > source venv/bin/activate > python setup.py install OR > cd .../tap-covid-19 > pip install .
-
Dependent libraries The following dependent libraries were installed.
> pip install singer-python > pip install singer-tools > pip install target-stitch > pip install target-json
-
Create your tap's
config.json
file. This tap connects to GitHub with a GitHub OAuth2 Token. This may be a Personal Access Token or Create an authorization for an App.{ "api_token": "YOUR_GITHUB_API_TOKEN", "start_date": "2019-01-01T00:00:00Z", "user_agent": "tap-covid-19 <api_user_email@your_company.com>" }
Optionally, also create a
state.json
file.currently_syncing
is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off. The...files
streams use a datetime bookmark based on the GitHublast_modified
datetime of the file that is returned in the GET header response. Thecsv-data
streams us an integer bookmark based on the UNIX epoch time when the file batch was last sent. This is used with the__sdc_row_number
as a part of the Singer.io Activate Version logic to insert/update and delete the delta (when the new batch has fewer records).{ "currently_syncing": "eu_daily", "bookmarks": { "italy_national_daily": "2020-03-31T16:04:30Z", "c19_trk_us_states_info": "2020-03-31T22:00:05Z", "eu_daily": "2020-03-31T23:47:48Z", "eu_ecdc_daily": "2020-03-31T12:38:47Z", "c19_trk_us_states_kff_hospital_beds": "2020-03-19T20:09:37Z", "neherlab_country_codes": "2020-03-21T22:20:39Z", "nytimes_us_counties": "2020-03-31T13:58:59Z", "c19_trk_us_states_current": "2020-03-31T22:00:05Z", "neherlab_population": "2020-03-30T15:30:29Z", "italy_regional_daily": "2020-03-31T16:36:40Z", "c19_trk_us_daily": "2020-03-31T22:00:05Z", "c19_trk_us_states_daily": "2020-03-31T22:00:05Z", "c19_trk_us_population_states": "2020-03-19T20:09:37Z", "neherlab_case_counts": "2020-03-30T19:47:05Z", "c19_trk_us_population_counties": "2020-03-19T20:09:37Z", "nytimes_us_states": "2020-03-31T13:58:59Z", "italy_provincial_daily": "2020-03-31T16:28:32Z", "c19_trk_us_states_acs_health_insurance": "2020-03-19T20:09:37Z", "jh_csse_daily": "2020-04-01T00:08:36Z", "c19_trk_us_population_states_age_groups": "2020-03-19T20:09:37Z" } }
-
Run the Tap in Discovery Mode This creates a catalog.json for selecting objects/fields to integrate:
tap-covid-19 --config config.json --discover > catalog.json
See the Singer docs on discovery mode here.
-
Run the Tap in Sync Mode (with catalog) and write out to state file
For Sync mode:
> tap-covid-19 --config tap_config.json --catalog catalog.json > state.json > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
To load to json files to verify outputs:
> tap-covid-19 --config tap_config.json --catalog catalog.json | target-json > state.json > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
To pseudo-load to Stitch Import API with dry run:
> tap-covid-19 --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
-
Test the Tap
While developing the COVID-19 tap, the following utilities were run in accordance with Singer.io best practices: Pylint to improve code quality:
> pylint tap_covid_19 -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments
Pylint test resulted in the following score:
Your code has been rated at 9.44/10
To check the tap and verify working:
> tap-covid-19 --config tap_config.json --catalog catalog.json | singer-check-tap > state.json > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
Check tap resulted in the following:
The output is valid. It contained 101142 messages for 19 streams. 19 schema messages 101066 record messages 57 state messages Details by stream: +-----------------------------------------+---------+---------+ | stream | records | schemas | +-----------------------------------------+---------+---------+ | c19_trk_us_population_states_age_groups | 936 | 1 | | c19_trk_us_population_counties | 3220 | 1 | | italy_provincial_daily | 4480 | 1 | | neherlab_country_codes | 250 | 1 | | italy_regional_daily | 735 | 1 | | nytimes_us_counties | 17731 | 1 | | nytimes_us_states | 1437 | 1 | | c19_trk_us_states_current | 56 | 1 | | c19_trk_us_states_kff_hospital_beds | 51 | 1 | | c19_trk_us_states_acs_health_insurance | 1768 | 1 | | c19_trk_us_population_states | 988 | 1 | | neherlab_population | 237 | 1 | | c19_trk_us_states_daily | 1317 | 1 | | neherlab_case_counts | 14660 | 1 | | eu_daily | 18109 | 1 | | eu_ecdc_daily | 416 | 1 | | c19_trk_us_daily | 28 | 1 | | italy_national_daily | 35 | 1 | | c19_trk_us_states_info | 56 | 1 | | jh_csse_daily | 35000 | 1 | +-----------------------------------------+---------+---------+
Copyright © 2020 Stitch