Skip to content

singer-io/tap-covid-19

 
 

Repository files navigation

tap-covid-19

This is a Singer tap that produces JSON-formatted data following the Singer spec.

This tap:

Streams

eu_daily

eu_ecdc_daily

italy_national_daily

  • Repository: pcm-dpc/COVID-19
  • Folder: dati-andamento-nazionale
  • Search Endpoint: hhttps://api.github.com/search/code?q=path:dati-andamento-nazionale+extension:csv+repo:pcm-dpc/COVID-19&sort=indexed&order=asc
    • exclude: current files
  • File Endpoint: https://api.github.com/repos/pcm-dpc/COVID-19/contents/[GIT_FILE_PATH]
  • Primary key fields: git_path, __sdc_row_number
  • Replication strategy: Many files w/ new file each day. Use INCREMENTAL replication only (NOT activate_version).
    • Bookmark field: git_last_modified
  • Transformations: Remove content node, add repository fields, decode, parse italy_national_daily content, cleanse location fields, and convert to JSON

italy_regional_daily

italy_provincial_daily

jh_csse_daily

nytimes_us_states

nytimes_us_counties

c19_trk_us_states_current

c19_trk_us_states_daily

c19_trk_us_states_info

c19_trk_us_population_states

c19_trk_us_population_states_age_groups

c19_trk_us_population_counties

c19_trk_us_states_acs_health_insurance

c19_trk_us_states_kff_hospital_beds_files (per 1000 population)

Authentication

This tap requires a GitHub API Token. See Step 3 below. Even though this tap pulls from public GitHub repositories, API request limits are much lower without a token.

Quick Start

  1. Install

    Clone this repository, and then install using setup.py. We recommend using a virtualenv:

    > virtualenv -p python3 venv
    > source venv/bin/activate
    > python setup.py install
    OR
    > cd .../tap-covid-19
    > pip install .
  2. Dependent libraries The following dependent libraries were installed.

    > pip install singer-python
    > pip install singer-tools
    > pip install target-stitch
    > pip install target-json
    
  3. Create your tap's config.json file. This tap connects to GitHub with a GitHub OAuth2 Token. This may be a Personal Access Token or Create an authorization for an App.

    {
        "api_token": "YOUR_GITHUB_API_TOKEN",
        "start_date": "2019-01-01T00:00:00Z",
        "user_agent": "tap-covid-19 <api_user_email@your_company.com>"
    }

    Optionally, also create a state.json file. currently_syncing is an optional attribute used for identifying the last object to be synced in case the job is interrupted mid-stream. The next run would begin where the last job left off. The ...files streams use a datetime bookmark based on the GitHub last_modified datetime of the file that is returned in the GET header response. The csv-data streams us an integer bookmark based on the UNIX epoch time when the file batch was last sent. This is used with the __sdc_row_number as a part of the Singer.io Activate Version logic to insert/update and delete the delta (when the new batch has fewer records).

    {
      "currently_syncing": "eu_daily",
      "bookmarks": {
        "italy_national_daily": "2020-03-31T16:04:30Z",
        "c19_trk_us_states_info": "2020-03-31T22:00:05Z",
        "eu_daily": "2020-03-31T23:47:48Z",
        "eu_ecdc_daily": "2020-03-31T12:38:47Z",
        "c19_trk_us_states_kff_hospital_beds": "2020-03-19T20:09:37Z",
        "neherlab_country_codes": "2020-03-21T22:20:39Z",
        "nytimes_us_counties": "2020-03-31T13:58:59Z",
        "c19_trk_us_states_current": "2020-03-31T22:00:05Z",
        "neherlab_population": "2020-03-30T15:30:29Z",
        "italy_regional_daily": "2020-03-31T16:36:40Z",
        "c19_trk_us_daily": "2020-03-31T22:00:05Z",
        "c19_trk_us_states_daily": "2020-03-31T22:00:05Z",
        "c19_trk_us_population_states": "2020-03-19T20:09:37Z",
        "neherlab_case_counts": "2020-03-30T19:47:05Z",
        "c19_trk_us_population_counties": "2020-03-19T20:09:37Z",
        "nytimes_us_states": "2020-03-31T13:58:59Z",
        "italy_provincial_daily": "2020-03-31T16:28:32Z",
        "c19_trk_us_states_acs_health_insurance": "2020-03-19T20:09:37Z",
        "jh_csse_daily": "2020-04-01T00:08:36Z",
        "c19_trk_us_population_states_age_groups": "2020-03-19T20:09:37Z"
      }
    }
    
  4. Run the Tap in Discovery Mode This creates a catalog.json for selecting objects/fields to integrate:

    tap-covid-19 --config config.json --discover > catalog.json

    See the Singer docs on discovery mode here.

  5. Run the Tap in Sync Mode (with catalog) and write out to state file

    For Sync mode:

    > tap-covid-19 --config tap_config.json --catalog catalog.json > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

    To load to json files to verify outputs:

    > tap-covid-19 --config tap_config.json --catalog catalog.json | target-json > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

    To pseudo-load to Stitch Import API with dry run:

    > tap-covid-19 --config tap_config.json --catalog catalog.json | target-stitch --config target_config.json --dry-run > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json
  6. Test the Tap

    While developing the COVID-19 tap, the following utilities were run in accordance with Singer.io best practices: Pylint to improve code quality:

    > pylint tap_covid_19 -d missing-docstring -d logging-format-interpolation -d too-many-locals -d too-many-arguments

    Pylint test resulted in the following score:

    Your code has been rated at 9.44/10

    To check the tap and verify working:

    > tap-covid-19 --config tap_config.json --catalog catalog.json | singer-check-tap > state.json
    > tail -1 state.json > state.json.tmp && mv state.json.tmp state.json

    Check tap resulted in the following:

    The output is valid.
    It contained 101142 messages for 19 streams.
    
        19 schema messages
    101066 record messages
        57 state messages
    
    Details by stream:
    +-----------------------------------------+---------+---------+
    | stream                                  | records | schemas |
    +-----------------------------------------+---------+---------+
    | c19_trk_us_population_states_age_groups | 936     | 1       |
    | c19_trk_us_population_counties          | 3220    | 1       |
    | italy_provincial_daily                  | 4480    | 1       |
    | neherlab_country_codes                  | 250     | 1       |
    | italy_regional_daily                    | 735     | 1       |
    | nytimes_us_counties                     | 17731   | 1       |
    | nytimes_us_states                       | 1437    | 1       |
    | c19_trk_us_states_current               | 56      | 1       |
    | c19_trk_us_states_kff_hospital_beds     | 51      | 1       |
    | c19_trk_us_states_acs_health_insurance  | 1768    | 1       |
    | c19_trk_us_population_states            | 988     | 1       |
    | neherlab_population                     | 237     | 1       |
    | c19_trk_us_states_daily                 | 1317    | 1       |
    | neherlab_case_counts                    | 14660   | 1       |
    | eu_daily                                | 18109   | 1       |
    | eu_ecdc_daily                           | 416     | 1       |
    | c19_trk_us_daily                        | 28      | 1       |
    | italy_national_daily                    | 35      | 1       |
    | c19_trk_us_states_info                  | 56      | 1       |
    | jh_csse_daily                           | 35000   | 1       |
    +-----------------------------------------+---------+---------+
    

Copyright © 2020 Stitch

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%