# Importing from the COVID Tracking Project

This script pulls data from the API provided by the [COVID Tracking Project](https://covidtracking.com/). They're collecting data from 50 US states, the District of Columbia, and five U.S. territories to provide the most comprehensive testing data. They attempt to include positive and negative results, pending tests and total people tested for each state or district currently reporting that data.

In [2]:
import pandas as pd
import requests
import json
import datetime
!pip install pycountry
import pycountry

Collecting pycountry
[?25l  Downloading https://files.pythonhosted.org/packages/16/b6/154fe93072051d8ce7bf197690957b6d0ac9a21d51c9a1d05bd7c6fdb16f/pycountry-19.8.18.tar.gz (10.0MB)
[K     |████████████████████████████████| 10.0MB 6.3MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: pycountry
  Building wheel for pycountry (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/dsxuser/.cache/pip/wheels/a2/98/bf/f0fa1c6bf8cf2cbdb750d583f84be51c2cd8272460b8b36bd3
Successfully built pycountry
Installing collected packages: pycountry
Successfully installed pycountry-19.8.18


In [3]:
# papermill parameters
output_folder = '../output/'

In [4]:
raw_response = requests.get("https://covidtracking.com/api/states/daily").text
raw_data = pd.DataFrame.from_dict(json.loads(raw_response))

### Data Quality
1. Replace empty values with zero
2. Convert "date" int column to "Date" datetime column
4. Rename columns in order to match with other source
5. Drop unnecessary columns
6. Add "Country/Region" column, since the source contains data from US states, it can be hardcoded

In [5]:
data = raw_data.fillna(0)
data['Date'] = pd.to_datetime(data['date'].astype(str), format='%Y%m%d')
data = data.rename(
    columns={
        "state": "ISO3166-2",
        "positive": "Positive",
        "negative": "Negative",
        "pending": "Pending",
        "death": "Death",
        "totalTestResults": "Total",
        "hospitalized": "Hospitalized"
    })
data = data.drop(labels=['dateChecked', "date"], axis='columns')
data['Country/Region'] = "United States"
data['ISO3166-1'] = "US"

In [6]:
states = {k.code.replace("US-", ""): k.name for k in pycountry.subdivisions.get(country_code="US")}

In [7]:
data["Province/State"] = data["ISO3166-2"].apply(lambda x: states[x])

## Sorting data by Province/State before calculating the daily differences

In [8]:
sorted_data = data.sort_values(by=['Province/State'] + ['Date'], ascending=True)

In [9]:
sorted_data['Positive_Since_Previous_Day'] = sorted_data['Positive'] - sorted_data.groupby(['Province/State'])["Positive"].shift(1, fill_value=0)
sorted_data['Total_Since_Previous_Day'] = sorted_data['Total'] - sorted_data.groupby(['Province/State'])["Total"].shift(1, fill_value=0)
sorted_data['Negative_Since_Previous_Day'] = sorted_data['Negative'] - sorted_data.groupby(['Province/State'])["Negative"].shift(1, fill_value=0)
sorted_data['Pending_Since_Previous_Day'] = sorted_data['Pending'] - sorted_data.groupby(['Province/State'])["Pending"].shift(1, fill_value=0)
sorted_data['Death_Since_Previous_Day'] = sorted_data['Death'] - sorted_data.groupby(['Province/State'])["Death"].shift(1, fill_value=0)
sorted_data['Hospitalized_Since_Previous_Day'] = sorted_data['Hospitalized'] - sorted_data.groupby(['Province/State'])["Hospitalized"].shift(1, fill_value=0)

## Rearrange columns

In [10]:
rearranged_data = sorted_data.filter(items=['Country/Region', 'Province/State', 'Date',
                               'Positive', 'Positive_Since_Previous_Day',
                               'Negative', 'Negative_Since_Previous_Day',
                               'Pending', 'Pending_Since_Previous_Day',
                               'Death', 'Death_Since_Previous_Day',
                               'Hospitalized', 'Hospitalized_Since_Previous_Day',
                               'Total', 'Total_Since_Previous_Day',
                               'ISO3166-1', 'ISO3166-2'])

## Add `Last_Update_Date`

In [13]:
rearranged_data.loc[:, "Last_Update_Date"] = datetime.datetime.utcnow()
rearranged_data.head()

Unnamed: 0,Country/Region,Province/State,Date,Positive,Positive_Since_Previous_Day,Negative,Negative_Since_Previous_Day,Pending,Pending_Since_Previous_Day,Death,Death_Since_Previous_Day,Hospitalized,Hospitalized_Since_Previous_Day,Total,Total_Since_Previous_Day,ISO3166-1,ISO3166-2,Last_Update_Date
1585,United States,Alabama,2020-03-07,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,US,AL,2020-04-06 18:22:38.790648
1534,United States,Alabama,2020-03-08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,US,AL,2020-04-06 18:22:38.790648
1483,United States,Alabama,2020-03-09,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,US,AL,2020-04-06 18:22:38.790648
1432,United States,Alabama,2020-03-10,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,US,AL,2020-04-06 18:22:38.790648
1381,United States,Alabama,2020-03-11,0.0,0.0,10.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,10,10,US,AL,2020-04-06 18:22:38.790648


## Export to CSV

In [12]:
rearranged_data.to_csv(output_folder + "CT_US_COVID_TESTS.csv", index=False)

FileNotFoundError: [Errno 2] No such file or directory: '../output/CT_US_COVID_TESTS.csv'