# "The COVID Tracking Project"
The script pulls data from "The COVID Tracking Project"'s API. They're collecting data from 50 US states, the District of Columbia, and 5 other U.S. territories to provide the most comprehensive testing data. They attemptto include positive and negative results, pending tests, and total people tested for each state or district currently reporting that data. 
Website: https://covidtracking.com/

In [None]:
import pandas as pd
import requests
import json
import datetime

In [None]:
# papermill parameters
output_folder = '../output/'

In [None]:
raw_response = requests.get("https://covidtracking.com/api/states/daily").text
raw_data = pd.DataFrame.from_dict(json.loads(raw_response))
raw_data.head(5)

### Data Quality
1. Replace empty values with zero
2. Convert "date" int column to "Date" datetime column
4. Rename columns in order to match with other source
5. Drop unnecessary columns
6. Add "Country/Region" column, since the source contains data from US states, it can be hardcoded

In [None]:
data = raw_data.fillna(0)
data['Date'] = pd.to_datetime(data['date'].astype(str),format='%Y%m%d')
data = data.rename(columns = {"state":"Province/State","positive":"Positive", "negative": "Negative", "pending": "Pending", "death":"Death", "total":"Total"})
data = data.drop(labels = ['dateChecked', "date"], axis = 'columns')
data['Country/Region'] = "US"

1. Sorting data by Province/State before calculating the daily differences

In [None]:
sorted_data = data.sort_values(by=['Province/State'] + ['Date'], ascending=True)

In [None]:
sorted_data['Positive_Since_Previous_Day'] = sorted_data['Positive'] - sorted_data.groupby(['Province/State'])["Positive"].shift(1, fill_value=0)
sorted_data['Total_Since_Previous_Day'] = sorted_data['Total'] - sorted_data.groupby(['Province/State'])["Total"].shift(1, fill_value=0)
sorted_data['Negative_Since_Previous_Day'] = sorted_data['Negative'] - sorted_data.groupby(['Province/State'])["Negative"].shift(1, fill_value=0)
sorted_data['Pending_Since_Previous_Day'] = sorted_data['Pending'] - sorted_data.groupby(['Province/State'])["Pending"].shift(1, fill_value=0)
sorted_data['Death_Since_Previous_Day'] = sorted_data['Death'] - sorted_data.groupby(['Province/State'])["Death"].shift(1, fill_value=0)

1. Rearrange columns
2. Add "Last_Update_Date" column
3. Write to csv format

In [None]:
rearranged_data = sorted_data[['Country/Region', 'Province/State', 'Date',
                               'Positive', 'Positive_Since_Previous_Day',
                               'Negative', 'Negative_Since_Previous_Day',
                               'Pending', 'Pending_Since_Previous_Day',
                               'Death', 'Death_Since_Previous_Day',
                               'Total', 'Total_Since_Previous_Day']]
rearranged_data["Last_Update_Date"] = datetime.datetime.utcnow()
rearranged_data.to_csv(output_folder + "CT_US_COVID_TESTS.csv", index=False)