# Importing from the COVID Tracking Project

This script pulls data from the API provided by the [COVID Tracking Project](https://covidtracking.com/). They're collecting data from 50 US states, the District of Columbia, and five U.S. territories to provide the most comprehensive testing data. They attempt to include positive and negative results, pending tests and total people tested for each state or district currently reporting that data.

In [10]:
import pandas as pd
import requests
import json
import datetime
import pygsheets

In [2]:
# papermill parameters
output_folder = '../output/'

In [3]:
raw_response = requests.get("https://covidtracking.com/api/states/daily").text
raw_data = pd.DataFrame.from_dict(json.loads(raw_response))
raw_data.head(5)

Unnamed: 0,date,state,positive,negative,pending,death,total,dateChecked
0,20200318,AK,6,406.0,,,412,2020-03-18T20:00:00Z
1,20200318,AL,46,28.0,,0.0,74,2020-03-18T20:00:00Z
2,20200318,AR,33,236.0,50.0,,319,2020-03-18T20:00:00Z
3,20200318,AZ,28,148.0,102.0,0.0,278,2020-03-18T20:00:00Z
4,20200318,CA,611,7981.0,,13.0,8592,2020-03-18T20:00:00Z


### Data Quality
1. Replace empty values with zero
2. Convert "date" int column to "Date" datetime column
4. Rename columns in order to match with other source
5. Drop unnecessary columns
6. Add "Country/Region" column, since the source contains data from US states, it can be hardcoded

In [4]:
data = raw_data.fillna(0)
data['Date'] = pd.to_datetime(data['date'].astype(str),format='%Y%m%d')
data = data.rename(columns = {"state":"Province/State","positive":"Positive", "negative": "Negative", "pending": "Pending", "death":"Death", "total":"Total"})
data = data.drop(labels = ['dateChecked', "date"], axis = 'columns')
data['Country/Region'] = "US"

## Sorting data by Province/State before calculating the daily differences

In [5]:
sorted_data = data.sort_values(by=['Province/State'] + ['Date'], ascending=True)

In [6]:
sorted_data['Positive_Since_Previous_Day'] = sorted_data['Positive'] - sorted_data.groupby(['Province/State'])["Positive"].shift(1, fill_value=0)
sorted_data['Total_Since_Previous_Day'] = sorted_data['Total'] - sorted_data.groupby(['Province/State'])["Total"].shift(1, fill_value=0)
sorted_data['Negative_Since_Previous_Day'] = sorted_data['Negative'] - sorted_data.groupby(['Province/State'])["Negative"].shift(1, fill_value=0)
sorted_data['Pending_Since_Previous_Day'] = sorted_data['Pending'] - sorted_data.groupby(['Province/State'])["Pending"].shift(1, fill_value=0)
sorted_data['Death_Since_Previous_Day'] = sorted_data['Death'] - sorted_data.groupby(['Province/State'])["Death"].shift(1, fill_value=0)

## Rearrange columns

In [7]:
rearranged_data = sorted_data[['Country/Region', 'Province/State', 'Date',
                               'Positive', 'Positive_Since_Previous_Day',
                               'Negative', 'Negative_Since_Previous_Day',
                               'Pending', 'Pending_Since_Previous_Day',
                               'Death', 'Death_Since_Previous_Day',
                               'Total', 'Total_Since_Previous_Day']]

## Add `Last_Update_Date`

In [20]:
rearranged_data.loc[:, "Last_Update_Date"] = datetime.datetime.utcnow()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


## Export to CSV

In [None]:
rearranged_data.to_csv(output_folder + "CT_US_COVID_TESTS.csv", index=False)