This notebook will guide you through a list of steps needed to prepare a time series-based dataset containing JSON files to be fed into the Metrics Advisor workspace. Each JSON file will contain daily data representing the count of COVID positive cases by age group.

First, let's import the requires libraries and namespaces.

In [None]:
import pandas as pd
from pandas.io.json import json_normalize
import numpy as np
import datetime
import os, shutil
import math
import timeit
from io import StringIO
import re
import urllib.request, json

print("pandas version: {} numpy version: {}".format(pd.__version__, np.__version__))

import os
import azureml.core
from azureml.core import Workspace, Datastore, Dataset
# Check core SDK version number
print("azureml SDK version:", azureml.core.VERSION)

Replace the `<BLOBSTORAGE_ACCOUNT_NAME>` and `<BLOBSTORAGE_ACCOUNT_KEY>` values below with the values you have noted down on a previous step.

In [None]:
#Provide values for the existing blob storage account name and key
blob_account_name = "<BLOBSTORAGE_ACCOUNT_NAME>"
blob_account_key = "<BLOBSTORAGE_ACCOUNT_KEY>"
blob_datastore_name='covid_datastore' # Name of the datastore to workspace
container_name = "jsonmetrics" # Name of Azure blob container

Connect to the Azure Machine Learning workspace and register the `covid_datastore` container in the workspace. This is the place where the input data for Metrics Advisor will be saved.

In [None]:
ws = Workspace.from_config()

#register the datastore where the Metrics Advisor data feed will be generated
blob_datastore = Datastore.register_azure_blob_container(
    workspace=ws, 
    datastore_name=blob_datastore_name, 
    container_name=container_name, 
    account_name=blob_account_name,
    account_key=blob_account_key)

Load the COVID-19 case surveillance dataset.

Inspect the first 10 rows in the dataset.

In [None]:
df = pd.read_csv(
    'https://solliancepublicdata.blob.core.windows.net/ai-in-a-day/shared/' +
    'COVID19_Case_Surveillance_Data/COVID-19_Case_Surveillance_Public_Use_Data.csv')
df.head(10)

Prepare the timestamp column to match the format required by the Metrics Advisor ingestion process.

In [None]:
df['cdc_report_dt']=pd.to_datetime(df['cdc_report_dt']) + (pd.to_timedelta(13, unit='M'))
df['datekey'] =  pd.to_datetime(df['cdc_report_dt']).dt.strftime('%Y-%m-%d')

Group data by date, age group, and hospitalization status.

In [None]:
dfgroup = df.groupby(['datekey','age_group','hosp_yn']).size().to_frame()
dfgroup.rename(columns={0: 'count'}, inplace=True)
dfgroup.head(10)

Reset the index hierarchical index resulting from the group by process to flatten the dataset.

In [None]:
dfflat = dfgroup.reset_index()
dfflat.head(10)

Get the list of dates for which data is available in the original dataset.

In [None]:
dates = df['datekey'].unique()

Create the daily JSON files to be ingested by Metrics Advisor.

In [None]:
if os.path.exists('covid_age_hosp'):
    shutil.rmtree('covid_age_hosp')
    
os.mkdir('covid_age_hosp')

for row in dates:
    print(row)
    is_date = dfflat['datekey']==row
    df_date = dfflat[is_date]
    resultJSON = df_date.to_json(orient='records', date_format='%Y-%m-%d')
    filename_processed_json =  f'covid_age_hosp/{row}.json'
    with open(filename_processed_json, 'w') as f:
        f.write(resultJSON)

Upload the local folder containing the generated JSON files to the blob storage container.

In [None]:
blob_datastore.upload('./covid_age_hosp', 
                 target_path = '', 
                 overwrite = True, 
                 show_progress = True)