# Machine Learning Engineer Nanodegree: Forecasting COVID-19 Cases – A `Time Series Forecasting` Model

### Domain Background
On December 31, 2019, the World Health Organization (WHO) was informed of an outbreak of “pneumonia of unknown cause” detected in Wuhan City, Hubei Province, China. Identified as coronavirus disease 2019, it quickly came to be known as COVID-19 and has resulted in an ongoing global pandemic. As of 20 June 2020, more than 8.74 million cases have been reported across 188 countries and territories, resulting in more than 462,000 deaths. More
than 4.31 million people have recovered.[^1]

In response to this ongoing public health emergency, Johns Hopkins University (JHU), a private research university in Maryland, USA, developed an interactive web-based dashboard hosted by their Center for Systems Science and Engineering (CSSE). The dashboard visualizes and tracks reported cases in real-time, illustrating the location and number of confirmed COVID-19 cases, deaths and recoveries for all affected countries. It is used by researchers, public health authorities, news
agencies and the general public. All the data collected and displayed is made freely available in a [GitHub repository](https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data).

### Problem Statement
This project seeks to forecast number of people infected and number of deaths caused by COVID-19 for a time duration of 14-days based on
historical data from JHU. I will be using Amazon SageMaker DeepAR forecasting algorithm, a supervised learning algorithm for forecasting
scalar (one-dimensional) time series using recurrent neural networks (RNN) to produce both point and probabilistic forecasts[^2].
DeepAR is an underutilized approach in this area.[^3] The dataset contains hundreds of related time series, and DeepAR outperforms classical
forecasting methods including but not limited to autoregressive integrated moving average (ARIMA), exponential smoothing (ETS), Time Series
Forecasting with Linear Learner for this type of applications. I will be using [DeepAR](https://github.com/sahussain/ML_SageMaker_Studies/blob/master/Time_Series_Forecasting/Energy_Consumption_Solution.ipynb) and [Time Series Forecasting with Linear Learner](https://github.com/awslabs/amazon-sagemaker-examples/blob/80333fd4632cf6d924d0b91c33bf80da3bdcf926/introduction_to_applying_machine_learning/linear_time_series_forecast/linear_time_series_forecast.ipynb)

-----------
[^1]:[COVID-19 Dashboard](https://systems.jhu.edu/research/public-health/ncov/) by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU)". ArcGIS. Johns Hopkins University. Retrieved 20 June 2020.

[^2]:[DeepAR Forecasting Algorithm. Amazon Web Services](https://docs.aws.amazon.com/sagemaker/latest/dg/deepar.html). Retrieved 20 June, 2020 

[^3]:[Time series prediction](https://www.telesens.co/2019/06/08/time-series-prediction/). Telesens. Retrieved 20 June, 2020.

-----------

## Initialization

### Loading in the resources.

### Download function

In [None]:
def progress_report_hook(count, block_size, total_size):
    mb = int(count * block_size // 1e6)
    if count % 50 == 0:
        sys.stdout.write("\r{} MB downloaded".format(mb))
        sys.stdout.flush()

def download(DATA_HOST, DATA_PATH, FILE_NAME, OVERRIDE=1, reporthook=progress_report_hook):
    if OVERRIDE:
        print("downloading dataset, can take a few minutes depending on your connection")
        urlretrieve(DATA_HOST + DATA_PATH + FILE_NAME, FILE_NAME, reporthook=progress_report_hook)
    else:
        print("File found skipping download")

### Clean-up functions

In [None]:
#convenience function to delete prediction endpoints after we're done with them form udacity
def delete_endpoint(predictor):
        try:
            boto3.client('sagemaker').delete_endpoint(EndpointName=predictor.endpoint)
            print('Deleted {}'.format(predictor.endpoint))
        except:
            print('Already deleted: {}'.format(predictor.endpoint))

-----------

## Load and Explore the Data

### Downloading data

In [None]:
DATA_HOST = "https://raw.githubusercontent.com"
DATA_PATH = "/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"
FILE_NAME = "time_series_covid19_confirmed_US.csv"

In [None]:
download(DATA_HOST, DATA_PATH, FILE_NAME,0)

### Loading data into pandas

In [None]:
csv_file = 'time_series_covid19_confirmed_US.csv'
covid_df = pd.read_csv(csv_file)

### Examining the Data

In [None]:
covid_df.head()



## Datasets and Inputs
The datasets are accessed from files provided by the JHU GitHub
repository [time_series_covid19_confirmed_US.csv](https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv)

The file have the same columns:
* UID - UID = 840 (country code3) + 000XX (state FIPS code). Ranging from
8400001 to 84000056.
* iso2- Officially assigned country code identifiers 2 Chr (US, CA, ...)
* iso3 - Officially assigned country code identifiers 3 Chr.(USA, CAN,
...)
* code3- country code USA = 840
* FIPS -Federal Information Processing Standards code that uniquely
identifies counties within the USA.
* admin2 - County name. US only.
* Province_State - The name of the State within the USA.
* Country_Region - The name of the Country (US).
* Combined_Key - Province_State + Country_Region
* Population - Population
* Number of cases are is columns where each column is a day



In [None]:
#Geting colunm names
list(covid_df.columns)

#Getting State list

In [None]:
covid_df.describe()

### getting rid of unnecessary columns, we only need the 'Combined_Key','Date','Case' columns

### Convert Date into time-series

### Handling Missing Values

In [None]:
covid_df.isna().sum()

### Now we have a clean dataset

In [None]:
covid_df.head()

In [None]:
print(covid_df.shape)

In [None]:
covid_df.describe()

-----

## Plotting the Data

-----

## Analyzing the data

### cumulated cases

### Total cases

### New cases

-----

## Model Desing & Testing

### Total cases

### New cases