# 1: Validating and Importing Target Time Series Data

## Obtaining Your Data

A critical requirement to use Amazon Forecast is to have access to time-series (or meta) data for your selected use case. To learn more about time-series data:

1. [Wikipedia](https://en.wikipedia.org/wiki/Time_series)
1. [Toward's Data Science Primer](https://towardsdatascience.com/the-complete-guide-to-time-series-analysis-and-forecasting-70d476bfe775)
1. [O'Reilly Book](https://www.amazon.com/gp/product/1492041653/ref=ppx_yo_dt_b_search_asin_title?ie=UTF8&psc=1)

In this POC, we are going to select a dataset from the UCI repository of machine learning datasets. This is an excellent tool for finding datasets for various problems. In this case, it is traffic data for a given section of interstate highway. More information on the dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume)

Your specific data my come from a DB export, an existing spreadsheet - the source really doesn't matter. Going forward, the files should be uploaded into this notebook and stored in CSV format.

To begin, the cell below will produce the following:

1. Create a directory for the data files.
1. Download the sample data into the directory.
1. Extract the archive file into the directory.


> When putting `!` before the command, you are running it in the built-in subshell

In [None]:
data_dir = "data"
!rm -rf $data_dir
!mkdir -p $data_dir
!cd $data_dir && wget https://archive.ics.uci.edu/ml/machine-learning-databases/00492/Metro_Interstate_Traffic_Volume.csv.gz
!gunzip $data_dir/Metro_Interstate_Traffic_Volume.csv.gz

With the data downloaded, now we import the Pandas library as well as a few other data science tools.

1. [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html#) - Is an AWS SDK for Python
2. [pandas](https://pandas.pydata.org) - Data analysis and manipulation framework operating upon data frames
3. [numpy](https://numpy.org) - Scientific computation tool 
4. [matplotlib](https://matplotlib.org) - tool to produce plots, graphs, charts, etc.

In [None]:
import boto3
from time import sleep
import subprocess
import pandas as pd
import json
import time
import pprint
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
import uuid

Next, open the file with Pandas and take a look at the contents:

In [None]:
original_data = pd.read_csv(data_dir + '/Metro_Interstate_Traffic_Volume.csv')
original_data.head(5)

We can see a few things about the data:

* Holidays seem to be specified
* There is a value for temp, rainfall, snowfall, and a few other weather metrics.
* The time series is hourly
* Our value to predict is `traffic_volume down` at the end.

Amazon Forecast relies on a concept called the target-time-series:
- In order to start making predictions, this has a timestamp, an item identifier, and a value. 
- The timestamp is pretty self-explanatory, and the value to predict will be traffic_volume, given this is a singular time series an arbitrary item_ID of `1` will be applied later to all entries in the time series file.

The other attributes can serve as a basis for related time series components, when we get to that much later.

Amazon Forecast also works well to fill in gaps for the target-time-series, but not the related data
So before we input our data and get a prediction, we should look to see where gaps are, and how we want to structure both inputs to address this issue. 

To get started, we will manipulate our starting data frame to determine the quality and consistency of this dataset.

In [None]:
target_df = original_data.copy()
target_df.plot()
print("Start Date: ", min(target_df['date_time']))
print("End Date: ", max(target_df['date_time']))

Interestingly at this point, we do not see obvious gaps in this plot, but we should still check a bit deeper to confirm this. The next cell gives some necessary information on the dataset size.

In [None]:
target_df.info()

In the cell above, we now see a range of October 2012 to nearly October 2018, almost 6 years of hourly data: 
- Given there are around 8700 hours in a year, we expect to see 52,000 time-series. 
- Immediately here, we see 48,204. 
- It looks like some data points are missing, next let us define the index, drop the duplicates, and see where we are then.

In [None]:
target_df.set_index('date_time', inplace=True)
target_df = target_df.drop_duplicates(keep ='first')
target_df.info()

That change dropped us to 48,175 unique entries:
- Given this is traffic data, we could be dealing with a missing sensor, construction causing outages, or even severe weather delay damaging the recording equipment. 
- Before we decide on how to fill any gaps, let us first take a look to see where they are, and how large the gaps themselves may be.

We will do this by creating a new data frame for the entire length of the dataset, that has no missing entries, then joining our data to it, and padding out 0's where anything is missing.

*Note* the `periods` value below is the total number of entries to make, I cheated and used WolframAlpha to sort out the number of days: https://www.wolframalpha.com/input/?i=days+from+2012-10-02+to+2018-09-30

In [None]:
total_days = 2190
# Build the index first
idx = pd.date_range(start='10/02/2012', end='09/30/2018', freq='H')

In [None]:
full_df = pd.DataFrame(index=idx)
full_df.head(3)

In [None]:
print (full_df.index.min())
print (full_df.index.max())

In [None]:
# Now perform the join
full_historical_df = full_df.join(target_df, how='outer')
print (full_historical_df.index.min())
print (full_historical_df.index.max())

In [None]:
# Take a look at 10 random entries
full_historical_df.sample(10)

The sample **may** or **may not** have shown values with `NaN`s or other `null`s, in this instance it did, but we will still want to look for these `NaN` entities to confirm if they exist and where they are.

At this point, we have done enough work to see, where we may have any large portions of missing data. To that end, we can plot the data below and see any gaps that may crop up.

In [None]:
full_historical_df.plot()

This shows a large gap of missing data from late 2014 until mid-2016. If we just wanted to feed in the previously known value, this may give us too long of a timeframe of data that is simply not reflexive to the problem. 

Before making any decisions, we will now step through each year and see what the gaps look like starting in 2013 as it is the first full year.

In [None]:
df_2013 = full_historical_df.loc['2013-01-01':'2013-12-31']
print (df_2013.index.min())
print (df_2013.index.max())
df_2013.plot()

In [None]:
df_2014 = full_historical_df.loc['2014-01-01':'2014-12-31']
print (df_2014.index.min())
print (df_2014.index.max())
df_2014.plot()

In [None]:
df_2015 = full_historical_df.loc['2015-01-01':'2015-12-31']
print (df_2015.index.min())
print (df_2015.index.max())
df_2015.plot()

In [None]:
df_2016 = full_historical_df.loc['2016-01-01':'2016-12-31']
print (df_2016.index.min())
print (df_2016.index.max())
df_2016.plot()

In [None]:
df_2017 = full_historical_df.loc['2017-01-01':'2017-12-31']
print (df_2017.index.min())
print (df_2017.index.max())
df_2017.plot()

In [None]:
df_2018 = full_historical_df.loc['2018-01-01':'2018-12-31']
print (df_2018.index.min())
print (df_2018.index.max())
df_2018.plot()

A few things to note here, clearly, we are missing a large volume of data in 2014 and 2015, but also, there are some missing patches in 2013 as well. 2016 had spotty data initially, but 2017 and 2018 look pretty good.

Given that the data is hourly, we still have plenty of it within a single year, and an additional 10 months to use for broader validation if we choose to do that. 

To note, it seems approaches like DeepAR+ and Prophet work very well with > 1k measurements on a given time series. Assuming hourly data (24 measurements per day), that yields around 42 days before we have a solid base of data. Learning over an entire year should be plenty.

Also, we need to think about a Forecast horizon or how far into the future we are going to predict at once. Forecast currently limits us to 500 intervals of whatever granularity we have selected. For this exercise, we will keep the data hourly and predict 480 hours into the future, or exactly 20 days.

## Building Data Files

Knowing that our above data frame `full_historical_df` covers the entire time period we care about, we start there, reducing it to 2017 to end. Then we will use feed-forward to plug in any missing holes before splitting into the 3 files described before. 

More info on techniques to patch missing information can be found [here](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.fillna.html)

The risk of filling in values like this is that in smoothing out the data it may cause our predictions to resemble the smoother curve than is our historical data. This is why we selected 2017 to 2018 based on the lack of large gaps in the data.

In [None]:
# Create a copy
target_df = full_historical_df.copy()
# Slice to only 2017 onward
target_df = target_df.loc['2017-01-01':]
# Validate the dates
print (target_df.index.min())
print (target_df.index.max())

Feed-forward fills the missing values based on the value from the same column in the previous row.

> Read [more](https://www.geeksforgeeks.org/python-pandas-dataframe-ffill/)

In [None]:
# Fill in any missing data with the method ffill
target_df.ffill()

At this point, we have all the data needed to make our target time series file and dataset. While we are doing this, we will also make a validation file for later use as well.

### Building The Target Time Series File

In [None]:
target_time_series_df = target_df.copy()
target_time_series_df = target_time_series_df.loc['2017-01-01':'2017-12-31']
# Validate the date range
print(target_time_series_df.index.min())
print(target_time_series_df.index.max())

`item_id` is a non-target, non-timestamp item unique key. Here, we are using the identifier for all entities just because we have one type of entity (traffic volume).

But if, for example, we have to predict traffic volume in different metro stations, we have to add different item identifiers to depict it.

In [None]:
# Restrict the columns to timestamp, traffic_volume
target_time_series_df = target_time_series_df[['traffic_volume']]
# Add in item_id
target_time_series_df['item_ID'] = "1"
# Validate the structure
target_time_series_df.head()

In [None]:
# With the data in a great state, save it off as a CSV
target_time_series_filename = "target_time_series.csv"
target_time_series_path = data_dir + "/" + target_time_series_filename
target_time_series_df.to_csv(target_time_series_path, header=False)

### Building The Validation File

This is the last file we need to build before getting started with Forecast itself. This will be the same in structure as our target-time-series file but will only project into 2018 and includes no historical data from the training data.

In [None]:
validation_time_series_df = target_df.copy()
validation_time_series_df = validation_time_series_df.loc['2018-01-01':]
# Validate the date range
print(validation_time_series_df.index.min())
print(validation_time_series_df.index.max())

In [None]:
# Restrict the columns to timestamp, traffic_volume
validation_time_series_df = validation_time_series_df[['traffic_volume']]
# Add in item_id
validation_time_series_df['item_ID'] = "1"
# Validate the structure
validation_time_series_df.head()

In [None]:
# With the data in a great state, save it off as a CSV
validation_time_series_filename = "validation_time_series.csv"
validation_time_series_path = data_dir + "/" + validation_time_series_filename
validation_time_series_df.to_csv(validation_time_series_path, header=False)

## Getting Started With Forecast

Now that all of the required data to get started exists, our next step is to build the dataset groups and datasets required for our problem. Inside Amazon Forecast a DatasetGroup is an abstraction that contains all the datasets for a particular collection of Forecasts. There is no information sharing between DatasetGroups, so if you'd like to try out various alternatives to the schemas we create below, you could create a new DatasetGroup and make your changes inside its corresponding Datasets.

The order of the process below will be as follows:

1. Create a DatasetGroup for our POC.
1. Create a `Target-Time-Series` Dataset.
1. Attach the Dataset to the DatasetGroup.
1. Import the data into the Dataset.

Later you can use the other notebooks to build Predictors based on this information or to add related time-series data as well.

The cell immediately below defines a few core aspects of our Dataset Group and info on our data. For example, the timestamp format, the project name, and how frequent our time series data is.

In [None]:
DATASET_FREQUENCY = "H" 
TIMESTAMP_FORMAT = "yyyy-MM-dd hh:mm:ss"

project = 'forecast_poc_'+str(uuid.uuid4()).replace("-", "_")
datasetName= project+'_ds'
datasetGroupName= project +'_dsg_'

Now using the metadata stored on this instance of a SageMaker Notebook determine the region we are operating in. If you are using a Jupyter Notebook outside of SageMaker, simply define `region` as the string that indicates the region you would like to use for Forecast and S3.


In [None]:
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print(region)

Configure your AWS APIs

In [None]:
session = boto3.Session(region_name=region)
forecast = session.client(service_name='forecast')
forecast_query = session.client(service_name='forecastquery')

Create the Dataset Group, this is the largest abstraction when using Forecast. There is no information sharing between Dataset Groups so if you want to try out new schemas, or completely different datasets for a problem this is a great isolation layer to use.

In [None]:
# Create the DatasetGroup
create_dataset_group_response = forecast.create_dataset_group(
    DatasetGroupName=datasetGroupName,
    Domain="CUSTOM",
)
datasetGroupArn = create_dataset_group_response['DatasetGroupArn']

In [None]:
forecast.describe_dataset_group(DatasetGroupArn=datasetGroupArn)

Assuming you made no initial schema changes, the cell below should just be fine. If you have made any alterations, update the cell accordingly, then execute it.

Schema allows you to define the shape of data, where:
- `AttributeName` is the actual column name
- `AttributeType` is the datatype. If the incorrect type is provided, the scary error is thrown from Forecast.

> Read moreabout allowed values [here](https://docs.aws.amazon.com/forecast/latest/dg/API_SchemaAttribute.html)

In [None]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"target_value",
         "AttributeType":"float"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      }
   ]
}

Inside every DatasetGroup you can have 3 types of additional data:

1. Target Time Series
1. Related Time Series
1. Item Metadata

In this guide we are really only focusing on the target-time-series bit. The cells below will create this container for you and then add it to your DatasetGroup.

In [None]:
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=datasetName,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = schema
)

Here, we are using the `Custom` domain, because our subject field doesn't fit well in the predefined sets:
- Retail
- Inventory planning
- EC2 capacity
- Work force
- Web traffic
- Metrics

The choice of domains defines the requires/optional features in the data as well as Hyper Parameters.
[Hyper paramaters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning) are features of the model itself, like learning velocity.

Read more about different allowed [domains](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html).

In [None]:
target_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=target_datasetArn)

In [None]:
# Attach the Dataset to the Dataset Group:
forecast.update_dataset_group(DatasetGroupArn=datasetGroupArn, DatasetArns=[target_datasetArn])

We will also need a Role to interact with S3 and Forecast on our behalf going forward. This cell creates that role. Note that it does sleep for 60 seconds to ensure that the process has completed and all permissions have propagated before going forward.

In [None]:
iam = boto3.client("iam")

role_name = "ForecastRolePOC_"+str(uuid.uuid4()).replace("-", "_")
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "forecast.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/AmazonForecastFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

At this point, the next thing to do is import a file into Amazon Forecast, however, we do not yet have anything in S3, so we will create a bucket and upload our target file there. Note this is only the target file.

In [None]:
print(region)
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "-" + str(uuid.uuid4()) + "-forecastpoc"
print(bucket_name)
if region != "us-east-1":
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})
else:
    s3.create_bucket(Bucket=bucket_name)

In [None]:
# Upload Target File
boto3.Session().resource('s3').Bucket(bucket_name).Object(target_time_series_filename).upload_file(target_time_series_path)
target_s3DataPath = "s3://"+bucket_name+"/"+target_time_series_filename

At this point your data is now formatted correctly for Forecast and exists within S3, the last thing to do is to import it so you can get started actually generating models!

In [None]:
# Finally we can call import the dataset
datasetImportJobName = 'DSIMPORT_JOB_TARGET_POC'
ds_import_job_response=forecast.create_dataset_import_job(
    DatasetImportJobName=datasetImportJobName,
    DatasetArn=target_datasetArn,
    DataSource= {
      "S3Config" : {
         "Path":target_s3DataPath,
         "RoleArn": role_arn
      } 
    },
    TimestampFormat=TIMESTAMP_FORMAT
)

In [None]:
ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']
print(ds_import_job_arn)

The cell below will run and poll every 30 seconds until the import process has completed. From there we will be able to view the metrics on the data and see that it is valid and ready for use.

In [None]:
while True:
    dataImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)['Status']
    print(dataImportStatus)
    if dataImportStatus != 'ACTIVE' and dataImportStatus != 'CREATE_FAILED':
        sleep(30)
    else:
        break

Once the import shows a state of `ACTIVE` we are then ready to evaluate the data that exists within the system and call the importing process complete.

## Evaluating the Target Time Series Data

First let us take a look at the information provided in our target time series file:

In [None]:
# Validate the date range
print (target_time_series_df.index.min())
print (target_time_series_df.index.max())

In [None]:
# Take a look at high level metrics:
target_time_series_df.info()

There are exactly 10,642 entries in this file with no null values at all. Let us now look at the metrics from the import.

In [None]:
forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)

At long last, we see the same metrics from our import that we saw from our data frame. From here, we can now consider our work on target-time-series done. 

If you are running the POC process, this is a great time to now explore sorting out the related data bits as well. If you are not, you can move on to just building Predictors.

The final cell below will use the store function of Jupyter to save off a few variables for use in other notebooks.

In [None]:
%store full_historical_df
%store target_time_series_df
%store validation_time_series_df
%store datasetName
%store bucket_name
%store datasetGroupName
%store datasetGroupArn
%store target_datasetArn
%store role_arn
%store region
%store target_time_series_filename
%store target_df
%store full_df
%store data_dir
%store DATASET_FREQUENCY
%store TIMESTAMP_FORMAT
%store project
%store data_dir