# 3: Validating and Importing Related Time Series Data

## Obtaining Your Data

This will take off where you stopped regarding your target time series data. In this particular example, one master file contained both the target and the related time-series information. That may or may not be the case for your problem. The goal here is to produce a file that contains the following 2 required attributes:

1. Timestamp - Must be of the same format and total range as the target-time series data, as well as slices of values into the dates for your forecast.
1. Item_ID - Must exist for all the time stamps for each item in your time series dataset

In addition to those attributes, we are looking for variables that shift over time that are impactful in some way towards our desired goal of predicting traffic volumes.

Again, the data was already bundled together for us in this sample, so we will skip obtaining it a second time, but that is where you would start otherwise.

With the data ready to go, skip the blank cell (feel free to add to it if you need to manipulate your own data) and execute the cells to handle our imports and retrieving our stored values from the previous notebook.


In [None]:
import boto3
from time import sleep
import subprocess
import pandas as pd
import json
import time
import pprint
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.dates import DateFormatter
import matplotlib.dates as mdates
import uuid
%store -r

## Building The Related Time Series File

The challenge here is to make sure that we leave absolutely 0 entries with NaN values, or the service will throw an error when building a Predictor. This is because the values must be present in order for us to make assumptions about their impact overall.

In [None]:
related_time_series_df = target_df.copy()
related_time_series_df.dropna()
related_time_series_df = full_df.join(related_time_series_df, how='outer')
cols = related_time_series_df.columns.tolist()
related_time_series_df[cols] = related_time_series_df[cols].replace('', np.nan).ffill()
related_time_series_df = related_time_series_df.loc['2017-01-01':]
print(related_time_series_df.index.min())
print(related_time_series_df.index.max())

We can see now that the data covers the range of our target time series of 2017's entire year to the end of our known data about 2018. We have not yet defined a forecast horizon, but it is important to note here that the related data needs to cover that time span. To spoil later work, the horizon for us is 480 hours or 20 days, plenty of time with 9 months of validation data.

Lastly on prepping the base set of the data we validate there are no blanks or NaNs.

In [None]:
related_time_series_df[related_time_series_df.isnull().any(axis=1)]

### Look at the columns and decide what we should keep:


In [None]:
related_time_series_df.sample(3)

A few things to note here:

* Holidays are not needed given this date is in the US, we can just use the Holidays [feature](https://docs.aws.amazon.com/forecast/latest/dg/API_SupplementaryFeature.html) within Forecast:
* Weather description seems to have more variety
* Traffic volume will be removed here. 
* We still need to add back the item_id field.

This leaves us with the following schema:

* `timestamp` - The Index
* `temp` - float
* `rain_1h` - float
* `snow_1h` - float
* `clouds_all` - float
* `weather_description` - string
* `item_ID` - string

The cell below will build that file for us.


In [None]:
# Restrict the columns to timestamp, traffic_volume
related_time_series_df = related_time_series_df[['temp', 'rain_1h', 'snow_1h', 'clouds_all', 'weather_description']]
# Add in item_id
related_time_series_df['item_ID'] = "1"
# Validate the structure
related_time_series_df.head()


In [None]:
# Save it off as a file:
related_time_series_filename = "related_time_series.csv"
related_time_series_path = data_dir + "/" + related_time_series_filename
related_time_series_df.to_csv(related_time_series_path, header=False)

## Adding Related Data to the DatasetGroup

Next, we are going to create a related-time-series dataset, then add it to our dataset group and finally import our information and validate that it looks good. We will also delete this dataset import after we are done so that the first models do not yet receive any extra info from the related data.

You can, of course, to not delete and get started right away with related-data-honoring models.

In [None]:
session = boto3.Session(region_name=region)
forecast = session.client(service_name='forecast')
forecast_query = session.client(service_name='forecastquery')

In [None]:
# Upload Related File
boto3.Session().resource('s3').Bucket(bucket_name).Object(related_time_series_filename).upload_file(related_time_series_path)
related_s3DataPath = "s3://"+bucket_name+"/"+related_time_series_filename

In [None]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
related_schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"temperature",
         "AttributeType":"float"
      },
       {
         "AttributeName":"rain_1h",
         "AttributeType":"float"
      },
       {
         "AttributeName":"snow_1h",
         "AttributeType":"float"
      },
       {
         "AttributeName":"clouds_all",
         "AttributeType":"float"
      },
       {
         "AttributeName":"weather",
         "AttributeType":"string"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      }
   ]
}

In [None]:
related_DSN = datasetName + "_related_"
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='RELATED_TIME_SERIES',
                    DatasetName=related_DSN,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = related_schema
)

In [None]:
related_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=related_datasetArn)

In [None]:
datasetImportJobName = 'DSIMPORT_JOB_RELATEDPOC_'+str(uuid.uuid4()).replace("-", "_")
related_ds_import_job_response=forecast.create_dataset_import_job(
    DatasetImportJobName=datasetImportJobName,
    DatasetArn=related_datasetArn,
    DataSource= {
      "S3Config" : {
         "Path":related_s3DataPath,
         "RoleArn": role_arn
      } 
    },
      TimestampFormat=TIMESTAMP_FORMAT
)

In [None]:
rel_ds_import_job_arn=related_ds_import_job_response['DatasetImportJobArn']
print(rel_ds_import_job_arn)

The cell below will poll until the import process has completed, once that has been accomplished we can review the metrics and decide to delete the data or not.

In [None]:
while True:
    dataImportStatus = forecast.describe_dataset_import_job(
        DatasetImportJobArn=rel_ds_import_job_arn
    )['Status']
    print(dataImportStatus)
    if dataImportStatus != 'ACTIVE' and dataImportStatus != 'CREATE_FAILED':
        sleep(30)
    else:
        break

## Evaluating the Related Time Series Data

First let us examine the dataframe that we provided to Forecast:

In [None]:
related_time_series_df.info()

In [None]:
related_time_series_df.sample(3)

Above we see 18,609 entries and not one is a NaN value! This is perfect. Now to double check what we imported:

In [None]:
forecast.describe_dataset_import_job(DatasetImportJobArn=rel_ds_import_job_arn)

Let's look at the `CountNull` and `CountNan` values in output. They should be `0`.

Fantastic! No NaNs or nulls and the entire dataset is ready to go. Once that is done you are ready to move forward building your models with Amazon Forecast.

Now we have to update the dataset group to include the related dataset.

In [None]:
forecast.update_dataset_group(DatasetGroupArn=datasetGroupArn, DatasetArns=[target_datasetArn, related_datasetArn])

If you'd like to dis-associate this information from the dataset group so you can build your models without related data simply uncomment the cell below and execute it.

In [None]:
#forecast.delete_dataset_import_job(DatasetImportJobArn=rel_ds_import_job_arn)