# Getting Data Ready

The overall process for using Amazon Forecast is the following:

1. Create a Dataset Group, this is the large box that isolates models and the data they are trained on from each other.
1. Create a Dataset, in Forecast there are 3 types of dataset, Target Time Series, Related Time Series, and Item Metadata. The Target Time Series is required, the others provide additional context with certain algorithms. 
1. Import data, this moves the information from S3 into a storage volume where the data can be used for training and validation.
1. Train a model, Forecast automates this process for you but you can also select particular algorithms, and you can provide your own hyper parameters or use Hyper Parameter Optimization(HPO) to determine the most performant values for you.
1. Deploy a Predictor, here you are deploying your model so you can use it to generate a forecast.
1. Query the Forecast, given a request bounded by time for an item, return the forecast for it. Once you have this you can evaluate its performance or use it to guide your decisions about the future.

In this notebook we will be walking through the first 3 steps outlined above. One additional task that will be done here is to trim part of our training and validation data so that we can measure the accuracy of a forecast against our predictions. 


## Table Of Contents
* Setup
* Data Preparation
* Creating the Dataset Group and Dataset
* Next Steps


**Read Every Cell FULLY before executing it**

For more informations about APIs, please check the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/what-is-forecast.html)

## Setup

Import the standard Python libraries that are used in this lesson.

In [1]:
import sys
import os
import json
import time

import pandas as pd
import boto3

import util

Configure the S3 bucket name and region name for this lesson.

- If you don't have an S3 bucket, create it first on S3. If you used CloudFormation Wizard to set up the environment, use same bucket name as you specified in the setup process.
- Although we have set the region to us-west-2 as a default value below, you can choose any of the regions that the service is available in.

The last part of the setup process is to validate that your account can communicate with Amazon Forecast, the cell below does just that.

In [2]:
import sagemaker
import boto3

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()
region = boto3.Session().region_name

session = boto3.Session(region_name=region) 
forecast = session.client(service_name='forecast') 
forecastquery = session.client(service_name='forecastquery')

## Data Preparation<a class="anchor" id="DataPrep"></a>

For this exercise, we use the individual household electric power consumption dataset. (Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.) We aggregate the usage data hourly. 

To begin, use Pandas to read the CSV and to show a sample of the data.

In [3]:
df = pd.read_csv("./data/item-demand-time.csv", dtype = object, names=['timestamp','value','item'])
df.head(3)

Unnamed: 0,timestamp,value,item
0,2014-01-01 01:00:00,38.34991708126038,client_12
1,2014-01-01 02:00:00,33.5820895522388,client_12
2,2014-01-01 03:00:00,34.41127694859037,client_12


Notice in the output above there are 3 columns of data:

1. The Timestamp
1. A Value
1. An Item

These are the 3 key required pieces of information to generate a forecast with Amazon Forecast. More can be added but these 3 must always remain present.

The dataset happens to span January 01, 2014 to Deceber 31, 2014. For our testing we would like to keep the last month of information in a different CSV. We are also going to save January to November to a different CSV as well.

You may notice a variable named `df` this is a popular convention when using Pandas if you are using the library's dataframe object, it is similar to a table in a database. You can learn more here: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html


In [4]:
# Select January to November for one dataframe.
jan_to_oct = df[(df['timestamp'] >= '2014-01-01') & (df['timestamp'] <= '2014-10-31')]

# Select the month of December for another dataframe.
df = pd.read_csv("./data/item-demand-time.csv", dtype = object, names=['timestamp','value','item'])
remaining_df = df[(df['timestamp'] >= '2014-10-31') & (df['timestamp'] <= '2014-12-01')]

Now export them to CSV files and place them into your `data` folder.

In [5]:
jan_to_oct.to_csv("./data/item-demand-time-train.csv", header=False, index=False)
remaining_df.to_csv("./data/item-demand-time-validation.csv", header=False, index=False)

At this time the data is ready to be sent to S3 where Forecast will use it later. The following cells will upload the data to S3.

In [6]:
key="elec_data/item-demand-time-train.csv"

boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_file("data/item-demand-time-train.csv")

## Creating the Dataset Group and Dataset <a class="anchor" id="dataset"></a>

In Amazon Forecast , a dataset is a collection of file(s) which contain data that is relevant for a forecasting task. A dataset must conform to a schema provided by Amazon Forecast. 

More details about `Domain` and dataset type can be found on the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html) . For this example, we are using [CUSTOM](https://docs.aws.amazon.com/forecast/latest/dg/custom-domain.html) domain with 3 required attributes `timestamp`, `target_value` and `item_id`.


It is importan to also convey how Amazon Forecast can understand your time-series information. That the cell immediately below does that, the next one configures your variable names for the Project, DatasetGroup, and Dataset.

In [7]:
DATASET_FREQUENCY = "H" 
TIMESTAMP_FORMAT = "yyyy-MM-dd hh:mm:ss"

In [8]:
forecast_project_name = 'util_power_forecast'
forecast_dataset_name = forecast_project_name +'_ds'
forecast_dataset_group_name= forecast_project_name +'_dsg'
s3_data_path = "s3://" + bucket + "/" + key

In [9]:
# Now save things 
%store forecast_project_name

Stored 'forecast_project_name' (str)


### Create the Dataset Group

In [10]:
create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=forecast_dataset_group_name,
                                                              Domain="CUSTOM")


In [11]:
forecast_dataset_group_arn = create_dataset_group_response['DatasetGroupArn']

In [12]:
forecast.describe_dataset_group(DatasetGroupArn=forecast_dataset_group_arn)

{'DatasetGroupName': 'util_power_forecast_dsg',
 'DatasetGroupArn': 'arn:aws:forecast:us-east-1:992382405090:dataset-group/util_power_forecast_dsg',
 'DatasetArns': [],
 'Domain': 'CUSTOM',
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2024, 7, 8, 4, 27, 1, 801000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2024, 7, 8, 4, 27, 1, 801000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'c50425f4-221a-471f-9e12-21aefb25ab38',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 08 Jul 2024 04:27:02 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '269',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'c50425f4-221a-471f-9e12-21aefb25ab38'},
  'RetryAttempts': 0}}

### Create the Schema

In [13]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
forecast_schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"target_value",
         "AttributeType":"float"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      }
   ]
}

### Create the Dataset

In [14]:
response=forecast.create_dataset(
                    Domain="CUSTOM",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=forecast_dataset_name,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = forecast_schema
)

In [15]:
forecast_dataset_arn = response['DatasetArn']

In [16]:
forecast.describe_dataset(DatasetArn=forecast_dataset_arn)

{'DatasetArn': 'arn:aws:forecast:us-east-1:992382405090:dataset/util_power_forecast_ds',
 'DatasetName': 'util_power_forecast_ds',
 'Domain': 'CUSTOM',
 'DatasetType': 'TARGET_TIME_SERIES',
 'DataFrequency': 'H',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
    'AttributeType': 'timestamp'},
   {'AttributeName': 'target_value', 'AttributeType': 'float'},
   {'AttributeName': 'item_id', 'AttributeType': 'string'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2024, 7, 8, 4, 27, 8, 700000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2024, 7, 8, 4, 27, 8, 700000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'b03bec47-a66f-4d38-944c-85fc5c645b4e',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 08 Jul 2024 04:27:10 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '501',
   'connection': 'keep-alive',
   'x-amzn-requestid': 'b03bec47-a66f-4d38-944c-85fc5c645b4e'},
  'RetryAttempt

### Add Dataset to Dataset Group

In [17]:
forecast.update_dataset_group(DatasetGroupArn=forecast_dataset_group_arn, DatasetArns=[forecast_dataset_arn])

{'ResponseMetadata': {'RequestId': '6252cb38-5e47-489b-ab24-eb41f28423e3',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'date': 'Mon, 08 Jul 2024 04:27:31 GMT',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '2',
   'connection': 'keep-alive',
   'x-amzn-requestid': '6252cb38-5e47-489b-ab24-eb41f28423e3'},
  'RetryAttempts': 0}}

### Create IAM Role for Forecast

Like many AWS services, Forecast will need to assume an IAM role in order to interact with your S3 resources securely. In the sample notebooks, we use the get_or_create_iam_role() utility function to create an IAM role. Please refer to ["notebooks/common/util/fcst_utils.py"](../../common/util/fcst_utils.py) for implementation.

In [18]:
# Create the role to provide to Amazon Forecast.
forecast_role_name = "ForecastNotebookRole"
forecast_role_arn = util.get_or_create_iam_role(role_name=forecast_role_name)

Created arn:aws:iam::992382405090:role/ForecastNotebookRole
Attaching policies
Waiting for a minute to allow IAM role policy attachment to propagate
Done.


### Create Data Import Job


Now that Forecast knows how to understand the CSV we are providing, the next step is to import the data from S3 into Amazon Forecaast.

In [19]:
forecast_dataset_import_job_name = 'EP_DSIMPORT_JOB_TARGET'
forecast_ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=forecast_dataset_import_job_name,
                                                          DatasetArn=forecast_dataset_arn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path": s3_data_path,
                                                                 "RoleArn": forecast_role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [20]:
forecast_ds_import_job_arn = forecast_ds_import_job_response['DatasetImportJobArn']
print(forecast_ds_import_job_arn)

arn:aws:forecast:us-east-1:992382405090:dataset-import-job/util_power_forecast_ds/EP_DSIMPORT_JOB_TARGET


Check the status of dataset, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on the data size. It can take 10 mins to be **ACTIVE**. This process will take 5 to 10 minutes.

In [21]:
status_indicator = util.StatusIndicator()

while True:
    status = forecast.describe_dataset_import_job(DatasetImportJobArn=forecast_ds_import_job_arn)['Status']
    status_indicator.update(status)
    if status in ('ACTIVE', 'CREATE_FAILED'): break
    time.sleep(10)

status_indicator.end()

CREATE_IN_PROGRESS .......................................................................
ACTIVE 


In [22]:
forecast.describe_dataset_import_job(DatasetImportJobArn=forecast_ds_import_job_arn)

{'DatasetImportJobName': 'EP_DSIMPORT_JOB_TARGET',
 'DatasetImportJobArn': 'arn:aws:forecast:us-east-1:992382405090:dataset-import-job/util_power_forecast_ds/EP_DSIMPORT_JOB_TARGET',
 'DatasetArn': 'arn:aws:forecast:us-east-1:992382405090:dataset/util_power_forecast_ds',
 'TimestampFormat': 'yyyy-MM-dd hh:mm:ss',
 'UseGeolocationForTimeZone': False,
 'DataSource': {'S3Config': {'Path': 's3://sagemaker-us-east-1-992382405090/elec_data/item-demand-time-train.csv',
   'RoleArn': 'arn:aws:iam::992382405090:role/ForecastNotebookRole'}},
 'FieldStatistics': {'item_id': {'Count': 21813,
   'CountDistinct': 3,
   'CountNull': 0,
   'CountLong': 21813,
   'CountDistinctLong': 3,
   'CountNullLong': 0},
  'target_value': {'Count': 21813,
   'CountDistinct': 4630,
   'CountNull': 0,
   'CountNan': 0,
   'Min': '0.0',
   'Max': '209.99170812603649',
   'Avg': 50.0805953246442,
   'Stddev': 38.44386200710882,
   'CountLong': 21813,
   'CountDistinctLong': 4630,
   'CountNullLong': 0,
   'CountNanLo

## Next Steps

At this point you have successfully imported your data into Amazon Forecast and now it is time to get started in the next notebook to build your first model. To Continue, execute the cell below to store important variables where they can be used in the next notebook, then open `2.Building_Your_Predictor.ipynb`.

In [23]:
%store forecast_dataset_group_arn
%store forecast_dataset_arn
%store forecast_role_name
%store forecast_key
%store forecast_ds_import_job_arn

Stored 'forecast_dataset_group_arn' (str)
Stored 'forecast_dataset_arn' (str)
Stored 'forecast_role_name' (str)


UsageError: Unknown variable 'forecast_key'
