## Setup
Import the standard Python Libraries that are used in this lab.


In [1]:
import boto3
from time import sleep
import subprocess
import pandas as pd
import json
import time

Import sagemaker and get execution role for getting role ARN

In [2]:
import sagemaker
region = boto3.Session().region_name    
smclient = boto3.Session().client('sagemaker')
from sagemaker import get_execution_role

role_arn = get_execution_role()
print(role_arn)

#Make sure this role has the forecast permissions set to be able to use S3

arn:aws:iam::457927431838:role/service-role/AmazonSageMaker-ExecutionRole-20190920T105733


The last part of the setup process is to validate that your account can communicate with Amazon Forecast, the cell below does just that.

In [3]:
session = boto3.Session(region_name='us-east-1') 
forecast = session.client(service_name='forecast') 
forecastquery = session.client(service_name='forecastquery')

## Data Prepraration

In [4]:
df = pd.read_csv("../data/COF_yearly_Revenue_Data.csv", dtype = object, names=['metric_name','timestamp','metric_value'])
df.head(3)

Unnamed: 0,metric_name,timestamp,metric_value
0,metric_name,timestamp,metric_value
1,Revenue,2018-12-31,28076000000
2,Revenue,2017-12-31,27237000000


Create the training set and validation set. Use the last years revenue as the validation set

In [5]:
# Select 1996 to 2017 in one data frame
df_1996_2017 = df[(df['timestamp'] >= '1995-12-31') & (df['timestamp'] <= '2017-12-31')]

# Select the year 2018 seprately for validation
df = pd.read_csv("../data/COF_yearly_Revenue_Data.csv", dtype = object, names=['metric_name','timestamp','metric_value'])
df_2018 = df[(df['timestamp'] >= '2018-12-31')]

In [6]:
df_1996_2017


Unnamed: 0,metric_name,timestamp,metric_value
2,Revenue,2017-12-31,27237000000
3,Revenue,2016-12-31,25816000000
4,Revenue,2015-12-31,23413000000
5,Revenue,2014-12-31,22290000000
6,Revenue,2013-12-31,22384000000
7,Revenue,2012-12-31,21396000000
8,Revenue,2011-12-31,16279000000
9,Revenue,2010-12-31,16171000000
10,Revenue,2009-12-31,12983267000
11,Revenue,2008-12-31,13892686000


In [7]:
df_2018

Unnamed: 0,metric_name,timestamp,metric_value
0,metric_name,timestamp,metric_value
1,Revenue,2018-12-31,28076000000


Now export them to CSV files and place them into your data folder.

In [8]:
df_1996_2017.to_csv("../data/cof-revenue-train.csv", header=False, index=False)
df_2018.to_csv("../data/cof-revenue-validation.csv", header=False, index=False)

Define the S3 bucket name where we will upload data where Amazon Forecast will pick up the data later

In [10]:
bucket_name = "sagemaker-capone-forecast-useast1-02"  # Rember to change this to the correct bucket name used for Capital One
folder_name = "cone"  # change this to the folder name of the user.

Upload the data to S3

s3 = session.client('s3')
key=folder_name+"/cof-revenue-train.csv"
s3.upload_file(Filename="../data/cof-revenue-train.csv", Bucket=bucket_name, Key=key)

In [None]:
iam = boto3.client("iam")

role_name = "C1ForecastRoleDemo"
assume_role_policy_document = {
    "Version": "2012-10-17",
    "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "forecast.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
    ]
}

create_role_response = iam.create_role(
    RoleName = role_name,
    AssumeRolePolicyDocument = json.dumps(assume_role_policy_document)
)

# AmazonPersonalizeFullAccess provides access to any S3 bucket with a name that includes "personalize" or "Personalize" 
# if you would like to use a bucket with a different name, please consider creating and attaching a new policy
# that provides read access to your bucket or attaching the AmazonS3ReadOnlyAccess policy to the role
policy_arn = "arn:aws:iam::aws:policy/AmazonForecastFullAccess"
iam.attach_role_policy(
    RoleName = role_name,
    PolicyArn = policy_arn
)

# Now add S3 support
iam.attach_role_policy(
    PolicyArn='arn:aws:iam::aws:policy/AmazonS3FullAccess',
    RoleName=role_name
)
time.sleep(60) # wait for a minute to allow IAM role policy attachment to propagate

role_arn = create_role_response["Role"]["Arn"]
print(role_arn)

## Creating the Dataset Group and Dataset <a class="anchor" id="dataset"></a>

In Amazon Forecast , a dataset is a collection of file(s) which contain data that is relevant for a forecasting task. A dataset must conform to a schema provided by Amazon Forecast. 

More details about `Domain` and dataset type can be found on the [documentation](https://docs.aws.amazon.com/forecast/latest/dg/howitworks-domains-ds-types.html) . For this example, we are using [METRICS](https://docs.aws.amazon.com/forecast/latest/dg/metrics-domain.html) domain with 3 required attributes `metrics_name`, `timestamp` and `metrics_value`.


It is importan to also convey how Amazon Forecast can understand your time-series information. That the cell immediately below does that, the next one configures your variable names for the Project, DatasetGroup, and Dataset.

In [12]:
DATASET_FREQUENCY = "Y" 
TIMESTAMP_FORMAT = "yyyy-mm-dd"

In [13]:
project = 'cof_revenue_forecastdemo'
datasetName= project+'_ds'
datasetGroupName= project +'_dsg'
s3DataPath = "s3://"+bucket_name+"/"+key

### Create the Dataset Group

In [15]:
create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=datasetGroupName,
                                                              Domain="METRICS",
                                                             )
datasetGroupArn = create_dataset_group_response['DatasetGroupArn']

In [16]:
forecast.describe_dataset_group(DatasetGroupArn=datasetGroupArn)

{'DatasetGroupName': 'cof_revenue_forecastdemo_dsg',
 'DatasetGroupArn': 'arn:aws:forecast:us-east-1:457927431838:dataset-group/cof_revenue_forecastdemo_dsg',
 'DatasetArns': [],
 'Domain': 'METRICS',
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2019, 10, 9, 6, 1, 58, 826000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2019, 10, 9, 6, 1, 58, 826000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '9fca26de-9da0-4954-b5c9-6fabecfeffd5',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 09 Oct 2019 06:02:00 GMT',
   'x-amzn-requestid': '9fca26de-9da0-4954-b5c9-6fabecfeffd5',
   'content-length': '280',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

### Create the Schema

In [17]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
schema ={
   "Attributes":[
      {
         "AttributeName":"metric_name",
         "AttributeType":"string"
      },
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"metric_value",
         "AttributeType":"float"
      }
   ]
}

### Create the Dataset

In [18]:
response=forecast.create_dataset(
                    Domain="METRICS",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=datasetName,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = schema
)

In [19]:
datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=datasetArn)

{'DatasetArn': 'arn:aws:forecast:us-east-1:457927431838:dataset/cof_revenue_forecastdemo_ds',
 'DatasetName': 'cof_revenue_forecastdemo_ds',
 'Domain': 'METRICS',
 'DatasetType': 'TARGET_TIME_SERIES',
 'DataFrequency': 'Y',
 'Schema': {'Attributes': [{'AttributeName': 'metric_name',
    'AttributeType': 'string'},
   {'AttributeName': 'timestamp', 'AttributeType': 'timestamp'},
   {'AttributeName': 'metric_value', 'AttributeType': 'float'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2019, 10, 9, 6, 2, 18, 945000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2019, 10, 9, 6, 2, 18, 945000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'dff13de6-b420-48ad-afd9-8ff6cc14a2c7',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 09 Oct 2019 06:02:20 GMT',
   'x-amzn-requestid': 'dff13de6-b420-48ad-afd9-8ff6cc14a2c7',
   'content-length': '520',
   'connection': 'keep-alive'}

### Add Dataset to Dataset Group

In [20]:
forecast.update_dataset_group(DatasetGroupArn=datasetGroupArn, DatasetArns=[datasetArn])

{'ResponseMetadata': {'RequestId': 'd3690d18-636d-4c63-a472-dbf83410c9f2',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Wed, 09 Oct 2019 06:02:24 GMT',
   'x-amzn-requestid': 'd3690d18-636d-4c63-a472-dbf83410c9f2',
   'content-length': '2',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

### Create Data Import Job


Now that Forecast knows how to understand the CSV we are providing, the next step is to import the data from S3 into Amazon Forecaast.

In [21]:
datasetImportJobName = 'EP_DSIMPORT_JOB_TARGET'
ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":s3DataPath,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [22]:
ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']
print(ds_import_job_arn)

arn:aws:forecast:us-east-1:457927431838:dataset-import-job/cof_revenue_forecastdemo_ds/EP_DSIMPORT_JOB_TARGET


Check the status of dataset, when the status change from **CREATE_IN_PROGRESS** to **ACTIVE**, we can continue to next steps. Depending on the data size. It can take 10 mins to be **ACTIVE**. This process will take 5 to 10 minutes.

In [23]:
while True:
    dataImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)['Status']
    print(dataImportStatus)
    if dataImportStatus != 'ACTIVE' and dataImportStatus != 'CREATE_FAILED':
        sleep(30)
    else:
        break

CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
CREATE_IN_PROGRESS
ACTIVE


In [24]:
forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)

{'DatasetImportJobName': 'EP_DSIMPORT_JOB_TARGET',
 'DatasetImportJobArn': 'arn:aws:forecast:us-east-1:457927431838:dataset-import-job/cof_revenue_forecastdemo_ds/EP_DSIMPORT_JOB_TARGET',
 'DatasetArn': 'arn:aws:forecast:us-east-1:457927431838:dataset/cof_revenue_forecastdemo_ds',
 'TimestampFormat': 'yyyy-mm-dd',
 'DataSource': {'S3Config': {'Path': 's3://sagemaker-capone-forecast-useast1-02/cone/cof-revenue-train.csv',
   'RoleArn': 'arn:aws:iam::457927431838:role/service-role/AmazonSageMaker-ExecutionRole-20190920T105733'}},
 'FieldStatistics': {'date': {'Count': 22,
   'CountDistinct': 22,
   'CountNull': 0,
   'Min': '1996-12-31T00:00:00Z',
   'Max': '2017-12-31T00:00:00Z'},
  'item': {'Count': 22, 'CountDistinct': 1, 'CountNull': 0},
  'target': {'Count': 22,
   'CountDistinct': 22,
   'CountNull': 0,
   'CountNan': 0,
   'Min': '1.203046E9',
   'Max': '2.7237E10',
   'Avg': 12856397727.272728,
   'Stddev': 8210568656.541229}},
 'DataSize': 6.26780092716217e-07,
 'Status': 'ACTIV

In [25]:
print("DatasetArn: ")
print(datasetGroupArn)

DatasetArn: 
arn:aws:forecast:us-east-1:457927431838:dataset-group/cof_revenue_forecastdemo_dsg
