# Amazon Forecast

2015년 12월 말 시점의 sales를 예측

 - dept_id FOODS_3 (8230개) 중 200개 Sampling (Best200)
 - Target Time Series
     - From/To : 2014-01-01/2015-11-30
     - **timestamp (timestamp)**
     - **id (string)**
     - **demand (int)**
 - Related Time Series
     - From/To : 2014-01-01/2015-12-30 (Target Time Series + 30days)
     - **timestamp (timestamp)**
     - **id (string)**
     - sell_price (float)
     - snap_CA, snap_TX, snap_WI (int)
     - Easter, LaborDay, Purim_End, StPatricksDay, SuperBowl (int)
     - Black Friday (int)
 - Item meta data : id, item_id, dept_id, cat_id, store_id, state_id
 - BackTestWindows : 4
 - BackTestWindowOffset : Default (Same as ForecastHorizon)

<img src="../img/forecast-steps.png" align="left">

# Data Preparation

In [1]:
# Import required library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import boto3
from datetime import datetime, timedelta
import json
import time
from time import sleep
import warnings

%matplotlib inline

warnings.filterwarnings(action='ignore')

In [2]:
%store -r

In [3]:
len(df_merged) # 15,743,990

15743990

In [4]:
len(df_sales_foods_3) #8,230

8230

In [5]:
df_sales_foods_3.head()

Unnamed: 0,id,item_id,dept_id,cat_id,store_id,state_id,d_1,d_2,d_3,d_4,...,d_1904,d_1905,d_1906,d_1907,d_1908,d_1909,d_1910,d_1911,d_1912,d_1913
2226,FOODS_3_001_CA_1_validation,FOODS_3_001,FOODS_3,FOODS,CA_1,CA,1,1,1,1,...,0,0,1,2,0,0,1,0,0,1
2227,FOODS_3_002_CA_1_validation,FOODS_3_002,FOODS_3,FOODS,CA_1,CA,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2228,FOODS_3_003_CA_1_validation,FOODS_3_003,FOODS_3,FOODS,CA_1,CA,0,0,0,0,...,0,1,1,1,0,0,0,0,1,0
2229,FOODS_3_004_CA_1_validation,FOODS_3_004,FOODS_3,FOODS,CA_1,CA,0,0,0,0,...,1,2,0,0,0,0,0,2,0,1
2230,FOODS_3_005_CA_1_validation,FOODS_3_005,FOODS_3,FOODS,CA_1,CA,1,0,1,2,...,0,0,2,0,0,0,0,0,1,0


## Best200 선택

In [6]:
# d_로 시작하는 column 추출
d_cols = [c for c in df_sales_foods_3.columns if 'd_' in c]

# d_로 시작하는 column의 value(판매량)들을 더해 "sales_total" column에 추가
df_sales_foods_3["sales_total"] = df_sales_foods_3.loc[:,d_cols].sum(axis=1)

In [7]:
# Daily sales가 가장 많은 item Best200 list 선택
best200  = df_sales_foods_3.sort_values(by="sales_total", ascending=False).head(200)
sampled = best200
len(sampled)

200

In [8]:
sampled[["id", "sales_total"]].head()

Unnamed: 0,id,sales_total
8412,FOODS_3_090_CA_3_validation,250502
18055,FOODS_3_586_TX_2_validation,192835
21104,FOODS_3_586_TX_3_validation,150122
8908,FOODS_3_586_CA_3_validation,134386
2314,FOODS_3_090_CA_1_validation,127203


In [9]:
sampled[["id", "sales_total"]].tail()

Unnamed: 0,id,sales_total
15101,FOODS_3_681_TX_1_validation,24273
8635,FOODS_3_313_CA_3_validation,24261
29751,FOODS_3_086_WI_3_validation,24133
29819,FOODS_3_154_WI_3_validation,24019
8699,FOODS_3_377_CA_3_validation,23976


In [10]:
# Best200 추출
df_merged_sampled = df_merged[df_merged["id"].isin(sampled.id)]

In [11]:
len(df_merged_sampled["id"].unique()) # 200

200

In [12]:
df_merged_sampled.head()

Unnamed: 0_level_0,id,item_id,dept_id,cat_id,store_id,state_id,day,sales,wm_yr_wk,weekday,...,year,d,event_name_1,event_type_1,event_name_2,event_type_2,snap_CA,snap_TX,snap_WI,sell_price
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2011-01-29,FOODS_3_064_CA_1_validation,FOODS_3_064,FOODS_3,FOODS,CA_1,CA,d_1,0,11101,Saturday,...,2011,d_1,,,,,0,0,0,
2011-01-29,FOODS_3_080_CA_1_validation,FOODS_3_080,FOODS_3,FOODS,CA_1,CA,d_1,33,11101,Saturday,...,2011,d_1,,,,,0,0,0,1.48
2011-01-29,FOODS_3_090_CA_1_validation,FOODS_3_090,FOODS_3,FOODS,CA_1,CA,d_1,107,11101,Saturday,...,2011,d_1,,,,,0,0,0,1.25
2011-01-29,FOODS_3_099_CA_1_validation,FOODS_3_099,FOODS_3,FOODS,CA_1,CA,d_1,0,11101,Saturday,...,2011,d_1,,,,,0,0,0,
2011-01-29,FOODS_3_120_CA_1_validation,FOODS_3_120,FOODS_3,FOODS,CA_1,CA,d_1,0,11101,Saturday,...,2011,d_1,,,,,0,0,0,


## Create Data Sets

### Target (df_target)

In [13]:
df_target = df_merged_sampled[["id", "sales"]]
df_target = df_target.loc["2014-01-01":"2015-11-30"]
df_target.head()

Unnamed: 0_level_0,id,sales
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2014-01-01,FOODS_3_064_CA_1_validation,5
2014-01-01,FOODS_3_080_CA_1_validation,14
2014-01-01,FOODS_3_090_CA_1_validation,58
2014-01-01,FOODS_3_099_CA_1_validation,9
2014-01-01,FOODS_3_120_CA_1_validation,23


In [14]:
len(df_target)

139800

In [15]:
df_target = df_target.sort_values(by=["id", "date"])

### Related (df_related)
- Black Friday 전일, 당일, 다음 날을 df_related 데이터에 추가한다.

In [16]:
df_merged_sampled['black_friday'] = 0

In [17]:
# 2011년 Black Friday : 2011-11-25
df_merged_sampled['black_friday'].loc["2011-11-24":"2011-11-26"] = 1

# 2012년 Black Friday : 2012-11-23
df_merged_sampled['black_friday'].loc["2012-11-22":"2013-11-24"] = 1

# 2013년 Black Friday : 2013-11-29
df_merged_sampled['black_friday'].loc["2013-11-28":"2013-11-30"] = 1 

# 2014년 Black Friday : 2014-11-28
df_merged_sampled['black_friday'].loc["2014-11-27":"2014-11-29"] = 1

# 2015년 Black Friday : 2015-11-27
df_merged_sampled['black_friday'].loc["2015-11-26":"2015-11-28"] = 1

In [18]:
df_related = df_merged_sampled[["id", "event_name_1", "snap_CA", "snap_TX", "snap_WI", "sell_price", "black_friday"]]

# Related TS는 Target TS + ForecastHorizon까지 데이터가 있어야 하고,
# Missing Value가 있으면 안된다.
df_related = df_related.loc["2014-01-01":"2015-12-30"]

print(len(df_related))
df_related.isnull().sum()

145800


id                   0
event_name_1    134200
snap_CA              0
snap_TX              0
snap_WI              0
sell_price           0
black_friday         0
dtype: int64

In [19]:
# event_name_1의 NaN를 "None"으로 fill
df_related["event_name_1"] = df_related["event_name_1"].fillna("None")

# 특정 item이 2015-07-01 이후부터 판매 되었다고 한다면, df_sales의 해당 item의 sell_price 데이터는 2015-07-01 이후부터 있을 것이다.
# df_merged의 date는 df_calendar를 merge했으므로 특정 item의 date는 2011-01-29~2016-06-19 범위지만,
# 특정 item의 df_sales내 date는 2015-07-01 이후이므로
# df_merged와 df_sales를 Merge하면 특정 item의 2015-07-01 이전 시점의 sell_price는 NaN이다.
# 따라서 sell_price의 NaN를 "0"으로 fill
df_related["sell_price"] = df_related["sell_price"].fillna(0)
df_related.isnull().sum()

id              0
event_name_1    0
snap_CA         0
snap_TX         0
snap_WI         0
sell_price      0
black_friday    0
dtype: int64

In [20]:
print(len(df_related)) # 359400 = 200items*(1767+30)days
df_related.isnull().sum()

145800


id              0
event_name_1    0
snap_CA         0
snap_TX         0
snap_WI         0
sell_price      0
black_friday    0
dtype: int64

In [21]:
# One-hot encoding for event_name_1
df_related = pd.concat([df_related, pd.get_dummies(df_related['event_name_1'])],axis=1)

In [22]:
print(len(df_related))
df_related.isnull().sum()

145800


id                     0
event_name_1           0
snap_CA                0
snap_TX                0
snap_WI                0
sell_price             0
black_friday           0
Chanukah End           0
Christmas              0
Cinco De Mayo          0
ColumbusDay            0
Easter                 0
Eid al-Fitr            0
EidAlAdha              0
Father's day           0
Halloween              0
IndependenceDay        0
LaborDay               0
LentStart              0
LentWeek2              0
MartinLutherKingDay    0
MemorialDay            0
Mother's day           0
NBAFinalsEnd           0
NBAFinalsStart         0
NewYear                0
None                   0
OrthodoxChristmas      0
OrthodoxEaster         0
Pesach End             0
PresidentsDay          0
Purim End              0
Ramadan starts         0
StPatricksDay          0
SuperBowl              0
Thanksgiving           0
ValentinesDay          0
VeteransDay            0
dtype: int64

In [23]:
# event_name_1에서 Unique value 추출
all_events = df_related.event_name_1.unique()

# event_name_1 : SuperBowl, LaborDay, Purim End, Easter, StPatricksDay <- Sales가 가장 많은 event 다섯 개만 Related에 추가
chosen_events = ['SuperBowl', 'LaborDay', 'Purim End', 'Easter', 'StPatricksDay']
for event in [event for event in all_events if event not in chosen_events]:
    df_related.drop([event], axis=1, inplace=True)

df_related.drop(["event_name_1"], axis=1, inplace=True)
#df_related.head()

In [24]:
print(len(df_related))
df_related.isnull().sum()

145800


id               0
snap_CA          0
snap_TX          0
snap_WI          0
sell_price       0
black_friday     0
Easter           0
LaborDay         0
Purim End        0
StPatricksDay    0
SuperBowl        0
dtype: int64

In [25]:
df_related = df_related.sort_values(by=["id", "date"])

### Item_metadata (df_item)

In [26]:
df_item = df_merged_sampled[["id", "item_id", "dept_id", "cat_id", "store_id", "state_id"]].drop_duplicates()

In [27]:
len(df_item) #200

200

In [28]:
len(df_item["id"].unique()) # 200

200

In [29]:
df_item.head()

Unnamed: 0_level_0,id,item_id,dept_id,cat_id,store_id,state_id
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2011-01-29,FOODS_3_064_CA_1_validation,FOODS_3_064,FOODS_3,FOODS,CA_1,CA
2011-01-29,FOODS_3_080_CA_1_validation,FOODS_3_080,FOODS_3,FOODS,CA_1,CA
2011-01-29,FOODS_3_090_CA_1_validation,FOODS_3_090,FOODS_3,FOODS,CA_1,CA
2011-01-29,FOODS_3_099_CA_1_validation,FOODS_3_099,FOODS_3,FOODS,CA_1,CA
2011-01-29,FOODS_3_120_CA_1_validation,FOODS_3_120,FOODS_3,FOODS,CA_1,CA


## Make CSV files

In [30]:
# Prepare csv files

!mkdir ./train

local_path = "./train/"

target_file_name     = "df_target.csv"
related_file_name    = "df_related.csv"
item_file_name       = "df_item.csv"

local_target     = local_path + target_file_name
local_related    = local_path + related_file_name
local_item       = local_path + item_file_name

df_target.to_csv(local_target, header=False, index=True)
df_related.to_csv(local_related, header=False, index=True)
df_item.to_csv(local_item, header=False, index=False) #index 제외

# Forecast 시작

참고 : https://github.com/chrisking/ForecastPOC/blob/master/

In [31]:
DATASET_FREQUENCY = "D" # Day
TIMESTAMP_FORMAT = "yyyy-MM-dd"

project = 'walmart_m5'
datasetName= project+'_ds'
datasetGroupName= project +'_dsg'

In [32]:
# Jupyter notebook이 실행되는 AWS region 정보 추출
with open('/opt/ml/metadata/resource-metadata.json') as notebook_info:
    data = json.load(notebook_info)
    resource_arn = data['ResourceArn']
    region = resource_arn.split(':')[3]
print(region)

us-east-1


In [33]:
session = boto3.Session(region_name=region)
forecast = session.client(service_name='forecast')
forecast_query = session.client(service_name='forecastquery')

In [34]:
# Sagemaker Jupyter notebook에서 Amazon Forecast의 API를 사용할 수 있도록 execution_role을 가져 온다.

from sagemaker import get_execution_role

role_arn = get_execution_role()
print(role_arn)

arn:aws:iam::889750940888:role/forecast-sm-hol-smNotebookRole68693F67-DG5AW9BOQJXX


## 1. Datagroup 생성

In [38]:
create_dataset_group_response = forecast.create_dataset_group(DatasetGroupName=datasetGroupName,
                                                              Domain="RETAIL",
                                                             )
datasetGroupArn = create_dataset_group_response['DatasetGroupArn']

In [39]:
forecast.describe_dataset_group(DatasetGroupArn=datasetGroupArn)

{'DatasetGroupName': 'walmart_m5_dsg',
 'DatasetGroupArn': 'arn:aws:forecast:us-east-1:889750940888:dataset-group/walmart_m5_dsg',
 'DatasetArns': [],
 'Domain': 'RETAIL',
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 11, 2, 14, 36, 21, 704000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 11, 2, 14, 36, 21, 704000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': 'c8435e4f-6002-45ba-be36-5f83d631e73d',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Mon, 02 Nov 2020 14:36:23 GMT',
   'x-amzn-requestid': 'c8435e4f-6002-45ba-be36-5f83d631e73d',
   'content-length': '251',
   'connection': 'keep-alive'},
  'RetryAttempts': 0}}

## 2a. Target Time Series Dataset 생성

In [40]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
      {
         "AttributeName":"demand",
         "AttributeType":"integer"
      }
   ]
}

In [41]:
target_DSN = datasetName + "_target"

response=forecast.create_dataset(
                    Domain="RETAIL",
                    DatasetType='TARGET_TIME_SERIES',
                    DatasetName=target_DSN,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = schema
)

In [42]:
target_datasetArn = response['DatasetArn']
forecast.describe_dataset(DatasetArn=target_datasetArn)

{'DatasetArn': 'arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_target',
 'DatasetName': 'walmart_m5_ds_target',
 'Domain': 'RETAIL',
 'DatasetType': 'TARGET_TIME_SERIES',
 'DataFrequency': 'D',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
    'AttributeType': 'timestamp'},
   {'AttributeName': 'item_id', 'AttributeType': 'string'},
   {'AttributeName': 'demand', 'AttributeType': 'integer'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 11, 2, 14, 36, 25, 922000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 11, 2, 14, 36, 25, 922000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '5e95a613-a87a-40d6-803d-e4f7ff00bd30',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Mon, 02 Nov 2020 14:36:25 GMT',
   'x-amzn-requestid': '5e95a613-a87a-40d6-803d-e4f7ff00bd30',
   'content-length': '497',
   'connection': 'keep-alive'},
  'RetryAttempts'

## 2b. Target Time Series Dataset Import

In [43]:
# Create S3 Bucket
# {Account Number}-forecastpoc

print(region)
s3 = boto3.client('s3')
account_id = boto3.client('sts').get_caller_identity().get('Account')
bucket_name = account_id + "-forecastpoc"
print(bucket_name)
s3.create_bucket(Bucket=bucket_name)
if region != "us-east-1":
    s3.create_bucket(Bucket=bucket_name, CreateBucketConfiguration={'LocationConstraint': region})
else:
    s3.create_bucket(Bucket=bucket_name)

us-east-1
889750940888-forecastpoc


In [44]:
# Upload Target File

bucket_name = bucket_name
role_arn = role_arn

s3_path = "walmart"

s3_target     = "s3://" + bucket_name + "/" + s3_path + "/" + target_file_name
s3_related    = "s3://" + bucket_name + "/" + s3_path + "/" + related_file_name
s3_item       = "s3://" + bucket_name + "/" + s3_path + "/" + item_file_name

boto3.Session().resource('s3').Bucket(bucket_name).Object(s3_path + "/" + target_file_name).upload_file(local_target)

In [46]:
# Finally we can call import the dataset
role_arn = role_arn #ForecastRolePOC
datasetImportJobName = 'DSIMPORT_JOB_TARGET_POC'
ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=target_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":s3_target,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [47]:
ds_import_job_arn=ds_import_job_response['DatasetImportJobArn']
print(ds_import_job_arn)

arn:aws:forecast:us-east-1:889750940888:dataset-import-job/walmart_m5_ds_target/DSIMPORT_JOB_TARGET_POC


In [48]:
#while True:
#    dataImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)['Status']
#    print(dataImportStatus)
#    if dataImportStatus != 'ACTIVE' and dataImportStatus != 'CREATE_FAILED':
#        sleep(30)
#    else:
#        break

In [49]:
# 방금 만든 dataset을 dataset group에 attach한다.
# attach하지 않으면 Forecast dataset group의 dataset가 조회되지 않는다.
#response = forecast.update_dataset_group(
#    DatasetGroupArn=datasetGroupArn,
#    DatasetArns=[
#        target_datasetArn
#    ]
#)

## 2c. Related Time Series dataset 생성

Related Time Series 고려사항 : https://docs.aws.amazon.com/ko_kr/forecast/latest/dg/related-time-series-datasets.html

<img src="../img/related-ts.png" align="left">


In [50]:
# Upload Related File
boto3.Session().resource('s3').Bucket(bucket_name).Object(s3_path + "/" + related_file_name).upload_file(local_related)

In [51]:
df_related.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 145800 entries, 2014-01-01 to 2015-12-30
Data columns (total 11 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   id             145800 non-null  object 
 1   snap_CA        145800 non-null  int64  
 2   snap_TX        145800 non-null  int64  
 3   snap_WI        145800 non-null  int64  
 4   sell_price     145800 non-null  float64
 5   black_friday   145800 non-null  int64  
 6   Easter         145800 non-null  uint8  
 7   LaborDay       145800 non-null  uint8  
 8   Purim End      145800 non-null  uint8  
 9   StPatricksDay  145800 non-null  uint8  
 10  SuperBowl      145800 non-null  uint8  
dtypes: float64(1), int64(4), object(1), uint8(5)
memory usage: 8.5+ MB


In [52]:
# Specify the schema of your dataset here. Make sure the order of columns matches the raw data files.
related_schema ={
   "Attributes":[
      {
         "AttributeName":"timestamp",
         "AttributeType":"timestamp"
      },
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
       {
         "AttributeName":"snap_CA",
         "AttributeType":"integer"
      },
       {
         "AttributeName":"snap_TX",
         "AttributeType":"integer"
      },
       {
         "AttributeName":"snap_WI",
         "AttributeType":"integer"
      },
       {
         "AttributeName":"sell_price",
         "AttributeType":"float"
      },
       {
         "AttributeName":"black_friday",
         "AttributeType":"integer"
      },
       {
         "AttributeName":"Easter",
         "AttributeType":"integer"
      },
       {
         "AttributeName":"LaborDay",
         "AttributeType":"integer"
      },
       {
         "AttributeName":"Purim_End",
         "AttributeType":"integer"
      },
       {
         "AttributeName":"StPatricksDay",
         "AttributeType":"integer"
      },
       {
         "AttributeName":"SuperBowl",
         "AttributeType":"integer"
      }
   ]
}

In [53]:
related_DSN = datasetName + "_related"
response=forecast.create_dataset(
                    Domain="RETAIL",
                    DatasetType='RELATED_TIME_SERIES',
                    DatasetName=related_DSN,
                    DataFrequency=DATASET_FREQUENCY, 
                    Schema = related_schema
)

In [54]:
related_datasetArn = response['DatasetArn']
print(related_datasetArn)
forecast.describe_dataset(DatasetArn=related_datasetArn)

arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_related


{'DatasetArn': 'arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_related',
 'DatasetName': 'walmart_m5_ds_related',
 'Domain': 'RETAIL',
 'DatasetType': 'RELATED_TIME_SERIES',
 'DataFrequency': 'D',
 'Schema': {'Attributes': [{'AttributeName': 'timestamp',
    'AttributeType': 'timestamp'},
   {'AttributeName': 'item_id', 'AttributeType': 'string'},
   {'AttributeName': 'snap_CA', 'AttributeType': 'integer'},
   {'AttributeName': 'snap_TX', 'AttributeType': 'integer'},
   {'AttributeName': 'snap_WI', 'AttributeType': 'integer'},
   {'AttributeName': 'sell_price', 'AttributeType': 'float'},
   {'AttributeName': 'black_friday', 'AttributeType': 'integer'},
   {'AttributeName': 'Easter', 'AttributeType': 'integer'},
   {'AttributeName': 'LaborDay', 'AttributeType': 'integer'},
   {'AttributeName': 'Purim_End', 'AttributeType': 'integer'},
   {'AttributeName': 'StPatricksDay', 'AttributeType': 'integer'},
   {'AttributeName': 'SuperBowl', 'AttributeType': 'integer'}]},
 'Encry

## 2d. Related Time Series Dataset Import

In [55]:
datasetImportJobName = 'DSIMPORT_JOB_RELATEDPOC'
related_ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=related_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":s3_related,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          },
                                                          TimestampFormat=TIMESTAMP_FORMAT
                                                         )

In [56]:
rel_ds_import_job_arn=related_ds_import_job_response['DatasetImportJobArn']
print(rel_ds_import_job_arn)

arn:aws:forecast:us-east-1:889750940888:dataset-import-job/walmart_m5_ds_related/DSIMPORT_JOB_RELATEDPOC


## 2e. Item Metadata 생성

In [57]:
# Upload Item Metadata File
boto3.Session().resource('s3').Bucket(bucket_name).Object(s3_path + "/" + item_file_name).upload_file(local_item)

In [58]:
df_item.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 200 entries, 2011-01-29 to 2011-01-29
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        200 non-null    object
 1   item_id   200 non-null    object
 2   dept_id   200 non-null    object
 3   cat_id    200 non-null    object
 4   store_id  200 non-null    object
 5   state_id  200 non-null    object
dtypes: object(6)
memory usage: 10.9+ KB


In [59]:
item_schema ={
   "Attributes":[
      {
         "AttributeName":"item_id",
         "AttributeType":"string"
      },
       {
         "AttributeName":"item_id_not_combined",
         "AttributeType":"string"
      },
       {
         "AttributeName":"dept_id",
         "AttributeType":"string"
      },
       {
         "AttributeName":"cat_id",
         "AttributeType":"string"
      },
       {
         "AttributeName":"store_id",
         "AttributeType":"string"
      },
       {
         "AttributeName":"state_id",
         "AttributeType":"string"
      }
   ]
}

In [60]:
item_DSN = datasetName + "_item"
response=forecast.create_dataset(
                    Domain="RETAIL",
                    DatasetType='ITEM_METADATA',
                    DatasetName=item_DSN,
                    Schema = item_schema
)

In [61]:
item_datasetArn = response['DatasetArn']
print(item_datasetArn)
forecast.describe_dataset(DatasetArn=item_datasetArn)

arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_item


{'DatasetArn': 'arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_item',
 'DatasetName': 'walmart_m5_ds_item',
 'Domain': 'RETAIL',
 'DatasetType': 'ITEM_METADATA',
 'Schema': {'Attributes': [{'AttributeName': 'item_id',
    'AttributeType': 'string'},
   {'AttributeName': 'item_id_not_combined', 'AttributeType': 'string'},
   {'AttributeName': 'dept_id', 'AttributeType': 'string'},
   {'AttributeName': 'cat_id', 'AttributeType': 'string'},
   {'AttributeName': 'store_id', 'AttributeType': 'string'},
   {'AttributeName': 'state_id', 'AttributeType': 'string'}]},
 'EncryptionConfig': {},
 'Status': 'ACTIVE',
 'CreationTime': datetime.datetime(2020, 11, 2, 14, 41, 48, 220000, tzinfo=tzlocal()),
 'LastModificationTime': datetime.datetime(2020, 11, 2, 14, 41, 48, 220000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '35d33e64-fb9d-46ef-b841-6ec1880ac1b4',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-type': 'application/x-amz-json-1.1',
   'date': 'Mon, 02 Nov 202

## 2f. Item Metadata Dataset Import

In [62]:
datasetImportJobName = 'DSIMPORT_JOB_ITEMPOC'
item_ds_import_job_response=forecast.create_dataset_import_job(DatasetImportJobName=datasetImportJobName,
                                                          DatasetArn=item_datasetArn,
                                                          DataSource= {
                                                              "S3Config" : {
                                                                 "Path":s3_item,
                                                                 "RoleArn": role_arn
                                                              } 
                                                          }
                                                         )

In [63]:
item_ds_import_job_arn=item_ds_import_job_response['DatasetImportJobArn']
print(item_ds_import_job_arn)

arn:aws:forecast:us-east-1:889750940888:dataset-import-job/walmart_m5_ds_item/DSIMPORT_JOB_ITEMPOC


## 2g. Check Dataset Import Status (5~10분 소요)

In [64]:
import time 

start_time = time.time()

while True:
    TargetdataImportStatus  = forecast.describe_dataset_import_job(DatasetImportJobArn=ds_import_job_arn)['Status']
    RelateddataImportStatus = forecast.describe_dataset_import_job(DatasetImportJobArn=rel_ds_import_job_arn)['Status']
    ItemdataImportStatus    = forecast.describe_dataset_import_job(DatasetImportJobArn=item_ds_import_job_arn)['Status']
    
    print("Dataset {} status : {}".format(target_datasetArn, TargetdataImportStatus))
    print("Dataset {} status : {}".format(related_datasetArn, RelateddataImportStatus))
    print("Dataset {} status : {}".format(item_datasetArn, ItemdataImportStatus))
    print("--------------------------------------------------------------------")
    
    if TargetdataImportStatus != 'ACTIVE' or RelateddataImportStatus != 'ACTIVE' or ItemdataImportStatus != 'ACTIVE':
        sleep(30)
    else:
        break
print('작업 수행된 시간 : %f 초' % (time.time() - start_time))

Dataset arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_target status : CREATE_PENDING
Dataset arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_related status : CREATE_PENDING
Dataset arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_item status : CREATE_PENDING
--------------------------------------------------------------------
Dataset arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_target status : CREATE_IN_PROGRESS
Dataset arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_related status : CREATE_IN_PROGRESS
Dataset arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_item status : CREATE_IN_PROGRESS
--------------------------------------------------------------------
Dataset arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_target status : CREATE_IN_PROGRESS
Dataset arn:aws:forecast:us-east-1:889750940888:dataset/walmart_m5_ds_related status : CREATE_IN_PROGRESS
Dataset arn:aws:forecast:us-east-1:

In [65]:
response = forecast.update_dataset_group(
    DatasetGroupArn=datasetGroupArn,
    DatasetArns=[
        target_datasetArn,
        related_datasetArn,
        item_datasetArn
    ]
)

Amazon Forecast Console에서 아래 스크린 캡쳐와 같이 3가지 Dataset이 모두 Import되었는지 확인한 후 "3. Create Predictor" 단계로 넘어 간다.
Import 상태가 "Falied"인 경우 세부 오류 메시지를 확인한다.

<img src="../img/datasets.png" align="left">

# 3. Create Predictor (30분 ~ 40분 소요)


In [66]:
forecastHorizon = 30 # 30 days
NumberOfBacktestWindows = 4
BackTestWindowOffset = 30
ForecastFrequency = "D"

In [67]:
prophet_algorithmArn = 'arn:aws:forecast:::algorithm/Prophet'
deepAR_Plus_algorithmArn = 'arn:aws:forecast:::algorithm/Deep_AR_Plus'

## 3a. Prophet

In [68]:
# Prophet Specifics
prophet_predictorName= project+'_prophet_algo_1'

In [69]:
# Build Prophet:
prophet_create_predictor_response=forecast.create_predictor(PredictorName=prophet_predictorName, 
                                                  AlgorithmArn=prophet_algorithmArn,
                                                  ForecastHorizon=forecastHorizon,
                                                  PerformAutoML= False,
                                                  PerformHPO=False,
                                                  EvaluationParameters= {"NumberOfBacktestWindows": NumberOfBacktestWindows, 
                                                                         "BackTestWindowOffset": BackTestWindowOffset}, 
                                                  InputDataConfig= {"DatasetGroupArn": datasetGroupArn, "SupplementaryFeatures": [ 
                                                                     { 
                                                                        "Name": "holiday",
                                                                        "Value": "US"
                                                                     }
                                                                  ]},
                                                  FeaturizationConfig= {"ForecastFrequency": ForecastFrequency, 
                                                                        "Featurizations": 
                                                                        [
                                                                          {"AttributeName": "demand", 
                                                                           "FeaturizationPipeline": 
                                                                            [
                                                                              {"FeaturizationMethodName": "filling", 
                                                                               "FeaturizationMethodParameters": 
                                                                                {"frontfill": "none", 
                                                                                 "middlefill": "zero", 
                                                                                 "backfill": "zero"}
                                                                              }
                                                                            ]
                                                                          }
                                                                        ]
                                                                       }
                                                 )

## 3b. DeepAR Plus

In [70]:
# Prophet Specifics
deeparplus_predictorName= project+'_deeparplus_algo_1'

In [71]:
# Build DeepAR Plus:
deeparplus_create_predictor_response=forecast.create_predictor(PredictorName=deeparplus_predictorName, 
                                                  AlgorithmArn=deepAR_Plus_algorithmArn,
                                                  ForecastHorizon=forecastHorizon,
                                                  PerformAutoML= False,
                                                  PerformHPO=False,
                                                  #PerformHPO=True,
                                                  EvaluationParameters= {"NumberOfBacktestWindows": NumberOfBacktestWindows, 
                                                                         "BackTestWindowOffset": BackTestWindowOffset}, 
                                                  InputDataConfig= {"DatasetGroupArn": datasetGroupArn, "SupplementaryFeatures": [ 
                                                                     { 
                                                                        "Name": "holiday",
                                                                        "Value": "US"
                                                                     }
                                                                  ]},
                                                  FeaturizationConfig= {"ForecastFrequency": ForecastFrequency, 
                                                                        "Featurizations": 
                                                                        [
                                                                          {"AttributeName": "demand", 
                                                                           "FeaturizationPipeline": 
                                                                            [
                                                                              {"FeaturizationMethodName": "filling", 
                                                                               "FeaturizationMethodParameters": 
                                                                                {"frontfill": "none", 
                                                                                 "middlefill": "zero", 
                                                                                 "backfill": "zero"}
                                                                              }
                                                                            ]
                                                                          }
                                                                        ]
                                                                       },
                                                 TrainingParameters= { 
                                                          "likelihood" : "student-t" 
                                                       }
                                                 )

- 일반적으로 Prophet predictor 학습은 DeepAR+ 학습보다 빨리 끝난다.

## 3c. Check Predictor Creation Status (Be patient)

In [None]:
import time 

start_time = time.time()
while True:
    ProphetArn = prophet_create_predictor_response['PredictorArn']
    DeepARPlusArn = deeparplus_create_predictor_response['PredictorArn']
    
    ProphetStatus = forecast.describe_predictor(PredictorArn = prophet_create_predictor_response['PredictorArn'])['Status']
    DeepARPlusStatus = forecast.describe_predictor(PredictorArn = deeparplus_create_predictor_response['PredictorArn'])['Status']
    
    print("Predictor {} status : {}".format(ProphetArn, ProphetStatus))
    print("Predictor {} status : {}".format(DeepARPlusArn, DeepARPlusStatus))
    print("--------------------------------------------------------------------")
    
    if ProphetStatus != 'ACTIVE' or DeepARPlusStatus != 'ACTIVE':
        sleep(30)
    else:
        break
print('작업 수행된 시간 : %f 초' % (time.time() - start_time))

Predictor arn:aws:forecast:us-east-1:889750940888:predictor/walmart_m5_prophet_algo_1 status : CREATE_PENDING
Predictor arn:aws:forecast:us-east-1:889750940888:predictor/walmart_m5_deeparplus_algo_1 status : CREATE_PENDING
--------------------------------------------------------------------
Predictor arn:aws:forecast:us-east-1:889750940888:predictor/walmart_m5_prophet_algo_1 status : CREATE_IN_PROGRESS
Predictor arn:aws:forecast:us-east-1:889750940888:predictor/walmart_m5_deeparplus_algo_1 status : CREATE_IN_PROGRESS
--------------------------------------------------------------------
Predictor arn:aws:forecast:us-east-1:889750940888:predictor/walmart_m5_prophet_algo_1 status : CREATE_IN_PROGRESS
Predictor arn:aws:forecast:us-east-1:889750940888:predictor/walmart_m5_deeparplus_algo_1 status : CREATE_IN_PROGRESS
--------------------------------------------------------------------
Predictor arn:aws:forecast:us-east-1:889750940888:predictor/walmart_m5_prophet_algo_1 status : CREATE_IN_PRO

## 3d. Examining the Predictors
- AWS Forecast에서 생성된 Predictor별 Metric을 확인한다.
- 참고 : https://docs.aws.amazon.com/ko_kr/forecast/latest/dg/metrics.html

# 4. Create Forecast (10분 정도 소요)
- Predictor별 Forecast를 만든다.
- 참고 : https://docs.aws.amazon.com/ko_kr/forecast/latest/dg/gs-console.html 의 "3단계 - 예상 생성"
- ForecastTypes : The quantiles at which probabilistic forecasts are generated. You can currently specify up to 5 quantiles per forecast. Accepted values include 0.01 to 0.99 (increments of .01 only) and mean. The mean forecast is different from the median (0.50) when the distribution is not symmetric (for example, Beta and Negative Binomial). The default value is ["0.1", "0.5", "0.9"].

## 4a. Create Prophet, DeepAR+ Forecast

In [None]:
deeparplus_forecastName= project+'_deepAR_algo_forecast'
prophet_forecastname= project+'_prophet_algo_forecast'
ForecastTypes=["0.1", "0.5", "0.9", "mean"]

**DeepAR+**

In [None]:
create_forecast_response=forecast.create_forecast(ForecastName=deeparplus_forecastName,
                                                  ForecastTypes=ForecastTypes,
                                                  PredictorArn=DeepARPlusArn
                                                 )
deeparplus_forecastArn = create_forecast_response['ForecastArn']

In [None]:
deeparplus_forecastArn

**Prophet**

In [None]:
create_forecast_response=forecast.create_forecast(ForecastName=prophet_forecastname,
                                                  ForecastTypes=ForecastTypes,
                                                  PredictorArn = ProphetArn
                                                 )
prophet_forecastArn = create_forecast_response['ForecastArn']

In [None]:
prophet_forecastArn

## 4b. Check Forecast Creation Status

In [None]:
import time 

start_time = time.time()
while True:
    deeparplus_forecast_status = forecast.describe_forecast(ForecastArn=deeparplus_forecastArn)['Status']
    prophet_forecast_status = forecast.describe_forecast(ForecastArn=prophet_forecastArn)['Status']
    
    print("Predictor {} status : {}".format(deeparplus_forecastArn, deeparplus_forecast_status))
    print("Predictor {} status : {}".format(prophet_forecastArn, prophet_forecast_status))
    print("--------------------------------------------------------------------")
    
    if deeparplus_forecast_status != 'ACTIVE' or prophet_forecast_status != 'ACTIVE':
        sleep(30)
    else:
        break
print('작업 수행된 시간 : %f 초' % (time.time() - start_time))

## 4c. Get Forecast & Visualization
Predictor별 Forecast를 생성한 후 id별 p10, p50, p90, mean 값을 확인할 수 있다.

In [None]:
# 200개의 Sample item 중 top10
sampled[["id", "sales_total"]].head(10)

In [None]:
# 200개의 Sample item 중 worst10
sampled[["id", "sales_total"]].tail(10)

In [None]:
def get_forecast(id):
    for forecastArn in [deeparplus_forecastArn, prophet_forecastArn]:
        forecastResponse = forecast_query.query_forecast(
                            ForecastArn=forecastArn,
                            Filters={"item_id":id}
                            )

        mean = pd.DataFrame(forecastResponse['Forecast']['Predictions']['mean'])
        mean.Timestamp = mean.Timestamp.apply(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%S"))
        mean.set_index("Timestamp", inplace=True)
        mean.rename(columns = {'Value' : 'mean'}, inplace = True)

        p10 = pd.DataFrame(forecastResponse['Forecast']['Predictions']['p10'])
        p10.Timestamp = p10.Timestamp.apply(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%S"))
        p10.set_index("Timestamp", inplace=True)
        p10.rename(columns = {'Value' : 'p10'}, inplace = True)

        p50 = pd.DataFrame(forecastResponse['Forecast']['Predictions']['p50'])
        p50.Timestamp = p50.Timestamp.apply(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%S"))
        p50.set_index("Timestamp", inplace=True)
        p50.rename(columns = {'Value' : 'p50'}, inplace = True)

        p90 = pd.DataFrame(forecastResponse['Forecast']['Predictions']['p90'])
        p90.Timestamp = p90.Timestamp.apply(lambda x: datetime.strptime(x, "%Y-%m-%dT%H:%M:%S"))
        p90.set_index("Timestamp", inplace=True)
        p90.rename(columns = {'Value' : 'p90'}, inplace = True)

        plot_start_ts = mean.index.min() - timedelta(days=0.5 * 365/12)
        plot_end_ts   = mean.index.max() + timedelta(days=0.5 * 365/12)
        plot_start_str = datetime.strptime(str(plot_start_ts), '%Y-%m-%d %H:%M:%S')
        plot_end_str   = datetime.strptime(str(plot_end_ts), '%Y-%m-%d %H:%M:%S')
        plot_start_date = str(plot_start_str.year) + "-" + str(plot_start_str.month) + "-" + str(plot_start_str.day)
        plot_end_date   = str(plot_end_str.year) + "-" + str(plot_end_str.month) + "-" + str(plot_end_str.day)

        observations = df_merged[df_merged["id"] == id].loc[plot_start_date:plot_end_date].sales
        
        fig = plt.figure(figsize=(20, 5))

        plt.title("Forecast for {}, Predictor : {}".format(id, forecastArn))
        plt.plot(observations, color='gray', linewidth=1, label="observation")
        plt.plot(p90, label='p90')
        plt.plot(mean, label='mean')
        plt.plot(p50, label='p50')
        plt.plot(p10, label='p10')
        plt.axvline(x=datetime(2015, 12, 25), color='r', linestyle='--', linewidth=2) # Adding Vertical line for Christmas
        plt.legend()
     
    return

### Sample중 Top5

In [None]:
sampled[["id", "sales_total"]].head()

In [None]:
sampled.id.head(5)

In [None]:
for item in sampled.id.head(5):
    get_forecast(item)

### Sample중 Worst5

In [None]:
for item in sampled.id.tail(5):
    get_forecast(item)

# Discussion
- Model Performanace를 높이기 위한 방법은?
- Grouping (dept_id = FOODS_3 중 Best200) 방식은 적절했는지?
- Related Time Series가 Target에 영향을 주는지?
- Item Metadata의 각 Attribute는 Item을 충분히 설명하고 있는지?

# Resource Termination
- Delete Forecasts
- Delete Predictors
- Delete Dataset Import Jobs
- Delete Datasets
- Delete Dataset Group
- Delete CloudFormation Stack
- Delete S3 Bucket