Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Data Preparation
---

This repository uses simulated orange juice sales data from [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/) to walk you through the process of training many models and forecasting on Azure Machine Learning. 

This notebook walks you through all the necessary steps to configure the data for this solution accelerator, including:

1. Download the sample data
2. Split in training/forecasting sets
3. Connect to your workspace and upload the data to its Datastore

### Prerequisites
If you have already run the [00_Setup_AML_Workspace](00_Setup_AML_Workspace.ipynb) notebook you are all set.


## 1.0 Download sample data

The time series data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset, which featured two years of sales of 3 different orange juice brands for individual stores. You can learn more about the dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/sample-oj-sales-simulated/). 

The full dataset includes simulated sales for 3,991 stores with 3 orange juice brands each, thus allowing 11,973 models to be trained to showcase the power of the many models pattern. Each series contains data from '1990-06-14' to '1992-10-01'.

You'll need the `azureml-opendatasets` package to download the data. You can install it with the following:

We'll start by downloading the first 10 files but you can easily edit the code below to train all 11,973 models.

## 2.0 Split data in two sets

We will now split each dataset in two parts: one will be used for training, and the other will be used for simulating batch forecasting. The training files will contain the data records before '1992-5-28' and the last part of each series will be stored in the inferencing files.

Finally, we will upload both sets of data files to the Workspace's default [Datastore](https://docs.microsoft.compython/api/azureml-core/azureml.core.datastore(class)).

In [85]:
from scripts.helper import split_data
import pandas as pd

timestamp_column = 'WeekStarting'
split_date = '1991-05-28'
target_path = "data"

train_path, inference_path = split_data(target_path, timestamp_column, split_date)

## 3.0 Upload data to Datastore in AML Workspace

In the [setup notebook](00_Setup_AML_Workspace.ipynb) you created a [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace). We are going to register the data in that enviroment.

In [4]:
import os
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml import command, Input
from azure.ai.ml.entities import (
    AzureBlobDatastore,
    AzureFileDatastore,
    AzureDataLakeGen1Datastore,
    AzureDataLakeGen2Datastore,
)
from azure.ai.ml.entities import Environment

subscription_id=os.getenv("SUBSCRIPTION_ID", default="80a3336a-33ac-4098-a7e7-64eb71d80cee")
resource_group=os.getenv("RESOURCE_GROUP", default="tgrgml")

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group)

for ws in ml_client.workspaces.list():
    print(ws.name, ":", ws.location, ":", ws.description)

workspace = "mlw-basic-prod-202209110348"

Class SystemCreatedStorageAccount: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class SystemCreatedAcrAccount: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Class RegistryOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.


mlw-basic-prod-202209110348 : australiaeast : This example shows how to create a basic workspace


In [5]:
ml_client = MLClient(DefaultAzureCredential(), 
                     subscription_id, 
                     resource_group, 
                     workspace)

We will upload both sets of data files to your Workspace's default [Datastore](https://docs.microsoft.com/azure/machine-learning/how-to-access-data). 
A Datastore is a place where data can be stored that is then made accessible for training or forecasting. Please refer to [Datastore documentation](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore(class)) on how to access data from Datastore.

#TODO

In [None]:
# IGNORE - FOR REFERENCE USING V1 
# Connect to default datastore
# datastore = ws.get_default_datastore()

# # Upload train data
# ds_train_path = target_path + '_train'
# datastore.upload(src_dir=train_path, target_path=ds_train_path, overwrite=True)

# # Upload inference data
# ds_inference_path = target_path + '_inference'
# datastore.upload(src_dir=inference_path, target_path=ds_inference_path, overwrite=True)

## 4.0 Register dataset in AML Workspace

The last step is creating and registering [datasets](https://docs.microsoft.com/azure/machine-learning/concept-data#datasets) in Azure Machine Learning for the train and inference sets.

Using a [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset) is currently the best way to take advantage of the many models pattern, so we create FileDatasets in the next cell. We then [register](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets#register-datasets) the FileDatasets in your Workspace; this associates the train/inference sets with simple names that can be easily referred to later on when we train models and produce forecasts.

In [93]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

#Upload training data
train_data = Data(
    path="./data/upload_train_data/dataset.csv",
    type=AssetTypes.URI_FILE,
    description="Training Dataset",
    name="train",
    version="1",
)

ml_client.data.create_or_update(train_data)

[32mUploading dataset.csv[32m (< 1 MB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 357k/357k [00:00<00:00, 

Data({'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'train', 'description': 'Training Dataset', 'tags': {}, 'properties': {}, 'id': '/subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourceGroups/tgrgml/providers/Microsoft.MachineLearningServices/workspaces/mlw-basic-prod-202209110348/data/train/versions/1', 'base_path': './', 'creation_context': <azure.ai.ml._restclient.v2022_05_01.models._models_py3.SystemData object at 0x7faaadf43370>, 'serialize': <msrest.serialization.Serializer object at 0x7faaadf2b2b0>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourcegroups/tgrgml/workspaces/mlw-basic-prod-202209110348/datastores/workspaceblobstore/paths/LocalUpload/47ff199392829392b91ee6967b996c54/dataset.csv', 'referenced_uris': None})

In [94]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

#Upload inference data
inference_data = Data(
    path="./data/upload_inference_data/dataset.csv",
    type=AssetTypes.URI_FILE,
    description="Inference Dataset",
    name="inference",
    version="1",
)

ml_client.data.create_or_update(inference_data)

[32mUploading dataset.csv[32m (< 1 MB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.23M/1.23M [00:00<00:00, 

Data({'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'inference', 'description': 'Inference Dataset', 'tags': {}, 'properties': {}, 'id': '/subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourceGroups/tgrgml/providers/Microsoft.MachineLearningServices/workspaces/mlw-basic-prod-202209110348/data/inference/versions/1', 'base_path': './', 'creation_context': <azure.ai.ml._restclient.v2022_05_01.models._models_py3.SystemData object at 0x7faaadf29ba0>, 'serialize': <msrest.serialization.Serializer object at 0x7faaae0635b0>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourcegroups/tgrgml/workspaces/mlw-basic-prod-202209110348/datastores/workspaceblobstore/paths/LocalUpload/149e7bbf78efa09ced745dc85d7b5963/dataset.csv', 'referenced_uris': None})

In [6]:
# Creating a datastore and uploading data
blob_credless_datastore = AzureBlobDatastore(
    name="automl_datastore",
    description="Datastore pointing to a blob container using SAS token.",
    account_name="mlwbasicstoragecf2cc6d6e",
    container_name="data-container"
)
ml_client.create_or_update(blob_credless_datastore)


AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'automl_datastore', 'description': 'Datastore pointing to a blob container using SAS token.', 'tags': {}, 'properties': {}, 'id': '/subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourceGroups/tgrgml/providers/Microsoft.MachineLearningServices/workspaces/mlw-basic-prod-202209110348/datastores/automl_datastore', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/tgokal1/code/Users/tgokal/solution-accelerator-many-models-v2', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7ff289042d00>, 'credentials': <azure.ai.ml.entities._credentials.NoneCredentialConfiguration object at 0x7ff289046070>, 'container_name': 'data-container', 'account_name': 'mlwbasicstoragecf2cc6d6e', 'endpoint': 'core.windows.net', 'protocol': 'https'})

In [13]:
import mltable

tbl = mltable.load("./data/upload_train_data")
df = tbl.to_pandas_dataframe()
df.head(5)

Unnamed: 0,WeekStarting,Store,Brand,Quantity,Advert,Price,Revenue
0,1990/09/30 8:26:35 AM,3916,minute.maid,12923,1,2.45,31661.35
1,1990/12/25 1:37:57 PM,1040,minute.maid,18841,1,2.31,43522.71
2,1990/08/21 8:07:38 PM,1428,minute.maid,24185,1,2.21,36690.2
3,1990/11/10 9:38:40 AM,1765,tropicana,20438,1,2.27,39428.9
4,1990/10/25 7:25:47 AM,2826,minute.maid,16284,0,2.67,47434.3


In [17]:
# get datastore uri from local data path
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

#how to point to the newly created datastore?

csv_train_data = Data(
    path="./data/upload_train_data/train_dataset.csv",
    type=AssetTypes.URI_FILE,
    description="CSV train data",
    name="v2_csv_train_urifile",
)

csv_train_data = ml_client.data.create_or_update(csv_train_data)
print(csv_train_data.path)

In [18]:
csv_inference_data = Data(
    path="./data/upload_inference_data/inference_dataset.csv",
    type=AssetTypes.URI_FILE,
    description="CSV inference data",
    name="v2_csv_inference_urifile",
)

csv_inference_data = ml_client.data.create_or_update(csv_inference_data)
print(csv_inference_data.path)

[32mUploading inference_dataset.csv[32m (< 1 MB): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.23M/1.23M [00:00<00:00, 

azureml://subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourcegroups/tgrgml/workspaces/mlw-basic-prod-202209110348/datastores/workspaceblobstore/paths/LocalUpload/149e7bbf78efa09ced745dc85d7b5963/inference_dataset.csv


## Next Steps

Now that you have created your datasets, you are ready to move to one of the training notebooks to train and score the models:

- Automated ML: please open [02_AutoML_Training_Pipeline.ipynb](Automated_ML/02_AutoML_Training_Pipeline/02_AutoML_Training_Pipeline.ipynb).
- Custom Script: please open [02_CustomScript_Training_Pipeline.ipynb](Custom_Script/02_CustomScript_Training_Pipeline.ipynb).