Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Data Preparation
---

This repository uses simulated orange juice sales data from [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/) to walk you through the process of training many models and forecasting on Azure Machine Learning. 

This notebook walks you through all the necessary steps to configure the data for this solution accelerator, including:

1. Download the sample data
2. Split in training/forecasting sets
3. Connect to your workspace and upload the data to its Datastore

### Prerequisites
If you have already run the [00_Setup_AML_Workspace](00_Setup_AML_Workspace.ipynb) notebook you are all set.


## 1.0 Download sample data

The time series data used in this example was simulated based on the University of Chicago's Dominick's Finer Foods dataset, which featured two years of sales of 3 different orange juice brands for individual stores. You can learn more about the dataset [here](https://azure.microsoft.com/services/open-datasets/catalog/sample-oj-sales-simulated/). 

The full dataset includes simulated sales for 3,991 stores with 3 orange juice brands each, thus allowing 11,973 models to be trained to showcase the power of the many models pattern. Each series contains data from '1990-06-14' to '1992-10-01'.

You'll need the `azureml-opendatasets` package to download the data. You can install it with the following:

In [1]:
%pip install azureml-opendatasets

Note: you may need to restart the kernel to use updated packages.


We'll start by downloading the first 10 files but you can easily edit the code below to train all 11,973 models.

In [16]:
dataset_maxfiles = 137 # Set to 11973 or 0 to get all the files

In [17]:
import os
from azureml.opendatasets import OjSalesSimulated

# Pull all of the data
oj_sales_files = OjSalesSimulated.get_file_dataset()

# Pull only the first `dataset_maxfiles` files
if dataset_maxfiles:
    oj_sales_files = oj_sales_files.take(dataset_maxfiles)

# Create a folder to download
target_path = 'oj_sales_data' 
os.makedirs(target_path, exist_ok=True)

# Download the data
oj_sales_files.download(target_path, overwrite=True)

['/mnt/batch/tasks/shared/LS_root/mounts/clusters/tgokal1/code/Users/tgokal/solution-accelerator-many-models-v2/oj_sales_data/https%3A/%2Fazureopendatastorage.azurefd.net/ojsales-simulatedcontainer/oj_sales_data/Store1000_dominicks.csv',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/tgokal1/code/Users/tgokal/solution-accelerator-many-models-v2/oj_sales_data/https%3A/%2Fazureopendatastorage.azurefd.net/ojsales-simulatedcontainer/oj_sales_data/Store1000_minute.maid.csv',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/tgokal1/code/Users/tgokal/solution-accelerator-many-models-v2/oj_sales_data/https%3A/%2Fazureopendatastorage.azurefd.net/ojsales-simulatedcontainer/oj_sales_data/Store1000_tropicana.csv',
 '/mnt/batch/tasks/shared/LS_root/mounts/clusters/tgokal1/code/Users/tgokal/solution-accelerator-many-models-v2/oj_sales_data/https%3A/%2Fazureopendatastorage.azurefd.net/ojsales-simulatedcontainer/oj_sales_data/Store1001_dominicks.csv',
 '/mnt/batch/tasks/shared/LS_root/mounts/clust

## 2.0 Split data in two sets

We will now split each dataset in two parts: one will be used for training, and the other will be used for simulating batch forecasting. The training files will contain the data records before '1992-5-28' and the last part of each series will be stored in the inferencing files.

Finally, we will upload both sets of data files to the Workspace's default [Datastore](https://docs.microsoft.compython/api/azureml-core/azureml.core.datastore(class)).

In [29]:
def write_file(data, path, extension):
    if extension == ".parquet":
        data.to_parquet(path)
    else:
        data.to_csv(path, index=None, header=True)

In [28]:
def read_file(path, extension):
    if extension == ".parquet":
        return pd.read_parquet(path)
    else:
        return pd.read_csv(path)

In [30]:
def split_data(data_path, time_column_name, split_date):

    train_data_path = os.path.join(data_path, "train")
    inference_data_path = os.path.join(data_path, "inference")
    os.makedirs(train_data_path, exist_ok=True)
    os.makedirs(inference_data_path, exist_ok=True)

    files_list = [os.path.join(path, f) for path, _, files in os.walk(data_path) for f in files
                  if path not in (train_data_path, inference_data_path)]

    for file in files_list:
        if '.csv' in file:
            file_name = os.path.basename(file)
            file_extension = os.path.splitext(file_name)[1].lower()
            df = read_file(file, file_extension)
            before_split_date = df[time_column_name] < split_date
            train_df, inference_df = df[before_split_date], df[~before_split_date]
            write_file(train_df, os.path.join(train_data_path, file_name), file_extension)
            write_file(inference_df, os.path.join(inference_data_path, file_name), file_extension)
    
    return train_data_path, inference_data_path

In [34]:
import pandas as pd

timestamp_column = 'WeekStarting'
split_date = '1992-05-28'
target_path = "oj_sales_data"

train_path, inference_path = split_data(target_path, timestamp_column, split_date)

## 3.0 Upload data to Datastore in AML Workspace

In the [setup notebook](00_Setup_AML_Workspace.ipynb) you created a [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace). We are going to register the data in that enviroment.

In [35]:
import os
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml import command, Input
from azure.ai.ml.entities import (
    AzureBlobDatastore,
    AzureFileDatastore,
    AzureDataLakeGen1Datastore,
    AzureDataLakeGen2Datastore,
)
from azure.ai.ml.entities import Environment

subscription_id=os.getenv("SUBSCRIPTION_ID", default="80a3336a-33ac-4098-a7e7-64eb71d80cee")
resource_group=os.getenv("RESOURCE_GROUP", default="tgrgml")

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group)

for ws in ml_client.workspaces.list():
    print(ws.name, ":", ws.location, ":", ws.description)

workspace = "mlw-basic-prod-202209110348"

mlw-basic-prod-202209110348 : australiaeast : This example shows how to create a basic workspace


In [36]:
ml_client = MLClient(DefaultAzureCredential(), 
                     subscription_id, 
                     resource_group, 
                     workspace)

We will upload both sets of data files to your Workspace's default [Datastore](https://docs.microsoft.com/azure/machine-learning/how-to-access-data). 
A Datastore is a place where data can be stored that is then made accessible for training or forecasting. Please refer to [Datastore documentation](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore(class)) on how to access data from Datastore.

#TODO

## 4.0 Upload to Datastore and register datasets

The last step is creating and registering [datasets](https://docs.microsoft.com/azure/machine-learning/concept-data#datasets) in Azure Machine Learning for the train and inference sets.

Using a [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset) is currently the best way to take advantage of the many models pattern, so we create FileDatasets in the next cell. We then [register](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets#register-datasets) the FileDatasets in your Workspace; this associates the train/inference sets with simple names that can be easily referred to later on when we train models and produce forecasts.

In [37]:
# from azure.ai.ml.entities import Data
# from azure.ai.ml.constants import AssetTypes

# #Upload training data
# train_data = Data(
#     path="./data/train",
#     type=AssetTypes.URI_FILE,
#     description="Training Dataset",
#     name="train",
#     version="1",
# )

# ml_client.data.create_or_update(train_data)

#TODO: register 10 train and 10 inference files to platform. Upload through datastore

[32mUploading train (59.74 MBs): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 59743107/59743107 [01:40<00:00, 594263.41i

HttpResponseError: (UserError) A data version with this name and version already exists. If you are trying to create a new data version, use a different name or version. If you are trying to update an existing data version, the existing asset's data type, data uri cannot be changed. Only tags, description, and isArchived can be updated.
Code: UserError
Message: A data version with this name and version already exists. If you are trying to create a new data version, use a different name or version. If you are trying to update an existing data version, the existing asset's data type, data uri cannot be changed. Only tags, description, and isArchived can be updated.
Additional Information:Type: ComponentName
Info: {
    "value": "managementfrontend"
}Type: Correlation
Info: {
    "value": {
        "operation": "1f1082f29b6516e6dea7a34ea1483ce5",
        "request": "00bca2f2c691624c"
    }
}Type: Environment
Info: {
    "value": "australiaeast"
}Type: Location
Info: {
    "value": "australiaeast"
}Type: Time
Info: {
    "value": "2022-10-16T11:06:10.3548915+00:00"
}Type: InnerError
Info: {
    "value": {
        "code": "Immutable",
        "innerError": {
            "code": "DataVersionPropertyImmutable",
            "innerError": null
        }
    }
}Type: MessageFormat
Info: {
    "value": "A data version with this name and version already exists. If you are trying to create a new data version, use a different name or version. If you are trying to update an existing data version, the existing asset's {property} cannot be changed. Only tags, description, and isArchived can be updated."
}Type: MessageParameters
Info: {
    "value": {
        "property": "data type, data uri"
    }
}

In [94]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

#Upload inference data
inference_data = Data(
    path="./data/upload_inference_data/dataset.csv",
    type=AssetTypes.URI_FILE,
    description="Inference Dataset",
    name="inference",
    version="1",
)

ml_client.data.create_or_update(inference_data)

[32mUploading dataset.csv[32m (< 1 MB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.23M/1.23M [00:00<00:00, 

Data({'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'inference', 'description': 'Inference Dataset', 'tags': {}, 'properties': {}, 'id': '/subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourceGroups/tgrgml/providers/Microsoft.MachineLearningServices/workspaces/mlw-basic-prod-202209110348/data/inference/versions/1', 'base_path': './', 'creation_context': <azure.ai.ml._restclient.v2022_05_01.models._models_py3.SystemData object at 0x7faaadf29ba0>, 'serialize': <msrest.serialization.Serializer object at 0x7faaae0635b0>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourcegroups/tgrgml/workspaces/mlw-basic-prod-202209110348/datastores/workspaceblobstore/paths/LocalUpload/149e7bbf78efa09ced745dc85d7b5963/dataset.csv', 'referenced_uris': None})

In [38]:
# Creating a datastore and uploading data
blob_credless_datastore = AzureBlobDatastore(
    name="automl_datastore_scale",
    description="Datastore",
    account_name="mlwbasicstoragecf2cc6d6e",
    container_name="data-container"
)
ml_client.create_or_update(blob_credless_datastore)


AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'automl_datastore_scale', 'description': 'Datastore', 'tags': {}, 'properties': {}, 'id': '/subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourceGroups/tgrgml/providers/Microsoft.MachineLearningServices/workspaces/mlw-basic-prod-202209110348/datastores/automl_datastore_scale', 'base_path': './', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7f864ca004f0>, 'credentials': <azure.ai.ml.entities._datastore.credentials.NoneCredentials object at 0x7f8642133910>, 'container_name': 'data-container', 'account_name': 'mlwbasicstoragecf2cc6d6e', 'endpoint': 'core.windows.net', 'protocol': 'https'})

In [2]:
import mltable

tbl = mltable.load("./data/train")
df = tbl.to_pandas_dataframe()
df.head(5)

Unnamed: 0,WeekStarting,Store,Brand,Quantity,Advert,Price,Age60,COLLEGE,INCOME,Hincome150,Large HH,Minorities,WorkingWoman,SSTRDIST,SSTRVOL,CPDIST5,CPWVOL5
0,1990-06-14,2,dominicks,10560,1,1.59,0.232864734,0.248934934,10.55320518,0.463887065,0.103953406,0.114279949,0.303585347,2.110122129,1.142857143,1.927279669,0.3769266129999999
1,1990-06-14,2,minute.maid,4480,0,3.17,0.232864734,0.248934934,10.55320518,0.463887065,0.103953406,0.114279949,0.303585347,2.110122129,1.142857143,1.927279669,0.3769266129999999
2,1990-06-14,2,tropicana,8256,0,3.87,0.232864734,0.248934934,10.55320518,0.463887065,0.103953406,0.114279949,0.303585347,2.110122129,1.142857143,1.927279669,0.3769266129999999
3,1990-06-14,5,dominicks,1792,1,1.59,0.117368032,0.32122573,10.92237097,0.535883355,0.103091585,0.053875277,0.410568032,3.801997814,0.681818182,1.600573425,0.736306837
4,1990-06-14,5,minute.maid,4224,0,2.99,0.117368032,0.32122573,10.92237097,0.535883355,0.103091585,0.053875277,0.410568032,3.801997814,0.681818182,1.600573425,0.736306837


In [7]:
# get datastore uri from local data path
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

#how to point to the newly created datastore?

csv_train_data = Data(
    path="./data/upload_train_data/train_dataset.csv",
    type=AssetTypes.URI_FILE,
    description="CSV train data",
    name="v2_csv_train_urifile",
)

csv_train_data = ml_client.data.create_or_update(csv_train_data)
print(csv_train_data.path)

[32mUploading train_dataset.csv[32m (< 1 MB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 167k/167k [00:00<00:00, 

azureml://subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourcegroups/tgrgml/workspaces/mlw-basic-prod-202209110348/datastores/workspaceblobstore/paths/LocalUpload/eafe7ae8c73eece97c196bf64d73dab5/train_dataset.csv


In [8]:
csv_inference_data = Data(
    path="./data/upload_inference_data/inference_dataset.csv",
    type=AssetTypes.URI_FILE,
    description="CSV inference data",
    name="v2_csv_inference_urifile",
)

csv_inference_data = ml_client.data.create_or_update(csv_inference_data)
print(csv_inference_data.path)

[32mUploading inference_dataset.csv[32m (< 1 MB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.98k/5.98k [00:00<00:00,

azureml://subscriptions/80a3336a-33ac-4098-a7e7-64eb71d80cee/resourcegroups/tgrgml/workspaces/mlw-basic-prod-202209110348/datastores/workspaceblobstore/paths/LocalUpload/150a448efa43dab8b2f2e277486656d8/inference_dataset.csv


## Next Steps

Now that you have created your datasets, you are ready to move to one of the training notebooks to train and score the models:

- Automated ML: please open [02_AutoML_Training_Pipeline.ipynb](Automated_ML/02_AutoML_Training_Pipeline/02_AutoML_Training_Pipeline.ipynb).
- Custom Script: please open [02_CustomScript_Training_Pipeline.ipynb](Custom_Script/02_CustomScript_Training_Pipeline.ipynb).