Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Data Preparation
---
In this notebook we will use a dummy sales forecasting dataset that has some made up historical sales figures by skus for each week going back to a couple of years worth of data. The original many models solution accelerator repo uses the simulated orange juice sales data. To use that please refere to the original many models github repo: https://github.com/microsoft/solution-accelerator-many-models

The dataset will be very simple - 3 columns only: Week | Sku01 | Sku02 | Sku03 .... and so on. For the many models solution accelerator to work with this dataset, we must generate a file for each Sku so that we run a model per sku. Each file will subsequently have the following format: Week | Sku | Sales

### Prerequisites
If you have already run the [00_Setup_AML_Workspace](00_Setup_AML_Workspace.ipynb) notebook you are all set.


In [9]:
import sys
# insert at 1, 0 is the script path (or '' in REPL)
sys.path.append('scripts')
from helper import process_denormalized_skus
from helper import split_data

In [2]:
data_folder = 'data'
processed_folder = 'processed_data'
os.makedirs(data_folder, exist_ok=True)
os.makedirs(data_folder + "/" + processed_folder, exist_ok=True)

In [3]:
denormalized_skus = 'historical_sales.csv'
file_extension = 'csv'

In [12]:
timestamp_col = 'Week'
split_date = '18/05/2020'

In [5]:
files_created = process_denormalized_skus(timestamp_col, data_folder, processed_folder, denormalized_skus, file_extension)


In [6]:
print(files_created)

{'files_created': 5}


## 2.0 Split data in two sets

We will now split each dataset in two parts: one will be used for training, and the other will be used for simulating batch forecasting. The training files will contain the data records before '1992-5-28' and the last part of each series will be stored in the inferencing files.

Finally, we will upload both sets of data files to the Workspace's default [Datastore](https://docs.microsoft.compython/api/azureml-core/azureml.core.datastore(class)).

In [7]:
target_path = data_folder + "/" + processed_folder

In [13]:
train_path, inference_path = split_data(target_path, timestamp_col, split_date)

## 3.0 Upload data to Datastore in AML Workspace

In the [setup notebook](00_Setup_AML_Workspace.ipynb) you created a [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace). We are going to register the data in that enviroment.

In [22]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()

# Take a look at Workspace
ws.get_details()

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


{'id': '/subscriptions/051aa254-957d-4431-a6df-6caa8963bdd7/resourceGroups/manymodelsfatos/providers/Microsoft.MachineLearningServices/workspaces/mmaml',
 'name': 'mmaml',
 'location': 'westeurope',
 'type': 'Microsoft.MachineLearningServices/workspaces',
 'sku': 'Enterprise',
 'workspaceid': '64638bba-4983-4100-9dd1-f411313ac67a',
 'description': '',
 'friendlyName': 'mmaml',
 'creationTime': '2020-09-15T13:55:07.5489274+00:00',
 'containerRegistry': '/subscriptions/051aa254-957d-4431-a6df-6caa8963bdd7/resourceGroups/manymodelsfatos/providers/Microsoft.ContainerRegistry/registries/64638bba498341009dd1f411313ac67a',
 'keyVault': '/subscriptions/051aa254-957d-4431-a6df-6caa8963bdd7/resourcegroups/manymodelsfatos/providers/microsoft.keyvault/vaults/kvxwcqolhnxvyp6',
 'applicationInsights': '/subscriptions/051aa254-957d-4431-a6df-6caa8963bdd7/resourcegroups/manymodelsfatos/providers/microsoft.insights/components/aixwcqolhnxvyp6',
 'identityPrincipalId': '612c3456-be85-408a-9c77-e9cd62a730

In [23]:
target_path

'data/processed_data'

We will upload both sets of data files to your Workspace's default [Datastore](https://docs.microsoft.com/azure/machine-learning/how-to-access-data). 
A Datastore is a place where data can be stored that is then made accessible for training or forecasting. Please refer to [Datastore documentation](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore(class)) on how to access data from Datastore.

In [24]:
# Connect to default datastore
datastore = ws.get_default_datastore()

# Upload train data
ds_train_path = target_path + '_train'
datastore.upload(src_dir=train_path, target_path=ds_train_path, overwrite=True)

# Upload inference data
ds_inference_path = target_path + '_inference'
datastore.upload(src_dir=inference_path, target_path=ds_inference_path, overwrite=True)

Uploading an estimated of 5 files
Uploading data/processed_data\upload_train_data\sku01.csv
Uploaded data/processed_data\upload_train_data\sku01.csv, 1 files out of an estimated total of 5
Uploading data/processed_data\upload_train_data\sku02.csv
Uploaded data/processed_data\upload_train_data\sku02.csv, 2 files out of an estimated total of 5
Uploading data/processed_data\upload_train_data\sku03.csv
Uploaded data/processed_data\upload_train_data\sku03.csv, 3 files out of an estimated total of 5
Uploading data/processed_data\upload_train_data\sku04.csv
Uploaded data/processed_data\upload_train_data\sku04.csv, 4 files out of an estimated total of 5
Uploading data/processed_data\upload_train_data\sku05.csv
Uploaded data/processed_data\upload_train_data\sku05.csv, 5 files out of an estimated total of 5
Uploaded 5 files
Uploading an estimated of 5 files
Uploading data/processed_data\upload_inference_data\sku01.csv
Uploaded data/processed_data\upload_inference_data\sku01.csv, 1 files out of a

$AZUREML_DATAREFERENCE_2a8558320b064331a817296439e1f39a

## 4.0 Register dataset in AML Workspace

The last step is creating and registering [datasets](https://docs.microsoft.com/azure/machine-learning/concept-data#datasets) in Azure Machine Learning for the train and inference sets.

Using a [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset) is currently the best way to take advantage of the many models pattern, so we create FileDatasets in the next cell. We then [register](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets#register-datasets) the FileDatasets in your Workspace; this associates the train/inference sets with simple names that can be easily referred to later on when we train models and produce forecasts.

In [25]:
from azureml.core.dataset import Dataset

# Create file datasets
ds_train = Dataset.File.from_files(path=datastore.path(ds_train_path), validate=False)
ds_inference = Dataset.File.from_files(path=datastore.path(ds_inference_path), validate=False)

# Register the file datasets
dataset_name = target_path
train_dataset_name = dataset_name + '_train'
inference_dataset_name = dataset_name + '_inference'
ds_train.register(ws, train_dataset_name, create_new_version=True)
ds_inference.register(ws, inference_dataset_name, create_new_version=True)

{
  "source": [
    "('workspaceblobstore', 'data/processed_data_inference')"
  ],
  "definition": [
    "GetDatastoreFiles"
  ],
  "registration": {
    "id": "e554bcd7-12e1-45e2-9122-8a5f36687028",
    "name": "data/processed_data_inference",
    "version": 1,
    "workspace": "Workspace.create(name='mmaml', subscription_id='051aa254-957d-4431-a6df-6caa8963bdd7', resource_group='manymodelsfatos')"
  }
}

## Next Steps

Now that you have created your datasets, you are ready to move to one of the training notebooks to train and score the models:

- Automated ML: please open [02_AutoML_Training_Pipeline.ipynb](Automated_ML/02_AutoML_Training_Pipeline/02_AutoML_Training_Pipeline.ipynb).
- Custom Script: please open [02_CustomScript_Training_Pipeline.ipynb](Custom_Script/02_CustomScript_Training_Pipeline.ipynb).