## INITIALIZE DATA FOR PREDICTING FLIGHT DELAYS

First let's to understand what are we going to do and what is Azure ML

<img src="images/level1-azureml.png" width="900" height="450" />

So Data is a csv that contains delays informations from flights in 2015. This csv is inside a blob storage which is a container in Azure ML for raw Data (like a csv file)

We already created a Compute Instance that is going to download the data from a public blob Storage in Azure and load it in Azure Machine Learning DataSet by executing this NoteBook
Then we will train a model and deploy this model inside an Azure service so we can predict any future Data. This Azure service is called Endpoint in Azure ML and behind this service, Azure is actually using Kubernates Services. To deploy this model we will need a configuration file to explains how to create this service. All the config files are .yaml file here and they are use to create environment (which is where your application is running like a Docker environment), components (reusable part of codes that have inputs, outputs, parameters and do jobs such as training or deploy a model) or endpoints.

<br>
<br>
<br>
<br>
<br>
<br>

Let's have a closer look :

<img src="images/level2-azureml.png" width="900" height="450" />

Stay focus on the main line in the middle. We now see Compute Clusters instead of Compute instance. Compute clusters are like Kubernate Clusters, they allow multiple nodes and so multiple executions/jobs. This is why they are better to run components. We now see Environments, this is an essential part as it is mandatory for any jobs. 

* In this tutorial, we created a compute instance to play around with our notebook and test rapidly our code

* To begin with, we are going to use this compute to dowload our Data (this notebook)

* Additionally, we are going to create our own custom environment for our jobs

* Afer this, we will create components for analysis and training

* Ultimately, we will create our endpoint for our trained model 

# Prepare the DataSet for our pipeline
The goal of this notebook is to download the data from a public blob Storage in Azure and load it in Azure Machine Learning

Be sure to log in while entering for the first time into Azure ML Studio/Compute in Notebook (Should appear on top)

<img src="images/LogIn.png" width="900" height="280" />

If you are logged inside the compute, then you should set the values inside the "" of the code below (to get a handle to our workspace) :

To get the values : Click on the Top Left :

<img src="images/credentials.png" width="900" height="380" />

And GET your :
* Subscription ID (for subscription_id)
* Resource Group (for resource_group_name)
* Current Workspace (workspace_name)

<img src="images/values.png" width="582" height="911" />

The tenant ID is the Directory ID, you can find it here : https://portal.azure.com/#settings/directory

<img src="images/tenantID.png" width="600" height="185" />

GET the :
* Directory ID (for tenant_id)

# Set Up the Variables in the NoteBook (IMPORTANT)

We install Azure-Ai-Ml into our Compute Instance Environment

In [1]:
%pip install azure-ai-ml

Collecting azure-ai-ml
  Downloading azure_ai_ml-1.2.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 4.4 MB/s eta 0:00:01
Collecting pydash<6.0.0
  Downloading pydash-5.1.2-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 2.3 MB/s  eta 0:00:01
Collecting azure-storage-file-datalake<13.0.0
  Downloading azure_storage_file_datalake-12.9.1-py3-none-any.whl (238 kB)
[K     |████████████████████████████████| 238 kB 33.3 MB/s eta 0:00:01
Collecting strictyaml<2.0.0
  Downloading strictyaml-1.6.2.tar.gz (130 kB)
[K     |████████████████████████████████| 130 kB 32.8 MB/s eta 0:00:01
Collecting marshmallow<4.0.0,>=3.5
  Downloading marshmallow-3.19.0-py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 1.8 MB/s  eta 0:00:01
Collecting azure-storage-file-share<13.0.0
  Downloading azure_storage_file_share-12.10.1-py3-none-any.whl (252 kB)
[K     |████████████████████████████████| 252 kB 50.6 MB/s eta 0:00:01
Collectin

TODO : Replace with your own values

In [3]:
%%writefile setenv.py
import os

# TODO: Replace with your own subscription key
# You can find your information in the Azure portal Machine, see above for details

os.environ['subscription_id'] = "" # this will look like xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
os.environ['resource_group'] = "" # this will look like "rg-xxx-xxx"
os.environ['workspace_name'] = "flights-mlbox" # this will look like "flights-mlbox"

os.environ['owner'] = "user" # this is your user name or you email address
os.environ['tenant_id'] = "" # this will look like xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Writing setenv.py


Gather the environment variables and get the handle of our workspace

In [4]:
# Authentication package
from azure.identity import DefaultAzureCredential
import os
from azure.ai.ml import MLClient
credential = DefaultAzureCredential()

# Execute the script
%run setenv.py

file = open("setenv.sh","w")
file.write("export subscription_id=" + os.environ['subscription_id'] + "\n" + "export resource_group=" + os.environ['resource_group'] + "\n" + "export workspace_name=" + os.environ['workspace_name'] + "\n" + "export owner=" + os.environ['owner'] + "\n" + "export tenant_id=" + os.environ['tenant_id'] + "\n")
file.close()

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id= os.environ['subscription_id'],
    resource_group_name= os.environ['resource_group'],
    workspace_name= os.environ['workspace_name']
)

* Checking Credentials

ml_client is lazy. So your credentials might be invalid. Run this cell to make sure your credentials are correct :

In [5]:
# Check if credentials are valid
from IPython.display import Image
from colorama import Fore

try :
    ml_client.begin_create_or_update(ml_client.workspaces.get())
    print(Fore.GREEN + "Credentials are valid")
except :
    print(Fore.RED + "Credentials are invalid - please check the TODO CELL")
    print("Please check your credentials : subscription_id, resource_group_name, workspace_name must be correct")
    display(Image(filename='images/credentials.PNG'))
    print("You can find your credentials by clicking on the TOP LEFT of the Azure Portal ML Studio")
    display(Image(filename='images/values.png'))


[32mCredentials are valid


We download in local a public blob storage account that contains all the data we need, then we create a Dataset that contains every files.

In [6]:
from azure.ai.ml.entities import Data
import os
from azure.storage.blob import BlobServiceClient
from azure.ai.ml.constants import AssetTypes
import shutil

dataset_dir = "./Datasets"

os.makedirs(dataset_dir, exist_ok=True)

account_url = "https://publicdataflights.blob.core.windows.net/"

blob_service_client = BlobServiceClient(account_url=account_url)


# download all blob into local files
container_name = 'dataflights'
container_client = blob_service_client.get_container_client(container_name)
blob_list = container_client.list_blobs()
for blob in blob_list:
    print(blob.name)
    blob_client = blob_service_client.get_blob_client(container_name, blob=blob.name)
    with open(os.path.join(dataset_dir, blob.name), "wb") as my_blob:
        blob_data = blob_client.download_blob()
        blob_data.readinto(my_blob)
        print("\tBlob '{}' downloaded".format(blob.name))


# Create a dataset from the local file

dataset_name = "dataset-delays-flights"

dataset = Data(
    name=dataset_name,
    description="Dataset for delays flights prediction 2015",
    tags={"ama_owner": "romain.caret"},
    type=AssetTypes.URI_FOLDER,
    path=dataset_dir,
)

dataset = ml_client.data.create_or_update(dataset)

print(
    f"Dataset with name {dataset.name} is registered to workspace, the dataset version is {dataset.version}"
)

# Delete the local files

if os.path.isdir(dataset_dir):
    shutil.rmtree(dataset_dir)

airline_codes.csv
	Blob 'airline_codes.csv' downloaded
airlines.csv
	Blob 'airlines.csv' downloaded
airports.csv
	Blob 'airports.csv' downloaded
flights.csv
	Blob 'flights.csv' downloaded
Dataset with name dataset-delays-flights is registered to workspace, the dataset version is 1


Your file exceeds 100 MB. If you experience low upload speeds or latency, we recommend using the AzCopy tool for this file transfer. See https://docs.microsoft.com/azure/storage/common/storage-use-azcopy-v10 for more information.
[32mUploading Datasets (592.49 MBs): 100%|██████████| 592487593/592487593 [00:03<00:00, 162888587.24it/s]
[39m

