# Selecting Data Granularity Appropriate to the Hypothesis

##Azure Machine Learning workspace

It is a logical container for your machine learning experiments, compute target, datastore, machine learning models, docker images, deployed services...

Keeps them all together for teams to collaborate

### Create an Azure Machine learning workspace

In [5]:
# Check core SDK version number
import azureml.core
from azureml.core import Workspace
from azureml.core import Dataset

print('SDK version:', azureml.core.VERSION)

In [6]:
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace

try:
  ws = Workspace.create(
    name="your-workspace-name", 
    subscription_id='your-subscription-id', 
    resource_group='your-resource-group'
    location='your-preferred-location'
  )
  print('workspace configuration succeeded')
except:
  print('Workspace not found')
  
ws.name

###Use an existing workspace

In [8]:
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = 'your-subscription-id'
resource_group = 'your-resource-group'
workspace_name = 'your-workspace-name'

workspace = Workspace(subscription_id, resource_group, workspace_name)
workspace.name

## Use Datasets directly in training

### Create a TabularDataset

By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred.

Every workspace comes with a default datastore (and you can register more) which is backed by the Azure blob storage account associated with the workspace. We can use it to transfer data from local to the cloud, and create Dataset from it. We will now upload the Iris data to the default datastore (blob) within your workspace.

In [12]:
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [13]:
try:
  dataset = Dataset.get_by_name(workspace, name='Flight Delays Data')
  file = dataset.download(target_path='/train-dataset/flight_data', overwrite=True)
  print('File target path {}'.format(file))
except:
  print('Not completed')

In [14]:
datastore = workspace.get_default_datastore()
datastore.upload_files(files = file,
                       target_path = '/train-dataset/Flight Delays Data/',
                       overwrite = True,
                       show_progress = True)

In [15]:
dataset = Dataset.Tabular.from_delimited_files(path = [(datastore, '/train-dataset/test_file/Flight Delays Data.csv')])

# preview the first 3 rows of the dataset
df = dataset.to_pandas_dataframe()
df.describe()

###Basic statistical description of airlines

As a first step, we consider all the flights from all carriers. Here, the aim is to classify the airlines with respect to 
their punctuality and for that purpose, we compute a few basic statisticial parameters:

In [18]:
#__________________________________________________________________
# function that extract statistical parameters from a grouby objet:
def calc_stats(group):
    return {'min': group.min(), 'max': group.max(),
            'count': group.count(), 'mean': group.mean()}
  
  
#_______________________________________________________________
# Creation of a dataframe with statitical infos on each airline:
global_stats = df['DepDelay'].groupby(df['Carrier']).apply(calc_stats).unstack()
global_stats = global_stats.sort_values('count')
global_stats