In [None]:
%reload_azureml_ws

# 1. Azure Datasets

An Azure [***Dataset***](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py) represents a resource for 

- exploring,
- transforming, and
- managing 

data in AzureML.

Datasets are the recommended way to work with data, and are the primary mechanism for advanced Azure Machine Learning capabilities like data labeling and data drift monitoring.

>*** A Dataset is a reference to data in a Datastore or behind public web urls.***

## 1.1 Type of Azure Datasets 

You can create the following types of dataset:

- ***Tabular*** : The data is read from the dataset as a table. You should use this type of dataset when your data is consistently structured and you want to work with it in common tabular data structures, such as Pandas dataframes.

- ***File*** : The dataset presents a list of file paths that can be read as though from the file system. Use this type of dataset when your data is unstructured, or when you need to process the data at the file level (for example, to train a convolutional neural network from a set of image files).

# 2. Creating and Registering Tabular Datasets

## 2.1 Create Tabular Dataset
To create a tabular dataset using the SDK, use the from_delimited_files method of the Dataset.Tabular class, like this:

In [None]:
from azureml.core import Dataset

# Get the default datastore
default_ds = ws.get_default_datastore()

#Create a tabular dataset from the path on the datastore (this may take a short while)
tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds,'01_raw/*.csv'))


## 2.2 Register Tabular Dataset
We will register the tabular dataset as Iris:

In [None]:
# Register the tabular dataset
tab_data_set = tab_data_set.register(workspace=ws, 
                                   name='Iris Dataset',
                                   description='iris data',
                                   tags = {'format':'CSV'},
                                   create_new_version=True)

## 2.3 Creating and Registering File Datasets

To create a file dataset using the SDK, use the from_files method of the Dataset.File class, like this:

In [None]:
#Create a file dataset from the path on the datastore (this may take a short while)
file_data_set = Dataset.File.from_files(path=(default_ds, '01_raw/*.csv'))

# Get the files in the dataset
for file_path in file_data_set.to_path():
    print(file_path)

try:
    # Register the file dataset
    file_data_set = file_data_set.register(workspace=ws, 
                                       name='Iris Files Dataset',
                                       description='Iris file dataset',
                                       tags = {'format':'CSV'},
                                       create_new_version=True)
    print('Datasets registered')
except:
    print('Dataset File already registered')



You can view and manage datasets on the Datasets page for your workspace in Azure ML Studio. You cal also get a list of datasets from the workspace object:

In [None]:
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)


# 3. Retrieving a Registered Dataset
After registering a dataset, you can retrieve it by using any of the following techniques:

- The ***datasets*** dictionary attribute of a ***Workspace*** object.
- The ***get_by_name*** or ***get_by_id*** method of the ***Dataset*** class.

Both of these techniques are shown in the following code:

In [None]:
from azureml.core import Workspace, Dataset

# Get a dataset from the workspace datasets collection
ds1 = ws.datasets['Iris Dataset']

# Get a dataset by name from the datasets class
ds2 = Dataset.get_by_name(ws, 'Iris Dataset')

In [None]:
# Display the first 20 rows as a Pandas dataframe

data = ds1.to_pandas_dataframe()

In [None]:
data.head()

In [None]:
 ws.datasets.get('Iris Dataset')