In [1]:
from azureml.core import Workspace, Datastore, Dataset

# Access datastore 

Data is vital part of the machine learning workflow. In classical software engineering the source coude is version controlled. In machine learning engineering we additionally need to version control the data as well. There are two related concept about the data in azure ml. 
- `Datasores` are the places were data is stored in the cloud. When we create a workspace it creates some default datastore to store the data and artifact. We can also setup additional datastore to the workspace. 
- `Datasets` are versioned data registered in the azure ml workspace. 


In [2]:
ws = Workspace.from_config()

In [3]:
# We can list the all available datasotre in the workspace. 
for ds_name in ws.datastores:
    print(ds_name)

workspaceartifactstore
azureml_globaldatasets
workspacefilestore
workspaceblobstore


In [4]:
# Accessing the data store by its name. 
Datastore.get(ws, datastore_name='workspaceartifactstore')

{
  "name": "workspaceartifactstore",
  "container_name": "azureml",
  "account_name": "dsdev011073180542",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

In [5]:
# workspaceblobstore is the default datastore in the workspace. 
ws.get_default_datastore()

{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-03198395-8463-4a4d-8899-cb105b49c173",
  "account_name": "dsdev011073180542",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

In [6]:
# We can change the default datastore by the following code. 
ws.set_default_datastore('workspaceblobstore')

```python 
# To register a new azure storage blob container 
# We can make this created datastore as default datastore as well. 
blob_ds = Datastore.register_azure_blob_container(workspace=ws, 
                                                 datastore_name='new_blob_data', 
                                                 container_name='rk_data_container', 
                                                 account_name='name', 
                                                 account_key='key')
```

# Register data 

We register the data for the data versioning and reproducibility. Here we first upload the data from local machine to the `Datastore` and register the data from the `Datastore` to the `Datasets`. 

In [7]:
datastore = ws.get_default_datastore()

In [8]:
# We can register the single csv file 
datastore.upload(src_dir='Data', target_path='data', overwrite=True)

data_path = [(datastore, 'data/iris.csv')]
dataset = Dataset.Tabular.from_delimited_files(path=data_path)
dataset.register(workspace=ws, name='Iris Data')

Uploading an estimated of 2 files
Uploading Data\iris.csv
Uploaded Data\iris.csv, 1 files out of an estimated total of 2
Uploading Data\sample.csv
Uploaded Data\sample.csv, 2 files out of an estimated total of 2
Uploaded 2 files


{
  "source": [
    "('workspaceblobstore', 'data/iris.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "61f6e378-ca3f-48c8-bd55-6a041837d045",
    "name": "Iris Data",
    "version": 1,
    "workspace": "Workspace.create(name='ds_dev_01', subscription_id='54245888-2ffe-41fa-b080-67a29997b41c', resource_group='rg-dataservices-sandbox-01')"
  }
}

```python 
import pandas as pd
pd.read_csv('Data/iris.csv').sample(5).to_csv('Data/sample.csv')

# We can also register multiple csv files. 
# A second csv file is created above for this demo. 

datastore.upload(src_dir='Data', target_path='data', overwrite=True)

data_path = [(datastore, 'data/iris.csv'), 
             (datastore, 'data/sample.csv')]
dataset = Dataset.Tabular.from_delimited_files(path=data_path)
dataset.register(workspace=ws, name='Two Irish Data')
```

# Retriving registered data 

We can retrive the registered data as a pandas dataframe in the following different ways. 

In [9]:
ws.datasets['Iris Data'].to_pandas_dataframe().head()

Unnamed: 0,PL,PW,SL,SW,y
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [10]:
Dataset.get_by_name(ws, 'Iris Data').to_pandas_dataframe().head()

Unnamed: 0,PL,PW,SL,SW,y
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


# Data versioning 

We can turn on and off the functionality to version the data during registration. If not versioned the registration overwrites the existing data. 

```python 
# Single csv file 
datastore.upload(src_dir='Data', target_path='data', overwrite=True)

data_path = [(datastore, 'data/sample.csv')]
dataset = Dataset.Tabular.from_delimited_files(path=data_path)

dataset.register(workspace=ws, name='Iris Data', create_new_version=True)
```

# Retriving vesrioned and Combined data

We can retrive a specific version of the versioned data. 

In [12]:
# Versioned data 
Dataset.get_by_name(ws, 'Iris Data', version=1).to_pandas_dataframe().head()

Unnamed: 0,PL,PW,SL,SW,y
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


```python
# Splitted data (How to do it correctly?)
Dataset.get_by_name(ws, 'Two Irish Data').to_pandas_dataframe()
```