# Exploring AML Data Capabilities

In [1]:
subscriptionID = '2213e8b1-dbc7-4d54-8aff-b5e315df5e5b'
RG = '1-87a0bbbf-playground-sandbox'
ws_name = "MLOPS101"
location = "eastus"
container_name = 'sample-datastore'
account_url = "https://mlops1019715661474.blob.core.windows.net"
storage_account_name = 'mlops1019715661474'

In [2]:
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient

ws = MLClient(
    DefaultAzureCredential(),
    subscription_id = subscriptionID,
    resource_group_name = RG,
    workspace_name= ws_name,
)
print(ws)

MLClient(credential=<azure.identity._credentials.default.DefaultAzureCredential object at 0x7f521e597dc0>,
         subscription_id=2213e8b1-dbc7-4d54-8aff-b5e315df5e5b,
         resource_group_name=1-87a0bbbf-playground-sandbox,
         workspace_name=MLOPS101)


## Data concepts in AML

<pre>

<b> URI </b>
<ol>
<li> Local computer - ./home/username/data/my_data
<li> Public http(s) server - https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv
<li> Blob storage - wasbs://<containername>@<accountname>.blob.core.windows.net/<folder>/
<li> Azure Data Lake (gen2) - abfss://<file_system>@<account_name>.dfs.core.windows.net/<folder>/<file>.csv
<li> azureml://datastores/<data_store_name>/paths/<folder1>/<folder2>/<folder3>/<file>.parquet
</ol>

<b> Datastores </b>

<ol>
<li> Datastores are pointers to different storages in AML
</ol>

<b> Datasets </b>
<ol>
<li> Datastores are pointers to the data
</ol>

<b> File types in SDK v2 </b>

1. File URI - Can be anything that is a file image/audio/text/csv
2. Folder URI - Pointing to directories containing data and can be anything image/audio/text/csv
3. ML Table - Optimised for reading large structured dataset efficiently

An Azure Machine Learning job maps URIs to the compute target filesystem. This mapping means that in a command that consumes or produces a URI, that URI works like a file or a folder. A URI uses identity-based authentication to connect to storage services, with either your Azure Active Directory ID (default), or Managed Identity. Azure Machine Learning Datastore URIs can apply either identity-based authentication, or credential-based (for example, Service Principal, SAS token, account key) without exposure of secrets.

<b> Modes of URI </b>

<ol>
<li> ro_mount - The URI represents a storage location that is mounted to the compute target filesystem
<li> rw_mount - The URI represents a storage location that is mounted to the compute target filesystem.
<li> download - The URI represents a storage location containing data that is downloaded to the compute target filesystem.
<li> upload - All data written to a compute target location is uploaded to the storage location represented by the URI.
<ol>

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?view=azureml-api-2&tabs=adls

In [3]:
ws.datastores

<azure.ai.ml.operations._datastore_operations.DatastoreOperations at 0x7f5212bc5810>

In [4]:
dir(ws.datastores)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_credential',
 '_enable_telemetry',
 '_fetch_and_populate_secret',
 '_init_kwargs',
 '_list_secrets',
 '_operation',
 '_operation_config',
 '_operation_scope',
 '_registry_name',
 '_resource_group_name',
 '_scope_kwargs',
 '_show_progress',
 '_subscription_id',
 '_workspace_name',
 'create_or_update',
 'delete',
 'get',
 'get_default',
 'list']

In [5]:
default_datastore = ws.datastores.get_default()
default_datastore

AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'workspaceblobstore', 'description': None, 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/providers/Microsoft.MachineLearningServices/workspaces/MLOPS101/datastores/workspaceblobstore', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mlops101/code/Users/cloud_user_p_343e3e29', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7f521e597970>, 'credentials': {'type': 'account_key'}, 'container_name': 'azureml-blobstore-fa97166b-673b-4e23-8246-1b9618c363e2', 'account_name': 'mlops1019715661474', 'endpoint': 'core.windows.net', 'protocol': 'https'})

## Creating a container in BLOB Storage & uploading data

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-datastore?view=azureml-api-2&tabs=sdk-identity-based-access%2Csdk-adls-sp%2Ccli-azfiles-account-key%2Ccli-adlsgen1-identity-based-access

In [6]:
!mkdir data
!wget https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv -P ./data/raw_iris
!head ./data/raw_iris/iris.csv

--2023-05-18 13:14:09--  https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3975 (3.9K) [text/plain]
Saving to: ‘./data/raw_iris/iris.csv’


2023-05-18 13:14:09 (328 KB/s) - ‘./data/raw_iris/iris.csv’ saved [3975/3975]

"sepal.length","sepal.width","petal.length","petal.width","variety"
5.1,3.5,1.4,.2,"Setosa"
4.9,3,1.4,.2,"Setosa"
4.7,3.2,1.3,.2,"Setosa"
4.6,3.1,1.5,.2,"Setosa"
5,3.6,1.4,.2,"Setosa"
5.4,3.9,1.7,.4,"Setosa"
4.6,3.4,1.4,.3,"Setosa"
5,3.4,1.5,.2,"Setosa"
4.4,2.9,1.4,.2,"Setosa"


In [7]:
from azure.identity import DefaultAzureCredential
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

default_credential = DefaultAzureCredential()

# Create the BlobServiceClient object
blob_service_client = BlobServiceClient(account_url, credential=default_credential)
container_client = blob_service_client.create_container(container_name)
blob_client = blob_service_client.get_blob_client(container=container_name, blob='iris.csv')

# Upload the created file
with open(file='./data/raw_iris/iris.csv', mode="rb") as data:
    blob_client.upload_blob(data)

## Registering a datastore from the created container

In [8]:
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import SasTokenConfiguration

storage_account_SAS = 'sp=racwdl&st=2023-05-18T13:14:37Z&se=2023-05-18T21:14:37Z&skoid=ce5e68fb-796a-43ef-b80d-e46b14e0902e&sktid=84f1e4ea-8554-43e1-8709-f0b8589ea118&skt=2023-05-18T13:14:37Z&ske=2023-05-18T21:14:37Z&sks=b&skv=2022-11-02&spr=https&sv=2022-11-02&sr=c&sig=ys8F7THmRB6GYKWLSzqqG98F7CkPTYSGDOEE2omYVYQ%3D'

sample_datastore = AzureBlobDatastore(
    name = "sampledatastore",
    description = "Datastore pointing to a blob container using SAS token.",
    account_name = storage_account_name,
    container_name = container_name,
    credentials=SasTokenConfiguration(
        sas_token= storage_account_SAS
    ),
)
ws.create_or_update(sample_datastore)

AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'sampledatastore', 'description': 'Datastore pointing to a blob container using SAS token.', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/providers/Microsoft.MachineLearningServices/workspaces/MLOPS101/datastores/sampledatastore', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mlops101/code/Users/cloud_user_p_343e3e29', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7f5210a17160>, 'credentials': {'type': 'sas'}, 'container_name': 'sample-datastore', 'account_name': 'mlops1019715661474', 'endpoint': 'core.windows.net', 'protocol': 'https'})

In [9]:
list(ws.datastores.list())

[AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'sampledatastore', 'description': 'Datastore pointing to a blob container using SAS token.', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/providers/Microsoft.MachineLearningServices/workspaces/MLOPS101/datastores/sampledatastore', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mlops101/code/Users/cloud_user_p_343e3e29', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7f5210a3c220>, 'credentials': {'type': 'sas'}, 'container_name': 'sample-datastore', 'account_name': 'mlops1019715661474', 'endpoint': 'core.windows.net', 'protocol': 'https'}),
 AzureFileDatastore({'type': <DatastoreType.AZURE_FILE: 'AzureFile'>, 'name': 'workspaceworkingdirectory', 'description': None, 'tags': {}, 'properties': {}, 'print_as_yaml': True,

## List files in a datastore

- Can be done explicitly using the Azure Blob Storage API

Ref : https://stackoverflow.com/questions/75870455/is-there-a-way-to-get-list-of-folders-from-a-datastore-in-azure-ml-studio-with-p

In [10]:
for file in container_client.list_blobs():
    print(file.name)

iris.csv


## Creating a data asset

In [11]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Supported paths include:
# local: './<path>'
# blob:  'https://<account_name>.blob.core.windows.net/<container_name>/<path>'
# ADLS gen2: 'abfss://<file_system>@<account_name>.dfs.core.windows.net/<path>/'
# Datastore: 'azureml://datastores/<data_store_name>/paths/<path>'

iris_path = 'azureml://datastores/sampledatastore/paths/iris.csv'

iris_data = Data(
    path = iris_path,
    type=AssetTypes.URI_FILE,
    description = "Iris data from datastore",
    name = "iris-data-raw",
    version = '1'
)

ws.data.create_or_update(iris_data)

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'iris-data-raw', 'description': 'Iris data from datastore', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/providers/Microsoft.MachineLearningServices/workspaces/MLOPS101/data/iris-data-raw/versions/1', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mlops101/code/Users/cloud_user_p_343e3e29', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f5210a17670>, 'serialize': <msrest.serialization.Serializer object at 0x7f5210a17fa0>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourcegroups/1-87a0bbbf-playground-sandbox/workspaces/MLOPS101/datastores/sampledatastore/paths/iris.cs

In [12]:
datastore = ws.datastores.get('sampledatastore')
datastore

AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'sampledatastore', 'description': 'Datastore pointing to a blob container using SAS token.', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/providers/Microsoft.MachineLearningServices/workspaces/MLOPS101/datastores/sampledatastore', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mlops101/code/Users/cloud_user_p_343e3e29', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7f5212bc7cd0>, 'credentials': {'type': 'sas'}, 'container_name': 'sample-datastore', 'account_name': 'mlops1019715661474', 'endpoint': 'core.windows.net', 'protocol': 'https'})

## Reading dataset from datastore directly

In [13]:
from pathlib import Path
import pandas as pd

def urlExtractor(datastore):
    splits = Path(datastore.id).parts
    return f'azureml:{"/".join(v for v in splits[:5])}/{"/".join(v for v in splits[-4:])}/paths'

print(urlExtractor(datastore))

azureml://subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/workspaces/MLOPS101/datastores/sampledatastore/paths


In [14]:
pd.read_csv(urlExtractor(datastore)+'/'+'iris.csv')

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


## Reading the dataset from the registered dataset

In [15]:
ws.data.get('iris-data-raw', "1")

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'iris-data-raw', 'description': 'Iris data from datastore', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/providers/Microsoft.MachineLearningServices/workspaces/MLOPS101/data/iris-data-raw/versions/1', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mlops101/code/Users/cloud_user_p_343e3e29', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f5210a3c1c0>, 'serialize': <msrest.serialization.Serializer object at 0x7f5210a17df0>, 'version': '1', 'latest_version': None, 'path': 'azureml://subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourcegroups/1-87a0bbbf-playground-sandbox/workspaces/MLOPS101/datastores/sampledatastore/paths/iris.cs

In [16]:
pd.read_csv(ws.data.get('iris-data-raw', "1").path)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica


## Creating new version of dataset and getting the latest dataset

In [17]:
iris_path = 'azureml://datastores/sampledatastore/paths/iris.csv'
iris_data = Data(
    path = iris_path,
    type=AssetTypes.URI_FILE,
    description = "Updated data from datastore",
    name = "iris-data-raw",
    version = '2'
)
ws.data.create_or_update(iris_data)

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'iris-data-raw', 'description': 'Updated data from datastore', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/providers/Microsoft.MachineLearningServices/workspaces/MLOPS101/data/iris-data-raw/versions/2', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mlops101/code/Users/cloud_user_p_343e3e29', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f51e6d43d90>, 'serialize': <msrest.serialization.Serializer object at 0x7f51e6d82440>, 'version': '2', 'latest_version': None, 'path': 'azureml://subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourcegroups/1-87a0bbbf-playground-sandbox/workspaces/MLOPS101/datastores/sampledatastore/paths/iris

In [18]:
ws.data._get_latest_version('iris-data-raw')

Data({'skip_validation': False, 'mltable_schema_url': None, 'referenced_uris': None, 'type': 'uri_file', 'is_anonymous': False, 'auto_increment_version': False, 'name': 'iris-data-raw', 'description': 'Updated data from datastore', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/providers/Microsoft.MachineLearningServices/workspaces/MLOPS101/data/iris-data-raw/versions/2', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mlops101/code/Users/cloud_user_p_343e3e29', 'creation_context': <azure.ai.ml.entities._system_data.SystemData object at 0x7f51e6d82740>, 'serialize': <msrest.serialization.Serializer object at 0x7f51e6d83730>, 'version': '2', 'latest_version': None, 'path': 'azureml://subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourcegroups/1-87a0bbbf-playground-sandbox/workspaces/MLOPS101/datastores/sampledatastore/paths/iris

## Folder type

In [19]:
!mkdir iris-processed

In [20]:
import pandas as pd
data = pd.read_csv(ws.data.get('iris-data-raw', "1").path)

In [21]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder().fit(data.variety)

data.variety = le.transform(data.variety)

data.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [22]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, stratify = data.variety)

print(f'Train : {train.shape}; Test : {test.shape}')

Train : (112, 5); Test : (38, 5)


In [23]:
train.to_csv('./iris-processed/train.csv', index = False)
test.to_csv('./iris-processed/test.csv', index = False)

## Upload a folder to ADLS

In [24]:
dir(BlobServiceClient)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_batch_send',
 '_configure_encryption',
 '_create_pipeline',
 '_format_query_string',
 '_format_url',
 '_rename_container',
 'api_version',
 'close',
 'create_container',
 'delete_container',
 'find_blobs_by_tags',
 'from_connection_string',
 'get_account_information',
 'get_blob_client',
 'get_container_client',
 'get_service_properties',
 'get_service_stats',
 'get_user_delegation_key',
 'list_containers',
 'location_mode',
 'primary_endpoint',
 'primary_hostname',
 'secondary_endpoint',
 'secondary_hostname',
 'set_service_properties',
 'undelete_container',
 'url']

In [25]:
from azure.storage.blob import BlobServiceClient

ADLS = BlobServiceClient.from_connection_string('DefaultEndpointsProtocol=https;AccountName=mlopsadlsxx;AccountKey=PS++nG6VtyW/aLcjGj6pPaNmhe8jZAkZ98wibzce1DBxIn5YmFRIAaKO+A8s1pxByaxRxE05iABm+AStKIn4Kw==;EndpointSuffix=core.windows.net')
ADLS.create_container(container_name)

<azure.storage.blob._container_client.ContainerClient at 0x7f51dd4bbdf0>

In [27]:
from pathlib import Path

BASE_DIR = './iris-processed'
BASE_PATH = Path(BASE_DIR)

for file in BASE_PATH.rglob('*.csv'):
    azure_path = str(file).replace(BASE_DIR[2:], '')
    local_path = file
    blob_client = ADLS.get_blob_client(container = container_name, blob = azure_path)
    with open(local_path, "rb") as data:
        blob_client.upload_blob(data)
        print("uploading file —->", file)

uploading file —-> iris-processed/test.csv
uploading file —-> iris-processed/train.csv


## Creating ADLS Datastore

- Can be done only with SPA

In [28]:
ADLS_NAME = 'ADLS'
ACCOUNT_NAME = 'mlopsadlsxx'
TENANT_ID = '84f1e4ea-8554-43e1-8709-f0b8589ea118'
CLIENT_ID = '21eaf05b-a2d1-4c65-a401-0319f9cf57a3'
SECRET = 'frk8Q~IJ8SsVSgwSSHD2PQP.-zJ5esJ1EyVIVbl4'

In [30]:
from azure.ai.ml.entities import AzureDataLakeGen2Datastore
from azure.ai.ml.entities import ServicePrincipalConfiguration

store = AzureDataLakeGen2Datastore(
    name = ADLS_NAME,
    description="Datastore pointing to an Azure Data Lake Storage Gen2.",
    account_name = ACCOUNT_NAME,
    filesystem = container_name,
     credentials=ServicePrincipalConfiguration(
        tenant_id= TENANT_ID,
        client_id= CLIENT_ID,
        client_secret= SECRET,
    ),
)

ws.create_or_update(store)

resource_uri is not a known attribute of class <class 'azure.ai.ml._restclient.v2022_10_01.models._models_py3.ServicePrincipalDatastoreCredentials'> and will be ignored


AzureDataLakeGen2Datastore({'type': <DatastoreType.AZURE_DATA_LAKE_GEN2: 'AzureDataLakeGen2'>, 'name': 'adls', 'description': 'Datastore pointing to an Azure Data Lake Storage Gen2.', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/2213e8b1-dbc7-4d54-8aff-b5e315df5e5b/resourceGroups/1-87a0bbbf-playground-sandbox/providers/Microsoft.MachineLearningServices/workspaces/MLOPS101/datastores/adls', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/mlops101/code/Users/cloud_user_p_343e3e29', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7f51dced7130>, 'credentials': {'authority_url': 'https://login.microsoftonline.com', 'resource_url': 'https://storage.azure.com/', 'tenant_id': '84f1e4ea-8554-43e1-8709-f0b8589ea118', 'client_id': '21eaf05b-a2d1-4c65-a401-0319f9cf57a3', 'type': 'service_principal'}, 'account_name': 'mlopsadlsxx', 'filesystem': 'sample-datastore', 'endpoint': 'core.window

## Easy data operations with azureml.fsspec

Upgrade to the latest version

https://learn.microsoft.com/en-us/azure/machine-learning/how-to-access-data-interactive?view=azureml-api-2&tabs=adls

In [62]:
!pip install --upgrade azureml-fsspec

Collecting azureml-fsspec
  Downloading azureml_fsspec-1.0.0-py3-none-any.whl (11 kB)
Collecting azureml-dataprep<4.11.0a,>=4.10.0a
  Downloading azureml_dataprep-4.10.7-py3-none-any.whl (38.2 MB)
[K     |████████████████████████████████| 38.2 MB 112 kB/s eta 0:00:01    |█████                           | 6.0 MB 6.5 MB/s eta 0:00:05     |███████████████████████         | 27.4 MB 6.5 MB/s eta 0:00:02
Collecting azureml-dataprep-rslex~=2.17.6dev0
  Downloading azureml_dataprep_rslex-2.17.11-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20.2 MB)
[K     |████████████████████████████████| 20.2 MB 62.7 MB/s eta 0:00:01
[31mERROR: azureml-dataset-runtime 1.49.0 has requirement azureml-dataprep<4.10.0a,>=4.9.0a, but you'll have azureml-dataprep 4.10.7 which is incompatible.[0m
Installing collected packages: azureml-dataprep-rslex, azureml-dataprep, azureml-fsspec
  Attempting uninstall: azureml-dataprep-rslex
    Found existing installation: azureml-dataprep-rslex 2.16.1
    Uni

In [64]:
from azureml.fsspec import AzureMachineLearningFileSystem

#azureml://subscriptions/<subid>/resourcegroups/<rgname>/workspaces/<workspace_name>/datastores/datastorename
ds_url = f"azureml://subscriptions/{subscriptionID}/resourcegroups/{RG}/workspaces/{ws_name}/datastores/adls/paths/"
fs = AzureMachineLearningFileSystem(ds_url)
fs.ls()

['adls/test.csv', 'adls/train.csv']

In [65]:
with fs.open('adls/test.csv') as f:
    x = pd.read_csv(f)

x.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,6.7,3.1,4.4,1.4,1
1,6.4,3.2,5.3,2.3,2
2,5.4,3.4,1.7,0.2,0
3,6.9,3.2,5.7,2.3,2
4,6.4,2.7,5.3,1.9,2


In [None]:
fs.download(rpath='sample-datastore', lpath='data/download_files/', recursive=True)

In [None]:
fs.upload(lpath='./iris-processed', rpath='/iris-processed-v2/', recursive=True, **{'overwrite': 'MERGE_WITH_OVERWRITE'})
#lpath - local path; rpath - remote path; 3 modes - append, MERGE_WITH_OVERWRITE, FAIL_ON_FILE_CONFLICT
#recursive = True for uploading folder and rpath is created if does not exist